You are on page 1of 25

Mini Project Report

Design and implementation of a


classification system based on soft computing
and statistical approaches

Submitted By:

Ashish Kumar Agrawal(2001114)


Abstract

This project, being developed as a part of MHRD research


project (Designing an Intelligent Robot for Explosive Detection and
Decontamination funded by MHRD, Govt. of India), explores the
design and development of classifier based on statistical methods and
soft computing based approaches which is capable of identifying the
mines and non mines using various clustering, classification and rules
establishment algorithms as to compare the algorithm on the basis of
complexity and accuracy. Designing such a classifier is a big challenge
because data is not linearly separable and since it has overlapping
features, it is not possible to design a classifier with 100% accuracy
.This project deals with PVC tubes, wood piece and copper cylinders as
non mine data in addition to data of various mines. The basic idea of
the classification is based on a fact that it is safe if the non-mines data
is predicted as mine but it is not the case when we predict mines data as
non-mines. So the unsupervised learning based ART algorithm divides
the data into several clusters which are merged on the basis of above
fact. Genetic algorithm is enhancing the results to establish the results
having negation in the antecedent part. In addition to these approaches,
fuzzy approaches also give the membership values corresponding to
each class to visual the class of data in better way.

I
Candidate’s Declaration

I hereby declare that the work presented in this project titled


“Design and implementation of a classification system based on
soft computing and statistical approach ” submitted towards
completion of mini-project in sixth Semester of B.Tech (IT) at the
Indian Institute of Information Technology (IIIT), Allahabad. It is an
authentic record of my original work pursued under the guidance of Dr.
G. C. Nandi, Associate Professor, IIIT, Allahabad.

I have not submitted the matter embodied in this project for the award of
any other degree.

(Ashish Kumar Agrawal)

Place: Allahabad
Date: 17-5-2004

------------------------------------------------------------------------------------------------------------

Certificate

This is to certify that the above declaration made by the candidate is correct
to the best of my knowledge and belief.

(Dr G.C. Nandi)


Associate Professor
Place: Allahabad IIIT (Deemed University)
Date: May, 2004 Deoghat, Jhalwa, Allahabad.

II
Acknowledgement

First and foremost, I would like to express my sincere


gratitude to my project guide, Dr G.C. Nandi. I was privileged to
experience a sustained enthusiastic and involved interest from his side.
This fueled my enthusiasm even further and encouraged us to boldly
step into what was a totally dark and unexplored expanse before us.
I would also like to thank my seniors who were ready with a positive
comment all the time, whether it was an off-hand comment to encourage
us or a constructive piece of criticism and a special thank to JRC
database provider who arranged a good database for mines.
Last but not least, I would like to thank the IIIT-A staff members and
the institute, in general, for extending a helping hand at every juncture
of need.

Ashish Kumar Agrawal (2001114)

III
Table Of Contents

Abstract………………………………………………………………..….I
Declaration………………………………………………………………..II
Certificate…………………………………………………………………II
Acknowledgements……………………………………………………….III
List Of Figures………………………………………………………… VII
CHAPTER I
Introduction and Statement Of Problem………………………………....1
1.1 Introduction…………………………………………………………1
1.2 Problem Statement………………………………………………….1
CHAPTER II
Challenges In This Field ..………………………………………………...3
2.1 Features extraction …..……………………………………………..3
2.2 Selection of an algorithm………..……………………………….…3
CHAPTER III
Approaches in This Direction…..………………………………………...4
3.1 s tatistical Approaches.. ……………………………………...4
3.1.1 Clustering algorithm Kmean ....…………………………….4
3.1.2 k-nearest neighbour……..…………………………………..4
3.2 Softcomputing………………………………………………………5
3.2.1 Genetic algorithm ……………………………………………5
3.2.2 Adaptive resonance theory (ART)….…………………….....6
3.2.3 Fuzzy C-mean………………………………………………...7
3.2.4. Gustavson kessel algorithm………………………………...8
3.2.5 Gath–geva algorithm…………………………………….…..8
3.2.6 Kohonen SOM…………………………………………….….9

IV
CHAPTER IV
System Architecture……...………………………………………………10
4.1 Data Source Name and login….…………………………………..10
4.2 Algorithm and table selection……………………………………..10

CHAPTER V
Results And Conclusions…………………………………………………12
5.1 Results……………………………………………………………...12
5.2 conclusion …………………………………………………………15
5.3 Future Extensions…….…..………………………………………….15
5.3.1 Improvement in the genetic algorithm…………………..………16
5.3.2.Distributed computing environment……………………….16
5.3.3.Dealing with various platform and format………………...16
References ……………………………………………………………….. 17
-Books ……………………………………………………………………17
- Research Papers………………………………………………………..17

V
List of Figures

Fig 1: Flow of information………………………………………....…11


Fig 2: Main frame of algorithm……………………………………..…11
Fig 3: Result of genetic algorithm……….. ….………………………12
Fig 4: Result of ART………………………..…………………………12
Fig 5: Result of Fuzzy c-mean and Gustavson kessel algorithm
……………………………………………………………………………..13
Fig 6: Result of k-mean algorithm......………………..……..……….14
Fig 7: Result of k-nearest neighbour algorithm….….……………….14
Fig 8: Result of kohonen SOM….…….……..……………………….15

VI
VII
Chapter 1

Introduction & Statement of Problem


BRIEF OVERVIEW

1. 1 Int r oducti on
“If we already know about the upcoming hazards; it is very easy to find
the way to abolish it.”
Here, this sentence is being described in the context of Landmine
Detection and Decontamination. My objective is to predict
whether at a particular point of working area is occupied by mines
or not, with some confidence parameter. Robot is designed to
move toward these predicted areas to decontaminate the mines.
These mines occupied area can be known before initiation of robot
movements or can be predicted dynamically, so to design an
obstacles free path for robot is another aspect beyond the domain
of this module.
To tackle this problem a classification toolkit has been designed using
some statistical and soft computing based approaches to cluster
the data, to predict the possible class of incoming data, to generate
some rules in the term of confidence parameter. The data may be
given in image form or some tabular form having all numeric or
categorized attributes.
It is impossible to design a classifier having 100% right classification
because it is not easy to differentiate between the data of metallic
debris, PVC tubes and actual mine data.
On the basis of this prediction path designers develop the obstacle free
path to decontaminate these mines.

1 . 2 S t a t e m e n t O F P r ob l e m
1
Anti-Personal landmines are a significant barrier to economic and social
development in a number of countries, so we need a classification
system that can differentiate a mine from metallic debris on the
basis of given data. This data is generated by some highly accurate
sensors.

2
Chapter 2

Challenges in This Field

In the field of classification and rules establishment, the basic problems


are the features extraction (building blocks of algorithms) and
selection of good algorithms those can generate results with high
certainty value.
2.1 Features Extraction
The initial problem is the problem of features extraction. Generally the
image data is given having a blurred image of an object, so it is
very difficult to extract the exact boundary of object. There may
be various features those can be used as the raw material of
system. Here blobsize, blobaspectratio and blobintensity have been
chosen. The given data may contain the images of PVC tube,
metallic debris and Mines.
The data in some tabular format having numerical or categorized
values of attributes can also be given, which is more suitable for
the algorithms
2.2 sel e cti on of an Al gorithm
The second problem is to choose an algorithm that can interpret the
problem in best way. The algorithms can be categorized in two
parts:
(1)Statistical approaches
(2)Softcomputing based approaches
The three types of algorithms can be applied here: Classification,
Clustering and Rules establishment with some certainty factor, so
the best way is to design various algorithms and then check their
efficiency and accuracy.

3
Chapter 3

APPROACHES IN THIS DIRECTION

In this section the various algorithms will be discussed being used to


achieve the objective.

3.1 Stati stical A pproa ch es


Two algorithms have been used one for clustering (K-mean algorithm)
and another for prediction of class of incoming data (K-nearest
neighbour).
3.1.1 Clustering Algorithm: Kmean
Clustering is a nonlinear activity that generates ideas, images and
feelings around a stimulus word. Clustering may be a class or an
individual activity.
If the number of data is less than the number of cluster then we
assign each data as the centroid of the cluster. Each centroid will
have a cluster number. If the number of data is bigger than the
number of cluster, for each data, we calculate the distance to all
centroid and get the minimum distance. This data is said belong to
the cluster that has minimum distance from this data. Since we are
not sure about the location of the centroid, we need to adjust the
centroid location based on the current updated data. Then we
assign all the data to this new centroid. This process is repeated
until no data is moving to another cluster anymore. Mathematically
this loop can be proved to be convergent
Since there are only two classes mine and non-mine so number of
classes is given 2 as input with the dataset [B1].

3 . 1 . 2 K - n e a r e s t N e i g h b ou r

4
K-nearest neighbour technique is used to predict the class of
incoming data on the basis of given training data and density
estimator (k-nn) to estimate the confidence of the incoming
sample for a particular class. Finally the class is predicted having
the highest estimator.

Density estimator: qc(x) = (number of neighbors of class c)/K


The neighbors are the k closest point to the given sample .Their mutual distances are
calculated by city block distance. [B2]
The problem of choosing k still remains, but a general rule of thumb is to use
K= sqrt(N).
Where N is the number of learning samples.
A disadvantage of this method is that it is computationally intensive for large data
sets.

3.2 Soft computi n g app r oa ch es


S o f t c o m pu t i n g a p p r o a c h e s c a n b e c l a s s i f i e d i n to severa l ca tegor ies
like:
1.) Neural approaches
2.) Fuzz y cluster ing
3.) Ad ap tive r eso nance th eor y
4.) Kohonen SOM
5.) G en e ti c a l gor i thm

3 . 2 . 1 G e n e t i c a l g o r i t h m t o e s t a b l i s h r u l es
T o e sta blish the rules between t h e a t t r i b u t e s o f d a t a
a sso c ia t i on ru le but a ssociation Rule mining cannot predict the
complete set of rules, i.e. the rules which have negation in the
attributes cannot be discovered. To overcome that disadvantage,
Genetic Algorithms (GAs) has been used.
F i r st o f a l l a ss o c i a t i o n r u l e i s a p p l i e d w i t h s o m e s u p p o r t a n d
c o n f id e n c e v a l u e s e n t e r e d b y u s e r t o ge n e r a t e s o m e b a s e r u l e s

5
and these r u les a re sent to ge n e t i c a l g o r it h m a s i n p u t w h i c h
h e l p s t o e v o l v e s o m e n e w r u l e h a v in g n e ga t io n i n a t t r ib ute s .
T h e t h r e e b a s i c p a r t o f g e n et i c a l g or i thm a r e a s f ol l ow :
(a)S el ec t i on: R ou l e t t e w h e e l t e c h n i q u e i s u s e d t o s e l e c t t h e t w o
parents [R1].
(b)C r os s ov e r : A r a n d o m p o i n t ( c r os s o v e r po int ) is gen erat ed a nd
t h e s e gm e n t t o t h e le ft o f t h i s p o i n t o f f i r s t p a r e n t a n d t h a t o f
second parent are interchanged.
(c)M u ta t i on: m ut a t i o n p o i n t i s g e n e r a t e d r a n d o m l y a n d t h e b i t
va lue a t this po i n t i s t o g g l e d .
A ft er so m e i tera t io n w e f i nd s om e r u les fo l l ow in g t h e a bo ve
p r o p e r t i e s a n d h a v in g h i gh f i t n e ss va lue that ca n be ca lculated
e i t h e r u s i n g t h e c o n fid en c e va lu e o r b y c o n fu s i o n m a t r i x .
3 . 2 . 2 A d a p t i v e r e s on a n c e t h e o r y (AR T )
As w e k no w backpropagation network is very powerful in the sense that
it can simulate any continuous function given a certain number of
hidden neurons and a certain forms of activation functions. But
once a back propagation is trained, the number of hidden neurons
and the weights are fixed. The network cannot learn from new
patterns unless the network is re-trained from scratch, so there is
no plasticity. [R2]
So ART is a new neural network technique to solve this problem.
Our ultimate objective is to cluster the data in several chunks.
Each time one by one samples from the data as input neurons is sent as
input and the activation value is calculated corresponding to each
of the existing output neurons, and the highest value is chosen ,if
this value is higher than threshold values then the weight of this
connection is updated otherwise a new output neuron is added.
After certain iteration it’s found that the proper clusters of the
data in our application don’t have classes more than two (mine and
non-mine). The another fact is that if a non-mine data is predicted
as mine it is acceptable but vice-versa is not true because it may be
6
dangerous, so among all the clusters, the cluster having the
cluster-center farthest from the mine data center is classified as
non-mine, rest of the clusters are classified as mine.
Here activation function is calculated as the city block distance of the
incoming normalized data and weights of connection.
3.2.3 Fuzzy c-mean:
I n t h e c la s s ic a l c l u s t e r in g a l gor it h m we h a v e t h e c r is p
m e m b e r s h i p o f a c l a s s ( e i t h e r o n e o r z er o ) . b u t w h i l e
cla ss i fy i ng t he m in e d a ta it is n o t v e r y e a s y t o d i f f e r e n t i a t e
b e t w e e n m i n e a n d n o n - m in e . S o w e n e e d a m et h o d t h a t c a n t e l l
t h e m em be r s h i p o f t h e d a t a i n e ac h cla ss . I f th is m e m ber sh i p
is a vera g e the n w e d ea l th is d a ta a s spe c ia l d a ta a n d c la s s if y
t h i s i n t h e c l a s s o f m i n e ( a s m in e a r e d a n g e r o u s ! ! ) . [ R 3]
where |X| is the feature vector

and p is the number of classes (p=2 in our case)

Membership
values

Euclidean distance:

Mean center prototype:

7
Mean center
prototype(Ci)=

I f t h e d i f f e r e n c e o f t h e m e m be r s h i p v a l u e w i t h p r e v i o u s
memb ers hip va lu e is le ss tha n t h r e s h o l d t h a n a l g o r i t h m
ter m ina t e w it h h a v in g th e m em be r s h ip v a l u e f o r e a c h c la s s .
3 . 2 . 4Gustavson-Kessel Algorithm
It is an improvement of fuzzy c-mean clustering algorithm .the
correlation between the data is not considered in c mean. In this
algorithm we redefine our distance formula as: [R3]
Mahalobis distance :

where Ai is the mean center


prototype and xj and cj are
the sample attribute and
cluster center.
And covariance matrix is
calculated as :
Fuzzy covariance matrix

Mean center prototype

3 . 2 . 5 Gath-Geva Algorithm :
This algorithm assumes that data is normally distributed. [R3]

Distance :

8
where is the a-priori probability of data belonging to cluster i,

and Mean center prototype

The symbols have same explanation as above.


Before applying this algorithm it is suggested to analyze data whether it
is normally distributed or not.
3. 2 . 6 Kohonen SOM:
A competitive network learns to categorize the input vectors presented to it. If a
neural network just needs to learn to categorize its input vectors, then a competitive
network will do. Competitive networks also learn the distribution of inputs by
dedicating more neurons to classifying parts of the input space with higher densities
of input.[B3]
A self-organizing map learns to categorize input vectors. It also learns the
distribution of input vectors. Feature maps allocate more neurons to recognize parts
of the input space where many input vectors occur and allocate fewer neurons to
parts of the input space where few input vectors occur.
Self-organizing maps also learn the topology of their input vectors. Neurons
next to each other in the network learn to respond to similar vectors. The layer of
neurons can be imagined to be a rubber net that is stretched over the regions in the
input space where input vectors occur.
Self-organizing maps allow neurons that are neighbors to the winning neuron to
output values. Thus the transition of output vectors is much smoother than that
obtained with competitive layers, where only one neuron has an output at a time.
Now we have some brief knowledge of algorithms those have been implemented
.Now I will discuss the architectural design of classification system followed by the
results .

9
Chapter 4

System Architecture

As I have already discussed that input can have image form or tabular
form .Matlab has been used to extract the features from the input
images. We have the numerical attributes based table with the
entry whether the data belongs to mine or non-mine, but for the
genetic algorithm categorized table is required so data is
categorized in three categories :Low, Medium and High with class
value simply mine or non-mine.

4.1 Data source name and login:


Data is being maintained in MS Access. User is free to enter any
d a t a b u t h e n e e d s t o c o n f i g u r e t he d a t a b a s e f ir s t u s i n g ( c o n t r o l
p a n e l - > a d m i n i s t r a t i v e t o o l - > d at a s o u r c e s ( o d b c ) - > s y s t e m D S N
- > c o n f i g u r e ) . A f t e r t h e c o n f i g u r a t i o n h e w i l l b e a s s ig n e d a
DSN name .this DSN name is asked when u initialize the
application with the user name and password that can be
o b t a in e d f r o m h e l p ( B e c a u s e t h i s i s d e s i g n e d f o r d e m o n s t r a t i o n
s o u s e r n a m e a n d p a s s w o r d h a v e b e e n g i v e n in h e lp ). Whe n
c o n n e c t b u t t o n i s p r e s s e d i f th e u s er na m e a nd p a ss w or d a r e
c o r r e c t a n d t h e e n t e r e d D S N e x ist s , a n e w p a g e o p e n s h a v i n g
all the algorithms and table selection facility.

4.2 Algorithm and table selection


A n y t a b le e x ist i n g in t h e i n p u t d a t a b a s e c a n b e s e l e c t e d w i t h t h e
algor it h m(Se le ct ca te gor i zed tab le if u ap p ly ge net i c

10
algorithm).Now the algorithm specific results will be
displaced.
Different algorithm can ask for some input parameter like clustering
algorithm can ask for number of cluster etc.
The interface is self explanatory with proper help. Java language has
been used at front hand and Microsoft Access XP for Database in
back hand and JDBC Bridge to communicate between algorithms
and databases.

Fig 1: Flow of information

Fig 2: Main frame of algorithms


11
Chapter 5

Result And Conclusion


5 . 1 R es u l t s : -
This module has successfully been implemented. The ultimate objective
of this module is to compare between various algorithms and
differentiate between them on the basis of their accuracy and
results.
Genetic algorithm:

Fig 3 : Result of genetic algorithm


This snapshot is displaying the result of both association rule and
genetic rules. It is very much clear that genetic algorithm has generated
the rule having negative attribute value in antecedent part so this
algorithm is very useful to establish rules.
ART

Fig 4 : Result of ART algorithm


12
the ART algorithm is also giving good rules .the ART gives the multiple
class distribution of given data. Because to predict a non-mine as a
mine is not as much dangerous as to predict a mine as non-mine,
so the all the cluster having more distance from the non-mine
center has been assigned mine class.

Fuzzy C-mean Gustavson kessel

The fuzzy c-mean algorithm is This algorithm is abolishing the


giving rules with membership drawback of fuzzy c-mean
value in each class. So it is algorithm because it
very easy to check some data considers the correlation
that can not be classified as between data.
mine and non-mine, so this
type of data can be put into
mine class to avoid danger.

13
Fig 5 : Result of Fuzzy c-mean and Gustavson kessel algorithm

Kmean Algorithm

Kmean algorithm is non-adaptive


and time consuming and
giving the accuracy of 65%.

Fig 6 : Result of Kmean algorithm

K nearest neighbour algorithm This algorithm is useful if


someone wants to know
the class of a given data.
First of all, the training
data must be given with
the inputs and the
number of nearest
neighbours. On the basis
of class of nearest
neighbours, this
algorithm predicts the
possible class of the input
data.
The algorithm gives good result
when number of K is
more which also makes
14
this algorithm very time
consuming.
Fig 7 : Result of Knearest neighbour algorithm

Kohonen SOM

This algorithm is also used for clustering and it’s quite a fast algorithm
based on ‘winner take all’ strategy. It differentiates the mine and
non-mine up to 80% accuracy
Fig 8 : Result of Kohonen SOM algorithm

5.2 C o n clusi on
A l l t h e e i ght d if f e r e n t a lg or it h m s h a v e b e e n i m p l e m e n t e d t o
c o m pa r e t h e r e s u l t s . T h i s c l a s s i f ie r is givin g re su lt with 80%
a c c u r a c y . T h e b e s t re su lt is b e in g given b y AR T an d Ge ne tic
a l g o r i t h m . F u z z y C - m ea n a nd G us t a vs on k e sse l is a l so g ood
b e c a u s e o f m e m b e r s h i p v a lu e s f or ea ch cla ss. Th is modu le c a n
d if f e r e n t ia t e b e t w e e n t h e P V C t ub e , wo o d p ie c e , b r a s s t ub e
, c o p p e r c y l in d e r ( N o n m in e d a t a) a n d t h e m i n e d a t a o b t a in e d
f r o m j r c I s r a e l ( h t t p : / / a p l - d a t a b a se . j r c . i t ) .

15
5. 3 F u t u r e E x t e n s i on
We contemplate following future features which can be incorporated
into this project:-
5.3.1 Im prov em ent in th e gen eti c a l gorit h m :the implemented
genetic algorithm in this module incorporates only point mutation,
so the other type of mutation can also be practiced like deletion
,insertion and segment mutation etc. and the crossover and
mutation probabilities can be modified to get better results.
5.3.2 Distributed computing environment: Generally we have to deal
with large databases because on the basis of 100 tuples databases it
is very hard to predict the exact class of data .In practical and real
life application we have several GB of data . To operate this much
of data we need the distributed databases and computing.
5.3.3 Dealing with various platforms and formats: The data may be
various format and databases system so system should be flexible
enough to handle the various formats and DBMSs like (Oracle
,MySql etc).

16
References

Books
B.1 Earl Gose Steve Jost Richard Johnsonbaugh Pattern Recognition
and Image Analysis June, 1996 0132364158 Prentice Hall.
B.2 Richard O. Duda, Peter E. Hart, David G. Stork (2001) Pattern
classification (2nd edition), Wiley, New York, ISBN 0471056693
B.3 Valluru B. Rao C++ Neural Networks and Fuzzy Logic second
edition.

Research Papers
R.1 Improvements in Genetic AlgorithmsJ. A. Vasconcelos, J. A. Ramírez, R. H. C.
Taka hashi, and R. R. Saldanha . IEEE TRANSACTIONS ON
MAGNETICS, VOL. 37, NO. 5, SEPTEMBER 2001.

R.2 ART Neural Networks for Remote Sensing: Vegetation


Classification from Landsat TM and Terrain Data Gail A.
Carpenter, Marin N. Gjaja, Sucharita Gopal, and Curtis E.
Woodcock.
R.3. Bezdek, J.C., Pal, S.K., 1992: Fuzzy Models for Pattern Recognition. IEEE
Press, New York.

17

You might also like