You are on page 1of 4

A hybrid GA and SA algorithms for f eature selection in

recognition of hand-printed Farsi characters
Reza zi ¸  Bosha Pishgoo q  Narges Norozi q  Maryam koohzadi q  Fahimeh baesi  
Dprtment Of Computer
Alzhra University
Tehrn, Iran
l
ami@alzhra.ac.ir
Abstract-In this rsearch a hybrid feature selection
technique based on geneic and simulated annaling
algorithms is proposed. this approach is evaluated by
using Bayesian classiier on a dataset of hand-pinted
Farsi characters which includes 100 samples for each åå
hand-printed characters. The acquired results have
been improved by correcion of Simulated Annaling
through considering two minimum and maimum
thresholds.
Kwods: Gentic Algothm, Smuated Annolng,
paen recogniion, feature s elecion
I. INTRODUCTION
In a patten recognition system, achieving a good
efectiveness is very related to selected features.
Patten recognition daasets re oten chracterized
by a large number of irrelevant or redndant featres
that may signiicantly degrade the recognition results
nd reduce the lening speed of algorihms, so
adopting a relevnt subset of featres is importnt to
manage dimensional complexity of his problem.
Deinitely, using all exracted features, not only has
not always he desirable result but also increase the
time complexity of recognition process.
Gnerally, featre selection is inding a subset of
features which improve the recognition accuracy.
This process has two main phases. First phase
includes a search srategy to select one featre subset
among all possible, the second phase includes a
mehod for evaluating selected subsets with
assigning a itness value to hem[l].
There are a lot of algorihms for featre selection.
in general, we cn classiY feature selection methods
in three groups. At irst group, All possible subsets of
complete featres are considered and evaluated for
reaching to a suitable subset. At second group,
heuristic methods such as forwrd nd backward
selection are adopted. In his group the algorihm
sarts its work wih a subset of featres and oher
features have been iteratively and dynmically added
978-1-4244-6585-9/10/$26.00 ©2010 IEEE
Jb1
to or omitted rom it. Finally at third group, random
search methods such as genetic algorithm nd
simulated nnealing method can be used . Reference
[2] has evaluated he eiciency of these algorithms
details. Reference [3] has divided the feature
selection method to diferent covering nd ilter
mehods rom dependent nd independent point of
view to classiiers. References [4,5,6] have reviewed
the efect of different feature selection algorihms on
the efectiveness of text classiiers nd compred
them. In the present rticle, ater featre extraction by
chracteristic loci method, irst we use genetic
algorihm nd hen its compond shpe by SA
algorihm nd at he end improved compound
algorithm as our feature selection mehod. our
classiir in his test is simple Bayesin.
II. PRE PROCESSING
Generally, in image processing, the irst step is
considered as pre-processing. This includes several
sages which can be pointed out to slnt corection,
normalization and hinning. Reference [7] deals wih
review of diferent techniques of normalization nd
their implementation. Reference [8] too describes the
ways towrds he implementation of three mentioned
sages on inscriptive hnd-prined of Frsi . In his
article, Farsi alphabets as characters re being
normalized toally and each chracter is srrounded
to one four sided pictre.
III. FEATURE EXTRACTION
In this article the existing features in images have
been extracted hrough loci characteristic method.
The base of this method is that he vertical nd
horizonal passing through each image pixel can
cross the black dots in four original directions on 0, 1
or 2 points ( it is possible of course hat he numbers
exceeds two, but in this method it deceases to two) .
Therefore in each image pixel we can relate a four
bits number to a three basis, so that each bit
represents number of vrtical nd horizontal cross
dots of that pixel in one of the main directions nd
has an amount between 0 and 2 . So we can relate to
every each pixel a decimal nmber between 0 to 80.
Wih this method rom each picture 81 featres can
be exracted. the number related to feature I is equal
abndantly of repetition of number of 1-1 between
decimal numbers related to image pixel.
IV. FEATUR SELECTION
As previously mentioned in section 1 , this cn be
claimed that use of all exracted features rom image,
not only increases the rate of calculation but also
reduces test rapidity and always can't give he best
recognition percent.
We as proof point, irstly use all extracted
features, nd train our classiier by hem. Then with
test daa evaluation, we acquire the percent of
recognition in this case. hen doing feature selection
though genetic algorithm nd simulated annealing.
Aterwrds hrough the selected of feature sets in his
algorithms, classiier are rained again nd evaluate
by test data. At the end, results are being compared.
A. Feature selection through gnetic algorithm
Gnetic algorithm is one of the random methods
which uses gradual evolution theory for problem
solving . One of the main issue of his algorithm,
keeping a set of best answers in a population . As in
biology evolution theory this algorithm has a
mechnism for choosing best chromosomes in a
generation. In his process chosen chromosomes
under operation such as Crossover nd Mutation are
implemented. Until now in he ield of feature
selection hrough genetic algorithm there have been
lots of reserches done. H reference [9] for
recognition of hnd-printed chracter ,one approach
is presented according to gnetic algorihm. in
reference [8] genetic algorithm is used as a method of
selection suitable featres set, intended to improve
sysem recognition of Frsi chracter hnd-printed.
In his aticle, as mentioned in previous section,
from each character, 81 features re being extracted.
The number of featres as exracted re being
assumed as N for ease. The gnetic algorithm should
select rom his N featre, suitable subset for
classiication, so hat the nmber of featres are
being reduced rom results of calculation nd also
provide suitable recognition percentage rom
characters classiication. Chromosomes of his
genetic algorithm is a binary igure wih an N length
is taken into considerations. In oher words each
chromosome has N genes. If he gene Being one,
shows selection and if that being zero, shows non
JbJ
selection of feature analogous wih that genes.
Figure 1 illustrates this concept.
¹
¿
I
¹1
I
¹
¿
I ¹S1
I
¹S-¡
I
¹S
1 1 1 1 1 1 1
Û
I I
. . . . . . . . . . . . . . . Û
I I
Û
Figure !. presentation of one chromosome [6]
The initial population of this algorithm re
selected completely rndom. It then calculates itness
for each chromosome wih the usage of itness
nction. Fitness nction used in this ticle is in fact
the sme simple Bayesian classiier hat at each time
rained data according to features of a chromosome
and calculates recognition percentage. In each
generation, hree groups of chromosomes are being
ransferred to next generation. The irst group are
those chromosomes that recognition percenage re
fr more hn of a threshold. This threshold level is
related to gneration number nd as we go higher
generation, this amount increases too. Second nd
third groups re chromosomes hat respectively due
to cross over and mutation upon half of chromosomes
of irst group and one fourth of other recent
generation chromosomes re created accidentally.
The cross over operation, by choosing one rndom
integer number between 1 nd N, nd hen change of
chromosomes tails is done rom crossed dot.
Mutation operation too is done by production of one
rndom number between 1 nd N and change of hat
gene to zero in case of being 1 and vice versa.
Here it must be noted hat selecting features by
using of genetic algorihm is a very time consuming
operation, because for each chromosome all data
must be rained nd evaluated at least once. But
fmally test smples will be implemented as rapid nd
with lesser eror.
In this test, genetic algorithm ater few generation
will achieve a recognition percentage higher than
primry percntage. The notable point is hat the
number of features which has been evaluated by his
mehod are reduced accordingly. In he method of
loci characteristic, he number of features are reduced
rom 81 to 55. Results re shown in table 1 .
Genetic algorithm has one main problem nd hat
is, it always selects chromosomes with the best
recognition percentages, then mutation nd cross over
operation re performed upon them. This is when one
chromosome which does not have a suitable
recognition percentage, he n noable probbility
might have a very high recognition percentage when
mixed wih anoher chromosome. One suitable idea
for irradiation of his problem is that combination
with simulated nnealing algorithm which will be
discussed as follows.
D. Feature selection through combination of
Simulated Annaling and genetic algorithm
Here for genetic algorihm problem solving, we
combine it wih SA algorithm. The procedure is the
sme. Here too Irstly we create a few accidental
chromosomes and calculate recognition percentage
through itness unction for each chromosome. but
here instead of hose chromosomes which their
recognition percentage re higher thn threshold
amount will be rnsfered into the next generation,
all recent generation's chromosomes on the basis of
their recognition percentage, have the chnce of
presence in the next gneration. It is considered self
evident hat chromosome which have a lower
recognition percentage, will have less chnce to be
ransferred into next generation nd vice versa. but
the impornt case is that this chnce although very
low, it is still in existence. conrary to genetic
algorihm those wih low recognition percentage
there is not a chance for presence in next generation.
The proposed procedres in combined algorihm
SA and GA, although the situation is created for a
suitable chromosome selection, but there exists one
deIciency. Since next generation chromosomes
selection is done on probbility basis, here might be
a condition that best chromosomes of a population
would not enter into next gneration at all. Therefore
as it goes it is possible a long time ater generations
to have better chromosomes, it is also possible hat
the gradual inferiorities might be happn.
To resolve this problem it is enough to have SA
be corrected a litle bit in creation of next generation.
Here he operation of chromosome selection of next
generation doing with two high and low thresholds.
For ease, in hese two levels re abbreviated to tnd
B. If recognition percentage for a chromosome of low
threshold level t is lower it mens hat it will never
have he chnce of entring into next generation. vice
versa If recognition percentage for a chromosome of
high threshold B is more, it mens that it will
defmitely rnsferred into next generation. Now to all
chromosomes which have a recognition percntage
between hese two t and B hresholds, based on
recognition percentage he chnce will be given to be
present in next generation. So the valuable
chromosomes of present generation will deInitely
have the opportunity to be in next generation. And
very invaluable chromosomes of this generation also
will not be present in next generation. But the rest
chromosomes with a recognition percentage will be
Jbb
proportionate will have possibility to be presnt in
next generation. By implemnation of his
mechnism we can derive more useul results han
the mentioned two methods.
V. EVALUATION ND COPARISON
The presented and above mentioned methods
have been implemented nd evaluated by using
Bayesian classiIer on a dataset of hnd-printed Frsi
charactrs which includes 1 00 samples for each 33
hand-prined characters on Farsi hand-printed
chracters . One fourth of he total existing samples
have been considered as sample tests nd remaining
as raining samples. The features of these samples
have been extracted through loci chracteristic
mehod ater introduced pre processes.
,
^
¬
}
¬ = ^

)
G
L 2)

r/
)
/
)
/ /

JI

9
f d
f
"


'
>
L
L

0.
. ª 
J
J /
I
Fiure 2-sample set of hand-printed Farsi chracters
In the Irst stp, one simple Bayesin classiier is
trained with all 81 features have been exracted nd
observed hat in test stage approximately 77 percent
of samples have been classiied correctly.
In the second step, we performed feature selection
through accidenal genetic algorihm As poined in
section 4. 1 . in this state, genetic algorithm by omitted
residue features and selected effective nd suitable
features re performed in addition to increase of
recognition percntage of chracters until 80 percent,
the nmber of features used for raining classiIer too
is reduced rom 81 features to 55 .
In the third step, in order to resolve the problem
of genetic algorithm, SA was added to the mehod
nd in fact operation related to feature selection
through combined genetic algorithm nd SA was
performed . In this stae, algorihm wih omission of
residue featres and selection of suitable feature sets,
recognition percentage of characters is increased K
about 82 percent. In his case numbers of selected
features were reduced rom 81 to 60 features. Table I
shows the results.
....... 
LA 
~~~~LA+5A 
~LA+bA wI!h  z
|hICshOIO 
ss
·z
. . . ... . _-
�"=
f



f
�••
«
-
-
=
81
s0
¯U 
JB 
¯¯ 
Jb 
J' 
J4 
Ïo 
-   ¯2 
12  11  1Û  J  o  ¯  Û  b  4  o  ¿  I 
Figure 3. Comparative igures of proress in percentage
classiication in diferent genertions, in GA , GA+SA and
GA +SA algorithm with two hih and low thresholds
Figure 3 gives a clear insight in relation with the
progress of mentioned algorithm in diferent
generations. As you see, those igures related to
genetic algorithm nd combined genetic algorithm
nd SA with two high and low thresholds are
ascending. But igure related to combined genetic
algorithm And simple SA, re sometimes descending
nd sometimes ascending . This vriation is the efect
of probability factor which previously was mentioned
in section 4.2. In addition his comprative igure
explains superiority of combined genetic algorithm
nd SA, with tow thresholds in selection suitable
features for classiication.
VI. SUMMARY
Recognition system of letters must have high
scrutiny, high rapidity nd easy tools. Mny methods
could be explained hat must be diferent in feature
exraction nd classiication nd produce different
results. But extract all features is not always useul.
In this article for reduction of problem dimension,
two genetic and combined genetic nd SA algorithm
were used, nd it ultimaely is proved hat usage of
all exracted featres rom one image, for
classiication not only complexity of calculations are
increased but also as always the highest recognition
Jb1
percenage is not created. Therefore reduction of
problem dimension seems necessry through
diferent algorithm.
Table l- Results and Comparison
ClassiI¡cr ClassiI¡cation lcaturc's
ratc uumÞcr
^oIcaturcsc!cct¡on TT S1
Icaturcsc!cction byCA SO óó
Icaturc sc!cct¡on by S2 ôO
comb¡nat¡on o¡ CA anô
SA usi4g two ÞigÞ anô
!owtÞrcsÞo!Js
REFERENCES
[1] Ho-Duck Kim, Chang-H
y
un Park, Hun-Chng Yng,
Kwee-Bo "Genetic Algorithm Basd Feature Selection
Method Development for Patten Reconition",
ppears in SICE-ICASE, Intenational Joint Conference, pp
1020:1025, 2006
[2] D.Zongker ò A.Jain "Algorithms for feature Selection
.P Evaluation", apears in: Patten Recognition,
Proceedings of the 13th Intrnational Conerence, volume2,
pp 18.22,1996
[3] Jalili saeid, bitarafan mahi, "increment text
classiication performance based improve feature
selection methods",volume40, p 313:328, 2006. (in
Farsi)
[4] Janez Brank, Marko Grobelnik, NataSa Milic-Frayling, Dunja
Mldenic " Interaction of Feature Selection Methods nd
Linear Classiication Models", Proceedings of the ICL-02
Workshop on Text Lening, 2002
[5] Anirban Dasgupta, Petros Drineas,Boulos Harb "Featre
Selection Methods for Text Classiication", Intenational
Conference on Knowledge Discovery and Data Mining,
Proceedings of the 13th ACM SIGKDD intenational
conerence on Knowledge discovery and data mining, pp
230:239,2007
[6] Huiqing Liu, Jinyan Li, Limsoon Wong "A Comparative
Study on Featre Selection and Classiication Methods Using
Gene Expression Proiles and Proteomic Pattens", Genome
Informatics 13, pp 51-60, 2002
[7] Chng-Lin Liu, Kazuki Nakashima, Hiroshi Sko nd
Hiromichi Fujisawa,"Hnwritten digit reconition:
investigation of nomalization nd featue extraction
techniques", Patten Reconition Societ
y
. Published
b
y
Elsevier Science B.V., Volume 37, pp 265:279,
2004
[8] Kheirkhah ahmad Reza, rhmnin esmaeil,
"optimization of recognition of Farsi hnwriting
chracter based efective feature selection b
y
GA". 8
h
intellignce s
y
stem conference in Ferdosi universit
y
,
2007(in Frsi)
[9] L.Cordella, C.De Stemo,F.Fontanella and C.Mrrocco "A
Feature Selection Algorithm for Handwritten Character
Recognition" , appears in: Patten Reconition, ICPR 2008.
19th International Conference, pp 1 : 4,2008