You are on page 1of 11

Recognition of Handwritten Indic Script Using Clonal Selection Algorithm

Utpal Garain1, Mangal P. Chakraborty1, and Dipankar Dasgupta2
1

Indian Statistical Institute, 203, B.T. Road, Kolkata 700108, India 2 The University of Memphis, Memphis, TN 38152
utpal@isical.ac.in, dasgupta@memphis.edu

Abstract. The work explores the potentiality of a clonal selection algorithm in pattern recognition (PR). In particular, a retraining scheme for the clonal selection algorithm is formulated for better recognition of handwritten numerals (a 10-class classification problem). Empirical study with two datasets (each of which contains about 12,000 handwritten samples for 10 numerals) shows that the proposed approach exhibits very good generalization ability. Experimental results reported the average recognition accuracy of about 96%. The effect of control parameters on the performance of the algorithm is analyzed and the scope for further improvement in recognition accuracy is discussed. Keywords: Clonal selection algorithm, character recognition, Indic scripts, handwritten digits.

1 Introduction
Several immunological metaphors are now being used (in a piecemeal) for designing Artificial Immune Systems (AIS) [1]. These approaches can broadly classified into three groups namely, immune network models [2], negative selection algorithms [3], and clonal selection algorithms [4]. This paper investigates a new training approach for clonal selection algorithm (CSA) and its application to character recognition. Earlier CSA was used for a 2-class problem to discriminate pair of similar character patterns [5], the present study extends it for a m-class classification problem. Training in CSA so far is modeled as one pass method where each antigen undergoes single training phase. Once the training on all antigens is over, an immune memory is produced and used for solving classification problem (as used in [5] and [6]). Our work presents a new training algorithm where a refinement phase is used to finetune the initial immune memory that is build from the single pass training. In the refinement stage, training of an antigen depends on its recognition score. Incorrect recognition of an antigen triggers further training. This process continues until the immune system suffers from negative learning or it is over-learned. Recognition of handwritten Indic numerals has been considered to study the performance of the modified CSA. Because of its numerous applications for postal automation, bank check reading, etc., the document image analysis researchers have been studying the problem for last several years and a number of methods have been proposed.
H. Bersini and J. Carneiro (Eds.): ICARIS 2006, LNCS 4163, pp. 256 – 266, 2006. © Springer-Verlag Berlin Heidelberg 2006

where 1 indicates the highest and 0 signifies the lowest similarity between two samples. For any mi. …. mi) is. Initialization: This stage deals with choosing some antigens as initial memory cells to initialize the immune memory. Let the immune memory.Recognition of Handwritten Indic Script Using Clonal Selection Algorithm 257 While some of these are biologically inspired approaches such as neural networks [7]. 1]. and one mismatches. ag2.c∈C = {c1. These stages are briefly discussed below. Section-2 describes the CSA with the proposed retraining scheme. Clone generation: For a given antigen agi.f. c2. AIS approaches remained unexplored for this application. Similarity between two such feature matrices S(F1.………cn} (n = 10 for digit classification) and feature vector: ag. It is to be noted that S gives values in the range [0. After a memory cell mi (renamed as mmatch) is . while Phase-II incorporates a refinement process. F2) a measure of autocorrelation coefficient between F1 and F2 as defined below: S ( F1 .f is the feature vector. chosen from the existing IM as follows: stim(agi. s11. mi.c2.c=agi. It is to be noted that the number of initial cells has certain effect on system’s performance as illustrated in [6]. This section also exhibits the performance of the new retraining scheme over the previously used single-pass approach. This matrix is used as a feature map for the experiments. In addition. and s10 denote the number of zero matches. mm} where mi is a memory cell having two attributes similar to those of an individual antigen. m2. agk}. F2 ) = s10 s 01 − s 00 s11 1 − 2 2 ( s11 + s10 )( s 01 + s 00 )( s11 + s 01 )( s10 + s 00 ) (1) where s00. one matches. initialization of immune memory. section-3 discusses the effect of CSA control parameters on its performance.c ∈C ={c1. at first. In the present study. clone generation. 2 Classification Using Clonal Selection Algorithm Let AG represent a set of training data (antigens) and agi represents an individual member of this set: AG = {ag1. mi) ≥ stim(agi. zero mismatches. Binary images of handwritten numerals are first size-normalized in a 48x48 matrix whose each element is binary. respectively. We used this metric to measure similarity/affinity during antibody-antibody or antigen-antibody interactions. Each agi has two attributes: class: ag. its closest match (say. and section-4 provides some concluding remarks. s01.………cn} is the class information and mi.c (2) The function stim() is used to measure the response of a b-cell to an antigen or to another b-cell and is directly proportional to the similarity between the feature matrices as defined in equation (1). Section-3 provides the experimental details and report results highlighting the performance of the CSA in classifying handwritten numerals. only one antigen from each class is randomly chosen to initialize the immune memory (IM). though AIS techniques have been applied to several pattern recognition problems [9-14]. Training has two phases: Phase-I is the same as was used in [6]. The rest of the paper is organized as follows. IM={m1. …. genetic algorithms [8]. and selection of clones to update the immune memory. mj). for all j ≠ i and mj. Phase-I involves three stages namely.

e. Chakraborty. (i) hyper-mutation rate. If this criterion is not met then further proliferation of existing (i. each survived b-cell. Each clone is produced through mutation (controlled by MUTATION_RATE. survived after resource limitation) b-cells is invoked. bj). The exact number of clones is determined by three parameters. mmatch). The modified version considers only the recent clones generated for the current antigen undergoing the (maturation) training process. respectively. agi). Dasgupta selected for a training antigen agi. namely. Hyper-mutation/Proliferation-I Let B is the set of b-cell clones to be created due to somatic hyper-mutation started with mmatch. mmatch goes through a proliferation process (Proliferation-I). This number is determined only by the CLONAL_RATE and stim(agi. Stopping criterion defined in equation (3) is used to terminate the training on an antigen agi. Proliferation-II process is similar to one for proliferation-I outlined in Algorithm-I except the calculation of the number clones to be generated from each surviving b-cell (bj). In order to minimize the computational cost in generating clones. and D. mut) Let bj denote a mutated clone of mi If (mut) Then BÅB ∪ bj Done . On completion of hyper-mutations. known as somatic hyper-mutation that generates a number of clones of mmatch. a modified version of the resource limitation policy [15] is incorporated. Garain. Proliferation-II).P. The method does not consider clones generated for previous antigens i. present implementation considered the entire resource for the current antigen’s class only. i. B ¦ b .stim j j =1 (3) > STIMULATION_THRESHOLD B Algorithm I.e. NcÅ HYPER_MUTATION_RATE * CLONAL_RATE * stim(agi. M. a stimulation value is computed for each element bj ∈ B as stim(bj. mi) While (|B| ≤ Nc) Do mut Å false //mut is a Boolean variable Call mutate(mi. bj is proliferated to produce a number of clones determined by the resources allocated to it.258 U. No clone is an exact copy of mmatch. Note that the first two parameters are user-defined.e. These algorithms are similar to the ones described in [6]. The algorithms for Proliferation-I and the generation of mutated clones are outlined in Algorithm-I and II. In this stage (i. Let Nc denote the number of clones and calculated as. Here bj denotes an individual b-cell clone and B represents the entire cloned population. a user defined parameter) at selected sites of mmatch’s feature matrix. Initially B={mmatch}. (ii) clonal rate and (iii) stim(agi.e.

t. 1. At the end of the training phase.j Å toggle(x.fi. the current antigen undergoing training) b-cell among the survived ones is selected as a candidate (let bcandidate denote this cell) to be inserted into immune memory.f // note that x. selection and updating immune memory as outlined above in Phase-I of the training. mmatch) CellAff Å stim(mmatch. In the present implementation. This newer version is retained if better . Production of Mutated Clones mutate(x. recognition of the all the training antigens is done first using the immune memory (IMi. …. IM0={m1.fi.j) flagÅ true Endif Done } Clone selection and update of immune memory: Once the training criterion in equation (3) is met for an antigen.e. r in [0. IM0. Classification strategy outlined next is used for recognition of antigens and the recognition accuracy is noted. The parameter α is a user-defined one. mm} is produced. j) in x. the system is trained only once on a training antigen.e. which is then used for classification of all the training antigens.Recognition of Handwritten Indic Script Using Clonal Selection Algorithm 259 Algorithm II. flag){ For each binary feature element (i. i-th stage).r. the most stimulated (w. whereas AS is measured from the input training antigen set as the average stimulation between all pairs of the mean values of the antigen classes. antigens for which incorrect classification is recorded act as a bootstrap samples that undergo further training involving clone generation. Algorithm III: Update of immune memory CandStim Å stim(agi. This results in an updated immune memory (IMi+1). This process is outlined in Algorithm III that is similar to one in [6].e. Next. 1] If (r < MUTATION_RATE) Then x. …) obtained in the previous stage (say. This algorithm makes use of two parameters AS (average stimulation) and α (a scalar value). m2. bcandidate) MatchStim Å stim (agi. In this phase. bcandidate) If (CandStim > MatchStim) IM Å IM ∪ bcandidate If (CellAff > α × AS) IM Å IM – mmatch // insertion into the immune memory // memory replacement Phase-II of the training algorithm: Note that the training in Phase-I is a one-pass method i.f is basically a matrix Do Generate a random number. training involves a second phase namely Phase-II that employs a refinement process. i=0. the immune memory i. In this method recognition and training go hand in hand to obtain a better immune memory from its initial version i.

L2 generates a slightly larger sized immune memory than the one produced by L1. Some samples for each digit class are shown in fig 1. a separate validation set can be used in this refinement phase. L1: training is single pass and L2: proposed method that employs refinement process. 128 RAM) PC. k mi’s are grouped based on their class labels. L1 takes quite less CPU time than L2 that involves additional refinement phase. Both the datasets contain real samples collected from different kinds of handwritten documents such as postal mails. For our experiment. Significant difference is observed in the time units required for training. Classification strategy: Classification is implemented by a k-nearest neighbor (kNN) approach. Closeness is measured by the stim function i. M. job application forms and railway ticket reservation forms.. Next. However. signaling a negative (or over) learning in the system. Class of the largest sized (a majority-voting strategy) group identifies ag. instead of using the training antigen set. mi ∈ IM. Garain. etc. etc.938 Bengali numerals written by 556 persons. Moreover. there is hardly any difference in the time needed for classification by the two approaches. For a target antigen (ag). IMi is reloaded and the Phase-II terminates. The results reported next are averaged over these six runs. Dasgupta (than what was obtained using IMi) recognition accuracy is obtained. However. Devanagari (Hindi) and Bengali. 3 Experimental Details Two different datasets (DS1 and DS2) [16] have been used to test the proposed classification approach based on clonal selection algorithm (CSA). mi) for all i. On a Pentium-IV (733 MHz.r.000 samples (equal number of samples for each class). Chakraborty. datasets consisting of a large number of samples for handwritten digits in Indic scripts are recently available [16] in public domain and this facilitates training and testing of an approach and comparing it with other competing methods. The system can classify about 50 characters per second. It is observed that for a few iterations of Phase-II newer versions of the immune memory continue to produce better recognition accuracy and then there is degradation in accuracy. These datasets DS1 and DS2 contain samples for handwritten numerals in two major Indic scripts namely. This modification would be considered in the future extension of the present study. Chinese. . respectively. Abso lute time units taken during training and testing are outlined in Table 2 below. Training is conducted on samples from five partitions and classification is tested on the sixth partition. Experiments are carried out under two different training policies. Japanese. This realizes a six-fold experiment that results in six test runs. and D. In fact. studies in Indic script handwriting recognition are rare and this provides additional motivation to this present work to deal with datasets of handwriting in Indian languages.P.e. each dataset consists of 12. stim(ag. ag) memory cells are selected from the immune memory IM. Otherwise.t. Recognition accuracies under these two environments are reported in Table 1 and it is observed that L2 outperforms L1 by a significant margin. DS1 samples are randomly selected from a collection of 22. passport application forms. Unlike English. k (an odd number) closest (w.260 U.556 Devanagari numerals written by 1049 persons and DS2 samples are taken from a set of 12. The datasets are divided are into six equal sized partitions.

CPU Time for training and classification using two different training algorithms Time to train Dataset DS1 DS2 L1 5 H 14 Min 5 H 19 Min L2 7 H 05 Min 7 H 22 Min Classification speed (#characters per second) L1 L2 52 49 51 47 . The response of the additional training module is shown in fig. In fact. A similar behaviour is obtained for the other dataset too. Please note that iteration 0 represents the initial Phase-I training where all 10. 2. 2 it is to be noted that the recognition accuracy gradually increases till the 8th iteration after which the accuracy degrades and training terminates. it’s the first local maximum where the training terminates and at present. Number of antigens undergo training in each pass is also plotted by a line curve in fig. In fig. Recognition accuracies and size of immune memory with two different training algorithms Dataset DS1 DS2 Recognition accuracy L1 L2 93.68% Size of immune memory L1 L2 912 1283 1103 1472 Table 2.Recognition of Handwritten Indic Script Using Clonal Selection Algorithm 261 Fig. 1. Table 1. 2 for the dataset DS1.000 antigens were trained. the system does not attempt to find the global one.57% 95.31% 96. Hundred random samples from the dataset of Bengali handwritten numerals Performance of the proposed refinement stage is studied to check how rapidly the system attains the maximum classification rate on the training set.23% 92.

hyper-mutation rate = 2 and clonal rate = 10 (the last two parameters are used in Algorithm-I of section 2). mutation rate = 0.89. Classification results are further grouped into three classes. and D. Recognition accuracies for different values of k are shown in Fig.262 U. the effects of parameters are studied for two different measures: (i) recognition accuracy and (ii) size of the immune memory. Garain. incorrect: a sample is wrongly classified. 4. 2. the effect of k in k-nearest neighbour classification is examined and it is observed that k = 5 gives the best performance. The overall results reported in Table 1 are obtained with k = 5. Almost similar effects have been observed on both the datasets and results on DS1 are shown in Fig 3. affinity threshold scalar. correct: a sample is properly classified. A rejection is reported when no single class gets majority among the k choices returned by the classifier. Table 3 presents the average classification results taking these three aspects into consideration. Dasgupta Fig. Finally.P. . M. and reject: the system cannot classify a sample. Chakraborty. α = 0.4.008. Performance analysis of the bootstrap module Next. number of resources = 400. Results are reported here for the new training algorithm. stimulation threshold = 0.

68 % incorrect 2. (b) number of resources used for resource limitation. On the other hand. α as used in Algorithm-III Fig. Recognition of the digit ‘0’ attains highest recognition score in both scripts. Some post-processing can be employed to discriminate such confusion pairs.52%. Such multi-level recognition scheme is considered as a future extension of the present study.63 1. 5 presents the class-wise classification rates. 3. Study of the confusion matrix identifies several similar-shaped character pairs.23 95. samples of (digit ‘2’) in Hindi and (digit ‘9’) in Bengali result in the lowest classification rates as 89. a previous study [5] reported promising ability of an AIS-based approach for discrimination of similar-shaped character pairs.32% and 90.44 % reject 1. many samples from (digit ‘1’) and (digit ‘2’) in Hindi dataset and from (digit ‘1’) and (digit ‘9’) in Bengali dataset resulted in confusion during classification. respectively. The same approach can also be employed here to further improve the classification accuracy. For example. Effect of different parameters on recognition accuracy and size of immune memory: (a) stimulation threshold (refer equation (3)).Recognition of Handwritten Indic Script Using Clonal Selection Algorithm Table 3. . and (d) Affinity threshold scalar. (c) Mutation rate (refer Algorithm-II). In this context.14 2. Classification results Dataset DS1 DS2 % correct 96.88 263 Fig.

there are only a few reports on Indic script. Dasgupta Comparison with other existing studies: As mentioned earlier that there are many studies on recognition of handwritten digits in English and Oriental scripts. Class-wise recognition accuracies . Garain. A recent study [17] makes use of fuzzy model based recognition scheme and reports recognition accuracy of about 95% on a dataset containing about 3500 handwritten samples for Devanagari digits. 4. However.26% on the same dataset used here for recognition of handwritten Bengali digits. and D. 5. Recognition accuracies using k nearest neighbor approach with different k values Fig. Chakraborty. M.P. Study in [18] used neural net as classifier and achieved an accuracy of 93.264 U. Fig.

Since encouraging results have been obtained in this experiment. and experiments using different datasets are performed.B. and Qi P. D. J. 6. “Artificial immune system (AIS) research in the last five years. Special Issue on Artificial Immune Systems. [18] considers wavelet coefficients as features whereas. 2004. Proceedings of GECCO. Ji and D. Ji. “Improvement of OCR Accuracy by Similar Character Pair Discrimination: an Approach based on Artificial Immune System. F. August 2006. In particular. Koichi Tashima. 56 . Issue 12. However. Hongkong. Dutta Majumder. Castro and F. Z. Z. and F. pp. 6. Reference 1. “AIRS: a resource limited artificial immune classifier.” in Congress on Evolutionary Computation (CEC’03). Dept. Therefore. P. Zheng Tang. 4. of Computer Science. Gonzalez. . Dasgupta. Use of distance measure also differs from one study to another. Dasgupta. M. 2003. the proposed AIS-based method can be viewed as a potential alternative. a direct comparison needs replication of these experiments using a uniform feature set and the same distance measure.” in LNCS 3102. Reported results show that this new method outperforms the previously used single pass method.” Master’s dissertation. “Learning and Optimization Using the Clonal Selection Principle. Overall classification performance shows that this method compares well with the existing approach. a 2-phase clonal selection algorithm implementing a retraining scheme is proposed. Volume: 1. 239-251. D.” IEEE Transactions on Evolutionary Computation. 5. Cao.” Systems and Computers in Japan. In particular. d. 2001. “Real-valued negative selection algorithm with variable-sized detectors. Chakraborty. 2. Garain. it is to be noted that no study employs the same feature set. Volume 34. V. 2002.130. Our future study will consider this aspect to bring out a judicious comparison between an AISbased framework and other approaches using different learning paradigm. L. pp. future extension of this study would include examination of different feature sets and distance measures to further improve the recognition accuracy. 4 Conclusions This paper presents an application of a clonal selection algorithm for recognition of handwritten Indic numerals.63. pages 287–298.Recognition of Handwritten Indic Script Using Clonal Selection Algorithm 265 Compared to these approaches and achievements. This study uses a feature vector and a simple distance measure to explore the feasibility of an AIS-based approach as an alternative classification tool. Mississippi State University. Zuben. Authors in [17] use some grid-based features. the proposed scheme achieves recognition accuracy of about 96% that is comparable to the previous approaches. 2003. pp. N. “Pattern recognition system using a clonal selection-based immune network. 3. vol. A.” to be presented in the 18th Int. a size normalized binary image array has been used as feature in the present study. on Pattern Recognition (ICPR). U. Watkins. Conf. 123.

on Pattern Recognition and Artificial Intelligence (IJPRAI). May 2002.” in the proceedings of the special sessions on artificial immune systems in Congress on Evolutionary Computation. Chaudhuri. Conf. Timmis.html 8.” Proc. of the 8th Int. Aberystwyth.V. Dasgupta 7. J. Dutta. Bhattacharya and B. B. A. “Pattern Recognition by Immunocomputing. Springer-Verlag.P. White and Simon M. and B. “The Immune System as a model for Pattern Recognition and classification. Korea.” Advances in Evolutionary Computing (Eds. T.usc. India. J.28-41.266 U. 2005. “Artificial Immune Systems: A Novel Approach to Pattern Recognition. January 2003. UK. J. no. 67-84. Della Cioppa. U. 14. Y. Forrest. 191-211. Smith and A. page: 789-793. 9. E.” in Proc. page: 804-808. Chapter 36. 7. Garain. 12. B. on Document Analysis and Recognition (ICDAR).edu/Vision-Notes/bibliography/char1019. 2002 IEEE World Congress on Computational Intelligence. S. Dasgupta. R. L. Bhattacharya. of 2nd Int. Skormin. de Castro and J Timmis. pp. pp. “Handwritten Numeral Recognition by Means of Evolutionary Algorithms. Chaudhuri. 15. H. on Document Analysis and Recognition (ICDAR). Parui. “Using genetic algorithms to explore pattern recognition in the immune system. 16. 3. L Alonso J Corchado and C Fyfe). Perelson. Keith Price Bibliography on use of Neural Networks for recognition of Numbers and Digits at http://iris. S. Conf. Marcelli. 1999.” Journal of the American Medical Informatics Association. January 2002. 2000 10. 17. pp. Conf. 2005. vol. Volume 16. 490-496. de Stefano. Edinburgh. University of Paisley. U. pp. 13. Jennifer A. Carter. M. Conf. Tarakanov and V. S. September 1-3. B. Int. “Databases for research on recognition of handwritten characters of Indian scripts. on Cognition and Recognition. “A Hybrid scheme for handwritten numeral recognition based on Self Organizing Network and MLP. Hanmandlu and O. Vol. 2003. “Artificial Immune Systems: a novel data analysis techniques inspired by the immune network theory.studentprogress.” PhD Thesis. K. “An Immunogenetic Approach in Chemical Spectrum Recognition. Ghosh & Tsutsui). and A.com/appln/colleges/cogrec/ 18. 2002. 11. Napier University. Garrett. Bangalore.” in the Proc. “Fuzzy Model Based Recognition of Handwritten Hindi Numerals. http://www. pp. Javornik. Hawaii. . Cao and D. M.” in Evolutionary Computation 1:3. 845-864. Chakraborty.” in Proc. A. Dec. Ramana Murthy.” in Artificial Neural Networks in Pattern Recognition (Eds.” in Int. Das. and D. Seoul. 2001. on Artificial Immune Systems (ICARIS). C. II. Honolulu. K. 1993. of the 5th Int. “Improved Pattern Recognition with Artificial Clonal Selection. N. University of Wales.