You are on page 1of 6

ALGORITHM FOR DEVANAGARI CHARACTER RECOGNITION

Nilesh Kandalgaonkar
M.E. (Electronics) Electronics & Telecommunication, Department, Government Collage of Engineering, Aurangabad, Maharashtra, India Phone: 9822592094 E-mail: knilesh@hotmail.com

Prof. S. S. Rautmare
Lecturer, Electronics & Telecommunication, Department, Government Collage of Engineering, Aurangabad, Maharashtra, India Phone: 9822002302 E-mail: s_rautmare@yahoo.com

Abstract - The paper presents an algorithm for Devanagari character recognition. The input to the recognizer would be an image of pure Devanagari text and output would be recognized Devanagari character. By removing horizontal and vertical lines of characters, reduces the complexity of character images. The center of gravity method is used for character recognition. By this algorithm size independent character recognition is possible.

1. INTRODUCTION The strategy used for OCR can be broadly classified into three categories. In statistical approach, a pattern is represented as a vector: an ordered, fixed length list of numeric features. An attempt is made to capture orthogonal features, which are capable of correctly partitioning the feature space such that each partitioned zone corresponds to a unique character class. In structural or syntactic approach, a pattern is represented as a set of simpler shapes: an unordered, variable length list of geometric features of mixed type. The simpler shapes include strokes, end points, loops and stroke relations. The features represent global and local properties of the characters. In hybrid approach, these two approaches are combined at appropriate stages for representation of characters and utilizing them for classification of unknown characters. In the following subsections, some of the major attempts have been outlined. 2. CHARACTER RECOGNITION METHODS 2.1 Template Matching and Correlation Techniques In 1929 Tausheck obtained a patent on OCR in Germany and this is the first conceived idea of an OCR. Their approach was, what is referred to as template matching in

the literature. The template matching process can be roughly divided into two sub processes, i.e. superimposing an input shape on a template and measuring the degree of coincidence between the input shape and the template. The template, which matches most closely with the unknown, provides recognition. The two-dimensional template matching is very sensitive to noise and difficult to adapt to a different font. A variation of template matching approach is to test only selected pixels and employ a decision tree for further analysis. Peephole method is one of the simplest methods based on selected pixels matching approach. In this approach, the main difficulty lies in selecting the invariant discriminating set of pixels for the alphabet. Moreover, from an Artificial Intelligence perspective, template matching has been ruled out as an explanation for human performance [1, 2]. 2.2 Features Derived from the Statistical Distribution of Points This technique is based on matching on feature planes or spaces, which are distributed on an n-dimensional plane where n is the number of features. This approach is referred to as statistical or decision theoretic approach. Unlike template matching where an input character is directly compared with a standard set of stored prototypes. Many samples of a pattern are used for collecting statistics. This phase is known as the training phase. The objective is to expose the system to natural variants of a character. Recognition process uses this statistics for identifying an unknown character. The objective is to expose the system to natural variants of a character. The recognition process uses this statistics for partitioning the feature space. For instance, in the K-L expansion one of the first attempt in statistical feature extraction, orthogonal vectors are generated from a data

set. For the vectors, the covariance matrix is constructed and its eigenvectors are solved which form the coordinates of the given pattern space. Initially, the correlation was pixelbased which led to large number of covariance matrices. This approach was further refined to the use of class-based correlation instead of pixel-based one which led to compact space size. However, this approach was very sensitive to noise and variation in stroke thickness. To make the approach tolerant to variation and noise, a tree structure was used for making a decision and multiple prototypes were stored for each class. Researchers for classification have used the Fourier series expansions, Walsh, Haar, and Hadamard series expansion. 2.3 Geometrical and Topological Features The classifier is expected to recognize the natural variants of a character but discriminate between similar looking characters such as k ph, p - Sh etc. This is a contradicting requirement which makes the classification task challenging. The structural approach has the capability of meeting this requirement. The multiple prototypes are stored for each class, to take care of the natural variants of the character. However, a large number of prototypes for the same class are required to cover the natural variants when the prototypes are generated automatically. In contrast, the descriptions may be handcrafted and a suitable matching strategy incorporating expected variations is relied upon to yield the true class. The matching strategies include dynamic programming, test for isomorphism, inexact matching, relaxation techniques and multiple to-one matching. Rocha have used a conceptual model of variations and noise along with multiple to one mapping. Yet another class of structural approach is to use a phrase structured grammar for prototype descriptions and parse the unknown pattern syntactically using the grammar. Here the terminal symbols of the grammar are the primitives of strokes and non-terminals represent the pattern-classes. The production rules give the spatial relationships of the constituent primitives. 2.4 Hybrid Approach The statistical approach and structural approach both have their advantages and shortcomings. The statistical features are more tolerant to noise (provided the sample space over which training has been performed is representative and realistic) than structural descriptions. Whereas, the variation due to font or writing style can be more easily abstracted in structural descriptions. Two approaches are complimentary in terms of their strengths and have been combined. The primitives have to be ultimately classified using a statistical approach. Combine the approaches by mapping variable length, unordered sets of geometrical shapes to fixed length numerical vectors. This approach, the hybrid approach, has

been used for omni font, variable size character recognition systems.

2.5 Neural Networks In the beginning, character recognition was regarded as a problem, which could be easily solved. But the problem turned out to be more challenging than the expectations of most of the researchers in this field. The challenge still exists and an unconstrained document recognition system matching human performance is still nowhere in the sight. The performance of a system deteriorates very rapidly with deterioration in the quality of the input or with the introduction of new fonts handwriting. In other words, the systems do not adapt to the changed environment easily. Training phase aims at exposing the system to a large number of fonts and their natural variants. The neural networks are based on the theory of learning from the known inputs. A back propagation neural network is composed of several layers of interconnected elements. Each element computes an output, which is a function of weighted sum of its inputs. The weights are modified until a desired output is obtained. The neural networks have been employed for character recognition with varying degree of success. The neural networks are employed for integrating the results of the classifiers by adjusting weights to obtain desired output. The main weakness of the systems based on neural networks is their poor capability for generality. There is always a chance of under training or over training the system. Besides this, a neural network does not provide structural description, which is vital from artificial intelligence viewpoint. The neural network approach has solved the problem of character classification no more than the earlier described approaches. The recent research results call for the use of multiple features and intelligent ways of combining them. The combination of potentially conflicting decisions by multiple classifiers should take advantage of the strength of the individual classifier, avoid their weaknesses and improve the classification accuracy. The intersection and union of decision regions are the two most obvious methods for classification combination. 3. DEVANAGARI TEXT RECOGNITION Devanagari script is alphabetic in nature and the words are two-dimensional composition of characters and symbols which makes it different from Roman and ideographic scripts. The algorithms, which perform well for other scripts, can be applied only after extensive preprocessing which makes simple adaptation ineffective. Therefore, the research work has to be done

independently for Devanagari script. Some effort has been made in this direction mainly in India. R. M. K. Sinha has reported various aspect of Devanagari script recognition [3]. The post processing system is based on contextual knowledge, which checks the composition syntax. Sethi and Chatterjee have described Devanagari numeral recognition based on the structural approach [4]. The primitives used are horizontal line segment, vertical line segment, right slant and left slant. A decision tree is employed to perform the analysis based on the presence/absence of these primitives and their interconnection. A similar strategy was applied to constrained hand printed Devanagari characters neural network approach for isolated characters have also been reported. However, none of these works have considered real life documents consisting of character fusions and noisy environment. The algorithm is based on geometrical features of the Devanagari characters. Input image is parsed in to many sub-images based on these features. Then other properties, such as distribution of points/pixels and edges within each sub-image are feature used to recognize parsed symbols. Two major properties used to segment input word (image) into various sub symbols are Horizontal Bars &Vertical Bars. Both Horizontal Bars and Vertical forms an integral parts of the Devanagari character [5, 6]. 3.1 Exploiting Horizontal Bars This bar separated out the Matra Zone and the Consonant Zone. Removal of this bar has advantages: 1) Separation of matra zone and consonant zone and 2) Segmentation of character in a word. Since the horizontal bar separates out the upper zone or the matra zone of words, its removal leads to segmentation of word into two sub-images, matra zone image and consonant zone image. Now both of these images can be recognize using special properties of symbols in each of these zones. Removal of Horizontal bar also facilitates the segmentation of character. Since the assumption was that characters are not touching each other, once the horizontal bar is removed all the characters get separated by the spacers, and hence can be easily extracted out. 3.2 Exploiting Vertical Bars Vertical bar is also an integral part of most of the Devanagari characters. Removal of this bar has advantages: 1) Reduce the complexity of character image and 2) In dividing the possible symbols into various classes. Since this bar occurs so frequently in Devanagari characters, this bar can be easily removed. This word significantly reduces the complexity of the character image. Another observation is that there are following types of Devanagari characters: 1) Characters with vertical bar some where in the middle of the

character, 2) Characters with vertical bar at right side of the character and 3) Characters with no vertical bar. Once remove vertical bars, can drive another following three classes of symbols: 1) Symbols that occur on the left of vertical bar in first two classes defined above Class I, 2) Symbols that occur on the right side of the vertical bar in the second class defined above Class II and 3) Characters with vertical bars Class III In the above classes, symbols are resultant sub-symbols after removal of horizontal and vertical bar. One may note that these symbols are not same as original Devanagari alphabets, these are lot simpler symbols. A Devanagari alphabet, such as k, is divided into two subsymbols, one of which is in class I and other in class II. Symbols in each of these classes can be recognized with relative ease, by using their geometric properties. 4. ALGORITHM 4.1 Detection of Horizontal Bar Once the word is extracted out from image, again take the horizontal histogram of the word image. Since headline is integral part of all the Devanagari alphabets, the peak in the horizontal histogram will correspond to the headline or the horizontal bar of the word. The thickness of horizontal line is estimated by the difference in the pixel density distribution in the histogram. When the pixel density drops by value grater than a threshold value, the corresponding row in the image is marked as one of the boundaries of the head bar. 4.2 Detection of Lower Matra Zone Once the horizontal bar is detected, divide word image into two sub images part of the image lying above horizontal bar (will be ref. as matra zone image), and the part lying below horizontal bar. Again consider the horizontal histogram, as calculated to detect horizontal bar. From this, look for a drop greater than a threshold value in the lower part of the histogram. This drop indicates the end of consonant zone and start of lower character zone. The threshold value is experimentally found to be 25% of the average value of the pixel density in the histogram, excluding the contribution of the horizontal bar. Again segment out the lower image into two sub-images one lying above the boundary detected (will be ref. As Consonant zone), and the other lying below the boundary (will be ref. as Lower matra zone). 4.3 Segmentation of Characters Now matras in matra zone and lower matra zone image, and consonants in the consonant zone image become isolated, and can be easily extracted. For each of these

symbols, its position in the original image is stored. This information would be required while constructing back from recognizing images [7]. 4.4 Detection of Vertical Bars Take the vertical histogram of the word image. The thickness of vertical line is estimated by the difference in the pixel density distribution in the histogram. When the pixel density drops by value grater than a threshold value, the corresponding column in the image is marked as one of the boundaries of the vertical bar.

Figure 2. Example of thinned images Note that skeletons produced by this method contain undesirable short spurs produced by small irregularities in the boundary of the original object. A process called pruning, which is in fact just another sort of thinning, can remove these spurs. The structuring element for this operation is shown below, along with some other common structuring elements.

Figure 3. Structuring elements 4.6 Computation of Attributes Vector A number of heuristic techniques are used to recognize. These techniques attempted to capture the basic skeleton or the basic structural properties of each symbol. Properties do not change drastically on variation in font or size. Following are the some of the major heuristic techniques used in this implementation. First element of this attribute vector is the class (eg. I, II or III) of the symbol. This already known from the fact that symbol was derived by extraction left or right part of the original image or the symbol had no vertical bar in it. 4.7 Center of Gravity The Center of Gravity of a symbol gives a lot of information about the structure of the symbol. If the symbol is simple, and all the curves/edges are of unit thickness, as in our case, the relative position of its center of gravity, with respect to other curves in the symbol, will not very drastically. 4.8 4-Element Vector This 4-element vector also captures the structural information of the symbol image. In this vector, each of the elements denotes the number of curves, between CG and the image boundary. To get a feel one, one may imagine him placed at CG. Now the number of layers/curves that will be visible in each direction, eg. Top, left, down and right, from the first, second, third and fourth elements of this 4-element vector respectively.

Figure 1. Example of complete parsing the input image 0 and I, in above image show the class to which the symbol belongs. First the image is parsed into sub images of this order and then each of this sub image is resolved with candidate of its class. These sub images are les complicated and hence easy to recognize. Once these leaf images, in this parse tree are recognized original word is reconstructed by using reconstruction rules described in the section. 4.5 Thinning the Image By broken the word into simple symbols, which will form the basic blocks for recognition. Since, this algorithm is based on the geometric or structural properties of Devanagari alphabets, thin the symbol image for further processing. In this way the attributes (to be computed later), of the symbol image will not get affected by the uneven thickness of edges or lines in the symbol. By using two structuring elements for skeletonization by morphological thinning. At each iteration, the image is first thinned by the left hand structuring element, and then by the right hand one, and then with the remaining six 90 rotations of the two elements. The process is repeated in cyclic fashion until none of the thinning produces any further change. As usual, the origin of the structuring element is at the center.

The structural information that this technique captures is very basic to the symbol. Even if the curves are irregular, or varying fonts are used, the vector yields almost same value for the symbol. Using this vector we ware able to reduce number of candidates for the symbol by approximately 60%

in the process of segmentation of consonant and matra zone, this information is preserved weather head bar on a alphabet/symbol was present or not. Using this information, these symbols can be resolved easily. Using a combination of above techniques all the symbols in the matra zone, consonant zone and lower matra zone are recognized. 4.13 Reconstruction Once all the symbols are recognized. Reconstruct the word these symbols. For each symbol image, the relative position stored in the original image. Also for each recognized symbol we know weather it is of class I type or of class II type. This enables us to define rules to merge two symbols to construct back the original character. Example of such rule is as follows: p [class I] + k [class II] = ph p [class I] + NULL [class II] = p Once the character is constructed we merge the matra information to construct the complete character with matra. 5. CONCLUSIONS

Figure 4. 4-Element vector for some symbols 4.9 Relative Position Of Symbol In The Image This encodes the relative position of the symbol in the original consonant zone image. That is, it tells if the significant fraction of the rows above or below is empty, or some rows are empty from both above and below the symbol image. All the symbols can be divided into very distinct four classes, depending upon their relative position of occurrence. So this attribute is also very useful in reducing the set of candidate symbols. 4.10 Finding Candidate Symbol Set And Resolving Once we have computed the attribute vector for an input image, find the candidate vector of possible symbols for the symbol image. This candidate vector is generated by lookup in table, containing the entire candidate for a possible attribute vector. To recognize symbol among in this set, apply a number of more heuristic techniques. Major of them are 1) Analyzing subsection of the image and 2) Computing 4-element vector at different points. 4.11 Moving Along Center Of Gravity After reducing the number of candidates for the symbol, the candidates left are very closed in shape and structure. So to differentiate among them one needs to capture the slight variations in the structure. These symbols can be resolved by this technique. In this, compute same 4-element vector for a se of points in the row corresponding to CG. These set of points depend upon the width of the symbol and number of candidate symbols to be resolved. 4.12 Head Bar Symbols such as m and bh, tha and ya and dha and gha, are almost same in consonant zone. There are almost pair wise identical. So using the head bar information differentiates these symbols. While removing horizontal bar,

Character recognition is an important problem and is a constituent of any document image analysis and recognition system. A recognition system may be for machine printed and/or hand printed scripts. There are many applications of printed character recognition. The recognition can be achieved by many methods such as dynamic programming, hidden Markov modeling, neural networks, nearest neighbor classifiers, expert system and combination techniques. The characters may be of any sizes and styles. Since different publication prints different ways. The recognition of Devanagari scripts is difficult as compared to many other language scripts, due to large variations in stroke primitives and very complex structure of the scripts. If these stroke primitives are broken, then they add more problems for recognition. A recognition system must robust in performance so that it may be able to cope with large variety of printed variations arising due to different problems. The performance of a recognition system depends upon the feature extraction method/s being used for classification purpose. More relevant is the feature extraction method/s used for discrimination; higher will be the recognition rate of the recognition system. The features used for classification should be translation, scale and rotational invariant. This rotational invariance feature is very much important, particularly when deal with Devanagari character recognition. It will be much beneficial, if use

such features, which have above said properties. In this, have used center of gravity as feature for recognizing Devanagari characters. 6. REFERENCES [1] V. Bansal, R.M.K. Sinha, Integrating Knowledge Sources in Devanagari Text Recognition, IEEE Transactions on Systems, Man and Cybernetics, Vol. 30, No. 4, pp. 500-505,2000. [2] Mandeep singh Recognition, May 2001 chauhan, Hindi Character

[7] Bansal, Veena and R.M.K. Sinha Segmentation of touching and fused Devanagari characters Pattern Recognition, volume 35 (2002), number 4 pp. 875-893. [8] R. M. K. Sinha, Veena Bansal, Devanagari Document Processing, IEEE Transactions 0-7803-255911 [9] C. V. Jawahar, M.N. S. S. K. PavanKumar, S. S. Ravi Kiran, A Bilingual OCR for Hindi-Telugu Documents and its Applications [10] Yungang Zhang, Changshui Zhang, A New Algorithm for Character Segmentation of License Plate [11] Oivind Due Trier, ANI K Jain, Tortinn Tat, Feature Extraction Methods For Character Recognition A Sdvey (1995) Pattern Recognition Vol29, No. 4 pp. 641662,1996. [12] R. M. K. Sinha, Birendra Prasada, Gilles F. Houle, and Michael Sabourin, Hybrid Contextual Text Recognition with String Matching, IEEE transactions on pattern analysis and machine intelligence, vol. 15, no. 9, september 1993 915 [13] Anjum Ali, Mahmood Ahmad, Nasir Rafiq, Javed Akber, Usman Ahmad and Shahwar Akmal, Language Tndependent Optical Character Recognition for Hand Written Text, IEEE transactions 0-7803 -8680-9 [14] B.B. Chaudhuri and U. Pal, An OCR system to read two Indian language scripts: Bangla and Devanagari, Proc. 4th Irit. Conf. Document Analysis and Recognition, Ulm, Germany, pp. 101 1-1015, Aug. 1997.

[3] S. S. Marwah, S. K. Mullick and R. M. K. Sinha, Recognition of Devanagari characters using a hierarchical binary decision tree classifier, IEEE International Conference on Systems, Man and Cybernetics, October 1994 [4] Sethi, I.K., Chatterjee, B., Machine Recognition Pattern of Constrained Hand Printed Devanagari. Recognition (9), 1977, pp. 69-75. 1101 R.M.K. Sinha, Mahabala H. Recognition of Devanagari Script. IEEE Trans on Systems, Man, and Cybernetics, Vol. 9, pp 435-441, 1979. [5] Veena Bansal, R. M. Sinha, A Complete OCR for Printed Hindi Text in Devanagari Script [6] Veena Bansal, R.M.K. Sinha. How to Describe Shapes of Devanagari Characters and Use Them for Recognition. Fifth International Conference on Document Analysis and Recognition