Image Processing Algorithms for Improved Character Recognition

Satadru Das, Nincy James School of Information Technology and Engineering, VIT University, Vellore – 632014, Tamil Nadu, India das.satadru@gmail.com; nincyjames@gmail.com; +91-8124251405 Guide: Prof E. Vijayan

Abstract In this paper we propose some methodologies for Optical Character Recognition (OCR). In order to recognize characters from a digital or scanned image we use auto thresholding based on Otsu’s method. Noise is also removed efficiently from the given image. Automatic selection of threshold results in improved OCR performance. The characters present in the image are efficiently extracted and labeled from left to right and from top to bottom after row-wise and column-wise search order are done. Keywords: OCR, thresholding, Otsu’s method, search order

Introduction

Optical Character Recognition (OCR) is used to classify optical patterns in a digital image corresponding to alphanumeric or other characters. The process of OCR involved various aspects such as segmentation, feature extraction, and classification.

Classification Process

Building a classifier involves the following two steps: training and testing. These can be broken down further into the following sub-steps: 1. Training: A library of characters is created that is later used by the program to compare input data against known characters. In order to determine a character, the program searches the library to find a character that is the closest match to input data. a. Pre-processing: The steps often followed are as follows Binarization – Choosing a threshold value when presented with a grayscale image. Morphological Operators – Remove holes and isolated specks in characters. Segmentation – Check the connectivity of shapes, label, and isolate. We can use bwlabel and regionprops functions. b. Feature Extraction: Reduce amount of data by extracting only relevant information

2. b. Images are converted into grayscale and individual characters are segmented in their own block for further processing. The calculations for finding the foreground and background variances are done as follows: . Model Estimation: From the finite set of feature vectors we need to estimate a statistical model for each class of training data. Testing: In order to find out a string of characters from a given image we employ three steps: a. Feature Extraction: Important features are determined of each character. Pre-processing: Background noise and irrelevant details in the image are removed and only the characters remain. Extracted features are compared to the library. Classification: Character is classified based on a number of algorithms. Figure 1 The pattern classification process Otsu’s Thresholding Method Otsu’s method is one many binarization algorithms.c. In this method all the possible threshold values are considered and then measure of spread for the pixel level on each side of the threshold is calculated and then threshold with lowest sum of weighted variance is taken as the thresholding value. c.

6. and Extrema (8 by 2 matrix specifying the extrema points in the region). of the same size as the input image. containing labels for connected objects (maybe 4-connected or 8-connected) in the image. Foreground 𝑊𝑒𝑖𝑔ℎ𝑡 𝑊𝑓 = 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 2 𝜎𝑓 𝑀𝑒𝑎𝑛 µ𝑓 = � =� 𝑖=1 𝑡 𝑖=1 ∑𝑡 𝑃(𝑖) 𝑖=1 𝑁𝑜. Connected components in the binary image are labeled using the bwlabel function. This returns a matrix L. Here we calculate the BoundingBox (specifies the upper left corner and width of bounding box along each dimension). 2. Global image thresholding is done using Otsu’s method which chooses the threshold to minimize the interclass variance of the black and white pixels. 5. 𝑜𝑓 𝑝𝑖𝑥𝑒𝑙𝑠 (𝑟𝑖 − µ2 ) ∗ 𝑃(𝑖) 𝑃(𝑖) 𝑡 𝑖𝑃(𝑖) 𝑃(𝑖) Implementation The implementation of OCR program is done with MATLAB as it has built-in image reading tools and native array types. 4. These properties are used so that proper labeling can be done in order to read the objects in sequence.Background 𝑊𝑒𝑖𝑔ℎ𝑡 𝑊𝑏 = 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 2 𝜎𝑏 𝑀𝑒𝑎𝑛 µ𝑏 = � =� 𝑖=1 𝑡 𝑖=1 ∑𝑡 𝑃(𝑖) 𝑖=1 𝑁𝑜. 3. 𝑜𝑓 𝑝𝑖𝑥𝑒𝑙𝑠 (𝑟𝑖 − µ2 ) ∗ 𝑃(𝑖) 𝑃(𝑖) 𝑡 𝑖𝑃(𝑖) 𝑃(𝑖) where ri is the gray level. Centroid (specifies the center of mass of region). A row wise sorting is done based on the left most top value of each of the detected objects. Removal of noise from the image. Convert the result to binary image and invert the background to black. 7. . To do this we need to measure a set of properties for each connected component (object) in the binary image which is accomplished using regionprops function. The implementation is done as shown in the following steps: 1. The original image is converted to a grayscale image.

A column wise sorting is then done based on the left most bottom values and after quantizing the bottom coordinates of each object. internet. then it may not work properly. Results and Discussion The tests were made on images obtained from various sources like scanned images.8. Although the aforementioned steps works fine for most cases but if the image is too blurred or contrast of image is very poor. etc. Some of the samples were made with the help of MS Paint. (a) (b) (c) (d) (f) .

as shown in Figure 4 (b). This results in accurate detection of each of the connected components. (b) After thresholding using Otsu’s method. (a) Figure 3 (a) Given image (b) Serif of “F” extending (b) But still “s” is not labeled as the first object.Figure 2 (a) Original image. no matter which way we search. Hence column wise scan by bwlabel finds “F” first. (d) After row-wise sorting. The ordering done is not at always straightforward. instead “t” is found first as the vertical stem of “t” extends the top of “s”. (e) After row and column sorting Figure 2 shows the results that are obtained after performing each of the steps that were followed to extract objects. Instead “F” is labeled as the first object as the serif on upper left of “F” extends the left of “s” (as shown in Figure 3 (b)). One would expect that the upper left corner “s” would be labeled as the first object found by bwlabel. Thus. we sort according to row-wise search order and then according to column-wise search order. The upper top left corner character “L” is misinterpreted as the character “I” as a result of thresholding. Hence we need to do a column wise sorting after quantizing each of the objects based on their left most bottom values. Suppose we have an image as shown in Figure 4 (a). Another problem is that due to the presence of maximum noise the thresholding algorithm used does not always produce correct output. But that is not the case. A row wise sorting based on the left most top value of each object does not give accurate results. (a) (b) . Suppose we have a scanned text image as given in Figure 3 (a). (c) Extracted characters before sorting.

. Verma.. C. Second Edition. These are labeled as two different objects which should not have been the case.g. R. Woods. 258-267. R. or :). or a colon (. Gonzalez. e. Blumenstein.Vol. E. We also face difficulty with characters that are not connected. 2006. & Basli. 57.. There are many areas on which we have not worked yet due to time constraint.. C. No.. 137-141.. R. & Al-Shabi. Gonzalez. M. R. 5. Fuzzy logic and template matching techniques can well be applied as well for proper character recognition. which picks up the character from the library having the highest probability of being the same character as in the image. In order to determine the correct character we can go for statistical pattern recognition. New Delhi: Prentice Hall of India Private Limited. We can also use structural pattern recognition in which 2dimensional structures are extracted from the image and matched to structures in library. Second Edition. Digital Image Processing using MATLAB. & Eddins. Conclusion In this work we have extracted the connected components (objects) from a scanned image using the functionalities of MATLAB. 7th International Conference on Document Analysis and Recognition (ICDAR ‘03) Eddinburgh. K. References Alata. Digital Image Processing. 2003..(b) (d) Figure 4 (a) (c) Original image with noise.. a semicolon. E. the letter i. Journal of Electrical Engineering.. 75-134. and Woods. S.. . A Novel Feature Extraction Technique for the Recognition of Segmented Handwritten Characters. Scotland: pp. M. M. B. pp. H. L. In future we hope to undertake those areas to enhance our project. (b) (d) Image after thresholding Thus we can see that the thresholding we have used here is not robust and we need to find a better thresholding algorithm in order to detect characters accurately even in the presence of noise.

S. 2012 from http://www. 2. I.. MATLAB.. Paper presented at the 9th International Workshop on Computer Science and Information Technologies CSIT’2007. Inc. 2009. pp. 1. . Vol. Ufa. No. Image Processing Toolbox 6 User’s Guide. S. & Worn.co. World Congress on Nature and Biologically Inspired Computing. License Plate Recognition System for Indian Vehicles. I. Optical Character Recognition using Hierarchical Optimisation Algorithms.. Agarwal. International Journalof Information Technology and Knowledge Management..uk/software/imgProc/otsuThreshold.labbookpages. Retrieved April 13. Majumder. Image Processing Algorithms for Improved Character Recognition and Components Inspection. Russia. July-December 2008.. 311-325.html Safronov. H. K.Kumar. Tchouchenkov. (2010). & Saurabh.. I. Otsu Thresholding. The MathWorks. K.. A. 2007.