Professional Documents
Culture Documents
net/publication/281773601
CITATIONS READS
9 1,937
2 authors, including:
Ahmed Muaz
Systems Ltd.
4 PUBLICATIONS 39 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Ahmed Muaz on 15 September 2015.
Abstract— one of the subject areas in Natural Language are one base in the middle, two on top of the base, and two on
Processing (NLP), OCR is a kind of system which plays a the bottom of the base. From left to right, Khmer is written
significant role in assisting human beings to digitize documents continuously to the end of the line. The new line starts
with minimum time and effort. This paper presents the complete whenever the horizontal space runs out [3]. One of the
OCR system for Khmer language which is the official language
differences between Khmer and English is that between
of Cambodia. It demonstrates four main processes of the system
such as pre-processing, segmentation, recognition, and mapping, Khmer words, there is no white space like those of English.
each of which consists of other sub-processes that are also Therefore, in order to address OCR for Khmer, a new
explained in this paper. Moreover, the training system which technique other than that of English is required.
uses HTK Toolkit is also presented. The training system for this
system adapts the technique employed by the Bangla OCR II. SCOPE
system.
This paper addresses specifically one Khmer font only, i.e.
Keywords— Text Band, Main Body, SuperScript, SubScript, Limon S1 font, size 22. The reason for selection of this font is
CCDown, CC. that most of the Khmer documents are being printed in this
font face and size.
I. INTRODUCTION
OCR has its own history since early 1950s when scientists
found ways to capture images by both mechanical and optical III. METHODOLOGY
means [1]. Thus, early OCR technology indicated solely the In this paper, we assume that the document is scanned in
development of hardware, which had the ability to capture the black and white type and is picture-free. There are four steps
images and digitized them. Although early OCR hardware in this OCR system. The first step is pre-processing, which
likes scanner allowed digitize one line at a time by moving involves only the line separation. The second step is
either the hardware itself or the paper, modern technology segmentation, which is the most crucial step in OCR processes.
such as flatbed scanner allowed the possibility of full-page The third step is the recognition part. The last step is the
scan [1]. As technology advanced, researchers started to work mapping.
on OCR software development. As the result, the computer
modeled another human’s special ability, i.e. the ability to A. Pre-processing
read the printed documents. However, such ability for 1) Line Separation
computer is still limited until the present date. This is due to Beside the inseparable style of writing, a page of Khmer
the fact that human, through experiences and contexts, has the text consists of obvious lines. White space is a delimiter
ability to understand easily the distorted text with complicated between lines. Therefore, in order to separate one line from
background, while the computer still does not. Today, more another, a horizontal projection profile method is used [4].
and more researchers are interested in OCR research, and thus Since the horizontal projection profile is the histogram of ON
more advanced OCR techniques has been developed in a rapid (black) pixels accumulated along the horizontal line, the line
pace. Moreover, OCR software has been widely used in many where OFF (white) pixels accumulated is considered as a
fields such as businesses, education institutions, post offices, delimiter. Thus, from top of the document downward, a line is
newspaper publishers, and many other industries [1]. extracted once it is found in between two white lines. Fig. 1
shows white lines used as delimiters for extracting each line
To date, the demand for document processing is increasing from the page.
daily for developed countries as well as developing countries
such as Cambodia. Khmer, the official language of Cambodia,
belongs to Mon-Khmer group of Austro-Asiatic languages
and is the descendant of ancient Brahmi script of India [2].
Khmer language consists of thirty consonants, twenty one
dependent vowels and thirteen independent vowels. There are White lines
totally five layers per single word, but the possibility of
Fig. 1 Example of white lines used as delimiters for extracting each line from
combination is four layers at most. However, the five layers a page
shapes — shapes that cannot be separated anymore. Moreover,
those shapes are significant for the next process, the
2) Text Band Calculation and Character Separation
Recognition step.
Text Band means the margins for Main Body. It consists of In this step, two processes are taken. The first process is
start of the Text Band and end of the Text Band. Text Band Text Band adjustment. The second step is the Segmentation
Calculation is the process of finding the Text Band for each process.
line obtained from the line separation. Since the top position
and the bottom of different characters in a sentence are 1) Text Band Adjustment
apparently different, the start and the end of the Text Band are Text Band is an essential threshold for the segmentation
supposed to be the average of the top position and the bottom process to proceed. It can be said that wrong Text Band
position of those Characters, respectively. Fig. 2(a) shows calculation leads to the unexpected result of the whole OCR
how this technique is achieved. Fig. 2(b) shows the detection system. Hence, Text Band needs to be adjusted before
of Text Band of a sentence. proceeding to the segmentation process. Text Band adjustment
is the process of finding the optimum position of the existing
Average of the Text Band on an individual Character rather than on the whole
bottom position Average of the
top position
line (sentence). In other words, Text Band adjustment holds
two significant characteristics. On one hand it works on an
Fig. 2(a) Start of the Text Band as the average of the top position and End of
individual Character, so that each Character holds its own
the Text Band as the average of the bottom position Text Band more accurately. On the other hand, it plays an
important role in adjusting the Text Band for that Character.
Text Band In this process, Text Band for the whole line is used as the
threshold. Then, by finding the least pixels horizontally
Fig. 2(b) Example of Text Band, start of the Text Band is on top and end of accumulated, five pixels up are checked for the new start of
the Text Band is on the bottom the Text Band. Likewise, five pixels down are checked for the
new end of the Text Band. The following figure proves the
solution of this.
After Text Band has been calculated, the next process is
Character separation. Although Text Band is not important for
Character separation, Text Band detection is an essential step
which leads the segmentation process goes smoothly. (a) (b)
Character separation can be processed once the Text Band has
Fig. 4 Text Band adjustment, (a) before adjustment, (b) after adjustment on
been detected. In this process, vertical projection profile is each Character
used [4]. The idea is the same as the horizontal projection
profile except that while horizontal projection profile is the
histogram of ON (black) pixels accumulated along the 2) Segmentation
horizontal line, vertical projection profile is the histogram of
those accumulated along the vertical line. Thus, white space is After each individual Character has its own optimum Text
still an important delimiter. As a result, the term Character is Band, the final step through which each of them has to go is
used to refer to any separated shape after this process. Figure the segmentation process. This process is the most critical one
3(a) shows vertical white space in between two Characters. inasmuch as it involves the breaking down of every Character
Figure 3(b) shows each separable Character in box. into atomic (no longer separable) shapes which are then sent
to the recognition system for final outputs. Moreover, after the
segmentation, those atomic shapes will have their own
Vertical white space
distinctive identity. These identities are defined so as to
distinguish between shapes as well as to be the information to
which recognition system should they be sent. Furthermore, in
Fig. 3(a) Vertical white space used as a delimiter for separating Characters
whatever circumstances each shape must be one of the five
categories — Main Body, SuperScript, SubScript, CCDown,
and CC (Complex Character). It should be noted that all the
shapes detected and extracted are stored in the form of box,
which is the combination of two coordinates of two specified
Fig. 3(b) Example of Characters, the term used to refer to any separated shape
points.
in Character separation process
B. Segmentation
As stated, this step is the most essential process because it
involves breaking down each individual Character into atomic
Fig. 5 Two coordinates make up a box for a shape the information about the Character in an incomplete form, i.e.
only CCDown and Main Body are recorded. While Main
Main Body is any shape that resides within the Text Band. Body can be found by detecting any shape that resides within
Generally, Main Body can be a consonant, an independent the Text Band, CCDown needs to be specially detected. In this
vowel, a dependent vowel, or the combination of a consonant technique, we check on the Main Body. If it has about ten
and a dependent vowel. series of ON (black) pixels attached to the bottom of it, it is
assumed as the CCDown, thereby marked as a CCDown
instead of a Main Body. Figure 4(a) shows the differentiation
between a CCDown and the Main Body. Figure 4(b)
illustrates the series of ON (black) pixels to be considered as
Fig. 6 Example of Main Body the factor that makes a genuine CCDown.
Fig. 10 Example of CC
(a) (b)
D. Training System All in all, the recognition for Main Body can be improved
by adding more entries to the mismatch files since part of the
Prior to the recognition, shapes need to be trained. The
error rates is from the lack of information in those files.
training system uses HTK for modeling the shapes. Unlike the
However, the major error rate is that after the recognition,
recognition which is the repeated process, training is done
only once and separately from OCR main processes — Pre- some shapes confused with shape ID 102 ( ). For h
processing, Segmentation, Recognition, and Mapping. SuperScript, some shapes still have confusion; therefore, it
Moreover, since it uses HTK, both framing and DCT can be improved by increasing numbers of samples for
calculation processes are needed, and thus they are not training. Likewise, the recognition rate for SubScript can also
different from those in the recognition. Furthermore, each be enhanced by increasing numbers of training samples.
category of shape is trained separately, for there will be Moreover, the extraction of SubScript also involves in
separate recognition systems. The following table shows the improving the recognition result. After the extraction, it can
total training shapes of each category in number. be concluded that, through observation, some SubScripts are
missing. This can be improved by modifying the existing code.
Finally, the recognition rate for CC is low because of the
extraction process. Whenever there is Main Body ID 043
together with CC ID 01, both shapes are wrongly extracted,
and thus leaving the unrecognized shapes.
V. ACKNOWLEDGEMENT
The authors would like to thank International Development
Research Center (IDRC) of Canada and Ministry of Education,
Youth and Sport (MoEYS) for sponsoring this project.
VI. REFERENCE
[1] M. Cheriet, N. Kharma, C. L. Liu, C. Y. Suen (2007). Character
Recognition Systems: A Guide for Students and Practioners, Hoboken,
New Jersey: John Wiley & Sons, Inc.
[2] Khmer alphabet. Retrieved December 1, 2008, from
http://www.omniglot.com/writing/khmer.htm
[3] Khmer. Retrieved December 1, 2008, from
http://www.ancientscripts.com/khmer.html
[4] T. V. Ashwin, P. S. Sastry (2002). A font and size-independent OCR
system for printed Kannada documents using support vector machines,
27, 35-58.
[5] M. A. Hasnat, S. M. Habib, M. Khan, Segmentation free Bangla OCR
using HMM: Training and Recognition, BRAC University,
Bangladesh.
[6] C. Chey, Khmer printed character using wavelet descriptor, M. Eng.
thesis, King Mongkut’s University of Technology, Thailand, 2004.
[7] L. Lensu (1998), Discrete Cosine Transform.
[8] A. B. Watson (1994). Mathematica Journal. Image Compression Using
the Discrete Cosine Transform, 4 (1), 81-88.
[9] I. LengIeng, K. Chenda (2008). Algorithm for Character Segmentation
in Khmer Optical Character Recognition (OCR), PAN Localization
Cambodia.