You are on page 1of 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/281773601

Khmer Optical Character Recognition (OCR)

Research · September 2015


DOI: 10.13140/RG.2.1.2393.3926

CITATIONS READS

9 1,937

2 authors, including:

Ahmed Muaz
Systems Ltd.
4 PUBLICATIONS   39 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

PAN Localization Project View project

Morphology and Syntax Term Paper View project

All content following this page was uploaded by Ahmed Muaz on 15 September 2015.

The user has requested enhancement of the downloaded file.


Khmer Optical Character Recognition (OCR)
Ing LengIeng, Ahmed Muaz
#141, St. 04, Tuol Sangke, Russei Keo, Phnom Penh, Cambodia
PAN Localization Cambodia
lengieng_ing@yahoo.com
ahmed.muaz@nu.edu.pk

Abstract— one of the subject areas in Natural Language are one base in the middle, two on top of the base, and two on
Processing (NLP), OCR is a kind of system which plays a the bottom of the base. From left to right, Khmer is written
significant role in assisting human beings to digitize documents continuously to the end of the line. The new line starts
with minimum time and effort. This paper presents the complete whenever the horizontal space runs out [3]. One of the
OCR system for Khmer language which is the official language
differences between Khmer and English is that between
of Cambodia. It demonstrates four main processes of the system
such as pre-processing, segmentation, recognition, and mapping, Khmer words, there is no white space like those of English.
each of which consists of other sub-processes that are also Therefore, in order to address OCR for Khmer, a new
explained in this paper. Moreover, the training system which technique other than that of English is required.
uses HTK Toolkit is also presented. The training system for this
system adapts the technique employed by the Bangla OCR II. SCOPE
system.
This paper addresses specifically one Khmer font only, i.e.
Keywords— Text Band, Main Body, SuperScript, SubScript, Limon S1 font, size 22. The reason for selection of this font is
CCDown, CC. that most of the Khmer documents are being printed in this
font face and size.
I. INTRODUCTION
OCR has its own history since early 1950s when scientists
found ways to capture images by both mechanical and optical III. METHODOLOGY
means [1]. Thus, early OCR technology indicated solely the In this paper, we assume that the document is scanned in
development of hardware, which had the ability to capture the black and white type and is picture-free. There are four steps
images and digitized them. Although early OCR hardware in this OCR system. The first step is pre-processing, which
likes scanner allowed digitize one line at a time by moving involves only the line separation. The second step is
either the hardware itself or the paper, modern technology segmentation, which is the most crucial step in OCR processes.
such as flatbed scanner allowed the possibility of full-page The third step is the recognition part. The last step is the
scan [1]. As technology advanced, researchers started to work mapping.
on OCR software development. As the result, the computer
modeled another human’s special ability, i.e. the ability to A. Pre-processing
read the printed documents. However, such ability for 1) Line Separation
computer is still limited until the present date. This is due to Beside the inseparable style of writing, a page of Khmer
the fact that human, through experiences and contexts, has the text consists of obvious lines. White space is a delimiter
ability to understand easily the distorted text with complicated between lines. Therefore, in order to separate one line from
background, while the computer still does not. Today, more another, a horizontal projection profile method is used [4].
and more researchers are interested in OCR research, and thus Since the horizontal projection profile is the histogram of ON
more advanced OCR techniques has been developed in a rapid (black) pixels accumulated along the horizontal line, the line
pace. Moreover, OCR software has been widely used in many where OFF (white) pixels accumulated is considered as a
fields such as businesses, education institutions, post offices, delimiter. Thus, from top of the document downward, a line is
newspaper publishers, and many other industries [1]. extracted once it is found in between two white lines. Fig. 1
shows white lines used as delimiters for extracting each line
To date, the demand for document processing is increasing from the page.
daily for developed countries as well as developing countries
such as Cambodia. Khmer, the official language of Cambodia,
belongs to Mon-Khmer group of Austro-Asiatic languages
and is the descendant of ancient Brahmi script of India [2].
Khmer language consists of thirty consonants, twenty one
dependent vowels and thirteen independent vowels. There are White lines
totally five layers per single word, but the possibility of
Fig. 1 Example of white lines used as delimiters for extracting each line from
combination is four layers at most. However, the five layers a page
shapes — shapes that cannot be separated anymore. Moreover,
those shapes are significant for the next process, the
2) Text Band Calculation and Character Separation
Recognition step.
Text Band means the margins for Main Body. It consists of In this step, two processes are taken. The first process is
start of the Text Band and end of the Text Band. Text Band Text Band adjustment. The second step is the Segmentation
Calculation is the process of finding the Text Band for each process.
line obtained from the line separation. Since the top position
and the bottom of different characters in a sentence are 1) Text Band Adjustment
apparently different, the start and the end of the Text Band are Text Band is an essential threshold for the segmentation
supposed to be the average of the top position and the bottom process to proceed. It can be said that wrong Text Band
position of those Characters, respectively. Fig. 2(a) shows calculation leads to the unexpected result of the whole OCR
how this technique is achieved. Fig. 2(b) shows the detection system. Hence, Text Band needs to be adjusted before
of Text Band of a sentence. proceeding to the segmentation process. Text Band adjustment
is the process of finding the optimum position of the existing
Average of the Text Band on an individual Character rather than on the whole
bottom position Average of the
top position
line (sentence). In other words, Text Band adjustment holds
two significant characteristics. On one hand it works on an
Fig. 2(a) Start of the Text Band as the average of the top position and End of
individual Character, so that each Character holds its own
the Text Band as the average of the bottom position Text Band more accurately. On the other hand, it plays an
important role in adjusting the Text Band for that Character.

Text Band In this process, Text Band for the whole line is used as the
threshold. Then, by finding the least pixels horizontally
Fig. 2(b) Example of Text Band, start of the Text Band is on top and end of accumulated, five pixels up are checked for the new start of
the Text Band is on the bottom the Text Band. Likewise, five pixels down are checked for the
new end of the Text Band. The following figure proves the
solution of this.
After Text Band has been calculated, the next process is
Character separation. Although Text Band is not important for
Character separation, Text Band detection is an essential step
which leads the segmentation process goes smoothly. (a) (b)
Character separation can be processed once the Text Band has
Fig. 4 Text Band adjustment, (a) before adjustment, (b) after adjustment on
been detected. In this process, vertical projection profile is each Character
used [4]. The idea is the same as the horizontal projection
profile except that while horizontal projection profile is the
histogram of ON (black) pixels accumulated along the 2) Segmentation
horizontal line, vertical projection profile is the histogram of
those accumulated along the vertical line. Thus, white space is After each individual Character has its own optimum Text
still an important delimiter. As a result, the term Character is Band, the final step through which each of them has to go is
used to refer to any separated shape after this process. Figure the segmentation process. This process is the most critical one
3(a) shows vertical white space in between two Characters. inasmuch as it involves the breaking down of every Character
Figure 3(b) shows each separable Character in box. into atomic (no longer separable) shapes which are then sent
to the recognition system for final outputs. Moreover, after the
segmentation, those atomic shapes will have their own
Vertical white space
distinctive identity. These identities are defined so as to
distinguish between shapes as well as to be the information to
which recognition system should they be sent. Furthermore, in
Fig. 3(a) Vertical white space used as a delimiter for separating Characters
whatever circumstances each shape must be one of the five
categories — Main Body, SuperScript, SubScript, CCDown,
and CC (Complex Character). It should be noted that all the
shapes detected and extracted are stored in the form of box,
which is the combination of two coordinates of two specified
Fig. 3(b) Example of Characters, the term used to refer to any separated shape
points.
in Character separation process

B. Segmentation
As stated, this step is the most essential process because it
involves breaking down each individual Character into atomic
Fig. 5 Two coordinates make up a box for a shape the information about the Character in an incomplete form, i.e.
only CCDown and Main Body are recorded. While Main
Main Body is any shape that resides within the Text Band. Body can be found by detecting any shape that resides within
Generally, Main Body can be a consonant, an independent the Text Band, CCDown needs to be specially detected. In this
vowel, a dependent vowel, or the combination of a consonant technique, we check on the Main Body. If it has about ten
and a dependent vowel. series of ON (black) pixels attached to the bottom of it, it is
assumed as the CCDown, thereby marked as a CCDown
instead of a Main Body. Figure 4(a) shows the differentiation
between a CCDown and the Main Body. Figure 4(b)
illustrates the series of ON (black) pixels to be considered as
Fig. 6 Example of Main Body the factor that makes a genuine CCDown.

SuperScript is any shape that appears above the start of the


Text Band. Because of the definition, double-quote is also Main Body
considered as a SuperScript. Fig. 5 shows SuperScripts in CCDown
boxes.
Fig. 11(a) Initial Information Storing process stores the above Character as
“cM”, [c: CCDown, M: Main Body]
Start of the
Text Band

Fig. 7 Example of SuperScripts

Ten or more pixels


SubScript is any shape that resides below the Main Body,
thereby appears below the end of the Text Band. Fig. 11(b) CCDown detection, any found shape that has ten or more pixels
attached to the bottom of it is considered as a CCDown

b) Complex Character (CC) Detection


As mentioned in section 2 (Segmentation), CCDown and
CC share some common characteristics. Without detection,
Fig. 8 Example of SubScript they are identical due to the fact that they cover on both the
Main Body layer and the SubScript layer; therefore, in order
to be clear, the detection on the SuperScript layer is necessary.
CCDown is any shape that covers on both layers — Main
Body and SubScript. CC detection is simple but ideal if the Text Band is
correctly identified. The following conditions will tell whether
the shape is the CC:
- The shape is a CCDown.
- The shape resides on the right side of the Main Body,
Fig. 9 Example of CCDown e.g. Mc (M: Main Body, c: CCDown).
- There is a series of ON (black) pixels attached to the
CC stands for Complex Character is particularly coined to top of that shape.
refer to any shape that cover all the three layers —
SuperScript, Main Body, and SubScript. The following figure shows how these conditions help
identify the genuine CC.

Fig. 10 Example of CC

There are eight processes in this process.

a) Initial Information Storing


Since this is the first process in the Segmentation process, it
does not involve the segmentation yet. Instead, as the name
suggests, Initial Information Storing is the process of storing
SuperScript layer

Main Body layer White line SubScript


SubScript layer SubScript

(a) (b)

Fig. 14 Difference between normal SubScript and SubScript in special case,


(a) White line between Main Body and SubScript, (b) CCDown blocks the
white line between Main Body and SubScript.

e) Main Body Extraction


(a) (b)
In the early process (Initial Information Storing), Main
Fig. 12 (a) CCDown covers two layers — MainBody and SubScript, (b) CC Body is already detected. So, the function here should involve
covers three layers — SuperScript, MainBody, and SubScript solely the extraction process rather than the detection, which
can be a duplicated process. In a Character, there may be more
c) SuperScript Detection and Extraction
than one Main Body present; therefore, each Main Body is
Since SuperScript is any shape that resides above the start
extracted from left to right accordingly. Not different from
of the Text Band (see section 2 — Segmentation), the
other extraction processes, the area (expresses in box) where
detection is done from the start of the Text Band up to the top
Main Body resides is filled with OFF (white) pixels. The
of the Character. There are two main possibilities for the
following figure illustrates how Main Body, which is on the
presentment of SuperScript. One possibility is when there is
second index of the sequence, is extracted by ignoring the first
white line between the Main Body and itself. The other
index which is the CCDown.
possibility is when both the Main Body and the SuperScript
attach to each other; hence, the special detection is needed.
Sequence: c M
Once the detection is done, the position of the shape is
recorded in the form of box. After that, it is extracted
according to the coordinates in the box. Finally, a function for
filling the OFF (white) pixels is called, so that the extracted
area is erased from the Character data.
Fig. 15 Main Body Extraction, Main Body is extracted by ignoring the
CCDown
SuperScript SuperScript
f) CCDown and CC Extraction
White space Ultimately, the only shape left is the CCDown and/or the
CC. There is no need for the detection, and the extraction
(a) (b) process is also simple. Figure 9(a) shows a Character which
Main Body is already extracted, leaving alone the CCDown.
Fig. 13 SuperScript Detection: Normal (a) vs Special Case (b)
Figure 9(b) shows the same case for the CC.

d) SubScript Detection and Extraction


SubScript can be detected based on its position. It usually
appears below the Main Body. Thus, the detection is done (a)
from the end of the Text Band down to the bottom of the
Character. There may be a special case for SubScript detection
when there is either a CCDown or a CC presents. In this case,
there is no white line separating the Main Body from the
SubScript. After the detection, the position of the SubScript is
recorded in the form of box. Then, the extraction is done on (b)
the basis of that box. Ultimately, the extracted area is filled Fig. 16 Example of CCDown and CC extraction
with OFF (white) pixels in the Character data. The following
figure illustrates how normal SubScript differs from the g) Sequence Generation
SubScript in special case. This is the last process in the Segmentation.
Notwithstanding it has no longer role in the segmentation, it
generates the whole sequence information about the Character
which is essential for the next process — the Mapping. Unlike
the Initial Information Storing process which stores only the
information of the CCDown and the Main Body, Sequence
Generation process stores all the information of the five
categories — Main Body, SuperScript, SubScript, CCDown, TABLE I
NUMBER OF TRAINING SHAPES
and CC.
Category Numbers of Sample
C. Recognition Main Body 8935
After all the shapes have been separated, it is necessary that SuperScript 13500
they are sent to the recognizer. Recognition is the process of SubScript 8600
sending the segmented shape to the recognizer, and gets CCDown 2700
something, particularly the ID, from it. In our application, we CC 1500
use HTK Toolkit for both modeling all the shapes and being Total 35235
the recognizer. Before being sent to the recognizer, each shape
needs to be undergone two processes. The first process is the
E. Mapping
framing process which each shape is sliced into equal window
size. The second process is DCT (Discrete Cosine Transform) Mapping is the process of turning an ID obtained from the
calculation. recognizer into an ASCII code. This is done by using the
information in the defined code files. Each code file contains a
list of unique ID and its corresponding ASCII code for every
1) Framing shape. Hence, the process is not a challenging one because
In this process, each shape is sliced into equal size of what it has to do is to match the ID obtained from the
frames. This is because the number of frames will be recognizer and get back the ASCII code. However, for better
considered as the number of states for HMM. Each size of the performance there are mapping rule files as well. These files,
frame is identified by both the number of width and the also known as mismatch files, are used for verifying some
number height. For example, the frame size for Main Body is confusing shapes.
5 (width) by 60 (height). Since there are five distinctive
categories of shapes, five different sizes of frame are set
according to careful observation on the samples collected. In IV. EXPERIMENTAL RESULT AND DISCUSSION
this process, we implement vertical static framing technique. We apply the segmentation and the recognition on ten
The reason for this is that according to the nature of the pages of text taken from a newspaper (Kohsantepheap
Khmer letters, their height does not vary much. Unlike the newspaper) website. The texts are specially arranged to avoid
height, their width seems to have quite a great variation. It is, containing pictures, figures, tables, and other formats. In order
therefore, to do the vertical framing for each shape for a better words, the texts are plain texts only. They are scanned using
result. HP LaserJet 3055 with resolution of 300 dpi in black and
white picture type. The following table shows the result after
the recognition.
TABLE III
RECOGNITION RATE
Fig. 17 Example of framing, (a) a shape before framing, (b) a shape is framed
into 5 equal frames Character Total Correct Recognition
Rate (%)
Main Body 12983 12535 96.54
2) Discrete Cosine Transform (DCT) calculation SuperScript 2599 2505 96.38
SubScript 2062 1972 95.63
DCT is a mathematical method which uses cosine as the
CCDown 807 785 97.27
basis function for converting a continuous series of data into
CC 79 56 70.88
elementary frequency components. 18530 17853 96.34
Total

D. Training System All in all, the recognition for Main Body can be improved
by adding more entries to the mismatch files since part of the
Prior to the recognition, shapes need to be trained. The
error rates is from the lack of information in those files.
training system uses HTK for modeling the shapes. Unlike the
However, the major error rate is that after the recognition,
recognition which is the repeated process, training is done
only once and separately from OCR main processes — Pre- some shapes confused with shape ID 102 ( ). For h
processing, Segmentation, Recognition, and Mapping. SuperScript, some shapes still have confusion; therefore, it
Moreover, since it uses HTK, both framing and DCT can be improved by increasing numbers of samples for
calculation processes are needed, and thus they are not training. Likewise, the recognition rate for SubScript can also
different from those in the recognition. Furthermore, each be enhanced by increasing numbers of training samples.
category of shape is trained separately, for there will be Moreover, the extraction of SubScript also involves in
separate recognition systems. The following table shows the improving the recognition result. After the extraction, it can
total training shapes of each category in number. be concluded that, through observation, some SubScripts are
missing. This can be improved by modifying the existing code.
Finally, the recognition rate for CC is low because of the
extraction process. Whenever there is Main Body ID 043
together with CC ID 01, both shapes are wrongly extracted,
and thus leaving the unrecognized shapes.

V. ACKNOWLEDGEMENT
The authors would like to thank International Development
Research Center (IDRC) of Canada and Ministry of Education,
Youth and Sport (MoEYS) for sponsoring this project.

VI. REFERENCE
[1] M. Cheriet, N. Kharma, C. L. Liu, C. Y. Suen (2007). Character
Recognition Systems: A Guide for Students and Practioners, Hoboken,
New Jersey: John Wiley & Sons, Inc.
[2] Khmer alphabet. Retrieved December 1, 2008, from
http://www.omniglot.com/writing/khmer.htm
[3] Khmer. Retrieved December 1, 2008, from
http://www.ancientscripts.com/khmer.html
[4] T. V. Ashwin, P. S. Sastry (2002). A font and size-independent OCR
system for printed Kannada documents using support vector machines,
27, 35-58.
[5] M. A. Hasnat, S. M. Habib, M. Khan, Segmentation free Bangla OCR
using HMM: Training and Recognition, BRAC University,
Bangladesh.
[6] C. Chey, Khmer printed character using wavelet descriptor, M. Eng.
thesis, King Mongkut’s University of Technology, Thailand, 2004.
[7] L. Lensu (1998), Discrete Cosine Transform.
[8] A. B. Watson (1994). Mathematica Journal. Image Compression Using
the Discrete Cosine Transform, 4 (1), 81-88.
[9] I. LengIeng, K. Chenda (2008). Algorithm for Character Segmentation
in Khmer Optical Character Recognition (OCR), PAN Localization
Cambodia.

View publication stats

You might also like