Download Tesseract from

com/p/tesseract-ocr/downloads/list Here I choose the compiled one. Tesseract-2.01.ext.tar.gz (It is better to use new version. But since I do not have compiler at hand now. I’ll just use the compiled one) Extract it to any location.

Download the english language data. Extract it and put in the tessdata of your tesseract folder

by Kruy Vanna

Download tesseract source folder. What I need are files in folders configs and tessconfigs of tessdata. (tesseract.exe u downloaded does not have these)

by Kruy Vanna

Extract it to somewhere and copy the tessdata to our previous tesseract folder.

Now we can start training: I train with this image. They say should train enough data. So every characters should appear many times. ( don know if m right). May be each same character should appear many time but with different font?

by Kruy Vanna

Make box file. Go to command line and set the current directory to your tesseract folder
tesseract fontfile.tif fontfile batch.nochop makebox Got the file: fontfile.txt Renamed it to : so that I can open it in Tessboxer ( Here I input the character in the “Letter” textbox and the UTF8 code is automatically filled.

Making feature file ->
tesseract fontfile.tif junk nobatch box.train

got this log read_variables_file:variable not found: textord_no_rejectsTesseract Open Source OCR Engine Image has 24 bits per pixel and size (746,387) Resolution=96 APPLY_BOXES: Boxes read from boxfile: Initially labelled blobs: Box failures detected: Duped blobs for rebalance: " 2 5 Total unlabelled words: 1 Final labelled words: 19 19 17 in 3 rows 2

ច" has fewest samples:

Generating training data TRAINING ... Font name = UnknownFont. Generated training data for 19 blobs

( You should change the current directory to “training” to use the command) mftraining

Now I got the files I should have.

by Kruy Vanna

Inttemp This is the binary file -> human eye can’t understand. Pffmtable

ខ 104
I don’t know what the number mean.

ង 93 ច 85

I got this file too “Microfeat” but they say it’s not used

Another command:

cntraining Got this file: normproto

Compute the Character Set

Got this file: unicharset

Dictionary Data
Created “frequent_words_list” file. They said I must put at least one word so I just put “ Generate the frequent dictionary file using command:
wordlist2dawg frequent_words_list freq-dawg

ខងច” in it using notepad.

Got the file: freq-dawg

Created “words_list” file with the content “


Generate the word list dictionary file using command:
wordlist2dawg words_list worddawg

Got the file: word-dawg Created “user-words” file. They say it’s usually empty -> I keep them empty by Kruy Vanna

The last file
This file “DangAmbigs” is manually generated. This file purpose is to reduce the abiguity. Ex. “ file’s “m” can easily confused with “rn” (r+n) Khmer character may not have this kind of ambiguity. (need to confirm). So I make it empty file.

Putting it all together
Now I have all the files renamed to have prefix “khm.” (khm is the ISO_639-2_codes of Cambodia lanuage Khmer): Khmer) All of these files should be put in “tessdata” folder. ” khm.DangAmbigs khm.freq-dawg khm.inttemp khm.normproto khm.pffmtable khm.unicharset khm.user-words khm.word-dawg

Now time to run the test!!! I have this image khmer.tif

I run with command:
Tesseract khmer.tif output –l khm

I got the output.txt with the content:


Sign up to vote on this title
UsefulNot useful