You are on page 1of 27

Tamil E T il Encoding Standards di S d d

P.Chellappan

What is Encoding ?
Computers store all data as numbers Text data is also stored as numbers and not as glyphs (shape) A is not stored as A but as 65 i to ed b t A number should be assigned to every alphabet l h b t This is called encoding

What is Encoding ?
A Byte is a unit of storage and consists of 8 bits A Byte usually represents one character A Byte can contain a maximum of 256 y combinations of 1s and 0s 00000000, 00000001 ... 11111110, 11111111 Which means 256 different characters can be stored in a byte

Mac Roman Encoding

Tamil Encoding
Tamil has 313 characters Number of slots available is less than 256 Glyphs are encoded and not characters yp e.g. , , , etc are encoded i.e. i e is not a single letter but + + This is called Glyph Encoding

TAM Encoding

English Text input, display and storage


Q A Z W S X E D C Shape for a code point 2x1
English Keyboard driver

65

English font

A
Screen

Code point for key and 97 110 100

Tamil Text input, display and storage


Q A Z W S X E D C Shape for a code point 2x1
Tamil Keyboard driver

220

Tamil Font

Screen

Code point for key

220 139 241 163

8 Bit Encodings - Problems


Same 256 locations are used by all languages 97 a a g p yg g Use of wrong font will display garbage English Tamil and and Hence language information is also required to view the text

8 Bit Encodings - Problems


Font Change is difficult e.g and mother and mother Cross Pl tf C Platform d t exchange i diffi lt data h is difficult

Solution
Characters of different languages should be give different numbers 256 locations are not sufficient 2 Byte ( 16 bit) character system can 2B t h t t have 65,536 characters Sufficient to handle most of the world languages !

Solution
Moving to a M i 16 Bit encoding system g y is a must

Unicode
16 Bit character encoding scheme

Contains characters of most of the world's languages 9 Indian Scripts including Tamil are encoded One Script could cover more that one language

Unicode
Primarily encodes characters not glyphs Character to Glyph ratio is not necessarily 1:1 65 will be displayed as A Many Characters can be represented by 1 glyph l h e.g + will be displayed as + o will be displayed as

Unicode
Level 1 Implementation Character : Glyph is 1:1 TTF font is sufficient Level 2 Implementation (Tamil) Character : Glyph is not 1:1 Character OTF Fonts are required GSUB, GSUB GPOS tables etc. etc

Unicode
Tamil block - U+0B80 to U+0BFF (128 locations) All Vowels to and ayutham All Akaram Eriya Mei t and , , All Ak E i M i to d and (consonant) Vowel Modifiers o, o, o , o , o o , o, o, o, o, o, o , o Tamil numerals and symbols are encoded

Unicode Code Chart

Unicode Sequences
All Uyirmeis are encoded only as ll d d l consonant + vowel modifiers + o = + o = + o = + o = o + o = i encoded as + o + is d d is encoded as +o + + o

Advatages of Unicode
It is a common encoding being used f d b d for all languages by all operating systems It is the standard used in the Internet for all data exchange Systems automatically detect the language and process information correctly

Disadvantages of Unicode
File sizes are b l bigger Data processing time is more NLP is more complex Meis are not encoded Character widths are not uniform

Problems with Unicode


Not all applications support complex ll l l scripts Major DTP applications also do support complex scripts at present Same characters are represented by more than one not necessarily correct sequences. e.g. + o = + + =

TACE An alternate encoding


It is an alternate 16 b encoding l bit d It encodes all uyirs, meis and uyirmeis

TACE Code chart

TACE Design features


Every uyirmei h a unique code point has d Since every character is encoded, complex script support is not required The mei and uyir component of every character can be easily separated by simple bit operations e.g xxxx xxxc cccc vvvv

TACE Advantages
Complex Script support is not required l d Compact file sizes Fast data processing Fast NLP processing Fast

TACE Disdvantages
The major bottle neck is that it is h b l k h encoded in the Private Use Area (PUA) It cannot be used for global data interchange It can be used only within a closed user group

Tamil Nadu Government Standards


The Government of Tamil Nadu has taken h f l d h k a policy decision to convert to the 16 Bit encoding system di t UNICODE will be the primary 16 bit standard t d d TACE16 will be the only alternate encoding standard in areas where support for unicode is not available fully or partially. ti ll