You are on page 1of 30

MYANMAR UNICODE

MYTHS VS. TRUTHS


Ngwe Tun
CEO, Solveware Solution YMCA, 2008-02-12
1
Etymology: Myanmar Unicode

 Myanmar
> The Myanmar script is used to write Burmese, the majority
language of Myanmar. Variations and extensions of the script are
used to write other languages of the region, such as Shan and Mon,
Karen as well as Pali and Sanskrit.
Ref; http://www.unicode.org/versions/Unicode5.0.0/ch11.pdf
> The Myanmar writing system derives from a Brahmi-related
script borrowed from South India in about the eighth century to
write the Mon language.
> The basic consonants, independent vowels, and dependent
vowel signs required for writing the Myanmar language are encoded
at the beginning of the Myanmar range (U1000~U109F).

2
Myanmar Unicode 5.0.0

3
Unicode 5.1, What News ?

4
Etymology: Myanmar Unicode

 Unicode
> The Unicode Standard is the universal character encoding
standard for written characters and text. It defines a consistent way
of encoding multilingual text that enables the exchange of text data
internationally and creates the foundation for global software.
> It provides the capacity to encode all characters used for the
written languages of the world—more than 1 million characters can
be encoded.
> The Unicode character encoding treats alphabetic characters,
ideographic characters, and symbols equivalently, which means they
can be used in any mixture and with equal facility.

5
Etymology: Myanmar Unicode

 Unicode
> The Unicode Standard specifies a numeric value (code point)
and a name for each of its characters.
> The Unicode Standard defines these and other semantic values,
and it includes application data such as case mapping tables (a, A)
and character property tables as part of the Unicode Character
Database (UCD). Character properties define a character’s identity
and behavior; they ensure consistency in the processing and
interchange of Unicode data.
> Unicode characters are represented in one of three encoding
forms: a 32-bit form (UTF-32), a 16-bit form (UTF-16), and an 8-bit
form (UTF-8). The 8-bit, byte-oriented form, UTF-8, has been
designed for ease of use with existing ASCII-based systems.

6
What’s Unicode Design Goal?

 The primary goal of the development effort for the Unicode


Standard was to remedy two serious problems common to most
multilingual computer programs.
 The first problem was the overloading of the font mechanism when
encoding characters. Fonts have often been indiscriminately
mapped to the same set of bytes. For example, the bytes 0x00 to
0xFF are often used for both characters and dingbats.
 The second major problem was the use of multiple, inconsistent
character codes because of conflicting national and industry
character standards. In Western European software environments,
for example, one often finds confusion between the Windows Latin 1
code page 1252 and ISO/IEC 8859-1.

7
What is Unicode

 Unicode provides a unique number for every


character,
no matter what the platform,
no matter what the program,
no matter what the language.
 Adopted by industry
 Required by modern standards
 Official implementation of ISO/IEC 10646
 Standard maintained by the Unicode Consortium:
http://www.unicode.org
 Current version: 5.0
 Version 5.1 due for release later this year. March 2008

8
Why is Unicode needed?

 Enables computer systems to support virtually all


the world’s written languages.
 De facto requirement for multilingual applications.

Using Unicode Using CP 1252


9
Character: an Abstraction

 ှ မဿာ ပုဳစဳကွဲ ၄ ခု ရဿိပ၂တယံ။ (different glyph)


မ၁၊ ည၀ိ ၊ နှာ၊ လျှာ
 ူ မဿာ ပုဳစဳကွဲ ္ခု ရဿိပ၂တယံ။ (different glyph)
ကူ၊ စက္ကူ
 ဳ ၊ ါ ကို ချွငံဵချကံ အ ြဖစံ သီဵ ြခာဵ တစံ​ေနရာစီ
သတံမဿတံပ၂သညံ။ (different character)
 ဿ နဿငံဴ ္သ ကို ခွဲြခာဵနိုငံရနံ သီဵ ြခာဵ တစံ​ေနရာစီ
သတံမဿတံပ၂သညံ။ (different character)
10
Characters ≠ Glyphs

 ဋ္ဌ၊ ဏ္ဍ နဿငံဴ ပ၂ဌံဆငံဴမျာဵ အတွကံ Code point


တစံခုမသတံမဿတံေပဵပ၂။
 ၊က၊ ြခ - ​ဗျညံဵတွဲ ပုဳစဳကွမ
ဲ ျာဵ အတွကံ Code
point တစံခု မသတံမဿတံ​ေပဵပ၂။
 သေကတ
် - ကငံဵစီဵ အတွကံ Code point
တစံခုမသတံမဿတံ​ေပဵပ၂။

11
Unicode Design Principals

1. Universal repertoire
2. Processing efficiency
3. Characters, not glyphs
4. Semantics (properties)
5. Plain text
6. Logical order in memory
7. Unification (within scripts across languages)
8. Dynamic composition
9. Equivalent Sequences (compatibility with current standards)
10. Convertibility

12
Life of a Character

Caps
Lock
A S D F GG H J K L :
;
"
'
Enter

Shift
Shift

Memory

GG
Keyboard
Driver
G Layout
Renderer G
0047

CPU Font
File
Font
Font
13
Memory
1000 103B 1031 Rendering
က ျ ေ a complex script

Display
Order ေကျ
Glyph
Selection
ေကျ &
Positioning

Positional
Font Shapes
and Ligatures
14
かきく…
Syllabic
ถไณ… 漢字…
Alphabetic Ideographic

Scripts
!,:;…
Shared
Α Β Γ…
α β γ…
…‫ا ب ت‬
√ ≠…
Alphabetic
Alphabetic Right-to-left
Bi-cameral
Symbols
15
Letter → Codes → Glyphs

ĝ ĝ ĝ
ĝ 011D

Coded Character ĝ ĝ
Letter
g $̂
0067 0302
ĝ ĝ
Glyphs
Coded Character Sequence
16
What is the Standards
+ Unicode Standard Annexes

UAX UAX UAX UAX UAX UAX


#9 #11 #14 #15 #24 #29 . . .

Data Data Data ...


+ Unicode Character Database

UMV UMV Minor Versions


* when and if they exist
4.1* 4.2* 17
Standards Annexes

 UAX #9 The Bidirectional Algorithm


 UAX #11 East Asian Width
 UAX #14 Line Breaking Properties
 UAX #15 Unicode Normalization Forms
 UAX #24 Script Names
 UAX #29 Text Boundaries

18
Conclusion…

 Unicode is a universal character set


 Unicode encodes characters, not glyphs
 Unicode includes character properties and
implementation specifications
 Unicode increasing adoption enables
multilingual applications

19
Myth #1
(based on Unicode Std)
 …ဆိုတာ ယူနီကုတံအဖွဲဲအစညံဵက သတံမဿတံထာဵတဲဴ ြမနံမာစာနဿငံဴ
ပတံသကံတဲဴ စဳသတံမဿတံချကံ​ ေတွကိုအ​ေ ြခခဳ ြပီဵ ကွနံပျူတာမဿာ ြမနံမာစာကို
သုဳဵလိုေရ​ေအာငံ လုပံထာဵတဲဴစနစံ ြဖစံပ၂တယံ။…

 အမဿနံ

ယူနီကုဒံစဳသတံမဿတံချကံမျာဵကို တစံ​ေသွမတိမံဵလိုကံနာလုပံ​ေဆာငံရာတွငံ
သတံမဿတံထာဵ​ေသာ Code points သာမက အ ြခာဵ စဳသတံမဿတံချကံမျာဵကိုပ၂
လိုကံနာရနံ လိုအပံပ၂သညံ။

20
Myth #2
(Code Page Or Code Block)
 ဘာသာစကာဵတစံခုချငံဵစီအတွကံ Code page သတံမဿတံ​ေပဵထာဵပ၂တယံ။
ြမနံမာဘာသာစကာဵအတွကံ Code page ကို 1000 က​ေန 109F အထိ
သတံမဿတံ​ေပဵထာဵပ၂တယံ။

 အမဿနံ

Wiki defined as “Code page is the traditional IBM term used for a specific
character encoding table: a mapping in which a sequence of bits, usually a single
octet representing integer values 0 through 255, is associated with a specific
character. IBM and Microsoft often allocate a code page number to a character
set even if that charset is better known by another name.”
It might be Code Block for each Script, not for each Language.

21
Myth #3
(just only drag and drop)
 ဘယံလို OS မျိုဵမဿာ မဆို font file ကို drag and drop လုပံရုဳနဲေ
အသုဳဵ ြပုလိုရ
ေ ပ၂တယံ။

 Bravo ! 

 အမဿနံ

ြမနံမာစာကို အသုဳဵ ြပုရနံ ​အတွကံ Rendering Engine ကို


သကံဆိုငံရာ OS တွငံ ထညံဴသွငံဵထာဵရနံ မရဿိမ ြဖစံလိုအပံသညံ။

22
Myth #4
(word break or syllable break)
 input method မဿာ စာလုဳဵေတွကို သူေဘာသာသူ မဿတံမိေစနိုငံေအာငံ
လိုအပံတဲဴ word break ကို တစံပ၂တညံဵထညံဴေပဵေစပ၂တယံ။ ြမနံမာစာမဿ
မဟုတံပ၂ဘူဵ။ ဘယံဘာသာစကာဵမဆို word break လိုအပံပ၂တယံ။

 အမဿနံ
 UTN #11 expressed that “From this we can say that a syllable break
may occur before a Myanmar digit, an independent vowel, one of the
various signs or a base consonant so long as the consonant:”

 is not devowelised with an asat and has no stacked consonant below it and
 is not a kinzi.

 သူ အလွနံြကိုဵစာဵသ ြဖငံ ကနံထ


 ရိုကံတာ ြဖစံသွာဵသညံ။ (it’s breaking opportunities in syllable)

23
Myth #5
(vowel sign in Myanmar)
 သရအတွကံ သီဵသနံေ အက္ခရာမရဿိပ၂ဘူဵ။ ဒ၂ေ၊ကာငံဴ ဩ၊ ဪ၊
ဥ၊ ဦ ကို သရယူနီကုဒံ တနံဖိုဵ ေပဵထာဵတာ အရမံဵကို
အဳဴအာဵသငံဴမိတယံ။

 အမဿနံ
 ဆရာ ​ေမာငံခငံမငံ (ဓနု ြဖူ)၏ ြမနံမာစကာဵ၊ ြမနံမာစာ ရုပံပုဳလှာ
စာမျကံနဿာ း္း တွငံ၊ “သရသေကတကိ
် ု သရသကံသကံ
နဿငံဴ ​ဗျညံဵတွဲ​ေသာ သရဟူ၍နဿစံမျိုဵခွဲနိုငံသညံ။”

24
Myth #6
(fake, partial, pseudo Unicode)
 ယူနီကုဒံ စဳသတံမဿတံချကံမျာဵကို ြပငံဆငံထညံဴသွငံဴ ြခငံဵ။
 ယူနီကုဒံ Code point အသစံမျာဵထညံဴသွငံဵ ြခငံဵ။
 သီဵ ြခာဵ စဳပုဳစဳမျာဵ ထညံဴသွငံဵရ ြခငံဵ။

 အမဿနံ
ယူနီကုဒံကို စဳသတံမဿတံချကံမျာဵ ြပုလုပံရ ြခငံဵ၏ အဓိကရညံရွယံချကံမဿာ
း) အချကံအလကံမျာဵ ဖလဿယံရာတွငံ လွယံကူမဿနံကနံေစရနံ ြဖစံသညံ။
္) သုဳဵစွဲသူ ္ ဘကံလုဳဵတွငံ ပုဳစဳတူ စနစံမရဿိေသာံလညံဵ ဖတံရ၁နိုငံသညံဴ
စနစံမျိုဵ ြဖစံမဿသာ ယူနီကုဒံကို သုဳဵစွဲရေသာ အကျိုဵေကျဵဇူဵကို ရရဿိနိုငံပ၂မညံ။

25
Myth #7
(Windows enabled Myanmar 100%)
 ြမနံမာယူနီကုဒံကို MS Windows တွငံ း့့ ရာခိုငံန၁နံဵ အသုဳဵချနိုငံသညံ။
 Windows တွငံ ြမနံမာစကာဵလုဳဵမျာဵ ရဿာနိုငံသညံ။ အက္ခရာစဉံနိုငံသညံ။
 ကွနံပျူတာကို ြမနံမာလိုသုဳဵနိုငံ​ေတာဴမညံ။

 အမဿနံ
ြမနံမာစာတွငံ ထပံတိုဵ​ေသာ ယူနီကုဒံ စဳသတံမဿတံချကံမျာဵ ့့္၈
မတံလတွငံ အတညံ ြပုပ၂မညံ။ ယခုအချိနံထိ ​ေဆွဵ​ေနွဵဆဲ ြဖစံပ၂သညံ။
Windows တွငံ ရိုကံ​ေသာစာမျာဵ ​ေပ၃ရုဳသာရဿိပ၂သညံ။ အ ြခာဵ​ေသာ ဘာသာ
စကာဵမျာဵတွငံ ရရဿိ​ေသာ အဆငံဴ ြဖငံဴ နိ၁ငံဵယဿဉံပ၂က ၅့ ရာခိုငံန၁နံဵမျှသာ
ြဖစံပ၂သညံ။

26
Myth #8
(vowel should not place after consonants)
 သေဝထိုဵကို ဗျညံဵေနာကံပိုေထာဵတဲဴကိစ္စ၊ ...
သေဝထိုဵက ဗျညံဵေရဿဲေရာကံေနရငံ စာလုဳဵစီတာ၊ စဉံတာ၊ ြဖတံတာေတွမဿာ
ေခ၂ငံဵေတာံေတာံစာဵလိမံဴမယံ။

 အမဿနံ
ြမနံမာစာတွငံ စာလုဳဵအစဉံမျာဵ သတံမဿတံချကံသညံ အလွနံအ​ေရဵြကီဵ​ေသာ
အချကံ ြဖစံသညံ။ ဗျညံဵ၊ ဗျညံဵတွဲ၊ သရ၊ ဆို​ေသာ အစဉံသညံ
စကာဵလုဳဵဖွဲဲစညံဵပုဳအရ သတံမဿတံချကံ ြဖစံပ၂သညံ။
ဥပမာ - လိေမမောံ၊ အိေဒနြေ

27
Myth #9
(* Unicode Font)
 ယူနီကုဒံ စာလုဳဵကို ြမနံမာပညာရဿငံ ၅ ဦဵ ့့္၅ ခုနစံမဿာ
တီထွငံခဲဴ။

 အမဿနံ
ယူနီကုဒံမဿ ချမဿတံထာဵ​​ေသာ စဳသတံမဿတံချကံမျာဵကို
လိုကံနာမ၁မရဿိဘဲ သာမာနံ စာလုဳဵပုဳစဳ ​ေနရာချမ၁မျာဵကို ​ေ ြပာငံဵလဲ
သတံမဿတံ ြခငံဵကို တီထွငံသညံဟု မဆိုအပံ​ေပ။

28
Myth #10
(Government forced to use unicode fonts)
 သုဳဵလိုေမရတဲဴ Standards ကို နိုငံငဳေတာံက ဇွတံသုဳဵခိုငံဵတယံ။

 အမဿနံ
စဳသတံမဿတံချကံမျာဵကို ချမဿတံ၍ စမံဵသပံသုဳဵစွဲရနံအတွကံ
သကံဆိုငံရာမဿ မူဝ၂ဒ ချမဿတံ​ ြခငံဵသညံ နိုငံငဳတကာတွငံ
လုပံ​ေဆာငံသညံဴ ပုဳစဳ ြဖစံသညံ။

29
More Myths there. Send Us.

 ngwetun@solvewaresolution.net
 www.parabaik.info
30

You might also like