Professional Documents
Culture Documents
Week 4 - A Comparative Study of UTF-8 UTF-16 and UTF-32
Week 4 - A Comparative Study of UTF-8 UTF-16 and UTF-32
Introduction
Unicode is the first truly successful multilingual character set standard (Ken Lunde,
2008). It enables a single software product or website to be targeted across multiple
platforms, languages and countries without reengineering. It codes more than a
million characters and includes code points for all the characters of popular languages
of the world. Each character is assigned a unique number called Unicode code point.
It uses hexadecimal numbers to represent code points. For example, Latin Upper ‘A’
is given a code point U+0041 in hexadecimal or 0065 in decimal. It is the super set
of all earlier character set codes available. ASCII can represent only 128 characters,
which include only Latin upper and lower characters and some other special
characters. Universal Character Set (UCS) can represent only 256 characters; first 128
characters are the same as in ASCII and other 128 characters are used to represent
other European languages.
The computer industry has adopted Unicode and given a global outlook in building
international software that can be easily adapted to meet the needs of particular
locations and cultures.1 Incorporating Unicode into client-server or multitiered
* Faculty, Computer Application, Desh Bhagat Institute of Management and Computer Sciences,
Mandi Gobindgarh, Punjab, India. E-mail: san.jaidka@gmail.com
1
Available at www.unicode.org
50
© 2012 IUP. All Rights Reserved. The IUP Journal of Telecommunications, Vol. IV, No. 2, 2012
applications and websites offers significant cost savings over the use of legacy2
character sets. Unicode enables a single software product or a single website to be
targeted across multiple platforms, languages and countries without modification.
It allows data to be transported through many different systems without corruption,
which means software products developed earlier can be converted into Unicode
without losing data. The standard includes the European alphabetic scripts, Middle
Eastern right-to-left scripts, and Asian and African scripts.
Unicode makes globalization possible, i.e., single server, single build, single
installation, single instance and serves all clients in all languages.
– Markus Kuhan, 2009
Computers deal with just numbers. They store letters and other characters by
assigning a number for each one. Before Unicode was invented, there were hundreds
of different encoding systems for assigning these numbers.3 No single encoding could
contain enough characters: for example, the European Union alone requires several
different encodings to cover all its languages. Even for a single language like English
no single encoding was adequate for all the letters, punctuation and technical symbols
in common use.
One machine may support one kind of encoding. Transfer of text from one
machine to another causes loss of information. Moreover, conversion of information
requires an additional data of several megabytes and considerable time.
These encoding systems also conflict with one another—two encodings can use the
same number for two different characters, or use different numbers for the same
character. Any given computer (especially servers) needs to support many different
encodings; yet whenever data is passed between different encodings or platforms, that
data always runs the risk of corruption.4 Unicode Standard eliminates all such deficiencies
and keeps the programmer community relaxed from unnecessary burden by:
1. Allowing software such as a website to convert from one language to
another without loss of the semantic.
2. Data is passed between different encodings or platforms without any risk
of corruption.
3. No need to use many encoding schemes as Unicode supports all platforms,
native languages, and thus requires no additional data of several megabytes
and considerable time for conversion from one encoding to another.
2
Legacy encoding is a character encoding that continues to be used despite being obsolete by another
encoding.
3
Available at www.unicode.org
4
Available at http://www.ibm.com/developerworks/library/codepages.html
The above four characters will occupy the space equivalent to one character only,
i.e., . In the same way, some scripts follow right to left sequence such as in Urdu.
In the late 1980s, a project known as Unicode project was initiated by a consortium
of manufactures mostly from the US.
5
Available at www.unicode.org
6
Ibid.
C
¸
Ch
c h (Slovak)
• Characters in other than English writing systems are not necessarily rendered
visible one by one in rectangles from left to right. In many cases, character
positioning is quite complex and does not proceed in a linear fashion, e.g.,
Devnagri.
• Some languages have common characters within the script’s access language
such as Chinese/Japanese/Korean (CJK) where consolidation is achieved by
assigning a single code point for each common ideograph.
The Unicode gives the programmers extensive descriptions and a vast amount of
data about how characters function: how to form words and break lines; how to
sort text in different languages; how to format numbers, dates, times, and other
elements appropriate to different languages; how to display languages whose written
form flows from right to left such as Arabic and Hebrew, or whose written form
splits, combines and records, such as languages of South Asia; and how to deal with
security concerns regarding the many ‘look-alike’ characters from alphabets around
the world. Without the properties, algorithms and other specifications in the Unicode
Standard and its associated specifications, interoperability between different
implementations would be impossible.
Unicode Principles7
Universality
The Unicode Standard encodes a single, very large set of characters, encompassing
all the characters needed for worldwide use. The Unicode Standard is designed to
meet the needs of diverse user communities within each language, serving business,
educational, liturgical and scientific users and covering the needs of both modern
and historical texts.
Efficiency
There are no escape characters or shift states (As in ASCII for lower and upper case)
in the Unicode character encoding model. Each character code has the same status
7
Available at www.unicode.org
UTF-16
It uses one or two 16-bit code units for each code point. For example, character ‘A’
has U+0041 code point which can be stored using 16 bits, i.e., one code point.
Therefore, other code point is not required and space is saved. On the other hand,
U+263A white smiling face requires 32 bits, i.e., two code points.9 UTF-16 is extremely
well-designed as the best compromise between handling space, and all commonly
used characters can be stored with one code unit per code point where the code
unit actually has the same integer value as the code point. This is the default encoding
for Unicode.10 UTF-8 is space-efficient, but complex algorithms are required to store
variable length code points.
On average, more than 99% of all UTF-16 data is expressed using single code
units. This includes nearly all of the typical characters that software products need
to handle with special operations on text—for example, format control characters.
As a consequence, most text scanning operations do not need to unpack UTF-16
surrogate pairs at all, but rather can safely treat them as an opaque part of a
character string.
UTF-8
The encoding known today as UTF-8 was invented by Ken Thompson (Markus Kuhan,
2009). It is a variable-length (1 to 4 bytes), byte-order independent encoding. It uses
one to four 8-bit code units for each code point. If a character’s code point falls
within first 8-bit, then only one 8-bit code unit is allocated.
8
Available at http://linuxgazette.net/147/pfeiffer.html, Tips on Using Unicode with C/C++, René Pfeiffer
9
Available at www.unicode.org
10
Available at http://scripts.sil.org/, Mapping codepoints to Unicode encoding forms, §1: UTF-32
indicate that the code point has two bytes. Other than first byte of a string will
have 8th bit as set and 7th bit as 0 to represent code points longer than a single
byte.
Therefore, if four bytes are used to store a code point, then 11 bits are wasted,
and only 21 bits are used to represent code points.
The ××× bit positions are filled with the bits of the character code number in
binary representation. The rightmost × bit is the least significant bit. Only the shortest
11
Available at www.unicode.org
12
Available at http://linuxgazette.net/147/pfeiffer.html, Tips on Using Unicode with C/C++, René Pfeiffer
Conclusion
All three encoding forms can be used to represent the full range of encoded
characters in the Unicode Standard; they are thus fully interoperable for
implementations that may choose different encoding forms for various reasons. Each
of the three Unicode encoding forms can be efficiently transformed into either of
the other two without any loss of data.13
UTF-16 is both memory-efficient as well as speed-efficient. Almost 99% of data is
expressed using single code units. This includes nearly all of the typical characters that
software needs to handle with special operations on text—for example, format
13
Available at www.unicode.org
References
1. Jukka K Korpela (2006), “Unicode Explained”.
3. Markus Kuhan (2009), “UTF-8 and Unicode FAQ for Unix/Linux”, available at http:/
/www.cl.cam.ac.uk/~mgk25/unicode.html
6. http://www.ibm.com/developerworks/xml/library/x-utf8/
7. http://perldoc.perl.org/perlunicode.html#Unicode-Encodings
8. http://unicode.org/versions/Unicode5.2.0
Reference # 70J-2012-05-05-01