You are on page 1of 12

A Comparative Study of UTF-8, UTF-16,

and UTF-32 of Unicode Code Point


Sanjeev Kumar*

Unicode is a critical enabling technology for developers who want to internationalize


applications for global environments. Unicode assigns a unique number for every character,
irrespective of what the platform, or the program, or the language is. The Unicode Standard
has been adopted in the industry by Apple, HP, IBM, Microsoft, Oracle, SAP, Sun, Sybase, and
many others. Unicode is required by modern standards such as XML, Java and WML, and is
the official way to implement ISO/IEC 10646. It is supported in many operating systems, all
modern browsers, and many other products. The emergence of the Unicode standard, and
the availability of tools supporting it, is among the most significant recent global software
technology advances. Each available format of UTF-8, UTF-16 and UTF-32 has its own pros
and cons. The comparison of the following three formats is discussed in this paper.

Keywords: UTF-8, UTF-16, UTF-32, UCS, ASCII

Introduction
Unicode is the first truly successful multilingual character set standard (Ken Lunde,
2008). It enables a single software product or website to be targeted across multiple
platforms, languages and countries without reengineering. It codes more than a
million characters and includes code points for all the characters of popular languages
of the world. Each character is assigned a unique number called Unicode code point.
It uses hexadecimal numbers to represent code points. For example, Latin Upper ‘A’
is given a code point U+0041 in hexadecimal or 0065 in decimal. It is the super set
of all earlier character set codes available. ASCII can represent only 128 characters,
which include only Latin upper and lower characters and some other special
characters. Universal Character Set (UCS) can represent only 256 characters; first 128
characters are the same as in ASCII and other 128 characters are used to represent
other European languages.
The computer industry has adopted Unicode and given a global outlook in building
international software that can be easily adapted to meet the needs of particular
locations and cultures.1 Incorporating Unicode into client-server or multitiered

* Faculty, Computer Application, Desh Bhagat Institute of Management and Computer Sciences,
Mandi Gobindgarh, Punjab, India. E-mail: san.jaidka@gmail.com

1
Available at www.unicode.org

50
© 2012 IUP. All Rights Reserved. The IUP Journal of Telecommunications, Vol. IV, No. 2, 2012
applications and websites offers significant cost savings over the use of legacy2
character sets. Unicode enables a single software product or a single website to be
targeted across multiple platforms, languages and countries without modification.
It allows data to be transported through many different systems without corruption,
which means software products developed earlier can be converted into Unicode
without losing data. The standard includes the European alphabetic scripts, Middle
Eastern right-to-left scripts, and Asian and African scripts.
Unicode makes globalization possible, i.e., single server, single build, single
installation, single instance and serves all clients in all languages.
– Markus Kuhan, 2009
Computers deal with just numbers. They store letters and other characters by
assigning a number for each one. Before Unicode was invented, there were hundreds
of different encoding systems for assigning these numbers.3 No single encoding could
contain enough characters: for example, the European Union alone requires several
different encodings to cover all its languages. Even for a single language like English
no single encoding was adequate for all the letters, punctuation and technical symbols
in common use.
One machine may support one kind of encoding. Transfer of text from one
machine to another causes loss of information. Moreover, conversion of information
requires an additional data of several megabytes and considerable time.
These encoding systems also conflict with one another—two encodings can use the
same number for two different characters, or use different numbers for the same
character. Any given computer (especially servers) needs to support many different
encodings; yet whenever data is passed between different encodings or platforms, that
data always runs the risk of corruption.4 Unicode Standard eliminates all such deficiencies
and keeps the programmer community relaxed from unnecessary burden by:
1. Allowing software such as a website to convert from one language to
another without loss of the semantic.
2. Data is passed between different encodings or platforms without any risk
of corruption.
3. No need to use many encoding schemes as Unicode supports all platforms,
native languages, and thus requires no additional data of several megabytes
and considerable time for conversion from one encoding to another.

2
Legacy encoding is a character encoding that continues to be used despite being obsolete by another
encoding.
3
Available at www.unicode.org
4
Available at http://www.ibm.com/developerworks/library/codepages.html

A Comparative Study of UTF-8, UTF-16, and UTF-32 of Unicode Code Point 51


As the WWW (W3) and computer usage is growing, people around the world,
prefer information in their local language rather than learning another language.
“S/w engineers felt the need to develop a standard that supports almost all the
characters, scientific notations and special characters, artistic scripts, graphical,
typological, international phonetic alphabets, OCR fonts etc.” (Markus Kuhan,
2009)
Unicode Standard does not encode idiosyncratic, personal, novel or private-use
characters, nor does it encode logos or graphics. Graphologies unrelated to text, such
as dance notations are likewise outside the scope of Unicode Standard. Font variants
are explicitly not encoded. The Unicode Standard reserves 6,400 code points in the
BMP for private use which may be used to assign codes to characters not included
in the repertoire of the Unicode Standard. Another 131,068 private-use code points
are available outside the BMP, should 6,400 prove insufficient for particular
applications.5
The Unicode Standard directly addresses only the encoding and semantics of text
and not any other actions performed on the text.6 In certain language systems, as
in Devanagari, some characters do not need separate space but are accommodated
in the space of the previous character. For example,

The above four characters will occupy the space equivalent to one character only,
i.e., . In the same way, some scripts follow right to left sequence such as in Urdu.

In the late 1980s, a project known as Unicode project was initiated by a consortium
of manufactures mostly from the US.

Challenges in Designing Unicode


One of the challenges in designing a character encoding comes from the fact that
there is no universal set of fundamental units of text. Instead, the division of text
into text elements necessarily varies from one language to another and text process
as discussed below.
1. Composite Characters
Unicode has assigned two different code points for such composite characters,
one for C, i.e., 0×0043, and other for ‘¸’ i.e., 0×0327. Unicode also supports a single
code point for composite characters such as 00C7 for Ç Latin character C and Cedilla
‘¸’ (Tony Graham, 2000).

5
Available at www.unicode.org
6
Ibid.

52 The IUP Journal of Telecommunications, Vol. IV, No. 2, 2012


Ç Ç

C
¸

Many languages support composite characters such as German and these


characters require space equivalent to single characters.
2. Collation Unit
For example, in traditional German orthography, the letter combination ‘ck’ is
a text element for the process of hyphenation (where it appears as ‘k-k’) (Jukka,
2006).

Ch
c h (Slovak)

3. In Devanagari, many characters do not need separate rectangular space as in


the case of English alphabet.

4. English words are easy to allocate memory space in rectangular form.

Text Processes and Encoding


In the case of English text using an encoding scheme such as ASCII, the relationships
between the encoding and the basic text processes built on it are straightforward:
characters are generally rendered visible one by one in distinct rectangles from left
to right in linear order.

When designing an international and multilingual text encoding such as the


Unicode Standard, the relationship between the encoding and implementation of
basic text processes must be considered explicitly for several reasons:

• Characters in other than English writing systems are not necessarily rendered
visible one by one in rectangles from left to right. In many cases, character
positioning is quite complex and does not proceed in a linear fashion, e.g.,
Devnagri.

• Unicode follows two approaches for the encoding of accented characters


commonly used in French or Swedish, ‘ä’ and ‘ö’ as individual characters,
and also represents them by composition with diacritics instead.

A Comparative Study of UTF-8, UTF-16, and UTF-32 of Unicode Code Point 53


• Unicode defines separate codes for uppercase and lowercase letters. This
choice causes some text processes such as rendering to be carried out more
easily, but other processes such as comparison to become more difficult.
A different encoding design for English such as case-shift control codes
would have the opposite effect.

• Some languages have common characters within the script’s access language
such as Chinese/Japanese/Korean (CJK) where consolidation is achieved by
assigning a single code point for each common ideograph.

The Unicode gives the programmers extensive descriptions and a vast amount of
data about how characters function: how to form words and break lines; how to
sort text in different languages; how to format numbers, dates, times, and other
elements appropriate to different languages; how to display languages whose written
form flows from right to left such as Arabic and Hebrew, or whose written form
splits, combines and records, such as languages of South Asia; and how to deal with
security concerns regarding the many ‘look-alike’ characters from alphabets around
the world. Without the properties, algorithms and other specifications in the Unicode
Standard and its associated specifications, interoperability between different
implementations would be impossible.

The Unicode character encoding treats alphabetic characters, ideographic


characters and symbols equivalently, which means they can be used in any mixture
and with equal facility. In addition to character codes and names, other information
is crucial to ensure legible text: a character’s case, directionality and alphabetic
properties are also well-defined in Unicode. The Unicode Standard defines these and
other semantic values, and it includes application data such as case mapping tables
and character property tables as part of the Unicode Character Database.

Unicode Principles7
Universality
The Unicode Standard encodes a single, very large set of characters, encompassing
all the characters needed for worldwide use. The Unicode Standard is designed to
meet the needs of diverse user communities within each language, serving business,
educational, liturgical and scientific users and covering the needs of both modern
and historical texts.

Efficiency
There are no escape characters or shift states (As in ASCII for lower and upper case)
in the Unicode character encoding model. Each character code has the same status
7
Available at www.unicode.org

54 The IUP Journal of Telecommunications, Vol. IV, No. 2, 2012


as any other character code; all codes are equally accessible. All Unicode encoding
forms are self-synchronizing and non-overlapping. This makes randomly accessing and
searching inside streams of characters efficient.
By convention, characters of a script are grouped together as far as it is practical.
Not only is this practice convenient for looking up characters in the code charts, but
it also makes implementations more compact and compression methods more
efficient. The common punctuation characters are shared.

Characters, Not Glyphs


The Unicode Standard draws a distinction between characters and glyphs. Characters
are the abstract representations of the smallest components of written language that
have semantic value. They represent primarily but not exclusively, the letters,
punctuation, and other signs that constitute natural language text and technical
notation. Glyphs represent the shapes that characters can have when they are
rendered or displayed.
In contrast to characters, glyphs appear on the screen or paper as particular
representations of one or more characters. A repertoire of glyphs makes up a font.
Glyph shape and methods of identifying and selecting glyphs are the responsibility
of individual font vendors and of appropriate standards and are not part of the
Unicode Standard.
Various relationships may exist between character and glyph: a single glyph may
correspond to a single character or to a number of characters, or multiple glyphs
may result from a single character.
Text rendering requires that characters in memory be mapped to glyphs. The
final appearance of rendered text may depend on context (neighboring characters
in the memory representation), variations in typographic design of the fonts used,
and formatting information (point size, superscript, subscript and so on). The
results on screen or paper can differ considerably from the prototypical shape of
a letter or character. The following three methods are available to store Unicode
strings.

UTF-32 (Universal Character Set Transformation Format-32)


The UTF-32 form of a character is a direct representation of its code point; no mapping
is required (Markus Kuhan, 2009). All other Unicode transformation formats use
variable-length encodings. It uses 32-bit code unit to store each code point. The
advantage is that it can store more than million characters; and the other advantage
of UTF-32 versus variable length encodings is that the Unicode code points are directly
indexable, allowing random access. But commonly used characters require only 8 to
16 bits. Therefore, this encoding requires a lot of space to store each character. Thus,
it is not memory space-efficient. Though a fixed number of bytes per code point

A Comparative Study of UTF-8, UTF-16, and UTF-32 of Unicode Code Point 55


appears convenient, it is not as useful as it appears. It makes truncation easier but
not significantly so compared to UTF-8 and UTF-16.
For strings, however, storing 32 bits for each character takes up too much space,
especially considering that the highest value, 0×10FFFF, takes up only 21 bits. 11 bits
are always unused in a 32-bit word storing a Unicode code point. Therefore, you
will find that software generally uses 16-bit or 8-bit units as a compromise with a
variable number of code units per Unicode code point. It is a tradeoff between ease
of programming and storage space.8
The performance of UTF-32 as a processing code may actually be worse than
the performance of UTF-16 for the same data, because the additional memory
overhead means that cache limits will be exceeded more often and memory paging
will occur more frequently. For systems with processor designs that impose penalties
for 16-bit aligned access but have very large memories, this effect may be less
noticeable.

UTF-16
It uses one or two 16-bit code units for each code point. For example, character ‘A’
has U+0041 code point which can be stored using 16 bits, i.e., one code point.
Therefore, other code point is not required and space is saved. On the other hand,
U+263A white smiling face requires 32 bits, i.e., two code points.9 UTF-16 is extremely
well-designed as the best compromise between handling space, and all commonly
used characters can be stored with one code unit per code point where the code
unit actually has the same integer value as the code point. This is the default encoding
for Unicode.10 UTF-8 is space-efficient, but complex algorithms are required to store
variable length code points.
On average, more than 99% of all UTF-16 data is expressed using single code
units. This includes nearly all of the typical characters that software products need
to handle with special operations on text—for example, format control characters.
As a consequence, most text scanning operations do not need to unpack UTF-16
surrogate pairs at all, but rather can safely treat them as an opaque part of a
character string.

UTF-8
The encoding known today as UTF-8 was invented by Ken Thompson (Markus Kuhan,
2009). It is a variable-length (1 to 4 bytes), byte-order independent encoding. It uses
one to four 8-bit code units for each code point. If a character’s code point falls
within first 8-bit, then only one 8-bit code unit is allocated.

8
Available at http://linuxgazette.net/147/pfeiffer.html, Tips on Using Unicode with C/C++, René Pfeiffer
9
Available at www.unicode.org
10
Available at http://scripts.sil.org/, Mapping codepoints to Unicode encoding forms, §1: UTF-32

56 The IUP Journal of Telecommunications, Vol. IV, No. 2, 2012


UTF-8 is reasonably compact in terms of the number of bytes used. It is really only
at a significant size disadvantage when used for East Asian implementations such as
Chinese, Japanese, and Korean which use Han ideographs or Hangul syllables
requiring three-byte code unit sequences in UTF-8. IT is also significantly less efficient
in terms of processing than the other encoding forms.11
UTF-8 has the following properties (Markus Kuhan, 2009):
• UCS characters U+0000 to U+007F (ASCII) are encoded simply as bytes 0×00
to 0×7F (ASCII compatibility). Therefore, files which have only 7-bit ASCII
characters have the same encoding under both ASCII and UTF-8.
• All UCS characters >U+007F are encoded as a sequence of several bytes,
each of which has the most significant bit set. Therefore, no ASCII byte
(0×00-0×7F) can appear as part of any other character.
• The first byte of a multi-byte sequence that represents a non-ASCII character
is always in the range 0×C0 to 0×FD and it indicates how many bytes follow
for this character. All further bytes in a multi-byte sequence are in the range
0×80 to 0×BF. This allows easy resynchronization and makes the encoding
stateless and robust against missing bytes.
The following byte sequences are used to represent a character. The sequence to
be used depends on the Unicode number of the character:
According to Table 1, if code point of a character requires only one byte to
represent, then the 8th bit is always 0. If a character requires two bytes for its code
number, then the first byte will have 7th and 8th bit as set and 6th bit as 0, which

Table 1: Byte Sequence for UTF-8

U-00000000 - U-0000007F: 0××××××


U-00000080 - U-000007FF: 110××××× 10××××××
U-00000800 - U-0000FFFF: 1110×××× 10×××××× 10××××××
U-00010000 - U-001FFFFF: 11110××× 10×××××× 10×××××× 10××××××

indicate that the code point has two bytes. Other than first byte of a string will
have 8th bit as set and 7th bit as 0 to represent code points longer than a single
byte.
Therefore, if four bytes are used to store a code point, then 11 bits are wasted,
and only 21 bits are used to represent code points.
The ××× bit positions are filled with the bits of the character code number in
binary representation. The rightmost × bit is the least significant bit. Only the shortest
11
Available at www.unicode.org

A Comparative Study of UTF-8, UTF-16, and UTF-32 of Unicode Code Point 57


possible multi-byte sequence which can represent the code number of the character
can be used. Note that in multi-byte sequences, the number of leading 1 bits in the
first byte is identical to the number of bytes in the entire sequence.
Examples: The Unicode character U+00A9 = 1010 1001 (©) is encoded in UTF-8 as:
11000010 10101001 = 0×C2 0×A9
and character U+2260 = 0010 0010 0110 0000 () is encoded as:
11100010 10001001 10100000 = 0×E2 0×89 0×A0
For security reasons, a UTF-8 decoder must not accept UTF-8 sequences that are
longer than necessary to encode a character. For example, the character U+000A (line
feed) must be accepted from a UTF-8 stream only in the form 0×0A, but not in any
of the following three possible overlong forms:
0×C0 0×8A
0×E0 0×80 0×8A
0×F0 0×80 0×80 0×8A
Any overlong UTF-8 sequence could be abused to bypass UTF-8 substring tests
that look only for the shortest possible encoding. All overlong UTF-8 sequences start
with one of the byte patterns mentioned in Table 2.
It is open character set, which means it is growing and adding less frequently
characters. Now Unicodes are stable, which means a Unicode point assigned to a
character is permanent.
Table 2: Overlong UTF-8 Sequences
1100000× (10××××××)
11100000 100××××× (10××××××)
11110000 1000××× (10×××××× 10××××××)

Table 3 clearly shows that to represent characters such as composite character Ä,


UTF-8 has code point C3 A0, and for the same character UTF-16 has different code
point i.e., 00E4. Therefore, representation of characters in all the three formats differ
but each of the three Unicode encoding forms can be efficiently transformed into
either of the other two without any loss of data.12

Table 3: Representation of Code Point Using UTF-8 and UTF-16


Character Code Point UTF-16 UTF-8
A U+0061 0061 61
Ä U+00E4 00E4 C3 A0
 U+03C3 03C3 CF 83

12
Available at http://linuxgazette.net/147/pfeiffer.html, Tips on Using Unicode with C/C++, René Pfeiffer

58 The IUP Journal of Telecommunications, Vol. IV, No. 2, 2012


A summary of the comparison among UTF-32, UTF-16 and UTF-8 is presented
in Table 4.

Table 4: Summary of Comparison Among UTF-32, UTF-16 and UTF-8


UTF-32 UTF-16 UTF-8

Representation Direct representation of Mapping is required. Complex mapping is


its code point, no required.
mapping is required.*
Length Uses fixed length code Uses variable length Uses variable length
point i.e., 32 bits for code point i.e., one or code point i.e., one
each character two 16 bit code units to four 8 bit code
for each code point units for each code
point
Access Unicode code points are Not directly indexable, Not directly
directly indexable, therefore limited indexable, thus
allowing random access random access is searching is difficult.
possible, but better
than UTF-8
Space Not memory-efficient Best compromise Memory-efficient,
Complexity between handling good for Latin
space, and all alphabets but worst
commonly used for Asian scripts
characters can be
stored with one code
unit per code point
Execution Execution speed is better Requires time as code More time-
Speed as no mapping is point depends upon consuming as one to
required one or two 16 bit four 8 bit code units
code units. are required
Ease of Easy Moderate Difficult
Programming

Note: * Markus Kuhan (2009)

Conclusion
All three encoding forms can be used to represent the full range of encoded
characters in the Unicode Standard; they are thus fully interoperable for
implementations that may choose different encoding forms for various reasons. Each
of the three Unicode encoding forms can be efficiently transformed into either of
the other two without any loss of data.13
UTF-16 is both memory-efficient as well as speed-efficient. Almost 99% of data is
expressed using single code units. This includes nearly all of the typical characters that
software needs to handle with special operations on text—for example, format
13
Available at www.unicode.org

A Comparative Study of UTF-8, UTF-16, and UTF-32 of Unicode Code Point 59


control characters. As almost all the characters of all the languages can be represented
with single code point, it does not need much processing to represent character using
two code points. As a consequence, most text scanning operations do not need to
unpack UTF-16 surrogate pairs at all, but rather can safely treat them as an opaque
part of a character string. UTF-8 will require more processing time as to represent
many of characters using two code points. UTF-32 will require only one code point,
therefore no extra processing time is required but will waste a lot of memory space.&

References
1. Jukka K Korpela (2006), “Unicode Explained”.

2. Ken Lunde (2008), CJKV Information Processing, 2nd Edition.

3. Markus Kuhan (2009), “UTF-8 and Unicode FAQ for Unix/Linux”, available at http:/
/www.cl.cam.ac.uk/~mgk25/unicode.html

4. Richard Gillam (2002), “Unicode Demystified: A Practical Programmer’s Guide to the


Encoding Standard.”

5. Tony Graham (2000), “Unicode, A Primer”.

6. http://www.ibm.com/developerworks/xml/library/x-utf8/

7. http://perldoc.perl.org/perlunicode.html#Unicode-Encodings

8. http://unicode.org/versions/Unicode5.2.0

Reference # 70J-2012-05-05-01

60 The IUP Journal of Telecommunications, Vol. IV, No. 2, 2012


Reproduced with permission of the copyright owner. Further reproduction prohibited without
permission.

You might also like