You are on page 1of 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/314287941

Analysis of text encodings in computer systems

Conference Paper · March 2016

CITATIONS READS

0 79

1 author:

Ivan Muzyka
Kryvyi Rih National University
25 PUBLICATIONS   19 CITATIONS   

SEE PROFILE

All content following this page was uploaded by Ivan Muzyka on 08 March 2017.

The user has requested enhancement of the downloaded file.


Global Science and Innovation March 23-24th, 2016.

ANALYSIS OF TEXT ENCODINGS IN COMPUTER SYSTEMS

Muzyka I.O.

Kryvyi Rih National University

Ukraine

Abstract
The paper discusses advantages and drawbacks of ASCII, Unicode encodings (UTF-8, UTF-16,
and UTF-32). It presents the short performance analysis of basic text operations. The author sug-
gested an approach for high-performance processing of textual information in computer systems.

Key words: text encoding, Unicode, text processing, performance analysis.

Introduction
Modern information technologies are characterized by the processing of a huge amount of
textual information. Web sites, up-to-date applications, integrated development environments, da-
tabases and many other types of software use a lot of textual data and try to provide a multilingual
graphic user interface. Today almost all operating systems and software developers select ap-
proaches based on storing and processing Unicode text. However supporting of all possible
Unicode symbols in computer systems is often not so easy due to excessive memory usage or inef-
fective processing algorithms.

Background
Today, the most prevalent Unicode encoding especially in Web is UTF-8 (Fig. 1). This encod-
ing displaces gradually all other ones (ISO-8859-1, Windows-1251, GB 2312, Big5 and others) [1, 2].
100

80
Usage, %

60

40

20

0
2010 2011 2012 2013 2014 2015 2016

UTF-8 ISO-8859-1 Windows-1251 Others

Fig. 1. Usage of character encodings for websites

Using the power of entire Unicode standard, which includes 1,112,064 code points, is a
great idea for modern software development. Nevertheless, implementing of it has some issues
especially on mobile and micro devices. The most important limiting factor of using 32-bit charac-
ters (UTF-32) still memory overconsumption. So many developers recommend using UTF-8 encod-
ing by default in all new applications [3, 4]. Actually standard C++ library does not provide full sup-
port for this encoding. Furthermore, different programming languages (C, C++, C#, Swift, PHP) and


Muzyka I.O., 2016
195
Global Science and Innovation March 23-24th, 2016.

operating systems (Windows, Linux and Mac OS) use various encoding approaches. Therefore, the
selection suitable text encoding depending on solved tasks is a real issue nowadays.

Character Encodings: Performance Analysis


There are several Unicode encodings (Unicode Transformation Formats): UTF-8, UTF-16
(BE, LE) and UTF-32 (BE, LE) [5]. UTF-8 uses 1 to 4 bytes per code point and, being compact for
Latin scripts and ASCII-compatible. Table 1 shows how many bytes characters take in Unicode
encodings.
Table 5
Size in memory of some characters using different encodings

The UTF-8 representation is one byte for any character equal to or below 127 (0x7F). It is
only the lowest 7 bits of the entire Unicode value. This is also the same as the ASCII value. For
characters equal to or below 2047 (0x07FF), the UTF-8 representation is spread across two bytes.
The first byte will have the two high bits set and the third bit clear (i.e. 0xC2 to 0xDF). The second
byte will have the top bit set and the second bit clear (i.e. 0x80 to 0xBF). For all characters equal to
or greater than 2048 but less that 65535 (0xFFFF), the UTF-8 representation is spread across three
bytes [6]. Some formats of UTF-8 byte sequences are presented in the following table (Table 2).

Table 2
Binary format of bytes in UTF-8 sequence
Maximum Expressible
1st Byte 2nd Byte 3rd Byte 4th Byte Code points range
Unicode Value
0xxxxxxx – – – 007F hex (127) U-00000000 – U-
0000007F
110xxxxx 10xxxxxx – – 07FF hex (2047) U-00000080 – U-
000007FF
1110xxxx 10xxxxxx 10xxxxxx – FFFF hex (65535) U-00000800 – U-
0000FFFF
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 10FFFF hex U-00010000 – U-
(1,114,111) 001FFFFF

With the UTF-8 encoding, Unicode can be used in a backward compatible way in environ-
ments that were designed entirely around ASCII, like Unix. The main disadvantage of UTF-8 encod-
ing is that it does not allow accessing the certain character by index. This fact can reduce text-
processing performance significantly. Moreover, UTF-16 and UTF-32 characters are not fixed-width
as well. The Unicode standard defines rules for diacritical signs. So one user perceived symbol could
be coded by two or three code points. For example, symbol “ä” can be represented by two independ-
ent code points (“a”+“◌̈”) or via a single one. Unicode standard requires that both representations are

196
Global Science and Innovation March 23-24th, 2016.

considered equal. Two long strings can differ in every single byte and have identical writings.
Another big issue arises when we try comparing strings. All UTF encodings require normali-
zation of strings before bytes comparison. Programs should always compare canonical-equivalent
Unicode strings as equal. The Unicode Standard provides well-defined normalization forms that can
be used for this: Normalization Form Canonical Composition (NFC) and Normalization Form Ca-
nonical Decomposition (NFD) [7, 8].
Text files with UTF-16 or UTF-32 encodings use a magic number also known as byte order
mark (BOM). These encodings can be represented in big-endian or little-endian order. It depends
on the platform architecture and may bring some incompatibility in programming code.
Additionally, case changing and sorting are a little bit tricky throughout of all Unicode symbols of
various world languages (or special Unicode blocks). Some of these features are presented in Table 3.

Table 3

Some difficulties of the capital/small Unicode letters converting and sorting

Language / Sorting order


Letters and their codes Converting upper ↔ lower
Unicode block by code value
A B C upperCode ← lowerCode -
U+0041 U+0042 U+0043 0x20
English A<B<C<a<b<c
a b c lowerCode ← upperCode
U+0061 U+0062 U+0063 + 0x20
А Б В upperCode ← lowerCode -
U+0410 U+0411 U+0412 0x20
Russian А<Б<В<а<б<в
а б в lowerCode ← upperCode
U+0430 U+0431 U+0432 + 0x20
Ⴀ Ⴁ Ⴂ upperCode ← lowerCode -
U+10A0 U+10A1 U+10A2 0x30
Georgian Ⴀ<Ⴁ<Ⴂ<ა<ბ<გ
ა ბ გ lowerCode ← upperCode
U+10D0 U+10D1 U+10D2 + 0x30
Ȧ Ḃ Ċ upperCode ← lowerCode -
Latin Extend- U+0226 U+1E02 U+010A 0x01
Ċ<ċ<Ȧ<ȧ<Ḃ<ḃ
ed-B ȧ ḃ ċ lowerCode ← upperCode
U+0227 U+1E03 U+010B + 0x01

All the above reasons lead to the reducing performance of text processing in computer sys-
tems. Let us consider the performance of Unicode text basic operations (comparison, searching
and case changing) compared with regular C null-terminated ASCII strings (Fig. 2). ICU library has
been used for this purpose. Almost all procedures based on UTF encoded symbols took more time
(2-6 times).

197
Global Science and Innovation March 23-24th, 2016.

Fig. 2. Average processing time of ASCII and Unicode strings (less is better)
One of the possible ways of processing text data that provide high speed is based on using
of UTF-32 encoding. It can be presented schematically by such sequence of operations (Fig. 3).

Reading file Saving to file


UTF-8 UTF-32 Data UTF-8
or network or network
data data processing data
stream stream

Fig. 3. The generic algorithm of text processing

Unfortunately, there are very few scenarios, which allow using UTF-8 encoding in all parts of
the computer program. UTF-8 is very convenient and compact encoding type for text files, for data
transmission over the network. It is a good choice for Unix environments as well. However due to
historic reasons Windows API does not provide full support for this encoding. So up-to-date cross-
platform software and libraries should use all types of Unicode encodings depending on the situa-
tion (Table 4).

Table 4
General applicable scenarios of using different text encodings
Applicable scenarios
Text encoding
An application supports only English language, computer system has a small
ASCII
amount of RAM, only one platform is supported
It should be used when full Unicode support is needed. It‘s a good choice for stor-
UTF-8 ing textual data in the files, transmitting data over the networks. The Unix API sup-
ports UTF-8 natively.
It is used for Windows API. It is not fixed-width and require more memory than
UTF-16
UTF-8.
High-performance text processing in the computer systems, which have a big
UTF-32 amount of memory. This encoding can be suitable for applications like text editors,
parsers, etc.

Summary
Obviously modern applications need of using Unicode text. Considering of all benefits and
disadvantages of UTF encodings, one of the best choice is UTF-8. It is rather compact for repre-
senting strings in memory, on disk and for communication. However, if performance is the most

198
Global Science and Innovation March 23-24th, 2016.

important criterion for the certain task it is worth to use UTF-32. According to preliminary estimates,
this approach provides performance boost up to 5–15% depending on the algorithm of data pro-
cessing.

References
[1] Encoding Usage Statistics – BuiltWith, Web, 04 Jan 2016, http://trends.builtwith.com/encoding.
[2] Usage of character encodings for websites – W3Techs, Web, 10 Jan 2016,
http://w3techs.com/technologies/overview/character_encoding/all.
[3] P. Radzivilovsky, Y. Galka, S. Novgorodov, UTF-8 Everywhere, Web, 2015, http://utf8everywhere.org.
[4] Richard Ishida, W3C Choosing & applying a character encoding – W3C, Web, 31 Mar 2014,
https://www.w3.org/International/questions/qa-choosing-encodings.
[5] R. Gillam, Unicode Demystified: A Practical Programmer's Guide to the Encoding Standard, first ed., Addi-
son-Wesley Professional, 2002.
[6] UTF-8 Encoding – FileFormat.Info, Web, 10 Sep 2015, http://www.fileformat.info/info/unicode/utf8.htm.
[7] J.K. Korpela Unicode Explained, first ed., O'Reilly Media, 2006.
[8] Unicode Normalization – Unicode.org, Web, 7 Jul 2015, http://unicode.org/faq/normalization.html.

ALGORITHMIC METHODS IN CELLULAR AUTOMATA MODELING


1 2 3
Palmov S.V. , Popov A.V. , Rezepkyn A.V.
1 2 3
associate professor, Candidate of Engineering Sciences. student , student.

Volga State University of Telecommunications and Informatics

Russia

Abstract
Since start of project we have studied basic mathematical algorithm of Conway’s "Game of life".
After we got basic knowledge we modified a math model and developed two own versions of cel-
lular automata. With our new versions we modeled automata with different cells that have differ-
ent properties like infection or immunity or stay alive with no neighborhood etc. And we got per-
fect results, proved that our model works and create cell colonies similar to the natural bacteria
colony. The other model is for generating streams of liquid. We added a few new properties for
cells like weight, inertia, motion vector. So, this realization can model a stress situation inside
water supply network and notify staff in case of high pressure at any point in the pipeline.

Key words. Cellular Automata, Conway’s “Game of life”, CA, Math, Mathematical algorithms,
liquids, gas, cell, colony, algorithmic modeling.

Аннотация
С начала проекта мы изучили базовый математический алгоритм "Жизни" Джона Конвея . После
того, как мы получили базовые знания, мы изменили математическую модель и разработали
две собственные версии клеточного автомата. С помощью наших новых версий мы смоделиро-
вали среду с различными клетками, которые имеют различные свойства, такие как : способность
заражать соседей, лечить соседей от заражения или оставаться в живых пребывая в полном
одиночестве, и т.д. И мы получили отличные результаты. Доказали, что наша модель работает
и создает клеточные колонии, похожие на натуральные колонии бактерий , Другая модель соз-
дана для генерации потоков жидкости. Мы добавили несколько новых свойств клеток, таких как:


Palmov S.V. , Popov A.V. , Rezepkyn A.V., 2016
199

View publication stats

You might also like