You are on page 1of 78

The Bytext Standard

INTRODUCTION
"Once and awhile you'd be better listening to the fools for a change."
--C.O.C.

Bytext is an open standard for encoding text that uses a variable number of bytes to encode
each character. It is a single charset, an unambiguous bytewise (and bitwise) serialization of
a character repetoire that is a superset of Unicode normalization form C. It is a major
improvement upon ASCII and the multiple charsets associated with each version of Unicode.
It is a canonical superset of and can easily interoperate with all associated standards. In
many ways Bytext is simpler than Unicode and it is designed to be easier to implement. In
particular, it is far easier to use faster and simpler byte oriented regular expression
algorithms with Bytext than with Unicode.
Bytext provides practical, novel solutions for all 3 of the major challenges of the charset
protocol layer:

1. Publishing. Bytext provides a mechanism to encode an unlimited number of characters.


Bytext characters include all Unicode characters in normalization form C, which is the
form used by the latest version of the W3C Character Model for the World Wide Web.
Bytext also encodes many useful characters not found in Unicode.
2. Comparison. Comparison is the fundamental concept behind searching and collation. Bytext
is designed to make it easier and faster to do context free searches with both accuracy
and precision. Bytext categorizes characters by similar usage so it is fast and easy to
create regular expressions that match proper names and other words even tho they may
be written in other scripts. Further, Bytext provides a mechanism for ignoring entire
classes of characters in a search without context or lookup. This makes comparisons even
more useful because an author can make searches match user expectations much better.
Notably, markup text can be made ignorable without having to parse the markup.
3. Compression. With a variable number of bytes per character, it is possible to order
characters with their frequency of use in mind to achieve an inherent compression. It is a
challenge to order the characters by frequency of use while maintaining the other 2
important features of Bytext. These challenges are met in Bytext with ordering principles
that reflect an emphasis on global business oriented practicality. Bytext also provides the
non overloaded OBS characters which can be used to represent binary data in text while
achieving better inherent compression than base64 which DOES use overloaded
characters.

At any given time, there should be a single preferred charset for worldwide information
interchange. This will directly facilitate global communication and the other noble goals of
groups like the W3C. Bytext seeks to fulfill this need. A lot of thought and leadership needs
to go into such an effort in order to best provide for all the sometimes conflicting
requirements: “culturally inclusive”, “backwards compatible”, “flexible”, “built to last”,
“consistent”, “easy to learn”, “feature rich”, “practical”, and “limited in scope”.
Each character in Bytext has a single meaning that does not change with context except as
controlled by higher protocols such as a markup language. Except as used to a limited
extent only for round trip compatibility, there are no combining characters in Bytext.
Instead, any potentially useful abstract character formed by combining characters is
precomposed. There are no surrogate pairs; and no formatting or control characters that
normatively modify the properties of adjacent characters. All characters are categorized in a
way that reflects their composite relationships. In Bytext, there is no confusion about the
difference between a plain character, an abstract character, or a grapheme cluster; or what
unit should be used to store a character or string.

1
Variably length encoding has been maligned by protocol developers because of the false
assumption that such schemes require the implementation of complicated modes that
overlap such as in shift-JIS; and because of a misplaced fear about the difficulty in
determining string length. Unicode has already shown that variable length encodings can be
self synchronizing (explained below). Determining string length is a trivial concern with
Bytext because Bytext is specifically designed to be iterated thru as a sequence of bytes, so
it is not necessary for a string data type to know the number of characters in order to iterate
thru those characters. If required, it is a simple task to count the number of characters in
such a string by iterating thru the bytes.
Unicode is a variable length encoding system like Bytext, the difference is that Bytext takes
variable length encoding to it’s logical conclusion so it can be fully utilized. UTF-8 and UTF-
16 are certainly variable length encodings, and arguably even UTF-32 is in how abstract
characters are composed. Like Unicode, Bytext is self synchronizing. When randomly
accessing a string, a program can find the boundary of a character with limited backup. The
maximum length of fully defined non private characters with this introductory version of
Bytext is cannot be determined because allocation is incomplete, but it will probably be
under 16 bytes. This is contrasted with an indefinitely long number of bytes are required to
encode abstract Unicode characters such as combining or conjoining sequences. A
guaranteed limit may be set in the future for the maximum length of a non private Bytext
character, which may be 256 bytes. This will allow Bytext to encode more characters than
there are atoms in the known universe. Even 256 bytes definitely qualifies as a limited
backup.
Bytext has a single binary form, there is no context required to interpret a sequence of bits as
Bytext. There are no illegal character codes and no context outside each character that will
change the properties of a character, except as controlled by higher protocols. All characters
are either completely defined or only incompletely defined by PDT and category (all private
characters and reserved characters are incompletely defined this way). Bytext does not
require the generation of error messages or any kind of filtering. Any binary data can be
interpreted as Bytext without error.
Bytext is limited in scope so as to be consistent with a well designed protocol stack for
processing information. In particular, the role of Bytext is clearly separated from the role of
markup. This is contrasted with many normative or informative properties of Unicode
characters such as interlinear annotation characters (U+FFF9..U+FFFB); the object
replacement character (U+FFFC); the nesting bidirectional control characters; and the “Tag
Characters” (U+E0000..U+E007F). Most control functions of ASCII fell into disfavor, and the
same is sure to happen with the Unicode control functions. As discussed further in the
Related Projects page, control characters that act as markup syntax put a burden on the
user to deal with overlapping functionality. Instead of just a markup view to deal with, there
is a markup view and a control character view. The control character view requires special
provisions for entering control characters instead of just spelling them out with text.
Additional syntax creates additional complexity when dealing with the same class of
problems, it puts additional burdens on the user and creates the need for redundant
software.
Bytext answers the needs of online East Asian societies. The infinite number of private
characters available in Bytext can accommodate unabridged character sets or specific
physical type sets that can include every character variant that has ever been used or that
ever will be used. Such characters may be needed for writing the personal and place names
(Gaiji) of many people and places in East Asian societies; for more accurately creating online
libraries of historical and technical documents; and in general for precise and accurate
conversion between any and all other character related standards. With CO subtypes, all
such sets can be made collision free. Markup can be used to specify the fonts or
programming libraries necessary to use such sets, and can make other private sets collision
free. More importantly, these characters can be allocated in a way that makes them
readable with fonts made for unified sets such as Unicode. Proper allocation of such
characters will also make them searchable with well designed regular expressions made for
unified sets. Such a strategy, unique to Bytext, is a potentially useful approach to
computerizing print shops and publishing offices in east Asia.
Bytext is also a blessing to syllable oriented scripts such as Hangul and Brahmic scripts
because each possible syllable is a single character. The term “syllable” is used in the

2
structural sense, not necessarily in the English sense as a unit of pronunciation. Syllable
oriented scripts are loved by many users and they provide advantages over alphabet
oriented scripts. Bytext remains true to the nature of these scripts and allows their
advantages to be fully realized. Since syllables are not spelled with abstract control
characters, each syllable can be unambiguously identified by spell checking or search
algorithms. Variants that do not contribute to the semantic meaning of a syllable can be
readily identified without lookup or complicated processing. Inherent compression is one of
the advantages of syllabic scripts, and in fact this feature of the scripts often makes up for
the “deep” allocation of some syllable scripts. For example, even tho the most basic Bengali
characters take 3 bytes to encode, a complicated Bengali syllable with 4 consonants and 2
marks will take 8 bytes to encode in Bytext compared to 12 bytes in UTF-16 and 18 bytes in
UTF-8.
Bytext is not old fashioned or elitist. Bytext character names reflect modern common usage
and are chosen to be as clear and as easy to remember as possible. Equivalent names from
other standards can be obtained from the Bytext database. Also, Bytext character codes are
quoted as a series of easy to understand decimal values, instead of the hexadecimal values
that are used in most quoting conventions associated with Unicode.
The lack of canonically equivalent sequences in Bytext frees it from numerous problems found
in Unicode. In Unicode, two canonically equivalent inputs to case folding do not always
produce two canonically equivalent outputs, so certain string matches using case folding can
be missed which can cause errors and data corruption. Not only is there no such problem in
Bytext, there is little need for case folding in the first place because of the built in
convenience of open composition searching. Another problem presents itself in Unicode
because canonical ordering can rearrange characters at string boundaries even if both
strings start in the same normalization form. This means that editing operations must record
and coordinate themselves with each canonical ordering operation in order to be reversible.
In Bytext, when two strings are concatenated, all characters are guaranteed to not
rearrange in logical order and are guaranteed to not lose information, except as controlled
by higher protocols. The strings remain distinct units which keeps editing operations simple
and intuitive.

Other problems with Unicode


Unicode is messed up beyond repair. This can be said simply based on known problems such
as misnamed characters, redefined characters, obsolete characters, and the problems with
Khmer script --combined with the policy of not changing characters once defined (nevermind
the various exceptions). Unicode essentially defines itself as not being repairable, but the
real reason it is messed up beyond repair is the many unnecessary complexities needed to
process it which cannot be resolved by minor changes.
Many of the problems with Unicode stem from the fact that it is not really one thing, despite
what the name implies. Each version of Unicode has numerous binary forms, which makes it
difficult to truly understand your data on a bitwise level, which in turn makes it harder to
process and maintain. The following table illustrates some of this complexity.

Current Other Quoting conventions associated with Charsets


standard serializations Unicode associated
Unicode associated with with
serializations Unicode Unicode
UTF-8 CESU-8 Unicode Character Literals (U+nnnn, or SCSU
U+nnnnnn)
UTF-16 UCS-2 HTML Numeric Character References UTF-8
(&#number)
http://www.w3.org/TR/html401/charset.
html#h-5.3.1
UTF-16BE UCS-4 Java Unicode Escape Sequences UTF-16
(\unnnn)
UTF-16LE UTF-7 Comma-delimited C/C++ Hexadecimal UTF-16BE
Integer Constants (0xnnnn)

3
UTF-32 UTF-EBCDIC Visual Basic Hexadecimal Integer UTF-16LE
Constants (&Hnnnn)
UTF-32BE UTF-8B Unicode Sequence Identifier (<nnnn, UTF-32BE
nnnn> or <nn nn nn nn>)
UTF-32LE UTF-8N UTF-32LE
UTF-16PE also UTF-7
referred to as
UTF-16M
UTF-32PE or M,
as above

Unicode is not a flexible or general purpose format for processing binary data gracefully.
Unicode can never encode all possible byte sequences without the help of higher protocols.
The standard compression scheme for Unicode (SCSU) is not guaranteed to interpret illegal
or reserved values consistently. Unicode does not always provide a standardized bitwise
serialization of text, which limits it’s general purpose utility.
Canonical equivalence is not complete in Unicode. UAX #15 describes canonical equivalence as
"fundamental equivalence.. indistinguishable to users when correctly rendered". This is
understood as textual equivalence of the most rigorous kind. But who would argue that the
sequence GLUE, then HYPHEN, then GLUE does not have the most rigorous textual
equivalence to NON BREAKING HYPHEN? There is also the well known problem with “cluster
Jamos” which are textually equivalent but can still have multiple forms even after Unicode
normalization. For example, the sequence HANGUL CHOSEONG KIYEOK (U+1107) +
HANGUL CHOSEONG KIYEOK (U+1107) has the most rigorous textual equivalece to HANGUL
CHOSEONG SSANGKIYEOK (U+1108), but it is not put into a single form in any standard
Unicode normalization. These duplicate forms of equivalent texts are flaws in Unicode
normalizations which are easy to resolve in Bytext by using normalizations such as CF
normalization.
There are other incomplete normalizations due to combining class “simplifications” that allow
what are considered invalid sequences or duplicate sequences. For example: the Thai
sequence KO KAI + MAI EK + SARA II (as opposed to KO KAI + SARA II + MAI EK); or a
musical symbol composition consisting of a notehead + flag + stem (as opposed to
notehead + stem + flag). Invalid Thai sequences may not be interoperable with Thai
national language standards, they may not be consistently transcoded to TIS 620-2533
(1990) for example. Invalid sequences will likely be displayed as equivalent to their
corresponding valid sequences in Unicode, but they are not converted to a consistent form
by any Unicode normalization. In Bytext, the invalid sequences are normalization subtypes,
so they can easily be converted to a consistent, valid form; and so that regular expression
searches will best match user expectations.

___The following concepts in Unicode are manifestly absurd:


“Noncharacter” code points. They occupy a code point like a character, they must be
preserved by all Unicode transformation formats like a character... but they are not
characters. They are “characterized as noncharacters”. This semantic wizardry is the result
of compromises and cloudy thinking. It is simply an unnecessary complexity. There was
apparently not enough foresight that a character oriented charset would come along that
would need to transcode these code points, creating the inevitable mind bending semantic of
“noncharacter characters”, which Bytext must use.
Applying a combining character to a space (space then the combining character) is apparently
the preferred way to indicate the standalone mark equivalent of combining characters. This
puts the whole abstract character at risk of being erased by programs that clip whitespace.
It can also create unexpected results when processed thru different normalization forms.
Apparently Unicode recognizes this problem, calling these “irregularly decomposing
characters” in UAX #15, but still doesn’t change the convention. Bytext does not suffer from
this potential problem because the standalone mark equivalent of combining characters are
explicitly defined characters, there is no weird convention necessary to create them. ASCII
“grave accent” is not an irregularly decomposing character even tho one would expect it to

4
be according to Unicode decomposition principles. In this case it’s a good thing that the
principles are not consistently applied because grave accent is an important character in
programming languages.
Digraph characters such as LATIN CAPITAL LETTER DZ (U+01F1), and their titlecase forms are
positively non intuitive. The whole point of using a digraph is so that a new character doesn’t
need to be created, this is why they are useful in phonetics. Just because a digraph may be
sorted differently does not mean it is a single character.. some collation schemes may sort
entire words differently, it doesn’t mean that they are single characters. If it is important to
be able to transcode from Unicode to a legacy encoding, it is not necessary that the
transcoding be one to one. Not only does this complicate the process of determining a
titlecase, it makes for ambiguous behavior when such a digraph has a mark applied to it.
Ligatures are analogous to digraphs but ligatures do not have a titlecase property in
Unicode, this is another inconsistency as well as an inadequacy. Bytext treats these
characters and ypogegrammeni combining characters as ligating variants, which means that
a titlecase operation simply uppercases the first character in each appropriate word just like
one would expect. The trailing ligating variant(s) would then be converted to their prevariant
in the titlecase operation, just like one would expect. The uppercase of such ligating variants
are not ligating variants which is a way to break out of this compatibility usage.
The normative requirement that spaces at the end of a line cannot contribute to line width for
purposes of line breaking. What if a special purpose editor wants all non explicit line breaks
to occur at exactly 80 characters regardless of visibility? What if it is necessary to actually
see each character and horizontal scrolling is not available or desired? Since spacing the
characters is not allowed this type of editor cannot be Unicode compliant even if it is working
under “emergency” line breaking principles. Also, some people would like to see if there are
extra spaces at the end of a line by having them carry over to the next line, it makes a
common typographical mistake easy to identify. These are perfectly reasonable display
preferences, but are not Unicode compliant.
Mirroring.
Step X10 of the Unicode bidirectional algorithm. UAX #9 says: “Although the term embedding
is used for some explicit codes, the text within the scope of the codes is not independent of
the surrounding text. Characters within an embedding can affect the ordering of characters
outside, and vice versa. The algorithm is designed so that the use of explicit codes can be
equivalently represented by out-of-line information, such as stylesheet information”. The
first 2 sentences contradict the last. Because of step X10, their codes cannot in fact be
equivalently represented by markup because out-of-line information or any sensible kind of
markup will only affect the text within it’s scope. Step X10 affects text outside the range
between the control characters. Conventional markup codes would have to be shifted with
respect to their corresponding Unicode control characters.

DEFINITIONS
“I would do away with those great long compounded words; or require the
speaker to deliver them in sections, with intermissions for refreshments.”
--Mark Twain

Bytext uses a concept referred to as Fixed To Variable encoding, or FTV, in order to convert a
fixed length encoding system into a variable length encoding system. The fixed length
encoding system Bytext uses is, naturally, a byte stream. In particular, an unsigned integer
byte stream, where each value is a cardinal 8 bit number. In other words, it is a sequence of
values where each value is zero to 255. Bytext uses FTV to get a byte stream to encode a
sequence of values where each value is indefinitely variable, that is each value can range
from zero to infinity. A simple way to think of FTV is that the sign bit of a signed byte is used
to determine character boundaries and the rest of the bits are used to determine the scalar
values.
The FTV concept can be applied to any fixed length encoding system, and since this will come
up later there is a term to identify the particular fixed length that is started with. The
notation used reads the number of bits after the term “fixed” in FTV. So when starting with

5
an 8 bit fixed length encoding system (a byte stream), the term used is F8TV when it is
converted to a variable length encoding system. This is read “fixed 8 to variable” encoding.
The way F8TV works is each fixed length value is divided in two, the lower half represents the
start of a variable length value, and the upper half represents the continuation of a variable
length value. So, for Bytext, all byte values from 0 to 127 represent the starting byte of a
character. All byte values from 128 to 255 represent the continuation bytes of a character.
Starting bytes are exactly equivalent to positive signed byte values (as in the Java “byte”
data type), and continuation bytes are exactly equivalent to negative signed byte values.
Continuation bytes that start a sequence must be ignored and preserved by default. The
starting byte and all continuation bytes of a single character form the character code. The
character code, once isolated, can be thought of as a single base 128 number with a big
endian byte order, and the scalar value of this number is called the character code value.
The character code values for all Bytext characters comprise the coded character set of
Bytext.
A character that is only composed of 1 byte is called an F1, as in Finish level 1. A character
composed of 2 bytes is called an F2, etc. The value in these terms is called the finish level,
this is how many bytes are required to finish the character. Composition level is a more
general term for the ordinal byte number being dealt with in a character of a set of
characters, whether the character is finished or not. The term C1 refers to the 1st byte of a
character, and so on with C2, C3, etc.

F8TV bit distribution:

Character code value (scalar 1st byte 2nd byte 3rd byte 4th byte
value), big endian from left to (C1) (C2) (C3) (C4)
right
0xxxxxxx 0xxxxxxx
00xxxxxx xyyyyyyy 0xxxxxxx 1yyyyyyy
000xxxxx xxyyyyyy yzzzzzzz 0xxxxxxx 1yyyyyyy 1zzzzzzz
0000xxxx xxxyyyyy yyzzzzzz 0xxxxxxx 1yyyyyyy 1zzzzzzz 1uuuuuuu
zuuuuuuu

It is recommended that “uByte”, as in “unsigned byte”, be the name of the data type that is
used to store each byte that is to be interpreted as Bytext. It is a trivial difference but it
better matches the actual values that each byte is supposed to represent. For programming
languages that create default values automatically, it is recommended that the default value
for a uByte be zero. In Bytext documentation a uByte will be referred to as simply a byte
except in source code.
Each byte must be in big endian bit order. That is, a bit stream set to be interpreted as Bytext
will be interpreted as being a byte stream with big endian bit order unless additional context
is provided by higher protocols. If the first bit in a bit stream is 1, it indicates that either the
bit ordering within a byte needs to be reversed before interpreting the stream as Bytext; or
that the bit stream starts with raw binary data. Trailing bits in a bit stream that do not make
up a whole byte must be ignored.
A character with a given finish level is said to be a supertype of the characters with higher
finish levels that utilize the same initial sequence of bytes. Because supertypes are used to
organize and categorize subtypes, a supertype can also be referred to as a category of a
subtype. A CATegory Symbol, (CATS), is a visible symbol for a category. A CATS may be an
existing character such as how PLUS-MINUS SIGN is used as the technical symbols CATS; or
it may be a generic symbol such as any of the script CATS.
All non F1 characters are said to be a composition of their supertype. A character is a subtype
of it’s supertype unless it is a singular placeholder or group placeholder. Unless a specific
composition level is specified for a supertype, simply referring to “the supertype” refers to
the supertype with one lower finish level than the character in question (not necessarily the
prevariant). When the finish level is known for a supertype and a subtype (and when
singular placeholders are not used to extend the subtype), the difference is known as the
subtype level, and the subtype can be referred to in relation to the supertype by the
notation S1, S2, S3, etc. By logical extension, S0 can be used to refer to the prevariant.

6
A letter prevariant is the supertype with the lowest finish level that defines a specific letter. A
letter prevariant is meant to be a category for all the variants of a grapheme. The specific
variants a particular letter has will vary widely between individual characters and between
scripts, but each letter of a given script will have it’s variants in the same relative order as
other letters in the script. Any gaps are left as reserved unless otherwise specified, so the
relative ordering of variants will only change for different scripts.
The ordering of characters determined by the character code value from small to big is called
Bytext order, this ordering is seen in UEB. This is different that the ordering determined by
the Bytext collation algorithm, which is simply called Bytext collation order. The ordering
found within the Bytext character properties table is yet another ordering, called zero
terminated Bytext order, the purpose of which is to display subtypes immediately under
their supertypes.
A specific character code should be represented in text with the following format: an
uppercase “B” followed immediately by a fixed 3 digit cardinal decimal number for each byte
value of the character, with each 3 digit byte value separated by dashes, as in: B001-128-
128. This format is called Bytext decimal notation.
The searching technique of matching characters only up to a certain composition level is called
open composition searching. Open composition searching is a tool to help make effective
regular expressions faster and easier to write. “Regular expressions”, or “regex” is the term
for a formal notation that defines an algorithm used to search for a pattern within a
sequence. The term “regular expression” can thus be used interchangeably with the pattern
it formalizes. A character code ending in a dash like “B127-” is unfinished and can be used
to indicate a pattern for an open composition search.
A mark is a visible modification of a prevariant. Diacritics are marks for example. A mark is a
component of a character but it is not a character itself. A mark is said to apply to a
character, analogous to a combining character in Unicode. Each mark has a corresponding
standalone character that does not combine with a prevariant, these are called standalone
marks.
In Bytext the preferred term for what Unicode calls “double diacritics” is “spanning marks”, so
as to not be confused with the case when a character has two diacritic marks applied to it. It
is natural to call the latter a double diacritic also, but in Bytext the preferred term is
“stacked mark”. The term ‘stacked’ refers to logical stacking, and is not meant to imply any
specific arrangement of marks. The arrangement of marks is determined by the relative
properties of each mark. There are no mark variants of enclosing marks, so the composition
level specifying an enclosing mark is the last of all stacked marks.
Visible or invisible is a term used in Bytext to indicate whether a character is normally
considered to take up space in a display. As a property it is informative. “Space” refers to
what is commonly called user space and not device space. Visible is considered more
descriptive than the term “spacing” which is used in Unicode since the latter may imply
some kind of property for space characters. A combining character cannot be guaranteed to
not affect the width along the baseline compared to the base character alone, so to describe
such a character as “nonspacing” is somewhat meaningless. Also, new Unicode characters
such as INVISIBLE TIMES (U+2061) use this term instead of “nonspacing”, presumably
because it is more intuitive.
Letters, syllables, ideographs, punctuation and symbols are considered visible, and all type 2
ignorables are considered invisible. All type 1 ignorables like SOFT HYPHEN (U+00AD) are
considered to have context dependent visibility, as are whitespace characters. Tab and
space would normally not take up space on a line if they were at the end of that line. PEC
would not necessarily take up space on a line if there was already a line break at that
position. Visibility is a property derived from the informative sample glyph property: if a
character has a sample glyph it is considered to be a visible character.

Terms from Unicode and elsewhere


See also www.unicode.org/glossary/.
Cardinal - Integer numbering starting from zero. This can be contrasted with ordinal which is
integer numbering starting from one used to order an actual list such as 1st, 2nd, 3rd, etc.
Case folding - The process of mapping strings to a normalized form where there are no case
distinctions.

7
Character repetoire - The set of encodable characters.
Charset - a MIME content type parameter for text content specifying what is essentially the
bytewise serialization of a character repetoire.
Collation - The process of ordering sequences of characters. An elaboration upon
alphabetization so that it applies to every character. Different languages sometimes have
incompatible conventions for ordering characters.
Glyph - The actual image of a character. In Bytext whitespace is not considered to have a
glyph but it is considered to be visible (depending on context).
Grapheme - A distinct unit of writing, one that contributes to the definition of a word in a
natural language. For example, lowercase and uppercase versions of a letter are not
separate graphemes because substituting one for the other would not generally change the
definition of a word. This can be contrasted with phoneme, which is a distinct unit of voiced
sound that contributes to the definition of a word.
Informative - Not required for conformance but may be useful for implementation. Informative
properties and algorithms can be thought of as being functions for a default locale. The word
“should” indicates an informative statement.
Logical order - The order of a sequence of characters in memory, determined by the order
entered. It is generally the order that the sequence would be pronounced.
Normalization - A process where forms of data that for some purposes can be thought of as
equivalent are transformed into a single preferred form.
Normative - Required for conformance. The word “must” indicates a normative statement.
Overloaded - Used to describe a character used for 2 or more very different purposes that
each require different semantics. For example, HYPHEN-MINUS requires different semantics
when used as a hyphen vs as a minus sign.
Serialization - A linearized form of data, usually the form taken when the data is represented
by a stream of bytes or bits. Bytes are the common binary words, so it is usually best to let
the serialization of each byte be handled by exisiting common mechanisms, but ultimately
all binary serializations must be defined in terms of bits to be complete.

Special subtypes
A character that differs only in a specified property or set of properties from another character
is called a variant of the character with a lower character code value. A variant is named
after the property or set of properties that differ. For example, a line breaking variant only
differs in line breaking properties, and a mark variant is a character that differs only in a
mark. The character with a lower character code value that has a particular variant is called
the prevariant of that variant.
The term for a group of variants with their corresponding prevariant is called a type. Type
mappings are properties that map each variant to it’s prevariant, and each prevariant to all
of it’s variants. A type has one prevariant.
Bytext decimal notation is extended as follows so it can represent simple regular expressions
for patterns and open composition searches. A prefixed hyphen is used to indicate that any
preceding bytes of a character are part of that pattern. A postfix hyphen is used exactly as
previously indicated for open composition searching, it indicates that any following bytes of
a character are part of the pattern. An exclamation preceding a byte value in the notation
indicates that the byte value is not allowed in that position in the pattern.
A group of semantically related characters that are identified by a simple regular expression
(other than a regular expression used to identify a category or open composition search) is
called a pattern defined type, PDT. A PDT can have many prevariants. A PDT pattern without
a postfixed hyphen identifies what are called PDT supertypes. PDT’s in Bytext:

PDT name Pattern


Guaranteed private PDT -255-255-
Singular placeholders !B127-128-
Mirrored PDT -255-128
Group placeholders !B127-!B128-254
Normalization subtypes -255-
Variation selector variants -255-253-

8
CO subtypes -255-128-255-

All possible byte values represent characters that are either completely defined (complete), or
incompletely defined by PDT and category (incomplete). There are a finite number of
complete characters and reserved characters. All incomplete characters are either reserved
or private.
Reserved characters are incomplete characters that are identified by the Bytext database as
explicitly restricted from private use. Reserved characters are also called reserved’s.
Reserved characters often have informative notes and cross references. A reserved
character with a single cross reference(s) property is referred to as a cross reference
character. Cross reference characters do not have sample glyphs.
The SS cross reference property is a property that can be non null only for reserved
characters. A reserved character with a non null value for this property is called an SS cross
reference character. These are particular types of cross reference characters.
Cross reference characters are essentially just information in the Bytext database that exists
as an algorithmic convenience to reference a separate order. The appearance of a cross
reference character in text may mean that it is in a special encoded form such as a
compressed form or a special bidirectional compatibility form; that the text contains binary
data (presumably delimited by markup); or that the text contains encoding errors.
Cross reference characters are named with the following convention: “CROSS REFERENCE TO
”, then the character code of the cross reference(s) property without the preceding “B”, then
“ CHARACTER”. Likewise, SS cross reference characters are named with the following
convention: “SEARCH SIMILAR CROSS REFERENCE TO ”, then the character code of the SS
cross reference(s) property without the preceding “B”, then “ CHARACTER”. All other
reserved characters are named with the following convention: “RESERVED CHARACTER ”
followed by it’s character code without the preceding “B”.
All other characters that are not completely defined are private characters. Characters may be
incompletely defined by several patterns, so only the remaining semantics are considered to
be for private use. A private character must meet the semantics of any PDT that it is in,
there are no exceptions to a PDT pattern (exceptions are built into the pattern). There are
exceptions to ordinary category relationships however, OBS characters are an example, but
not to what are called fixed categories.
A fixed category defines a semantic such that all subtypes must have the semantic without
exception and all characters with the semantic must be in the category. Private characters
other than PRIVATE CHARACTER 127 may be completely defined in future Bytext versions
unless they have the guaranteed private PDT -255-255-, or a CO subtype pattern. In other
words, private characters may not always be available for private use unless explicitly
guaranteed to be. All private characters are named with the following convention: “PRIVATE
CHARACTER ” followed by it’s character code without the preceding “B”.
Fixed categories in Bytext:

Fixed category name Pattern


Non substitutable CC’s B126-255-130-
Excessive CSCC’s B126-255-252-
Defective CSCC’s B126-255-253-
Ignorable B126- or B127-
Type 1 ignorables B126-
Type 2 ignorables B127-
Context dependent visibility B048- or B049- or B126-

Singular placeholders
Singular placeholders are characters that act as an extension mechanism for a single
character. Singular placeholders are defined as not having subtypes of their own outside of
byte value 128. Singular placeholders are a PDT defined by the pattern !B127-128-. All
characters that use a singular placeholders composition, except for other singular

9
placeholders, are considered to be subtypes of the singular placeholder’s supertype. B127-
128 is not a singular placeholder because of the special way type 2 ignorables are allocated.
Singular placeholders enable each character to have an infinite number of subtypes solely for
it’s own use. Any subtype level can thus be made infinite using singular placeholders to
extend the subtype. The characters on a given subtype level are said to be peers because
they all differ from a supertype by the same number of subtype levels.
The singular placeholder level refers to how many singular placeholders are used to extend the
subtype of a character. Each singular placeholder for a given character has a special
notation, the letter P followed by it’s singular placeholder level, as in: P1, P2, P3, etc. For
example, B125-128 is at the P1 location of B125; B125-128-128 is at P2 of B125; and
B125-128-128 is at P3 of B125.
Without singular placeholders, a character would only have 128 S1 subtypes. Beyond that,
additional characters would have to be categorized as a subtype of a subtype (S2), which is
not the same as S1. A subtype character may take on new characteristics different than it’s
supertype such that it would be inappropriate to categorize characters under it that would be
appropriate to categorize as a subtype of it’s supertype. For example, if all 128 S1 subtypes
of LATIN LOWERCASE A were taken, it is not appropriate to categorize LATIN LOWERCASE A
WITH TILDE as a subtype of LATIN UPPERCASE A, or GREEK LOWERCASE ALPHA, or any
other subtype of LATIN LOWERCASE A.
All !B127-!255-128 characters are private. B127-128 is not private because of the special way
type 2 ignorables are allocated. Many characters with the pattern -255-128 are not private
because they are mirrored variants.

Mirrored variants
For compatibility with Unicode, the mirrored form of each mirrorable character can be directly
encoded in Bytext. These are called mirrored variants, and are all located at -255-128,
which is the mirrored PDT.
Mirrored variants can be generated when converting Unicode bidirectional text to Bytext:

With this this Unicode maps to this in Bytext


condition character
Even embedding a character with the same character
level at UBA step the mirrored
I2 property
example: LEFT LEFT PARENTHESIS
PARENTHESIS
Odd embedding a character with the mirrored variant of the character
level at UBA step the mirrored
I2 property
example: LEFT MIRRORED LEFT PARENTHESIS
PARENTHESIS

What a mirrored variant in Bytext maps to in Unicode depends on other characters in the
same paragraph. The conditional mapping of mirrored variants to Unicode is displayed in the
following table:

10
With this this Bytext maps to this in Unicode
condition character
Even BBA mirrored variant RLM RLO the prevariant PDF RLM
embedding level of the mirrored
variant
example: RLM RLO LEFT PDF RLM
MIRRORED LEFT PARENTHESIS
PARENTHESIS,
which looks like “)”
Odd BBA mirrored variant the prevariant of the mirrored variant
embedding level example: LEFT PARENTHESIS
MIRRORED LEFT
PARENTHESIS,
which looks like “)”

Group placeholders
Group placeholders are characters that act as an extension mechanism for a peer group. A
peer group is a group of characters on a subtype level that are distinguished from other
characters on the same subtype level. For example, in the European script family, uppercase
variants, mark variants, and SS variants exist as separate peer groups on each subtype
level. Each peer group is extended by a group placeholder composition of the last character
in the group at that subtype level.
Group placeholders are a PDT defined by the pattern !B127-!B128-254. All characters that use
a group placeholders composition, except for other group placeholders, are considered to be
peers of the group placeholder’s supertype. B127-254 is not a placeholder because of the
special way type 2 ignorables are allocated. -128-254 cannot be a group placeholder
because it is on a singular placeholder level.
The group placeholder level refers to how many group placeholders are used to extend a peer
group. Each group placeholder for a given character has a special notation, the letter G
followed by it’s group placeholder level, as in: G1, G2, G3, etc. For example, B125-254 is at
the G1 location of B125; B125-254-254 is at G2 of B125; and B125-254-254-254 is at G3 of
B125.
The subtypes of a group placeholder are used to extend the group, not the group placeholder
character itself. Each group placeholder character has no subtypes of it’s own, but can
otherwise be used as a normal character, just like a singular placeholder character. Some
kinds of characters do not need subtypes of their own, such as mirrored variants and East
Asian full width variants. So whereas mirrored variants are always at P1, East Asian full
width variants are always at G1.

East Asian width


If a character has an East Asian full width variant, it is always at the G1 placeholder, -254, of
it’s prevariant. For example: LATIN LOWERCASE A, B010, has an East Asian full width
variant; since it is the G1 placeholder of B010, it’s character code is B010-254. If a non G1
prevariant has an East Asian full width variant (if it has a non private character at G1), the
prevariant should map to a narrow character in East Asian typography. If a non G1
prevariant does not have an East Asian full width variant (if it has a private character at G1),
the prevariant should map to a wide character in East Asian typography. East Asian full
width variant characters themselves should always map to wide characters in East Asian
typography.

Search similar subtypes


A search similar (SS) subtype is a character that is categorized to maximize the usefulness of
open composition searches. Consideration is made for how characters are used and how
people might reasonably search for them.

11
For letters, this mostly involves categorizing in terms how they might be used to spell proper
names and titles. SS letters are likely to be transliterated letter for letter when spelling
proper names and titles in different languages. A SS relationship is similar to a
transliteration mapping but much less rigorous and invariably incomplete for a given script.
SS allocation provides what can be called “inherent transliteration”.
For symbols, SS allocation involves categorizing in terms of the overall expected visual
appearance, and in terms of how the characters might be used the same way on a character
for character basis. Visual appearance is considered in regard to text related usage, such as
when using a small font size suitable for fast reading, and in terms of what is distinguishable
in handwritten text. For example, DEGREE SIGN is a SS subtype of STANDALONE RING
MARK.
Punctuation characters and overloaded math operators do not have SS subtypes based on
function alone, only if the function and the glyph are both similar. Non overloaded math
operators have SS subtypes based on function alone, for example: INVISIBLE TIMES
(U+2062) is a SS subtype of DOT OPERATOR (U+22C5).
Letterlike symbols, including letters exclusively for math usage are SS subtypes of letters only
if they are single letters and are not currency symbols. COPYRIGHT SIGN and REGISTERED
SIGN are not SS subtypes, they are subtypes of the misc symbols CATS.
Search similar allocation rearranges characters from what is called their native location and
allocates them in to what is called their search similar location, which is their actual location
in the Bytext database. Native refers to the original, natural allocation of a character. For
example, a letter is originally allocated, naturally, among other letters of the same script
family. Decimal digits and most symbols are not considered SS characters. Other numeric
types can be SS characters. Search similar characters are arranged as subtypes in such a
way as to preserve scripts order.
An SS cross reference occupies the native location of a search similar character and points to
the search similar location. A SS character has a native cross reference property that
identifies it’s native location.
Search similar allocation slightly complicates the process of collation and the process of using
composition fragments to form characters such as might be used in a user interface. It is
justified because a user can benefit more from increased usefulness of open composition
searches than they would from having it slightly easier to program collation algorithms and
the user interfaces used to select characters. The methods used in collation and user
interfaces are transparent to the user and do not need to be as fast. The search similarity
concept could also be implemented algorithmically by a search tool but this places additional
burdens on the user using searching tools and it would also make the equivalent regular
expression algorithmically slower.
Searching is something that will always be important to do algorithmically fast because even
tho computers tend to get faster, it’s easy for the amount of data that needs to be searched
to grow even faster.
When open composition searching is not used, search similar allocation does not
algorithmically slow down searching for ANY character except in terms of the inherent
compression of the character. However, inherent compression is primarily determined by the
frequency of use principle, not by search similar allocation. There is a tradeoff, and as
mentioned the overall guiding principle is global business oriented practicality.
Tables that display SS relationships are part of the Bytext database.

Normalization subtypes
The normalization subtype is used to categorize all characters that may be important for most
types of normalizations. With this allocation, all such characters can be quickly scanned and
identified without lookup. With Bytext, common types of normalizations can therefore be
implemented faster than is possible with Unicode since Unicode normalizations require
extensive lookup for each character.
All ligating base counterparts, all ligating mark counterparts, all logical peers, all CO subtypes,
all variation selector variants, all font variants, all full width variants, and all compatibility
characters are normalization subtypes.

12
Normalization subtypes are defined by the pattern -255-. A character matching the pattern -
255 is said to be a normalization subtype supertype, even if there are no explicitly defined
subtypes of it. It is the supertype of all characters in the normalization subtype.
TAB as a character in text can best be described as a “markup-dependent-width space”. Text
editors that do not use TAB formatting are advised to treat it as a very wide space character
rather than an arrow. It is therefore a subtype of space. TAB is most commonly used to
indicate a leading indent paragraph format and to delimit row elements. Both of these uses
are more appropriate to indicate with markup. TAB is the normalization subtype supertype
of space because various normalization routines may choose to convert TAB to more
appropriate forms of markup.

Ligating variants
Ligating variants consist of ligating base variants and ligating mark variants. Ligating variants
are characters that should be adjacent to one or more other specified characters called
counterparts. Ligating variants are the Bytext way of handling most ligature characters from
Unicode; the Unicode “double diacritics” (U+0360, U+0361, and U+0362); and Unicode
“half marks” (U+FE20..U+FE23). Ligating variants are analogous to conjoining or subjoining
characters in Unicode except the properties in the Bytext database specify each allowable
counterpart on each side of the character. Without such a specified counterpart, a ligating
variant is called fractured.
Characters such as Arabic letters that ligate according to context are not ligating variants
because ligating variants must always ligate. Rather than use a new formatting character
that creates a context, or worse, an ugly sequence of other formatting characters such as
how Unicode uses ZWJ+ZWNJ+ZWJ to specify exceptions to normal ligating behavior;
Bytext provides a specific set of individual ligating variants that form such ligatures.
Most ligature characters from Unicode are divided into 2 or more ligating variants. Exceptions
include Hangul characters that some may consider to be ligatures, such as HANGUL
CHOSEONG SSANGKIYEOK (U+1101). The word “ligating” is a synonym of “joining”, and it’s
usage in Bytext is similar. Ligating can be considered a specific kind of joining, a distinct
style with a limited extent and application. Because of these limitations, ligating variants are
considered fractured if arranged improperly, whereas joining variants are not.
An implementation should not allow editing that creates fractured ligatures. Ligating variants
should be converted to their non ligating prevariant before splitting.
Ligating base variants are not mirrorable, but implementations will want to either keep them
from being reversed or (better yet) to convert them to their prevariant before reversing
them. Otherwise the act of reversing them will make them fractured.
In Bytext the preferred term for what Unicode calls a “double diacritic” is “spanning mark”, so
as to not be confused with the case when two diacritic marks are to be applied to a
character. Unicode has characters like “combining double acute accent” (U+030B) which
seem to confuse the issue further. A spanning mark itself is not a character but a concept
analogous to a regular mark, and there is likewise a standalone mark and a CSCC for each
spanning mark.
Instead of attempting to apply a single spanning mark to multiple characters which would
create a context, Bytext has 2 separate ligating marks for each character that is to have a
spanning mark. When applied to 2 separate characters appropriately, the 2 ligating marks
will combine to form 2 characters with a spanning mark. Each character with a ligating mark
is said to be a ligating mark variant, even if fractured.
Ligating mark variants map to Unicode half mark compositions. For each of these there is a
subtype character that maps to the corresponding Unicode “double diacritic” composition.
Ligating base variants may have many characters as counterparts on each side. However,
ligating mark variants must have only one counterpart for each side of the character. A
character can have more than one ligating mark. The counterpart of each side of each mark
is defined in the Bytext database. See the property info table in the Bytext database for
more information about how the ligating properties work.
Ligating mark variants are mirrorable. Their mirror is the same character with the ligating
mark(s) of it’s counterpart instead of it’s original ligating mark(s). This will not change the
direction of the ligating mark(s) itself(themselves).

13
The idea behind ligating variants is to have a form that is searchable with consistent open
composition regular expressions. For example, a user searching for the word “find” may be
unaware that the “f” and “i” may have been combined into a ligature. If the regular
expression does not account for this the word will be missed. Having to account for each
form places burdens on the searcher and slows down the regular expression algorithm
compared to open composition searching.

Compatibility characters
As in Unicode, compatibility characters (CC’s) are characters that are included in Bytext solely
for compatibility with other character standards. That is, they are otherwise considered to be
outside of the scope of Bytext. In Bytext there are 3 types of compatibility characters:
1. Substitutable CC’s, SCC’s. These are CC’s with a mapping to a compatibility substitute
(CS). All non ignorable CC’s are of this type. These are all allocated as normalization
subtypes (-255-). The semantic difference between a SCC and it’s corresponding CS is
indicated by the CS type property.
2. Non substitutable CC’s, NSCC’s. These are ignorable CC’s without any kind of mapping to a
CS. These include the ignorable non formatting control characters like null and start of
heading. NSCC’s are a fixed category identified by the pattern B126-255-130-.
3. Conditionally substitutable CC’s, CSCC’s. Ignorable CC’s with conditional mappings to CS’s.
The mapping is said to be conditional because whether a normalization form should map
them to one or more CS’s depends on surrounding characters. A CSCC without
surrounding characters will always map to a single Unicode character. CSCC’s are included
in Bytext so that all legal Unicode text is round trip convertible thru Bytext. CSCC’s are
divided into:
a. Excessive CSCC’s are a fixed category identified by the pattern B126-255-252-.
b. Defective CSCC’s are a fixed category identified by the pattern B126-255-253-.

Variation selector variants


These variants are meant to be equivalent to the abstract characters formed by the variation
selector characters in version 3.2 of Unicode. Only characters that do not fit into other
patterns are encoded in the variation selector patterns. If a character is allocated elsewhere
in Bytext before being designated as a variation selector variant in Unicode, a reserved
character with a cross reference to the original location will be left in it’s place in the
variation selector pattern.

ALLOCATION PRINCIPLES
"Those are my principles. If you don't like them I have others."
--Groucho Marx

The principles used to determine the allocation of characters in Bytext are numbered in order
of priority. Specific interpreted consequences of each principle are lettered beneath the
principle:

1. Characters are categorized by function.


a. Lowercase and uppercase share the same letter prevariant.
b. Characters are categorized by usage, or “search similarity”. Search similar letters
are likely to be translated letter for letter when spelling proper names and titles in
different languages.
c. All decimal digits are subtypes of the F1 decimal digit characters.
d. Braille characters are subtypes of the corresponding numbers, letters, and
symbols that they represent.
e. All arrow symbols are subtypes of the 4 F1 arrows.
f. All non overloaded currency symbols have the same F1 supertype.

14
2. More frequently used before less frequently used.
a. Lowercase letters are ordered before uppercase.
b. Characters directly entered from common keyboards are considered qualitatively
more frequently used than characters that require other input methods.
c. Modern scripts are ordered before archaic scripts.
d. DOLLAR SIGN (U+0024) is the CATS for currency symbols because it is
quantitatively more frequently used than CURRENCY SIGN (U+00A4), and
qualitatively more frequently used than any other currency symbol, since it is an
overloaded ASCII character. Also, DOLLAR SIGN is already an ambiguous currency
symbol, there are Canadian dollars, Australian dollars, US dollars, etc.
3. Simple before complex.
a. Decimal digits are ordered before letters or symbols.
b. Prevariants are always ordered (in terms of Bytext order) before variants.
c. Basic letters are ordered before modified letters, such as letters with diacritic
marks.
4. Characters more important for comprehension before characters less important for
comprehension.
a. Grapheme’s are ordered before other variant characters.
b. “GLUE” is ordered before “UNGLUE” because the former is more important for
comprehension.
5. General before specific.
a. Characters commonly used for meta information, such as programming languages,
are ordered before those that are not.
b. UNGLUE is ordered before SOFT HYPHEN.
6. Existing ordering is preserved.
a. The order of mark variants preserves the relative ordering of their corresponding
standalone marks. Reserved characters fill in any gaps.
b. The order of SS variants preserves scripts order. Reserved characters fill in any
gaps.
c. Characters are ordered to achieve mnemonic order. This is ordering that contains
some sort of text oriented mnemonic or expectation, to provide a series of
standard arbitrary orders. It is useful to have a common understanding of order,
even if it is arbitrary. A common arbitrary order for things that are often listed by
authors establishes expectations that make it easier to find things in the list. The
ordering of all Bytext characters is considered to have some potential mnemonic
order, especially when a rationale is given for an ordering.
d. Directions have a rationale for their ordering. Ordering linear before radial; left
handed (radial) before right handed; little endian before big endian; and rows
before columns can establish useful common expectations. Leftward arrow is
before rightward arrow; and left parenthesis is before right parenthesis. The
mnemonic is that most scripts write text left to right. Most text also moves from
left to right before going from top to bottom, so both left and right oriented
symbols are before any top and bottom symbols. Upward arrow is before
downward arrow. The mnemonic is that most scripts write text from top to
bottom. An arrow such as “north west arrow” will be a category of leftwards arrow
to preserve the existing ordering of left before up.
e. Miscellaneous mnemonic associations are preserved whenever possible: Small
before big; nothing before something; black before white; general before specific;
low energy before high energy; low frequency before high frequency. Out, off,
zero correspond to in, on, one, etc.
f. Unicode ordering is preserved whenever possible. Cross reference characters are
often used to achieve this.

F1 overview
19 ASCII punctuation and symbol characters have the identical character codes in Bytext as
they do in ASCII. This is the maximum amount possible consistent with other Bytext
principles. SPACE and PEC are allocated early in the list of F1’s because they are perhaps the
most shared of all characters so it is convenient to not have to rearrange them for selective

15
differencing compression. From there, other visible ASCII characters are in the same relative
order as in ASCII.
PLUS-MINUS SIGN is the technical symbols CATS in an effort to increase inherent
compression. PLUS-MINUS SIGN is considered much more frequently used than DIMENSION
ORIGIN which is used in the icon representing the Unicode “Miscellaneous Technical” block.
PLUS-MINUS SIGN could logically belong in this or the Mathematical Operators category and
this in fact helps to clarify that the difference between these two categories can be vague or
subtle. It is useful to separate the categories largely just for inherent compression.
BULLET, FONGMAN, dingbats, and similar characters are all allocated in the misc symbols
CATS because all misc symbols are considered to be potential bullets or dingbats.
Parenthesized and period formed characters are compatibility subtypes of the misc symbols
CATS.

Character Number of Description of group Approximate Additional notes


code of characters number of
first in in group OBS
group subtypes in
group
B000 10 Decimal digits
B010 26 Lowercase Basic Latin
alphabet
B036 12 Identical to ASCII codes 1320
B048 1 Space 100 Whitespace
B049 1 PEC 100
B050 8 More ASCII 880
B058 7 Identical to ASCII codes 770
B065 3 More ASCII 330
B068 4 Arrows 256
B072 36 Private 4440
B108 1 Greek script CATS Script CATS
B109 1 Cyrillic script CATS
B110 1 Hebrew script CATS
B111 1 Arabic script CATS
B112 1 Brahmic script CATS
B113 1 Additional Brahmic script
CATS
B114 1 Hangul script CATS
B115 1 Kana script CATS
B116 1 Han script CATS
B117 1 Additional Han script CATS
B118 1 Additional scripts CATS
B119 1 Combined punctuation CATS
B120 1 Combined standalone mark
CATS
B121 1 ∞ Math symbols CATS
B122 1 Technical symbols
± CATS
B123 1 ▲ Geometric symbols
CATS
B124 1 ♪ Music symbols CATS
B125 1 ☺ Misc symbols CATS
B126 1 GLUE Ignorables
B127 1 PRIVATE CHARACTER 127

16
Numeric types
Numeric types that are SS to F1 decimal digits are allocated in scripts order with a few
additions due to the fact that some scripts have more than one numbering system
associated with them. Extended Arabic-Indic SS subtypes are ordered directly after Arabic-
Indic SS subtypes. Hangzhou SS subtypes are ordered directly after Han accounting SS
subtypes which are ordered directly after ideographic SS subtypes. The rest follow Unicode
order.
Symbols used as constants like GREEK LOWERCASE LETTER PI (U+03C0) do not have a
numerical value in the Bytext database. Only symbol marks and 2 Koranic annotation marks
apply to all numeric types.

Numeric SS allocation
type
decimal All are SS to F1 decimal digits.
digit
symbol All are normalization subtypes of F1 decimal digits.
decimal
digit
numeral Most are SS to letters.
Kanbun All are normalization subtypes of ideographic numeric types which are SS to
annotation F1 decimal digits.
symbol All with numeric values from 0..9 are SS to F1 decimal digits except
parenthesized forms, period forms, and numerator and/or denominator forms
which are compatibility subtypes of the misc symbols CATS.
ideograph All with numeric values from 0..9 are SS to F1 decimal digits. The rest are
dispersed thruout B116- and B117-, not necessarily ordered by numeric
value.
Han All with numeric values from 0..9 are SS to F1 decimal digits. The rest are
accounting dispersed thruout B116- and B117-, not necessarily ordered by numeric
value.
binary No SS allocation

Marks
Marks are divided into 3 basic types based on the kinds of characters they apply to:
1. Script family specific. These apply only to certain characters within a script family, with
limited exceptions.
2. Symbol. These apply to most visible characters. These consist of characters in the
Unicode “Combining Diacritical Marks” block under the headings “Overstruck Diacritics”
and “Additions”; and also the marks of the Unicode “Combining Marks for Symbols”
block.
3. Additional. These consist of the remaining marks in the Unicode “Combining Diacritical
Marks” block except for IPA specific marks which only apply to base characters found in
IPA.

The Combined Standalone Marks CATS consists of basic European diacritics (a subset of
European script family specific marks); followed by any script specific marks that could
benefit from inherent compression by being allocated in this category instead of their own
script category; followed by “Symbol” marks; followed by the “Additional” marks.
With a few exceptions, multiple marks must be composed in such a way to reflect Unicode
canonical ordering. For example, LATIN LOWERCASE A WITH DIAERESIS has an acute mark
variant, and LATIN LOWERCASE A WITH ACUTE has a diaeresis mark variant. Since diaeresis
and acute accent marks have the same combining class, rearranging the order of how the
marks are encoded would change the character. Since they are different characters, they
are encoded as separate characters in Bytext. However, LATIN LOWERCASE A WITH TILDE
OVERLAY has a grave mark variant, but LATIN LOWERCASE A WITH GRAVE does NOT have
a tilde overlay mark variant. Since the tilde overlay mark has a different combining class
than the grave mark, rearranging how the marks are encoded would not change the

17
character (it would not map to a different character or sequence in Unicode normalization
form C), so the redundant form is not be allocated with a character in Bytext (the encoding
position is private).
The exceptions to the above include sequences of Thai and Lao marks that are considered
invalid but allowed because of combining class “simplifications”. Invalid sequences include
the Thai sequence KO KAI + MAI EK + SARA II (as opposed to KO KAI + SARA II + MAI
EK); or a musical symbol composition consisting of a notehead + flag + stem (as opposed to
notehead + stem + flag). In these cases, any duplicate or invalid sequences are encoded as
normalization subtypes, and there is a cross reference character in the place each would
otherwise be encoded at.

European script family


The European script family is defined as including Latin, Greek, and Cyrillic scripts. This
section describes the allocation of each script.
Default uppercase variants are at -129. Locale specific uppercase mappings are not identified
by pattern and are not in the Bytext database. When a locale specific uppercase variant is
allocated as the uppercase of a different character, the -129 subtype will be occupied by a
reserved character with a cross reference to the locale specific uppercase variant.
Greek ypogegrammeni is a ligating base variant with 3 allowable counterparts to the left:
lowercase alpha, lowercase eta, and lowercase omega ligating base variants. A titlecase
operation uppercases the first character in each appropriate word, so if a lowercase ligating
base variant does not have an uppercase ligating base variant, it uses an uppercase that is
not a ligating base variant. Since this would fracture the ligature, the ypogegrammeni
ligating base variant should be converted to it’s prevariant (lowercase iota) by the titlecase
operation. An uppercase operation would likewise convert both ligating base variants to
uppercase variants that are not ligating base variants. This results in the accepted Unicode
behavior of these characters while simplifying the definition of a titlecase operation. Instead
of a titlecase operation requiring a whole new property set for each character (the titlecase
property), a titlecase operation simply uppercases the first character in each appropriate
word just like one would expect.
GREEK LOWERCASE FINAL SIGMA is arguably a word position variant, but it is treated as a SS
subtype of GREEK LOWERCASE SIGMA in Bytext, ie not with an Arabic style word position
variant pattern. European letters are thus defined as not having word position variants.
LATIN LOWERCASE SHARP S (U+00DF), commonly used in German, is divided into ligating
base variants of LOWERCASE LONG S (U+017F) and LOWERCASE S. This simplifies
searching, transliteration, titlecasing and uppercasing. To maximize inherent compression,
these ligating base variants are the normalization subtype supertype of their prevariant.
Digraph characters such as LATIN CAPITAL LETTER DZ (U+01F1) are also divided into
ligating variants.
Non SS subtypes of Basic Latin letter prevariants are allocated as subtypes of the Additional
Latin script CATS, which is at B118-129. This category includes all non SS IPA characters.
In the Cyrillic script CATS, Cyrillic short I is encoded as a letter prevariant, not as a mark
variant of Cyrillic letter I. This is because the most frequently used language that uses the
script (Russian) considers it part of the basic alphabet. The same cannot be said of
characters like ä, ö, and ü as used for German, or ñ as used for Spanish. The pattern based
position of the breve mark variant of Cyrillic letter I is occupied with a reserved character.

Allocation table
An allocation table shows how specific character types are composed at each composition
level. Columns represent the character information that can be obtained at each composition
level. To compose a particular character, one chooses a character type in each succeeding
composition level. Character types available for composition must be located directly to the
right of the previous character type or to the right of any contiguous blank cell beneath it.
For simplicity, normalization subtypes are generally not shown in allocation tables.
Normalization subtypes are allocated by patterns that are consistent thruout all scripts

18
unless explicitly stated otherwise. These patterns are described in the normalization
subtypes section.
Only 4 stacking levels for marks are shown, for simplicity. More stacking levels may be
allowed in the final standard. Mark variants in this script family always begin at -130 and SS
subtypes always begin at -245. Mark variants and SS subtypes are peer groups, so if either
list needs to be extended, they are extended with group placeholders.
The following table describes the allocation of Basic Latin and SS subtypes. Non SS Greek,
Cyrillic, and Extended Latin letter prevariants are allocated with a similar scheme except the
compositions are preceded with their corresponding script CATS.

___Key:

L = Basic Latin letter prevariants


U = uppercase variants
M = mark variants
SS = search similar subtypes
SS- = SS script category symbols, for SS scripts in the same
CATS script family. Largely consists of SS cross reference
characters. Each script has it’s own standalone mark
category symbol.

___Table:

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
L U M M M M SS
SS
SS
SS
SS
M M M M SS
SS
SS
SS
SS
SS-
CATS

Arabic script family


Instead of specifying joining behavior with formatting control characters like Unicode, Bytext
encodes each word position form (nominal, left, right, medial) as normalization subtypes.
Specific word position forms should only be used to indicate an exception to standard Arabic
joining rules. Thus character variants that do not normally contribute to word meaning can
be systematically disregarded by, for example, spell checking operations.
All Arabic ligature presentation forms are divided into ligating variants. Ligating variants, as
always, are normalization subtypes. Arabic ligating variants should only be used to indicate
an exception to Unicode Arabic ligating rules. All Arabic ligating variants are not
recommended to be used since it is considered to be best handled on the font level, but the
“lam alef” ligating variants that can be assumed from word position form and standard
Arabic ligating rules are especially discouraged. To further aid in the normalization of lam-
alef ligating variants and to further discourage their encoding, they are on the second
singular placeholder level of the normalization subtype supertype.
Only script specific marks are applied to Arabic characters. The 2 Arabic Koranic annotation
signs that enclose digits (U+066DD and U+06DE) are applied to all decimal digits from all
scripts.

19
Syriac allocation is essentially the same as Arabic with the addition of 3 glyph forms for
SYRIAC LETTER ALAPH in the normalization subtype. These are the final joining; final non-
joining except following dalath and rish; and final non-joining following dalath and rish glyph
variants. Whether to encode Estrangela, Serto, and East Syriac type style variants as
normalization subtypes is listed in the Discussion Topics.

Allocation table

___Key:

CATS = Arabic script family category symbol


N = normalization subtype supertype
WP = word position variants: nominal, left, right, and medial
L = Arabic letter prevariants
SMCATS = Arabic standalone mark category symbol
M = mark variants
Pun = punctuation and script specific symbols

___Table:

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
CATS L M M M M N WP
N WP
N WP
N WP
N WP
Pun
SMCATS

Brahmic script family


The Brahmic script family is defined as including the Devanagari, Bengali, Gurmukhi, Gujarati,
Oriya, Tamil, Telugu, Kannada, Malayalam, Myanmar, Tibetan, Sinhala, Thai, Lao, and
Khmer scripts. This is perhaps the most complex of all script families, but Bytext features
such as ligating variants, cross references, and normalization subtypes are in place to deal
with it gracefully.
The most important feature that makes it possible to describe all these scripts with the same
allocation table is that each possible syllable is directly encoded as a single character. It is
believed that this syllable oriented approach is the only way to remain true to the nature of
these scripts, and it solves the various problems with the approaches used to spell syllable
variants with combining characters and formatting control characters.
Half consonant characters are normalization subtype supertypes of syllables consisting of full
consonants.
SS allocation for Brahmic scripts is a work in progress open to discussion. The basic idea is to
base it on phonetic similarity. It might even be possible to fit Brahmic allocation into one
CATS.

Indic notes
The term “Indic” will be used to describe a subset of the Brahmic script family consisting of
the Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam,
and Myanmar scripts. This is useful to describe certain features of these scripts as a group
that are not shared with other Brahmic scripts.

20
Only script specific marks apply to Indic characters.
Some Indic characters are in the Unicode composition exclusion table, meaning that they are
not found in normalization form C. These characters are invariably mark variants of other
characters. Bytext encodes these as cross reference characters that map to the
corresponding mark variants. These cross reference characters in the Devanagari script can
function as categories for search similar subtypes.
There are no vowel mark variants of independent vowels. The Unicode construct of applying a
dependent vowel sign to independent vowel A is encoded as a compatibility variant of the
independent vowel syllable.
There are a limited number of vowel mark variants of full consonants, currently limited to
vowel U and vowel UU mark variants of the full consonant RA.
For each Indic syllable with x number of consonants, there are 3(x - 1) - (x - 1) compatibility
variants for each possible way explicit virama and explicit half consonant forms could be
applied to the syllable as a whole. When x = 1, the compatibility variant is the explicit
virama because the explicit half consonant form already has an allocation (the normalization
subtype supertype).
When the consonant R (any form of letter RA) appears in a syllable, the glyph of the syllable
should conform to Unicode consonant RA rules, as the syllables will map to Unicode
sequences that are subject to the rules. Exceptions to these rules are provided as
compatibility variants of syllables. The consonant RA rules are summarized in the following
table:

___Key:

R = consonant R, any form of letter RA


C = any consonant other than R
V = any vowel including the inherent vowel of a consonant

___Table:

Syllable Description
RCV, R is in Rsup form. The mark Rsup applies to next following non explicit half
RCCV, consonant form.
or
RCCCV
CRV, R is in Rsub form. The mark Rsub applies to previous C, which is in it’s nominal
CCRV, form, and may form a conjunct with R.
or
CCCRV
CRCV, The consonants in the syllable are in a dead consonant form which may include a
CCRCV, visible virama mark.
or
CRCCV
RRCV, The first R is in Rsup form applied to the second R which is in explicit virama form.
RRCCV,
CRRCV,
or
CCRRV

Explicit virama, explicit half form, and consonant RA variants are called systematic Indic
syllable variants because they are determined by a regular set of rules. Systematic syllable
variants are listed first in the list of normalization subtypes, then syllable variants that
consist of other conjunct forms. Only the most common conjunct variants are included.

21
Tibetan and Sinhala notes
Fitting Tibetan and Sinhala into SS subtypes of Devanagari is a work in progress. SS subtypes
will be those that have one to one transliteration with Devanagari. Non SS prevariants will
be listed in the Additional Brahmic CATS.
Tibetan syllables that use fixed form subjoined consonants and TIBETAN FIXED FORM LETTER
-A (U+0FB0) are encoded as normalization subtypes of their non fixed form prevariants.
Each yig-go head mark ligature will be available as ligating variants of their component
characters.

Thai and Lao notes


Thai and Lao are treated as CV (consonant-vowel) syllable oriented scripts. Non combining
vowels from Unicode are treated as if they were combining characters, except if they are not
preceded by an appropriate Thai or Lao consonant in Unicode text, they are transcoded into
Bytext as standalone marks instead of as CSCC’s. This prevents multiple spellings for the
same syllable and provides the following advantages relative to Unicode encoding:
1. Characters are stored in logical order. This is a better long term solution than glyph based
encoding for the same reasons that it was advocated for Indic scripts by ISCII and
Unicode.
2. Maintains a consistent model with all other Brahmic scripts. Also is consistent with all
other syllable oriented scripts such as Ethiopic and Canadian Aboriginal Syllabics scripts.
3. Simplifies collation.

Note that editing methods can exactly mimic the effect of using standalone marks to spell
syllables without actually encoding standalone marks.

Khmer notes
Khmer encoding in Unicode is messed up. There is an official request by the government of
Cambodia for rescission of the Unicode encoding approach. The Unicode encoding approach
to Khmer clearly does not reflect the true nature of the script: numerous characters were
invented to force it into the model Unicode uses for Indic scripts.
In Bytext Khmer script is treated as a C(C)V (consonant-optional consonant-vowel) syllable
oriented script. All possible syllables are directly encoded in Bytext. The Bytext encoding
model solves the problems with Unicode Khmer encoding:
KHMER INDEPENDENT VOWEL QAQ (U+17A3) is a compatibility subtype of KHMER LETTER QA
(U+17A2). This allows for direct Pali/Sanskrit transliteration and provides a fallback to
KHMER LETTER QA.
There are a limited number of vowel mark variants of independent vowels encoded in Khmer,
similar to how there are a limited number of vowel mark variants of full consonants in Indic
scripts. KHMER INDEPENDENT VOWEL QAA (U+17A4) is a vowel AA mark variant of KHMER
LETTER QA (U+17A2).
KHMER INDEPENDENT VOWEL QOO TYPE TWO (U+17B2) is a normalization subtype of KHMER
INDEPENDENT VOWEL QOO TYPE ONE (U+17B1). This is the kind of thing the normalization
subtype was made for.
KHMER SIGN BEYYAL (U+17D8) is divided into ligating base variants of KHMER SIGN KHAN
(U+17D4), KHMER LETTER LO (U+179B) and KHMER SIGN KHAN (U+17D4). KHMER
INDEPENDENT VOWEL QUK (U+17A8) is divided into ligating base variants of KHMER
INDEPENDENT VOWEL QU (U+17A7) and KHMER LETTER KA (U+1780).

Allocation table

___Key:

CATS = Brahmic script category symbol


A-CATS = Additional Brahmic script category symbol

22
C = full consonants. A cross reference character that maps to independent vowel
A is listed before the consonants.
AC = Additional consonants, consists of nukta variants and non SS consonants
V = vowels
M = marks that apply to syllables including vowel modifier mark variants and
tone mark variants
SS = search similar subtypes
D- = Devanagari standalone mark category symbol, subtypes include the
SMCATS standalone marks Rsub, Rsup, and virama --and SS subtypes of each
SS- = SS script category symbols, for SS scripts in the same script family. Largely
CATS consists of SS cross reference characters. Each script has it’s own
standalone mark category symbol.
SS- = Standalone mark category symbols of each Brahmic script other than
SMCATS Devanagari.
N = normalization subtype supertype, subtypes include explicit virama variants
and explicit half form variants --and SS subtypes of each
Pun = punctuation and script specific symbols

___Table:

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
CATS V M SS
SS
N
C V M SS
SS
N
C V M SS
SS
N
C V M SS
SS
N
C V M SS
SS
N
SS
SS
SS
SS
A-CATS
AC V M SS
SS
N
AC V M SS
SS
N
AC V M SS
SS
N
AC V M SS
SS
N
SS
SS
SS

23
SS
Pun
D-
SMCATS
SS-CATS
SS-
SMCATS

Hangul script
Hangul is a syllable oriented script so all possible syllables are encoded in Bytext. There are no
joining or ligating variants of Hangul prevariants. The only script specific marks that apply to
Hangul characters are HANGUL SINGLE DOT TONE MARK (U+302E) and HANGUL DOUBLE
DOT TONE MARK (U+302F).
Hangul syllables composed of only a vowel map to precomposed vowel syllables in Unicode, or
if there is not a precomposed syllable equivalent they map to HANGUL CHOSEONG FILLER
(U+115F) followed by a Unicode conjoining Jamo vowel. Vowel letters (corresponding to
U+314F..U+3163 and U+3187..U+318E) are encoded as compatibility variants of vowel
syllables. The choseong filler aspect of such syllables, ie the appearance of Ieung, is
considered to be a font issue.
The position in C2 that HANGUL CHOSEONG IEUNG (U+110B) would occupy based on Hangul
consonant order is filled with a reserved character with a cross reference to HANGUL
CHOSEONG IEUNG. HANGUL CHOSEONG IEUNG (U+110B) is a normalization subtype of
HANGUL CHOSEONG FILLER (U+115F). HANGUL CHOSEONG FILLER itself is listed in C2
after the consonants. The HANGUL FILLER character (U+3164) is the normalization subtype
supertype of HANGUL CHOSEONG FILLER. HANGUL JUNGSEONG FILLER (U+1160) is a
normalization subtype of HANGUL CHOSEONG FILLER.

Allocation table

___Key:

CATS = Hangul script category symbol


C = consonants, including Choseong filler
V = vowels
M = mark variants
SMCATS = Hangul standalone mark category symbol

___Table:

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
CATS C V C M M
M M
M M
V C M M
M M
SMCATS

Kana script family


The Kana script family is defined as including the Hiragana and Katakana scripts.
Except for the voiced iteration mark characters, characters with precomposed marks from
Unicode are encoded as mark variants. There are cross references in the (non mark variant)
peer group to preserve the original Unicode order.

24
Full width variants are the normalization subtype supertype.

Allocation table

___Key:

CATS = Kana script family category symbol


H = Hiragana
K = Katakana
S = small variants
M = mark variants
AK = Additional Katakana without a Hiragana equivalent, KATAKANA LETTER SMALL
KA (U+30F5) and KATAKANA LETTER SMALL KE (U+30F6)
Pun = punctuation and script specific symbols
SMCATS = Hangul standalone mark category symbol

___Table:

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
CATS H S M M M K
K
K
K
M M M K
K
K
K
AK
Pun
SMCATS

Han script family


The Han script family is defined as encompassing all East Asian or “CJK” ideographs. The term
“Han” is simply considered more elegant than “CJK” and does not imply nationality. As a
script family it can be assumed to encompass other related scripts that are not strictly “Han
script”.
All non punctuation, non SS Han characters are classified as a subtype of a single KangXi
radical. Subtypes of radicals are ordered by usage frequency which is to be determined.
There are mark variants of each ideograph.
SS allocation will likely consist of simplified variants followed by other variant types. There will
be a pattern list of simplified variants such as Chinese simplified, Japanese simplified, and
Korean simplified (reserved). Some variants may be ligating variants. If the simplified
mapping of a traditional ideograph consists of more than 1 character, the simplified
characters will not be SS subtypes. These issues are all open to discussion.

Allocation table

___Key:

CATS = Han script category symbol


A-CATS = Additional Han script category symbol

25
P1 = level 1 singular placeholder character
P2 = level 2 singular placeholder character
i1 = instances of characters with the KangXi radical of it’s
supertype, ordered by usage frequency
i2 = continuation of i1
i3 = continuation of i2
R1 = KangXi radicals 1 to 120
R2 = KangXi radicals 121 to 214
M = mark variants
SMCATS = Han standalone mark category symbol
SS = search similar subtypes

___Table:

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
CATS R1 P1 P2 i3 M SS
SS
i2 M SS
SS
i1 M SS
SS
A-CATS R2 P1 P2 i3 M SS
SS
i2 M SS
SS
i1 M SS
SS
SMCATS

Ethiopic script

Allocation table

___Key:

A-CATS = Additional Scripts category symbol


CATS = Ethiopic script category symbol
C = consonants
V = vowels
M = mark variants
Pun = punctuation and script specific symbols

___Table:

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
A-CATS CATS C V M
Pun

26
Canadian Aboriginal Syllabics script
Canadian Aboriginal Syllabics script is treated as a CV (consonant-vowel) syllable oriented
script. Each possible syllable is encoded. Vowel /e/ is treated as the inherent vowel.

Allocation table

___Key:

A-CATS = Additional Scripts category symbol


CATS = Canadian Aboriginal Syllabics script category symbol
C = consonants with vowel /e/
V = vowels
SS = SS regional variants in order: Algonquian, Inuktitut,
Athapascan.

___Table:

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
A-CATS CATS C V SS
SS
Pun

Mongolian script
Most Mongolian letters are SS to Cyrillic letters, so the allocation table mostly indicates how
the SS cross reference characters are allocated. SS to Cyrillic allocation doesn’t change
inherent compression and is useful so users can easily (albeit roughly) search for Altaic
words written with Mongolian and Cyrillic scripts simultaneously.

Allocation table

___Key:

A-CATS = Additional Scripts category symbol


CATS = Mongolian script category symbol
N = normalization subtype supertype
V = Nominal, left, right, and medial word position variants;
Mongolian free variation variants; Mongolian vowel separator
variants (only for vowels A and E).
L = Mongolian letter prevariants
SMCATS = Mongolian standalone mark category symbol, the standalone
version of MONGOLIAN LETTER ALI GALI DAGALGA (U+18A9)
M = mark variants
Pun = punctuation and script specific symbols

___Table:

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
A-CATS CATS L M M M N V
N V
N V

27
N V
Pun
SMCATS

NEW CHARACTERS
“What can I get for a rib?”
--Adam

Characters are units of language that are inherently finite. All new Bytext characters must
satisfy this requirement of having a finite nature. "New" refers to characters that are not in
Unicode. Even tho the nature of a character is finite, there is no particular limit as to the
number of finite elements that are useful because they will tend to grow over time.
In order to remain a finite concept, characters must limit the extent that they are
distinguished based on composition. Font variations and color variations are composite
concepts that have infinite potentiality and are discouraged. Algorithms that completely
define an infinite number of characters are not appropriate for the character encoding
protocol layer and will be excluded from Bytext. Visible characters to date do not imply any
color except black and white because the symbols evolved primarily as a handwritten
language. Characters that imply color variations should be excluded unless there is
compelling reason to include them and there is some finite nature to the number of potential
variations.
Ideographs seem like a contradiction of this “finite principle” in how they can be used to
represent words, but they are not so long as there is not an algorithm that defines an
infinite number of variations. Each ideograph must be actually used and part of an existing
standard, this is the main factor preventing ideographs from having an infinite nature.
Characters that have not necessarily been used before that are allocated by an algorithmic
pattern, such as mark variants, are allowed but the total number of compositions must be
limited (as they are for mark variants). Extensive composition belongs in the realm of
markup language. New formatting characters that may be needed to interoperate with
associated standards are allowed.
There are heaps of potential characters, even more than the 1,111,998 potentially available in
Unicode. See the references section. CJK scripts may require new ideographs or pop culture
may become saturated with common icons or symbols that would be useful to have as
characters. Humanity may evolve new characters or may contact new intelligent life with
new characters.
It is short sighted to put an arbitrary limit on the number of characters people may choose to
create and utilize --that is, if the arbitrary limit is small. With Bytext, even if a limit of 256
bytes is set on the maximum length of a Bytext character, for all practical purposes there
will be no limits on the total number of characters (256 to the power 256 is a larger number
than the number of atoms in the known universe). There’s a lot you can do with these large
numbers, as the characters in this section demonstrate.
The primary role of Bytext is to reflect existing usage of units of language that can be
considered to be characters, not to invent characters. Any characters invented by Bytext
without pre-existing widespread usage must have exceptional utility, there must be a solid
reason to expect widespread usage. This is believed to be the case for the characters in this
section.

Type 1 ignorables
First to describe ignorables. Ignorable characters, ignorables, are characters with the ignorable
fixed category which is identified by the pattern B126- or B127-. Ignorable refers to how
these characters are recommended to be treated in searches. Ignorable means they may be
ignored in searches, not that they must. This is separate from characters with glyphs that
might not be available for display and is separate from the concept of compatibility.

28
Type 1 ignorables are ignorables that have an informative context dependent visibility
property. The informative context dependent visibility fixed category is identified by the
pattern B048- or B049- or B126-. Type 1 ignorables are a fixed category identified by the
pattern B126-.
The fallback glyphs property of type 1 ignorables maps to a Bytext character that best
matches the glyph the character when it is displayed. For example, the fallback glyphs
property for SOFT HYPHEN (U+00AD) references HYPHEN (U+2010).
Bytext does not specify any automatic editing of Bytext sequences, so if a character is ignored
it should not be erased or changed except as indicated by higher protocols.
Not all control characters are ignorable, some are non ignorable normalization subtypes. TAB
is a prominent example. Normalization subtypes will tend to fall back to their supertype
depending on interpretation; whereas ignorables will tend to be ignored or not be ignored
depending on interpretation.
B126, the GLUE character (U+2060), is the F1 supertype for all type 1 ignorables. GLUE is not
a new character but the type 1 ignorable category is new. COMBINING GRAPHEME JOINER
(U+034F) is also a type 1 ignorable.
GLUE and UNGLUE are not considered formatting characters in Bytext because they do not
change any properties of adjacent characters, they only change the resulting behavior of the
boundary between 2 characters which is explicitly defined as being dependent on the
properties of both characters. Formatting characters are not used in Bytext except for
compatibility.

AUTO LEC
AUTO LINE ENDING CHARACTER, B126-255-132, also called AUTO LEC, is a new character
that indicates an optional line break. This is the character to use when a line break character
(not just a line break opportunity character) is necessary but not explicitly requested. LINE
ENDING CHARACTER, (LEC), is a character that indicates a mandatory line break, but AUTO
LEC does not indicate a mandatory line break and it does not even indicate a line break
opportunity. Typically, an implementation that does not have automatic line breaking or that
requires explicit line breaks will break at every AUTO LEC, and a system that does have
automatic line breaking will ignore every AUTO LEC. It is useful so as to not propagate
incorrect line breaking information when text is exchanged between implementations with
different ways of handling line breaking.
With AUTO LEC, the original intent of the author can be preserved, while line breaking
imposed by different implementations may be used or ignored. There is no equivalent of
AUTO LEC in Unicode.

BELI and BELD


BIDIRECTIONAL EMBEDDING LEVEL INCREASER, referred to as BELI, B126-255-135, is a
formatting character for bidirectional text as defined in the Bidirectionality algorithm section.
The context that makes this character have context dependent visibility is whether or not
the display component implements the Bytext bidirectionality algorithm. If the display
component does implement the algorithm, the character itself should not be visible except
as indicated by higher protocols. There are numerous other ways to format text to make
complex bidirectional embeddings easy and unambiguous to read. If the display component
does not implement the algorithm, the character should be displayed as its first fallback
glyph character: SYMBOL FOR BIDIRECTIONAL EMBEDDING LEVEL INCREASER, B125-135.
Sample glyphs for these and other bidirectionality symbols are available in the Bidi symbols
section.
A unique tranformation is something like a color change, for the purpose of distinguishing the
fallback glyph appearance of BELI with the actual SYMBOL FOR BIDIRECTIONAL EMBEDDING
LEVEL INCREASER character. This is the same principle used for any character that is
displayed as it’s fallback glyph.
BIDIRECTIONAL EMBEDDING LEVEL DECREASER, referred to as BELD, B126-255-136, is the
complementary formatting character to BELI. As with BELI, if a display component does not
implement the Bytext bidirectionality algorithm, BELD should be displayed as a unique

29
transformation of the first fallback glyph character: SYMBOL FOR BIDIRECTIONAL
EMBEDDING LEVEL DECREASER, B125-136.
In the Bytext Bidirectionality Algorithm (BBA), BELI and BELD override the effects of Unicode
Bidirectional Algorithm (UBA) formatting characters (RLE, LRE, RLM, LRM, RLO, LRO, and
PDF) present in the same paragraph. An editor inserting BELI and BELD should (but not
must) erase any UBA formatting characters present in the same paragraph.

AUTO variants
AUTO BIDIRECTIONAL EMBEDDING LEVEL INCREASER, referred to as AUTO BELI, B126-255-
135-255; and AUTO BIDIRECTIONAL EMBEDDING LEVEL DECREASER, referred to as AUTO
BELD, B126-255-135-255, duplicate the effects of UBA formatting characters present in the
same paragraph. AUTO BELI and AUTO BELD are automatically inserted into Unicode
bidirectional text during the transcoding of Unicode to Bytext in such a way as to imitate the
bidirectional display of UBA formatting characters. This way, an application that only
implements the Bytext Bidirectionality Algorithm (BBA) will display bidirectional Unicode text
the same way as a UBA compliant display device would.
AUTO BELI and AUTO BELD are always treated equivalently to BELI and BELD; within Bytext
they are respectively considered to be equivalent codes. The differences only come into play
when Bytext is transcoded into Unicode as described below. The differences between the
characters can also optionally be used for troubleshooting purposes, to determine what kind
of auto editing functions are implemented in the input method being used.
As with BELI and BELD, if a display component does not implement the BBA, AUTO BELI and
AUTO BELD should be displayed as a unique transormation of their first fallback glyph
characters: SYMBOL FOR AUTO BIDIRECTIONAL EMBEDDING LEVEL INCREASER, B125-137;
and SYMBOL FOR AUTO BIDIRECTIONAL EMBEDDING LEVEL DECREASER, B125-138,
respectively.

Unicode to Bytext conversion


AUTO BELI and AUTO BELD duplicate the effects of UBA formatting characters present in the
same paragraph. They are automatically inserted into Unicode bidirectional text during the
transcoding of Unicode to Bytext in such a way as to imitate the bidirectional embedding
levels resulting after step I2 of the UBA. The UBA formatting characters are preserved
during transcoding.
The following grid demonstrates how BBA formatting characters are used to mimic the specific
numbered embedding levels resulting after step I2 of the UBA. Each row represents a
paragraph. A run of characters represented by a given UBA embedding level can be empty.
In the following grid, color is only used to indicate repetitive patterns:

0
AUTO 1 AUTO
BELI BELD
AUTO 1 AUTO 2 AUTO 1 AUTO
BELI BELI BELD BELD
AUTO 1 AUTO 2 AUTO 3 AUTO 2 AUTO 1 AUTO
BELI BELI BELI BELD BELD BELD

Bytext to Unicode conversion


Text created in Bytext using the BBA then transcoded to Unicode will display as intended in a
UBA compliant application.
If UBA formatting characters are present in a Bytext paragraph, and BELI and BELD are not
present, AUTO BELI and AUTO BELD are eliminated when text containing them is transcoded
to the UCS. Otherwise AUTO BELI and AUTO BELD map to the same private UBA formatting
character sequences as BELI and BELD. These conditions are illustrated in the following
table:

30
With this condition this Bytext character maps to this in Unicode
In a given paragraph, if UBA AUTO BELI nothing, not even private
formatting characters are characters
present; and both BELI and AUTO BELD nothing, not even private
BELD are not present characters
In a given paragraph, if UBA AUTO BELI the same sequences that BELI
formatting characters are not would map to, described below
present AUTO BELD the same sequences that BELD
would map to, described below

BELI and BELD map to special sequences of UBA formatting characters when transcoded to
Unicode. The following grid demonstrates how Bytext characters with the numbered BBA
embedding levels shown are transcoded into Unicode sequences. Each row represents a
paragraph. A run of characters represented by a given BBA embedding level can be empty.
In the following grid, color is only used to indicate repetitive patterns:

0*
RLM 0**
LRM LRO 0*** PDF LRM
RLM RLO 1 PDF RLM
RLM RLO 1 LRM LRO 2 PDF LRM 1 PDF RLM
RLM RLO 1 LRM LRO 2 RLM RLO 3 PDF RLM 2 LRM LRO 1 PDF RLM

* is for any contiguous span of characters with the embedding level indicated except those
with the Unicode “strong R” property and except those characters without a “strong R” or
“strong L” property that are at the start of a paragraph (as defined in the BBA) and followed
by a BELI, an AUTO BELI, or a character with the “strong R” property.
** is for any contiguous span of characters with the embedding level indicated and without a
“strong R” or “strong L” property, that are at the start of a paragraph (as defined in the
BBA) and followed by a BELI, an AUTO BELI, or a character with the “strong R” property.
Authoring software spelling characters with these properties must be careful because BBA
embedding levels following these characters will be one less than corresponding UBA
embedding levels, so they should not exceed a BBA embedding level of 61 in order to
display correctly in the UBA.
*** is for any contiguous span of characters with the embedding level indicated and with the
Unicode “strong R” property. Authoring software spelling characters with these properties
must be careful because BBA embedding levels following these characters will be two less
than corresponding UBA embedding levels, so they should not exceed a BBA embedding
level of 60 in order to display correctly in the UBA.

Non substitutable CC’s


Non substitutable CC’s, NSCC’s, are ignorable compatibility characters (CC’s) without any kind
of mapping to a compatibility substitute (CS). These include characters that are depreciated
in Unicode and the ignorable non formatting control characters like null and start of heading.
They are thus not new characters but are in the new type 1 ignorable category. NSCC’s are
all allocated in B126-255-130-.
The object replacement character from Unicode is the supertype for all type 1 ignorables that
are compatibility characters, such as all control characters and all of the 66 “noncharacters”
found in Unicode. Formatting characters are not considered control characters since Unicode
text containing formatting characters can map to a non compatibility Bytext equivalent.
Form feed is not considered a control character because it’s normative Unicode properties
define it’s usage as being identical to PEC, so it is a subtype of PEC. Horizontal tab is not
considered a control character, it is considered whitespace. Vertical tab is considered a
control character because it is not commonly used as whitespace. The object replacement
character and all 3 interlinear annotation characters are also considered control characters.

31
All NSCC’s are listed in the following table:

Unicode Character Character description and explanation


Literal
U+FFFC Object replacement character. This is the
supertype for all NSCC’s.
U+070F SYRIAC ABBREVIATION MARK
U+0000..U+001F, ASCII control characters.
except U+0009 and
U+000C
U+0080..U+009F, C1 (non ASCII) control characters.
except U+0085
U+FFF9..U+FFFB The interlinear annotation characters.

U+FEFF ZERO WIDTH NO-BREAK SPACE. This


character no longer has glue semantics in
Unicode 3.2, so it’s definition as a byte order
mark (BOM) means it is a NSCC. Regardless
of the usage of U+FEFF, it will ultimately be a
subtype of GLUE as all NSCC’s are.
U+1D173..U+1D17A Musical format control characters.
U+FDD0..U+FDEF “Noncharacter” characters. These are
informally called the “Arabic noncharacters”
because of their location in the Arabic
Presentation Forms A block of Unicode.
U+FFFE..U+FFFF “Noncharacter” characters. These are
informally called the “special noncharacters”
because of their location in the Specials block
of Unicode.
U+1FFFE, U+2FFFE, “Noncharacter” characters. These are
U+3FFFE, U+4FFFE, informally called the “trailing noncharacters”
U+5FFFE, U+6FFFE, for lack of a better name.
U+7FFFE, U+8FFFE,
U+9FFFE, U+AFFFE,
U+BFFFE, U+CFFFE,
U+DFFFE, U+EFFFE,
U+FFFFE, U+10FFFE,
U+1FFFF, U+2FFFF,
U+3FFFF, U+4FFFF,
U+5FFFF, U+6FFFF,
U+7FFFF, U+8FFFF,
U+9FFFF, U+AFFFF,
U+BFFFF, U+CFFFF,
U+DFFFF, U+EFFFF,
U+FFFFF, U+10FFFF

Excessive CSCC’s
Excessive CSCC’s are characters left over from Unicode text that contains what is deemed to
be either excessive or inappropriate (“degenerate”) use of combining or formatting
characters. The term degenerate is used in Unicode to refer to cases “that never occur in
practice”, such as a LATIN UPPERCASE A followed by an Indic combining mark. There are an
infinite number of such abstract characters possible in Unicode that would otherwise not
have an equivalent in Bytext. For example, a Latin base character with hundreds of
combining characters applied to it can only be encoded in Bytext using excessive CSCC’s.
While it would be possible to encode an infinite set of abstract characters in Bytext, it
violates the idea that characters are finite units of language.

32
A character with an excessive number of marks is coded in Bytext as a regular character with
as many marks as is available, followed by excessive CSCC characters as needed. For now,
Bytext directly encodes a maximum of 8 marks per European letter, with similar limitations
for different scripts.

Defective CSCC’s
Defective CSCC’s are characters left over from Unicode text containing defective or invalid
sequences such as text containing only combining characters; combining characters applied
to control characters; or meaningless use of formatting characters.
Another type of defective CSCC consists of conjoining Hangul Jamo characters that do not
form syllables consisting of at least 2 Jamo characters. The same is not true of characters
used to form syllables in other syllable oriented scripts such as Indic scripts because there
are no corresponding precomposed syllables to map them to.

Type 2 ignorables
Type 2 ignorables are a fixed category identified by the pattern B127-. Type 2 ignorables
consist of invisible private characters and invisible “made ignorables”. Invisibility is an
informative property. Any character in this category that is not a made ignorable is a private
character, and this includes the F1 supertype itself. Type 2 ignorables are useful to markup
languages that wish to make the markup invisible and ignorable without parsing.
Made ignorables are subtypes of B127 that are allocated by a convention that interprets it’s S1
subtypes with byte values 128 to 254 as if they were the F1 characters with byte values
from 0 to 126. The F1 character generated is called the made F1 character. This “pseudo F1”
subtype level is called MS0 (M-S-zero). Subsequent subtypes of MS0 (all continuation byte
values, from 128 to 255) are interpreted as if they were subtypes of the made F1 character.
This is a context free way of applying the type 2 ignorable category and it’s associated
ignorable and informative invisibility properties to any other non type 2 ignorable character
in Bytext.
An algorithm that takes a sequence of original characters and converts them to made
ignorables is called a make ignorable algorithm. Making a character sequence ignorable is a
reversible function that can be applied to any Bytext character sequence. Any preexisting
type 2 ignorables will remain as such before and after applying a make ignorable algorithm;
and before and after the reverse make ignorable algorithm. A make ignorable algorithm
does not loop and apply to type 2 ignorables to create an infinite series of nested levels, it
only applies once so as to double the number of Bytext characters.
Made ignorable character names are the names of their non made ignorable originals
prepended with “MADE IGNORABLE CHARACTER “.

Private character 127


PRIVATE CHARACTER 127 is the name of the F1 supertype of type 2 ignorables. PRIVATE
CHARACTER 127 is the supertype of the made ignorables and of private characters that have
the informative property of invisibility. As the name indicates, it is a private character, and
the informative property of invisibility of it’s category applies to itself and all subtypes.
PRIVATE CHARACTER 127 is the only character without singular placeholders. The only way to
extend the subtype of PRIVATE CHARACTER 127 is thru the normalization subtype, and all
such subtypes are private.
Private character 127 cannot have a mirrored variant or an East Asian full width variant.

OCR binary symbols


OCR binary symbols (OBS’s) are characters that represent a sequence of bits. The glyphs are
meant to be machine readable, hence the term OCR which stands for optical character
recognition. Some fonts may incorporate color for such OCR glyphs when space used must
be extra compact and color is available.

33
Representing binary data in text is useful for things like displaying encryption keys, digital
signatures, and other long codes with the convenience of being able to store, edit, and print
them along with text. Since the binary forms are meant to have distinct OCR glyphs, people
can use Bytext enabled software to easily and efficiently store digital data on paper. The
data can be easily recovered by scanning.
All OBS’s are (currently) base 8192, meaning that they represent all possible sequences of 13
bits. Different base OBS’s may be added in the future. There are 8192 base 8192 OBS’s, and
all are F2’s. Half of all F2 characters are base 8192 OBS’s. With this allocation, 13 out of
every 16 bits can be used to represent a binary sequence without overloading any
characters. This provides more inherent compression than base64 or similar text based
binary encoding schemes which DO overload characters.
The numeric value of OCR’s are defined as their unsigned big endian binary number
interpretation written in base 10 form. The character names for all base 8192 OCR binary
symbols begins: “BASE 8192 OCR BINARY SYMBOL “ followed by the base 10 number form
of it’s numeric value written with decimal digits (not spelled out). OBS characters and
private characters are the only characters that have decimal digits in their name. The base
10 number form of an OBS is also defined as it’s numeric value, which has a numeric type
called “binary”. All OBS characters and only OBS characters have the binary numeric type.
The convention for using base 8192 OBS characters is that all trailing bits that do not fit into
an 8 bit unit must be zero and must be truncated. When an 8 bit unit itself must be
truncated in order to represent a certain number of bytes, the 8 bit unit must consist
entirely of zeros and the truncation must be indicated by a trailing truncate character.
Multiple truncate characters do not truncate more than a single 8 bit unit. The base 8192
OBS set has it’s own truncate character called BASE 8192 OBS TRUNCATE CHARACTER,
which is not to be used with any other future OBS set. This convention minimizes space
requirements, especially for small binary sequences. 5 out of 8 arbitrary length byte
sequences will require this truncate character. If only 16 bit units or multiples thereof are
encoded with OBS’s, then the truncate character will not be required at all.

Arrow parentheses
Arrow parentheses are regular visible characters invented by the author to be used in a simple
and flexible spelling convention that along with existing characters can formally or informally
write virtually any mathematical expression in a way that is easy to read in plain text.

Name Sample glyph Notes


ARROW Used to delimit the superscript of the previous
PARENTHESIS LEFT term, ie the power argument of the previous
UP term. Graphically, it would appear to the upper
right (or perhaps to the left in a bidi context) of
ARROW
the previous term.
PARENTHESIS RIGHT
UP
ARROW Used to delimit the subscript of the previous
PARENTHESIS LEFT term. Graphically, it would appear to the lower
DOWN right (or perhaps to the left in a bidi context) of
the previous term.
ARROW
PARENTHESIS RIGHT
DOWN
ARROW Used to delimit the “top expression” of the
PARENTHESIS LEFT previous term. As applied to integral or
DOUBLE UP summation characters, this would refer to the
upper limit expression. This construct can be
ARROW
used to label arrows or product and set theory
PARENTHESIS RIGHT
characters; or to apply marks and various
DOUBLE UP
embellishments as a single unit to the top of
entire constructs. Graphically, it would appear
directly above the previous term.

34
ARROW Used to delimit the “bottom expression” of the
PARENTHESIS LEFT previous term. As applied to integral or
DOUBLE DOWN summation characters, this would refer to the
lower limit expression. This construct can be
ARROW
used to label arrows or product and set theory
PARENTHESIS RIGHT
characters; or to apply marks and various
DOUBLE DOWN
embellishments as a single unit to the bottom of
entire constructs. Graphically, it would appear
directly below the previous term.

The characters can be considered to be a way of “visualizing” a common form of markup. The
glyphs for these characters are considered particularly useful for this purpose. They are
unique and intuitive. In comparison, an arrow character alone is not an intuitive way to
delimit a sequence since the beginning cannot be easily distinguished from the end of the
sequence. Explicitly delimiting a sequence with a parenthetical construct is much more
flexible than relying on character properties to determine the end of a sequence. Utilizing
arrows and parenthesis as separate characters working together contributes to clutter and
requires the characters to be even more overloaded than they already are.

CO subtypes
CO subtypes are non private characters defined only by the name of an existing “charset
organization” (CO). All subtypes of these subtypes are private, but considered to be for the
exclusive use of the charset organization indicated by name. This is designed to prevent
collisions between an important class of privately defined characters. CO subtypes are a PDT
defined by the pattern -255-128-255-. All CO subtype supertypes fit the pattern -255-128-
255. CO subtype supertypes consist of an ordered list of CO’s, extended by singular
placeholders as necessary.
For this purpose, charset organizations are defined as organizations that have a charset
registered with IANA. All charset organizations including Bytext itself will have CO subtype
whether they are requested or not. The Bytext database will not change solely to add a new
charset organization, so until one is officially added for any new charset organization, one
should be assumed to exist in the order of the date of their IANA charset registration, on the
date of that registration.
It is up to each charset organization using these private characters to allocate them in a way
that provides a useful supertype to fall back to if the private character is not recognized.
Organizations without a registered charset are encouraged to publish any of their own usage
of private characters that they wish to avoid collisions with in a centralized source outside of
Bytext.

Misc symbols
Common expressive symbols used in text based messaging systems will be included in Bytext.

Name Notes
KISS IMPRINT
BROKEN HEART there are already several hearts in Unicode
WRAPPED GIFT
FLOWER There are florettes in Unicode but a florette does not
typically include a stem whereas a flower would
TREE SYMBOL
LIGHTBULB ON eureka
FLASHING CAMERA picture
MICROPHONE

35
Generic symbols
A generic symbol is a symbol obtained by a formula or convention to represent a character
because the character would otherwise have no widely accepted symbol available to
represent it. For example, there are no widely accepted symbols available to represent
CSCC’s.
All completely defined characters that do not have a sample glyph have an informative fallback
glyphs property that maps to one or more symbols that can be used to represent the
character if required by an implementation. In some cases a character may have more than
one symbol that can be used to represent it. In other cases, a generic symbol will need to be
created. The purpose is not so that all characters can be represented by a compact unique
symbol; but rather that all characters that do not have a sample glyph can be represented
by a compact symbol that is unique except among generic symbols. It is trivial to distinguish
the glyph of a generic symbol character from the glyph of the generic symbol as it is used to
represent another character (it can be a different color for example), but this is of course
outside the scope of Bytext.
Fallback glyphs and generic symbols make it easy for an implementation to have a unique
display for each text string, which makes it easy for users to verify that the screen display is
faithful to the encoding. This helps eliminate spoofing, which is a non trivial security
concern. For example, a display component could choose to provide a specific
transformation, such as a color, for each character with a non null fallback glyphs property.
In Bytext the recommended convention for generating generic symbols is to use the
corresponding block icon used in the Unicode code charts for the symbol. A rounded box is
not a mark in Unicode or Bytext and this will help contribute to the visual distinction of these
generic symbols. If there is no corresponding block icon in Unicode, a compact English
abbreviation in a rounded box is recommended. An abbreviation may include a number,
such as “CSCC1”.
SYMBOL FOR NULL, B125-137, is the supertype for all generic symbols of control characters --
what Unicode calls “control pictures”. Bytext encodes the ASCII control character generic
symbols (U+2400..U+243F); followed by the other control character generic symbols which
are not in Unicode.

Bidi symbols
Because bidirectional formatting characters are such an integral part of the encoding of
several major natural languages, it is appropriate to develop user friendly symbols that can
be used to represent them in special circumstances, rather than generic symbols. The
following images are proposed glyphs for such symbols:

Name Character code Sample glyph


image
SYMBOL FOR BIDIRECTIONAL EMBEDDING B125-135
LEVEL INCREASER

SYMBOL FOR BIDIRECTIONAL EMBEDDING B125-136


LEVEL DECREASER

SYMBOL FOR AUTO BIDIRECTIONAL B125-137


EMBEDDING LEVEL INCREASER

SYMBOL FOR AUTO BIDIRECTIONAL B125-138


EMBEDDING LEVEL DECREASER

36
Emoticons
Bytext is committed to providing a rich set of graphical emoticons as single characters.
Emoticons are a very practical unit of written language and are in common use. Unicode has
smiling and frowning faces but not enough to round out the spectrum of basic universal
emotions.
Several principles give emoticons a finite nature. There are a limited number of universally
understood facial expressions. There are a limited number of commonly used ASCII style
emoticons. There are also a limited number of character-like images used in popular text
based messaging systems, and of those even less are in common use.
Many users will find it particularly convenient to be able to copy a conversation from a text
based messaging system, then paste it into a plain text editor and have the graphical
emoticons preserved as characters. The reverse operation would also be convenient.

Name ASCII style Sample glyph image &/or notes


emoticon
approximation
TIRED-BORED $-|
FACE

ANGRY FACE >:-(

SAD FACE <:-(

FEAR FACE <:-I

SADISTIC FACE >:-D

OPEN GRIN FACE :-D


O FACE :-o
TONGUE OUT :-P
FACE
RIGHT WINK ;-)
WITH SMILE

37
LEFT WINK WITH (-;
SMILE
QUEASY FACE :-S
BLANK FACE :-|
HALO FACE

HORNY FACE

DEVIL FACE

YUKKY FACE

DROOLING FACE
SCREAMING
FACE
YAWNING FACE eyes closed, open mouth not pulled back
QUESTION FACE peoples eyebrow with curled lip

Potential new characters


Potential characters may be privately defined as Bytext CO pattern subtypes.

___Button symbols:
Button symbols may be useful for describing command sequences to beginners. These would
be glyphs of the keys on a keyboard; or common buttons and widget elements in a graphical
user interface.
Button symbols may include a vast number of existing glyphs surrounded by an outline meant
to mimic the appearance of an actual key on a keyboard, which should be different (3
dimensional appearance perhaps) than the rounded box used for generic symbols. These
may be allocated as mark variants.

Name (before appending “ BUTTON SYMBOL”)


MINIMIZE
PARTIAL WINDOW

38
FULL WINDOW
CANCEL
NEW FILE
OPEN FILE
SAVE
PRINT
SUPERSCRIPT
SUBSCRIPT
RIGHT JUSTIFY
LEFT JUSTIFY
LEFT MOUSE CLICK
RIGHT MOUSE CLICK
etc

___Common hand signs:

Name Notes
THUMB UP HAND SIGN
THUMB DOWN HAND SIGN
MIDDLE FINGER HAND SIGN
HORNS HAND SIGN
HORNS WITH THUMB HAND
SIGN
HANG TEN HAND SIGN
CLENCHED FIST HAND SIGN
LETTER L HAND SIGN
OK HAND SIGN
GUN HAND SIGN
HAND SIGN ZERO 0
HAND SIGN INDEX ONE 1
HAND SIGN INDEX TWO 2
HAND SIGN INDEX THREE 3
HAND SIGN INDEX FOUR 4
HAND SIGN THUMB ONE 1
HAND SIGN THUMB TWO 2
HAND SIGN THUMB THREE 3
HAND SIGN THUMB FOUR 4
HAND SIGN PINKY ONE 1
HAND SIGN PINKY TWO 2
HAND SIGN PINKY THREE 3
OPEN HAND SIGN 5

___Emoticons maybe too difficult to iconize:

Name
clears throat
coughs
kick
hug
laugh
ignore
amazed
think
confused

39
jump
scold
apology
listen

___Stick figures:
Various positions may be both universal and useful...

Name Notes
DISPLAY OF STRENGTH Stick figure flexing both biceps for example
DISPLAY OF SUBMISSION Stick figure kneeling/bowing for example
DISPLAY OF EXUBERANCE Stick figure jumping up kicking feet together to one side for
example
BUFFALO STANCE

___Other symbols:

Name Notes
TWO PI New symbol for math constant equal to 2π (2 times pi),
maybe should just be an existing character
COPYLEFT SIGN Symbolizes something expressly not copyrighted. Perhaps a
slash thru a copyright circle may be more understandable
which would better be called “NON COPYRIGHT SYMBOL”.
UPSIDE DOWN CROSS

Renamed characters
Bytext character names reflect modern common usage and are chosen to be as clear and as
easy to remember as possible. Some Unicode characters need to be renamed to achieve this
goal.
Bytext names use Basic Latin script with descriptive terms in English (such as ARABIC LETTER
ALEF WITH MADDA ABOVE). Only a single description is chosen as the name, the most
common description. Other terms describing an overloaded usage of the character can be
included in the list of alternative names. HYPHEN-MINUS is the only exception to this
principle, it is kept as is.
In the following table, characters found in ASCII are listed first:

Unicode name Bytext name Reason


or part of or part of
name name
(empty) (control Unicode does not provide a name (field 1 in the Unicode
character database) for control characters. The Bytext name
names) property provides the ASCII name for ASCII control
characters, and the Unicode alias names for other
control characters. This better matches user
expectations and is better database modelling practice
since the name property can then be used as a primary
key.
SMALL LETTER LOWERCASE Easier to understand and remember. The Bytext name
is by far the more common term for this concept. The
term “SMALL LETTER” is vague considering the many
font variants in Unicode. This changes many character
names.
CAPITAL UPPERCASE Provides a logical compliment to the term
“LOWERCASE”, and is just as commonly used as the

40
term “CAPITAL”. This changes many character names.
LOW LINE UNDERSCORE Easier to understand and remember. The Bytext name
is by far the more common term for this character.
VERTICAL LINE BAR Easier to understand and remember. The Bytext name
is by far the more common term for this character.
FULL STOP PERIOD Easier to understand and remember. The Bytext name
is by far more commonly recognized. This changes
other character names that incorporate the term such
as “DIGIT ONE FULL STOP” (U+2488).
SOLIDUS FORWARDSLASH Easier to understand and remember. The Bytext name
is by far the more common term for this character.
REVERSE BACKSLASH Easier to understand and remember. The Bytext name
SOLIDUS is by far the more common term for this character.
HORIZONTAL TAB Easier to understand and remember. The Bytext name
TAB is by far the more common term for this character.
GRAVE ACCENT BACKTICK Easier to understand and remember. The name
provides a mnemonic with “BACKSLASH”. It is in
common usage, used in “Programming Perl” for
example.
LINE LINE ENDING The term provides additional information about how it is
SEPARATOR CHARACTER used and it makes for a catchy acronym (LEC). A LEC is
positively associated with preceding characters, not
what comes after it. The term “LINE SEPARATOR”
leaves this important fact ambiguous.
PARAGRAPH PARAGRAPH The term provides additional information about how it is
SEPARATOR ENDING used and it makes for a catchy acronym (PEC) which is
CHARACTER appropriate for such a ubiquitous character. A PEC is
positively associated with preceding characters, not
what comes after it. The term “PARAGRAPH
SEPARATOR” leaves this important fact ambiguous.
GREEK PI GREEK Easier to understand and remember. As explained in
SYMBOL LOWERCASE the informative notes of Unicode this character looks
OMEGA PI nothing like a pi symbol.
GREEK PHI GREEK Easier to understand and remember. The Unicode term
SYMBOL LOWERCASE PHI is likely to be confused with the Greek letter phi.
(U+03D5) WITH FULL
VERTICAL LINE
HORIZONTAL FULL DASH Easier to understand and remember. A HORIZONTAL
BAR BAR is expected to combine with other HORIZONTAL
BARs to form a continuous line. If this is traditionally
not the case, a different Unicode character will be
chosen to be FULL DASH.
WORD JOINER GLUE Easier to understand and remember. Provides a
(U+2060) mnemonic with UNGLUE.
ZERO WIDTH UNGLUE Easier to understand and remember. “ZERO WIDTH”
SPACE could be interpreted as an absolute width that prevents
expansion during justification. Also, zero width space
sounds like something that might be used to prevent
joining behavior (instead of non joiner).
INVERTED UPSIDE DOWN Easier to understand and remember.
EXCLAMATION EXCLAMATION
MARK MARK
INVERTED UPSIDE DOWN Easier to understand and remember.
QUESTION QUESTION
MARK MARK
KANNADA KANNADA According to chapter 9.8 in The Unicode Standard 3.0,
LETTER FA LETTER LLLA the Unicode name is a mistake.
(U+0CDE)

41
LATIN CAPITAL LATIN The Unicode Indic FAQ says the name “should be” as
LETTER OI UPPERCASE changed.
(U+01A2) LETTER GHA
LATIN SMALL LATIN The Unicode Indic FAQ says the name “should be” as
LETTER OI LOWERCASE changed.
(U+01A3) LETTER GHA
TAMIL LETTER TAMIL LETTER http://www.tamil.net/people/sivaraj/unicode.html
LLLA (U+0BB4) ZHA
MALAYALAM MALAYALAM http://www.tamil.net/people/sivaraj/unicode.html
LETTER LLLA LETTER ZHA
(U+0D34)
GURMUKHI GURMUKHI http://members.tripod.com/~jhellingman/IndianScripts
LETTER EE LETTER E Unicode.html
GURMUKHI GURMUKHI http://members.tripod.com/~jhellingman/IndianScripts
LETTER OO LETTER O Unicode.html
TIBETAN MARK TIBETAN MARK According to The Unicode Standard 3.0, the Unicode
INTERSYLLABI INTERSYLLABIC name is a misnomer.
C TSHEG TSEK
(U+0F0B)
TIBETAN MARK TIBETAN MARK According to The Unicode Standard 3.0, the Unicode
DELIMITER DELIMITER name is a misnomer.
TSHEG BSTAR TSEK TAR
(U+0F0C)

Potential renames
ACUTE ACCENT renamed to FORWARDTICK. It seems easier to understand and remember,
and provides a mnemonic with FORWARDSLASH and BACKTICK.
TIBETAN renamed to ? to better suit people from Nepal.

LINE BREAKING ALGORITHM


“Make like an envelope and get stuffed!”
--Unknown

Bytext will attach it’s name to a single line breaking algorithm for informative purposes. The
idea is to keep line breaking very simple so that customizations that require dictionary
lookup or context may be implemented separately. This context dependent line breaking can
be documented in the text using markup that implies or specifies a specific style of line
breaking; or by directly inserting line breaking characters. With this strategy, the usage of
markup is distinct from the usage of line breaking characters. The markup would make the
line breaking characters unnecessary, not act as a substitute for them. For example, markup
may indicate that a sequence of characters is a “last name” and this may indicate that there
are no line breaking opportunities in the sequence.
Markup commands should not be used as a substitute for PEC or any other line breaking
character. Since Bytext is a distinct protocol layer, there is nothing preventing higher
protocols from overriding normative Bytext functionality, but users should know that a well
designed protocol stack will not have unnecessary overlaps in functionality. As is usually the
case with poor design, it ends up making things more difficult and complex for the user.
Markup is traditionally written as plain text and it is considered more important for the
source code to look neat than for the text to look neat. There is no inherent need for such a
compromise, it is simply a hack to cater to the inadequacies of state of the art source code
editors. Mandatory line breaking is traditionally in the realm of character encoding and for
good reason. Markup should only be used when it is necessary to create a tree-like data
structure, or to delimit context free operators that are not already associated with basic text
processes and even the most basic character encoding systems. Just like Unicode has a

42
tendency to try to specify things in the realm of markup, markup like HTML and XML has a
tendency to try to specify things in the realm of text. Bytext is designed to have a clear
distinction between the two.
PEC, LEC, LINE FEED, NEL, and FORM FEED have the mandatory break property. All but PEC
are compatibility characters.
A mandatory break after CARRIAGE RETURN (U+000D), unless it is followed by LINE FEED
(U+000A) is a well known necessity in order to cater to legacy systems, but by principle
Bytext cannot declare a context dependent property value to be normative. Since
“mandatory break” is a normative property, declaring that the property value for “carriage
return” is dependent on context would cause the whole property to have what is known as
“mixed domains”, which is poor database modeling practice. Instead, Bytext provides
informative guidelines for all context dependent values.
Unlike Unicode, Bytext does not recognize any “page break” property, even as an informative
property because page formatting is definitely in the domain of markup. Since FORM FEED
looks and acts like a PEC in screen display it can produce unanticipated results that are very
frustrating if it causes unnecessary pages to print, wasting paper.
How to provide for spelling differentiation caused by word wrap? It is a problem for the
implementation to deal with. Bytext is helpful because the ignorability mechanism can be
used to ignore the hyphen. A search for the word will simply have to account for the
different spellings as if they were spelled differently without the hyphen.
Line breaking of course only matters for visible characters. All invisible characters should be
considered to be associated with the first visible preceding character in logical order, this will
best match user expectations.
“Emergency line breaking” is in effect when a break must be made but a breaking opportunity
(BO) was not found. There should be an ordered list of informative line breaking
opportunities in order from more ideal to less ideal. This concept is called BO priority
number and is found in the 2 informative BO priority number properties: Break Opportunity
Before Priority number (BOBP); and Break Opportunity After Priority number (BOAP). These
can be seen in the table in the next section. The lower the priority number the more ideal it
is to break. Implementations should try to find breaks at one level before moving on to the
next level. Note that the terms “after” and “before” are relative to logical order, so they are
independent of bidirectionality.
Medial ligating variants have null BO priority number property values, so there is never a
break opportunity between them unless they are fractured. Likewise, initial ligating variants
have a null BOAP, and final ligating variants have a null BOBP, so there is never a break
opportunity within a properly ordered ligature or spanning mark.

Break opportunity priority summary

Character(s) or character class(s) BO priority


number
properties
Break Break
oppor- oppor-
tunity tunity
before after
priority priority
(BOBP) (BOAP)
HYPHEN-MINUS (U+002D). 7 1
All characters from U+3000..U+33FF not covered elsewhere. 2 2
___From Unicode line breaking property “Contingent Break Opportunity”
(CB):
OBJECT REPLACEMENT CHARACTER (U+FFFC).
___From Unicode line breaking property “Ideographic”, (ID):
IDEOGRAPHIC SPACE (U+3000).
CJK Unified Ideographs, (U+4E00..U+9FAF).
CJK Unified Ideographs Extension A, (U+3400..U+4DBF).
CJK Compatibility Ideographs, (U+F900..U+FAFF).

43
Hiragana, (U+3040..U+309F), except small variants.
Katakana, (U+30A0..U+30FF), except small variants.
Yi Syllables, (U+A000..U+A48F).
Yi Radicals, (U+A490..U+A4CF). CJK Radicals Supplement, KangXi
Radicals, and Ideographic Description Symbols, (U+2E80..U+2FFF).
FULL DASH. 2 7
EM DASH (U+2014).
HORIZONTAL BAR (U+2015).
___From Unicode line breaking property “Break Opportunity Before”, (BB):
ACCUTE ACCENT (U+00B4).
MODIFIER LETTER VERTICAL LINE (U+02C8).
MODIFIER LETTER LOW VERTICAL LINE (U+02CC).
SYRIAC END OF PARAGRAPH (U+0700). 7 2
FORWARDSLASH (U+002F).
HYPHEN (U+2010).
ARMENIAN HYPHEN (U+058A).
TIBETAN MARK INTERSYLLABIC TSHEG (U+0F0B).
ETHIOPIC WORDSPACE (U+1361).
OGHAM SPACE MARK (U+1680).
KHMER SIGN BARIYOOSAN (U+17D5).
HYPHENATION POINT (U+2027).
BAR (U+007C).
All of the fullwidth Latin letters. 3 3
SMALL PLUS SIGN..SMALL EQUALS SIGN, (U+FE62..U+FE66).
Wide digits, (U+FF10..U+FF19).
All characters with Unicode line breaking property “Opening Punctuation”, 4 7
(OP), equivalent to characters with Unicode general category Ps.
All characters with Unicode line breaking property “Prefix (Numeric)”, (PR).
All currency symbols (Unicode general category Sc), except those with
Unicode line breaking property “Postfix (Numeric)”, (PO).
PLUS (U+002B).
REVERSE SOLIDUS (U+005C).
PLUS-MINUS SIGN(U+00B1).
MINUS SIGN (U+2212).
NUMERO SIGN (U+2116).
MINUS-OR-PLUS SIGN (U+2213).
AUTO LINE ENDING CHARACTER, this is basically ignored for line breaking. 7 4
___From Unicode line breaking property “Closing Punctuation”, (CL),
COMMA and PERIOD are not listed here because they are listed in
“Infix Separator” property:
U+3001-3002, IDEOGRAPHIC COMMA..IDEOGRAPHIC FULL STOP.
SMALL COMMA (U+FE50).
SMALL FULL STOP (U+FE52).
FULLWIDTH COMMA (U+FF0C).
FULLWIDTH FULL STOP (U+FF0E).
HALFWIDTH IDEOGRAPHIC FULL STOP (U+FF61).
HALFWIDTH IDEOGRAPHIC COMMA (U+FF64).
___From Unicode line breaking property “Exclamation/Interrogation”,
(EX):
EXCLAMATION MARK (U+0021).
QUESTION MARK (U+003F).
SMALL QUESTION MARK..SMALL EXCLAMATION MARK,
(U+FE56..U+FE57).
FULLWIDTH EXCLAMATION MARK (U+FF01).
FULLWIDTH QUESTION MARK (U+FF1F).
___From Unicode line breaking property “Postfix (Numeric)”, (PO):
PERCENT SIGN (U+0025).
CENT SIGN (U+00A2).

44
DEGREE SIGN (U+00B0).
PER MILLE SIGN (U+2030).
PER TEN THOUSAND SIGN (U+2031).
PRIME..REVERSED TRIPLE PRIME, (U+2032..U+2035).
PESETA SIGN (U+20A7).
DEGREE CELSIUS (U+2103).
DEGREE FAHRENHEIT (U+2109).
OHM SIGN (U+2126).
SMALL PERCENT SIGN (U+FE6A).
FULLWIDTH PERCENT SIGN (U+FF05).
FULLWIDTH CENT SIGN (U+FFE0).
___From Unicode line breaking property “Non starter”, (NS):
Characters with an East Asian width type prevariant (the equivalent of
Unicode East Asian width property W).
Halfwidth characters are only defined in terms of an implementation that is
using East Asian typography. Halfwidth characters are defined as
characters being used in East Asian typography that have an East
Asian width variant. For example, ASCII characters (prevariants) are
considered halfwidth in East Asian typography. East Asian typography
implementations will apparently treat halfwidth characters as non
starters, so this is an informative guideline.
___The following characters:
All characters with Unicode General Category Lm (Letter, Modifier) and all
characters with General Category Sk (Symbol, Modifier).
All small Hiragana and Katakana.
THAI CHARACTER ANGKHANKHU..THAI CHARACTER KHOMUT,
(U+0E5A..U+0E5B).
KHMER SIGN KHAN (U+17D4).
KHMER SIGN CAMNUC PII KUUH..KHMER SIGN KOOMUUT,
(U+17D6..U+17DA).
DOUBLE EXCLAMATION MARK (U+203C).
FRACTION SLASH (U+2044).
WAVE DASH (U+301C).
KATAKANA MIDDLE DOT (U+30FB).
IDEOGRAPHIC ITERATION MARK (U+3005).
KATAKANA-HIRAGANA VOICED SOUND MARK..HIRAGANA VOICED
ITERATION MARK, (U+309B..U+309E).
KATAKANA ITERATION MARK (U+30FD).
SMALL SEMICOLON..SMALL COLON, (U+FE54..U+FE55).
FULLWIDTH COLON.. FULLWIDTH SEMICOLON, (U+FF1A).
HALFWIDTH KATAKANA MIDDLE DOT (U+FF65).
HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK (U+FF70).
All visible characters with unspecified line breaking properties, such as 4 4
incompletely specified private characters.
All characters with Unicode general category Sc, Sk, or So (not math
symbols), except as indicated elsewhere.
___All characters with Unicode line breaking property “Alphabetic”, (AL): 5 5
Characters with Unicode general category Lu, Ll, Lt or Lo, (everything
starting with “L” except Lm).
___All characters with Unicode line breaking property “Complex Context
(South East Asian)”, (SA):
Thai and Lao characters (U+0E00..U+0EFF), Myanmar characters
(U+1000..U+109F), and Khmer characters (U+1780..U+17FF) with
Unicode general category Lo or Lm.
Ligating variants with BCA or MCA property (not null). 5 null
Ligating variants with BCB or MCB property (not null). null 5
___From Unicode line breaking property “Numeric”, (NU): 6 7
All characters with Unicode General Category Nd –-except if FULL WIDTH.

45
___From Unicode line breaking property “Inseparable”, (IN):
ONE DOT LEADER (U+2024), this is also part of the closing punctuation
category.
TWO DOT LEADER (U+2025), this is also part of the closing punctuation
category.
HORIZONTAL ELLIPSIS (U+2026), this is also part of the closing
punctuation category.
___All characters with Unicode General Category property Sm.
___From Unicode line breaking property “Infix Separator (Numeric)”, (IS):
COMMA (U+002C).
FULL STOP (U+002E).
COLON (U+003A).
SEMICOLON (U+003B).
ARMENIAN FULL STOP (U+0589).
___From Unicode line breaking property “Ambiguous Quotation”, (QU):
QUOTATION MARK (U+0022).
APOSTROPHE (U+0027).
GLUE (U+2060). 8 8
___Compatibility glue:
NO-BREAK SPACE (U+00A0).
NARROW NO-BREAK SPACE (U+202F).
FIGURE SPACE (U+2007).
NON-BREAKING HYPHEN (U+2011).
TIBETAN MARK DELIMITER TSHEG BSTAR (U+0F0C).

The basic algorithm


1. Scan for characters that have the mandatory break property and break after each one.
Break after CARRIAGE RETURN (U+000D) unless it is followed by LINE FEED (U+000A).
2. As soon as a line’s width exceeds it’s margins, backtrack in logical order one offset at a
time to the first pair of characters that provide a mandatory break opportunity at that
offset.
3. If a line breaking opportunity is still needed, proceed to find one using characters that
have the lowest BO priority numbers. The BO priority number for an offset is the lowest
one provided by each character, unless either of the characters has the glue property in
which case the BO priority number for that offset is that of GLUE.

Sample Java code


The following is a Java code segment to take a string and obtain a tree structure where every
character is referenced by it’s paragraph index within the string, it’s line index within a
paragraph, and it’s character index within a line. AUTO LEC’s are ignored. Abstract objects
and methods are used, working code is not yet available. The code is provided for
descriptive purposes only.

ModifiedArrayList pLC = new ModifiedArrayList();


//This class works similar to an ArrayList except for the “put” and “cut”
//methods. “put” is like an ArrayList “add” except if the index is out of
//range the range is extended instead of throwing an exception. Also, instead
//of shifting elements that it inserts between, it deletes everything after
//the insertion point. There is a corresponding “putAll” method. The “cut”
//method cuts everything after and including an index then returns the result
//as a ModifiedArrayList. Cut means the elements are deleted from their
//original list. The class definition is not included in this algorithm.
//pLC stands for paragraph, line, character.
int paragraphIndex = 0;
int lineIndex = 0;
int characterIndex = 0;

46
BytextString s = new BytextString(stringArgument);
//BytextString methods access the Bytext API. The class definition is not
//included in this algorithm.
for(cCount = 0; cCount <=s.length(); cCount++) //character loop
{
if(s.getMBProperty(cCount)) //note that LEC behaves exactly like PEC
{
pLC.put(paragraphIndex, new ModifiedArrayList());
paragraphIndex = paragraphIndex + 1;
lineIndex = 0;
characterIndex = 0;
}
if(lineMetrics.lengthTooLongTest(paragraphIndex, lineIndex, characterIndex))
{
for(pLCount = 1; pLCount <= 8; pLCount++) //priority level loop
{
int backstepOBL = 1; //OBL stands for Outside Backstep Loop
for(bCount = 0; bCount < characterIndex; bCount++) //backstep loop
{
if(s.getLineBreakPriorityNumber(cCount - bCount) = pLCount)
//getLineBreakPriorityNumber refers to an offset, which is the point
//just before the character that would be indexed by the same number.
//The argument is thus interpreted as an offset, not an index.
{
backsteopOBL = bCount;
break; //break backstep loop;
}
}
int[] testPosition = [paragraphIndex, lineIndex,
(characterIndex - backstepOBL)];
if(!(lineMetrics.LengthTooShortTest(testPosition))
& !(lineMetrics.lengthTooLongTest(testPosition)))
{
ModifiedArrayList excessLineFragment = pLC.cut(backstepOBL);
pLC.get(paragraphIndex).put(lineIndex, new ModifiedArrayList());
lineIndex = lineIndex + 1;
pLC.get(paragraphIndex).putAll(0, excessLineFragment)
characterIndex = backstepOBL + 1;
break; //break priority level loop
}
}
}
pLC.get(paragraphIndex).get(lineIndex).add(s.getChar(cCount));
characterIndex = characterIndex + 1;
}

COLLATION ALGORITHM
“Each word more useless than the next.”
--Adam Sandler

The number one stated goal listed in UTR #10 for the Unicode collation algorithm is “a
complete, unambiguous, specified ordering for all characters in Unicode”. This sounds like a
useful general purpose tool and it implies that for a given version of Unicode, you have an
unambiguous ordering. But this is not the case since compliant implementations can supply
their own “collation element table” and can even skip the normalization step. Bytext will
attach it’s name to a single collation scheme, so when the term ‘Bytext collation’ is used, it
has a single meaning and it is in fact a complete, unambiguous, specified ordering for all
possible characters in Bytext. It is provided as an informative general purpose tool.

47
There are numerous potential ways to sort character sequences. The Bytext collation
algorithm uses a few basic techniques that take advantage of the unique Bytext allocation
principles. These techniques can be used as a framework for other collation styles.
The Bytext collation algorithm is intended to be serviceable for all locales. Serviceable means
that most users should find it good enough for “manual” indexing. Typically, a user will be
searching in a list for words or phrases in terms of a common order that they have
memorized. Practically all users will have numeric digits and their alphabet memorized.
Comparatively few users will have an order for diacritic marks memorized, but most will
regard characters with diacritics as being distinct grapheme’s, so they would expect them to
tend to group within a list accordingly. Most users will regard case variants as not being
distinct graphemes, so they would expect them be dispersed as if case did not matter.
Almost no users will memorize an order for whitespace, punctuation, and symbols, so most
users will expect that they be dispersed as if they did not matter. In particular, when
searching for a word or phrase such as “De Luge” that they know has a space, most users
will still scan for “D” then “e” then “L” --and not “D” then “e” then “space” (which might be
before the letter a or before the number 1 or after the alphabet or who knows where). In a
short list, minor differences in collation styles will not matter too much because there are
less items to scan. In a long list, minor differences in collation styles can be learned by
observing patterns within the list because a long list is likely to demonstrate each pattern.
This is especially true if the patterns are straightforward and easy to predict, as in
(regarding Swedish) “Ä is not found after z, so it must be with the A’s”. This is what is
meant by being good enough for manual indexing.
Several consequences of Bytext collation make it especially general purpose and useful: for
the most common numbering systems, if the numbers in a list are prefixed with zeros so
they are all the same length, the numbers will sort in numerical order. Notably, this is true
for hexadecimal numbers and even if decimal numbering systems from different scripts are
mixed. Such numbers can be prepended to other lists to affect their collation order. Also,
many numeric date formats will sort in chronological order if the units of time each have
consistent lengths (prefixed with zeros as necessary) and go from large scale to small scale
units (year to month to day to hour, etc). These conventions are convenient ways to
organize file systems.

Setup
The algorithm generates a single sort key for each collation sequence. The collation sequence
is the sequence that is to be collated. The sort key of a collation sequence is used to
determine the order of collation sequences by comparison with other sort keys. The sort key
is a byte sequence that is interpreted as a single base 256 number in big endian order. All
sort keys are thus zero terminated to be the same length as others they are to be compared
with.
Each character has an informative collation element array property, which is an array of
integers from 0 to 255 that are algorithmically related to the character byte values. Each
integer (byte) in the array is called a collation element, or CE.
Each character also has an informative property called the collation element priority number
array. This is an array that provides a priority number for each collation element. A collation
element is said to “have” a collation element priority number by means of them sharing a
corresponding position in each array. The priority number is an integer used as explained in
the algorithm. Priority numbers are always in numerical order in the array, that is, 1’s come
before 2’s which come before 3’s, etc. The priority numbers are determined algorithmically
as described in the table below.

Priority Only the following The criteria in this column determines which composition
number characters may level(s) of the characters in column 2 is (are) used to generate
have a collation the collation elements with the priority numbers in column 1.
element with the If a character is at a SS location, the criteria are in terms of
priority number in the composition of it’s native location; likewise, if a character
column 1 is at a native location, the criteria are in terms of the SS
location if there is one.
1 Letters, syllables, For letters, C1 and each higher composition level until the

48
ideographs, and letter prevariant is reached and included. For ideographs and
characters with syllable characters, C1 and each higher composition level until
numerical value. the mark prevariant is reached and included. For decimal
digits, C1. For HANGUL CHOSEONG FILLER and subtypes,
every composition level.
2 Letters, ideographs For letters, the lowest composition level that specifies a mark
and syllables that variant and each higher composition level that is used to
are mark variants. specify a mark variant (such as a stacking diacritic) for a
letter also qualifies. For ideographs and syllables, each higher
composition level that is used to specify a mark variant.
3 Letters. The composition level that specifies non mark subtypes of the
letter prevariant, such as uppercase variants.
4 Non ignorables. Every composition level that has not been used so far, in
order from lowest to highest.
5 Ignorables. Every composition level from lowest to highest.

Collation elements are not equivalent to the byte values of the composition levels given in the
above table. They are put into a special encoded form that allows certain optimizations and
a simplification of the overall algorithm. This encoded form reserves the number zero for the
algorithm to use. In order to reserve the number zero, byte values from 0 to 125 are shifted
so they are recorded as 1 to 126. Other byte values are not changed. Having the collation
element value of C1 of B125- being the same as the collation element value of C1 of B126-
does not effect the relative sorting ability of the algorithm because ignorables are only
sorted after every other non ignorable has already been sorted (so to speak). The
information about which characters are ignorable is provided to the algorithm by the priority
numbers.
Collation elements are further modified for Kana scripts. There is an extra collation element for
Kana scripts that enables small variants to sort before their non small prevariants.
When a CE is requested for a given priority number and a character has no more (meaning it
had at least one) CE’s with that priority level, the algorithm reports that the CE is zero for
that round. Every other CE requested in subsequent rounds for that character and that
priority number will also be zero.

The basic algorithm


The sort key is generated by appending collation elements to the sort key in a series of
rounds. Append is the process of adding a digit to the least significant end of a number. For
example, appending 9 to 11 will result in the value 119.
Each round appends the collation elements with a certain priority number for each character in
the collation sequence. The first round appends the first collation elements with priority
number 1 for each character in the collation sequence. The second round appends the
second (that is, the “ordinalOfCEWithCurrentPriorityNumberUsed” value in the source code
below = 2) collation elements with priority number 1, if any, and so on until there are no
more non zero collation elements with priority number 1. Then the priority number used is
increased and the rounds continued until all non zero collation elements for all priority
numbers have been appended.

Sample Java code


The following is a Java code segment to obtain a sort key for a collation sequence. Note that
to collate a list of collation sequences it will usually not be necessary to obtain the entire
sort key for each collation sequence in a list. It is a straightforward matter to modify the
algorithm to stop when enough of a collation value is obtained to collate a particular list.
Abstract objects and methods are used, working code is not yet available. The code is
provided for descriptive purposes only.

cs.sortKey.setToZero();

49
//cs is an object for each collation sequence
for(priorityNumberUsed = 1; priorityNumberUsed <= 5; priorityNumberUsed++)
{
ordinalOfCEWithCurrentPriorityNumberUsed = 1;
boolean priorityNumberUsedFinished = false;
while !(priorityNumberUsedFinished)
{
tempSortKeyForCurrentRound.setToZero();
for(c = 1; c <= cs.numberOfCharacters(); c++)
tempSortKeyForCurrentRound.append(cs.getCE(
priorityNumberUsed,
ordinalOfCEWithCurrentPriorityNumberUsed,
c));
if(tempSortKeyForCurrentRound.equalsZero())
priorityNumberUsedFinished = true;
else
{
tempSortKeyForCurrentPriorityLevel.append(
tempSortKeyForCurrentRound.getValue());
ordinalOfCEWithCurrentPriorityNumberUsed =
ordinalOfCEWithCurrentPriorityNumberUsed + 1;
}
}
cs.sortKey.append(tempSortKeyForCurrentPriorityNumberUsed.getValue());
}

BIDIRECTIONALITY ALGORITHM
“Excuse my backward friend for being so forward.”
--Larry, Three Stooges

The Bytext bidirectionality algorithm (BBA) is used to format text containing simplified
formatting characters that act as stand-in’s for markup codes. The formatting characters the
BBA uses, BELI and BELD, are simplified relative to the Unicode bidirectional algorithm
(UBA) formatting characters so it is easier to convert them to markup codes; and so they
can be directly used by straightforward input methods that eliminate the hard to predict
results of the UBA as described in questions 12 and 13 of the Bytext FAQ. Only 2 codes are
required to format bidirectional text using the BBA, compared with 7 codes using the UBA.
This makes the BBA easier to learn and it frees up keyboard real estate (or it’s analogy in
whatever input method is used).
The BBA allows language specific conventions for formatting bidirectional text to be
implemented by an input method. This is a more flexible, long term solution than the UBA
approach of “hard coding” spelling conventions (such as how to format a list of numbers)
into the display method. These spelling conventions also violate the principle that Unicode
encodes characters, not languages.
Using fewer formatting characters also enables each to be more readily understandable as a
graphic symbol. In Bytext, an application that does not support the BBA or the UBA can
display the graphic symbols of the formatting characters along with the text they were
intended to format --in lieu of actual formatting. This allows the text to be readable even in
the absence of the intended formatting. Trying to apply the same idea to Unicode, not only
would recognizing up to 7 different kinds of UBA formatting symbols be difficult, and not
only would the intended formatting of these UBA codes be hard to understand, but such a
display would not be Unicode compliant. Unicode conformance requirements forbid the
display of any right-to-left character (Hebrew and Arabic characters for example) unless the
application formats them in accordance with the UBA.
The 7 UBA formatting characters:

RLE Right to Left Embedding

50
LRE Left to Right Embedding
RLO Right to Left Override
LRO Left to Right Override
PDF Pop Directional Formatting
RLM Right to Left Mark
LRM Left to Right Mark

In the BBA, higher nesting levels will not force boundary types to change (see step X10 of the
Unicode bidirectional algorithm in UAX #9). This does not translate well to conventional
markup syntax at all and it forces users to know obscure bidirectional properties of every
character they use. All bidi (bidirectional) formatting in the BBA uses only explicit codes,
which means that character properties do not alter the formatting. This eliminates multiple
spellings for the same formatting. As with UBA codes, the scope of BBA codes ends at the
end of a paragraph.
The following grid demonstrates how BBA formatting characters are used to achieve specific
numbered embedding levels. Each row represents a paragraph. A run of characters
represented by a given BBA embedding level can be blank. In the following grid, color is only
used to indicate repetitive patterns:

0
BELI 1 BELD
BELI 1 BELI 2 BELD 1 BELD
BELI 1 BELI 2 BELI 3 BELD 2 BELD 1 BELD

The BBA always treats AUTO BELI and AUTO BELD identically to BELI and BELD, respectively.
The differences between the “auto” bidi formatting characters and their prevariants only
comes into effect during Bytext to Unicode conversion.
The mirroring property of Unicode is not used in Bytext except during transcoding to and from
Unicode. Mirroring is positively counterintuitive to the way most people would expect a
keyboard or typewriter to work. If text is being written right to left then it is more natural to
enter a right parenthesis first to enclose something. Mirroring is used to promote the
misguided idea that left=open and right=closed when it comes to characters commonly used
for enclosing and quoting. As explained in question 13 of the Bytext FAQ, this idea doesn’t
always work if a quoted phrase spans multiple lines (automatic line breaking is the default
display preference for the UBA). Characters that have the mirroring property of Unicode will
be referred to as mirrorable characters. Instead of simply using the character with the
appropriate name and sample glyph to achieve the required display, the mirroring property
requires one to know which characters are mirrorable and what the mirrored form of each of
these characters looks like. For compatibility with Unicode, the mirrored form of each
mirrorable character can be directly encoded in Bytext using mirrored variants.
There are no bidirectional controls in Unicode or Bytext that can be used to express a
boustrophedon style directionality (alternating direction right to left then left to right for
each line) while allowing automatic line breaking. The best that can be done in plain text,
currently, is to encode each line explicitly using LEC. There is an additional problem that this
style typically requires more extensive mirroring. Bytext is better suited to deal with these
kinds of complex mirroring issues because mirrored characters are explicitly encoded.
The BBA is drastically simplified compared to the Unicode bidirectional algorithm primarily
because the responsibility for making the input method for right-to-left languages match
locale expectations is put upon the input method (the editing software) and not the display
method (the display software, the software using the BBA).

The basic algorithm

51
Definitions:

D1. A paragraph formatting character is defined as any of the following characters: PEC, LINE
FEED, NEL, FORM FEED, or CARRIAGE RETURN.
D2. A bidirectional formatting character is defined as any of the following characters: BELI,
AUTO BELI, BELD, or AUTO BELD.
D3. A glyph is defined as any other single character not in D1 or D2, with the following
provisions: (a) One or more contiguous CSCC’s are grouped with the preceding character,
unless the preceding character is in D1 or D2, and together the group constitutes a single
glyph. If there is no such preceding character, each CSCC constitutes a single glyph. (b)
A contiguous matching sequence of ligating variants together constitute a single glyph.
Fractured ligating variants constitute a single glyph.
D4. A paragraph is defined (in this algorithm only) as any sequence delimited by a character
in D1.
D5. A bidirectional iteration unit, BIU, is defined as a paragraph formatting character, a
bidirectional formatting character, or a glyph, as defined in D1, D2, and D3.

Encoded embedding:

1. For each BIU, repeat steps 2 to 6.


2. Set the embedding level of each PEC, LINE FEED, NEL, FORM FEED, and CARRIAGE
RETURN to zero.
3. If the first character in a paragraph starts with BELI or AUTO BELI, set the paragraph
embedding level of that character to one. Note: any paragraph formatting such as
indenting, justification, and bulleting should reflect the paragraph embedding level except
as indicated by higher protocols.
4. Increase the current embedding level by one if BELI or AUTO BELI is encountered. An
excess number of BELI or AUTO BELI characters that would create an embedding level
greater than 62 are ignored.
5. Decrease the current embedding level by one if BELD or AUTO BELD is encountered. An
excess number of BELD or AUTO BELD characters that would create an embedding level
less than the paragraph embedding level are ignored.
6. Set the embedding level of the current BIU to the current embedding level.

Line breaking:

7. Calculate line breaking, do not break lines before whitespace characters.

Display embedding:

8. Set the embedding level of the following characters to the paragraph embedding level:
a. TAB, VERTICAL TAB, INFORMATION SEPARATOR ONE, PEC, LINE FEED, NEL, FORM
FEED, and CARRIAGE RETURN.
b. Any sequence of whitespace characters at the end of a line or preceding a
character listed in part a. Note: A higher protocol that requires whitespace
characters to be displayed by visible characters such as a middle dot should
override this step.

Reordering BIU’s for a single left to right advancing direction on each line:

9. Start with the highest embedding level on each line. Reverse any contiguous sequence of
BIU’s that are at that level or higher (called the BEL span).

10. Until the embedding level of one is reached, reduce the embedding level used in step 9 by
one and repeat step 9.

52
COMPRESSION FRAMEWORK
“Lookout honey cuz I’m using technology”
--R.H.C.P.

Compression, other than inherent compression, is an advanced topic with many options that
are appropriate under different circumstances. It is therefore best handled at the markup
level and not at the charset level. Compression for Bytext will be covered as a recommended
framework for markup, not as a single recommended algorithm. A compression scheme that
takes advantage of the features of textual content will be referred to as a Text-specialized
Compression Scheme (TCS). TCS’s are contrasted with general purpose data compression
schemes such as Huffman or LZW. As mentioned in UTS #6, using both types of
compression is generally more effective than either type alone for compressing long strings
of text. For small sequences such as fields in a database, a TCS alone is often more
effective.
A sample TCS is appropriate to cover in The Bytext Standard because of it’s dependence on
the structure of Bytext and to show how it can work with the features of Bytext. Unicode
allocation provides no inherent benefit in terms of compression over Bytext allocation. In
fact, depending on how Bytext is compressed, open composition searches and other features
of Bytext can still be partially or fully usable directly in the compressed form. This is a much
more flexible model than SCSU, and is likely to prove more useful.
One of the reasons for developing Unicode was the difficulty with specifying different charsets
within a single document. If a specialized text compression scheme is encoded as a separate
charset, as the Standard Compression Scheme for Unicode (SCSU) is, it means that many
authors will be faced with the same difficulty all over again. For example, since SCSU is not
the default charset of XML, most XML documents that wish to use SCSU will need to specify
it as an alternative charset in the XML markup.
Since most documents will require compression to be specified in markup anyway, it makes
sense to use markup to not only to specify both a TCS and a general purpose compression
scheme, but also as an integral part of the TCS itself.
The efficiency of any particular specialized text compression scheme depends heavily on the
type of content being compressed and on the implementation. Direct comparison of the
Bytext compression framework with SCSU will be somewhat abstract not only because of the
dependence on content, but also because the Bytext compression framework is not intended
to be limited to only a TCS. After all, once the markup language is in place, it is a trivial task
to implement a general purpose compression algorithm on top of it. Since Bytext
compression must be specified by markup, the effectiveness may heavily dependent on the
efficiency and capabilities of the markup language. XML is quite poor in this regard as can be
expected since terseness is explicitly excluded as a design goal of XML. The markup
language proposed by the creator of Bytext will have a maximally efficient and flexible set of
TCS’s built in.

Rearranging
Bytext allocation can be systematically rearranged to a form suitable for different types of
TCS’s. A rearranged form of Bytext for a given range of text is itself a TCS. One can simply
specify a custom version of Bytext to suit the particular needs of a given range of text.
The basic elements that can be used to specify any type of rearrangement are described in the
following table. The descriptions are in terms of how to interpret the rearranged form. How
to obtain the rearranged form can be derived from the descriptions.

Element name Description


range The range of text the other parameters apply to. If specified as “out of
band” information, it is most efficiently indexed in terms of bytes rather
than characters. In this case, the recommended practice is to cut out of
the range any character split by an improper byte index.
switch SS cross This indicates that any SS cross reference characters in the range of

53
reference pattern(s) x are to be switched with the SS character that they map to
characters of before the range is to be interpreted as Bytext. This parameter enables SS
pattern(s) x scripts to be handled as a clean range of characters, which makes it easier
to specify the parameter described next.
switch pattern(s) This indicates that pattern(s) x are to be switched with pattern(s) y and
x with pattern(s) vice versa before interpretation as Bytext. Patterns can be described in
y terms of a range of patterns, with wildcards and other regular expressions,
so long as there is a one to one correspondence. For example, x = “B116-
129-129-129- to B116-129-129-192-” and y = “B010- to B036-” and
“B077- to B114-”. In this example, if Bengali SS characters are switched,
most Bengali prevariants will be encoded as F1 characters, while still
allowing punctuation, digits and many other F1 characters to be left as is.

Selective differencing
Selective differencing is a novel technique for applying differencing compression only to
certain characters. The idea is to account for a major characteristic of international textual
content which is that there are certain characters that tend to be shared among different
languages, while the other characters that are used tend to be clustered within an allocation
unit. The former set of characters will be referred to as “S” for shared. The latter set will be
referred to as “N” for non shared. N characters are appropriate for differencing, while S
characters are not --hence selective differencing.
The characters that constitute S or N are defined only in terms of F1 supertypes, optionally
after a rearrangement. All subtypes are in the same S or N set as their F1 supertype. The
sets are mutually exclusive and together the sets encompass all possible character codes. N
must be a contiguous sequence of F1 characters, and a rearrangement can be used to make
it so. S generally includes whitespace and European digits and punctuation. The Basic Latin
alphabet, could also belong in the S set. For demonstration, S will consist of F1 digits,
lowercase Basic Latin alphabet, uppercase Basic Latin alphabet, 30 ASCII punctuation
characters (all except circumflex accent and backtick), and the ignorables. Bytext is
rearranged so the remaining 32 F1 characters in N are allocated at B094..B125.
The character code value of the first F1 character in N is called minN, which is 094 in this
case. The number of F1 characters in N is called numN. The sample code is designed to work
only when numN is an even number, so in the description and samples it will be assumed to
be even.
Selective differencing will be described by stepping thru the compression of one character at a
time in a sample sequence of characters.
P is defined as bitwise equivalent to N except it is interpreted as strictly a byte pattern instead
of a character code value. Since N is B094-..B125-, P is 094-..125-. The first P in a sequence
is interpreted as an N. The next P and every other P after that is interpreted as a “delta”.
Delta is the character code value input minus the character code value of the mid. The mid is
a character code with an identical composition of the previous N character, except the last
byte is equal to the nearest lowest multiple x of (129 + x(numN divided by 2)). The last
byte of the mid can be obtained with the following Java code:

int numN; //=32 in demo


int minN; //=94 in demo
int lastUByteOfPNC = (int)(pNC.getLastUByte()); //A uByte is cast to an integer
if(lastUByteOfPNC < 128) return (uByte)((numN / 2) + minN); //An integer is
//cast to a uByte. pNC is an object for the previous N character.
if(lastUByteOfPNC = 128) return (uByte)(numN / 2);
int x = 0;
while lastUByteOfPNC >= 129 + (x * numN)
{
x = x + 1;
}
return (uByte)(129 + (x * numN / 2));

54
Sample compression, stepping thru one character code at a time:

Step Character Mid Delta Output Explanation


code bytes
input
1 B001 001 Character is identified as S by it’s range.
Differencing does not apply.
2 B123-200- 123-200- 123- Character is identified as N by it’s range.
129 145 200- This is the first N character so the
145 character code input is used as the basis
for the mid.
3 B002-200 123-200- 002- Character is immediately identified as S by
145 200 it’s range. Differencing does not apply.
4 B123-200- 123-200- -15 111 Character is identified as N by it’s range.
130 145 The character code input is subtracted
from the mid from step 2. The character
code input becomes the basis for the mid
for the next character.
5 B123-200- 123-200- -16 110 Character is identified as N by it’s range.
129 145 The character code input is subtracted
from the mid from step 4. The character
code input becomes the basis for the mid
for the next character.
6 B124-200- 123-200- 0 094 Character is identified as N by it’s range.
145 145 The character code input is subtracted
from the mid from step 5. The character
code input becomes the basis for the mid
for the next character.
7 B003 123-200- 003 Character is identified as S by it’s range.
145 Differencing does not apply.

Selective differencing compression can be thought of as going thru 2 stages. First a signed
delta value is converted to a signed “mixed base” number. A mixed base number is an
intermediate form of selective differencing output. It is called mixed base because the first
(most significant) positional coefficient can be in one or two different bases, while
subsequent positional coefficients are simultaneously in base 128 form. The first positional
coefficient can be in one or two bases depending on it’s size because preceding zeros are not
needed to determine when the number starts and stops in a sequence since that is handled
by F8TV. So a preceding zero can be used as a digit to increase the size of the base for
numbers with a second positional coefficient.
The second stage is when the mixed base number is subsequently converted to output bytes
by converting the signed byte form of the first byte in the mixed base number to an
unsigned form by adding numN. Then minN is added to this unsigned form. Then the
preliminary output bytes are packaged up into a single character code with F8TV by adding
128 to all C2+ bytes.
The rest of this section introduces the terms used in the sample source code and the
mathematical relationships involved.

sign = -1 if delta is negative, or 1 if delta is zero or positive


negBoolInt = 1 if delta is negative, or 0 if delta is zero or positive
posBoolInt = 1 if delta is positive or zero, or 0 if delta is negative
M = numN / 2
k = M + negBoolInt

For integers n, where


n>0

55
For coefficients c, where
1 <= cn <= M
0 <= c0..(n-1) <= 127

delta = sign(term1 + trailingTerms)

term1 = MSTerm = upperMSTerm + lowerMSTerm

upperMSTerm = (cn-1)(128)n

j =n
lowerMSTerm = ∑j =1
k(128)n-j

x = n −1
trailingTerms = ∑ term2..(n+1) = ∑
x =0
cx(128)x

Sample values:

Delta n sign cn c(n-1) c(n-2)


0..15 n/a 1 0..15
-1..-16 n/a -1 1..16
-17..-144 1 -1 1 0..127
-145..-272 1 -1 2 0..127
16..143 1 1 1 0..127
144..271 1 1 2 0..127
272..399 1 1 3 0..127
1936..2063 1 1 16 0..127
2064..2191 2 1 1 0 0..127
18448..18575 2 1 2 0 0..127

Byte B for numB = 1 is obtained as:

B1 = (cn)(sign) + (negBoolInt)(numN) + minN

Bytes B for numB > 1 are obtained as:

B1 = (cn - posBoolInt)(sign) + (negBoolInt)(numN) + minN


B2..(n+1) = (c(n-1)..(n-n)) + 128

Sample Java code


The following code segment converts a signed delta value to output bytes. This is the key part
of the selective differencing compression algorithm. Code for decompression is not included.
Abstract objects and methods are used, working code is not yet available. The code is
provided for descriptive purposes only.

int delta;
int numN; // = 32 in demo, numN must be an even number
int minN; // = 94 in demo
int M = numN / 2;
int absDelta = abs(delta);

56
int signOfDelta = absDelta / delta;
int negBoolInt;
int posBoolInt;
if(delta < 0) negBoolInt = 1; else negBoolInt = 0;
//negBoolInt = 0 if delta is positive or zero
//negBoolInt = 1 if delta is negative
posBoolInt = negBoolInt + signOfDelta;
//posBoolInt = 1 if delta is positive or zero
//posBoolInt = 0 if delta is negative
int k = M + negBoolInt;
ModifiedArrayList uByteList = new ModifiedArrayList();
//This is the ModifiedArrayList described (not defined) in the line
//breaking algorithm sample Java code.
UByte uB = new UByte(0); //UByte is an unsigned byte integer class

//If output is 1 byte (numB = 1), obtain the byte. numB is the number of output
//bytes.
if(absDelta < k)
{
uB = delta + minN + (negBoolInt * numN);
uByteList.put(uB);
return uByteList;
}

//Obtain n, the power of the most significant term


int n = 0; //numB = n + 1
int maxValueOfNumBterms = 0;
int maxValueOfMSTerm; //MS is short for most significant
int maxValueOfTrailingTerms;
int maxValueOfUpperMSTerm;
int maxValueOfLowerMSTerm;
while absDelta > maxValueOfNumBTerms
{
n = n + 1;
maxValueOfUpperMSTerm = (M - 1) * powi(128, n);
maxValueOfLowerMSTerm = 0;
for(j = 1; j <= n; j++)
{
sT = k * powi(128, n - j); //powi() is just like math.pow() except it
//inputs and outputs integers.
maxValueOfLowerMSTerm = maxValueOfLowerMSTerm + sT;
}
maxValueOfMSTerm = maxValueOfUpperMSTerm + maxValueOfLowerMSTerm;
maxValueOfTrailingTerms = powi(128, n) - 1;
maxValueOfNumBTerms = maxValueOfMSTerm + maxValueOfTrailingTerms;
}

//Obtain cSubN, the coefficient of the most significant term


int cSubN = 0;
int maxValueOfTermsWithCSubN = 0;
int valueOfMSTermWithCSubN = 0;
while absDelta > maxValueOfTermsWithCSubN
{
cSubN = cSubN + 1;
valueOfMSTermWithCSubN = (cSubN - 1) * powi(128, n) + maxValueOfLowerMSTerm;
maxValueOfTermsWithCSubN = valueOfMSTermWithCSubN + maxValueOfTrailingTerms;
}
uB = (UByte)((cSubN - posBoolInt) * signOfDelta + negBoolInt * numN + minN);
uByteList.put(uB);
uB.setTo(0);

//Obtain trailing coefficients


int valueOfMSTerms = valueOfMSTermWithCSubN;

57
int valueOfTermWithCSubX;
for(x = n - 1; x >= 0; x--)
{
int cSubX = 0;
int maxValueOfLSTermsWithCSubX = powi(128, x) - 1; //LS is short for least
//significant.
while (absDelta - valueOfMSTerms) > maxValueOfLSTermsWithCSubX
{
cSubX = cSubX + 1;
valueOfTermWithCSubX = cSubX * powi(128, x);
maxValueOfTrailingTerms = powi(128, x) - 1;
maxValueOfLSTermsWithCSubX = valueOfTermWithCSubX +
maxValueOfTrailingTerms;
valueOfMSTerms = valueOfMSTerms + valueOfTermWithCSubX;
}
uB = (UByte)(cSubX + 128);
uByteList.put(uB);
uB.setTo(0);
}
return uByteList;

CONVERSION INFO
“You need a 3 sided dice to play this game.”
--Andrew K.

Bytext is a superset of Unicode normalization form C. Bytext characters or sequences


normatively map to Unicode normalization form C (NFC) characters or sequences. As a
consequence, Unicode conformant text in a given version of Unicode in normalization form C
or KC is round trip convertible thru Bytext. This means that Unicode text in normalization
form C or KC can be converted to Bytext and back to it’s original form without loss and
without requiring external information. Bytext is dedicated toward maintaining this principle
throughout all versions of Bytext.
If information about the normalization form of Unicode text is available from higher protocols,
this can be used to make any Unicode normalization form round trip convertible thru Bytext.
The following embedding conventions are not designed to be a substitute for transcoding.
Transcoding is separate from embedding.

ASCII Embedded Bytext


ASCII Embedded Bytext, AEB, is a format for embedding Bytext within ASCII. It will also work
for encoding Bytext within Unicode but it achieves less compression than UEB as explained
in next section. 64 (6 bits worth of) visible ASCII characters are used instead of the 128 (7
bits) available so that no control characters have to be used. This enables AEB to be read by
humans and by optical character recognition (OCR) technology. AEB uses F6TV to convert
the 6 bits worth of ASCII characters into a variable length encoding that can accommodate
Bytext.
The 64 ASCII characters chosen are the base64 characters put in Bytext order: digits zero to
9; lowercase Basic Latin letters; uppercase Basic Latin letters; plus sign; then forwardslash.
So as with base64, AEB characters can be delimited by quotation marks.

Unicode Embedded Bytext


Unicode Embedded Bytext, UEB, is a format for embedding Bytext within Unicode. It uses the
same concept as AEB, except 15 bits worth of Unicode characters are used to represent a
string of variable length values using F15TV. 15 bits worth are chosen instead of 16 bits
worth so that control characters, combining characters, formatting characters and other
characters that might generate errors in Unicode processing are not used.

58
The 32,768 Unicode characters are chosen to minimize the need to use the continuation bit.
No compatibility characters are chosen. The UEB characters are in Bytext order, not Unicode
order. There is no attempt to only encode visible characters because it is assumed that OCR
technology will be unable to distinguish so many characters, and also for the convenience of
not having to use the continuation bit at all for the vast majority of applications. Some
applications may choose to limit their repertoire to the UEB characters. This will free them of
most of the problematic features of Unicode and would make conversion to Bytext trivial.
The format does not specify how to markup the embedding so it is recognized as such, this
would be handled by higher protocols. Quotation marks cannot be used because the ASCII
quotation mark character is part of the UEB repertoire.

CF Normalization
CF Normalization is an informative transformation of data that eliminates Compatibility
characters and Fractured ligatures, hence the term CF. Applying CF normalization must be
controlled by higher protocols, and it should only be applied to data when the content is
known to only be text --and not binary data. In particular, it should an optional part of
converting Unicode to Bytext when it is known that round trip convertibility with Unicode is
not necessary.
The purpose of CF normalization is analogous to Unicode normalization forms KD and KC: to
eliminate redundant forms like formatting but to retain what is considered textual
information; and to ensure that equivalent textual information has an equivalent binary
form.

BYTEXT DATABASE
“That's just scratching the surface of the iceberg!”
--Unknown

The Bytext database is a simplified relational data model of the Bytext API (application
programming interface). Since an infinite number of characters are partially defined by
category and since the vast majority of the property values of Bytext characters can be
generated algorithmically, an actual implementation must extend the database into an
implementation of the Bytext API.
The Bytext database is called a simplified relational data model because some properties in
the Character Properties Table would normally be considered to be multivalued attributes in
a relational database but instead, in this database, the property value domain is an array of
values. This allows each character to have a single value (which could be null or an array)
for each property, which provides a dramatic overall simplification of the database. Each of
the potential multivalued attributes are informative properties: alternative name(s);
informative note(s); cross reference(s); and Unicode general category(categories).

Ordered lists
Ordered lists are ordered lists of concepts that are used to derive the Character Properties
Table. Some ordered lists are provided here or elsewhere in this document, other long lists
such as instances Han radical subtypes ordered by usage frequency will be provided in other
documents.
Most ordered lists consist of things like consonant order and vowel order of various scripts
which can be assumed from Unicode data. The focus of this document is on any exceptional
information.
Eventually these lists will be completed in a format that can be used by the Bytext API to
derive the Character Properties Table.
The following type of grid is called a statement grid. Each row in the grid is a statement
defining a word or describing a list. The statements are in a grid to help itemize repetitive

59
features. The grid is not a table with defined columns. “=” can be read as “is (are) defined
as”.

font variants = superscript variants


subscript variants
bold variants
italic variants
negative color variants
monospace variants
sans-serif variants
script variants
fraktur variants
basic European = Dot above 130
diacritics
Diaeresis 131
Ring 132
Acute 133
Macron 134
backtick 135
Cedilla 136
circumflex 137
tilde 138
Breve 139
F1 order = 0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
a 10
b 11
c 12
d 13
e 14
f 15
g 16
h 17
i 18
j 19
k 20
l 21
m 22
n 23
o 24
p 25
q 26
r 27
s 28
t 29
u 30
v 31
w 32

60
x 33
y 34
z 35
$ 36
% 37
& 38
‘ 39
( 40
) 41
* 42
+ 43
, 44
hyphen-minus 45
period 46
/ 47
Space 48
PEC 49
! 50
quotation mark 51
# 52
[ 53
\ 54
] 55
_ 56
{ 57
: 58
; 59
< 60
= 61
> 62
? 63
@ 64
| 65
} 66
~ 67
← 68
→ 69
↑ 70
↓ 71
Private 72
Private 73
Private 74
Private 75
Private 76
Private 77
Private 78
Private 79
Private 80
Private 81
Private 82
Private 83
Private 84
Private 85
Private 86
Private 87
Private 88

61
Private 89
Private 90
Private 91
Private 92
Private 93
Private 94
Private 95
Private 96
Private 97
Private 98
Private 99
Private 100
Private 101
Private 102
Private 103
Private 104
Private 105
Private 106
Private 107
Greek script CATS 108
Cyrillic script CATS 109
Hebrew script CATS 110
Arabic script CATS 111
Brahmic script CATS 112
Additional Brahmic script CATS 113
Hangul script CATS 114
Kana script CATS 115
Han script CATS 116
Additional Han script CATS 117
Additional scripts CATS 118
Combined punctuation CATS 119
Combined standalone mark CATS 120
∞ 121
± 122
▲ 123
♪ 124
☺ 125
GLUE 126
PRIVATE CHARACTER 127 127
script families = European script family
Hebrew script family
Arabic script family
Brahmic script family
Hangul script family
Kana script family
Han script family
Additional script family
European script = Latin script
family
Greek script
Cyrillic script
Hebrew script = Hebrew script
family
Arabic script family = Arabic script

62
Brahmic script = Devanagari script
family
Bengali script
Gurmukhi script
Gujarati script
Oriya script
Tamil script
Telugu script
Kannada script
Malayalam script
Myanmar script
Tibetan script
Sinhala script
Thai script
Lao script
Khmer script
Hangul script = Hangul script
family
Kana script family = Hiragana script
Katakana script
Han script family = all East Asian or “CJK” ideographs
Additional script = Additional Latin script
family
Armenian script
Syriac script
Thaana script
Georgian script
Ethiopic script
Cherokee script
Unified Canadian Aboriginal Syllabic script
Ogham script
Runic script
Sinhala script
Tibetan script
Mongolian script
Yi script
Old Italic script
Gothic script
Deseret script
Tagalog script
Hanunóo script
Buhid script
Tagbanwa script
scripts order = Latin script
Greek script
Cyrillic script
Hebrew script
Arabic script
Devanagari script
Bengali script
Gurmukhi script
Gujarati script
Oriya script
Tamil script
Telugu script
Kannada script

63
Malayalam script
Myanmar script
Tibetan script
Sinhala script
Thai script
Lao script
Khmer script
Hangul script
Hiragana script
Katakana script
Han script
Additional Latin script
Armenian script
Syriac script
Thaana script
Georgian script
Ethiopic script
Cherokee script
Unified Canadian Aboriginal Syllabic script
Ogham script
Runic script
Sinhala script
Tibetan script
Mongolian script
Yi script
Old Italic script
Gothic script
Deseret script
Tagalog script
Hanunóo script
Buhid script
Tagbanwa script

SS Tables
SS tables display SS relationships by listing the prevariants in the leftmost column and each
SS subtype in the same row. Rows may be grouped into categories such as letters, vowel
signs (abbreviated “vs”), and also into a group that lists remaining non search similar (NSS)
prevariants.
SS Tables are in a rudimentary stage of development and are open to discussion. The
following SS tables are available:

European_SS_Table in HTML format.

Brahmic_SS_Table in HTML format.

Character properties
All character properties are modeled by a single table that lists each Bytext character as a row
in the table with character properties being columns. Each character has a value for each
column.
The number of fully defined characters in Bytext is not infinite, but the Character Properties
Table is expected to report properties for all characters, so the Character Properties Table is
really a conceptual model for the Bytext API.
The Basic_Character_Properties_Table is a large document that demonstrates the Character
Properties Table model and is useful to view information about common characters while the
Bytext API is being developed. It will be available soon.

64
___Table conventions:

• The character code property value is divided into columns for each composition level. This
is to simplify editing.
• The "sample glyph" property domain is a vector based image file (a font instance), but the
values given here consist of font codes which are only complete if the font codes are
recognized and an instance is provided. Since many Bytext characters are new, there are
no known font codes or fonts available for them yet.
• Null values are left blank, but since the table is incomplete, a blank value does not
necessarily indicate null. In most cases the correct value will be obvious from other
information given in The Bytext Standard.
• All property values that are arrays have bar separated elements. This is a sort of hack for
display convenience and for easy conversion to an actual array.

Property info
The property info table contains information about each property. Since property values are
attributes of a character, this table can be thought of as consisting of attributes of
attributes, which some may call metadata about the character properties table.
Any property that is derived from a normative property value such as a character name is
considered to be normative. An implementation shouldn’t have to parse text values like a
characters name for information.
Facts about a character that can be algorithmically derived from character codes like is the
case for PDT’s or fixed categories; or facts that can be derived by combinations of character
code and another property value (without parsing text) are not included in the Bytext
database. This includes information about whether a character is an ignorable, a CSCC, or
an implicit bidirectionality variant.

___Table conventions:
• N? stands for Normative and the column contains a boolean value (Y=yes, N=no). No
means that the property is informative.
• The property value domain is a description of the set of values that are acceptable for that
property. Such descriptions do not prescribe any particular standardized datatype (except
for maybe boolean), this must be determined by each implementation of the database. In
particular, array values do not need to be ordered unless explicitly described as ordered.
• NFC stands for Normalization Form C.

Property info table


Links to property categories in the table:

General
Numeric
Compatibility
Mark
Line breaking
Bidirectional
Ligating
Collation
Search similar

Property Property N Property Notes


name. category ? value
Primary key. name domain

65
Character General Y Single Bytext This is the primary key in the character
code decimal properties table.
notation unit
Sample glyph General N Vector based Null for non private characters indicates
image file that it is normally always invisible. If an
(font image is given here it indicates the
instance); or character is normally visible. No particular
null glyph or font is prescribed for any
character.
Unicode NFC General Y Ordered This maps each Bytext character to it’s
equivalent array of Hex normalization form C character(s); or is
numbers, null if there is either no equivalent or no
each from context free equivalent.
0000 to
1FFFFF; or
null
Name General Y Text The Bytext name. Bytext names are
English language names. Foreign language
names may be included in the alternative
names property.
Informative General N Array of text A potential multivalued attribute. Commas
note(s) strings. are not used inside an element of the
array, semicolon is used instead.
Cross General N Array of Maps to zero or more Bytext characters. A
reference(s) Bytext potential multivalued attribute. These
decimal mappings are to characters that are
notation related somehow but not indicated by
units; or null other properties.
Alternative General N Array of text A potential multivalued attribute. Does not
name(s) strings. include Unicode name unless it is
acceptable for use in Bytext. Includes all
acceptable names. Commas are not used
inside an element of the array, semicolon
is used instead.
Unicode name General Y Text
Fallback General N Ordered Maps to zero or more Bytext characters
glyphs array of that have informative glyphs that can be
Bytext used to represent the glyph of this
decimal character. The references are ordered in
notation terms of preference. Implementations that
units; or null are not equipped to display a character
may choose to leave the character
invisible; or if the character is not an
ignorable character, an implementation
may display something to from this array;
or may display something to indicate that
it is an unrecognized character.
Jamo short General Y Text Maps each single consonant or single
name vowel syllable to a short name as defined
in Unicode, like “G”, “GG”, “N”, etc. Used
to form other Hangul syllable names.
Private General Y Boolean
Reserved General Y Boolean
Unicode General N Array of 2 A potential multivalued attribute. Each
general letter text item in the array contains 2 letters to
category strings. indicate a the Unicode general category
(categories) that the character is a member of. Each
possible value is listed and described in

66
the Unicode general category
(categories) table.

Numeric value Numeric Y String of F1 The characters are interpreted as a single


digits with number.
space,
hyphen/minu
s and
forwardslash
allowed.
Numeric type Numeric Y Text One of the terms from the numeric types
table.
CS type Compatibility N Array of Mapping from a compatibility character to
mapping Bytext a CS, or from an CS prevariant to a
decimal compatibility character of the type
notation specified in the “CS type” property.
units; or null
CS type Compatibility N Text A word or phrase summarizing the
difference between a CC and it’s CS. Each
possible value is listed and described in
the CS types table.
Mark type Mark Y Array of Maps a standalone mark to all mark
mappings Bytext variants with it’s mark. Maps a mark
decimal variant to it’s standalone mark.
notation
units; or null
Mandatory Line breaking Y Boolean PEC.
Break (MB) LEC.
FORM FEED (U+000C).
NEL.
LINE FEED.
Mandatory Line breaking Y Boolean MONGOLIAN TODO SOFT HYPHEN
break (U+1806).
opportunity
before
(MBOB)
Mandatory Line breaking Y Boolean UNGLUE.
break SOFT HYPHEN (U+00AD).
opportunity SPACE. Forget all that nonsense in Unicode
after (MBOA) about how groups of spaces are treated
as a single word, what’s the point?
EN QUAD (U+2000).
EM QUAD (U+2001).
EN QUAD (U+2002).
EM QUAD (U+2003).
THREE-PER-EM SPACE (U+2004).
FOUR-PER-EM SPACE (U+2005).
SIX-PER-EM SPACE (U+2006).
PUNCTUATION SPACE (U+2008).
THIN SPACE (U+2009).
HAIR SPACE (U+200A).
TAB (U+0009).
Glue property Line breaking Y Boolean Prevents break before and after. Overrides
all break opportunities except MB. All
characters with this property that are not
the GLUE character are converted to a
sequence of characters.
GLUE (U+2060).

67
NO-BREAK SPACE (U+00A0).
NARROW NO-BREAK SPACE (U+202F).
FIGURE SPACE (U+2007).
NON-BREAKING HYPHEN (U+2011).
TIBETAN MARK DELIMITER TSHEG BSTAR
(U+0F0C).
Line breaking Line breaking Y Array of Zero or more mappings to all characters of
type Bytext this type. Line breaking variants include a
mappings decimal no-break version of a space, hyphen, or
notation other punctuation.
units; or null
Break Line breaking N Integer or Context dependent treatment will just
opportunity null have to be known by the implementation,
before priority it is not marked in the Bytext database.
(BOBP)
Break Line breaking N Integer or Context dependent treatment will just
opportunity null have to be known by the implementation,
after priority it is not marked in the Bytext database.
(BOAP)
Mirror Bidirectional Y Bytext Single mapping to a character that can be
mapping decimal used as the characters mirror. This is a
notation specific kind of fallback glyph. The other
unit; or null character will always have a mirror
mapping back to the first character.
Strong L Bidirectional N Boolean This property is used during transcoding to
implicit and from Unicode. It is the same property
bidirectionality as in Unicode, but it is informative in
Bytext. Applications can use this property
to implement a kind of “auto-insert”
functionality when imputing bidirectional
text. A character cannot have both strong
R and strong L implicit bidirectionality.
Strong R Bidirectional N Boolean
implicit
bidirectionality
base Ligating Y Array of Maps to one or more Bytext characters.
counterparts Bytext These characters represent all the ligating
after (BCA) decimal variants that are acceptable to be the
notation character immediately following this
units; or null character in order for the sequence to not
be considered fractured. If a ligating
variant has a base counterpart after
property, it is called a ligating base
counterpart before variant.
base Ligating Y Array of Maps to one or more Bytext characters.
counterparts Bytext These characters represent all the ligating
before (BCB) decimal variants that are acceptable to be the
notation character immediately preceding this
units; or null character in order for the sequence to not
be considered fractured. If a ligating
variant has a base counterpart before
property it is called a ligating base
counterpart after variant.
mark Ligating Y Ordered The element number of each mapping
counterparts array of identifies which mark of a character with
after (MCA) Bytext multiple marks that the property applies
decimal to, in order of composition level. The
notation mapping is to a standalone ligating mark,

68
units; or null which is used to infer each possible
counterpart character.
mark Ligating Y Ordered The element number of each mapping
counterparts array of identifies which mark of a character with
before (MCB) Bytext multiple marks that the property applies
decimal to, in order of composition level. The
notation mapping is to a standalone ligating mark,
units; or null which is used to infer each possible
counterpart character.
Collation Collation N Ordered
element array array of
integer
values from
0 to 255.
Collation Collation N Ordered
element array of
priority integer
number array values from
1 to 5.
SS cross Search Y Single Bytext Only reserved characters can have a non
reference similar decimal null value for this property. Maps to the SS
notation location of the character that would
unit; or null otherwise occupy the native location of the
reserved character.
Native cross Search Y Single Bytext Maps to the native location of a SS
reference similar decimal character. Only SS characters can have a
notation non null value for this property.
unit; or null

CS types

Possible CS type Description


property values
Font A font variant. For example, a black letter form.
noBreak A no-break version of a space, hyphen, or other punctuation.
Initial An initial presentation form of an Arabic script character.
Medial A medial presentation form of an Arabic script character.
Final A final presentation form of an Arabic script character.
Isolated An isolated presentation form of an Arabic script character.
Circle An encircled form.

Super A superscript form.

Sub A subscript form.

Vertical A vertical layout presentation form.

Square A CJK squared font variant.

Fraction A vulgar fraction form.

ClusterJamo A Multi Jamo mapping form of a single Jamo.


MultiMark A form of single mark meant to span multiple characters.
SystematicIndic A systematic Indic syllable variant form.
LigatureForm A compatibility ligature form of two or more characters.
LamAlefLigatureForm A Lam-Alef ligature form of two characters.

69
Mirrored A mirrored compatibility form.
Auto An auto form of a formatting character: AUTO LEC, AUTO BELI, or
AUTO BELD.

Unicode general categories

Possible Description Notes


Unicode
general
category
(categories)
property
values
Mn Mark nonspacing
Mc Mark spacing combining
Me Mark enclosing
Nd Number decimal digit
Nl Number letter
No Number other
Zs Separator space
Cc Other control
Cf Other format
Co Other private
Cn Other not assigned
Lm Letter modifier
Lo Letter other
Pc Punctuation connector
Pd Punctuation dash
Ps Punctuation open
Pe Punctuation close
Pi Punctuation initial quote May behave like Ps or Pe
depending on usage says
Unicode.
Pf Punctuation final quote May behave like Ps or Pe
depending on usage says
Unicode.
Po Punctuation other
Sm Symbol math All these symbols are
assumed to have the
Unicode “mathematical
property” and vice versa.
Sc Symbol currency
Sk Symbol modifier
So Symbol other

Numeric types
The symbol, numeral, and Kanbun annotation types combined are equivalent to the “numeric”
Unicode type.

Numeric Description
type
decimal Equivalent to Unicode “decimal” numeric type.
digit
symbol Equivalent to Unicode “digit” numeric type.
decimal

70
digit
numeral Unicode “numeric” types with the term “NUMERAL” in their name.
Kanbun U+3192..U+3195 from Unicode “numeric” type.
annotation
symbol Includes remaining characters from the “numeric” Unicode numeric type: non
decimal dingbats; parenthesized and period forms; and the “ideographic
number zero” character.
ideograph Ideograph numeric types are only for characters in B116- or B117-.
Han Characters that may be used as accounting numbers (“spelled out”
accounting numbers).
binary Only for OBS characters.

META INFO
“Have I taken to answering my own questions? Apparently so.”
--Bruno Campos

Title
The Bytext Standard

Summary
Bytext is an open standard for encoding text vying to fill the need for a single preferred
charset for worldwide information interchange. Bytext is a major improvement upon ASCII
and Unicode. This is the authoritative source of information on Bytext version 0.102. This
version is an incomplete specification.

Status update
Version 0.102 is the 3rd publicly distributed version of Bytext. It is an incomplete specification.
Check www.bytext.org for the latest information and archives. Previous version was 0.101,
published 2002-01-28.
Changes from previous version: Major changes to bidirectionality encoding. Changes to
placeholder allocation. IGNORABLE LINE ENDING CHARACTER renamed. Changes to how
precomposed forms of Thai and Lao syllables that have a non logically ordered Unicode
decomposition are transcoded to and from Unicode. New terminology for pointers. Numerous
edits for clarification.

Version
0.102

Date of publication
2002-05-29. May 29, 2002.

Author(s)
Bernard Rafael Miller.
• Email: Bernard_R_Miller@bytext.org
• Personal web site: http://www.geocities.com/bernard_r_miller/index.htm

71
Intellectual property rights
Copyright © 2002 Bernard Rafael Miller. All rights reserved. This applies to the Bytext
documentation and the Bytext.org website as a whole.
Copyrights are basically distribution rights. There are 2 main reasons for reserving these
rights:
1. To control the content as it appears publicly on the web. In other words, if you wish to
mirror the site for public display you must get permission. The purpose of getting
permission is basically to form an agreement that updates will be maintained -this is for
the public good.
2. For the exclusive right to sell or distribute a physical media publication (book or cd)
consisting of this information in whole or in part. Proceeds are for a good cause.

The Bytext name; the blue, green and yellow Bytext emblem; and the 7 color Bytext logo are
trademarks of Bernard Rafael Miller.
Bytext is an open standard. Notwithstanding the rights reserved above, the right to use the
Bytext standard is freely granted to everyone.
Unicode is a trademark of the Unicode Consortium. Java is a trademark of Sun Microsystems.

Keywords
Bytext, character encoding scheme, coded character set, charset, Unicode, ASCII, indexary,
keyboard, markup, search similarity, placeholders, ignorables, normalization, line breaking,
collation, bidirectional, bidirectionality, LEC, PEC, AEB, UEB, FTV, F6TV, F8TV, F15TV,
grapheme, character, open standard, publishing, comparison, compression, international,
i18n, open standard, symbol, letter, prevariant, variant, emoticon, ligating, joining,
subjoining, conjoining, corpus, selective differencing, rearranging, rearrangement, TCS,
text-specialized compression scheme, SCSU, Gaiji, BELI, BELD,

Conventions

___Conventions specific to Bytext:


Major finalized versions will be whole numbers by convention. Version 0.5 will be made to
roughly correspond to the release of a known implementation of Bytext. Documentation
changes will not affect the version number or the date of publication unless the changes
affect how the standard actually works. After version 0.3 and possibly not before, minor
documentation changes will be noted with the date in the status update section. Check
www.bytext.org for the latest information.
Character names are written with uppercase letters, except as section titles they follow the
capitalization scheme of their section (header) level. The character name written is the
Bytext character name unless explicitly specified as being the Unicode character name. A
character name, whether a Bytext character name or Unicode character name, will usually
have it’s Unicode character literal following it in parentheses.
A character is sometimes described as having a property or not when in fact all characters
have all properties. What is meant is that the property value is either null or not.

___General writing conventions:


The index and glossary are combined and this concept is called an indexary. It’s actually not
really an index because it does not list every occurrence of a word or phrase, just those that
are considered particularly relevant. It’s also not really a glossary because it does not
contain definitions, only links to where a definition can be found in freeform text.
It is intended that the definition of a word will follow before or shortly after it’s first use, or
that the first use will be a link to the definition. Defining things in freeform text is natural
and efficient. Rewording everything so it can be defined individually is a hassle and is
sometimes not appropriate. It can lead to inadequate descriptions when it is the only thing
available and it contributes to clutter.
All paragraphs in this document have a hanging indent except for single paragraphs within a
table cell. It seems to be more compact, clear and consistent than any other paragraph

72
style. It reduces the need for bulleting, numbering, and paragraph spacing. It is also
consistent with source code conventions as explained next.
The label of a column is in bold. This helps to differentiate between a table with labeled
columns and a grid without labeled columns.
A block or a statement of source code is treated as a paragraph with a hanging indent. Such
indents can nest. A statement is considered to end when the comments that are part of that
statement end. Comments that are not part of a statement but rather go between
statements are treated as their own paragraph with a nesting and hanging indent. In more
conventional styles such as in “Kernighan & Ritchie”, some of the text of a block is treated
as if it has a hanging indent (the part inside the curly braces) and some of the text is not
(the actual curly braces). All styles like that are considered visually inconsistent.

___Spelling and punctuation conventions:


Two periods are used to indicate a range. For example, 1 to 20 is written 1..20. Two periods
followed by a space is sometimes used as an ellipsis.
Whitespace is one word not two.
“Who” is considered a spelling variation of “whom”. It is not considered useful to distinguish
the two.
“Tho” is considered a spelling variation of “though”.
“Thru” is a considered a spelling variation of “through”.
Occasionally 2 dots are used instead of 3 to represent an ellipsis. Two dots are considered
sufficient for understanding as an ellipsis so long as they are followed by a space or ending
punctuation such as an ending quote or a right parenthesis.
A period within a quotation or parenthesis is not used as the period for the sentence using the
quotation or parenthesis. A quotation that ends a sentence but does not itself end with a
period can be assumed to end with a period unless it ends with an ellipsis.
Bulleted or numbered list items are written as if each were separate paragraphs, not as if it
was one long sentence.
3 underscores then a statement followed by a colon is the format used for a small subsection
that is not intended to appear in the table of contents.
Anything delimited by 3 curly braces on each side is considered to be there for historical
reference, sort of a preliminary status before it is archived or deleted. Several question
marks in a row is a marker used to indicate a topic for discussion or further research, even if
the preceding statement is not in question form. These are typically removed for publication
but some may be left in if it seems useful.

References
Misc references and notes in no particular order:

ASCII. American Standard Code for Information Interchange. ANSI X3.4-1986 (R1997). ISO
646 is basically the “international” version of ASCII. ECMA-6 is a free equivalent to 3rd
edition of ISO 646: http://www.ecma.ch/ecma1/STAND/ECMA-006.HTM. Praise is due to
ECMA. Charging for informational access to an open standard seems inherently contrary to
the whole idea of openness. ANSI is guilty of this also, and until recently, so was Unicode.

Free equivalent to the ISO/IEC 6429 standard: http://www.ecma.ch/ecma1/STAND/ECMA-


048.HTM.

Free equivalent to ISO/IEC 2022:1994: http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM.

Markus Kuhn’s site, excellent: http://www.cl.cam.ac.uk/~mgk25/unicode.html

General encoding info, line breaking critique: http://www.cs.tut.fi/~jkorpela/chars/

ASCII: American Standard Code for Information Infiltration, from the creator of FidoNet:
http://www.wps.com/texts/codes/

73
www.vt100.net

Symbols: http://www.lib.ohio-state.edu/OSU_profile/refweb/resources/symbol2.htm

Unicode. The vast majority of information useful for creating Bytext was provided by Unicode
free of charge on their website www.unicode.org. However, it wasn’t until fairly recently that
the online version was available on their website, about 9 years after Unicode started.
Before than, there was a charge for informational access to the so called open standard.

http://www.hastingsresearch.com/net/04-unicode-limitations.shtml

types of scripts: http://www.ontopia.net/i18n/script-types.jsp


South Asia free ebooks: http://www.columbia.edu/cu/lweb/indiv/southasia/cuvl/ebooks.html
comments on Tamil in Unicode: http://www.tamil.net/people/sivaraj/tamil_unicode.html
http://www.isoc.org/isoc/whatis/conferences/inet/96/proceedings/a5/a5_2.htm
“TSCII”: http://www.geocities.com/Athens/5180/tsic.html
Character Frequency in Tamil http://www.thunaivan.com/thunaivan/Compare/comparat.htm

Try the ICU transliteration page- it will display hex codes. http://oss.software.ibm.com/cgi-
bin/icu/tr

Writing systems: http://www.omniglot.com/index.htm


Article on alphabets: http://encarta.msn.com/find/Concise.asp?ti=03643000

Unicode Support for Mathematics http://www.unicode.org/unicode/reports/tr25

Ozideas: http://home.vicnet.net.au/~ozideas

<FONT FACE> considered harmful: http://babel.alis.com:8080/web_ml/html/fontface.html

http://oss.software.ibm.com/icu/

Text Encoding Initiative: www.uic.edu/orgs/tei

GB 18030-2000: http://oss.software.ibm.com/icu/docs/papers/gb18030.html
http://www.uselessknowledge.com/word/dollar.shtml
ConScript: http://www.evertype.com/standards/csur/index.html

http://nim.cit.cornell.edu/usr/share/man/info/C/a_doc_lib/aixprggd/genprogc/input_method.h
tm#A8F046

Internationalized Resource Identifiers: http://www.ietf.org/internet-drafts/draft-masinter-url-


i18n-08.txt

http://www.dot-dot-dot.org/summer2001/issue3/issue3_peter.html

Mark Davis: http://www.macchiato.com/

nameprep http://search.ietf.org/internet-drafts/draft-ietf-idn-nameprep-06.txt
'IETF Stringprep list' http://search.ietf.org/internet-drafts/draft-hoffman-stringprep-00.txt

history of unicode.org: http://web.archive.org/web/*/http://www.unicode.org

info on competitors to Unicode:


http://www-106.ibm.com/developerworks/unicode/library/u-secret.html
Rosetta: http://www.kotovnik.com/~avg/rosetta/
Tron: http://tronweb.super-nova.co.jp/tadenvironment.html

74
the IETF Charset Policy [RFC 2277] specifies that "Protocols MUST be able to use the UTF-8
charset". http://www.ietf.org/rfc/rfc2277.txt

Japanese character frequency, book description: http://www.sanseido-


publ.co.jp/publ/NTT_English.html

Linguist magazine: http://www.emich.edu/~linguist/


http://www.emich.edu/~linguist/www-vl.html

CCCII: ftp://ftp.cuhk.hk/pub/chinese/ifcss/software/info/cjk-codes/94x94x94.html
general info: ftp://ftp.cuhk.hk/pub/chinese/ifcss/software/info/cjk-codes/index.html

http://www.microsoft.com/globaldev/

shift-JIS:
http://www.microsoft.com/globaldev/reference/dbcs/932.htm
http://www.geocities.com/fontboard/cjk/jis.html#sjis

general info from Marco Cimarosti: http://www.geocities.com/Tokyo/1763/efhan.html

Unicode on unix: http://czyborra.com/

http://tronweb.super-nova.co.jp/characcodehist.html

http://www.planet-typography.com/directory/institutions.html

International Signal Flags: http://www.envmed.rochester.edu/wwwrlp/flags/flags.htm

indent styles:
http://cs.wellesley.edu/~cs111/lectures/conditionals.html
http://www.comsc.ucok.edu/~mccann/indent_c.html#ThreeTwo

Security issues in Unicode:


http://www.counterpane.com/crypto-gram-0007.html#9
http://www.w3.org/TR/xmldsig-core/#sec-Seen

Table of contents

INTRODUCTION .......................................................................................................................................................... 1
OTHER PROBLEMS WITH UNICODE ................................................................................................................................... 3
DEFINITIONS ................................................................................................................................................................ 5
TERMS FROM UNICODE AND ELSEWHERE ......................................................................................................................... 7
SPECIAL SUBTYPES ......................................................................................................................................................... 8
Singular placeholders ................................................................................................................................................ 9
Group placeholders ................................................................................................................................................. 11
Search similar subtypes ........................................................................................................................................... 11
Normalization subtypes............................................................................................................................................ 12
ALLOCATION PRINCIPLES .................................................................................................................................... 14
F1 OVERVIEW ............................................................................................................................................................... 15
Numeric types .......................................................................................................................................................... 17
Marks ....................................................................................................................................................................... 17
EUROPEAN SCRIPT FAMILY ............................................................................................................................................ 18
Allocation table........................................................................................................................................................ 18
ARABIC SCRIPT FAMILY................................................................................................................................................. 19
Allocation table........................................................................................................................................................ 20

75
BRAHMIC SCRIPT FAMILY .............................................................................................................................................. 20
Indic notes................................................................................................................................................................ 20
Tibetan and Sinhala notes........................................................................................................................................ 22
Thai and Lao notes .................................................................................................................................................. 22
Khmer notes ............................................................................................................................................................. 22
Allocation table........................................................................................................................................................ 22
HANGUL SCRIPT ............................................................................................................................................................ 24
Allocation table........................................................................................................................................................ 24
KANA SCRIPT FAMILY ................................................................................................................................................... 24
Allocation table........................................................................................................................................................ 25
HAN SCRIPT FAMILY...................................................................................................................................................... 25
Allocation table........................................................................................................................................................ 25
ETHIOPIC SCRIPT ........................................................................................................................................................... 26
Allocation table........................................................................................................................................................ 26
CANADIAN ABORIGINAL SYLLABICS SCRIPT .................................................................................................................. 27
Allocation table........................................................................................................................................................ 27
MONGOLIAN SCRIPT ...................................................................................................................................................... 27
Allocation table........................................................................................................................................................ 27
NEW CHARACTERS .................................................................................................................................................. 28
TYPE 1 IGNORABLES ..................................................................................................................................................... 28
AUTO LEC............................................................................................................................................................... 29
BELI and BELD....................................................................................................................................................... 29
Non substitutable CC’s ............................................................................................................................................ 31
Excessive CSCC’s .................................................................................................................................................... 32
Defective CSCC’s .................................................................................................................................................... 33
TYPE 2 IGNORABLES ..................................................................................................................................................... 33
Private character 127 .............................................................................................................................................. 33
OCR BINARY SYMBOLS ................................................................................................................................................ 33
ARROW PARENTHESES .................................................................................................................................................. 34
CO SUBTYPES ............................................................................................................................................................... 35
MISC SYMBOLS ............................................................................................................................................................. 35
Generic symbols....................................................................................................................................................... 36
Bidi symbols............................................................................................................................................................. 36
Emoticons ................................................................................................................................................................ 37
POTENTIAL NEW CHARACTERS ...................................................................................................................................... 38
RENAMED CHARACTERS ................................................................................................................................................ 40
Potential renames .................................................................................................................................................... 42
LINE BREAKING ALGORITHM ............................................................................................................................. 42
BREAK OPPORTUNITY PRIORITY SUMMARY .................................................................................................................... 43
THE BASIC ALGORITHM ................................................................................................................................................. 46
SAMPLE JAVA CODE ...................................................................................................................................................... 46
COLLATION ALGORITHM ..................................................................................................................................... 47
SETUP ........................................................................................................................................................................... 48
THE BASIC ALGORITHM ................................................................................................................................................. 49
SAMPLE JAVA CODE ...................................................................................................................................................... 49
BIDIRECTIONALITY ALGORITHM ...................................................................................................................... 50
THE BASIC ALGORITHM ................................................................................................................................................. 51
COMPRESSION FRAMEWORK .............................................................................................................................. 53
REARRANGING.............................................................................................................................................................. 53
SELECTIVE DIFFERENCING ............................................................................................................................................. 54
Sample Java code..................................................................................................................................................... 56
CONVERSION INFO .................................................................................................................................................. 58
ASCII EMBEDDED BYTEXT .......................................................................................................................................... 58
UNICODE EMBEDDED BYTEXT ...................................................................................................................................... 58
CF NORMALIZATION .................................................................................................................................................... 59

76
BYTEXT DATABASE ................................................................................................................................................. 59
ORDERED LISTS ............................................................................................................................................................ 59
SS TABLES ................................................................................................................................................................... 64
CHARACTER PROPERTIES .............................................................................................................................................. 64
PROPERTY INFO ............................................................................................................................................................ 65
Property info table ................................................................................................................................................... 65
CS types ................................................................................................................................................................... 69
Unicode general categories ..................................................................................................................................... 70
Numeric types .......................................................................................................................................................... 70
META INFO ................................................................................................................................................................. 71
TITLE ........................................................................................................................................................................... 71
SUMMARY .................................................................................................................................................................... 71
STATUS UPDATE............................................................................................................................................................ 71
VERSION....................................................................................................................................................................... 71
DATE OF PUBLICATION .................................................................................................................................................. 71
AUTHOR(S)................................................................................................................................................................... 71
INTELLECTUAL PROPERTY RIGHTS ................................................................................................................................. 72
KEYWORDS .................................................................................................................................................................. 72
CONVENTIONS .............................................................................................................................................................. 72
REFERENCES ................................................................................................................................................................. 73
TABLE OF CONTENTS..................................................................................................................................................... 75
INDEXARY................................................................................................................................................................... 77
END .............................................................................................................................................................................. 78

INDEXARY
“I could take that to so many bad places.”
--‘DJ Launchpad’ from Microsoft

An indexary is an index and glossary combined. It simply contains links to bookmarks that
define the word or phrase, and possibly links to other occurrences of the word.

AEB (ASCII embedded Bytext)


allocation table
AUTO BELD (AUTO Bidirectional Embedding Level Decreaser)
AUTO BELI (AUTO Bidirectional Embedding Level Increaser)
AUTO LEC (AUTO Line Ending Character)
BBA (Bytext Bidirectionality Algorithm)
BELD (Bidirectional Embedding Level Decreaser)
BELI (Bidirectional Embedding Level Increaser)
bidi (BIDIrectional or BIDIrectionality)
Bytext API
Bytext collation order
Bytext order
category
CATS (CATegory Symbols)
CC's (Compatibility Characters)
counterparts, ligating properties
cross reference character, cross reference(s) property
CS (Compatibility Substitute)
CSCC's (Conditionally Substitutable Compatibility Character)
fixed category
fractured, ligating properties
group placeholder, East Asian width variants
ignorables

77
LEC (Line Ending Character)
ligating marks, ligating properties
mark
mirrorable
mirrored variants
NSCC's (Non Substitutable Compatibility Character)
OBS's (OCR Binary Symbol’s)
PDT (Pattern Defined Type)
PEC (Paragraph Ending Character), mandatory break property
peer group
prevariant
SCC's (Substitutable Compatibility Character’s)
singular placeholder, mirrored variants
SS (Search Similar)
standalone marks
statement grid
systematic Indic syllable variants, SystematicIndic CS type
TCS (Text-specialized Compression Scheme)
type
type mappings
UBA (Unicode Bidirectional Algorithm)
UEB (Unicode Embedded Bytext)
zero terminated Bytext order

End

78

You might also like