You are on page 1of 2

Unicode and Character Sets 1

Unicode and Character Sets


By Christopher | Comments (0) | Trackbacks (1)
I have a lot of developer friends who are still confused about the idea of character sets.
The internet is a global phenomena; in today’s world, every developer must understand
character sets if they are to create applications that work around the world.
What are character sets?
Let’s start with the basics. Every character of a string is stored as some binary number in
your computer. A character set is a map that maps the number in the computer with an
actual glyph that you can recognize. For example, in the ASCII character set, the code for
‘a’ is 97.
Back in the day there was ASCII (well there were more before that, but let’s not go too far
back in history). ASCII used numbers from 0-127 to map certain control characters (like
newlines) and English letters and punctuation (a-z, 0-9, quotes etc). If you spoke English,
then life was good.
But if you spoke a different language that used different characters (for example, accented
characters), you were pretty much screwed.
To solve this problem, engineers in different parts of the world started to come up with
different character sets. That is, they created character sets so they could represent glyphs
relevant to their language.
In a pre-internet era, this wasn’t all too bad. But once people started sharing documents,
the problem became all too clear. An American reading a report drafted in Russia would
see a page of garbled text because the character set used on his American-bought
computer was completely different from the character set used on the Russian-bought
computer.
Every common character set agrees on the characters from 0-127 (thankfully!). But any
character represented from 128 and up differ greatly. For example, on some early PC’s the
code 219 represents an opaque block (used to create boxes and such in old apps). But on
other computers that code represented a U with a circumflex instead.
Further complicating the problems, the size of the code was sometimes different. In Asian
alphabets, there are thousands of characters. That means there are too many codes to fit
in the usual 1 byte (0-255) that most computers were using. To overcome this problem,
special character sets were created in which some characters took 1 byte, and other
characters took 2 bytes.
This might not seem so bad, but think about it for a second. The most common tasks like
finding the length of a string now becomes more difficult. You can’t simply count the
number of bytes the string occupies because that would give you an erroneous number (you
can’t know how many characters are 1 bytes and how many are 2 bytes). And even things
like iterating through a string a letter at a time is not a simple matter of moving a memory
pointer up by 1 byte.
Unicode
Now that the problem is sufficiently explained, let’s get to the solution.
Unicode was created as the one character set to rule them all. In Unicode, every possible
glyph imaginable is mapped to a unique number.
Unicode determines which number, called a code point, represents which character. The
standard way to represent a code point is like this: U+0097 for the letter ‘a’. That is ‘U+
[hex]‘.
But here’s where it gets a bit tricky. The Unicode standard only defines which numbers map
to which characters. The actual means of storing the number is still up in the air.
Unicode and Character Sets 2

Most people think that Unicode is limited to a maximum of 16 bits. But this is untrue:
Unicode is limitless. It’s limitless because Unicode itself doesn’t care about storage, it only
cares about mapping code points to characters. So Unicode can theoretically represent an
unlimited amount of languages.
Encodings
So if Unicode only cares about code points, how does a computer store the actual data?
This done done with a specific encoding.
So far we’ve been using the term ‘character set’ to mean two things: Which characters are
in the set and how they are represented. Basically by representing text in a computer there
are two components:
1. A so called character repertoire which defines which characters can be
represented. ASCII’s character repertoire says we can represent English letters,
numbers and punctuation (amongst others). Unicode on the other hand can
represent everything.
2. A character encoding states how the computer actually stores these characters. It
defines how a bunch of binary bits can be converted into the actual characters.
With ASCII, we know that a character is always stored in 1 byte, and that ‘a’ is 97
and ‘b’ is 98 etc. With Unicode, there are multiple encodings.
The most popular encoding you hear about today is UTF-8. UTF-8 uses 1 to 4 bytes to store
each code point. For code points 0-127, 1 byte is used. As the code points increase, the
bytes it takes to represent them increases up to a maximum of 4 bytes.
Since Unicode defines code points 0-127 the same as ASCII (for example, ‘a’ is code point
97, just like ‘a’ in ASCII is 97), and UTF-8 encodes code points 0-127 as a single byte — UTF-
8 is directly backwards compatible with ASCII. For most English systems, this makes
converting to UTF-8 very very easy. For example, if you have a bunch of English web pages,
you could switch the character encoding header on your web server to ‘UTF-8′ without any
work. If your systems contain different languages (and thus, non-ASCII characters), then
converting to UTF-8 will require other tools.
When character sets are important?
In short: ALL THE TIME
There is no such thing as text without a character set. There always needs to be a way for a
computer to convert the random bits and bytes in a file to characters that humans can
understand. Most of the time when you read about “plain text”, it means a file using ASCII
code-points 0-127.
How programs decode characters in files depends. For example, some browsers might guess
by looking at common patterns of code points. But most web sites these days will specify
the character set right in the response headers, so a browser doesn’t need to guess.
If you have ever opened up a web page or a file and seen a bunch of ???’s, then it means
the application is trying to decode the text data using an incorrect character set.
Conclusion
I hope you learned something about character sets today. In an upcoming article I will write
about the role character sets and Unicode play in our applications, specifically with PHP.

You might also like