Professional Documents
Culture Documents
Most people think that Unicode is limited to a maximum of 16 bits. But this is untrue:
Unicode is limitless. It’s limitless because Unicode itself doesn’t care about storage, it only
cares about mapping code points to characters. So Unicode can theoretically represent an
unlimited amount of languages.
Encodings
So if Unicode only cares about code points, how does a computer store the actual data?
This done done with a specific encoding.
So far we’ve been using the term ‘character set’ to mean two things: Which characters are
in the set and how they are represented. Basically by representing text in a computer there
are two components:
1. A so called character repertoire which defines which characters can be
represented. ASCII’s character repertoire says we can represent English letters,
numbers and punctuation (amongst others). Unicode on the other hand can
represent everything.
2. A character encoding states how the computer actually stores these characters. It
defines how a bunch of binary bits can be converted into the actual characters.
With ASCII, we know that a character is always stored in 1 byte, and that ‘a’ is 97
and ‘b’ is 98 etc. With Unicode, there are multiple encodings.
The most popular encoding you hear about today is UTF-8. UTF-8 uses 1 to 4 bytes to store
each code point. For code points 0-127, 1 byte is used. As the code points increase, the
bytes it takes to represent them increases up to a maximum of 4 bytes.
Since Unicode defines code points 0-127 the same as ASCII (for example, ‘a’ is code point
97, just like ‘a’ in ASCII is 97), and UTF-8 encodes code points 0-127 as a single byte — UTF-
8 is directly backwards compatible with ASCII. For most English systems, this makes
converting to UTF-8 very very easy. For example, if you have a bunch of English web pages,
you could switch the character encoding header on your web server to ‘UTF-8′ without any
work. If your systems contain different languages (and thus, non-ASCII characters), then
converting to UTF-8 will require other tools.
When character sets are important?
In short: ALL THE TIME
There is no such thing as text without a character set. There always needs to be a way for a
computer to convert the random bits and bytes in a file to characters that humans can
understand. Most of the time when you read about “plain text”, it means a file using ASCII
code-points 0-127.
How programs decode characters in files depends. For example, some browsers might guess
by looking at common patterns of code points. But most web sites these days will specify
the character set right in the response headers, so a browser doesn’t need to guess.
If you have ever opened up a web page or a file and seen a bunch of ???’s, then it means
the application is trying to decode the text data using an incorrect character set.
Conclusion
I hope you learned something about character sets today. In an upcoming article I will write
about the role character sets and Unicode play in our applications, specifically with PHP.