Unicode and Character Sets


Unicode and Character Sets
By Christopher | Comments (0) | Trackbacks (1)

I have a lot of developer friends who are still confused about the idea of character sets. The internet is a global phenomena; in today’s world, every developer must understand character sets if they are to create applications that work around the world.

What are character sets?
Let’s start with the basics. Every character of a string is stored as some binary number in your computer. A character set is a map that maps the number in the computer with an actual glyph that you can recognize. For example, in the ASCII character set, the code for ‘a’ is 97. Back in the day there was ASCII (well there were more before that, but let’s not go too far back in history). ASCII used numbers from 0-127 to map certain control characters (like newlines) and English letters and punctuation (a-z, 0-9, quotes etc). If you spoke English, then life was good. But if you spoke a different language that used different characters (for example, accented characters), you were pretty much screwed. To solve this problem, engineers in different parts of the world started to come up with different character sets. That is, they created character sets so they could represent glyphs relevant to their language. In a pre-internet era, this wasn’t all too bad. But once people started sharing documents, the problem became all too clear. An American reading a report drafted in Russia would see a page of garbled text because the character set used on his American-bought computer was completely different from the character set used on the Russian-bought computer. Every common character set agrees on the characters from 0-127 (thankfully!). But any character represented from 128 and up differ greatly. For example, on some early PC’s the code 219 represents an opaque block (used to create boxes and such in old apps). But on other computers that code represented a U with a circumflex instead. Further complicating the problems, the size of the code was sometimes different. In Asian alphabets, there are thousands of characters. That means there are too many codes to fit in the usual 1 byte (0-255) that most computers were using. To overcome this problem, special character sets were created in which some characters took 1 byte, and other characters took 2 bytes. This might not seem so bad, but think about it for a second. The most common tasks like finding the length of a string now becomes more difficult. You can’t simply count the number of bytes the string occupies because that would give you an erroneous number (you can’t know how many characters are 1 bytes and how many are 2 bytes). And even things like iterating through a string a letter at a time is not a simple matter of moving a memory pointer up by 1 byte.


Now that the problem is sufficiently explained, let’s get to the solution. Unicode was created as the one character set to rule them all. In Unicode, every possible glyph imaginable is mapped to a unique number. Unicode determines which number, called a code point, represents which character. The standard way to represent a code point is like this: U+0097 for the letter ‘a’. That is ‘U+ [hex]‘. But here’s where it gets a bit tricky. The Unicode standard only defines which numbers map to which characters. The actual means of storing the number is still up in the air.

With ASCII. it only cares about mapping code points to characters. 1 byte is used. Encodings So if Unicode only cares about code points. For code points 0-127. When character sets are important? In short: ALL THE TIME There is no such thing as text without a character set. For most English systems. . A character encoding states how the computer actually stores these characters. then it means the application is trying to decode the text data using an incorrect character set. if you have a bunch of English web pages. So far we’ve been using the term ‘character set’ to mean two things: Which characters are in the set and how they are represented. there are multiple encodings. It defines how a bunch of binary bits can be converted into the actual characters. There always needs to be a way for a computer to convert the random bits and bytes in a file to characters that humans can understand. the bytes it takes to represent them increases up to a maximum of 4 bytes. it means a file using ASCII code-points 0-127. ‘a’ is code point 97. Most of the time when you read about “plain text”. Since Unicode defines code points 0-127 the same as ASCII (for example. If your systems contain different languages (and thus. How programs decode characters in files depends. non-ASCII characters). so a browser doesn’t need to guess. you could switch the character encoding header on your web server to ‘UTF-8′ without any work. specifically with PHP. some browsers might guess by looking at common patterns of code points. and that ‘a’ is 97 and ‘b’ is 98 etc. Unicode on the other hand can represent everything. It’s limitless because Unicode itself doesn’t care about storage. Conclusion I hope you learned something about character sets today. ASCII’s character repertoire says we can represent English letters. numbers and punctuation (amongst others). If you have ever opened up a web page or a file and seen a bunch of ???’s. In an upcoming article I will write about the role character sets and Unicode play in our applications. The most popular encoding you hear about today is UTF-8. how does a computer store the actual data? This done done with a specific encoding. just like ‘a’ in ASCII is 97). For example.Unicode and Character Sets 2 Most people think that Unicode is limited to a maximum of 16 bits. But this is untrue: Unicode is limitless. For example. With Unicode. 2. But most web sites these days will specify the character set right in the response headers. As the code points increase. this makes converting to UTF-8 very very easy. we know that a character is always stored in 1 byte. Basically by representing text in a computer there are two components: 1. A so called character repertoire which defines which characters can be represented. So Unicode can theoretically represent an unlimited amount of languages. and UTF-8 encodes code points 0-127 as a single byte — UTF8 is directly backwards compatible with ASCII. then converting to UTF-8 will require other tools. UTF-8 uses 1 to 4 bytes to store each code point.

Sign up to vote on this title
UsefulNot useful