You are on page 1of 47

All you wanted to know about Character Encodings but were too afraid to ask

Whyaretheresomanydifferentencodings? Whatdoesalltheterminologymean? WhatdoIneedtokeepinmind? WhatisallthisUnicodestuffanyway? HowdoesUnicodeworkinErlang? HowdoweswitchoursystemtoUnicode?

In the beginning, there was the Teleprinter

Incomingcharactersareprintedtopaper Keypressissenttotheremoteteleprinteror computer(andprintedonthepaper,forfeedback) Thefirstteleprinterswereinventedinthe1870s Baudotcode(5bit,shiftedstateforletters/symbols)

Baudot code?

mileBaudot(1baud=1'signallingevent'/second) 1874,some70yearsbeforethefirstcomputers 5bits(32combinations) AZ,09,&()=%/;!'?,:.,ERASE,FIGSHIFT,... ~60charactersintotal,thankstoshiftstate Binary,fixedlength

Only2signallingstates:on/off Alwaysingroupsof5,e.g.,A=00001

Yeah? Why not Morse code, while you're at it?

Evenearlier,inventedin1844 Varyinglengthofsymbols(E=,Z=)

AsinHuffmancoding,commonsymbolsareshorter Shorttone(1unitlength) Longtone(3units) Gap(1unitsilencebetweentoneswithinasymbol) Letterspace(3unitssilence) Wordspace(7unitssilence)

5differentsignallingstates

Keep it simple

Simpleanduniformisgoodformachines 2levels(on/off)areeasierthan5 Fixedlengthgroupsiseasierthanvariablelength BadpartofBaudotcode:stateful

Lastpressedshiftkeydecidesif00001=Aor1,etc. Thehumanoperatorused5keysatoncelikeapiano chord,andtheshiftstatedidn'tchangeveryoften Stillprettysimpletobuildmachinesthathandleit

Control confusion

Wehavealreadymixedsymbolswithcontrolcodes

Erase(orDelete,orBackspace) Changeshiftstate LineFeed,Bell,andothercontrolcodessoonadded positiononpaper,shiftstate,alerttheoperator...

We'retalkingaboutcontrollingateleprinter...

...ratherthanrepresentingcharacters Andin1874,nobodythoughtaboutdataprocessing

85 years later: ASCII

AmericanStandardCodeforInformationInterchange 1963,7bitcode(128combinations) Stateless(nocurrentshiftmodetoremember) Bothuppercaseandlowercasecharacters

Bit5worksasaCapsShift,andbit6asCtrl

1000001=A 1100001=a 0000001=CtrlA

AZ,az,and09arecontiguous,foreasyprocessing

But what about 8-bit bytes?

Thetermbytehasbeenusedsince1954 Earlycomputershadveryvaryingbytesizes

Typically46bits Computermemoriesweresmallandexpensive Everybitcountednotaneasydecision,backthen Inthelongrun,8wasmorepractical(powerof2) UsedIBM'sown8bitcharacterencoding,EBCDIC

TheIBMSystem/360startedthe8bittrend,1964

...whicheverybodyhated,butstillexistsoutthere

Extended ASCII

ASCIIassumesthatonlyEnglishwillbeused Moresymbolswereneededforotherlanguages Many8bitextendedASCIIencodingscreated

Oftencalledcodepages,andnumbered

E.g.,IBMPCstandardcodepage437(DOSUS)

Countriespickedtheirownpreferredencoding Bigmess,nointeroperabilitybetweenoperating systemsandcountries

Latin-1

DECVT220terminal,1983

(Onlylike100yearsafterthosefirstteleprinters...) BecametheISO88591standard,alsocalledLatin1 Variants:88592885915,Windows1252, Stillalotofconfusion

MultinationalCharacterSetextendedASCII

Not everyone uses Roman

E.g.,ShiftJIS(8bit)forJapanese

0127areasinASCII,with2exceptions:

~(tilde)(overline) \(backslash)(yen)

161223arekatakana(alphabeticalsymbols) Variablelengthencoding 129159and224239arethefirstof2bytes

6355commonkanji(Chinese)+othersymbols

Hardtojumpinatanypoint:firstorsecondbyte?

Stateful encodings

E.g.,ISCIIforIndianlanguages,orISO2022for Japanese,KoreanandChinese Escapecodeschangethemeaningofallthe followingbytesuntilthenextescapecode

Mayalsochangenumberofbytespercharacter

Evenhardertojumpinatanypoint:mustscan backwardstofindlatestescapecode

How do you know which encoding a text is using?

Mostsoftwareleavesthisproblemtotheuser

Mostfileformatshavenowaytoindicateencoding Arawtextfilehasnoplacetostoresuchmetadata Peopleonsimilarsystemsandinthesamecountry usuallyusethesameencoding Ifnot,youhaveaproblem(yes,you) Convertingencodingscandestroyinformation

Thatitworksatallismainlybyconvention

IANA encoding names

TheIANA(InternetAssignedNumbersAuthority) maintainsthelistofofficialnamesofencodings tobeusedincommunication Thenamesarenotcasesensitive Forexample:

Name:ANSI_X3.41968[RFC1345,KXS2] Alias:ASCII Alias:USASCII(preferredMIMEname) Alias:ISO646US

Charset declarations

Email/HTTPheaders:

Content-Type: text/html; charset=utf-8

XMLdeclaration(atstartofXMLdocument)

<?xml version="1.0" encoding="iso-8859-1" ?>

Ifyoudon'tknowthecharset,howcanyoureadthe textthatsayswhatcharsetitis?

Email/HTTPheadersarealwaysASCIIonly Orguessencodingandseeifyoucanreadenoughto findthedeclarationthattellsyoutherealencoding MostencodingsaredownwardsASCIIcompatible

8 bits can be unsafe

Someoldsystemsassumethat7bitASCIIisused

Theextrabitineachbytewasusedasachecksum (parity)indatatransfers,orasaspecialflag Thatbitwasoftenautomaticallyresetto0 Zerobytes,linebreaks,etc.,couldalsobemangled

Tosend8bitdatathroughsuchsystemsandbe surethatitsurvives,youhavetouseonly charactercodesthatwillnotbetouched

E.g.,Base64(ortheolderUuencode)foremail

[1,128,254]=>"AYD+"

A few simple rules

Rememberwhatencodingyourstringsarein Nevermixdifferentencodingsinthesamestring Whenyouwritetoafile,socket,ordatabase:

Whatencodingisexpected? Doyouneedtoconvertbeforewriting? Isityourjobtoprovideacharsetdeclaration? Doyouknowwhattheencodingis? Doyouneedtoconverttoyourlocalencoding?

Whenyourprogramgetstextfromsomewhere:

Never trust your eyes

Therecanbeseverallayersofinterpretation betweenyouroutputandwhatyousee

Someofthesemaytrytobecleverandhelpyou Therearemanysystemsettingsthatcanaffectthings

Donottrustwhatyoureditorshowsyou Donottrustwhatyouseeintheterminal Ifyouwanttoknowwhat'sinafileordatastream, usealowleveltoollike'od -c <filename>'

Unicode

Ash nazg durbatulk, ash nazg gimbatul, Ash nazg thrakatulk agh burzum-ishi krimpatul. One ring to rule them all, one ring to find them, One ring to bring them all and in the darkness bind them.

Terminology

Peopleusethetermscharacterencoding, charactersetorcharset,codepage,etc., informallytomeanmoreorlessthesamething

Asetofcharacters(andcontrolcodes) ...andhowtorepresentthemasbytes

Buttherearemanydifferentconceptswhenit comestolanguagesandwritingsystems Unicodeterminologytriestoseparatetheconcepts

What is a character?

It'snotthenumberit'sbeengiveninanencoding It'snotthebyte(orbytes)storedinafile It'snotthegraphicalshape(whichfont?) It'sreallymoretheideaofthecharacter Unicodetalksaboutabstractcharactersandgives themuniquenames,suchasLATINCAPITAL LETTERX,orRUNICLETTEROE()

Code points

Eachabstractcharacterisassignedacodepoint Anaturalnumberbetween0and10FFFFhex(overa millionpossiblecodepoints) Dividedintoplanesof216codepointseach

00000000FFFFiscalledtheBasicMultilingualPlane andcontainsmostcharactersinactiveusetoday Mostplanesarecurrentlyunused,soplentyofspace

Justnumbers,notbytesormachinewords Thesubset0255isLatin1(and0127isASCII)

But where do fonts come in?

Encodings(andstringsingeneral)justexpressthe abstractcharacters,nothowtheyshouldlook Afontmapstheabstractcodepointstoglyphs (specificgraphicalshapes):A,A,A,A, A... Youcanusuallyforgetaboutfonts,but:

Thatyoucanhandleacharacter(codepoint)inyour codedoesn'tmeaneverybodycanseethesymbol NosinglefontcontainsallUnicodecodepoints Withoutasuitablefont,theuserseesorsimilar.

Unicode encodings

21bitsareneededforfullUnicode

Unicodewasfirstdesignedfor16bits(notenough) Buttherange010FFFFisfinal(theypromise)

ThereisafamilyofofficialencodingsforUnicode calledUnicodeTransformationFormat(UTF)

UTF8 UTF16 UTF32 andacoupleofothersthatarenotveryuseful

UTF-8

Quicklybecomingtheuniversallypreferred encodinginexternalrepresentationseverywhere DownwardsASCIIcompatible(0127unchanged) UTF8isnotthesamethingasLatin1!

Latin1isfixedlength(exactly1bytepercharacter) UTF8isvariablelength(16bytespercharacter)

CharactersintheBMP(0FFFF)use13bytes Latin1codes128255need2bytes(,etc.) ASCIIcharactersuseasinglebyte

UTF-8 details
Bits Max Byte 1 0....... 110..... 1110.... 11110... 111110.. 1111110. Byte 2 10...... 10...... 10...... 10...... 10...... Byte 3 Byte 4 Byte 5 Byte 6 7 7F 11 7FF 16 FFFF 21 1FFFFF 26 3FFFFFF 31 7FFFFFFF

10...... 10...... 10...... 10...... 10...... 10...... 10...... 10...... 10...... 10......

Firstbytesaysexactlyhowmanybytesareused Bytes26havehighbits10easytofindstartbyte Ifdataiscorrupted,itiseasytofindnextstartbyte PreservessortingorderofUnicodecodepoints EasytorecognizeUTF8text;also,FE/FFnotvalid

UTF-16

Somesystems(e.g.,JavaandtheWindowsAPI) use16bitcharactersafterall,Unicodefirstsaid theyweregoingtousenomorethan16bits UTF16uses2bytespercharacterfor0000FFFF CodepointsaboveFFFFareencodedusingtwo16 bitcharacters(sametrickasinShiftJIS)called surrogatepairs

D800DBFF(high)followedbyDC00DFFF(low) Togethertheygiveanumberfrom1000010FFFF Onlyvalidwhenusedinpairs(andincorrectorder)

UTF-32

UTF32meansthata32bitintegerisusedforeach codepointinthestring

Canofcourseuse64bitintegersorwideraswell Nosurrogatepairsmaybeusedalwaysexpanded

MainlyusedinternallyinAPIsandinchars/strings inprogramminglanguages

Inparticularforfunctionsthatworkonasingle characterratherthanonastring Unixlikesystemsnormallyuse32bitwide characters(wchar_t)alsoinstrings

What about UCS?

UniversalCharacterSet(ISO10646)

MostlythesameasUnicode(samecodepoints) Byteencodingsthatnobodyliked(UTF1,UCS2) thathavebeenobsoletedbyUTF8andUTF16 Notenoughacceptancefromsoftwarevendors UnicodewonandUCSwasadaptedtoUnicode

Basically,theytriedtocreateastandardtoosoon

IfyouseeUCS,thinkUnicodeorobsolete

Endianism

Ifyousend16bitor32bitdataasastreamof bytes,youhavetodecideinwhichordertosend thebytesofeachinteger:highorlowbytesfirst?

Someonewhoreadsthedatamustknowinwhich ordertoreassemblethebytesintointegers

DefaultbyteorderinInternetprotocolsisbig endian(networkorder)

Butusuallyitdependsonthesystem,i.e.,little endianistypicallyusedonx86machines,etc.

UTF8hasnoendianism;thebyteorderisfixed

Byte order mark (BOM)

ThecodepointFEFFcanbeusedtoindicatethe byteorder,byinsertingitfirstinthetext

Iffound,itisnotreallypartofthetext,justahint Zerowidth,nonbreakingspace,i.e.,invisible,ifit occursanywhereelse FEfollowedbyFFimpliesbigendian FFfollowedbyFEimplieslittleendian ThecodepointFFFEisreservedasinvalidin Unicode,sotherecanonlybeoneinterpretation

Don'tuseBOMinUTF8(causesconfusion)

Not only encodings

Unicodeincludesanumberofrulesandalgorithms fornormalizing,sorting,anddisplayingtext Therulesforalphabeticsortingdependonthe language(English,SwedishandGermandiffer) HowdoyouconvertCroatiantexttouppercase? Therearefreesupportlibrariesthathandlethisfor you,likeICUforC/C++

ManyprogramminglanguageshaveICUbindings

Normalization forms

Regardlessofencodingandsurrogatepairs, Unicodetextisvariablelength(eveninUTF32) Therearebasecharactersandcombiningcharacters

(776)isacombiningcharacter(diaeresis) canberepresentedby[196]orby[65,776]

Itcangetmuchmorecomplicatedthanthat Thecombiningcharactersfollowthebasecharacter

Tocomparestrings,theymustbenormalized

NFC(normalformcomposed)isusuallybest

Erlang, Latin-1 and Unicode

Chars? We don't need no steenkin' chars!

Strings are already Unicode

Erlangstringsarejustlistsofintegers Erlangintegersarenotboundedinsize WealreadyuseLatin1(0255)inourstrings Unicodesimplymeanslargernumbersinthelists

Technicallyit'sUTF32oneintegerpercodepoint

Youhavetothinkaboutotherencodingswhenyou readorwritetext,orconverttoorfrombinaries ErlangsourcecodeisstillalwaysLatin1!No Russianinsourcefilesfornow,sorry!

Binaries are chunks of bytes

Thequestionis:howarethecodepointsofthe stringrepresentedasbytesinthebinary? ForLatin1,youuseexactlyonebytepercharacter

list_to_binary(String)/binary_to_list(Bin)

TopackastringasaUTF8binary:

unicode:characters_to_binary("Motrhead")

TounpackaUTF8binarytoastring:

unicode:characters_to_list(<<"Motrhead">>)

IO-lists are also just bytes

AnIOlistcanbe:

Abinary,containinganybytes Alistofintegersand/orotherIOlists(toanydepth) Allintegersmustbebetween0and255

E.g.,[88,[89,90],[32,<<90,89,88>>],] ConcatenatingIOlistsischeap:L3=[L1,L2] IOlistscanbewrittendirectlytofiles/socketsor convertedtobinaryusingiolist_to_binary(List) Latin1stringscanbeuseddirectlyasIOlists

New Unicode type: chardata

ChardataissimilartoanIOlist.Chardatacanbe:

Abinary,containingUTF8encodedcharacters Alistofintegersand/orotherchardata,toanydepth AllintegersareUnicodecodepoints(010FFFF)

E.g.,[1234,<<195,132>>] Librarysupportforconvertingto/fromother encodings,andforwritingtooutput


io:format(~ts, [CharData]) insteadof~s

I/O stream encodings

BeforeUnicodeinErlang,I/Ostreamsassumedthe datawasinLatin1,andnevermodifiedthebytes Thefile:open/2functionnowtakesanoption {encoding, ...}whichbydefaultislatin1 Youcanchangetheencodingofanexistingstream withio:setopts/2orinspectitwithio:getopts/1 TheshellnowusesUnicodeI/Obydefaultifthe OSlanguagesettings(LANGetc.)indicateit

Think before you output

AplainoldflatLatin1stringinErlangisacharlist andcanbeprintedwith~tsaswellaswith~s ALatin1binarycannotbeprintedwith~ts, because~tsexpectsbinariestocontainUTF8 Chardatacannotbewrittendirectlytoafileor socketitmaycontainintegersabove255

IfyouknowitcontainsonlyASCIIintegers(0127) and/orUTF8binaries,andtheoutputstream expectsUTF8,youcantreatitasanIOlist

Switching to Unicode

How can you gradually change your system to support Unicode all the way through?

Start with the output!

Firstofall,makesureyoucanoutputUnicode,in webpages,mail,PDFs,etc.(Safest,andmakes sureyoucanseetheresultsoflaterchanges) Startwithonepage/documentatatimeandtest

GeneratecharsetdeclarationsforUTF8 TransformthetexttoUTF8whenyououtput

Checkthatresultlooksgood,bothforLatin1text (e.g.,),andfullUnicodetext(e.g.,Cyrillic characters)

Get a grip on internal data

Startassumingthatstringscancontaincodes>255 Mostcodeworkingonliststringsneedsnochange Checkallpacking/unpackingofbinaries UseUTF8inbinaries,notLatin1,inparticular whenyoustoretheminthedatabaseorondisk

Considerconvertingdatainoldtables/filestoUTF8

PreferUTF8encodedbinariesratherthanlistsof characterswheneveryouarestoringtext

Don't accept any Unicode input until you're ready!

AllinputtothesystemmustbelimitedtoLatin1 untilyoucanhandleUnicodeallthewaythrough! MakesurethatwebpagesdonotpostUTF8 encodedtextbacktoyou,evenifthetextonthe pageisdeclaredasUTF8

accept-charsetattribute

Checktheinputatruntimeforsafety

Testsendingine.g.,Cyrillicandseewhatbreaks!

Not all code at once!

Incrementallyabigbangupdateofallcodeto suddenlyhandlefullUnicodeisnotfeasibleina largeandcontinuouslyrunningsystem Takecareofonepartofthecodeatatime Convertto/fromLatin1whenyouinterfacewith othercodethatisnotUnicodecompliantyet

Convertgracefully;don'tcrashiftheconversion fails.Replacechars>255with'?'instead.

Last: accept Unicode input

OnceyouletfullUnicodeintothesystematsome point,itwillstarttospreadallovertheplace Entrypoints:HTTPPOSTorGET(URLqueries), XMLRPC,filesviaFTP,etc.Etc. EnsureallUnicodeinputgetsnormalizedtoNFC (composednormalform)attheentrypoint!

Shouldalreadybeguaranteedbytextsentfromweb browsers(W3Cstandard),butyouneedtomakesure

That's all there is to it

What could possibly go wrong?

You might also like