Encodings, Unicode and Erlang by Richard Carlsson

All you wanted to know about Character Encodings but were too afraid to ask
Whyaretheresomanydifferentencodings? Whatdoesalltheterminologymean? WhatdoIneedtokeepinmind? WhatisallthisUnicodestuffanyway? HowdoesUnicodeworkinErlang? HowdoweswitchoursystemtoUnicode?
In the beginning, there was the Teleprinter
Incomingcharactersareprintedtopaper Keypressissenttotheremoteteleprinteror computer(andprintedonthepaper,forfeedback) Thefirstteleprinterswereinventedinthe1870s Baudotcode(5bit,shiftedstateforletters/symbols)
Baudot code?
mileBaudot(1baud=1'signallingevent'/second) 1874,some70yearsbeforethefirstcomputers 5bits(32combinations) AZ,09,&()=%/;!'?,:.,ERASE,FIGSHIFT,... ~60charactersintotal,thankstoshiftstate Binary,fixedlength
Only2signallingstates:on/off Alwaysingroupsof5,e.g.,A=00001
Yeah? Why not Morse code, while you're at it?
Evenearlier,inventedin1844 Varyinglengthofsymbols(E=,Z=)
AsinHuffmancoding,commonsymbolsareshorter Shorttone(1unitlength) Longtone(3units) Gap(1unitsilencebetweentoneswithinasymbol) Letterspace(3unitssilence) Wordspace(7unitssilence)
5differentsignallingstates

Keep it simple
Simpleanduniformisgoodformachines 2levels(on/off)areeasierthan5 Fixedlengthgroupsiseasierthanvariablelength BadpartofBaudotcode:stateful
Lastpressedshiftkeydecidesif00001=Aor1,etc. Thehumanoperatorused5keysatoncelikeapiano chord,andtheshiftstatedidn'tchangeveryoften Stillprettysimpletobuildmachinesthathandleit
Control confusion
Wehavealreadymixedsymbolswithcontrolcodes
Erase(orDelete,orBackspace) Changeshiftstate LineFeed,Bell,andothercontrolcodessoonadded positiononpaper,shiftstate,alerttheoperator...
We'retalkingaboutcontrollingateleprinter...
...ratherthanrepresentingcharacters Andin1874,nobodythoughtaboutdataprocessing
85 years later: ASCII
AmericanStandardCodeforInformationInterchange 1963,7bitcode(128combinations) Stateless(nocurrentshiftmodetoremember) Bothuppercaseandlowercasecharacters
Bit5worksasaCapsShift,andbit6asCtrl
1000001=A 1100001=a 0000001=CtrlA
AZ,az,and09arecontiguous,foreasyprocessing
But what about 8-bit bytes?
Thetermbytehasbeenusedsince1954 Earlycomputershadveryvaryingbytesizes
Typically46bits Computermemoriesweresmallandexpensive Everybitcountednotaneasydecision,backthen Inthelongrun,8wasmorepractical(powerof2) UsedIBM'sown8bitcharacterencoding,EBCDIC
TheIBMSystem/360startedthe8bittrend,1964

...whicheverybodyhated,butstillexistsoutthere
Extended ASCII
ASCIIassumesthatonlyEnglishwillbeused Moresymbolswereneededforotherlanguages Many8bitextendedASCIIencodingscreated
Oftencalledcodepages,andnumbered
E.g.,IBMPCstandardcodepage437(DOSUS)
Countriespickedtheirownpreferredencoding Bigmess,nointeroperabilitybetweenoperating systemsandcountries
Latin-1
DECVT220terminal,1983
(Onlylike100yearsafterthosefirstteleprinters...) BecametheISO88591standard,alsocalledLatin1 Variants:88592885915,Windows1252, Stillalotofconfusion
MultinationalCharacterSetextendedASCII

Not everyone uses Roman
E.g.,ShiftJIS(8bit)forJapanese
0127areasinASCII,with2exceptions:

~(tilde)(overline) \(backslash)(yen)
161223arekatakana(alphabeticalsymbols) Variablelengthencoding 129159and224239arethefirstof2bytes
6355commonkanji(Chinese)+othersymbols
Hardtojumpinatanypoint:firstorsecondbyte?
Stateful encodings
E.g.,ISCIIforIndianlanguages,orISO2022for Japanese,KoreanandChinese Escapecodeschangethemeaningofallthe followingbytesuntilthenextescapecode
Mayalsochangenumberofbytespercharacter
Evenhardertojumpinatanypoint:mustscan backwardstofindlatestescapecode
How do you know which encoding a text is using?
Mostsoftwareleavesthisproblemtotheuser
Mostfileformatshavenowaytoindicateencoding Arawtextfilehasnoplacetostoresuchmetadata Peopleonsimilarsystemsandinthesamecountry usuallyusethesameencoding Ifnot,youhaveaproblem(yes,you) Convertingencodingscandestroyinformation
Thatitworksatallismainlybyconvention
IANA encoding names
TheIANA(InternetAssignedNumbersAuthority) maintainsthelistofofficialnamesofencodings tobeusedincommunication Thenamesarenotcasesensitive Forexample:
Name:ANSI_X3.41968[RFC1345,KXS2] Alias:ASCII Alias:USASCII(preferredMIMEname) Alias:ISO646US
Charset declarations
Email/HTTPheaders:
Content-Type: text/html; charset=utf-8
XMLdeclaration(atstartofXMLdocument)
<?xml version="1.0" encoding="iso-8859-1" ?>
Ifyoudon'tknowthecharset,howcanyoureadthe textthatsayswhatcharsetitis?
Email/HTTPheadersarealwaysASCIIonly Orguessencodingandseeifyoucanreadenoughto findthedeclarationthattellsyoutherealencoding MostencodingsaredownwardsASCIIcompatible
8 bits can be unsafe
Someoldsystemsassumethat7bitASCIIisused
Theextrabitineachbytewasusedasachecksum (parity)indatatransfers,orasaspecialflag Thatbitwasoftenautomaticallyresetto0 Zerobytes,linebreaks,etc.,couldalsobemangled
Tosend8bitdatathroughsuchsystemsandbe surethatitsurvives,youhavetouseonly charactercodesthatwillnotbetouched
E.g.,Base64(ortheolderUuencode)foremail
[1,128,254]=>"AYD+"
A few simple rules
Rememberwhatencodingyourstringsarein Nevermixdifferentencodingsinthesamestring Whenyouwritetoafile,socket,ordatabase:
Whatencodingisexpected? Doyouneedtoconvertbeforewriting? Isityourjobtoprovideacharsetdeclaration? Doyouknowwhattheencodingis? Doyouneedtoconverttoyourlocalencoding?
Whenyourprogramgetstextfromsomewhere:

Never trust your eyes
Therecanbeseverallayersofinterpretation betweenyouroutputandwhatyousee
Someofthesemaytrytobecleverandhelpyou Therearemanysystemsettingsthatcanaffectthings
Donottrustwhatyoureditorshowsyou Donottrustwhatyouseeintheterminal Ifyouwanttoknowwhat'sinafileordatastream, usealowleveltoollike'od -c <filename>'
Unicode
Ash nazg durbatulk, ash nazg gimbatul, Ash nazg thrakatulk agh burzum-ishi krimpatul. One ring to rule them all, one ring to find them, One ring to bring them all and in the darkness bind them.
Terminology
Peopleusethetermscharacterencoding, charactersetorcharset,codepage,etc., informallytomeanmoreorlessthesamething
Asetofcharacters(andcontrolcodes) ...andhowtorepresentthemasbytes
Buttherearemanydifferentconceptswhenit comestolanguagesandwritingsystems Unicodeterminologytriestoseparatetheconcepts
What is a character?
It'snotthenumberit'sbeengiveninanencoding It'snotthebyte(orbytes)storedinafile It'snotthegraphicalshape(whichfont?) It'sreallymoretheideaofthecharacter Unicodetalksaboutabstractcharactersandgives themuniquenames,suchasLATINCAPITAL LETTERX,orRUNICLETTEROE()
Code points
Eachabstractcharacterisassignedacodepoint Anaturalnumberbetween0and10FFFFhex(overa millionpossiblecodepoints) Dividedintoplanesof216codepointseach
00000000FFFFiscalledtheBasicMultilingualPlane andcontainsmostcharactersinactiveusetoday Mostplanesarecurrentlyunused,soplentyofspace
Justnumbers,notbytesormachinewords Thesubset0255isLatin1(and0127isASCII)
But where do fonts come in?
Encodings(andstringsingeneral)justexpressthe abstractcharacters,nothowtheyshouldlook Afontmapstheabstractcodepointstoglyphs (specificgraphicalshapes):A,A,A,A, A... Youcanusuallyforgetaboutfonts,but:
Thatyoucanhandleacharacter(codepoint)inyour codedoesn'tmeaneverybodycanseethesymbol NosinglefontcontainsallUnicodecodepoints Withoutasuitablefont,theuserseesorsimilar.
Unicode encodings
21bitsareneededforfullUnicode
Unicodewasfirstdesignedfor16bits(notenough) Buttherange010FFFFisfinal(theypromise)
ThereisafamilyofofficialencodingsforUnicode calledUnicodeTransformationFormat(UTF)
UTF8 UTF16 UTF32 andacoupleofothersthatarenotveryuseful
UTF-8
Quicklybecomingtheuniversallypreferred encodinginexternalrepresentationseverywhere DownwardsASCIIcompatible(0127unchanged) UTF8isnotthesamethingasLatin1!
Latin1isfixedlength(exactly1bytepercharacter) UTF8isvariablelength(16bytespercharacter)

CharactersintheBMP(0FFFF)use13bytes Latin1codes128255need2bytes(,etc.) ASCIIcharactersuseasinglebyte
UTF-8 details
Bits Max Byte 1 0....... 110..... 1110.... 11110... 111110.. 1111110. Byte 2 10...... 10...... 10...... 10...... 10...... Byte 3 Byte 4 Byte 5 Byte 6 7 7F 11 7FF 16 FFFF 21 1FFFFF 26 3FFFFFF 31 7FFFFFFF
10...... 10...... 10...... 10...... 10...... 10...... 10...... 10...... 10...... 10......
Firstbytesaysexactlyhowmanybytesareused Bytes26havehighbits10easytofindstartbyte Ifdataiscorrupted,itiseasytofindnextstartbyte PreservessortingorderofUnicodecodepoints EasytorecognizeUTF8text;also,FE/FFnotvalid
UTF-16
Somesystems(e.g.,JavaandtheWindowsAPI) use16bitcharactersafterall,Unicodefirstsaid theyweregoingtousenomorethan16bits UTF16uses2bytespercharacterfor0000FFFF CodepointsaboveFFFFareencodedusingtwo16 bitcharacters(sametrickasinShiftJIS)called surrogatepairs
D800DBFF(high)followedbyDC00DFFF(low) Togethertheygiveanumberfrom1000010FFFF Onlyvalidwhenusedinpairs(andincorrectorder)
UTF-32
UTF32meansthata32bitintegerisusedforeach codepointinthestring
Canofcourseuse64bitintegersorwideraswell Nosurrogatepairsmaybeusedalwaysexpanded
MainlyusedinternallyinAPIsandinchars/strings inprogramminglanguages
Inparticularforfunctionsthatworkonasingle characterratherthanonastring Unixlikesystemsnormallyuse32bitwide characters(wchar_t)alsoinstrings
What about UCS?
UniversalCharacterSet(ISO10646)
MostlythesameasUnicode(samecodepoints) Byteencodingsthatnobodyliked(UTF1,UCS2) thathavebeenobsoletedbyUTF8andUTF16 Notenoughacceptancefromsoftwarevendors UnicodewonandUCSwasadaptedtoUnicode
Basically,theytriedtocreateastandardtoosoon

IfyouseeUCS,thinkUnicodeorobsolete
Endianism
Ifyousend16bitor32bitdataasastreamof bytes,youhavetodecideinwhichordertosend thebytesofeachinteger:highorlowbytesfirst?
Someonewhoreadsthedatamustknowinwhich ordertoreassemblethebytesintointegers
DefaultbyteorderinInternetprotocolsisbig endian(networkorder)
Butusuallyitdependsonthesystem,i.e.,little endianistypicallyusedonx86machines,etc.
UTF8hasnoendianism;thebyteorderisfixed
Byte order mark (BOM)
ThecodepointFEFFcanbeusedtoindicatethe byteorder,byinsertingitfirstinthetext
Iffound,itisnotreallypartofthetext,justahint Zerowidth,nonbreakingspace,i.e.,invisible,ifit occursanywhereelse FEfollowedbyFFimpliesbigendian FFfollowedbyFEimplieslittleendian ThecodepointFFFEisreservedasinvalidin Unicode,sotherecanonlybeoneinterpretation
Don'tuseBOMinUTF8(causesconfusion)
Not only encodings
Unicodeincludesanumberofrulesandalgorithms fornormalizing,sorting,anddisplayingtext Therulesforalphabeticsortingdependonthe language(English,SwedishandGermandiffer) HowdoyouconvertCroatiantexttouppercase? Therearefreesupportlibrariesthathandlethisfor you,likeICUforC/C++
ManyprogramminglanguageshaveICUbindings
Normalization forms
Regardlessofencodingandsurrogatepairs, Unicodetextisvariablelength(eveninUTF32) Therearebasecharactersandcombiningcharacters
(776)isacombiningcharacter(diaeresis) canberepresentedby[196]orby[65,776]

Itcangetmuchmorecomplicatedthanthat Thecombiningcharactersfollowthebasecharacter
Tocomparestrings,theymustbenormalized
NFC(normalformcomposed)isusuallybest
Erlang, Latin-1 and Unicode
Chars? We don't need no steenkin' chars!
Strings are already Unicode
Erlangstringsarejustlistsofintegers Erlangintegersarenotboundedinsize WealreadyuseLatin1(0255)inourstrings Unicodesimplymeanslargernumbersinthelists
Technicallyit'sUTF32oneintegerpercodepoint
Youhavetothinkaboutotherencodingswhenyou readorwritetext,orconverttoorfrombinaries ErlangsourcecodeisstillalwaysLatin1!No Russianinsourcefilesfornow,sorry!
Binaries are chunks of bytes
Thequestionis:howarethecodepointsofthe stringrepresentedasbytesinthebinary? ForLatin1,youuseexactlyonebytepercharacter
list_to_binary(String)/binary_to_list(Bin)
TopackastringasaUTF8binary:
unicode:characters_to_binary("Motrhead")
TounpackaUTF8binarytoastring:
unicode:characters_to_list(<<"Motrhead">>)
IO-lists are also just bytes
AnIOlistcanbe:
Abinary,containinganybytes Alistofintegersand/orotherIOlists(toanydepth) Allintegersmustbebetween0and255
E.g.,[88,[89,90],[32,<<90,89,88>>],] ConcatenatingIOlistsischeap:L3=[L1,L2] IOlistscanbewrittendirectlytofiles/socketsor convertedtobinaryusingiolist_to_binary(List) Latin1stringscanbeuseddirectlyasIOlists
New Unicode type: chardata
ChardataissimilartoanIOlist.Chardatacanbe:
Abinary,containingUTF8encodedcharacters Alistofintegersand/orotherchardata,toanydepth AllintegersareUnicodecodepoints(010FFFF)
E.g.,[1234,<<195,132>>] Librarysupportforconvertingto/fromother encodings,andforwritingtooutput

io:format(~ts, [CharData]) insteadof~s
I/O stream encodings
BeforeUnicodeinErlang,I/Ostreamsassumedthe datawasinLatin1,andnevermodifiedthebytes Thefile:open/2functionnowtakesanoption {encoding, ...}whichbydefaultislatin1 Youcanchangetheencodingofanexistingstream withio:setopts/2orinspectitwithio:getopts/1 TheshellnowusesUnicodeI/Obydefaultifthe OSlanguagesettings(LANGetc.)indicateit
Think before you output
AplainoldflatLatin1stringinErlangisacharlist andcanbeprintedwith~tsaswellaswith~s ALatin1binarycannotbeprintedwith~ts, because~tsexpectsbinariestocontainUTF8 Chardatacannotbewrittendirectlytoafileor socketitmaycontainintegersabove255
IfyouknowitcontainsonlyASCIIintegers(0127) and/orUTF8binaries,andtheoutputstream expectsUTF8,youcantreatitasanIOlist
Switching to Unicode
How can you gradually change your system to support Unicode all the way through?
Start with the output!
Firstofall,makesureyoucanoutputUnicode,in webpages,mail,PDFs,etc.(Safest,andmakes sureyoucanseetheresultsoflaterchanges) Startwithonepage/documentatatimeandtest
GeneratecharsetdeclarationsforUTF8 TransformthetexttoUTF8whenyououtput
Checkthatresultlooksgood,bothforLatin1text (e.g.,),andfullUnicodetext(e.g.,Cyrillic characters)
Get a grip on internal data
Startassumingthatstringscancontaincodes>255 Mostcodeworkingonliststringsneedsnochange Checkallpacking/unpackingofbinaries UseUTF8inbinaries,notLatin1,inparticular whenyoustoretheminthedatabaseorondisk
Considerconvertingdatainoldtables/filestoUTF8
PreferUTF8encodedbinariesratherthanlistsof characterswheneveryouarestoringtext
Don't accept any Unicode input until you're ready!
AllinputtothesystemmustbelimitedtoLatin1 untilyoucanhandleUnicodeallthewaythrough! MakesurethatwebpagesdonotpostUTF8 encodedtextbacktoyou,evenifthetextonthe pageisdeclaredasUTF8
accept-charsetattribute
Checktheinputatruntimeforsafety
Testsendingine.g.,Cyrillicandseewhatbreaks!
Not all code at once!
Incrementallyabigbangupdateofallcodeto suddenlyhandlefullUnicodeisnotfeasibleina largeandcontinuouslyrunningsystem Takecareofonepartofthecodeatatime Convertto/fromLatin1whenyouinterfacewith othercodethatisnotUnicodecompliantyet
Convertgracefully;don'tcrashiftheconversion fails.Replacechars>255with'?'instead.
Last: accept Unicode input
OnceyouletfullUnicodeintothesystematsome point,itwillstarttospreadallovertheplace Entrypoints:HTTPPOSTorGET(URLqueries), XMLRPC,filesviaFTP,etc.Etc. EnsureallUnicodeinputgetsnormalizedtoNFC (composednormalform)attheentrypoint!
Shouldalreadybeguaranteedbytextsentfromweb browsers(W3Cstandard),butyouneedtomakesure
That's all there is to it
What could possibly go wrong?

Encodings, Unicode and Erlang by Richard Carlsson

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Encodings, Unicode and Erlang by Richard Carlsson

Uploaded by

Copyright:

Available Formats

All you wanted to know about Character Encodings but were too afraid to ask

Whyaretheresomanydifferentencodings? Whatdoesalltheterminologymean? WhatdoIneedtokeepinmind? WhatisallthisUnicodestuffanyway? HowdoesUnicodeworkinErlang? HowdoweswitchoursystemtoUnicode?

In the beginning, there was the Teleprinter

Incomingcharactersareprintedtopaper Keypressissenttotheremoteteleprinteror computer(andprintedonthepaper,forfeedback) Thefirstteleprinterswereinventedinthe1870s Baudotcode(5bit,shiftedstateforletters/symbols)

mileBaudot(1baud=1'signallingevent'/second) 1874,some70yearsbeforethefirstcomputers 5bits(32combinations) AZ,09,&()=%/;!'?,:.,ERASE,FIGSHIFT,... ~60charactersintotal,thankstoshiftstate Binary,fixedlength

Yeah? Why not Morse code, while you're at it?

AsinHuffmancoding,commonsymbolsareshorter Shorttone(1unitlength) Longtone(3units) Gap(1unitsilencebetweentoneswithinasymbol) Letterspace(3unitssilence) Wordspace(7unitssilence)

Simpleanduniformisgoodformachines 2levels(on/off)areeasierthan5 Fixedlengthgroupsiseasierthanvariablelength BadpartofBaudotcode:stateful

Lastpressedshiftkeydecidesif00001=Aor1,etc. Thehumanoperatorused5keysatoncelikeapiano chord,andtheshiftstatedidn'tchangeveryoften Stillprettysimpletobuildmachinesthathandleit

Erase(orDelete,orBackspace) Changeshiftstate LineFeed,Bell,andothercontrolcodessoonadded positiononpaper,shiftstate,alerttheoperator...

85 years later: ASCII

AmericanStandardCodeforInformationInterchange 1963,7bitcode(128combinations) Stateless(nocurrentshiftmodetoremember) Bothuppercaseandlowercasecharacters

1000001=A 1100001=a 0000001=CtrlA

But what about 8-bit bytes?

Typically46bits Computermemoriesweresmallandexpensive Everybitcountednotaneasydecision,backthen Inthelongrun,8wasmorepractical(powerof2) UsedIBM'sown8bitcharacterencoding,EBCDIC

ASCIIassumesthatonlyEnglishwillbeused Moresymbolswereneededforotherlanguages Many8bitextendedASCIIencodingscreated

Countriespickedtheirownpreferredencoding Bigmess,nointeroperabilitybetweenoperating systemsandcountries

(Onlylike100yearsafterthosefirstteleprinters...) BecametheISO88591standard,alsocalledLatin1 Variants:88592885915,Windows1252, Stillalotofconfusion

Not everyone uses Roman

161223arekatakana(alphabeticalsymbols) Variablelengthencoding 129159and224239arethefirstof2bytes

E.g.,ISCIIforIndianlanguages,orISO2022for Japanese,KoreanandChinese Escapecodeschangethemeaningofallthe followingbytesuntilthenextescapecode

How do you know which encoding a text is using?

Mostfileformatshavenowaytoindicateencoding Arawtextfilehasnoplacetostoresuchmetadata Peopleonsimilarsystemsandinthesamecountry usuallyusethesameencoding Ifnot,youhaveaproblem(yes,you) Convertingencodingscandestroyinformation

IANA encoding names

TheIANA(InternetAssignedNumbersAuthority) maintainsthelistofofficialnamesofencodings tobeusedincommunication Thenamesarenotcasesensitive Forexample:

Name:ANSI_X3.41968[RFC1345,KXS2] Alias:ASCII Alias:USASCII(preferredMIMEname) Alias:ISO646US

Content-Type: text/html; charset=utf-8

<?xml version="1.0" encoding="iso-8859-1" ?>

Email/HTTPheadersarealwaysASCIIonly Orguessencodingandseeifyoucanreadenoughto findthedeclarationthattellsyoutherealencoding MostencodingsaredownwardsASCIIcompatible

8 bits can be unsafe

Theextrabitineachbytewasusedasachecksum (parity)indatatransfers,orasaspecialflag Thatbitwasoftenautomaticallyresetto0 Zerobytes,linebreaks,etc.,couldalsobemangled

Tosend8bitdatathroughsuchsystemsandbe surethatitsurvives,youhavetouseonly charactercodesthatwillnotbetouched

A few simple rules

Rememberwhatencodingyourstringsarein Nevermixdifferentencodingsinthesamestring Whenyouwritetoafile,socket,ordatabase:

Whatencodingisexpected? Doyouneedtoconvertbeforewriting? Isityourjobtoprovideacharsetdeclaration? Doyouknowwhattheencodingis? Doyouneedtoconverttoyourlocalencoding?

Never trust your eyes

Donottrustwhatyoureditorshowsyou Donottrustwhatyouseeintheterminal Ifyouwanttoknowwhat'sinafileordatastream, usealowleveltoollike'od -c <filename>'

Peopleusethetermscharacterencoding, charactersetorcharset,codepage,etc., informallytomeanmoreorlessthesamething

Buttherearemanydifferentconceptswhenit comestolanguagesandwritingsystems Unicodeterminologytriestoseparatetheconcepts

It'snotthenumberit'sbeengiveninanencoding It'snotthebyte(orbytes)storedinafile It'snotthegraphicalshape(whichfont?) It'sreallymoretheideaofthecharacter Unicodetalksaboutabstractcharactersandgives themuniquenames,suchasLATINCAPITAL LETTERX,orRUNICLETTEROE()

Eachabstractcharacterisassignedacodepoint Anaturalnumberbetween0and10FFFFhex(overa millionpossiblecodepoints) Dividedintoplanesof216codepointseach

00000000FFFFiscalledtheBasicMultilingualPlane andcontainsmostcharactersinactiveusetoday Mostplanesarecurrentlyunused,soplentyofspace

But where do fonts come in?

Encodings(andstringsingeneral)justexpressthe abstractcharacters,nothowtheyshouldlook Afontmapstheabstractcodepointstoglyphs (specificgraphicalshapes):A,A,A,A, A... Youcanusuallyforgetaboutfonts,but:

Thatyoucanhandleacharacter(codepoint)inyour codedoesn'tmeaneverybodycanseethesymbol NosinglefontcontainsallUnicodecodepoints Withoutasuitablefont,theuserseesorsimilar.

UTF8 UTF16 UTF32 andacoupleofothersthatarenotveryuseful

Quicklybecomingtheuniversallypreferred encodinginexternalrepresentationseverywhere DownwardsASCIIcompatible(0127unchanged) UTF8isnotthesamethingasLatin1!

CharactersintheBMP(0FFFF)use13bytes Latin1codes128255need2bytes(,etc.) ASCIIcharactersuseasinglebyte

Firstbytesaysexactlyhowmanybytesareused Bytes26havehighbits10easytofindstartbyte Ifdataiscorrupted,itiseasytofindnextstartbyte PreservessortingorderofUnicodecodepoints EasytorecognizeUTF8text;also,FE/FFnotvalid

Somesystems(e.g.,JavaandtheWindowsAPI) use16bitcharactersafterall,Unicodefirstsaid theyweregoingtousenomorethan16bits UTF16uses2bytespercharacterfor0000FFFF CodepointsaboveFFFFareencodedusingtwo16 bitcharacters(sametrickasinShiftJIS)called surrogatepairs

D800DBFF(high)followedbyDC00DFFF(low) Togethertheygiveanumberfrom1000010FFFF Onlyvalidwhenusedinpairs(andincorrectorder)

Inparticularforfunctionsthatworkonasingle characterratherthanonastring Unixlikesystemsnormallyuse32bitwide characters(wchar_t)alsoinstrings

What about UCS?

MostlythesameasUnicode(samecodepoints) Byteencodingsthatnobodyliked(UTF1,UCS2) thathavebeenobsoletedbyUTF8andUTF16 Notenoughacceptancefromsoftwarevendors UnicodewonandUCSwasadaptedtoUnicode

Ifyousend16bitor32bitdataasastreamof bytes,youhavetodecideinwhichordertosend thebytesofeachinteger:highorlowbytesfirst?

Byte order mark (BOM)

Iffound,itisnotreallypartofthetext,justahint Zerowidth,nonbreakingspace,i.e.,invisible,ifit occursanywhereelse FEfollowedbyFFimpliesbigendian FFfollowedbyFEimplieslittleendian ThecodepointFFFEisreservedasinvalidin Unicode,sotherecanonlybeoneinterpretation

Not only encodings

Unicodeincludesanumberofrulesandalgorithms fornormalizing,sorting,anddisplayingtext Therulesforalphabeticsortingdependonthe language(English,SwedishandGermandiffer) HowdoyouconvertCroatiantexttouppercase? Therearefreesupportlibrariesthathandlethisfor you,likeICUforC/C++

Regardlessofencodingandsurrogatepairs, Unicodetextisvariablelength(eveninUTF32) Therearebasecharactersandcombiningcharacters

Erlang, Latin-1 and Unicode

Chars? We don't need no steenkin' chars!

Strings are already Unicode

Erlangstringsarejustlistsofintegers Erlangintegersarenotboundedinsize WealreadyuseLatin1(0255)inourstrings Unicodesimplymeanslargernumbersinthelists