Professional Documents
Culture Documents
Baudot code?
Only2signallingstates:on/off Alwaysingroupsof5,e.g.,A=00001
Evenearlier,inventedin1844 Varyinglengthofsymbols(E=,Z=)
5differentsignallingstates
Keep it simple
Control confusion
Wehavealreadymixedsymbolswithcontrolcodes
We'retalkingaboutcontrollingateleprinter...
...ratherthanrepresentingcharacters Andin1874,nobodythoughtaboutdataprocessing
Bit5worksasaCapsShift,andbit6asCtrl
AZ,az,and09arecontiguous,foreasyprocessing
Thetermbytehasbeenusedsince1954 Earlycomputershadveryvaryingbytesizes
TheIBMSystem/360startedthe8bittrend,1964
...whicheverybodyhated,butstillexistsoutthere
Extended ASCII
Oftencalledcodepages,andnumbered
E.g.,IBMPCstandardcodepage437(DOSUS)
Latin-1
DECVT220terminal,1983
MultinationalCharacterSetextendedASCII
E.g.,ShiftJIS(8bit)forJapanese
0127areasinASCII,with2exceptions:
~(tilde)(overline) \(backslash)(yen)
6355commonkanji(Chinese)+othersymbols
Hardtojumpinatanypoint:firstorsecondbyte?
Stateful encodings
Mayalsochangenumberofbytespercharacter
Evenhardertojumpinatanypoint:mustscan backwardstofindlatestescapecode
Mostsoftwareleavesthisproblemtotheuser
Thatitworksatallismainlybyconvention
Charset declarations
Email/HTTPheaders:
XMLdeclaration(atstartofXMLdocument)
Ifyoudon'tknowthecharset,howcanyoureadthe textthatsayswhatcharsetitis?
Someoldsystemsassumethat7bitASCIIisused
E.g.,Base64(ortheolderUuencode)foremail
[1,128,254]=>"AYD+"
Whenyourprogramgetstextfromsomewhere:
Therecanbeseverallayersofinterpretation betweenyouroutputandwhatyousee
Someofthesemaytrytobecleverandhelpyou Therearemanysystemsettingsthatcanaffectthings
Unicode
Ash nazg durbatulk, ash nazg gimbatul, Ash nazg thrakatulk agh burzum-ishi krimpatul. One ring to rule them all, one ring to find them, One ring to bring them all and in the darkness bind them.
Terminology
Asetofcharacters(andcontrolcodes) ...andhowtorepresentthemasbytes
What is a character?
Code points
Justnumbers,notbytesormachinewords Thesubset0255isLatin1(and0127isASCII)
Unicode encodings
21bitsareneededforfullUnicode
Unicodewasfirstdesignedfor16bits(notenough) Buttherange010FFFFisfinal(theypromise)
ThereisafamilyofofficialencodingsforUnicode calledUnicodeTransformationFormat(UTF)
UTF-8
Latin1isfixedlength(exactly1bytepercharacter) UTF8isvariablelength(16bytespercharacter)
UTF-8 details
Bits Max Byte 1 0....... 110..... 1110.... 11110... 111110.. 1111110. Byte 2 10...... 10...... 10...... 10...... 10...... Byte 3 Byte 4 Byte 5 Byte 6 7 7F 11 7FF 16 FFFF 21 1FFFFF 26 3FFFFFF 31 7FFFFFFF
10...... 10...... 10...... 10...... 10...... 10...... 10...... 10...... 10...... 10......
UTF-16
UTF-32
UTF32meansthata32bitintegerisusedforeach codepointinthestring
Canofcourseuse64bitintegersorwideraswell Nosurrogatepairsmaybeusedalwaysexpanded
MainlyusedinternallyinAPIsandinchars/strings inprogramminglanguages
UniversalCharacterSet(ISO10646)
Basically,theytriedtocreateastandardtoosoon
IfyouseeUCS,thinkUnicodeorobsolete
Endianism
Someonewhoreadsthedatamustknowinwhich ordertoreassemblethebytesintointegers
DefaultbyteorderinInternetprotocolsisbig endian(networkorder)
Butusuallyitdependsonthesystem,i.e.,little endianistypicallyusedonx86machines,etc.
UTF8hasnoendianism;thebyteorderisfixed
ThecodepointFEFFcanbeusedtoindicatethe byteorder,byinsertingitfirstinthetext
Don'tuseBOMinUTF8(causesconfusion)
ManyprogramminglanguageshaveICUbindings
Normalization forms
(776)isacombiningcharacter(diaeresis) canberepresentedby[196]orby[65,776]
Itcangetmuchmorecomplicatedthanthat Thecombiningcharactersfollowthebasecharacter
Tocomparestrings,theymustbenormalized
NFC(normalformcomposed)isusuallybest
Technicallyit'sUTF32oneintegerpercodepoint
list_to_binary(String)/binary_to_list(Bin)
TopackastringasaUTF8binary:
unicode:characters_to_binary("Motrhead")
TounpackaUTF8binarytoastring:
unicode:characters_to_list(<<"Motrhead">>)
AnIOlistcanbe:
ChardataissimilartoanIOlist.Chardatacanbe:
Switching to Unicode
How can you gradually change your system to support Unicode all the way through?
GeneratecharsetdeclarationsforUTF8 TransformthetexttoUTF8whenyououtput
Considerconvertingdatainoldtables/filestoUTF8
PreferUTF8encodedbinariesratherthanlistsof characterswheneveryouarestoringtext
accept-charsetattribute
Checktheinputatruntimeforsafety
Testsendingine.g.,Cyrillicandseewhatbreaks!
Convertgracefully;don'tcrashiftheconversion fails.Replacechars>255with'?'instead.
Shouldalreadybeguaranteedbytextsentfromweb browsers(W3Cstandard),butyouneedtomakesure