018 Repraesentation III Online

PSI-EiRBS-B
Einführung in Rechner- und Betriebssysteme
INFORMATION UND REPRÄSENTATION

TEIL III: KODIERUNG VON ZEICHEN UND STRINGS
Prof. Dr. Dominik Herrmann

Privacy and Security in Information Systems Group
https://www.uni-bamberg.de/psi/
Inhalt dieser Einheit
1. Zeichen und Zeichenketten
2. Unicode
3. Andere Kodierungen
2
Für die Kommunikation mit Menschen müssen Rechner Zeichen und
Zeichenketten (Strings) verarbeiten können.
Zeichen: character (char) Escaping: (1) Hinweis, dass folgende

Sequenz ein besonderes nicht-druck-
Buchstaben, Ziffern, Symbole
bares Zeichen referenziert (s.u.)
(„Sonderzeichen“), Leerzeichen,
Steuerzeichen (2) zur Aufhebung einer besonderen
Rolle eines Zeichens, etwa \" wenn
Zeichenkette: string
Strings als Literale genutzt werden.
Steuerzeichen ASCII-Code Escape-Sequenz Verwendung

Line feed 0x0a (10) \n („new line“) Zeilenumbruch, u.a. in Linux
Carriage Return 0x0d (13) \r Zeilenumbruch in Windows: \r\n
Alert/Bell 0x07 (7) \a visuell oder akustisches Signal
beliebiges Zeichen \xhh (hh: Hex-Code)

3
Historisch bedingte Probleme
Back in the semi-olden days, when ASCII

Unix was being invented and K&R
were writing The C Programming
Language, everything was very
simple. EBCDIC was on its way out.
The only characters that mattered
were good old unaccented English
letters, and we had a code for them
called ASCII which was able to
represent every character using a Because bytes have room for up to
number between 32 and 127. Space eight bits, lots of people got to think-
was 32, the letter “A” was 65, etc. ing, “gosh, we can use the codes 128-
255 for our own purposes.”
This could conveniently be stored in
7 bits. The trouble was, lots of people had
this idea at the same time, and they
had their own ideas of what should go
where in the space from 128 to 255.
https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-
sets-no-excuses/ 4
ASCII: American Standard Code for Information Interchange
5
Historisch bedingte Probleme (2)
The IBM-PC had something that

came to be known as the OEM
character set […] As soon as
people started buying PCs outside
of America all kinds of different
OEM character sets were dreamed
up, which all used the top 128
characters for their own purposes.
On some PCs the character code
Grafische Oberfläche mit OEM-Zeichen
130 would display as é, but on
computers sold in Israel it was the
Hebrew letter Gimel ‫ג‬, so when
Americans would send their
résumés to Israel they would
arrive as r‫ג‬sum‫ג‬s.
Eventually this OEM free-for-all
got codified in the ANSI standard.
https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/,
Ferdinand Ferber (https://de.wikipedia.org/wiki/Datei:VBDOS_IDE.PNG), „VBDOS IDE“, https://creativecommons.org/licenses/by-sa/3.0/de/legalcode 6
Historisch bedingte Probleme (3)
In the ANSI standard (American The national versions of MS-DOS had

National Standards Institute), dozens of these code pages, handling
everybody agreed on what to do everything from English to Icelandic
below 128, which was pretty and they even had a few “multilingual”
much the same as ASCII, but code pages that could do Esperanto
there were lots of different ways and Galician on the same computer!
to handle the characters from Wow!
128 and on up, depending on But getting, say, Hebrew and Greek on
where you lived. the same computer was a complete
These different systems were impossibility unless you wrote your
called code pages. So for example own custom program that displayed
in Israel DOS used a code page everything using bitmapped graphics,
called 862, while Greek users because Hebrew and Greek required
used 737. different code pages with different
(Heute in Westeuropa verwendeter interpretations of the high numbers.
Standard: ISO-8859-1 aka latin1)
https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/,
Ferdinand Ferber (https://de.wikipedia.org/wiki/Datei:VBDOS_IDE.PNG), „VBDOS IDE“, https://creativecommons.org/licenses/by-sa/3.0/de/legalcode 7
Programmiersprachen nutzen unterschiedliche Varianten für die Ablage von
Zeichenketten im Speicher.
C-Strings (null-terminated strings) Pascal-Strings (Längen-Prefix)
String steht Zeichen für Zeichen im Erstes Byte des Strings enthält die
Speicher, gefolgt von NUL (\0 bzw. Länge (0…255), danach folgen
0x00) als Endemarkierung. entsprechend viele Zeichen
Länge der Zeichenkette erfordert Längere Strings? Zwei oder mehr

vollständiges Einlesen Bytes für Längenangabe verwenden
(Trade-off).
Geht Endezeichen verloren, liest der
Rechner „ewig“ weiter. (Kann für Buffer-
Overflow-Angriffe missbraucht werden.)
Beispiel: Beispiel:
48 61 6c 6c 6f 00 05 48 61 6c 6c 6f
Modernere Sprachen legen Strings als Objekte ab. 8

Char, String, Escaping
ASCII, ANSI, Code Pages
null-terminated- und Pascal-Strings
2. Unicode
9
2. Unicode
10
Unicode
Ziel: Einheitliches System für alle Zeichensätze
11
Der Unicode-Standard definiert Symbole für alle bekannten Sprachen und
Kulturen.
Unicode:
Welches Symbol ist gemeint?
Unicode Collation Algorithm (UCA):

Wie sortieren wir Symbole?
Unicode Normalization:
Vereinheitlicht Symbole, die auf
unterschiedliche Weise dargestellt
werden können, z.B. „ü“ als „ü“ oder
Code Point U+1F638 „u“ mit Trema.
GRINNING CAT FACE WITH Encoding: Wie werden Symbole
SMILING EYES gespeichert oder übertragen?
Auszug aus dem Unicode-Standard (C1 Controls and Latin-1 Supplement)
13
Ein Encoding ist eine Vorschrift wie Code-Points in Bytes repräsentiert werden.
Früher Encoding-Vorschlag: UCS-2 (zwei Bytes je Code-Point)
Hello hat folgende Code Points

U+0048 U+0065 U+006C U+006C U+006F
Encoding in Bytes: Oder vielleicht doch eher so?

00 48 00 65 00 6C 00 6C 00 6F 48 00 65 00 6C 00 6C 00 6F 00
Lösung: Unicode Byte Order Mark (BOM)
FE FF am Dateianfang für Big Endian FF FE für Little Endian
Probleme von UCS-2

– Platzbedarf englischer Texte verdoppelt sich
– viele C-Stringfunktionen würden String nach erstem 0x00 abbrechen
14
Encodings im Zeitverlauf
Chris55 (https://commons.wikimedia.org/wiki/File:Utf8webgrowth.svg#mw-jump-to-license), https://creativecommons.org/licenses/by-sa/4.0/legalcode 15

Populärstes Encoding: UTF-8 (Unicode Transformation Format 8-bit)
Vorteile von UTF-8:

– kompatibel mit ASCII
– kompakte Repräsentation für häufige Code-Points
– effizient implementierbar
– selbstsynchronisierend, d.h. man kann „irgendwo“
im String zu lesen beginnen
UTF-8: Kodierung mit variabler Länge

Code-Points Bytes (Bitrepräsentation)
0000 0000 – 0000 007F 0xxxxxxx
0000 0080 – 0000 07FF 110xxxxx 10xxxxxx
0000 0800 – 0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx
0001 0000 – 0010 FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
16
Kodierungsvorschrift bei UTF-8
Code-Points Bytes (Bitrepräsentation)
€
0000 0000 – 0000 007F 0xxxxxxx
0000 0080 – 0000 07FF 110xxxxx 10xxxxxx
0000 0800 – 0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx
0001 0000 – 0010 FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
U+20AC
Wie kommt man von U+20AC zur Byte-Repräsentation e2 82 ac ?
20 AC entspricht 0010.0000 1010.1100

Erstes Byte: 1110.0010 = e2
Zweites Byte: 1000.0010 = 82
Drittes Byte: 1010.1100 = ac
17
Das Encoding muss zusätzlich zur Zeichenkette gespeichert werden, um
Interpretationsfehler zu vermeiden.
UTF-8-Bytes als latin1 interpretiert: HallÃ¶chen

„Mojibake“
Protokolle und Daten-

formate enthalten in der
Regel explizite Angaben
zum Encoding.
Eine automatische Er-

kennung des Encodings
ist zwar häufig möglich
(wie?), funktioniert je-
doch nicht zuverlässig. 18
Wie kann man Unicode-Zeichen eingeben?
Symboltabelle des Betriebssystems
direkt die UTF-8-Byte-Sequenz erzeugen
Copy-and-Paste
ggf. über eigene Escape-Sequenzen, z.B. in HTML

mittels sog. Entitites, die den Unicode-Code-Point
enthalten:
<html>…<body>
😸
</body></html>
19
Besonderheiten im Umgang mit Unicode
20
https://www.xkcd.com/1137/ 21
Das Right-to-Left-Zeichen
photo_high_re*U+202E*gnp.js
Weitere Überraschungen in der Praxis
zero-width whitespace
etwa zum Watermarking von Dokumenten
25
… oder um Nutzer zu täuschen
26
https://lifehacker.com/watch-out-for-this-fake-whatsapp-app-in-the-google-play-1820222637 27
Fehler bei der Längenbestimmung von Strings
28
Normalisierung
29
Normalisation is very important for
identifiers, such as usernames, to help
people enter values in different ways
but process them consistently. One
common way of normalising
identifiers is to transform everything
into lowercase, making sure that
JamesBond is the same as jamesbond.
https://gojko.net/2017/11/07/five-things-about-unicode.html, https://labs.spotify.com/2013/06/18/creative-usernames/ 30
Normalisation is very important for
identifiers, such as usernames, to help
people enter values in different ways
but process them consistently. One
common way of normalising
identifiers is to transform everything
into lowercase, making sure that
With so many similar characters and
overlapping sets, different languages
or unicode processing libraries might
apply different normalisation
strategies, potentially opening
security risks if normalisation is done
in several places.
Normalisation is very important for In short, don’t assume that lowercase
identifiers, such as usernames, to help transformations work the same in
people enter values in different ways different parts of your application.
but process them consistently. One Mikael Goldmann from Spotify wrote
common way of normalising up a nice incident analysis about this
identifiers is to transform everything issue in 2013, after one of their users
into lowercase, making sure that discovered a way to hijack accounts.
With so many similar characters and
overlapping sets, different languages
or unicode processing libraries might
apply different normalisation
strategies, potentially opening
security risks if normalisation is done
in several places.
Normalisation is very important for In short, don’t assume that lowercase
identifiers, such as usernames, to help transformations work the same in
people enter values in different ways different parts of your application.
but process them consistently. One Mikael Goldmann from Spotify wrote
common way of normalising up a nice incident analysis about this
identifiers is to transform everything issue in 2013, after one of their users
into lowercase, making sure that discovered a way to hijack accounts.
JamesBond is the same as jamesbond. Attackers could register unicode
With so many similar characters and variants of other people’s usernames
overlapping sets, different languages (such as ᴮᴵᴳᴮᴵᴿᴰ), which would be
or unicode processing libraries might translated to the same canonical
apply different normalisation account name (bigbird). Different
strategies, potentially opening layers of the application normalised
security risks if normalisation is done the word differently, allowing people
in several places. to register spoof accounts but reset
the password of the target account.
2. Unicode
Code Points versus Encoding
UCS-2, Byte Order Mark
UTF-8
Right-to-Left
Zero-Width Whitespace
34
2. Unicode
35
Encodings für andere Zwecke
base64
36
Mit base64 kann man beliebige Bitstrings mit druckbaren Zeichen darstellen:
Zur Kodierung werden jeweils drei Byte in vier 6-Bit-Blöcke aufgeteilt
Kodieren mit base64:

1. 0-Bits anhängen (Padding), 0100011100100000001001101110001111111000
Blöcke mit je 6 Bits bilden. 010001110010000000100110111000111111100000
2. Jeden Block als Zahl
interpretieren (0 bis 63). 17 50 0 38 56 63 32
3. Zeichen aus der Lookup-
Tabelle nachschlagen. R y A m 4 / g
4. So viele = anhängen wie
Inputstring zu kurz ist, um
Kodierter String: RyAm4/g=
eine durch 3 teilbare Länge
zu haben.
0 1 2 3 4 … 26 27 28 29 … 57 58 59 60 61 62 63
A B C D E … a b c d … 5 6 7 8 9 + /
Lookup-Tabelle für base64
Um wie viel wird eine Zeichenkette länger, wenn sie
base64-kodiert wird?
38
base32
– base64 ist für Maschinen, base32 ist für Menschen.

– fasst je fünf Bytes zu acht ASCII-Zeichen zusammen
– nutzt 32 Zeichen (Ziffern und Großbuchstaben)
– Die Ziffern 0 und 1 werden nicht verwendet, um
Verwechslung mit O und I zu vermeiden.
39
base85
– fasst je vier Bytes zu fünf ASCII-Zeichen zusammen

– benötigt 85 Zeichen
– wird z.B. in PostScript-Dateien verwendet
40
2. Unicode
base64 und Varianten
Datenkompression
41
Eine weitere Klasse von Kodierungen dienen der Datenkompression.
Einfaches verlustfreies Verfahren:

Lauflängenkodierung (Run Length Encoding, RLE)
WWWWWBBBWWWWWB
5 W 3 B 5 W 1 B (Reduktion um 43%)
Aber W B W B W B W W W wird länger: Worst case:

1W1B1W1B1W1B3W
Variante:
W W W W W B B B W W W W W B wird zu
WW5BB3WW5B
W B W B W B W W W wird zu W B W B W B W W 3
Bild: http://images17.newegg.com/is/image/newegg/12-423-144-TS?$S640W$ 42
Effizientere Kompression ist mit einem Wörterbuch möglich, etwa wie von
Lempel, Ziv und Welch (LZW) vorgeschlagen (kommt z.B. in GIFs zum Einsatz).
Zeichen Binär Dez.
Erläuterung der Idee am Bsp. von Strings (A–Z).
K 01011 11
Initiales Wörterbuch für A–Z; L 01100 12
ein 5-Bit-Code reicht dafür aus. M 01101 13
Zeichen Binär Dez. N 01110 14
# 00000 0 O 01111 15
A 00001 1 P 10000 16
B 00010 2 Q 10001 17
C 00011 3 R 10010 18
D 00100 4 S 10011 19
E 00101 5 T 10100 20
F 00110 6 U 10101 21
G 00111 7 V 10110 22
H 01000 8 W 10111 23
I 01001 9 X 11000 24
J 01010 10 Y 11001 25
Beispiel aus https://en.wikipedia.org/w/index.php?title=LZW Z 11010 26 43
Wir komprimieren den Text TOBEORNOTTOBEORTOBEORNOT#
Akt. Sequenz Next Output Code / Bits Erw. W.buch Kommentar

T O 20 10100 27: TO 27 = erster freier Platz im
O B 15 01111 28: OB Wörterbuch (nach 0–26)
B E 2 00010 29: BE
E O 5 00101 30: EO
O R 15 01111 31: OR
R N 18 10010 32: RN Code 32 benötigt 6 Bits (100000),
N O 14 001110 33: NO ab hier 6-Bit-Codes ausgeben
O T 15 001111 34: OT
Platzbedarf unkomprimiert:
T T 20 010100 35: TT
125 Bits = 25 Zeichen x
TO B 27 011011 36: TOB 5 Bits/Zeichen
BE O 29 011101 37: BEO
OR T 31 011111 38: ORT Komprimiert: 6x 5 Bits/Code +
TOB E 36 100100 39: TOBE 11x 6 Bits/Code = 96 Bits
Kompressionsrate: 23,2%
EO R 30 011110 40: EOR
RN O 32 100000 41: RNO
OT # 34 100010 # beendet den Algorithmus
0 000000 Stop-Code als Abschluss 44
2. Unicode
base64 und Varianten
RLE-Verfahren
LZW-Verfahren
45
Nicht in der Vorlesung besprochene Inhalte
aus Kapitel 2 des Skriptums
– Strukturierte Daten (ca. 3 Seiten)

– Problem und Algorithmus (ca. 7 Seiten)
46

018 Repraesentation III Online

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

018 Repraesentation III Online

Uploaded by

Copyright:

Available Formats

PSI-EiRBS-B

Einführung in Rechner- und Betriebssysteme

INFORMATION UND REPRÄSENTATION

Prof. Dr. Dominik Herrmann

1. Zeichen und Zeichenketten

Zeichen: character (char) Escaping: (1) Hinweis, dass folgende

Steuerzeichen ASCII-Code Escape-Sequenz Verwendung

beliebiges Zeichen \xhh (hh: Hex-Code)

Back in the semi-olden days, when ASCII

The IBM-PC had something that

In the ANSI standard (American The national versions of MS-DOS had

C-Strings (null-terminated strings) Pascal-Strings (Längen-Prefix)

Länge der Zeichenkette erfordert Längere Strings? Zwei oder mehr

1. Zeichen und Zeichenketten

1. Zeichen und Zeichenketten

Unicode Collation Algorithm (UCA):

Hello hat folgende Code Points

Encoding in Bytes: Oder vielleicht doch eher so?

Lösung: Unicode Byte Order Mark (BOM)

FE FF am Dateianfang für Big Endian FF FE für Little Endian

Probleme von UCS-2

Chris55 (https://commons.wikimedia.org/wiki/File:Utf8webgrowth.svg#mw-jump-to-license), https://creativecommons.org/licenses/by-sa/4.0/legalcode 15

Vorteile von UTF-8:

UTF-8: Kodierung mit variabler Länge

Code-Points Bytes (Bitrepräsentation)

Wie kommt man von U+20AC zur Byte-Repräsentation e2 82 ac ?

20 AC entspricht 0010.0000 1010.1100

UTF-8-Bytes als latin1 interpretiert: HallÃ¶chen

Protokolle und Daten-

Eine automatische Er-

Symboltabelle des Betriebssystems

direkt die UTF-8-Byte-Sequenz erzeugen

ggf. über eigene Escape-Sequenzen, z.B. in HTML

Fehler bei der Längenbestimmung von Strings

1. Zeichen und Zeichenketten

1. Zeichen und Zeichenketten

Kodieren mit base64:

– base64 ist für Maschinen, base32 ist für Menschen.

– fasst je vier Bytes zu fünf ASCII-Zeichen zusammen

1. Zeichen und Zeichenketten

Einfaches verlustfreies Verfahren:

Aber W B W B W B W W W wird länger: Worst case:

Akt. Sequenz Next Output Code / Bits Erw. W.buch Kommentar

1. Zeichen und Zeichenketten

– Strukturierte Daten (ca. 3 Seiten)

You might also like