You are on page 1of 56

Distributed Systems

Data Representation

Prof. Dr.-Ing. Torben Weis


Universitt Duisburg-Essen
Data

Data transmitted over a network is just bytes


It can represent:
Text
ASCII, UTF8, UTF16,
Simple data types
byte, int, long, double, float, decimal,
Objects
Objects in the sense of OO-languages Data
Classes
A *.class file, a DLL, or a Python module Code
Images
JPEG, GIF, PNG,
Documents
Office Open XML, OpenDocument, PDF,
Universitt Duisburg-Essen Torben Weis 2
Verteilte Systeme
Heterogeneity

Data representation can be different from


machine to machine
Byte order (Little Endian vs. Big Endian)
Text encoding (US-ASCII vs. ISO 8859-1 vs )
Integers or pointers may have different length on
different CPUs
FischerTechnik, Lego Mindstorm: 8 bit
Intel 8086: 16 bit
x86: 32 bit
x86-64: 64 bit

Universitt Duisburg-Essen Torben Weis 3


Verteilte Systeme
Transformation

A transformation is necessary
From one data format to another
Simple in theory
but usually with impact on performance
Two transformation principles:
1. Pair-wise transformation for every representation
2. Transform to canonical representation

Universitt Duisburg-Essen Torben Weis 4


Verteilte Systeme
1. Pair-wise Transformation

More code (n transformations)


Better performance (runs 1 transformation)
Usually sender transmits indication of
representation and receiver makes it right
Drawback: Receiver must understand all formats
R1 R1

R2 R2

R3 R3
. .
. .
. .
Rn Rn
Universitt Duisburg-Essen Torben Weis 5
Verteilte Systeme
2. Transform to Canonical Representation

Less to code (2n transformations)


Lower performance (runs 2 transformations)
External representation given by protocol
specification or negotiated by communication
partners
R1 R1

R2 R2

R3 RC R3
. .
. .
. .
Rn Rn
Universitt Duisburg-Essen Torben Weis 6
Verteilte Systeme
Outline

Byte order
Text encoding
Data serialization formats
CDR
ASN.1
XML
JSON
Java object serialization

Universitt Duisburg-Essen Torben Weis 7


Verteilte Systeme
Byte Order

How to crack an egg?


If you crack it here, you are a big ender

If you crack it here, you are a little ender


Universitt Duisburg-Essen Torben Weis 8
Verteilte Systeme
Byte Order (2)

How to store a number as 4-byte integer?


2004 decimal = 0x000007D4 hexadecimal

If you start here, its big endian Big-endian


0 1 2 3
Most significant byte first
00 00 07 D4

0x 00 00 07 D4
Little-endian
0 1 2 3
If you start here, its little endian D4 07 00 00
Least significant byte first
Universitt Duisburg-Essen Torben Weis 9
Verteilte Systeme
Byte Order (3)

Identical number, different sequence of bytes


But what if sender and receiver use different orders?

Most computer systems use little endian


Intel x86, x86-64,

Some use big endian


Motorola 68x, PowerPC,

Most network protocols use big endian


IP, TCP, UDP, DNS, HTTP/2,
Network Byte Order
Universitt Duisburg-Essen Torben Weis 10
Verteilte Systeme
Text Encoding

Characters and symbols are encoded as bytes


E.g. A = 0x41, B = 0x42
Number of available characters (character set)
depends on size of encoded character
E.g. one character = one byte maximum of 256
characters
What about Chinese, Japanese, Greek, characters?
Multiple common encodings (character sets)
US-ASCII, ISO 8859, Unicode,

Universitt Duisburg-Essen Torben Weis 11


Verteilte Systeme
US-ASCII

128 characters
Non-printable
Carriage return
(CR)
Newline (LF)
Tab

Printable
Latin alphabet
Some symbols

Universitt Duisburg-Essen Torben Weis 12


Verteilte Systeme
Unicode

Unicode assigns a code point for each character


1.1 million code points in Universal Coded Character
Set (UCS, defined in ISO 10646)
First 128 code points identical with ASCII
Multiple encoding formats available
UTF-8: variable length 14 bytes
UTF-16: variable length 2 or 4 bytes
UCS-2: fixed length 2 bytes
Cannot represent all possible unicode characters
UTF-32: fixed length 4 bytes
You can use Unicode e.g. in Java: \u2260
Universitt Duisburg-Essen Torben Weis 13
Verteilte Systeme
UTF-8

Variable length 14 bytes per characters


Single byte for code points U+0000 to U+007F
ASCII characters encoded as single byte 0x00 to 0x7F
Backward compatible with ASCII

Multiple bytes for code points U+80


U-00000000 - U-0000007F 0xxxxxxx

U-00000080 - U-000007FF 110xxxxx 10xxxxxx

U-00000800 - U-0000FFFF 1110xxxx 10xxxxxx 10xxxxxx

U-00010000 - U-001FFFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Universitt Duisburg-Essen Torben Weis 14


Verteilte Systeme
UTF-8 by Example

The Unicode character U+00A9 = 1010 1001


() is encoded in UTF-8 as
11000010 10101001 = 0xC2 0xA9

The character U+2260 = 0010 0010 0110 0000


() is encoded as:
11100010 10001001 10100000 = 0xE2 0x89 0xA0

UTF-8 wastes bits on non-ASCII characters

Universitt Duisburg-Essen Torben Weis 15


Verteilte Systeme
UTF-8 Decoding Algorithm

First bit 0?
process 1 byte and return character
Second bit 0? (when in the middle of a character)
skip 1 byte and goto start
Third bit 0?
process 2 bytes and return character

U-00000000 - U-0000007F 0xxxxxxx

U-00000080 - U-000007FF 110xxxxx 10xxxxxx

U-00000800 - U-0000FFFF 1110xxxx 10xxxxxx 10xxxxxx

U-00010000 - U-001FFFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Universitt Duisburg-Essen Torben Weis 16


Verteilte Systeme
UTF-16

2 bytes per character U+0000 to U+FFFF


4 bytes per character >U+FFFF
Encoded as two 16-bit words called surrogate pairs

UTF-16: byte order of 16-bit words unspecified


Explicitly negotiate UTF-16 LE or BE (Little/Big Endian)
Use Byte Order Mark (BOM) before the first character
U+FEFF = Byte Order Mark (invisible character)
U+FFFE = reserved, not used
BOM sometimes used in UTF-8 to recognize UTF-8 text

Universitt Duisburg-Essen Torben Weis 17


Verteilte Systeme
UTF-16

U+0000
U+010000
to
to
U+FFFF
U+01FFFF

Basic Multilingual 16 supplementary


Plane (BMP) planes
216 code points 216 additional code
points per plane
Surrogate code
points refer to
another plane
Universitt Duisburg-Essen Torben Weis 18
Verteilte Systeme
UTF-32

Uses fixed-length encoding


All characters require 32 bit

In contrast: UTF-8 & UTF-16 use


variable-length encoding
UTF-32 only rarely used as external format
Wastes too much space on average characters

Sometimes used as internal format


Efficient addressing of single characters
Access 100th character: data[4*100]

Universitt Duisburg-Essen Torben Weis 19


Verteilte Systeme
DATA SERIALIZATION

Universitt Duisburg-Essen Torben Weis 20


Verteilte Systeme
Serialization

Serialization: transform data into a byte stream


Purpose: store or transfer data object
External representation of data

Deserialization: load serialized data


Marshalling/Unmarshalling
Mostly synonymous to serialization/deserialization
Context: Moving data between boundaries

Pickling/Unpickling
Synonymous to serialization/deserialization
Universitt Duisburg-Essen Torben Weis 21
Verteilte Systeme
Schema: External Specification

How to indicate the schema of data structures?


E.g. <String, int>

External definition in protocol specification


Protocol defines syntax and semantics
Syntax: server must send string, then number
Semantics: string is text message, number is user id

What if the protocol changes?


Server sends <int, String, int>
Old implementation will interpret int as a String!

Universitt Duisburg-Essen Torben Weis 22


Verteilte Systeme
Schema: In-band Signaling

In-band signaling of schema


<<Type, Length, String>, <Type, Length, int>>

Client can detect and handle unexpected data


Unknown Type? Skip and continue

But: schema does not define semantics


What is the meaning of the String or the int?
Still relying on external protocol specification

Send version field in your protocol to recognize


changes in semantics
Universitt Duisburg-Essen Torben Weis 23
Verteilte Systeme
Common Data Representation (CDR)

Used in CORBA middleware since version 2.0


Principle: Receiver makes it right
CORBA has
Simple types (short, long, float, char, )
Complex types (sequence, string, union,
struct, )
Alignment of types according to size
Multiple of size
Byte order of sender
Sender declares its architecture in header
Universitt Duisburg-Essen Torben Weis 24
Verteilte Systeme
CDR Example

struct <string, unsigned long>


Hello, 2004

Senders byte order: little endian


5 H E L L O 2 0 0 4
Address 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Value 05 00 00 00 48 45 4C 4C 4F 00 00 00 D4 07 00 00

String length Alignment

Universitt Duisburg-Essen Torben Weis 25


Verteilte Systeme
ISO ASN.1

Abstract Syntax
Describes the structure of data independent of any
programming language

Concrete Syntax
Describes the structure of data for the selected
programming language

Transfer Syntax
Structure of data used for transfer
(e.g. <Type, Length, Value>)

Encoding Rules, e.g. Basic Encoding Rules (BER)


Describe the transformation from concrete to transfer
syntax
Universitt Duisburg-Essen Torben Weis 26
Verteilte Systeme
ISO ASN.1

Language-independent definition of data types


Based on predefined data types
BOOLEAN, INTEGER, ENUMERATED, STRING,
Concrete definition derived from the abstract
definition

INTEGER ASN.1 abstract syntax

int Concrete syntax


Java syntax

Universitt Duisburg-Essen Torben Weis 27


Verteilte Systeme
ASN.1 Example

Example type definition:


BookType ::= INTEGER {
paperback(0)
hardcover(1)
}
Example value definition:
hardcover BookType ::= hardcover(1)

Universitt Duisburg-Essen Torben Weis 28


Verteilte Systeme
Transfer Syntax: Basic Encoding Rules

Data is transferred in TLV coding (tag, length,


value)

Tag Length Value

class p/c number

Tag class
Primitive or Tag number
(2 bit)
complex type (variable
(1 bit) length)

Universitt Duisburg-Essen Torben Weis 29


Verteilte Systeme
BER Example

Example definition:
BookType ::= INTEGER {
paperback(0)
hardcover(1)
}
hardcover BookType ::= hardcover(1)

hardcover is transferred as:


Tag Length Value

01 0 00100 00000001 00000001

Universitt Duisburg-Essen Torben Weis 30


Verteilte Systeme
ASN.1 / BER Summary

Many more features, quite complex standard


Pros:
ASN.1/BER very powerful

Cons:
Coding is not time efficient
(e.g. accessing single bits)
Many applications know type information, no need to
transfer tag and length

Still in use, for example UMTS or X.509

Universitt Duisburg-Essen Torben Weis 31


Verteilte Systeme
ASN.1 Example Application: X.509 Certificates

X.509 certificates are ASN.1 encoded


Used for HTTPS or other TLS-secured connection

Can include extension fields


Ignored by applications that dont know them

Universitt Duisburg-Essen Torben Weis 32


Verteilte Systeme
BER versus CDR

BER encodes data as Tag & Value


Any BER-encoded sequence of bytes can be decomposed
into its values
However, the semantics of these values is not encoded
Disadvantage: Tags consume space

CDR encodes values only


A CDR-encoded sequence of bytes cannot be decoded
without the corresponding structure definition
However, the receiver most likely knows what to expect and
how to interpret it
Advantage: No data wasted on tags

Universitt Duisburg-Essen Torben Weis 33


Verteilte Systeme
XML

Data interchange format


Extensible Markup Language
Text-based, self describing data presentation
Meta language (i.e. use this language to define your
own data structure)
Platform independent
Widely supported
Specified by W3C for the World Wide Web

Universitt Duisburg-Essen Torben Weis 34


Verteilte Systeme
XML Basics I

XML is based on markup


Opening and closing tags
<tag></tag>
A tag has
a name
optional content between opening and closing tag
an optional namespace
optional attributes
<tag key=value></tag>

Universitt Duisburg-Essen Torben Weis 35


Verteilte Systeme
XML Example

<book bookid="1234">
<title>XML in 5 minutes</title>
<author>T.Knowsitall</author>
<price>0.50</price>
<availability>1</availability>
<bookoftheyear/>
</book>

Universitt Duisburg-Essen Torben Weis 36


Verteilte Systeme
XML Basics II

XML documents must be well-formed


Begin with the XML declaration
<?xml version="1.0"?>
Have one unique root element
All start tags must match end tags (case sensitive)
All elements must be closed
All elements must be properly nested
All attribute values must be quoted
XML entities must be used for special characters
e.g. &lt; for <

Universitt Duisburg-Essen Torben Weis 37


Verteilte Systeme
XML Structure

XML document: elements in a hierarchical


structure
XML is good at encoding tree structures

Structure somewhat evident for humans


Only syntax, semantics left for guessing
Verbose text representation

How to encode arbitrary graphs in XML?


E.g. a graph with circular references?

Universitt Duisburg-Essen Torben Weis 38


Verteilte Systeme
XML Validation

XML documents can be validated


Predefine structure of a document
Check if document matches structure
1. DTDs (Document Type Definitions)
Extended context-free grammar
Part of XML specification not covered here
2. XML schema
Define structure of a document in XML
Support for data types (built-in and user-defined)
http://www.w3.org/TR/xmlschema-0/
Universitt Duisburg-Essen Torben Weis 39
Verteilte Systeme
XML Schema Example
<element name="book">
<attribute name="bookid" type="decimal"/>
<complexType>
<sequence>
<element name="title" type="string"/>
<element name="author" type="string"/>
<element name="price" type="string"/>
<element name="availability" type="string"/>
<element name="bookoftheyear" minOccurs="0"
maxOccurs="1"/>
</sequence>
</complexType>
</element>

Universitt Duisburg-Essen Torben Weis 40


Verteilte Systeme
APIs for Using XML

How to read data from an XML document?


Document Object Model (DOM)
Parses XML as tree into memory
Easy traversal via API (e.g. go to child element)

Simple API for XML (SAX)


Throws events for each XML element (e.g. here comes
a tag, here comes an attribute, etc.)
Not simple to use at all

XPath & XPointer: address elements in document

Universitt Duisburg-Essen Torben Weis 45


Verteilte Systeme
XPath / XPointer

XPath: Language for accessing XML elements


The Root element: /*
The fifth child element under an element named
"FOOB": FOOB[5]
The element FOOB whose BAZ attribute is equal to
hello":
FOOB[ @BAZ = hello" ]

XPointer: another language, e.g.


<foobar id="foo"><bar/><baz><bom/></baz></foobar>
element(foo/2/1)

Universitt Duisburg-Essen Torben Weis 46


Verteilte Systeme
XML Summary

Meta language for defining other languages


i.e. data structure in your application

Data interchange format


Breaks documents into hierarchical parts
Gives the parts names
Lets you write rules for how parts form a document

Inefficient
Expensive conversion from binary data to text
Text format uses lot of space

Universitt Duisburg-Essen Torben Weis 49


Verteilte Systeme
XML Myths

XML is simple
XML is human-readable
Proof: MS Word Save Unpack XML

Universitt Duisburg-Essen Torben Weis 50


Verteilte Systeme
JavaScript Object Notation (JSON)

Data interchange format


Originates from JavaScript for Web usage
But has been also implemented for other
programming languages

Text format like XML


But no query languages unlike XML

Convert data object to JSON & send


Receive & parse JSON to data object

Universitt Duisburg-Essen Torben Weis 51


Verteilte Systeme
JSON Example

{
"bookid": 1234,
"title": "JSON in 5 minutes",
"author": "T.Knowsitall",
"price": 0.50,
"availability": 1,
"bookoftheyear": true,
}

Universitt Duisburg-Essen Torben Weis 52


Verteilte Systeme
JSON Data Types

Numbers: (un)signed integer, float/double


String
Boolean
Array
"someArray": [ 1, 2, 3, 4 ]

Object (key/value map)


{ "foo": "bar", "moep": { "nested": "structure" } }

null

Universitt Duisburg-Essen Torben Weis 53


Verteilte Systeme
XML vs. JSON

JSON is simpler and easier than XML


XML has more features
Query languages
Document transformation via XSLT
Validation via XML Schema

Both are text formats


JSON slightly less verbose

Universitt Duisburg-Essen Torben Weis 54


Verteilte Systeme
Java Object Serialization

Binary serializer
ObjectOutputStream/ObjectInputStream
ObjectOutputStream objOut = new ObjectOutputStream(outputStream);
objOut.writeObject(nodes[0]);

XML serializer
JAXB Marshaller/Unmarshaller
Object-to-XML mapping customizable
JAXBContext context = JAXBContext.newInstance(Node.class);
Marshaller m = context.createMarshaller();
m.marshal(nodes[0], outputStream);

Universitt Duisburg-Essen Torben Weis 55


Verteilte Systeme
Binary Serializer

Serializable interface indicates that object can


be serialized
Default serializer:
Retrieve type/value of all fields via reflection
transient keyword indicates that field should be
skipped from serialization

public class Object implements Serializable {


public int foo;
public transient int bar;
}

Universitt Duisburg-Essen Torben Weis 56


Verteilte Systeme
Binary Serializer (2)

Custom writeObject() / readObject() methods


Control interchange format
Necessary when generic serializer fails because
special processing required

private void readObject(java.io.ObjectInputStream stream)


throws IOException, ClassNotFoundException;
private void writeObject(java.io.ObjectOutputStream stream)
throws IOException;

Universitt Duisburg-Essen Torben Weis 57


Verteilte Systeme
Binary Serializer (3)

Example output

Hex Text

Universitt Duisburg-Essen Torben Weis 58


Verteilte Systeme
How Serialization works internally

The serializer traverses the object graph


This takes time
Has to detect circles
The serializer serializes all objects found during
the graph traversal
Every object is stored as a name/value pair
Type information is stored to recreate the object
during deserialization
What happens if the deserializing process
misses the implementation of a serialized type?
Universitt Duisburg-Essen Torben Weis 59
Verteilte Systeme
XML Serializer

Java Architecture for XML Binding (JAXB)


Alternative frameworks also available

Map Java objects to XML documents


Mapping controlled via annotations

@XmlRootElement
public class Object {
@XmlElement
public int foo;
public int bar;
}

Universitt Duisburg-Essen Torben Weis 60


Verteilte Systeme
XML Serializer (2)

Example output

<?xml version="1.0" encoding="UTF-8"


standalone="yes"?>
<foobar>
<foo>0</foo>
<bar>0</bar>
</foobar>

Universitt Duisburg-Essen Torben Weis 61


Verteilte Systeme
Discussion

Dealing with data representation is necessary


for interoperability
Use standards in open heterogeneous
environments (Internet)
Interoperable but slow, e.g. XML, JSON
Binary formats more efficient than text

Use specialized format in closed environments


Fast but less interoperable, e.g. Java Object
Serialization

Universitt Duisburg-Essen Torben Weis 62


Verteilte Systeme

You might also like