InfTh Vorl e

1
Lecture Notes on Information Theory
Prof. Peter Adam Hoeher Information and Coding Theory Lab Faculty of Engineering University of Kiel Germany ph@tf.uni-kiel.de www-ict.tf.uni-kiel.de
Summer Term 2009 Last update: 07/07/2008 c Peter A. Hoeher 20012009

2
Contents
Introduction: What is information theory? Fundamental notions: Probability, information, entropy Source coding: Typical sequences, Shannons source coding theorem (lossless source coding), Shannons rate distortion theorem (lossy source coding), Markov sources, Human algorithm, Willems algorithm, Lempel-Ziv algorithm Channel coding: Shannons channel coding theorem, channel capacity of discrete and of continuous channels, joint source and channel coding, MAP and ML decoding, Bhattacharyya bound and Gallager bound, Gallager exponent Cryptology: Classical cipher systems, Shannons theory of secrecy, redundancy, public key systems
Chapter 0: Introduction
What is information theory? History of information theory Block diagram of a communication system Separation theorem of information theory Fundamental questions of information theory Entropy and channel capacity Cipher systems
What is Information Theory?

Information theory establishes the scientic foundation for all areas and applications of communications (including digital communications) Claude E. Shannon, 1948: A Mathematical Theory of Communication source coding (data compression) lossless source coding channel coding (data security: error detection/ correction/ concealment) cryptology (data encryption, cryptanalysis, authentication) Information theory provides solutions for many other scientic areas (and hence is not just a part of coding theory) Information theory provides fundamental bounds, but is rarely constructive. lossy source coding
Contributions of Information Theory

Communication Theory Physics (Thermo dynamics) (Fundamental bounds) Computer Sciences (Kolmogorov complexity)
Mathematics
Statistics (Hypothesis testing)
(Inequalities)
Probability theory (Limit theorems)
Economics (Portfolio and game theory)
History of Information Theory

1842 S.F.B. Morse 1918 G.S. Vernam 1928 R.V.L. Hartley 1928 H. Nyquist 1948/49 C.E. Shannon 1950 R.W. Hamming 1952 D.A. Human Ecient coding of characters (intuitive) First secure cipher system (intuitive) First mathematical information measure Fundamentals of digital communications Establishment of information theory First channel codes Optimal source coding
Block Diagram of a Communication System
Disturbance Source Transm. Channel Receiver Sink
Examples: Source: continuous-time signals (e.g. voice, audio, video, analog measurement signals) or discrete-time signals (e.g. characters, data sequences, sampled analog signals) Channel: Wireless channel (radio channel, acoustical channel, infrared), wireline channel (cable or ber), CD/DVD, magnetic recording, etc.
8
Separation Theorem of Information Theory

Motivation: Any telecommunication may be done digitally! Shannon, 1948: Information processing and information transmission may be separated, theoretically without loss of quality, in two dierent problems: representation of the source outputs by means of binary symbols (bits), irrelevance and redundancy reduction (source coding) transmission of random symbol sequences via the channel (channel coding) Remark 1: The separation theorem does not say that the separation must be done in order to achieve optimal performance; its rather an option, which is (given some constraints) lossless and of great practical interest. Remark 2: Data encryption may also be separated and should be placed between source and channel coding.
Block Diagram of a Communication System with Source Coding, Encryption, and Channel Coding
Source
Source encoder
Encryption Binary data
Channel encoder
Modulator
Disturbance
Channel
Sink
Source decoder
De cryption
Channel decoder
Demodul.
10
Examples for Source Coding, Encryption, and Channel Coding

Source coding: 1. Example: Characters A-Z encoded with log2(26) = 5 bits, no data compression 2. Example: Characters A-Z encoded, with data compression (Morse code) no data compression: A B C D ... Encryption: [00000] [00001] [00010] [00011] with data compression: A B C D ... [01] [1000] [1010] [100]
Example: A key word is added modulo 2 to the plaintext . e.g. plaintext [100] (= D), key word [101] (random sequence) ciphertext [100] [101] = [001]
Channel coding: Example: repetition code e.g. info word [001] code word [00 00 11]
11
Fundamental Questions of Information Theory

Let us given a discrete memoryless source. What is the minimum number of bits/source symbol, R, after lossless source encoding? Answer: Entropy H On average, each symbol of a discrete-time source can be represented (and recovered with an arbitrarily small error) by R bits/source symbol if R H, but not if R < H. What is the maximum number of bits/channel symbol, R, after channel encoding? Answer: Channel capacity C On average, R randomly generated bits/channel symbol can be transmitted via a noisy channel with arbitrarily low error probability if R C, but not if R > C.
Symbols should not be transmitted individually. Instead, the channel encoder should map the info bits onto the coded symbols so that each info bit inuences as many coded symbols as possible.
12
Entropy and Channel Capacity
Entropy: Uncertainty measure of a random variable X: H(X) =

x
pX (x) log2 pX (x) [bits/source symbol],
where pX (x) is the probability mass function of X. Channel capacity: Maximum number of bits/channel symbol that can be transmitted via a noisy channel with arbitrarily small error probability: C = max H(X) H(X|Y ) [bits/channel symbol].
pX (x)
13
Classical Encryption System
Insecure channel Plaintext Encryption Ciphertext Decryption Plaintext
Secure channel Key
14
Encryption System with Public Key
Insecure channel Plaintext Encryption Ciphertext Decryption Plaintext
Public key (published)
Private key (known to the receiver only)
15
Chapter 1: Fundamentals
Discrete probability theory: Random variable, probability mass function, joint probability mass function, conditional probability mass function, statistical independence, expected value (mean), variance, sample mean, Bayes rule Shannons information measure: Information, mutual information, entropy, conditional entropy, chain rule for entropy, binary entropy function Fanos inequality Data processing theorem
16
Discrete Probability Theory

A set is a collection of objects (here: realizations of a random experiment) called elements. The space is the set of all possible objects. We will restrict ourselves to a nite space, i.e., = { (1), (2), . . . , (L)}. The elements (1), (2), . . . , (L) are called elementary events. A subset B of a set A is another set whose elements are also elements of A. In probability theory, each subset of is called an event. A discrete random variable X with values over a given alphabet X = {x(1), x(2), . . . , x(Lx)} is a surjective mapping from the space onto the set {x(1), x(2), . . . , x(Lx)}: X : {x(1), x(2), . . . , x(Lx)}. The number of possible events, Lx = |X |, is called cardinality. If no confusion is possible, we briey write L instead of Lx for convenience. The event that the value x(i) is obtained by the assignment X( (l)), is denoted as {X = x(i)}, where 1 l L and 1 i Lx.
17

Each event {X = x(i)} is associated a probability P ({X = x(i)}). We use the short-hand notation P (X = x(i)) instead of P ({X = x(i)}). The probability mass function of X is dened as pX (x(i)) = P (X = x(i)), where 0 pX (x(i)) 1 i and
L i=1
i {1, 2, . . . , L},
pX (x(i)) = 1.
If no confusion is possible, we briey write pX (x(i)) = p(x). A random variable X is called uniformly distributed, if pX (x(i)) = 1/L i, i.e., if all events are equally probable. (Please note that a uniform distribution should not be mixed-up with a sequence of identically distributed random variables.)
18

Let f (X) be a real function of a random variable X. The expected value (mean) and the variance 2 of f (X) are dened as
L
:= E{f (X)} =
i=1
f (x(i)) pX (x(i))
L
and
2
:= E (f (X) )
(i)
=
i=1
f (x(i))
pX (x(i)),
where x , i {1, 2, . . . , L}, are the possible events of X. Now, we generate n random draws of the random variable X, which we denote as x1 , x2 , . . . , xn . The sample mean is dened as 1 f (X) = n
n
f (xi).
i=1
Expected value, variance, and sample mean are only dened if f (X) is real valued.
19

Random variables may consist of several single random variables. Example: Z = (X, Y ), where X and Y are random variables with values over the alphabets {x(1), x(2), . . . , x(Lx)} and {y (1), y (2), . . . , y (Ly )}. Hence, the events of the random variable Z are over the alphabet {(x(1), y (1)), (x(1), y (2)), . . . , (x(Lx), y (Ly ))}. The joint probability mass function pXY (x(i), y (j)) (or p(x, y) for short) is the probability that the events {X = x(i)} and {Y = y (j)} occur at the same time: pXY (x(i), y (j)) = P ({X = x(i)} {Y = y (j)}) = P (X = x(i), Y = y (j)), where i {1, 2, . . . , Lx}, j {1, 2, . . . , Ly }, 0 pXY (x(i), y (j)) 1 i, j, and
Lx Ly
pXY (x(i), y (j)) = 1. Furthermore, pXY (x(i), y (j)) = pY X (y (j), x(i)).
i=1 j=1
Random variables X and Y are called statistically independent, if pXY (x(i), y (j)) = pX (x(i)) pY (y (j)) i, j.
20

The probability mass function pX (x(i)) is obtained from the joint probability mass function pXY (x(i), y (j)) by a summation over all possible events y (j):
Ly
pX (x(i)) =
j=1
pXY (x(i), y (j)).

Lx
Correspondingly, pY (y ) =
i=1 (j)
pXY (x(i), y (j)).
Example (x(1) = 0, x(2) = 1, y (1) = 0, y (2) = 1, Lx = Ly = L = 2): P (X = 0, Y = 0) = pXY (0, 0) = 0.4 P (X = 0, Y = 1) = pXY (0, 1) = 0.3 P (X = 1, Y = 0) = pXY (1, 0) = 0.2 P (X = 1, Y = 1) = pXY (1, 1) = 0.1 P (X = 0) = pX (0) = 0.7, P (X = 1) = pX (1) = 0.3 P (Y = 0) = pY (0) = 0.6, P (Y = 1) = pY (1) = 0.4.
21

The conditional probability mass function is dened as pXY (x(i), y (j)) pX|Y (x |y ) = , pY (y (j))
(i) (j)
where pY (y (j)) > 0 j (Bayes rule). Theorem: pX|Y (x(i)|y (j)) = pX (x(i)), if X and Y are statistically independent. Proof: pX|Y (x(i)|y (j)) =
pXY (x(i),y (j) ) pY (y (j) ) pX (x(i))pY (y (j) ) pY (y (j) )
= pX (x(i)).
q.e.d.
According to Bayes rule pXY (x, y) = pX (x) pY |X (y|x). A generalization results in the chain rule pX1X2X3X4...(x1, x2, x3, x4, . . .) = pX1 (x1) pX2X3X4...|X1 (x2, x3, x4, . . . |x1) = ... = pX1 (x1) pX2|X1 (x2|x1) pX3X4...|X1X2 (x3, x4, . . . |x1, x2)
22
Discrete Probability Theory Summary

Given: Random variables X, Y of cardinality Lx = |X |, Ly = |Y|, i.e., X = {x(1), x(2), . . . , x(Lx)} and Y = {y (1), y (2), . . . , y (Ly )}. Probability mass function: pX (x(i)) Conditional probability mass function: pX|Y (x(i)|y (j)) Joint probability mass function: pXY (x(i), y (j)) = pY X (y (j), x(i)) Denition: X and Y are statistically independent, if pXY (x(i), y (j)) = pX (x(i)) pY (y (j)) i, j Bayes rule: pXY (x(i), y (j)) = pX (x(i)) pY |X (y (j)|x(i)) = pY (y (j)) pX|Y (x(i)|y (j))
23
Hartleys vs. Shannons Information Measure

Denition of R.V.L. Hartley (1928): The information per symbol is dened as I = logb L = logb 1/L, where L is the cardinality of the symbol alphabet. Problem: Dierent probabilities are not considered.
Denition of C.E. Shannon (1948): The information of an event {X = x(i)} is dened as I(X = x(i)) = logb P (X = x(i)), i {1, . . . , L}. The smaller the probability of the event, the larger is the information. For a uniformly distributed random variable X, Hartleys and Shannons inform. measure are identical. For the Basis b = 2 the unit of the information is called bit(s), for b = e 2.71828 it is called nat(s), and for b = 10 it is called Hartley. Note that bit does not mean binary digit in this context.
24
Shannons Information Measure

Information theory deals with situations, in which events are observed in order to learn about other events. Examples: Given observations about the weather in Hamburg, we would like to estimate the weather in Kiel. Given a received signal, we would like to estimate the information sequence. Given a cipher text (cryptogram), we would like to decrypt the text. Given observations about the stock market, we would like to optimize a portfolio. Which information gain do we obtain concerning a random variable if we observe another random variable?
25

Denition: The mutual information, which we obtain for an event {X = x(i)} given observations about an event {Y = y (j)}, is denoted as I(X = x(i); Y = y (j)) and is dened as pX|Y (x(i)|y (j)) P (X = x(i)|Y = y (j)) (i) (j) = logb , I(X = x ; Y = y ) = logb P (X = x(i)) pX (x(i)) where P (X = x(i)) > 0 and P (Y = y (j)) > 0 are assumed. The basis b is arbitrary and hence is omitted in the following. Since I(X = x(i); Y = y (j)) = log P (X = x(i)) + log P (X = x(i)|Y = y (j)),
I(X=x(i) ) I(X=x(i) |Y =y (j) )
the mutual information may be interpreted as the information gain, which we obtain concerning an event {X = x(i)} given an event {Y = y (j)}, because I(X = x(i)) can be interpreted as an a priori information and I(X = x(i)|Y = y (j)) as an a posteriori information.
26

Properties of the mutual information: I(X = x(i); Y = y (j)) = I(Y = y (j); X = x(i)): It is the same whether we obtain information about {X = x(i)} by observing {Y = y (j)} or vice versa. I(X = x(i); Y = y (j)) log P (X = x(i)): Equality on the left hand side if P (X = x(i)|Y = y (j)) = 0, equality on the right hand side if P (X = x(i)|Y = y (j)) = 1. X and Y are statistically independent I(X = x(i); Y = y (j)) = 0 i, j. The information deduces from I(X = x(i); X = x(i)) := I(X = x(i)) = log P (X = x(i)).
27
Summary of Important Denitions and Theorems

Assumptions: Let X, Y be random variables and let x(i) , y (j) be events, where i {1, . . . , Lx} and j {1, . . . , Ly }. Def. 1: Probability mass function: pX (x(i)) = P ({X = x(i)}) = P (X = x(i)). Def. 2: Joint probability mass function: pXY (x(i), y (j)) = P ({X = x(i) } {Y = y (j) }) = P (X = x(i), Y = y (j)).
(i) (i) pXY (x(i) ,y (j) ) py (y (j) ) log2 pX (x(i))
=y I(X = x(i); Y = y (j) ) = log2 P (X=x |Y(i)) P (X=x (i) (j)
Def. 5: Mutual information of events {X = x(i)} and {Y = y (j) }:

(i) (j) )
I(X = x ) is the necessary information in order to know that the event {X = x(i)} occurs. bit = log2
pX|Y (x(i) |y (j) ) pX (x(i) )
Def. 4: Information: I(X = x ) = log2 P (X = x ) bit =

(i)
Def. 3: Conditional probability mass function: pX|Y (x(i)|y (j)) =
Bayes rule. bit.
bit.
I(X = x ; Y = y ) is the information gain concerning an event {X = x(i)} given an event {Y = y (j) }. Def. 6: X and Y are statistically independent, if pXY (x(i), y (j) ) = pX (x(i)) pY (y (j) ) i, j. Proof: pX|Y (x(i)|y (j) ) = Theorem: pX|Y (x(i)|y (j)) = pX (x(i)), if X and Y are statistically independent.
pXY (x(i) ,y (j) ) pY (y (j) ) (i) (j) pX (x(i) )pY (y (j) ) pY (y (j) )
= pX (x(i))
q.e.d.
Corollary: I(X = x ; Y = y ) = 0 if X and Y are statistically independent. Proof: I(X = x(i); Y = y (j) ) = log2
pX|Y (x(i) |y (j)) pX (x(i) ) p (X=x ) bit = log2 pX (X=x(i)) bit = 0
X (i)
q.e.d.
28

Denition:
Lx
H(X) = E{I(X = x )} = E{ log pX (x )} =
(i)
(i)
pX (x(i)) log pX (x(i))

i=1
is called entropy of the random variable X. The entropy is the average information (uncertainty) of a random variable. Denition:
Lx
I(X; Y ) = E{I(X = x ; Y = y )} =
i=1
(i)
(j)
pX|Y (x(i)|y (j)) pXY (x , y ) log pX (x(i)) j=1

(i) (j)
Ly
is called mutual information between the random variables X and Y . The mutual information is the average information gain of X given Y (or vice versa). Denition:
Lx Ly
H(X, Y ) = E{ log pXY (x , y )} =
(i)
(j)
pXY (x(i), y (j)) log pXY (x(i), y (j))

i=1 j=1
is called joint entropy of a pair of discrete random variables X and Y .
29

Theorem: The relation between the mutual information, the entropy, and the joint entropy, respectively, is given by H(X, Y ) = H(X) + H(Y ) I(X; Y ). I(X; Y ) = H(X) + H(Y ) H(X, Y )
Theorem: If X takes values over the alphabet {x(1), x(2), . . . , x(Lx)}, then with equality on the left hand side, if i so that pX (x(i)) = 1 and with equality on the right hand side, if pX (x(i)) = 1/Lx i. Denition:
Lx
0 H(X) log Lx
H(X|Y = y ) =
(j)
i=1
pX|Y (x(i)|y (j)) log pX|Y (x(i)|y (j))

30
is called conditional entropy of X given the event {Y = y (j)}.

Corollary: If pY (yi) > 0 (i.e., if pX|Y (x(i)|y (j)) exists), then with equality on the left hand side, if i so that pX|Y (x(i)|y (j)) = 1 and with equality on the right hand side, if pX|Y (x(i)|y (j)) = 1/Lx i. Denition: is the conditional entropy of X given the random variable Y . Therefore,
Ly
0 H(X|Y = y (j)) log Lx
H(X|Y ) = E{ log pX|Y (x(i)|y (j))}
H(X|Y ) =
j=1
pY (y (j))H(X|Y = y (j)).
Typically, this is the simplest rule in order to compute H(X|Y ). Corollary: The mutual information is the average information gain about X when observing Y . I(X; Y ) = H(X) H(X|Y ) = H(Y ) H(Y |X).
31

Corollary: with equality on the left hand side, if for any j with pY (yi) > 0 i so that pX|Y (x(i)|y (j)) = 1 and with equality on the right hand side, if for any j with pY (yi) > 0 we have pX|Y (x(i)|y (j)) = 1/Lx i. Theorem (chain rule for entropy): H(X1, X2, . . . , Xn) = H(X1) + H(X2|X1) + H(X3|X1, X2) + . . . + H(Xn|X1, X2, . . . , Xn1). Theorem: with equality, if X and Y are statistically independent. Corollary: I(X; Y ) 0 with equality, if X and Y are statistically independent.
32
0 H(X|Y ) log Lx
H(X|Y ) H(X)
Relations Between the Entropy, Conditional Entropy, Joint Entropy, and Mutual Information (Venn Diagram)
I(X; Y ) = H(X) + H(Y ) H(X, Y ) I(X; Y ) = H(X) H(X|Y ) I(X; Y ) = H(Y ) H(Y |X) I(X; Y ) = I(Y ; X)
I(X; X) = H(X)
a) I(X; Y ) > 0, b) I(X; Y ) = 0
33
Shannons Informations Measure

Example: Let X be a binary random variable with values x(1) and x(2) (i.e., Lx = 2) and let pX (x(1)) = p and pX (x(2)) = 1 p. Hence, we obtain H(X) = p log2 p (1 p) log2(1 p). Since this expression is often used, we denote h(p) := p log2 p (1 p) log2(1 p) and call h(p) the binary entropy function. It follows that h(p) = h(1 p).
1.0 0.9 0.8 0.7 0.6 h(p) 0.5 0.4 0.3 0.2 0.1 0.0 0.0 0.1 0.2 0.3 0.4 0.5 p 0.6 0.7 0.8 0.9 1.0
34
Summary of Important Denitions

The information of an event {X = x(i)} I(X = x(i)) = log pX (x(i)), i {1, . . . , Lx} is the necessary information in order to know that the event {X = x(i)} occurs. The mutual information between the events {X = x(i)} and {Y = y (j) } is
(i) (j) (i) (i) (j)
pX|Y (x(i)|y (j)) pXY (x(i), y (j)) = log , I(X = x ; Y = y ) = I(X = x ) I(X = x |Y = y ) = log pX (x(i)) pX (x(i))pY (y (j) )
where pX (x(i)) = P (X = x(i)) > 0 and pY (y (j) ) = P (Y = y (j) ) > 0, i {1, . . . , Lx}, j {1, . . . , Ly }. The entropy (uncertainty) of a random variable X is
Lx
H(X) = E{I(X = x )} =
(i)
pX (x(i)) log pX (x(i)).

i=1
The entropy is the average information about this random variable. The mutual information between the random variables X and Y is
Lx Ly
I(X; Y ) = E{I(X = x ; Y = y )} =
i=1 j=1
(i)
(j)
pXY (x(i), y (j) ) log
pXY (x(i), y (j) ) . pX (x(i))pY (y (j))
35
Summary of Important Denitions and Theorems

Denitions entropy H(X) := E{ log pX (x(i) )}
Lx
=
i=1
pX (x(i) ) [ log pX (x(i) )]
joint entropy
H(X, Y ) := E{ log pXY (x(i) , y (j))}

Lx Ly
=
i=1 j=1
pXY (x(i) , y (j)) [ log pXY (x(i) , y (j))]
conditional entropy
H(X|Y ) := E{ log pX|Y (x(i) |y (j))}

Lx Ly
=
i=1 j=1
pXY (x(i) , y (j)) [ log pX|Y (x(i) |y (j))]

pX|Y (x(i) |y (j) ) pX (x(i) ) pX|Y (x(i) |y (j) ) pX (x(i) )
mutual information
I(X; Y ) := E log
Lx Ly
=
i=1 j=1
pXY (x(i) , y (j)) log
Theorems
Example
0 H(X) log Lx 0 H(X|Y ) log Lx H(X|Y ) H(X) (equality, if X and Y are statistically independent) chain rule of entropy H(X1 , X2 ) = H(X1 ) + H(X2 |X1 ) H(X1 , X2 |X3 ) = H(X1 |X3 ) + H(X2 |X1 , X3 ) I(X; Y ) 0 (equality, if X and Y are statistically independent) I(X; Y ) = H(X) H(X|Y ) = H(Y ) H(Y |X) binary entropy function (Lx = 2) H(X) := h(p) = p log p (1 p) log(1 p)
36
Fanos Inequality
Suppose we know a random variable Y and we wish to guess the value of a correlated random variable X, where X and Y share the same alphabet (therefore Lx = Ly = L). Fanos inequality relates the probability of error Pe in guessing the random variable X to its conditional entropy H(X|Y ). Fanos inequality: Let X and Y be random variables with values over the same alphabet {x(1), x(2), . . . , x(L)} and let Pe = P (X = Y ). Then, H(X|Y ) h(Pe) + Pe log2(L 1). We can estimate X from Y with zero probability of error if and only if H(X|Y ) = 0.
37
Fanos Inequality
Proof of Fanos Inequality: 0 if X = Y 1 else. Therefore H(Z) = h(Pe), because Z is a binary random variable. According to the chain rule for entropy We introduce an error indicator Z: Z := H(X, Z|Y ) = H(X|Y ) + H(Z|X, Y ) = H(X|Y ) + 0 = H(X|Y ), since X and Y unequivocally determine Z. Furthermore, according to the chain rule for entropy H(X, Z|Y ) = H(Z|Y ) + H(X|Y, Z) = H(Z) + H(X|Y, Z). (2) Side calculation 1: H(X|Y, Z = 0) = H(X|X) = 0. Side calculation 2: H(X|Y, Z = 1) log2(L 1). With H(X|Y, Z) = P (Z = 0) H(X|Y, Z = 0) + P (Z = 1) H(X|Y, Z = 1) we get By substituting (3) into (2) and with (1) we obtain Fanos inequality. H(X|Y, Z) P (Z = 1) log2(L 1) = Pe log2(L 1). (1)
(3)
38
q.e.d.
Fanos Inequality
If an error occurs (i.e., if X = Y ), Fanos inequality is fullled with equality, if all L 1 remaining values are equally probable. The sum h(Pe) + Pe log2(L 1) is positive for all Pe with 0 < Pe 1:
4.0
3.0 h(Pe)+Pe log2(L-1)
L=10
2.0
L=4
1.0
L=2
0.0 0.0
0.1
0.2
0.3
0.4
0.5 Pe
0.6
0.7
0.8
0.9
1.0
Due to Fanos inequality we obtain lower and upper bounds of the error probability Pe given the conditional entropy H(X|Y ).
39
Data Processing Theorem

Suppose we are given two (or more) cascaded channels and/or processors:
X Channel 1
or
Channel 2
or
Processor 1
Processor 2
The channels/processors may be deterministic or stochastic. Z can be inuenced by X only indirectly via Y . We say that the sequence (X, Y, Z) forms a Markov chain. Data processing theorem: Let (X, Y, Z) be a Markov chain. Then, I(X; Z) I(X; Y ) and I(X; Z) I(Y ; Z). According to the data processing theorem, no information gain can be obtained by means of sequential data processing. (However, information may be transformed in order to become more accessible.)
40
Data Processing Theorem

Proof of the data processing theorem: Side calculation: H(X|Z) H(X|Y, Z), H(Z|X) H(Z|X, Y ) a) I(X; Z) I(X; Y ) : I(X; Z) = H(X) H(X|Z)
H(X) H(X|Y, Z) = I(X; Y ) = H(X) H(X|Y ) q.e.d.
b) I(X; Z) I(Y ; Z) :
I(X; Z) = H(Z) H(Z|X) = H(Z) H(Z|Y )
H(Z) H(Z|X, Y ) = I(Y ; Z) q.e.d.
41
Chapter 2: Source Coding I

Typical sequences Jointly typical sequences Asymptotic equipartition property (AEP) Shannons source coding theorem
42
Typical Sequences
In order to motivate the abstract notion of typical sequences, we conduct the following experiment: Let us given an unfair coin (trick coin): The probability for head is p, the probability for number is q = 1 p. With just one draw, we cannot estimate p for sure. With n draws, we can estimate p as accurately as desired, if n is suciently large. Proof: Let rn = number head/n, i.e., the relative frequency for head. According to Tschebyschevs inequality: pq 2 P (|rn p| ) 2 = 2 0 for n , n where 2 is the variance of rn and > 0.
q.e.d.
43
Typical Sequences
Motivated by the previous experiment, from now on we exclusively consider long sequences (e.g. n ). Let us denote the cardinality of the symbol alphabet by L. Hence, Ln possible sequences exist. We separate the set of all Ln possible sequences into two subsets: 1. The rst subset consists of all sequences, which show about n p times head. This is the set of so-called typical sequences. 2. The second subset consists of all remaining sequences. This is the set of so-called non-typical sequences. To start with, we restrict ourself to sequences of n independent and identically distributed (i.i.d.) random variables.
44
Typical Sequences
Formal motivation: Let X1, X2, . . . , Xj , . . . , Xn be a sequence of independent and identically distributed (i.i.d.) random variables, where each Xj is dened over an L-ary symbol alphabet X . We denote a certain event (i.e., a sequence of draws) as x = [x1, x2, . . . , xj , . . . , xn] X n, where xj {x(1), x(2), . . . , x(L)}. Since the random variables are assumed to be independent,
n
pX(x) =
j=1
pX (xj )
n
and hence 1 1 log pX(x) = n n Furthermore,
1 log pX (xj ) = n j=1
I(X = xj ) = I(X = xj ).
j=1
H(X) = E{I(X = x(i))},
We say that an event x = [x1, x2, . . . , xn] is an -typical sequence, if the sample mean I(X = xj ) and the entropy H(X) dier by or less.
i {1, 2, . . . , L}.
45
Typical Sequences
Denition: A set A(X) of -typical sequences x = [x1, x2, . . . , xn] is dened as follows: 1 A(X) = x : logb pX(x) H(X) . n
I(X=x) E{I(X=x)}
Theorem (asymptotic equipartition property (AEP)): For any > 0 n, where n is an integer number, so that A(X) fullls the following conditions: 1. P (x A(X)) 1
1 2. x A(X) | n logb pX(x) H(X)|
3. (1 )bn(H(X)) |A(X)| bn(H(X)+), where |A(X)| is the number of -typical sequences.
46
Typical Sequences
Proof: 1 Property 1 follows from the convergence of n logb pX(x) in the denition of A(X). Property 2 directly follows from the denition of A(X). Property 3 follows from 1 and 1
A (X) A (X)
pX(x)
A (X)
bn(H(X)+) = |A(X)|bn(H(X)+)
pX(x)
A (X)
bn(H(X)) = |A(X)|bn(H(X)).
q.e.d.
Corollary: For any -typical sequences x A(X) holds.
bn(H(X)+) pX(x) bn(H(X))
47
Typical Sequences
Example 1 for typical sequences: Let us given a binary random variable X (L = 2) with the values x(1) = head and x(2) = number, where P (X = x(1)) := p = 2/5 and P (X = x(2)) = 1 p := q = 3/5. According to the binary entropy function H(X) = h(p = 2/5) = 0.971 bit. We choose n = 5 and = 0.0971 (10 % of H(X)). According to the corollary, a sequence x is -typical, if 0.0247 px(x) 0.0484. Table I lists the Ln = 25 = 32 possible sequences (H: head, N: number). According to Table I, 10 out of the 32 possible sequences are -typical. (This is a combinatorial problem rather than a stochastical problem.) The -typical sequences are marked as . Note that P (x A(X)) 0.346, since n = 5 is rather small. Hence, the supposition P (x A(X)) 1 is not fullled.
48
Typical Sequences: Table I

49
Typical Sequences: Table II

Example 2 for typical sequences: Now, we choose p = 0.11 (therefore H(X) = h(p) = 0.5 bit), = 0.05 (10 % of H(X)) and we consider large values of n. The results are summarized in Table II:
100 n (1 )2n(H(X)) |A (X)| 2n(H(X)+) P (x A (X)) 1 ? 244.92
89.91
250.10
105.38
254.99
0.3676 < 0.9500 0.5711 < 0.9500 0.7760 < 0.9500 0.9049 < 0.9500 0.9834 > 0.9500 0.9998 > 0.9500 1.0000 > 0.9500 1.0000 > 0.9500 1.0000 > 0.9500 1.0000 > 0.9500
200 500 1000 2000 5000 10000 20000 50000 100000
2 2224.88 2449.84
2 2269.19 2541.87 21090.53 22731.75
2109.98 2274.96 2549.92 21099.83 22749.58 25499.16 210998.32 227495.80 254991.60
2899.76 22249.51
24499.09 25471.45 28998.25 210951.35 222495.72 227389.08 244991.52 254787.76
The number of -typical sequences of length n = 10000 amounts to only 25471.45. The 101361th fraction of all possible sequences has a contribution to the total probability of almost 100 %. The data sequence can be compressed by almost 50 %.
50
Typical Sequences
Properties of typical sequences: All typical sequences have about the same probability bnH(X). The sum of the probabilities of all typical sequences is nearly one. Nevertheless, the number of typical sequences, |A(X)| bnH(X), is very small when compared with the total number of all possible sequences: Although the number of typical sequences is very small, they contribute almost to the entire probability. However, the typical sequences are not the most likely sequences! (Example: Any sequence with n times number is more likely than any typical sequence, if p < 1/2.) In information theory, only typical sequences are considered, because the probability that an non-typical sequence occurs is arbitrarily small if n .
51
Jointly Typical Sequences

Now we consider random variables X and Y with joint probability mass function pXY (x, y) and denote by (X1, Y1), (X2, Y2), . . . , (Xn, Yn) the sequence of pairs (Xi, Yi), where i {1, 2, . . . , n}. Examples: Xi is the i-th channel input symbol Yi is the i-th channel output symbol Denition: The set A(X, Y) of jointly -typical sequences (x, y) = [(x1, y1), (x2, y2), . . . , (xn, yn )] of length n is dened as follows: A(X, Y) = (x, y) : 1 1 logb pX(x) H(X) , logb pY (y) H(Y ) , n n 1 logb pX Y (x, y) H(X, Y ) , n pXY (xi, yi).
i=1
52
where pX Y (x, y) =

Hence, a pair of sequences x and y is jointly -typical, if x, y, and (x, y) are -typical. Theorem (asymptotic equipartition property (AEP)): For any > 0 n, where n is an integer number, so that A(X, Y) fullls the following conditions: 1. P ((x, y) A(X, Y)) 1
1 2. (x, y) A(X, Y) | n logb pX Y (x, y) H(X, Y )|
3. (1 )2n(H(X,Y )) |A(X, Y)| 2n(H(X,Y )+), where |A(X, Y)| is the number of jointly -typical sequences.
53

Proof: The proof is the same as above, where X is substituted by (X, Y ).
pY (y) 1 2nH(Y ) 2nH(X,Y ) jointly -typical pairs
0 0 2nH(X) 1 pX(x)
Corollary: For any jointly -typical sequences (x, y) A(X, Y) holds. 2n(H(X,Y )+) pX Y (x, y) 2n(H(X,Y ))
54
Block Diagram of a Communication System with Source Coding and Channel Coding
q
Source
Source encoder
u
binary data
Channel encoder
x
Modulator
Distortion
Channel
q
Sink
Source decoder
u
binary data
Channel decoder
y
Demodul.
55
Shannons Source Coding Theorem

Let us given a discrete memoryless source generating a sequence of source symbols q = [q1, q2, . . . , qn], where each source symbol is dened over a nite L-ary alphabet Q. The entropy of the source symbols is denoted as H(Q) bits/source symbol. Furthermore, let us assume error-free transmission, i.e., u = u. What is the minimum number of bits/source symbol at the output of a lossless source encoder? Theorem (Shannons source coding theorem): Given the constraint that n , it is necessary and sucient that lossless source encoding is done on average with H(Q) bits/source symbol. A possible encoding rule is to enumerate each sequence in A(Q), and to encode the -typical sequences by the binary representation of its index.
56
Shannons Source Coding Theorem

Proof of the source coding theorem: For an arbitrary > 0 a suciently large integer n exists, so that a sequence of n source symbols, q = [q1, q2, . . . , qn], is uniquely encoded with n(H(Q) + ) bits, despite a set of non-typical source sequences, whose total probability is less than . This follows from the existence of a set A(Q) of sequences q, which fulll the rst and the second condition of the AEP. According to the right hand side of the inequation of the third property of the AEP, n(H(Q) + ) bits are sucient, while the rst condition of the AEP guarantees that P (q A(Q)) < . However, if we would use just n(H(Q) 2) bits, we could encode only a small subset of all typical sequences. This follows from the left hand side of the inequality of the third condition of the AEP. Hence, H(Q) bits/source symbol are not just sucient, but also necessary, if for long sequences the probability for a perfect reconstruction approaches one. q.e.d.
57
Chapter 3: Channel Coding I

Discrete memoryless channel (DMC) Channel encoding and decoding, random code Shannons channel coding theorem, channel capacity Examples for the channel capacity: uniformly dispersive DMC, uniformly focussing DMC, strongly symmetric DMC, symmetric DMC
58
Block Diagram of a Communication System with Source and Channel Coding

Discrete channel model
q
Source
Source encoder
u
binary data
Channel encoder
x
Modulator
pY|X(y|x)
Distortion
Channel
q
Sink
Source decoder
u
binary data
Channel decoder
y
Demodul.
59
Discrete Memoryless Channel (DMC)

We consider now a special class of transmission channels. Let x X denote a channel input symbol and y Y denote a channel output symbol, respectively, where X is the input alphabet, Y is the output alphabet, Lx = |X | 2, and Ly = |Y| 2. Denition: A channel model is called a discrete memoryless channel, if the conditional joint probability mass function pY|X(y|x) can be marginalized as follows:
n
pY|X(y|x) = pY1...Yn|X1...Xn (y1, . . . , yn|x1, . . . , xn) =
i=1
pY |X (yi|xi).
Hence, the discrete memoryless channel is completely dened by the conditional probability mass function pY |X (y|x). Interpretation: Given the constraint that symbol x is transmitted, the received symbol y randomly occurs with probability pY |X (y|x).
60
Discrete Memoryless Channel (DMC)

Example: Binary symmetric channel (BSC)
1p p p
X
1
Y
1
1p
pY |X (0|0) = pY |X (1|1) = 1 p pY |X (1|0) = pY |X (0|1) = p
61
Channel Coding and Channel Decoding

The binary data sequence delivered by the source encoder will now be partitioned into sequences [u1, u2, . . . , uk ], where each sequence consists of k bits. The channel encoder uniquely assigns to each info word u = [u1, u2, . . . , uk ] of length k a code word x = [x1, x2, . . . , xn] of length n. According to the unique assignment, only 2k instead of 2n dierent code words exist. The elements of the info word, ui, are binary. The elements of the code word, xi, are not necessarily binary; these elements are called channel symbols. Denition: The transmission rate is R = k/n bits/channel symbol. Hence, 2nR code words exist. We assume that all info words (and hence all code words) are equally likely. We denote the average word error probability after decoding by Pw : 1 P ( = u|u transmitted). u Pw = nR 2 u
62
Channel Coding and Channel Decoding
u
Channel encoder (opt. nonbinary)
k bits R = k/n [bits/channel symbol] n symbols

DMC
x Pw y
Channel decoder
[bits/channel symbol]
n symbols
k bits
The technical device for coding is called encoder. The set of all coded sequences, i.e. the set of all code words, is called code. We restrict ourself to so-called (n, k) block codes.
63
Random Code
Denition: An (n, k) random code of rate R = k/n consists of 2nR randomly generated code words x of length n, so that x A(X). The 2nR info words u are randomly but uniquely assigned to the code words. Example: p = 2/5, n = 5, k = 3, 0: We would like to generate a binary random code of rate R = k/n = 3/5. The set of all -typical sequences consists of |A(X)| = 10 sequences (see table). From the set of all -typical sequences, we randomly select 2nR = 8 sequences. These are our code words. The info words are assigned arbitrarily:
Code word x [00011] [00101] [00110] [01001] [01010] [01100] [10001] [10010] [10100] [11000] Index i 3 6 0 2 5 1 7 4 Info word u [011] [110] [000] [010] [101] [001] [111] [100]
Example: u(3) = [011], x(3) = [00011]
64
Shannons Channel Coding Theorem

Repetition: The mutual information can be written as I(X; Y ) = H(X) H(X|Y ) = H(Y ) H(Y |X). Here: I(X; Y ) is the average information gain with respect to the channel input symbol given an observation of the channel output symbol.
H(X|Y )
H(X)
I(X; Y )
H(Y )
Special designations: I(X; Y ): Transinformation H(Y |X): Irrelevance H(X|Y ): Equivocation
H(Y |X)
65

Denition: For a DMC a transmission rate R = k/n is achievable, if for any arbitrarily small > 0 and all suciently large n an (n, k) code exists, so that Pw < . The channel capacity C of a discrete memoryless channel is the maximum of all achievable rates R for 0.
Theorem (Shannons channel coding theorem): The channel capacity of a discrete memoryless channel is C = maxpX (x) I(X; Y ).
The unit of the channel capacity is bits/channel symbol (we also say bits/channel use), if the logarithm with basis b = 2 is used. Proof: We have to verify that: Each rate R < C is achievable, i.e., it exists at least one code of length n, so that Pw 0 if n . The corresponding reversion states that a positive lower bound for Pw exists if R > C, i.e., it exists an according to Pw , that cannot be improved by any channel code.
66

Proof of the channel coding theorem: a) Achievability: Firstly, we prove the asymptotic (i.e., for large n) achievability. Let p (x) be a probability mass function, which maximizes the mutual information X C = max I(X; Y ).
pX (x)
We randomly (!) choose the 2nR code words, where the code words are independent and identically distributed sequences x. Hence, each code word occurs with probability
n
p (x) X
=
i=1
p (x). X
We number all code words and denote the corresponding index by w {1, 2, . . . , 2nR }, i.e., x(w) is the wth code word. Decoding is done by choosing an index w for each received word y so that ( (w), y) A(X, Y), x assuming that such an index exists and that the index is unique.
67

The average error probability consists of two independent error events: 1. Such an -typical pair does not exist. 2. This pair is not unique. To 1) According to the rst condition of the AEP, the error probability of this event converges to zero if n . To 2) The error probability of this event is less or equal 2n(RC), i.e., this error probability also converges to zero, if R < C and n (proof: see literature). From 1) and 2): The average error probability for the randomly chosen code converges to zero. (Note that this is an existence proof, rather than a design rule for good codes!) It exists at least one code, so that Pw 0 for n and R < C. q.e.d.
68

Proof of the Channel Coding Theorem: b) Inversion: Now, we prove that it is impossible to transmit information with arbitrarily small error probability with rate R > C via a DMC. The proof is conducted for a so-called binary symmetric source (BSS): U = [U1, U2, . . . , Uk ] is a sequence of k independent and identically distributed binary random variables pU (U = 0) = pU (U = 1) = 1/2. Hence, the entropy of the BSS is H(U ) = 1 bit/source symbol, therefore H(U) = k. According to the relation between entropy and mutual information H(U|U) = H(U) I(U; U).
Applying the data processing theorem yields H(U|U) H(U) I(X; Y) (here: X channel input, Y channel output). According to the denition of the channel capacity of a DMC H(U|U) H(U) nC = k nC. ()
69

Furthermore, according to Fanos inequality H(U|U) h(Pw ) + Pw log(L 1), where L = 2k . By combining () and (), we get h(Pw ) + Pw log(2k 1) k(1 C/R), where R = k/n. Therefore, Pw (k(1 C/R) h(Pw )) / log(2k 1). This inequality can be further bounded as Pw (k(1 C/R) 1) /k 1 C/R. Pw is nite, if R > C. q.e.d. ()
70
Examples for Shannons Channel Coding Theorem

Denition: A DMC with input alphabet X and output alphabet Y is called uniformly dispersive, if the probabilities for the Ly = |Y| branches, which leave an input symbol, take for each of the Lx = |X | input symbols the same values p1, p2, . . ., p|Y|. Eventually, some of the transition probabilities are zero, i.e., the corresponding branches are missing. Theorem: The channel capacity of a uniformly dispersive DMC is given by C = max H(Y ) +
pX (x) |Y| j=1
pj log pj .
Example: The channel capacity of a binary symmetric channel (BSC) with transition probabilities p and q = 1 p is given by CBSC = max H(Y ) + p log p + (1 p) log(1 p) = 1 h(p).
pX (x) h(p)
71

Denition: A DMC with input alphabet X and output alphabet Y is called uniformly focussing, if the probabilities for the Lx = |X | branches, which merge in an output symbol, take for each of the Ly = |Y| output symbols the same values p1, p2, . . ., p|X |. Eventually, some of the transition probabilities are zero, i.e., the corresponding branches are missing. Theorem: For a uniformly focussing DMC
pX (x)
max H(Y ) = log |Y|
holds. Example: The BSC is uniformly focussing.
72

Denition: A DMC, which is both uniformly dispersive as well as uniformly focussing, is called strongly symmetric. Theorem: The channel capacity of a strongly symmetric DMC is given by C = log |Y| +
|Y| j=1
pj log pj .
The channel capacity is obtained for pX (x) = 1/|X | for all x. Example: The ternary DMC X 0 1 2 is strongly symmetric and has the channel capacity
x E x X $ rr $$ rr $$$ $$ rr $$$ r $$$r $$$ rr x $ E x X $ rr $$$$ $ $$rr $ $$ r rr $$$ $$$ j r x $ E x
p p p
1p
0 1 2 Y
1p
1p
C = log2(3) h(p) bits/channel symbol.
73

Denition: A DMC with input alphabet X and output alphabet Y is called symmetric, if the DMC can be partitioned into m strongly symmetric component channels (i.e., Yi Yj = i = j, m Yi = Y), where every component channel i=1 consists of |X | inputs and |Yi| outputs, where |Yi| < |Y| and i = 1, 2, . . . , m. The probabilities for the component channels are denoted by qi. Theorem: The channel capacity of a symmetric DMC is given by
m
Csym =
i=1
qi Ci ,
where qi is the probability and Ci is the channel capacity of the ith strongly symmetric component channel. Example: Binary erasure channel (BEC).
74

Denition: An adder channel is a DMC, whose inputs and outputs may by partitioned such that an input symbol of one set may not lead to an output symbol of another set. Hence, an adder channel may be regarded as a combination of m > 1 component channels, where we always use only one of these component channels. Theorem: The channel capacity of an adder channel is given by
m m
Csum = log2
i=1
2Ci
in general: Csum = logb

i=1
b Ci
where Ci is the channel capacity of the ith component channel. Example: X 0 1 2 1

x E x x E X $ x $ $$$ $$ $$ $$$ $$ x $$$ z $ E x
0 1 Y
2 1p The channel capacity of this DMC is Csum = log2(1 + 21h(p)) bits/channel symbol.
p p
1p
75
Chapter 4: Rate Distortion Theory

Distortion, distortion measures Rate distortion function Shannons rate distortion theorem Joint source and channel coding Summary of Shannons fundamental coding theorems
76
Shannons Rate Distortion Theorem

Motivation: The eciency of lossy source encoding increases with the average distortion (Examples: MPEG, JPEG). The fundamental question of the rate distortion theory is: What is, on average, the minimum number of bits/source symbol at the output of a lossy source encoder, which is necessary in order to reconstruct the source sequence with a given average distortion? In this section, we consider discrete memoryless sources only. Let q Q be a source symbol. Let q Q be the corresponding reconstructed source symbol. The alphabets Q and Q may not necessarily agree. For any pair (q, q ) we dene a non-negative distortion d(q, q ) 0. By denition, d(q, q ) = 0 for any pair (q, q ) corresponds to lossless source coding.
77
Block Diagram of a Communication System with Source and Channel Coding
q
Source
Source encoder
u
Binary data
Channel x Modulator encoder
pY|X(y|x)
pQ|Q( |q) q
Distortion
Channel
q
Sink
Source decoder
u
Binary data
Channel decoder
y
Demodul.
78

Examples of dierent distortion measures: a) Hamming distortion: d(q, q ) = Note that E{d(q, q )} = P (q = q ). b) Mean square error distortion: d(q, q ) = (q q )2. The mean square error distortion is popular in many applications, particularly in applications with continuous symbol alphabets. 0 if q = q 1 else.
79

We denote the probability mass function of the source symbols q as pQ(q) and the conditional probability mass function of the estimated symbols q given q as pQ|Q(|q) q (test channel). According to Bayes rule, the joint probability mass function can be written as pQQ(q, q ) = pQ(q) pQ|Q(|q). q Hence, the average distortion per source symbol can be calculated as follows: E{d(q, q )} =
q q
pQQ(q, q ) d(q, q ) =
pQ(q)
q q
pQ|Q(|q) d(q, q ). q
The set of all test channels, whose average distortion is D or less, is denoted as T (D) = pQ|Q(|q) : E{d(q, q )} D . q
80

Examples for test channels: a) Trivial test channel with I(Q; Q) = H(Q):
s Es Es qq q Es
s qq q s
maximum mutual information, but no data compression. b) Trivial test channel with I(Q; Q) = 0: Q
s s s z s E B qq qq q q s s
zero mutual information, but maximum data compression. Perception: We would like to have a test channel with minimal (!) mutual information, which causes an acceptable distortion at the same time.
81

Denition: The rate distortion function is dened as R(D) = where D 0. Hence, R(D) is the smallest mutual information I(Q; Q), for which the average distortion is D or less. Note that R(D) < H(Q) if D > 0 and that R(0) = H(Q). Theorem (Shannons rate distortion theorem): For any arbitrary 1 > 0 and 2 > 0 a suciently large n exists, so that sequences of n symbols from a memoryless source can be encoded, on average, with n(R(D) + 1) bits. Given the encoded bits, the n symbols can be reconstructed with an average total distortion of n(D + 2) or less. Theorem (inversion of the rate distortion theorem): Less than R(D) bits/ source symbol are not sucient, if an average distortion of D or less shall be achieved.
82
pQ|Q (|q)T (D) q
min
I(Q; Q),
Joint Source and Channel Coding

Rate distortion theorem: A sequence of k symbols from a memoryless source can be reconstructed with an average total distortion of kD or less, if encoding is done with kR(D) bits and if these bits are transmitted error-free. Channel coding theorem: We protect the source encoded bits by a channel code of rate R. Let Rtot = k/n be the total rate (including source and channel coding) R = kR(D)/n = Rtot R(D). The kR(D) source encoded bits can be transmitted quasi error-free if R C, where C is the channel capacity Rtot R(D) C. The smallest achievable distortion is obtained by setting Rtot R(D) = C, and by resolving for D.
83
Joint Source and Channel Coding
q
Source encoder
k source symbols
Rtot
u
Channel encoder (opt. nonbinary)
k R(D) bits Rate R [bits/channel use] n channel symbols
84
Summary of Shannons Fundamental Coding Theorems

Shannons channel coding theorem At least one channel code of length n exists, so that a sequence of k = nR info bits can be transmitted quasi error-free via a DMC if R = k/n < C, where C = max I(X; Y ).
pX (x)
Shannons source coding theorem A sequence of n symbols from a memoryless source, which is (source) encoded with at least nH(Q) bits, can be reconstructed quasi error-free. Shannons rate distortion theorem A sequence of n symbols from a memoryless source, which is (source) encoded with nR(D) bits, can be reconstructed with an average total distortion of nD, where R(D) =
pQ|Q (|q)T (D) q
min
I(Q; Q).
All theorems only hold for lim n !
85
Chapter 5: Source Coding II

Prex-free source coding Average code word length Probability tree Krafts inequality Masseys path length theorem Markov sources Human algorithm Willems algorithm Lempel-Ziv algorithm
86
Prex-free Source Coding

In this chapter, we return to the problem of lossless source coding and consider the following situation: Memoryless source q
E
Source encoder
u
E
1. The memoryless source generates a random variable Q over a nite alphabet Q with L elements, or a sequence of random variables with this property. In the latter case the source symbols q are assumed to be statistically independent. 2. For each L-ary source symbol q the source encoder delivers a bit sequence u. The bit sequence u has a variable length. The average code word length is a measure for the eciency of the source encoder.
87

We denote the code word corresponding to the source symbol q (i) as u(i) and its length as w (i), i {1, 2, . . . , L}. The average code word length is chosen as a measure for the eciency of source coding: L pQ(q (i)) w (i) [bits/source symbol]. E{W } =
i=1
The shorter the average code word length, the more ecient is the source encoder. The source encoder must fulll the following properties: 1. Any two code words are not allowed to be identical, i.e., u(i) = u(j) for i = j. H(Q) = H(U). 2. A code word is not allowed to be a prex of a longer code word. We are able to identify a code word, as soon as we received its last symbol, even if the source encoder operates continuously. Codes, which fulll both properties, are called prex-free.
88

Examples: q u This code is prex-free.
q (1) [0] q (2) [10] q (3) [11] q u
q (1) [1] q (2) [10] q (3) [11]
This code is not prex-free, since u(1) is a prex of u(2) and u(3).
Prex-free codes allow for a so-called comma-free transmission, if no transmission errors occur.
89

Denition: A binary prex-free code is called optimal, if no other binary prex-free code with a smaller average code word length E{W } (per source symbol) exists. Theorem: The average code word length of an optimal binary prex-free code of a random variable Q can be bounded as: H(Q) E{W } < H(Q) + 1. Remark 1: The lower bound follows from Shannons source coding theorem. Remark 2: For optimal codes the upper bound is necessary, but not sucient. Not any code with E{W } < H(Q) + 1 is an optimal code. Remark 3: The average code word length corresponds to one source symbol. Denition: A binary (probability) tree is a tree with a single root, so that either two or no branches diverge from a node. Denition: A full binary tree of depth n is a binary tree, which consists of 2n nal nodes (at depth n).
90

Example: q q (1) q (2) q (3) q (4) It is H(Q) = and E{W } = pQ(q) u This code can be represented by the following tree:
0.45 1.00 0

0.30 0

0.45 [0] 0.30 [10] 0.15 [110] 0.10 [111]

4
0.55
0.15 0

0.25
0.10
pQ(q (i)) log2 pQ(q (i)) = 1.782 bits/source symbol

i=1 4
pQ(q (i)) w (i) = 1.80 bits/source symbol.

i=1
Hence E{W } 1.01 H(Q). Obviously, the code is a good code, but we dont know at this point whether the code is optimal or not.
91

Krafts inequality: A binary prex-free code with code word lengths w (1), w (2), . . . , w (L) exists, if
L i=1
2w 1.
(i)
Example: Design a code with lengths w (1) = 2, w (2) = 2, w (3) = 2, w (4) = 3 and w (5) = 4!
5
Solution: Since
i=1
2w = 15/16 < 1 we know that such a code exists. Successively,
(i)
we let the tree grow:

0

0 1
0 1 1
q (1) [00] q (2) [01] q (3) [10] q (4) [110] q (5) [1110]
92

Path length theorem (J.L. Massey): In a probability tree the average code word length is equal to the sum of the probabilities of the inner nodes including the root. Example:
0.45 0.75 0 1.00 0

0.30 0.25 1 0.25 0 1 1 0
0.15
0.10 0 0.10
E{W } = 1.00 + 0.75 + 0.25 + 0.25 + 0.10 = 2.35 bits/source symbol
93

Denition: Let {u} be a binary prex-free code. We call the code {u} with u(1) = u(1), u(2) = u(2), . . . , u(L2) = u(L2) and u(L1) with u(L1) = u(L1) 0 and u(L) = u(L1) 1 a merged code for the random variable Q, if the code is dened as follows: pQ (q ) =
(i)
pQ(q (i)) for i = 1, 2, . . . , L 2 pQ(q (L1)) + pQ(q (L)) for i = L 1.
Here, symbolizes a concatenation of symbol sequences. Each binary prex-free code for Q is a merged code for a binary prex-free code. According to Masseys path length theorem we obtain: E{W } = E{W } + pQ(q (L1)) + pQ(q (L)). E{W } is minimal, if E{W } is minimal.
94
Optimal Algorithms for Lossless Source Coding

Human Algorithm independent source symbols with known probability mass function xed source word length is mapped into a variable code word length Tunstall Algorithm independent source symbols with known probability mass function variable source word length is mapped into a xed code word length Willems Algorithm universal source coding xed source word length is mapped into a variable code word length Lempel-Ziv Algorithm universal source coding variable source word length is mapped into a xed code word length
95
Human Algorithm
The Human algorithm is based on the following lemmas: Lemma: The tree of an optimum binary prex-free code has no unused nal nodes. Lemma: Let q (L1) and q (L) be the least probable events of a random variable Q. It exists an optimal binary prex-free code for Q, so that the least probable code words u(L1) and u(L) dier in the last position, i.e., u(L1) = u
(L1)
and
u(L) = u
(L1)
1,
where u(L1) is a common prex. Lemma: The binary prex-free code for Q with u(L1) = u(L1) 0 and u(L) = u(L1) 1 is optimal, if the merged code for the random variable Q is optimal.
96
Human Algorithm
The Human algorithm consists of the following three steps: Step 0 (initialization): Let us denote the L nal nodes as q (1), q (2), . . . , q (L). To each node q (i), the corresponding probability mass function pQ(q (i)), i {1, 2, . . . , L}, is assigned. All nodes are declared as being active. Step 1: We merge the two least probable active nodes. These two nodes are declared passive, whereas the new node is declared active. The active node is assigned the sum of the probabilities of the two merged nodes. Step 2: Stop, if only one active node is left. (In this case we have reached the root.) Otherwise, go to step 1.
97
Human Algorithm
Example: Let us design the Human code for the following scenario: q q (1) q (2) q (3) q (4) q (5) q (6) pQ(q) 0.05 0.10 0.15 0.20 0.23 0.27 Solution:
0.15 0.30 0 0.57 0 1.00 1 1 0.43 0 0 1 1 0.10
0.05 0
q q (1) q (2) q (3) q (4) q (5) q (6)
0.15 0.20
0.23
0.27
u [0000] [0001] [001] [10] [11] [01]
The average code word length is E{W } = 1.00 + 0.57 + 0.43 + 0.30 + 0.15 = 2.45 bits/source symbol.
6
The entropy is H(Q) =
i=1
pQ(q (i)) log2 pQ(q (i)) = 2.42 bits/source symbol.

98
Summary of Human Algorithm

The Human algorithm assumes that the probability mass function of the source symbols is known assumes that the source symbols are statistically independent generates an optimal binary prex-free code, i.e., H(Q) E{W } < H(Q) + 1 does not permit any transmission errors. The code words are of variable length. The average code word length E{W } may be computed by means of Masseys path length theorem. Instead of doing symbol-by-symbol encoding, the eciency (in [bits/source symbol]) may be further increased by collecting multiple source symbols before source encoding.
99
Markov Sources
So far we assumed statistically independent source symbols (i.e., memoryless sources). Now, we consider sources with memory. Motivation (Shannon, 1948): Consider a random variable Q with cardinality L = 27 (26 capital characters plus the space symbol). 1st order approximation (based on the statistics of an English-written text):
2nd order approximation (based on the statistics of an English-written text):
3rd order approximation (based on the statistics of an English-written text):
100
Markov Sources
Denition: A Markov source is a sequence of random variables Q0, Q1, . . ., Qn with the following properties: 1. Each random variable Qi = f (Si) consists of values over a nite alphabet Q with cardinality |Q| = L, where i is the symbol index (0 i n). 2. The sequence [Q0, Q1, . . . , Qn] is stationary, i.e., its statistical properties are not time-varying. 3. The sequence [S0, S1, . . . , Sn] forms a Markov chain with transition matrix . 4. f (.) is a function, whose denition domain contains the nite set S of all states and whose range contains the nite set Q of output symbols. 5. The initial state is randomly generated according to the stationary distribution = [(1), (2), . . . , (Ls)], where (i) := P (S0 = s(i)) for all i {1, . . . , Ls} and where Ls = |S| denotes the number of states. (Given the transition matrix , the stationary distribution can be computed by solving = .)
101
Markov Sources
Denition: The entropy rate (per symbol) of the sequence [Q0, Q1, . . . , Qn] is dened as H(Q) = lim H(Qn|Q0Q1 . . . Qn1).
n
The entropy rate may be interpreted as the uncertainty of a symbol given its history. Denition: The alphabet rate of a random variable Q is equal to the logarithm of the cardinality of the symbol alphabet: H0 := log |Q| = log L. Theorem: H(Q) H(Qn|Q0Q1 . . . Qn1) H(Qn) log |Q| = log L = H0. Example: For English-written texts with 27 letters (26 capital characters plus the space symbol) we obtain: H0 = log2 27 bits 4.75 bits, H 4.1 bits, H 1.3 bits.
102
Universal Source Coding

If the source statistics pQ(q (i)), i {1, . . . , L}, is known (and if the source is memoryless), the Human algorithm and the Tunstall algorithm are optimal source coding algorithms, i.e., the average code word length approaches the entropy H(Q). If the probability mass function is unknown, the source statistics must be estimated. If a wrong source statistics is assumed, the encoder is not optimal. The Human algorithm is sensitive with respect to the source statistics. Denition: A source coding algorithm is called universal, if the source statistics are not known before encoding is performed. A fundamental question is: Can we approach the entropy H(Q) or even the entropy rate H(Q) with universal source coding algorithms?
103
Willems Algorithm
The Willems algorithm can be applied to arbitrarily long source sequences with a nite alphabet. The main steps are as follows: Step 0 (initialization): We compute a coding table, as described afterwards, and initialize a buer of length 2N 1. Step 1: We divide the source sequence of length n into m subblocks q1, q2, . . . , qm of length N . Step 2: These subblocks are subsequently attached to the buer on the right hand side. The encoder determines the repetition time tr of the current subblock. If the repetition time is less than the buer length, we obtain a code word from the table representing the repetition time. Otherwise, a prex is attached to the subblock and no further encoding is done. Step 3: The contents of the buer is shifted to the left hand side by N steps. We stop, if the end of the source sequence is reached. Otherwise, we proceed with step 2.
104
Willems Algorithm
The coding table is computed as follows: We dene a set Ti = {tr : 2i tr < 2i+1} for i = 0, 1, . . . , N 1 {tr : tr 2N } for i = N.
The encoder assigns an index i to any repetition time tr , so that tr Ti. The encoded index is the rst part of the code word (prex). The prex consists of log2(N + 1) bits. If i < N , the rest of the code word (sux) is determined by encoding j = tr 2i. In this case, the sux consists of i bits. If i = N , the actual subblock is chosen as the sux. In this case, the sux consists of N bits.
105
Willems Algorithm
Example: For N = 3 the following coding table is obtained:
tr 1 2 3 4 5 6 7 i j 0 0 1 0 1 1 2 0 2 1 2 2 2 3 Prex Sux Length [00] [01] [01] [10] [10] [10] [10] [11] [0] [1] [00] [01] [10] [11] q 2 3 3 4 4 4 4 5
8 3
106
Willems Algorithm
Example (contd): N = 3, m = 7, 2N 1 = 7 Let [100,000,011,111,011,101,001] be the sequence of binary source symbols. Let [0100100] be the initial contents of the buer. 1. 2. 3. 4. 5. 6. 7. subblock: subblock: subblock: subblock: subblock: subblock: subblock: [0100100]100 [0100100]000 [0100000]011 [0000011]111 [0011111]011 [1111011]101 [1011101]001 tr tr tr tr tr tr tr = 3 u1 = [01 1] = 1 u2 = [00] 8 u3 = [11 011] = 1 u4 = [00] = 6 u5 = [10 10] = 4 u6 = [10 00] = 8 u7 = [11 001]
107
Willems Algorithm
We recognize that in the best case the source sequence is compressed by the factor log2(N + 1)/N in the worst case the source sequence is expanded by the factor 1 + log2(N + 1)/N . How ecient is Willems algorithm? Denition: The average rate is R = E{length(u)}/N = E{W }/N, where the code word length W is averaged w.r.t. the statistics of the source symbols. For stationary ergodic sequences q = [q1, q2, . . . , qm] we can prove that
N
lim R = H(Q).
Correspondingly, the Willems algorithm is optimal if the buer as well as the subblocks are of innite length.
108
Lempel-Ziv Algorithm
We explain the Lempel-Ziv algorithm (LZ78) for the example of a binary data sequence of length n. A generalization for sources with nite output alphabets is straightforward. Step 1: The source symbol sequence is subsequently divided into subblocks, which are as short as possible and which did not occur before. We denote the total number of subblocks as m the last symbol of each subblock a sux the remaining symbols of each subblock a prex. Step 2: We encode the position of any prex and attach the sux. log2 m bits are needed in order to encode the position and 1 bit for the sux, i.e., m (1 + log2 m) bits are needed in total for a sequence of length n. Correspondingly, we need m (1 + log2 m)/n bits per source symbol.
109
Lempel-Ziv Algorithm
Example: Let [1011010100010] be the sequence of binary source symbols. Correspondingly, the subblocks are obtained as [1],[0],[11],[01],[010],[00],[10]. We observe that n = 13 and m = 7, i.e., we need 3 bits in order to encode a position. The encoded sequence is [000,1] [000,0] [001,1] [010,1] [100,0] [010,0] [001,0]. In this tutorial example we need 28 bits in order to encode 13 source bits. However, the eciency grows with increasing length of the source sequence. For stationary ergodic sequences q = [q1, q2, . . . , qn] we can prove that
n
lim m (1 + log2 m)/n = H(Q).
Therefore, the Lempel-Ziv algorithm is optimal for innitely long source sequences. The version of the Lempel-Ziv algorithm introduced so far needs two passes and delivers code words of equal length. Many modications exist.
110
Practical Examples for Text Compression

The Human algorithm and the Lempel-Ziv algorithm as well as adaptive variations thereof are extensively used for the purpose of data compression on many computer systems. Some routines (in increasing eciency) are as follows: pack: Human algorithm compact: adaptive Human algorithm compress, ARC: Lempel-Ziv algorithm (LZW) gzip, zip, PKZIP: Lempel-Ziv algorithm (LZ77) Given an English-written text (without pictures), the data can be reduced by about a factor of two when using compress and by about a factor of three when using gzip.
111
Chapter 6: Channel Coding II

Word error probability, bit error probability Decoding rules: MAP decoding and ML decoding Bhattacharyya bound Gallager bound Gallager bound for random coding Gallager function, Gallager exponent R0 criterion
112
Word Error Probability, Bit Error Probability

In this chapter, we return to the problem of channel coding and channel decoding. We consider the following coded transmission system: R = k/n u Info word Quality criteria: Word error probability Pw = P ( = u) u Bit error probability (for info words with binary elements) Pb =
1 k k
Channel encoder
Channel (e.g. DMC)
Channel decoder
u or x
E
Code word
Received word
Decoded info word or decoded code word
P (j = uj ) u
j=1
Note that Pb Pw k Pb and Pw /k Pb Pw , respectively.
113
Word Error Probability
Code words y x(y) Decoding region Decoding threshold

,
(2k words)
Received words (2n words) Example: k = 2, n = 4
114
Word Error Probability

The general, exact formula for the word error probability can be derived as follows: By denition we have Pw = P ( = u) = 1 P ( = u). u u The last term on the right hand side may be expressed as P ( = u) = P ( = x) = u x
y
pY X(y, x(y)) =
y
pY|X(y| (y)) pX( (y)), x x
where x(y) is the estimated code word for a given received word y. Hence, x(y) describes the decoding rule. Correspondingly, Pw = 1
y
pY|X(y| (y)) pX( (y)) x x
This equation holds for arbitrary discrete channels and decoding rules, and is exact.
115
Decoding Rules
Let xi be the i-th code word, i {1, 2, . . . , 2k }, k = nR. Maximum a posteriori (MAP) decoding: pX|Y ( |y) pX|Y (xi|y) x Maximum-likelihood (ML) decoding: pY|X(y| ) pY|X(y|xi) x According to Bayes rule pX|Y (xi|y) = pY|X(y|xi) pX(xi) . pY (y) i i
Since the denominator is independent of i, the MAP rule and the ML rule are equal for equally probable code words, i.e., for pX(xi) = 1/2k i {1, 2, . . . , 2k }.
116
Decoding Rules
MAP Decoding: uMAP = arg max pU|Y (ui|y) or equivalently xMAP = arg max pX|Y (xi|y).
ui xi
The MAP rule estimates the most probable info word (or the most probable code word, respectively) given the received word y. The a priori probabilities pX(xi) have to be known at the receiver. ML Decoding: uML = arg max pY|U(y|ui) or equivalently xML = arg max pY|X(y|xi).
ui xi
The ML rule estimates the most probable received word y for all hypotheses xi, i {1, 2, . . . , 2k }. The ML decoder does not make use of a priori information.
117
Decoding Rules
Example: Ternary DMC, (2,1) repetition code of length n = 2
0 X 1 2
E X $ x $ $$$ $$ $$ $$ $$$ $$ x$ $ E X $ x $ $$$ $$ $$ $$ $$$ $$$ $ E x x x
0 1 2 Y
xi pX(xi) 1/8 5/8 2/8
1/2 1/2 1/2 1/2
A [00] B [11] C [22]
Which decision table is used by a MAP decoder, which table is used by a ML decoder? Are the decision tables unique? What are the corresponding word error probabilities?
118
Decoding Rules
y [00] [01] [02] [10] [11] [12] [20] [21] [22]
i A B B B C C C C C
y [00] [01] [02] [10] [11] [12] [20] [21] [22]
i B B B B B C C C C
ML decoder
MAP decoder
119
Bhattacharyya Bound
Motivation: An exact computation of the word error probability is very complex, since we have to add 2n terms, where decoding regions have to be taken into account. For typical block codes (n 100 . . . 1000), the computational eort is not manageable. Hence, in the following we derive two upper bounds on the word error probability. We dene the decoding regions as Di = {y : u(y) = ui} = {y : x(y) = xi}, i {1, 2, . . . , 2k }.
Furthermore, we dene the conditional word error probability Pw|i = P ( = ui | ui transmitted) = P ( = xi | xi transmitted). u x Since pU(ui) = pX(xi), the word error probability can be written as
2k
Pw =
i=1
pX(xi) Pw|i,
k = nR,
120
where the expected value is taken over all info words given a xed code.
Bhattacharyya Bound
D1
D2 y
,
Code words
(2k words)
x1 x2
Received words (2n words) Example: k = 1, n = 4
Decoding threshold
121
Bhattacharyya Bound
According to the denitions it follows that Pw|i = P (y Di | ui transmitted) = P (y Di | xi transmitted), or equivalently Pw|i =
yDi
pY|X(y|xi).
Special case: k = 1 (code with two code words): Pw|1 =

yD2
pY|X(y|x1).
For ML decoding an upper bound is obtained by multiplying each term of the sum with pY|X(y|x2)/pY|X(y|x1), since the square root 1 if y D2 and 1 else: Pw|1 pY|X(y|x1) pY|X(y|x2)/pY|X(y|x1) =
yD2
122
pY|X(y|x1) pY|X(y|x2).
yD2
Bhattacharyya Bound
The approximation is reasonably tight, since the decisions of the ML decoder are least reliable near the decision threshold between D1 and D2, where the square-root is approximately one. The right hand side may be further approximated by taking all possible received words y into account: Pw|1 pY|X(y|x1) pY|X(y|x2).
y
Note that the computational complexity reduces signicantly, since no decision pY|X(y|x1) pY|X(y|x2). thresholds have to be computed. Similarly, Pw|2
y
For the special case of a DMC we get:

n
Pw|i
j=1
pY |X (y|x1j ) pY |X (y|x2j ),
i {1, 2}.
This derivation proves the following theorem:
123
Bhattacharyya Bound
Theorem (Bhattacharyya bound for two code words): For a given code of length n with two code words x1 and x2, which are transmitted with arbitrary probabilities pX(x1) and pX(x2) via a discrete memoryless channel (DMC), the conditional word error probability for ML decoding can be upper bounded as
n
Pw|i Generalization:
j=1
pY |X (y|x1j ) pY |X (y|x2j ),
i {1, 2}.
Theorem (Bhattacharyya bound): For a given code of length n with code words xi, i {1, 2, . . . , 2k }, which are transmitted with arbitrary probabilities pX(xi) via a discrete memoryless channel (DMC), the conditional word error probability for ML decoding can be upper bounded as
2k n y
Pw|i
=1;=i j=1
pY |X (y|xij ) pY |X (y|xj ).
124
Bhattacharyya Bound
Example 1: (n, 1) repetition code, BSC: Pw|i 2 p(1 p) Example 2: (n, 1) repetition code, BEC: Pw|i q n 1 exact result: Pw = q n . 2
n
Remarks concerning the Bhattacharyya bound: The Bhattacharyya bound depends on i and hence on the source statistics pX(xi). The word error probability can be bounded as Pw max Pw|i.
i
For codes with many code words, the Bhattacharyya bound typically is not tight, since too many decision regions are taken into account. The so-called Gallager bound, which is derived next, is tighter.
125
Gallager Bound
As proven before, Pw|i =
yDi
pY|X(y|xi).
()
For ML decoding: y Di pY|X(y|xj ) pY|X(y|xi) for one j = i.
Let and s be real numbers with 0 1 and 0 s 1. Therefore, 2k s pY|X(y|xj ) 1. y Di pY|X(y|xi)

j=1;j=i
We multiply each term in () with the left hand side of the inequality and obtain Pw|i pY|X(y|xi)
1s 2k yDi
By means of the parameters and s, which have been introduced by Gallager, the Gallager bound is tighter than the Bhattacharyya bound.
126
j=1;j=i
s pY|X(y|xj ) .
Gallager Bound
According to Gallager we choose 1 0 1, 1+ take all received words y into account, and obtain nally: s= Theorem (Gallager bound): For a given code of length n with code words xi, i {1, 2, . . . , 2k }, which are transmitted with arbitrary probabilities pX(xi) via an arbitrary discrete channel, the conditional word error probability for ML decoding can be upper bounded as
1 1+
Pw|i
pY|X(y|xi)
y
2k
pY|X(y|x)
1 1+
=1;=i
0 1.
127
Gallager Bound
Special case: Theorem (Gallager bound for DMCs): For a given code of length n with code words xi, i {1, 2, . . . , 2k }, which are transmitted with arbitrary probabilities pX(xi) via a discrete memoryless channel (DMC), the conditional word error probability for ML decoding can be upper bounded as
n
Pw|i
j=1
pY |X (y|xij )
1 1+
2k
=1;=i
pY |X (y|xj )
1 1+
0 1.
For = 1 the Gallager bound and the Bhattacharyya bound are identical.
128
Gallager Bound for Random Coding

For practical codes, a computation of the Gallager bound or the Bhattacharyya bound still appears to be too complex. As Shannon observed, it is simpler to obtain bounds for the average word error probability, where averaging is with respect to an ensemble of codes. Theorem: For random codes of length n with code words xi, i {1, 2, . . . , 2k }, which are transmitted with arbitrary probabilities pX(xi) via an arbitrary discrete channel, the average word error probability for ML decoding can be upper bounded as:
1+
Pw (2k 1)
pY|X(y|x)
y x
1 1+
pX(x)
0 1.
The averaging must be performed over a large set of codes.
129

For statistically independent code symbols with probability mass function
n
pX(x) =
j=1
pX (x)
and for the special case of a DMC it immediately follows that n

1+
With the so-called Gallager function E0(, pX (x)) := log2 we get With 2k 1 < 2k and k = nR we obtain Pw (2k 1)2nE0(,pX (x)), Pw < 2n(E0(,pX (x))R), pY |X (y|x)
1 1+
Pw (2k 1)
pY |X (y|x)
1 1+
pX (x)
,
1+
0 1.
pX (x)
01
0 1. 0 1.
130

Theorem (Gallager Bound for Random Coding): Consider a set of codes with 2k = 2nR code words x of length n, where the code words are statistically independent and the code symbols xi, i {1, 2, . . . n}, are statistically independent. The code symbols are assumed to be distributed according to a given probability mass function pX (x) over the input alphabet of a DMC. The average word error probability for ML decoding can be upper bounded as Pw < 2nEG(R), where EG(R) := max max (E0(, pX (x)) R)
pX (x) 01
is the so-called Gallager exponent. Finally, we dene EG(0) = max E0(1, pX (x)) := R0.
pX (x)
R0 is called cut-o rate or R0-criterion, respectively.
131
A typical sketch of the Gallager exponent is illustrated as follows:
EG(R) R0
0 R0 C R
132

Since for a xed rate R, 0 R < C, the Gallager exponent is positive, the average word error probability exponentially approaches zero with increasing code length n. Therefore, a particular code with code length n must exist, whose word error probability is less than the average word error probability of random codes with the same code length. With this particular code, a quasi error-free transmission is possible. Therefore, Gallagers random coding bound proves Shannons channel coding theorem. EG(R0) is large enough, so that the word error probability for practical code lengths (n 100 . . . 1000) is very small. Hence, R0 is sort of a practical bound. Nowadays, channel codes are known (turbo codes), which make a quasi error-free transmission possible even for rates between R0 and C (however, only for code lengths n > 1000). The Gallager bound is a tight bound. This conrms the choice of s = 1/(1 + ).
133
Chapter 7: Channel Coding III

Continuous probability theory: Cumulative distribution function, probability density function, conditional probability density function Dierential entropy, conditional dierential entropy Discrete-time Gaussian channel Channel capacity of the discrete-time Gaussian channel Water-lling principle Sampling theorem, Nyquist rate Bandlimited Gaussian channel Channel capacity of the bandlimited Gaussian channel, Shannon limit
134
Continuous Probability Theory

So far we restricted ourself to discrete random variables with values over a nite alphabet. Now, we will study continuous random variables. Denition: Let X be a random variable with a cumulative distribution function (cdf) PX (x) = P (X x). If PX (x) is continuous, X is called a continuous random variable. Furthermore, we denote the derivative of PX (x) as pX (x) =
PX (x),
if the derivative exists. If
pX (x) dx = 1, pX (x) is called the
probability density function (pdf) of X. Therefore, pX (x) 0 for all x and P (a x b) =
pX (x) dx.
a
Hence, the probability of the event {a x b} is the area under the probability density function in the range a x b.
135
Continuous Probability Theory

Denition: Two continuous random variables X and Y are called statistically independent, if pXY (x, y) = pX (x) pY (y). Denition: The conditional probability density function is dened as pX|Y (x|y) = where pY (y) > 0 (Bayes rule). pXY (x, y) , pY (y)
136
Dierential Entropy
For the binary representation of real values an innite number of bits is necessary: The self-information is innite. Still, we dene the following expressions: Denition: Let X be a continuous random variable with probability density function pX (x).
H(X) =
pX (x) log pX (x) dx
is called the dierential entropy of X. In contrast to the entropy of a discrete random variable X, the dierential entropy has no fundamental meaning. Particularly, the dierential entropy can not be interpreted as the uncertainty of a random variable. Especially, H(X) even may be negative.
137
Dierential Entropy
Example: Let X be a Gaussian distributed random variable with mean and variance 2, i.e., let X be a random variable with probability density function e . 2 2 (A Gaussian distribution with zero mean and variance one is called a normal distribution.) The dierential entropy of X is 1 H(X) = log(2e 2). 2 Depending on the value of 2, H(X) may be positive, zero, or negative. Theorem: Among all continuous random variables with a given mean and a given variance 2, the Gaussian distributed random variable has the maximal dierential entropy. pX (x) = 1
(x)2 2 2
138
Dierential Entropy
Denition: The mutual information between two continuous random variables X and Y is dened as

I(X; Y ) =

pXY (x, y) log
pX|Y (x|y) dx dy pX (x) pXY (x, y) dx dy pX (x)pY (y)
pXY (x, y) log
and the conditional dierential entropy is dened as
H(X|Y ) =
pXY (x, y) log pX|Y (x|y) dx dy.
Like for discrete random variables, however,
I(X; Y ) = H(X) H(X|Y ) 0, I(X; Y ) H(X) generally does not apply any more.
139
Dierential Entropy
Theorem: Let Y = X + Z, where X and Z are independent, Gaussian distributed random variables with variances S and N , respectively. We obtain: I(X; Y ) = S 1 log 1 + 2 N .
Proof: Since X and Z are Gaussian distributed, Y is Gaussian distributed as well. Since X and Z are statistically independent, their variances are additive. Hence, Y is Gaussian distributed with variance S + N . With I(X; Y ) = H(X) H(X|Y ) = H(Y ) H(Y |X) = H(Y ) H(Z) and the result of the example treated above we get 1 1 I(X; Y ) = log (2e(S + N )) log (2eN ) 2 2 S 1 . q.e.d. = log 1 + 2 N
140
The Discrete-Time Gaussian Channel

Denition: Let X, Y , and Z be discrete-time, continuous random variables. A real-valued, discrete-time, memoryless Gaussian channel is denoted as y = x + z, where x are the (continuous or discrete) channel input symbols, z are Gaussian distributed noise values with zero mean and variance N , and y are the channel output symbols. The noise values z are assumed to be statistically independent from sample to sample. Furthermore, X and Z are assumed to be statistically independent. Since Z is assumed to be zero-mean, the variance N corresponds to the average noise power. If the channel input symbols x are Gaussian distributed with variance S, we are able to 1 S increase I(X; Y ) = 2 log 1 + N as much as we like by increasing S. However, the channel capacity C = max I(X; Y ) is only meaningful, if the channel input symbols are restricted in some sense. Often, we limit the average input power: E{X 2} =
pX (x)
x2pX (x) dx P.
141

Denition: The channel capacity of a real-valued, discrete-time, memoryless channel with an average input power of P or less is dened as C=
pX (x):E{X 2}P
max
I(X; Y ).
The unit is [bit/channel use] (or [bit/channel symbol]), if we take the binary logarithm. Theorem: The channel capacity of a real-valued, discrete-time, memoryless Gaussian channel with an average input power of P or less is 1 P bit C = log2 1 + 2 N channel use and is obtained, if X is Gaussian distributed. Proof: We previously obtained that I(X; Y ) = H(Y ) H(Z). Since H(Y ) is maximal if Y is Gaussian distributed, a probability density function pX (x) must be found, which leads to Gaussian distributed output symbols. Due to Cramer the sum of two Gaussian distributed random variables is Gaussian distributed once again. Hence, I(X; Y ) is maximized, if X is Gaussian distributed. q.e.d.
142
Channel Capacity of the Discrete-Time Gaussian Channel

5.0
4.0 C in bit/channel use
3.0
2.0
1.0
0.0 20.0
10.0
0.0 10.0 P/N in dB
20.0
30.0
143

In practice, instead of Gaussian distributed input symbols typically other distributions are used. Example: The channel capacity of a discrete-time, memoryless Gaussian channel with binary input symbols X {+ P , P } can be computed as follows:
2
C=
pX (x):E{X 2}P
max
I(X; Y ) = max
pX (x)
i=1 2 i=1
pXY (xi, y) log2
pXY (xi, y) dy pX (xi)pY (y) pY |X (y|xi) dy pY (y)
= max
pX (x)
pY |X (y|xi) pX (xi) log2
Due to symmetry, the maximum is obtained for pX (xi) = 1/2 i. Therefore, C= 1 2

2 i=1
pY |X (y|xi) log2
pY |X (y|xi) dy pY (y)
bit channel use
(yxi )2 2 2 e 22 , p (y) = (p with pY |X (y|xi) = 1/ 2 Y Y |X (y|x1) + pY |X (y|x2 ))/2, = N , x1 = + P and x2 = P . It is not possible to simplify the integral further.
144
Channel Capacity of the Discrete-Time Gaussian Channel

5.0
4.0 C in bit/channel use binary input symbols (+sqrt(P), sqrt(P)) Gaussian distributed input symbols 3.0
2.0
1.0
0.0 20.0
10.0
0.0 10.0 P/N in dB
20.0
30.0
145

Among all channels with Gaussian distributed input symbols and additive noise with a given average noise power N , the channel whose noise is zero-mean Gaussian distributed, has the lowest channel capacity. In other words: According to information theory the Gaussian channel is is the worst channel model one can think of. In the following, we investigate several parallel, discrete-time Gaussian channels with statistically independent noise from subchannel to subchannel. The fundamental question is: How is a given average total input power optimally distributed among the parallel channels, if their variances are dierent?
146

Theorem (Water-lling principle): We are given n parallel, real-valued, discrete-time, memoryless Gaussian channels with statistically independent noise values with variances Ni, i = 1, 2, . . . , n. The average total input is assumed to be P or less, i.e.,
n i=1
E{Xi2} P. bit (total) channel use
The channel capacity is

n
C=
i=1
Si 1 log2 1 + 2 Ni
and is reached, if the channel input symbols are independent, Gaussian distributed random variables with zero mean and variances Si, i = 1, 2, . . . , n, where S i + Ni = Si = 0
n
for Ni < for Ni
and is chosen such that

i=1
Si = P .
147
Power S1 S2 N1 N2 1 2 N3 3 4 S3 N4 S5
N5 5 Index i
Power distribution according to the water-lling principle. The average total input power may be interpreted as a mass of water, which is dammed up over an irregular terrain. The terrain represents the variances of the subchannels, whereas the water level corresponds to .
148
The Bandlimited Gaussian Channel

In the following, we consider analog random variables. Theorem (Sampling theorem): Let x(t) I (we restrict ourself to real-valued R signals) be a bandlimited deterministic signal and X(f ) its spectrum, i.e., X(f ) = 0 for |f | > W or a bandlimited stochastic process and xx(f ) its power density spectrum, i.e., xx(f ) = 0 for |f | > W , where W is the one-sided bandwidth. k Given the samples x[k] := x(t = 2W ), in both cases x(t) can by reconstructed by means of the so-called Nyquist-Shannon interpolation formula: x(t) =
k=
x[k]
sin (t k/(2W )) , (t k/(2W ))
if the sampling rate is (at least) 1/(2W ) samples per second. This minimum sampling rate is called Nyquist rate.
149

Example (thermal noise): The power density spectrum of a noise process z(t) I R, which is obtained by ltering a white noise process with an ideal low-pass lter (bandwidth W , unit amplication in the pass-band, zero amplication in the stop-band), is N0/2 for |f | W zz (f ) = 0 else, where N0 is the one-sided noise power density. Therefore, the average noise power is N = N0W . For thermal noise, N0 is proportional to the eective noise temperature. Denition: A bandlimited, discrete-time channel with additive, Gaussian distributed noise with a brick-wall power density spectrum with one-sided bandwidth W and one-sided noise power density N0 is called a bandlimited Gaussian channel.
150

According to a result from signal theory, the samples of a bandlimited Gaussian process with brick-wall-shaped power density spectrum are independent, Gaussian 2 distributed random variables with variance z = N0/2, if the sampling rate is equal to the Nyquist rate 1/(2W ). Now, we observe a real-valued bandlimited Gaussian channel over a time interval of T seconds. The sampled input values are denoted as x[k] = x (t = k/(2W )) and the sampled output values are denoted as y[k] = y (t = k/(2W )). Correspondingly, the process is completely described by n = 2W T samples. Therefore, the real-valued bandlimited Gaussian channel is equivalent to n = 2W T parallel, real-valued, discrete-time Gaussian channels. Each subchannel has a channel capacity of P bit 1 . Ck = log2 1 + 2 N sample
151

With N = NoW it follows that Ck = P 1 log2 1 + 2 N0 W P N0 W bit . sample bit . set of samples
Therefore, the total channel capacity for the n parallel channels is

n
Ck = W T log2 1 +
k=1
Denition: The channel capacity of a continuous-time channel is C = lim CT T T bit , s
where CT is the maximal mutual information, which can be transmitted in an interval of T seconds. The unit is [bit/s], if we take the binary logarithm.
152

Therefore, we have proven the following theorem: Theorem: The channel capacity of a real-valued bandlimited Gaussian channel with one-sided bandwidth W and one-sided noise power density N0 is C = W log2 1 + P N0 W = W log2 1 + P N bit . s
The ratio P/N is called signal/noise ratio. Example: Typically, an analog telephone channel has a bandwidth of about W = 3 kHz and a signal/noise ratio of about P/N = P/(N0W ) = 30 dB (if no digital switching is applied). Therefore, the channel capacity is about C = 30 kbit/s. With state-of-the-art telephone modems (V.34) data rates of up to 28.8 kbit/s may be obtained under these circumstances.
153
Channel Capacity of the Bandlimited Gaussian Channel

5.0
4.0
C /2W in bit/s/Hz
3.0
2.0
1.0
0.0 20.0
10.0
0.0 10.0 P/N in dB
20.0
30.0
154

Given a xed average signal power P , we now increase the channel capacity by increasing the bandwidth W . The threshold value is denoted as C. We obtain P P bit C = lim W log2 1 + = . W N0 W N0 loge 2 s Let the average signal energy per observation interval T be denoted as E = P T . Accordingly, the average signal energy per info bit is Eb = P T /k, where k is the number of info bits per observation interval T . If we denote the data rate (in [bit/s]) as R = k/T , we get Eb = P/R . By means of substitution we nally obtain C Eb 1 = . R N0 loge 2 For a reliable transmission R < C , hence Eb > loge 2 0.69 1.6 dB. N0 This threshold value is the so-called Shannon limit.
155

Finally, we study again channels with nite bandwidth. Given C = W log2 1 + we derive the expression C /2W = 1 Eb R log2 1 + 2 No W [bit/s/Hz] . P N bit , s
In the limit for R C we get
Eb C 1 C /2W = log2 1 + 2 No W Eb 2C /2W 1 = . N0 C /2W
[bit/s/Hz] .
The solution of Eb/N0 can be written as
This equation is illustrated in the next picture.

156
Channel Capacity of the Bandlimited Gaussian Channel

7.0 6.0 5.0 C /2W in bit/s/Hz 4.0 3.0 2.0 1.0 0.0 5.0 Shannon limit 0.0
5.0
10.0 15.0 Eb/N0 in dB
20.0
25.0
30.0
157
Summary: Channel Capacity of Real-Valued Gaussian Channels with Limited Input Power
Discrete-time, Gaussian distributed input symbols: P bit 1 , C = log2 1 + 2 N channel use where P : average input symbol power N : average noise power. Continuous-time, Gaussian distributed input signals, nite bandwidth: P bit C = W log2 1 + , N s where P : average input signal power N : average noise power, N = W N0 W : bandwidth. 1 For Tsample = 2W the product C Tsample equals the channel capacity C of the discrete-time Gaussian channel.
158
Summary: Channel Capacity of Real-Valued Gaussian Channels with Limited Input Power
Discrete-time, binary input symbols P : 1 C= 2 where
2 i=1
pY |X (y|xi) log2
(yx )2
pY |X (y|xi) dy pY (y)
bit , channel use
pY |X (y|xi) = pY (y) = 1 pY |X (y|x1) + pY |X (y|x2) 2 x1 = + P , x2 = P P : average input symbol power 2: average noise power.
i 1 e 2 2 2 2
159
Remarks
The Gaussian channel model often is called Additive White Gaussian Noise (AWGN) channel model. For a matched-lter receiver 2Es Es/Ts Es 1 1 Es 1 P 2Tsample = = = = N N0 W N0 T s W N0 T s N0 where Es: Energy per info symbol Ts: Symbol duration 1/Tsample: Sampling rate. For a bandlimited AWGN channel model we further obtain: N0 1 1 = 2 = N = N 0 W = N0 2Ts 2 Ts
160
Channel Capacity for a Finite Error Probability

Rtot = k/n R(D) Source symbols Source encoder k Symbols R Channel Code symbols encoder n Symbols C AWGN channel
kR(D) Bits
lossy, binary source encoder: Hamming distortion D = Pb R(D) = 1 h(Pb) ideal channel encoder: R = kR(D)/n = Rtot R(D)
1 P channel: e.g. C = 2 log2(1 + N )
Error-free transmission (errors are only introduced by the lossy source encoder!): RC Rtot R(D) C. where R(D) = 1 h(Pb).
For Rtot R(D) = C we nally obtain P = 22RtotR(D) 1, N
161
SNR Limit of the Discrete-Time Gaussian Channel with Gaussian Distributed Input Symbols given a Finite Bit Error Probability
AWGN channel with Gaussian distributed input symbols 10

0
10 Bit Error Probability
-1
10
-2
10
-3
Rtot=0 Rtot=1/3 Rtot=1/2 Rtot=2/3
10
-4
10
-5
-4.0
-3.0
-2.0 -1.0 Eb/N0 in dB
0.0
1.0
162
163
Chapter 8: Cryptology
Fundamentals of cryptology: Cryptography, cryptanalysis, authentication Ciphertext-only attack, known-plaintext attack, chosen-plaintext attack Classical cipher systems: Caesar cipher, Vigen`re cipher, Vernam cipher e Shannons theory of secrecy, absolutely secure cryptosystem Key entropy, message entropy Conditional key entropy, conditional message entropy Redundancy, unicity distance Private key, public key One-way function, trapdoor one-way function RSA system
164
Fundamentals of Cryptology
Cryptology is the art or science of the design and the attack of security systems. Cryptology includes Cryptography (art or science of encryption) Cryptanalysis (art or science of attacking a security system) Authentication (checking whether a message is genuine or not). As an art, cryptology has been used by armed forces, diplomats and spies since thousands of years. As a science, cryptology has been established in 1948 by Shannon. The demand of cryptosystems signicantly increased in the past decade (due to internet, home banking, e-commerce, m-commerce, pay TV, etc.)
165
Information theory
Number theory
Probability theory
Scientic foundations of cryptology
166
Goals of Cryptology Secrecy: How can we securely communicate with somebody else? organizing actions (e.g. secret meeting) physical actions (e.g. invisible ink) encryption of messages symmetrical techniques (private key techniques) asymmetrical techniques (public key techniques) Authentication How can we prove ones identity? How can we check that a message is authentic? user authentication (e.g. personal identication number (PIN), automated teller machine (ATM)) message authentication (e.g. electronic signature)
167
Goals of cryptology (contd) Anonymity: How can we save our privacy during (tele-)communication? cash money, help line, box number advertisement electronic cash Protocols: A protocol is the whole set of rules. Protocols are necessary, if two or more users participate (key management, service provision, etc.) cellular radio internet
168
Block Diagram of a Classical Cipher System for Secure Communication (Symmetrical Technique)
Plaintext estimation Interception
M
Plaintext
1 M = EK (C)
Plaintext
Encryption EK(.)
Ciphertext
C = EK(M)
Secure channel
Decryption 1 EK (.)
Key K
M: Message (to be transmitted via an insecure channel to an authorized receiver) C: Ciphertext, cryptogram 1 EK(.): Encryption rule, EK (.): Decryption rule K: Key (is known to authorized users only) M: Estimated message (probability that M = M should approach zero).
169
Encryption: C := f (K, M) := fK(M) := EK(M) 1 1 Decryption: M := f 1(K, C) := fK (C) := EK (C) := DK(C) Classication of encryption and decryption techniques: block enciphering: block-wise encryption and decryption (e.g. Data Encryption Standard (DES)) stream enciphering: continuous encryption and decryption (e.g. one-time pad) Kerckhos principle (1883): The security of cryptosystems must not depend on the concealment of the algorithm. Security is only based on the secrecy of the key.
170
The security of cryptosystems is based on the eort needed to hack the system. We have to assume that the attacker (the hacker) knows all about the nature of the plaintext (such as language and alphabet) and about the algorithm. In this case, suciently long messages can uniquely be recovered by an exhaustive search with respect to all possible keys, if the necessary computational power is available. Depending on the a priori knowledge, we classify attacks as follows: ciphertext-only attack: Besides the decryption algorithm, only the ciphertext is given to the attacker known-plaintext attack: Additionally, the attacker knows a set of plaintext/ciphertext pairs chosen-plaintext attack: The attacker may choose an arbitrary plaintext and obtain the corresponding ciphertext without having access to the key. State-of-the-art cryptosystems are chosen-plaintext attack resistant (e.g. UNIX password).
171
Some Classical Encryption Systems

We consider the Latin alphabet with 26 capital characters and a key of length one. A classical cipher is EK : M M + K (mod 26), 26 dierent keys K exist. Julius Caesar used K = 3 Augustus used K = 4 The so-called Caesar cipher is a special case of a simple substitution, where characters are interchanged. Example: plaintext alphabet: ABCDEFGHIJKLMNOPQRSTUVWXYZ cipher alphabet: XGUACDTBFHRSLMQVYZWIEJOKNP 26! dierent keys K of length one exist.
172
M {0, 1, . . . , 25}.

The main problem of the simple substitution is that due to interchanging characters the distributions in plaintext and ciphertext are identical. This problem may be solved by using a key of length n instead of a key of length one. In that case, (26!)n dierent keys exist. This technique is called polyalphabetic substitution. An example of polyalphabetic substitution is the Vigen`re cipher (about 1550). e Encryption and decryption is handled by means of the so-called Vigen`re-tableau. e In the Vigen`re-tableau shown next the plaintext alphabet is plotted on the horizontal e axis and the key alphabet is plotted on the vertical axis.
173
Vigen`re-Tableau e
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z A A B C D E F G H I J K L M N O P Q R S T U V W X Y Z B C B C C D D E E F F G G H H I I J J K K L L M M N N O O P P Q Q R R S S T T U U V V W W X X Y Y Z Z A A B D D E F G H I J K L M N O P Q R S T U V W X Y Z A B C E E F G H I J K L M N O P Q R S T U V W X Y Z A B C D F F G H I J K L M N O P Q R S T U V W X Y Z A B C D E G G H I J K L M N O P Q R S T U V W X Y Z A B C D E F H I H I I J J K K L L M M N N O O P P Q Q R R S S T T U U V V W W X X Y Y Z Z A A B B C C D D E E F F G G H J J K L M N O P Q R S T U V W X Y Z A B C D E F G H I K K L M N O P Q R S T U V W X Y Z A B C D E F G H I J L L M N O P Q R S T U V W X Y Z A B C D E F G H I J K M M N O P Q R S T U V W X Y Z A B C D E F G H I J K L N N O P Q R S T U V W X Y Z A B C D E F G H I J K L M O P O P P Q Q R R S S T T U U V V W W X X Y Y Z Z A A B B C C D D E E F F G G H H I I J J K K L L M M N N O Q Q R S T U V W X Y Z A B C D E F G H I J K L M N O P R R S T U V W X Y Z A B C D E F G H I J K L M N O P Q S S T U V W X Y Z A B C D E F G H I J K L M N O P Q R T T U V W X Y Z A B C D E F G H I J K L M N O P Q R S U U V W X Y Z A B C D E F G H I J K L M N O P Q R S T V W V W W X X Y Y Z Z A A B B C C D D E E F F G G H H I I J J K K L L M M N N O O P P Q Q R R S S T T U U V X X Y Z A B C D E F G H I J K L M N O P Q R S T U V W Y Y Z A B C D E F G H I J K L M N O P Q R S T U V W X Z Z A B C D E F G H I J K L M N O P Q R S T U V W X Y
174

In the Vigen`re cipher, the key is used periodically, i.e., the period is n. e Example: The key is THOMPSON and the plaintext is FOR WOODSTOCK MY FRIEND OF FRIENDS: plaintext: FORWOODSTOCKMYFRIENDOFFRIENDS key: THOMPSONTHOMPSONTHOMPSONTHOMP ciphertext: YVFIDGRFMVQWBQTEBLBPDXTEBLBPH The main problem of the Vigen`re cipher is the periodicity of the key. Therefore, e identical patterns of the plaintext may be repeated in the ciphertext: plaintext: FORWOODSTOCKMYFRIENDOFFRIENDS key: THOMPSONTHOMPSONTHOMPSONTHOMP ciphertext: --------------TEBLBP--TEBLBP-
175

This problem of the Vigen`re cipher can be solved by the Vernam cipher (1918): e The length of the key word is equal to the message length The key is changed after each message All keys are equally probable The entropy of any key symbol must be at least as large as the entropy per message symbol. The Vernam cipher is the rst secure cipher system. The Vernam cipher is called one-time key or one-time pad.
176
Shannons Theory of Secrecy

Shannons model of a cipher system: Shannon considers a classical cipher system: C = EK(M). The key K is changed after each message M. According to Kerckhos principle, only the keys are secret. The attacker knows EK(.) for any key K as well as the a priori probability mass functions pM(m) and pK(k). The attacker computes pM|C(m|c) (i.e., the most probable message) or pK|C(k|c) (i.e., the most probable key). The message M and the key K are assumed to be statistically independent. Only the ciphertext C is given to the attacker (ciphertext-only attack). The attacker has access to an innite computational power.
177

Denition: The key entropy is dened as H(K) = pK(k) log pK(k)
k
and the message entropy is dened as H(M) = pM(m) log pM(m).

m
Denition: The conditional key entropy (key equivocation) is dened as H(K|C) = pK C(k, c) log pK|C(k|c)
k c
and the conditional message entropy (message equivocation) is dened as H(M|C) = pM C(m, c) log pM|C(m|c).
m c
The conditional key entropy (or the conditional message entropy) is the uncertainty of an attacker trying to hack the actual key (or the actual message).
178

We know that H(K|C) H(K) and H(M|C) H(M),
i.e., knowledge of the ciphertext can, on average, never increase the uncertainty about the key or about the message. Theorem: The conditional message entropy can not exceed the conditional key entropy, i.e., H(M|C) H(K|C). Proof: H(M|C) H(K, M|C) = H(K|C) + H(M|K, C), where the right hand side follows from the chain rule for entropy. For well-designed cipher systems H(M|K, C) = 0, which proves the theorem. q.e.d.
179

Let us given an alphabet with L symbols. The message is denoted by M = [M1, M2, . . . , MN ] and the ciphertext by C = [C1, C2, . . . , CN ]. Denition: The entropy H0 = log L is called alphabet rate. The entropy H(M ) = 1/N H(M) is called message rate. Furthermore, let H = lim 1/N H(M).
N
Denition: The dierence between the alphabet rate and the message rate, D = H0 H(M ), is called redundancy. Example: For English-written texts H0 = log2(26) bits = 4.7 bits and H 1.5 bits (per symbol). Therefore, for lim the redundancy is D = H0 H 3.2 bits.
N
180

Theorem: H(K|C) H(K) N D. Proof: For all cipher systems: H(M, K) = H(C, K). Since M and K are statistically independent, H(M, K) = H(M) + H(K) = N H(M ) + H(K). Furthermore, We join the last three equations to obtain H(C, K) = H(C) + H(K|C) N H0 + H(K|C).
H(K|C) N H(M ) + H(K) N H0 = H(K) N (H0 H(M )), which proves the theorem. The theorem deduces the following fundamental results: q.e.d.
181
If the redundancy is D = 0, the conditional key entropy and the key entropy are identical: H(K|C) = H(K). Therefore, data compression is an essential recipe in order to improve the security of cipher system: Perfect source coding in conjunction with non-trivial enciphering results in an absolutely secure system. This only holds, however, if the key is changed after each message. For N H(K)/D (i.e., for H(K) N D 0) the lower bound of the conditional key entropy is zero: We risk a successful attack. Denition: The smallest message length in order to obtain a unique estimate of the message given the ciphertext (i.e., H(K|C) = 0 and therefore H(M|C) = 0), is called unicity distance and is denoted as Nmin . Theorem: Nmin = H(K)/D. (The proof of this theorem deduces from the previous theorem.)
182

Remarks: Shannons cryptosystem is absolutely secure, only if the message length < Nmin. The last theorem has been proven for Shannons model only. However, this theorem is often applied for computing the unicity distance of practical cipher systems too: For the simple substitution, H(K) = log2(26!) bits = 88.4 bits. With D = 3.2 bits (for English-written texts) Nmin 28. Hence, a message should consist of less than 28 characters. Otherwise, we risk a successful attack. The advantageous eect of source coding upon the security of cipher systems is used by doing source coding (data compression) encryption channel coding (in that order!). Channel coding must be applied after encryption, since channel coding adds redundancy.
183

A secure cryptosystem can be realized as follows: Source encoding of the message (which should be as long as possible) Segmentation of the source encoded bit sequence into subblocks, where the length of each subblock should not exceed the unicity distance Key generation the key is changed after each subblock; messages and keys must be independent the conditional key entropy must be larger or equal to the conditional message entropy the lengths of the key words must be the same as the lengths of the corresponding subblocks the keys must be transmitted via a secure channel Encryption by means of a modulo-2 addition Channel encoding in order to protect against transmission errors
184
Cipher Systems with Public Key

So far, we assumed the use of a private key. This requires that the demand for encryption is known a priori (n(n 1)/2 keys for n users) the keys are transmitted via a secure channel. Now, we investigate a cipher system with two keys (Die, Hellman, 1976): a public key is used for encryption a private key is used for decryption. The rst cipher system with two keys, the so called RSA system, has been published by Rivest, Shamir, and Adleman in 1978. It is still quite popular. The RSA system is not secure in the sense of information theory.
185
Block Diagram of a Cipher System with Public Key (Asymmetrical Technique)
Plaintext estimation Interception
M
Plaintext
Plaintext
Encryption EK(.)
Ciphertext
C = EK(M)
Decryption DK(.)
M = DK(C)
Public key (published)
Private key (known to the receiver only)
186
Cipher Systems with Public Key

Public key systems are based on the following fundamentals: A one-way function is a function f (x) so that for all x in the domain of denition y = f (x) is easily computable. However, for almost all y of the range of the function it is practically impossible to identify x. A trap-door one-way function is a family of invertible functions fK(x) so that for a given trap-door parameter K for all x in the domain of denition 1 y = fK(x) and for all y of the range of the function x = fK (y) is easily computable. However, for almost all K and y of the range of the function it is practically 1 impossible to compute fK (y), even if fK(x) is known. Without knowledge of the secret trap-door parameter K it is practically impossible to recover the message. In the previous block diagram the trap-door one-way function fK(x) corresponds to the 1 public key EK, whereas fK (y) corresponds to the private key DK. The user, who is supposed to receive the encrypted message, determines and publishes the trap-door one-way function EK. However, the trap-door parameter K is kept secret.
187
Some Results from Number Theory

Denition (Eulers function): Let (n) be the number of integer values i in the interval [1, n 1], which are relatively prime with respect to n, i.e., for which gcd(i, n) = 1. (gcd: greatest common divisor.) Per denition, (1) = 1. Example: Let n = p q, where p and q are prime numbers. Therefore, (p) = p 1 and (q) = q 1. Since p, 2p, . . . , (q 1)p, q, 2q, . . . , (p 1)q are not relatively prime
q1 terms p1 terms
with respect to n, we obtain (n) = n 1 (q 1) (p 1) = (p 1)(q 1) = (p) (q). Eulers theorem: If gcd(a, n) = 1, we get a(n) = 1 (mod n). Example: Let a = 2 and n = 55. Therefore (with n = p q = 5 11) (55) = (5 1)(11 1) = 40. Hence, a(n) = 240 = 10244 = 344 = 11562 = 1 (mod 55).
188
The RSA Cipher System

The RSA cipher system is based on the problem of factoring large numbers and on the problem of solving the so-called discrete logarithm. One chooses a large number n = p q, where p and q are large prime numbers, and one determines a random integer number e with 1 < e < (n) so that gcd(e, (n)) = 1. According to the Bezout identity, the inverse of e (mod (n)), denoted as d, exists and is unique: e d = 1 (mod (n)). The public key corresponds to the pair (n, e), the private key is (n, d). The trap-door parameter K is (p, q, (n)) or d, respectively. Remark: Secrecy relies on the assumption that it is impossible to factor n. Otherwise, for a given p and q it is straightforward to compute (n) = (p) (q) = (p 1)(q 1). Given e, it is possible to compute d = e1 (mod (n)).
189

The RSA cipher system works as follows: The characters of the message are represented by numbers. Then, the sequence of numbers is divided into subblocks M , where gcd(M, n) = 1 and 0 < M < n (e.g. 0 < M < min(p, q)). The encryption algorithm is C = EK(M ) = M e (mod n). The decryption algorithm is M = DK(C) = C d (mod n). Proof: DK(C) = C d = (M e)d = M ed = M 1+k(n) = (M (n))k M = M where ed = 1 (mod (n)) and Eulers theorem are applied. (mod n), q.e.d.
Remark: Users who know (n, e), are able to encrypt a message. However, only users who additionally know (p, q, (n)) or d, are able to decrypt the message. Secrecy relies on the assumption that it is impossible to solve C = M e (mod n) by means of the discrete logarithm, where e and C are known.
190

Example: Let p = 41 and q = 73 n = p q = 2993 and (n) = (p 1) (q 1) = 2880. Furthermore, let e = 17 d = e1 (mod (n)) = 2033. The characters of the plaintext RSA ALGORITH... are represented as A = 01, B = 02, . . . , Z = 26, space = 27. The message is divided into subblocks, where each subblock comprises two characters: 1819 0127 0112 0715 1809 2008 ... . The ciphertext reads 1375 1583 0259 0980 1866 1024 ... . RSA laboratories recommend a key word length of 768 bits for private applications, 1024 bits for business applications, and 2048 bits for very important applications. (The key word length equals the number of binary digits of the private key d.)
191
Literature (tutorial)
A. Beutelspacher, Kryptologie. Braunschweig/Wiesbaden: Vieweg, 5th ed., 1996. A. Beutelspacher, J. Schwenk, K.-D. Wolfenstetter, Moderne Verfahren der Kryptographie. Braunschweig/Wiesbaden: Vieweg, 3rd ed., 1999. O. Mildenberger, Informationstheorie und Codierung. Braunschweig/Wiesbaden: Vieweg, 2nd ed., 1992. H. Rohling, Einfhrung in die Informations- und Codierungstheorie. u Stuttgart: Teubner, 1995. A.J. Menezes, P.C. van Oorschot, S.A. Vanstone, Handbook of Applied Cryptography. CRC Press, 5th ed., 2001.
192
Literature (advanced)
T.M. Cover, J.A. Thomas, Elements of Information Theory. New York: John Wiley & Sons, 1991. R.G. Gallager, Information Theory and Reliable Communication. New York: John Wiley & Sons, 1968. R. Johannesson, Informationstheorie Grundlagen der (Tele-)Kommunikation. Lund: Addison-Wesley, 1992. J.C.A. van der Lubbe, Information Theory. Cambridge (UK): Cambridge University Press, 1997. J.M. Wozencraft, I.M. Jacobs, Principles of Communication Engineering. New York: Wiley, 1965. C.E. Shannons paper A Mathematical Theory of Communications (July/Oct. 1948) can be found on our homepage: http://www-ict.tf.uni-kiel.de (see Lectures: Scripts and exercises)

InfTh Vorl e

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

InfTh Vorl e

Uploaded by

Copyright:

Available Formats

1

Lecture Notes on Information Theory

Summer Term 2009 Last update: 07/07/2008 c Peter A. Hoeher 20012009

What is Information Theory?

Contributions of Information Theory

Statistics (Hypothesis testing)

Probability theory (Limit theorems)

Economics (Portfolio and game theory)

History of Information Theory

Block Diagram of a Communication System

Disturbance Source Transm. Channel Receiver Sink

Separation Theorem of Information Theory

Encryption Binary data

Examples for Source Coding, Encryption, and Channel Coding

Fundamental Questions of Information Theory

Entropy and Channel Capacity

Entropy: Uncertainty measure of a random variable X: H(X) =

pX (x) log2 pX (x) [bits/source symbol],

Classical Encryption System

Insecure channel Plaintext Encryption Ciphertext Decryption Plaintext

Secure channel Key

Encryption System with Public Key

Insecure channel Plaintext Encryption Ciphertext Decryption Plaintext

Public key (published)

Private key (known to the receiver only)

Discrete Probability Theory

Discrete Probability Theory

Discrete Probability Theory

Discrete Probability Theory

pXY (x(i), y (j)) = 1. Furthermore, pXY (x(i), y (j)) = pY X (y (j), x(i)).

Discrete Probability Theory

pXY (x(i), y (j)).

pXY (x(i), y (j)).

Discrete Probability Theory

Discrete Probability Theory Summary

Hartleys vs. Shannons Information Measure

Shannons Information Measure

Shannons Information Measure

Shannons Information Measure

Summary of Important Denitions and Theorems

=y I(X = x(i); Y = y (j) ) = log2 P (X=x |Y(i)) P (X=x (i) (j)

Def. 5: Mutual information of events {X = x(i)} and {Y = y (j) }:

Def. 4: Information: I(X = x ) = log2 P (X = x ) bit =

Def. 3: Conditional probability mass function: pX|Y (x(i)|y (j)) =

Bayes rule. bit.

Shannons Information Measure

H(X) = E{I(X = x )} = E{ log pX (x )} =

pX (x(i)) log pX (x(i))

pX|Y (x(i)|y (j)) pXY (x , y ) log pX (x(i)) j=1

H(X, Y ) = E{ log pXY (x , y )} =

pXY (x(i), y (j)) log pXY (x(i), y (j))

is called joint entropy of a pair of discrete random variables X and Y .

Shannons Information Measure

pX|Y (x(i)|y (j)) log pX|Y (x(i)|y (j))

is called conditional entropy of X given the event {Y = y (j)}.

Shannons Information Measure

0 H(X|Y = y (j)) log Lx

H(X|Y ) = E{ log pX|Y (x(i)|y (j))}

Shannons Information Measure

a) I(X; Y ) > 0, b) I(X; Y ) = 0

Shannons Informations Measure

Summary of Important Denitions