Professional Documents
Culture Documents
Prof. Peter Adam Hoeher Information and Coding Theory Lab Faculty of Engineering University of Kiel Germany ph@tf.uni-kiel.de www-ict.tf.uni-kiel.de
Contents
Introduction: What is information theory? Fundamental notions: Probability, information, entropy Source coding: Typical sequences, Shannons source coding theorem (lossless source coding), Shannons rate distortion theorem (lossy source coding), Markov sources, Human algorithm, Willems algorithm, Lempel-Ziv algorithm Channel coding: Shannons channel coding theorem, channel capacity of discrete and of continuous channels, joint source and channel coding, MAP and ML decoding, Bhattacharyya bound and Gallager bound, Gallager exponent Cryptology: Classical cipher systems, Shannons theory of secrecy, redundancy, public key systems
Chapter 0: Introduction
What is information theory? History of information theory Block diagram of a communication system Separation theorem of information theory Fundamental questions of information theory Entropy and channel capacity Cipher systems
Mathematics
(Inequalities)
Examples: Source: continuous-time signals (e.g. voice, audio, video, analog measurement signals) or discrete-time signals (e.g. characters, data sequences, sampled analog signals) Channel: Wireless channel (radio channel, acoustical channel, infrared), wireline channel (cable or ber), CD/DVD, magnetic recording, etc.
8
Block Diagram of a Communication System with Source Coding, Encryption, and Channel Coding
Source
Source encoder
Channel encoder
Modulator
Disturbance
Channel
Sink
Source decoder
De cryption
Channel decoder
Demodul.
10
Example: A key word is added modulo 2 to the plaintext . e.g. plaintext [100] (= D), key word [101] (random sequence) ciphertext [100] [101] = [001]
Channel coding: Example: repetition code e.g. info word [001] code word [00 00 11]
11
Symbols should not be transmitted individually. Instead, the channel encoder should map the info bits onto the coded symbols so that each info bit inuences as many coded symbols as possible.
12
where pX (x) is the probability mass function of X. Channel capacity: Maximum number of bits/channel symbol that can be transmitted via a noisy channel with arbitrarily small error probability: C = max H(X) H(X|Y ) [bits/channel symbol].
pX (x)
13
14
15
Chapter 1: Fundamentals
Discrete probability theory: Random variable, probability mass function, joint probability mass function, conditional probability mass function, statistical independence, expected value (mean), variance, sample mean, Bayes rule Shannons information measure: Information, mutual information, entropy, conditional entropy, chain rule for entropy, binary entropy function Fanos inequality Data processing theorem
16
17
i {1, 2, . . . , L},
pX (x(i)) = 1.
If no confusion is possible, we briey write pX (x(i)) = p(x). A random variable X is called uniformly distributed, if pX (x(i)) = 1/L i, i.e., if all events are equally probable. (Please note that a uniform distribution should not be mixed-up with a sequence of identically distributed random variables.)
18
:= E{f (X)} =
i=1
f (x(i)) pX (x(i))
L
and
2
:= E (f (X) )
(i)
=
i=1
f (x(i))
pX (x(i)),
where x , i {1, 2, . . . , L}, are the possible events of X. Now, we generate n random draws of the random variable X, which we denote as x1 , x2 , . . . , xn . The sample mean is dened as 1 f (X) = n
n
f (xi).
i=1
Expected value, variance, and sample mean are only dened if f (X) is real valued.
19
i=1 j=1
Random variables X and Y are called statistically independent, if pXY (x(i), y (j)) = pX (x(i)) pY (y (j)) i, j.
20
pX (x(i)) =
j=1
Correspondingly, pY (y ) =
i=1 (j)
Example (x(1) = 0, x(2) = 1, y (1) = 0, y (2) = 1, Lx = Ly = L = 2): P (X = 0, Y = 0) = pXY (0, 0) = 0.4 P (X = 0, Y = 1) = pXY (0, 1) = 0.3 P (X = 1, Y = 0) = pXY (1, 0) = 0.2 P (X = 1, Y = 1) = pXY (1, 1) = 0.1 P (X = 0) = pX (0) = 0.7, P (X = 1) = pX (1) = 0.3 P (Y = 0) = pY (0) = 0.6, P (Y = 1) = pY (1) = 0.4.
21
where pY (y (j)) > 0 j (Bayes rule). Theorem: pX|Y (x(i)|y (j)) = pX (x(i)), if X and Y are statistically independent. Proof: pX|Y (x(i)|y (j)) =
pXY (x(i),y (j) ) pY (y (j) ) pX (x(i))pY (y (j) ) pY (y (j) )
= pX (x(i)).
q.e.d.
According to Bayes rule pXY (x, y) = pX (x) pY |X (y|x). A generalization results in the chain rule pX1X2X3X4...(x1, x2, x3, x4, . . .) = pX1 (x1) pX2X3X4...|X1 (x2, x3, x4, . . . |x1) = ... = pX1 (x1) pX2|X1 (x2|x1) pX3X4...|X1X2 (x3, x4, . . . |x1, x2)
22
23
Denition of C.E. Shannon (1948): The information of an event {X = x(i)} is dened as I(X = x(i)) = logb P (X = x(i)), i {1, . . . , L}. The smaller the probability of the event, the larger is the information. For a uniformly distributed random variable X, Hartleys and Shannons inform. measure are identical. For the Basis b = 2 the unit of the information is called bit(s), for b = e 2.71828 it is called nat(s), and for b = 10 it is called Hartley. Note that bit does not mean binary digit in this context.
24
25
the mutual information may be interpreted as the information gain, which we obtain concerning an event {X = x(i)} given an event {Y = y (j)}, because I(X = x(i)) can be interpreted as an a priori information and I(X = x(i)|Y = y (j)) as an a posteriori information.
26
27
I(X = x ) is the necessary information in order to know that the event {X = x(i)} occurs. bit = log2
pX|Y (x(i) |y (j) ) pX (x(i) )
bit.
I(X = x ; Y = y ) is the information gain concerning an event {X = x(i)} given an event {Y = y (j) }. Def. 6: X and Y are statistically independent, if pXY (x(i), y (j) ) = pX (x(i)) pY (y (j) ) i, j. Proof: pX|Y (x(i)|y (j) ) = Theorem: pX|Y (x(i)|y (j)) = pX (x(i)), if X and Y are statistically independent.
pXY (x(i) ,y (j) ) pY (y (j) ) (i) (j) pX (x(i) )pY (y (j) ) pY (y (j) )
= pX (x(i))
q.e.d.
Corollary: I(X = x ; Y = y ) = 0 if X and Y are statistically independent. Proof: I(X = x(i); Y = y (j) ) = log2
pX|Y (x(i) |y (j)) pX (x(i) ) p (X=x ) bit = log2 pX (X=x(i)) bit = 0
X (i)
q.e.d.
28
(i)
(i)
is called entropy of the random variable X. The entropy is the average information (uncertainty) of a random variable. Denition:
Lx
I(X; Y ) = E{I(X = x ; Y = y )} =
i=1
(i)
(j)
Ly
is called mutual information between the random variables X and Y . The mutual information is the average information gain of X given Y (or vice versa). Denition:
Lx Ly
(i)
(j)
29
Theorem: If X takes values over the alphabet {x(1), x(2), . . . , x(Lx)}, then with equality on the left hand side, if i so that pX (x(i)) = 1 and with equality on the right hand side, if pX (x(i)) = 1/Lx i. Denition:
Lx
0 H(X) log Lx
H(X|Y = y ) =
(j)
i=1
H(X|Y ) =
j=1
pY (y (j))H(X|Y = y (j)).
Typically, this is the simplest rule in order to compute H(X|Y ). Corollary: The mutual information is the average information gain about X when observing Y . I(X; Y ) = H(X) H(X|Y ) = H(Y ) H(Y |X).
31
0 H(X|Y ) log Lx
H(X|Y ) H(X)
Relations Between the Entropy, Conditional Entropy, Joint Entropy, and Mutual Information (Venn Diagram)
I(X; Y ) = H(X) + H(Y ) H(X, Y ) I(X; Y ) = H(X) H(X|Y ) I(X; Y ) = H(Y ) H(Y |X) I(X; Y ) = I(Y ; X)
I(X; X) = H(X)
33
34
pX|Y (x(i)|y (j)) pXY (x(i), y (j)) = log , I(X = x ; Y = y ) = I(X = x ) I(X = x |Y = y ) = log pX (x(i)) pX (x(i))pY (y (j) )
where pX (x(i)) = P (X = x(i)) > 0 and pY (y (j) ) = P (Y = y (j) ) > 0, i {1, . . . , Lx}, j {1, . . . , Ly }. The entropy (uncertainty) of a random variable X is
Lx
H(X) = E{I(X = x )} =
(i)
The entropy is the average information about this random variable. The mutual information between the random variables X and Y is
Lx Ly
I(X; Y ) = E{I(X = x ; Y = y )} =
i=1 j=1
(i)
(j)
35
=
i=1
joint entropy
=
i=1 j=1
conditional entropy
=
i=1 j=1
mutual information
I(X; Y ) := E log
Lx Ly
=
i=1 j=1
Theorems
Example
0 H(X) log Lx 0 H(X|Y ) log Lx H(X|Y ) H(X) (equality, if X and Y are statistically independent) chain rule of entropy H(X1 , X2 ) = H(X1 ) + H(X2 |X1 ) H(X1 , X2 |X3 ) = H(X1 |X3 ) + H(X2 |X1 , X3 ) I(X; Y ) 0 (equality, if X and Y are statistically independent) I(X; Y ) = H(X) H(X|Y ) = H(Y ) H(Y |X) binary entropy function (Lx = 2) H(X) := h(p) = p log p (1 p) log(1 p)
36
Fanos Inequality
Suppose we know a random variable Y and we wish to guess the value of a correlated random variable X, where X and Y share the same alphabet (therefore Lx = Ly = L). Fanos inequality relates the probability of error Pe in guessing the random variable X to its conditional entropy H(X|Y ). Fanos inequality: Let X and Y be random variables with values over the same alphabet {x(1), x(2), . . . , x(L)} and let Pe = P (X = Y ). Then, H(X|Y ) h(Pe) + Pe log2(L 1). We can estimate X from Y with zero probability of error if and only if H(X|Y ) = 0.
37
Fanos Inequality
Proof of Fanos Inequality: 0 if X = Y 1 else. Therefore H(Z) = h(Pe), because Z is a binary random variable. According to the chain rule for entropy We introduce an error indicator Z: Z := H(X, Z|Y ) = H(X|Y ) + H(Z|X, Y ) = H(X|Y ) + 0 = H(X|Y ), since X and Y unequivocally determine Z. Furthermore, according to the chain rule for entropy H(X, Z|Y ) = H(Z|Y ) + H(X|Y, Z) = H(Z) + H(X|Y, Z). (2) Side calculation 1: H(X|Y, Z = 0) = H(X|X) = 0. Side calculation 2: H(X|Y, Z = 1) log2(L 1). With H(X|Y, Z) = P (Z = 0) H(X|Y, Z = 0) + P (Z = 1) H(X|Y, Z = 1) we get By substituting (3) into (2) and with (1) we obtain Fanos inequality. H(X|Y, Z) P (Z = 1) log2(L 1) = Pe log2(L 1). (1)
(3)
38
q.e.d.
Fanos Inequality
If an error occurs (i.e., if X = Y ), Fanos inequality is fullled with equality, if all L 1 remaining values are equally probable. The sum h(Pe) + Pe log2(L 1) is positive for all Pe with 0 < Pe 1:
4.0
L=10
2.0
L=4
1.0
L=2
0.0 0.0
0.1
0.2
0.3
0.4
0.5 Pe
0.6
0.7
0.8
0.9
1.0
Due to Fanos inequality we obtain lower and upper bounds of the error probability Pe given the conditional entropy H(X|Y ).
39
Channel 2
or
Processor 1
Processor 2
The channels/processors may be deterministic or stochastic. Z can be inuenced by X only indirectly via Y . We say that the sequence (X, Y, Z) forms a Markov chain. Data processing theorem: Let (X, Y, Z) be a Markov chain. Then, I(X; Z) I(X; Y ) and I(X; Z) I(Y ; Z). According to the data processing theorem, no information gain can be obtained by means of sequential data processing. (However, information may be transformed in order to become more accessible.)
40
b) I(X; Z) I(Y ; Z) :
41
42
Typical Sequences
In order to motivate the abstract notion of typical sequences, we conduct the following experiment: Let us given an unfair coin (trick coin): The probability for head is p, the probability for number is q = 1 p. With just one draw, we cannot estimate p for sure. With n draws, we can estimate p as accurately as desired, if n is suciently large. Proof: Let rn = number head/n, i.e., the relative frequency for head. According to Tschebyschevs inequality: pq 2 P (|rn p| ) 2 = 2 0 for n , n where 2 is the variance of rn and > 0.
q.e.d.
43
Typical Sequences
Motivated by the previous experiment, from now on we exclusively consider long sequences (e.g. n ). Let us denote the cardinality of the symbol alphabet by L. Hence, Ln possible sequences exist. We separate the set of all Ln possible sequences into two subsets: 1. The rst subset consists of all sequences, which show about n p times head. This is the set of so-called typical sequences. 2. The second subset consists of all remaining sequences. This is the set of so-called non-typical sequences. To start with, we restrict ourself to sequences of n independent and identically distributed (i.i.d.) random variables.
44
Typical Sequences
Formal motivation: Let X1, X2, . . . , Xj , . . . , Xn be a sequence of independent and identically distributed (i.i.d.) random variables, where each Xj is dened over an L-ary symbol alphabet X . We denote a certain event (i.e., a sequence of draws) as x = [x1, x2, . . . , xj , . . . , xn] X n, where xj {x(1), x(2), . . . , x(L)}. Since the random variables are assumed to be independent,
n
pX(x) =
j=1
pX (xj )
n
I(X = xj ) = I(X = xj ).
j=1
We say that an event x = [x1, x2, . . . , xn] is an -typical sequence, if the sample mean I(X = xj ) and the entropy H(X) dier by or less.
i {1, 2, . . . , L}.
45
Typical Sequences
Denition: A set A(X) of -typical sequences x = [x1, x2, . . . , xn] is dened as follows: 1 A(X) = x : logb pX(x) H(X) . n
I(X=x) E{I(X=x)}
Theorem (asymptotic equipartition property (AEP)): For any > 0 n, where n is an integer number, so that A(X) fullls the following conditions: 1. P (x A(X)) 1
1 2. x A(X) | n logb pX(x) H(X)|
46
Typical Sequences
Proof: 1 Property 1 follows from the convergence of n logb pX(x) in the denition of A(X). Property 2 directly follows from the denition of A(X). Property 3 follows from 1 and 1
A (X) A (X)
pX(x)
A (X)
bn(H(X)+) = |A(X)|bn(H(X)+)
pX(x)
A (X)
bn(H(X)) = |A(X)|bn(H(X)).
q.e.d.
47
Typical Sequences
Example 1 for typical sequences: Let us given a binary random variable X (L = 2) with the values x(1) = head and x(2) = number, where P (X = x(1)) := p = 2/5 and P (X = x(2)) = 1 p := q = 3/5. According to the binary entropy function H(X) = h(p = 2/5) = 0.971 bit. We choose n = 5 and = 0.0971 (10 % of H(X)). According to the corollary, a sequence x is -typical, if 0.0247 px(x) 0.0484. Table I lists the Ln = 25 = 32 possible sequences (H: head, N: number). According to Table I, 10 out of the 32 possible sequences are -typical. (This is a combinatorial problem rather than a stochastical problem.) The -typical sequences are marked as . Note that P (x A(X)) 0.346, since n = 5 is rather small. Hence, the supposition P (x A(X)) 1 is not fullled.
48
49
250.10
105.38
254.99
0.3676 < 0.9500 0.5711 < 0.9500 0.7760 < 0.9500 0.9049 < 0.9500 0.9834 > 0.9500 0.9998 > 0.9500 1.0000 > 0.9500 1.0000 > 0.9500 1.0000 > 0.9500 1.0000 > 0.9500
2 2224.88 2449.84
2899.76 22249.51
The number of -typical sequences of length n = 10000 amounts to only 25471.45. The 101361th fraction of all possible sequences has a contribution to the total probability of almost 100 %. The data sequence can be compressed by almost 50 %.
50
Typical Sequences
Properties of typical sequences: All typical sequences have about the same probability bnH(X). The sum of the probabilities of all typical sequences is nearly one. Nevertheless, the number of typical sequences, |A(X)| bnH(X), is very small when compared with the total number of all possible sequences: Although the number of typical sequences is very small, they contribute almost to the entire probability. However, the typical sequences are not the most likely sequences! (Example: Any sequence with n times number is more likely than any typical sequence, if p < 1/2.) In information theory, only typical sequences are considered, because the probability that an non-typical sequence occurs is arbitrarily small if n .
51
where pX Y (x, y) =
3. (1 )2n(H(X,Y )) |A(X, Y)| 2n(H(X,Y )+), where |A(X, Y)| is the number of jointly -typical sequences.
53
0 0 2nH(X) 1 pX(x)
Corollary: For any jointly -typical sequences (x, y) A(X, Y) holds. 2n(H(X,Y )+) pX Y (x, y) 2n(H(X,Y ))
54
Block Diagram of a Communication System with Source Coding and Channel Coding
q
Source
Source encoder
u
binary data
Channel encoder
x
Modulator
Distortion
Channel
q
Sink
Source decoder
u
binary data
Channel decoder
y
Demodul.
55
56
57
58
q
Source
Source encoder
u
binary data
Channel encoder
x
Modulator
pY|X(y|x)
Distortion
Channel
q
Sink
Source decoder
u
binary data
Channel decoder
y
Demodul.
59
i=1
pY |X (yi|xi).
Hence, the discrete memoryless channel is completely dened by the conditional probability mass function pY |X (y|x). Interpretation: Given the constraint that symbol x is transmitted, the received symbol y randomly occurs with probability pY |X (y|x).
60
1p p p
X
1
Y
1
1p
pY |X (0|0) = pY |X (1|1) = 1 p pY |X (1|0) = pY |X (0|1) = p
61
u
Channel encoder (opt. nonbinary)
x Pw y
Channel decoder
[bits/channel symbol]
n symbols
k bits
The technical device for coding is called encoder. The set of all coded sequences, i.e. the set of all code words, is called code. We restrict ourself to so-called (n, k) block codes.
63
Random Code
Denition: An (n, k) random code of rate R = k/n consists of 2nR randomly generated code words x of length n, so that x A(X). The 2nR info words u are randomly but uniquely assigned to the code words. Example: p = 2/5, n = 5, k = 3, 0: We would like to generate a binary random code of rate R = k/n = 3/5. The set of all -typical sequences consists of |A(X)| = 10 sequences (see table). From the set of all -typical sequences, we randomly select 2nR = 8 sequences. These are our code words. The info words are assigned arbitrarily:
Code word x [00011] [00101] [00110] [01001] [01010] [01100] [10001] [10010] [10100] [11000] Index i 3 6 0 2 5 1 7 4 Info word u [011] [110] [000] [010] [101] [001] [111] [100]
64
H(X)
I(X; Y )
H(Y )
H(Y |X)
65
Theorem (Shannons channel coding theorem): The channel capacity of a discrete memoryless channel is C = maxpX (x) I(X; Y ).
The unit of the channel capacity is bits/channel symbol (we also say bits/channel use), if the logarithm with basis b = 2 is used. Proof: We have to verify that: Each rate R < C is achievable, i.e., it exists at least one code of length n, so that Pw 0 if n . The corresponding reversion states that a positive lower bound for Pw exists if R > C, i.e., it exists an according to Pw , that cannot be improved by any channel code.
66
We randomly (!) choose the 2nR code words, where the code words are independent and identically distributed sequences x. Hence, each code word occurs with probability
n
p (x) X
=
i=1
p (x). X
We number all code words and denote the corresponding index by w {1, 2, . . . , 2nR }, i.e., x(w) is the wth code word. Decoding is done by choosing an index w for each received word y so that ( (w), y) A(X, Y), x assuming that such an index exists and that the index is unique.
67
Applying the data processing theorem yields H(U|U) H(U) I(X; Y) (here: X channel input, Y channel output). According to the denition of the channel capacity of a DMC H(U|U) H(U) nC = k nC. ()
69
70
pj log pj .
Example: The channel capacity of a binary symmetric channel (BSC) with transition probabilities p and q = 1 p is given by CBSC = max H(Y ) + p log p + (1 p) log(1 p) = 1 h(p).
pX (x) h(p)
71
72
pj log pj .
The channel capacity is obtained for pX (x) = 1/|X | for all x. Example: The ternary DMC X 0 1 2 is strongly symmetric and has the channel capacity
x E x X $ rr $$ rr $$$ $$ rr $$$ r $$$r $$$ rr x $ E x X $ rr $$$$ $ $$rr $ $$ r rr $$$ $$$ j r x $ E x
p p p
1p
0 1 2 Y
1p
1p
73
Csym =
i=1
qi Ci ,
where qi is the probability and Ci is the channel capacity of the ith strongly symmetric component channel. Example: Binary erasure channel (BEC).
74
Csum = log2
i=1
2Ci
b Ci
0 1 Y
2 1p The channel capacity of this DMC is Csum = log2(1 + 21h(p)) bits/channel symbol.
p p
1p
75
76
77
q
Source
Source encoder
u
Binary data
pY|X(y|x)
pQ|Q( |q) q
Distortion
Channel
q
Sink
Source decoder
u
Binary data
Channel decoder
y
Demodul.
78
79
pQQ(q, q ) d(q, q ) =
pQ(q)
q q
pQ|Q(|q) d(q, q ). q
The set of all test channels, whose average distortion is D or less, is denoted as T (D) = pQ|Q(|q) : E{d(q, q )} D . q
80
s qq q s
maximum mutual information, but no data compression. b) Trivial test channel with I(Q; Q) = 0: Q
s s s z s E B qq qq q q s s
zero mutual information, but maximum data compression. Perception: We would like to have a test channel with minimal (!) mutual information, which causes an acceptable distortion at the same time.
81
min
I(Q; Q),
83
q
Source encoder
k source symbols
Rtot
u
Channel encoder (opt. nonbinary)
84
Shannons source coding theorem A sequence of n symbols from a memoryless source, which is (source) encoded with at least nH(Q) bits, can be reconstructed quasi error-free. Shannons rate distortion theorem A sequence of n symbols from a memoryless source, which is (source) encoded with nR(D) bits, can be reconstructed with an average total distortion of nD, where R(D) =
pQ|Q (|q)T (D) q
min
I(Q; Q).
85
86
Source encoder
u
E
1. The memoryless source generates a random variable Q over a nite alphabet Q with L elements, or a sequence of random variables with this property. In the latter case the source symbols q are assumed to be statistically independent. 2. For each L-ary source symbol q the source encoder delivers a bit sequence u. The bit sequence u has a variable length. The average code word length is a measure for the eciency of the source encoder.
87
The shorter the average code word length, the more ecient is the source encoder. The source encoder must fulll the following properties: 1. Any two code words are not allowed to be identical, i.e., u(i) = u(j) for i = j. H(Q) = H(U). 2. A code word is not allowed to be a prex of a longer code word. We are able to identify a code word, as soon as we received its last symbol, even if the source encoder operates continuously. Codes, which fulll both properties, are called prex-free.
88
This code is not prex-free, since u(1) is a prex of u(2) and u(3).
Prex-free codes allow for a so-called comma-free transmission, if no transmission errors occur.
89
90
0.30 0
0.55
0.15 0
0.25
0.10
Hence E{W } 1.01 H(Q). Obviously, the code is a good code, but we dont know at this point whether the code is optimal or not.
91
2w 1.
(i)
Example: Design a code with lengths w (1) = 2, w (2) = 2, w (3) = 2, w (4) = 3 and w (5) = 4!
5
Solution: Since
i=1
(i)
0 1
0 1 1
q (1) [00] q (2) [01] q (3) [10] q (4) [110] q (5) [1110]
92
0.15
0.10 0 0.10
93
Here, symbolizes a concatenation of symbol sequences. Each binary prex-free code for Q is a merged code for a binary prex-free code. According to Masseys path length theorem we obtain: E{W } = E{W } + pQ(q (L1)) + pQ(q (L)). E{W } is minimal, if E{W } is minimal.
94
95
Human Algorithm
The Human algorithm is based on the following lemmas: Lemma: The tree of an optimum binary prex-free code has no unused nal nodes. Lemma: Let q (L1) and q (L) be the least probable events of a random variable Q. It exists an optimal binary prex-free code for Q, so that the least probable code words u(L1) and u(L) dier in the last position, i.e., u(L1) = u
(L1)
and
u(L) = u
(L1)
1,
where u(L1) is a common prex. Lemma: The binary prex-free code for Q with u(L1) = u(L1) 0 and u(L) = u(L1) 1 is optimal, if the merged code for the random variable Q is optimal.
96
Human Algorithm
The Human algorithm consists of the following three steps: Step 0 (initialization): Let us denote the L nal nodes as q (1), q (2), . . . , q (L). To each node q (i), the corresponding probability mass function pQ(q (i)), i {1, 2, . . . , L}, is assigned. All nodes are declared as being active. Step 1: We merge the two least probable active nodes. These two nodes are declared passive, whereas the new node is declared active. The active node is assigned the sum of the probabilities of the two merged nodes. Step 2: Stop, if only one active node is left. (In this case we have reached the root.) Otherwise, go to step 1.
97
Human Algorithm
Example: Let us design the Human code for the following scenario: q q (1) q (2) q (3) q (4) q (5) q (6) pQ(q) 0.05 0.10 0.15 0.20 0.23 0.27 Solution:
0.15 0.30 0 0.57 0 1.00 1 1 0.43 0 0 1 1 0.10
0.05 0
0.15 0.20
0.23
0.27
The average code word length is E{W } = 1.00 + 0.57 + 0.43 + 0.30 + 0.15 = 2.45 bits/source symbol.
6
i=1
99
Markov Sources
So far we assumed statistically independent source symbols (i.e., memoryless sources). Now, we consider sources with memory. Motivation (Shannon, 1948): Consider a random variable Q with cardinality L = 27 (26 capital characters plus the space symbol). 1st order approximation (based on the statistics of an English-written text):
100
Markov Sources
Denition: A Markov source is a sequence of random variables Q0, Q1, . . ., Qn with the following properties: 1. Each random variable Qi = f (Si) consists of values over a nite alphabet Q with cardinality |Q| = L, where i is the symbol index (0 i n). 2. The sequence [Q0, Q1, . . . , Qn] is stationary, i.e., its statistical properties are not time-varying. 3. The sequence [S0, S1, . . . , Sn] forms a Markov chain with transition matrix . 4. f (.) is a function, whose denition domain contains the nite set S of all states and whose range contains the nite set Q of output symbols. 5. The initial state is randomly generated according to the stationary distribution = [(1), (2), . . . , (Ls)], where (i) := P (S0 = s(i)) for all i {1, . . . , Ls} and where Ls = |S| denotes the number of states. (Given the transition matrix , the stationary distribution can be computed by solving = .)
101
Markov Sources
Denition: The entropy rate (per symbol) of the sequence [Q0, Q1, . . . , Qn] is dened as H(Q) = lim H(Qn|Q0Q1 . . . Qn1).
n
The entropy rate may be interpreted as the uncertainty of a symbol given its history. Denition: The alphabet rate of a random variable Q is equal to the logarithm of the cardinality of the symbol alphabet: H0 := log |Q| = log L. Theorem: H(Q) H(Qn|Q0Q1 . . . Qn1) H(Qn) log |Q| = log L = H0. Example: For English-written texts with 27 letters (26 capital characters plus the space symbol) we obtain: H0 = log2 27 bits 4.75 bits, H 4.1 bits, H 1.3 bits.
102
103
Willems Algorithm
The Willems algorithm can be applied to arbitrarily long source sequences with a nite alphabet. The main steps are as follows: Step 0 (initialization): We compute a coding table, as described afterwards, and initialize a buer of length 2N 1. Step 1: We divide the source sequence of length n into m subblocks q1, q2, . . . , qm of length N . Step 2: These subblocks are subsequently attached to the buer on the right hand side. The encoder determines the repetition time tr of the current subblock. If the repetition time is less than the buer length, we obtain a code word from the table representing the repetition time. Otherwise, a prex is attached to the subblock and no further encoding is done. Step 3: The contents of the buer is shifted to the left hand side by N steps. We stop, if the end of the source sequence is reached. Otherwise, we proceed with step 2.
104
Willems Algorithm
The coding table is computed as follows: We dene a set Ti = {tr : 2i tr < 2i+1} for i = 0, 1, . . . , N 1 {tr : tr 2N } for i = N.
The encoder assigns an index i to any repetition time tr , so that tr Ti. The encoded index is the rst part of the code word (prex). The prex consists of log2(N + 1) bits. If i < N , the rest of the code word (sux) is determined by encoding j = tr 2i. In this case, the sux consists of i bits. If i = N , the actual subblock is chosen as the sux. In this case, the sux consists of N bits.
105
Willems Algorithm
Example: For N = 3 the following coding table is obtained:
tr 1 2 3 4 5 6 7 i j 0 0 1 0 1 1 2 0 2 1 2 2 2 3 Prex Sux Length [00] [01] [01] [10] [10] [10] [10] [11] [0] [1] [00] [01] [10] [11] q 2 3 3 4 4 4 4 5
8 3
106
Willems Algorithm
Example (contd): N = 3, m = 7, 2N 1 = 7 Let [100,000,011,111,011,101,001] be the sequence of binary source symbols. Let [0100100] be the initial contents of the buer. 1. 2. 3. 4. 5. 6. 7. subblock: subblock: subblock: subblock: subblock: subblock: subblock: [0100100]100 [0100100]000 [0100000]011 [0000011]111 [0011111]011 [1111011]101 [1011101]001 tr tr tr tr tr tr tr = 3 u1 = [01 1] = 1 u2 = [00] 8 u3 = [11 011] = 1 u4 = [00] = 6 u5 = [10 10] = 4 u6 = [10 00] = 8 u7 = [11 001]
107
Willems Algorithm
We recognize that in the best case the source sequence is compressed by the factor log2(N + 1)/N in the worst case the source sequence is expanded by the factor 1 + log2(N + 1)/N . How ecient is Willems algorithm? Denition: The average rate is R = E{length(u)}/N = E{W }/N, where the code word length W is averaged w.r.t. the statistics of the source symbols. For stationary ergodic sequences q = [q1, q2, . . . , qm] we can prove that
N
lim R = H(Q).
Correspondingly, the Willems algorithm is optimal if the buer as well as the subblocks are of innite length.
108
Lempel-Ziv Algorithm
We explain the Lempel-Ziv algorithm (LZ78) for the example of a binary data sequence of length n. A generalization for sources with nite output alphabets is straightforward. Step 1: The source symbol sequence is subsequently divided into subblocks, which are as short as possible and which did not occur before. We denote the total number of subblocks as m the last symbol of each subblock a sux the remaining symbols of each subblock a prex. Step 2: We encode the position of any prex and attach the sux. log2 m bits are needed in order to encode the position and 1 bit for the sux, i.e., m (1 + log2 m) bits are needed in total for a sequence of length n. Correspondingly, we need m (1 + log2 m)/n bits per source symbol.
109
Lempel-Ziv Algorithm
Example: Let [1011010100010] be the sequence of binary source symbols. Correspondingly, the subblocks are obtained as [1],[0],[11],[01],[010],[00],[10]. We observe that n = 13 and m = 7, i.e., we need 3 bits in order to encode a position. The encoded sequence is [000,1] [000,0] [001,1] [010,1] [100,0] [010,0] [001,0]. In this tutorial example we need 28 bits in order to encode 13 source bits. However, the eciency grows with increasing length of the source sequence. For stationary ergodic sequences q = [q1, q2, . . . , qn] we can prove that
n
Therefore, the Lempel-Ziv algorithm is optimal for innitely long source sequences. The version of the Lempel-Ziv algorithm introduced so far needs two passes and delivers code words of equal length. Many modications exist.
110
111
112
Channel encoder
Channel decoder
u or x
E
Code word
Received word
P (j = uj ) u
j=1
113
(2k words)
114
pY X(y, x(y)) =
y
where x(y) is the estimated code word for a given received word y. Hence, x(y) describes the decoding rule. Correspondingly, Pw = 1
y
This equation holds for arbitrary discrete channels and decoding rules, and is exact.
115
Decoding Rules
Let xi be the i-th code word, i {1, 2, . . . , 2k }, k = nR. Maximum a posteriori (MAP) decoding: pX|Y ( |y) pX|Y (xi|y) x Maximum-likelihood (ML) decoding: pY|X(y| ) pY|X(y|xi) x According to Bayes rule pX|Y (xi|y) = pY|X(y|xi) pX(xi) . pY (y) i i
Since the denominator is independent of i, the MAP rule and the ML rule are equal for equally probable code words, i.e., for pX(xi) = 1/2k i {1, 2, . . . , 2k }.
116
Decoding Rules
MAP Decoding: uMAP = arg max pU|Y (ui|y) or equivalently xMAP = arg max pX|Y (xi|y).
ui xi
The MAP rule estimates the most probable info word (or the most probable code word, respectively) given the received word y. The a priori probabilities pX(xi) have to be known at the receiver. ML Decoding: uML = arg max pY|U(y|ui) or equivalently xML = arg max pY|X(y|xi).
ui xi
The ML rule estimates the most probable received word y for all hypotheses xi, i {1, 2, . . . , 2k }. The ML decoder does not make use of a priori information.
117
Decoding Rules
Example: Ternary DMC, (2,1) repetition code of length n = 2
0 X 1 2
0 1 2 Y
Which decision table is used by a MAP decoder, which table is used by a ML decoder? Are the decision tables unique? What are the corresponding word error probabilities?
118
Decoding Rules
i A B B B C C C C C
i B B B B B C C C C
ML decoder
MAP decoder
119
Bhattacharyya Bound
Motivation: An exact computation of the word error probability is very complex, since we have to add 2n terms, where decoding regions have to be taken into account. For typical block codes (n 100 . . . 1000), the computational eort is not manageable. Hence, in the following we derive two upper bounds on the word error probability. We dene the decoding regions as Di = {y : u(y) = ui} = {y : x(y) = xi}, i {1, 2, . . . , 2k }.
Furthermore, we dene the conditional word error probability Pw|i = P ( = ui | ui transmitted) = P ( = xi | xi transmitted). u x Since pU(ui) = pX(xi), the word error probability can be written as
2k
Pw =
i=1
pX(xi) Pw|i,
k = nR,
120
where the expected value is taken over all info words given a xed code.
Bhattacharyya Bound
D1
D2 y
,
Code words
(2k words)
x1 x2
Decoding threshold
121
Bhattacharyya Bound
According to the denitions it follows that Pw|i = P (y Di | ui transmitted) = P (y Di | xi transmitted), or equivalently Pw|i =
yDi
pY|X(y|xi).
pY|X(y|x1).
For ML decoding an upper bound is obtained by multiplying each term of the sum with pY|X(y|x2)/pY|X(y|x1), since the square root 1 if y D2 and 1 else: Pw|1 pY|X(y|x1) pY|X(y|x2)/pY|X(y|x1) =
yD2
122
pY|X(y|x1) pY|X(y|x2).
yD2
Bhattacharyya Bound
The approximation is reasonably tight, since the decisions of the ML decoder are least reliable near the decision threshold between D1 and D2, where the square-root is approximately one. The right hand side may be further approximated by taking all possible received words y into account: Pw|1 pY|X(y|x1) pY|X(y|x2).
y
Note that the computational complexity reduces signicantly, since no decision pY|X(y|x1) pY|X(y|x2). thresholds have to be computed. Similarly, Pw|2
y
Pw|i
j=1
pY |X (y|x1j ) pY |X (y|x2j ),
i {1, 2}.
123
Bhattacharyya Bound
Theorem (Bhattacharyya bound for two code words): For a given code of length n with two code words x1 and x2, which are transmitted with arbitrary probabilities pX(x1) and pX(x2) via a discrete memoryless channel (DMC), the conditional word error probability for ML decoding can be upper bounded as
n
Pw|i Generalization:
j=1
pY |X (y|x1j ) pY |X (y|x2j ),
i {1, 2}.
Theorem (Bhattacharyya bound): For a given code of length n with code words xi, i {1, 2, . . . , 2k }, which are transmitted with arbitrary probabilities pX(xi) via a discrete memoryless channel (DMC), the conditional word error probability for ML decoding can be upper bounded as
2k n y
Pw|i
=1;=i j=1
pY |X (y|xij ) pY |X (y|xj ).
124
Bhattacharyya Bound
Example 1: (n, 1) repetition code, BSC: Pw|i 2 p(1 p) Example 2: (n, 1) repetition code, BEC: Pw|i q n 1 exact result: Pw = q n . 2
n
Remarks concerning the Bhattacharyya bound: The Bhattacharyya bound depends on i and hence on the source statistics pX(xi). The word error probability can be bounded as Pw max Pw|i.
i
For codes with many code words, the Bhattacharyya bound typically is not tight, since too many decision regions are taken into account. The so-called Gallager bound, which is derived next, is tighter.
125
Gallager Bound
As proven before, Pw|i =
yDi
pY|X(y|xi).
()
We multiply each term in () with the left hand side of the inequality and obtain Pw|i pY|X(y|xi)
1s 2k yDi
By means of the parameters and s, which have been introduced by Gallager, the Gallager bound is tighter than the Bhattacharyya bound.
126
j=1;j=i
s pY|X(y|xj ) .
Gallager Bound
According to Gallager we choose 1 0 1, 1+ take all received words y into account, and obtain nally: s= Theorem (Gallager bound): For a given code of length n with code words xi, i {1, 2, . . . , 2k }, which are transmitted with arbitrary probabilities pX(xi) via an arbitrary discrete channel, the conditional word error probability for ML decoding can be upper bounded as
1 1+
Pw|i
pY|X(y|xi)
y
2k
pY|X(y|x)
1 1+
=1;=i
0 1.
127
Gallager Bound
Special case: Theorem (Gallager bound for DMCs): For a given code of length n with code words xi, i {1, 2, . . . , 2k }, which are transmitted with arbitrary probabilities pX(xi) via a discrete memoryless channel (DMC), the conditional word error probability for ML decoding can be upper bounded as
n
Pw|i
j=1
pY |X (y|xij )
1 1+
2k
=1;=i
pY |X (y|xj )
1 1+
0 1.
For = 1 the Gallager bound and the Bhattacharyya bound are identical.
128
Pw (2k 1)
pY|X(y|x)
y x
1 1+
pX(x)
0 1.
129
pX(x) =
j=1
pX (x)
With the so-called Gallager function E0(, pX (x)) := log2 we get With 2k 1 < 2k and k = nR we obtain Pw (2k 1)2nE0(,pX (x)), Pw < 2n(E0(,pX (x))R), pY |X (y|x)
1 1+
Pw (2k 1)
pY |X (y|x)
1 1+
pX (x)
,
1+
0 1.
pX (x)
01
0 1. 0 1.
130
is the so-called Gallager exponent. Finally, we dene EG(0) = max E0(1, pX (x)) := R0.
pX (x)
131
EG(R) R0
0 R0 C R
132
133
134
pX (x) dx.
a
Hence, the probability of the event {a x b} is the area under the probability density function in the range a x b.
135
136
Dierential Entropy
For the binary representation of real values an innite number of bits is necessary: The self-information is innite. Still, we dene the following expressions: Denition: Let X be a continuous random variable with probability density function pX (x).
H(X) =
is called the dierential entropy of X. In contrast to the entropy of a discrete random variable X, the dierential entropy has no fundamental meaning. Particularly, the dierential entropy can not be interpreted as the uncertainty of a random variable. Especially, H(X) even may be negative.
137
Dierential Entropy
Example: Let X be a Gaussian distributed random variable with mean and variance 2, i.e., let X be a random variable with probability density function e . 2 2 (A Gaussian distribution with zero mean and variance one is called a normal distribution.) The dierential entropy of X is 1 H(X) = log(2e 2). 2 Depending on the value of 2, H(X) may be positive, zero, or negative. Theorem: Among all continuous random variables with a given mean and a given variance 2, the Gaussian distributed random variable has the maximal dierential entropy. pX (x) = 1
(x)2 2 2
138
Dierential Entropy
Denition: The mutual information between two continuous random variables X and Y is dened as
I(X; Y ) =
H(X|Y ) =
I(X; Y ) = H(X) H(X|Y ) 0, I(X; Y ) H(X) generally does not apply any more.
139
Dierential Entropy
Theorem: Let Y = X + Z, where X and Z are independent, Gaussian distributed random variables with variances S and N , respectively. We obtain: I(X; Y ) = S 1 log 1 + 2 N .
Proof: Since X and Z are Gaussian distributed, Y is Gaussian distributed as well. Since X and Z are statistically independent, their variances are additive. Hence, Y is Gaussian distributed with variance S + N . With I(X; Y ) = H(X) H(X|Y ) = H(Y ) H(Y |X) = H(Y ) H(Z) and the result of the example treated above we get 1 1 I(X; Y ) = log (2e(S + N )) log (2eN ) 2 2 S 1 . q.e.d. = log 1 + 2 N
140
x2pX (x) dx P.
141
max
I(X; Y ).
The unit is [bit/channel use] (or [bit/channel symbol]), if we take the binary logarithm. Theorem: The channel capacity of a real-valued, discrete-time, memoryless Gaussian channel with an average input power of P or less is 1 P bit C = log2 1 + 2 N channel use and is obtained, if X is Gaussian distributed. Proof: We previously obtained that I(X; Y ) = H(Y ) H(Z). Since H(Y ) is maximal if Y is Gaussian distributed, a probability density function pX (x) must be found, which leads to Gaussian distributed output symbols. Due to Cramer the sum of two Gaussian distributed random variables is Gaussian distributed once again. Hence, I(X; Y ) is maximized, if X is Gaussian distributed. q.e.d.
142
3.0
2.0
1.0
0.0 20.0
10.0
20.0
30.0
143
C=
pX (x):E{X 2}P
max
I(X; Y ) = max
pX (x)
i=1 2 i=1
= max
pX (x)
pY |X (y|xi) log2
pY |X (y|xi) dy pY (y)
(yxi )2 2 2 e 22 , p (y) = (p with pY |X (y|xi) = 1/ 2 Y Y |X (y|x1) + pY |X (y|x2 ))/2, = N , x1 = + P and x2 = P . It is not possible to simplify the integral further.
144
4.0 C in bit/channel use binary input symbols (+sqrt(P), sqrt(P)) Gaussian distributed input symbols 3.0
2.0
1.0
0.0 20.0
10.0
20.0
30.0
145
146
C=
i=1
Si 1 log2 1 + 2 Ni
and is reached, if the channel input symbols are independent, Gaussian distributed random variables with zero mean and variances Si, i = 1, 2, . . . , n, where S i + Ni = Si = 0
n
Si = P .
147
Power S1 S2 N1 N2 1 2 N3 3 4 S3 N4 S5
N5 5 Index i
Power distribution according to the water-lling principle. The average total input power may be interpreted as a mass of water, which is dammed up over an irregular terrain. The terrain represents the variances of the subchannels, whereas the water level corresponds to .
148
x[k]
if the sampling rate is (at least) 1/(2W ) samples per second. This minimum sampling rate is called Nyquist rate.
149
150
151
Ck = W T log2 1 +
k=1
where CT is the maximal mutual information, which can be transmitted in an interval of T seconds. The unit is [bit/s], if we take the binary logarithm.
152
The ratio P/N is called signal/noise ratio. Example: Typically, an analog telephone channel has a bandwidth of about W = 3 kHz and a signal/noise ratio of about P/N = P/(N0W ) = 30 dB (if no digital switching is applied). Therefore, the channel capacity is about C = 30 kbit/s. With state-of-the-art telephone modems (V.34) data rates of up to 28.8 kbit/s may be obtained under these circumstances.
153
4.0
C /2W in bit/s/Hz
3.0
2.0
1.0
0.0 20.0
10.0
20.0
30.0
154
155
[bit/s/Hz] .
5.0
20.0
25.0
30.0
157
Summary: Channel Capacity of Real-Valued Gaussian Channels with Limited Input Power
Discrete-time, Gaussian distributed input symbols: P bit 1 , C = log2 1 + 2 N channel use where P : average input symbol power N : average noise power. Continuous-time, Gaussian distributed input signals, nite bandwidth: P bit C = W log2 1 + , N s where P : average input signal power N : average noise power, N = W N0 W : bandwidth. 1 For Tsample = 2W the product C Tsample equals the channel capacity C of the discrete-time Gaussian channel.
158
Summary: Channel Capacity of Real-Valued Gaussian Channels with Limited Input Power
Discrete-time, binary input symbols P : 1 C= 2 where
2 i=1
pY |X (y|xi) log2
(yx )2
pY |X (y|xi) dy pY (y)
pY |X (y|xi) = pY (y) = 1 pY |X (y|x1) + pY |X (y|x2) 2 x1 = + P , x2 = P P : average input symbol power 2: average noise power.
i 1 e 2 2 2 2
159
Remarks
The Gaussian channel model often is called Additive White Gaussian Noise (AWGN) channel model. For a matched-lter receiver 2Es Es/Ts Es 1 1 Es 1 P 2Tsample = = = = N N0 W N0 T s W N0 T s N0 where Es: Energy per info symbol Ts: Symbol duration 1/Tsample: Sampling rate. For a bandlimited AWGN channel model we further obtain: N0 1 1 = 2 = N = N 0 W = N0 2Ts 2 Ts
160
kR(D) Bits
lossy, binary source encoder: Hamming distortion D = Pb R(D) = 1 h(Pb) ideal channel encoder: R = kR(D)/n = Rtot R(D)
1 P channel: e.g. C = 2 log2(1 + N )
Error-free transmission (errors are only introduced by the lossy source encoder!): RC Rtot R(D) C. where R(D) = 1 h(Pb).
161
SNR Limit of the Discrete-Time Gaussian Channel with Gaussian Distributed Input Symbols given a Finite Bit Error Probability
-1
10
-2
10
-3
10
-4
10
-5
-4.0
-3.0
0.0
1.0
162
163
Chapter 8: Cryptology
Fundamentals of cryptology: Cryptography, cryptanalysis, authentication Ciphertext-only attack, known-plaintext attack, chosen-plaintext attack Classical cipher systems: Caesar cipher, Vigen`re cipher, Vernam cipher e Shannons theory of secrecy, absolutely secure cryptosystem Key entropy, message entropy Conditional key entropy, conditional message entropy Redundancy, unicity distance Private key, public key One-way function, trapdoor one-way function RSA system
164
Fundamentals of Cryptology
Cryptology is the art or science of the design and the attack of security systems. Cryptology includes Cryptography (art or science of encryption) Cryptanalysis (art or science of attacking a security system) Authentication (checking whether a message is genuine or not). As an art, cryptology has been used by armed forces, diplomats and spies since thousands of years. As a science, cryptology has been established in 1948 by Shannon. The demand of cryptosystems signicantly increased in the past decade (due to internet, home banking, e-commerce, m-commerce, pay TV, etc.)
165
Fundamentals of Cryptology
Information theory
Number theory
Probability theory
166
Fundamentals of Cryptology
Goals of Cryptology Secrecy: How can we securely communicate with somebody else? organizing actions (e.g. secret meeting) physical actions (e.g. invisible ink) encryption of messages symmetrical techniques (private key techniques) asymmetrical techniques (public key techniques) Authentication How can we prove ones identity? How can we check that a message is authentic? user authentication (e.g. personal identication number (PIN), automated teller machine (ATM)) message authentication (e.g. electronic signature)
167
Fundamentals of Cryptology
Goals of cryptology (contd) Anonymity: How can we save our privacy during (tele-)communication? cash money, help line, box number advertisement electronic cash Protocols: A protocol is the whole set of rules. Protocols are necessary, if two or more users participate (key management, service provision, etc.) cellular radio internet
168
Block Diagram of a Classical Cipher System for Secure Communication (Symmetrical Technique)
Plaintext estimation Interception
M
Plaintext
1 M = EK (C)
Plaintext
Encryption EK(.)
Ciphertext
C = EK(M)
Secure channel
Decryption 1 EK (.)
Key K
M: Message (to be transmitted via an insecure channel to an authorized receiver) C: Ciphertext, cryptogram 1 EK(.): Encryption rule, EK (.): Decryption rule K: Key (is known to authorized users only) M: Estimated message (probability that M = M should approach zero).
169
Fundamentals of Cryptology
Encryption: C := f (K, M) := fK(M) := EK(M) 1 1 Decryption: M := f 1(K, C) := fK (C) := EK (C) := DK(C) Classication of encryption and decryption techniques: block enciphering: block-wise encryption and decryption (e.g. Data Encryption Standard (DES)) stream enciphering: continuous encryption and decryption (e.g. one-time pad) Kerckhos principle (1883): The security of cryptosystems must not depend on the concealment of the algorithm. Security is only based on the secrecy of the key.
170
Fundamentals of Cryptology
The security of cryptosystems is based on the eort needed to hack the system. We have to assume that the attacker (the hacker) knows all about the nature of the plaintext (such as language and alphabet) and about the algorithm. In this case, suciently long messages can uniquely be recovered by an exhaustive search with respect to all possible keys, if the necessary computational power is available. Depending on the a priori knowledge, we classify attacks as follows: ciphertext-only attack: Besides the decryption algorithm, only the ciphertext is given to the attacker known-plaintext attack: Additionally, the attacker knows a set of plaintext/ciphertext pairs chosen-plaintext attack: The attacker may choose an arbitrary plaintext and obtain the corresponding ciphertext without having access to the key. State-of-the-art cryptosystems are chosen-plaintext attack resistant (e.g. UNIX password).
171
M {0, 1, . . . , 25}.
173
Vigen`re-Tableau e
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z A A B C D E F G H I J K L M N O P Q R S T U V W X Y Z B C B C C D D E E F F G G H H I I J J K K L L M M N N O O P P Q Q R R S S T T U U V V W W X X Y Y Z Z A A B D D E F G H I J K L M N O P Q R S T U V W X Y Z A B C E E F G H I J K L M N O P Q R S T U V W X Y Z A B C D F F G H I J K L M N O P Q R S T U V W X Y Z A B C D E G G H I J K L M N O P Q R S T U V W X Y Z A B C D E F H I H I I J J K K L L M M N N O O P P Q Q R R S S T T U U V V W W X X Y Y Z Z A A B B C C D D E E F F G G H J J K L M N O P Q R S T U V W X Y Z A B C D E F G H I K K L M N O P Q R S T U V W X Y Z A B C D E F G H I J L L M N O P Q R S T U V W X Y Z A B C D E F G H I J K M M N O P Q R S T U V W X Y Z A B C D E F G H I J K L N N O P Q R S T U V W X Y Z A B C D E F G H I J K L M O P O P P Q Q R R S S T T U U V V W W X X Y Y Z Z A A B B C C D D E E F F G G H H I I J J K K L L M M N N O Q Q R S T U V W X Y Z A B C D E F G H I J K L M N O P R R S T U V W X Y Z A B C D E F G H I J K L M N O P Q S S T U V W X Y Z A B C D E F G H I J K L M N O P Q R T T U V W X Y Z A B C D E F G H I J K L M N O P Q R S U U V W X Y Z A B C D E F G H I J K L M N O P Q R S T V W V W W X X Y Y Z Z A A B B C C D D E E F F G G H H I I J J K K L L M M N N O O P P Q Q R R S S T T U U V X X Y Z A B C D E F G H I J K L M N O P Q R S T U V W Y Y Z A B C D E F G H I J K L M N O P Q R S T U V W X Z Z A B C D E F G H I J K L M N O P Q R S T U V W X Y
174
175
176
177
Denition: The conditional key entropy (key equivocation) is dened as H(K|C) = pK C(k, c) log pK|C(k|c)
k c
and the conditional message entropy (message equivocation) is dened as H(M|C) = pM C(m, c) log pM|C(m|c).
m c
The conditional key entropy (or the conditional message entropy) is the uncertainty of an attacker trying to hack the actual key (or the actual message).
178
i.e., knowledge of the ciphertext can, on average, never increase the uncertainty about the key or about the message. Theorem: The conditional message entropy can not exceed the conditional key entropy, i.e., H(M|C) H(K|C). Proof: H(M|C) H(K, M|C) = H(K|C) + H(M|K, C), where the right hand side follows from the chain rule for entropy. For well-designed cipher systems H(M|K, C) = 0, which proves the theorem. q.e.d.
179
Denition: The dierence between the alphabet rate and the message rate, D = H0 H(M ), is called redundancy. Example: For English-written texts H0 = log2(26) bits = 4.7 bits and H 1.5 bits (per symbol). Therefore, for lim the redundancy is D = H0 H 3.2 bits.
N
180
H(K|C) N H(M ) + H(K) N H0 = H(K) N (H0 H(M )), which proves the theorem. The theorem deduces the following fundamental results: q.e.d.
181
If the redundancy is D = 0, the conditional key entropy and the key entropy are identical: H(K|C) = H(K). Therefore, data compression is an essential recipe in order to improve the security of cipher system: Perfect source coding in conjunction with non-trivial enciphering results in an absolutely secure system. This only holds, however, if the key is changed after each message. For N H(K)/D (i.e., for H(K) N D 0) the lower bound of the conditional key entropy is zero: We risk a successful attack. Denition: The smallest message length in order to obtain a unique estimate of the message given the ciphertext (i.e., H(K|C) = 0 and therefore H(M|C) = 0), is called unicity distance and is denoted as Nmin . Theorem: Nmin = H(K)/D. (The proof of this theorem deduces from the previous theorem.)
182
183
185
M
Plaintext
Plaintext
Encryption EK(.)
Ciphertext
C = EK(M)
Decryption DK(.)
M = DK(C)
186
187
with respect to n, we obtain (n) = n 1 (q 1) (p 1) = (p 1)(q 1) = (p) (q). Eulers theorem: If gcd(a, n) = 1, we get a(n) = 1 (mod n). Example: Let a = 2 and n = 55. Therefore (with n = p q = 5 11) (55) = (5 1)(11 1) = 40. Hence, a(n) = 240 = 10244 = 344 = 11562 = 1 (mod 55).
188
189
Remark: Users who know (n, e), are able to encrypt a message. However, only users who additionally know (p, q, (n)) or d, are able to decrypt the message. Secrecy relies on the assumption that it is impossible to solve C = M e (mod n) by means of the discrete logarithm, where e and C are known.
190
191
Literature (tutorial)
A. Beutelspacher, Kryptologie. Braunschweig/Wiesbaden: Vieweg, 5th ed., 1996. A. Beutelspacher, J. Schwenk, K.-D. Wolfenstetter, Moderne Verfahren der Kryptographie. Braunschweig/Wiesbaden: Vieweg, 3rd ed., 1999. O. Mildenberger, Informationstheorie und Codierung. Braunschweig/Wiesbaden: Vieweg, 2nd ed., 1992. H. Rohling, Einfhrung in die Informations- und Codierungstheorie. u Stuttgart: Teubner, 1995. A.J. Menezes, P.C. van Oorschot, S.A. Vanstone, Handbook of Applied Cryptography. CRC Press, 5th ed., 2001.
192
Literature (advanced)
T.M. Cover, J.A. Thomas, Elements of Information Theory. New York: John Wiley & Sons, 1991. R.G. Gallager, Information Theory and Reliable Communication. New York: John Wiley & Sons, 1968. R. Johannesson, Informationstheorie Grundlagen der (Tele-)Kommunikation. Lund: Addison-Wesley, 1992. J.C.A. van der Lubbe, Information Theory. Cambridge (UK): Cambridge University Press, 1997. J.M. Wozencraft, I.M. Jacobs, Principles of Communication Engineering. New York: Wiley, 1965. C.E. Shannons paper A Mathematical Theory of Communications (July/Oct. 1948) can be found on our homepage: http://www-ict.tf.uni-kiel.de (see Lectures: Scripts and exercises)