Professional Documents
Culture Documents
Ravi Ramanathan
Proof. Suppose | 𝒦 | < | ℳ | . We will show that there exists a distribution on ℳ, a message m,
Pr[M = m | C = c] ≠ Pr[M = m] .
1
Take the uniform distribution on ℳ, i.e. Pr[M = m] = ∀m ∈ ℳ .
|ℳ|
These are the only possible m that could yield the ciphertext c through any key k .
Optimality of the One-Time Pad
Theorem. If (Gen, Enc, Dec) is a perfectly secret encryption scheme with message space ℳ
and key space 𝒦, then | 𝒦 | ≥ | ℳ | .
Proof. Continued. Observe that | ℳ(c) | ≤ | 𝒦 | < | ℳ | . This is because Dec is deterministic.
1
But by the assumption that the message distribution is uniform, we know that Pr [M = m*] = .
|ℳ|
Pr [M = m* | C = c] ≠ Pr [M = m*]
Thus, the condition of perfect secrecy is not satisfied and the scheme is not perfectly secret.
2. For every m ∈ ℳ and every c ∈ 𝒞 there exists a unique key k ∈ 𝒦 such that Enck(m) outputs c .
It can be readily seen that the one-time pad satisfies both conditions, since Gen chooses the key
Suppose that the distribution over ℳ, 𝒞 are such that all m, c are assigned non-zero probabilities.
Firstly, we observe that for every m, c there must be at least one k ∈ 𝒦 such that Enck(m) = c .
This implies that for any fixed m, the set of all ciphertexts {Enck(m)}k∈𝒦 satisfies {Enck(m)}k∈𝒦 ≥ | 𝒞 | .
Proof. Contd. Since {Enck(m)}k∈𝒦 = | 𝒦 | , we see that for every pair (m, c),
Fix ciphertext c . Let ℳ = {m1, …, mn} and let ki denote the key that encrypts mi to c .
Therefore, we have Pr[K = ki] = Pr[C = c] . Similarly Pr[K = kj] = Pr[C = c] for every j .
1
Thus, we obtain that Pr[K = ki] = which is Condition 1.
|𝒦|
Proof of Shannon’s Theorem
Proof. Contd. Conditions 1 and 2 ⟹ Perfect Secrecy .
1
When Conditions 1,2 hold, we have that for every m, c it holds that Pr[C = c | M = m] = .
|𝒦|
This implies that for every pair of messages m, m′ ∈ ℳ and every ciphertext c, we have
1
Pr[C = c | M = m] = = Pr[C = c | M = m′],
|𝒦|
Q1. Suppose the message space is M = {0,…,4}, and Gen chooses randomly from {0,…,5} .
A1. By definition Perfect Secrecy requires: Pr[C = c | M = m] = Pr[C = c] . M, C must be independent rv's.
∑
Pr[C = c | M = m] = Pr[K = k]
k:c=Enck (m)
For M = 1, say, applying every possible key in {0,…,5} gives the ciphertexts {1,2,3,4,0,1}
1
We get Pr[C = 0 | M = 1] = . This is because C = 0 only occurs from M = 1 when k = 4.
6
C = 0 can only be obtained by the following (m, k) out of 30 combinations: {(0,0), (0,5), (1,4), (2,3), (3,2), (4,1)}
6 1 1
Therefore, Pr[C = 0] = = ≠ . This scheme is not Perfectly Secure. ◼
30 5 6
Perfect Secrecy Exercise
(with respective probabilities for a, b, c being 0.6, 0.3, 0.1), and key space 𝒦 := {k1, k2, k3}
(with respective probabilities for choosing keys k1, k2, k3 being 0.4, 0.3, 0.3.)
k1 k2 k3
The encryption table takes the form a 1 2 3
b 2 1 3
c 3 1 2
What is the probability that the ciphertext is 2? Calculate Pr[M = b | C = 2], Pr[M = b | C = 3] .
= 0.3 ⋅ 0.6 + 0.4 ⋅ 0.3 + 0.3 ⋅ 0.1 = 0.18 + 0.12 + 0.03 = 0.33.
0.4 × 0.3 12
= = ≈ 0.364.
0.33 33
∑
H(X ) = − Pr[X = x] log2 Pr[X = x]
x∈𝒳
The entropy is equivalently stated as H(X ) = 𝔼 (−log2 Pr[X ]), where 𝔼 denotes the
expectation value.
E.g. If X takes one value with probability 1 and other values with probability 0, then
the entropy of X is zero.
If X takes n values each with probability 1/n then the entropy of X is log2 n .
In general, we note that H(X ) ≤ log2 | 𝒳 |
Entropy: Example
Example. Suppose we have a horse race with eight horses taking part. Assume that the
( 2 4 8 16 64 64 64 64 )
1 1 1 1 1 1 1 1
probabilities of winning for the eight horses are , , , , , , , .
Suppose that we want to send a message indicating which horse won the race.
How many bits on average should we use in the best description as to which horse won?
We want to send the index of the winning horse. It makes sense to use shorter descriptions for
more probable horses and longer descriptions for the less probable ones.
We could use the following set of bit strings: 0, 10, 110, 1110, 111100, 111101, 111110, 111111.
The average description length above is 2 bits as opposed to 3 for the uniform code.
Entropy of a r.v. is a lower bound on the average number of bits required to represent the r.v.
2
is shown in Figure 2.1. The figure illustrates some of the basic properties
of entropy: It is a concave function of the distribution and equals 0 when
p = 0 or 1. This makes sense, because when p = 0 or 1, the variable
is not random and there is no uncertainty. Similarly, the uncertainty is
maximum when p = 12 , which also corresponds to the maximum value of
the entropy. Entropy: Example
Example
Example. 2.1.2 theLet
Calculate entropy of r.v. X with distribution
⎧
⎪
⎪ a with probability 12 ,
⎪
⎨ b with probability 14 ,
X= (2.6)
⎪
⎪ c with probability 18 ,
⎪
⎩
d with probability 18 .
The entropy of X is
The entropy of X is H(X ) = − 1/2 log2 1/2 − 1/4 log2 1/4 − 2/8 log2 1/8 = 7/4 bits.
1 1 1 1 1 1 1 1 7
SupposeHwe wish
(X) = to
−determine log of−X with
log −the value logthe − log number
minimum = of bits. (2.7)
binary questions.
2 2 4 4 8 8 8 8 4
An efficient first question is: Is X = a?
The entropy of X is
The expected number of binary questions required to determine the value of X in the best
1 1 1 1 1 1 1 1 7
(X) = −scheme
possibleHquestioning log is−1.75.log − log − log = bits. (2.7)
2 2 4 4 8 8 8 8 4
In general, the expected number of binary questions required to determine a random variable X
∑
H(X | Y ) = − Pr[X = x, Y = y] log2 Pr[X = x | Y = y]
x∈𝒳,y∈𝒴
∑
It is also customary to use the notation H(X | Y = y) := − Pr[X = x | Y = y] log2 Pr[X = x | Y = y]
x∈𝒳
∑
We observe that H(X | Y ) := − Pr[Y = y] H(X | Y = y)
y∈𝒴
Chain Rule and other entropy relations
∑
Here H(XY ) = H(X, Y ) = − Pr[X = x, Y = y] log2 Pr[X = x, Y = y] denotes the entropy
x∈𝒳,y∈𝒴
In general, we have H(X1X2…Xn) = H(X1) + H(X2 | X1) + H(X3 | X1X2) + … + H(Xn | X1X2…Xn−1)
The uncertainty of a random variable X can never increase by knowledge of the outcome of
another random variable Y, i.e. H(X | Y ) ≤ H(X ) with equality iff X, Y are independent.
We thus have H(XY ) ≤ H(Y ) + H(X ) with equality iff X, Y are independent.
and take the expectation of both sides of the equation to obtain the
theorem. !
Corollary
H (X, Y |Z) = H (X|Z) + H (Y |X, Z). (2.21)
Proof: The proof follows along the same lines as the theorem. !
Conditional Entropy and Joint Entropy: Example
Example 2.2.1 Let (X, Y ) have the following joint distribution:
(2 4 8 8)
1 1 1of 1Y is ( 1 , 1 , 1 , 1 ), and hence H (X) = 7 bits and H (Y ) = 2 bits. Also,
The marginal distribution of X is , , , . So
4 4H(X 4 4) = 7/4 bits. 4
4
!
(4 4 4 4)
H (X|Y ) = p(Y = i)H (X|Y = i) (2.22)
1 1 1 1
The marginal distribution of Y is , , , . So H(Y ) = 2i=1bits.
" # " #
1 1 1 1 1 1 1 1 1 1
= H , , , + H , , ,
4 2 4 8 8 4 4 2 8 8
" #
What are H(X | Y ), H(Y | X ) and H(X, Y )? 1
+ H
1 1 1 1
, , , +
1
H (1, 0, 0, 0) (2.23)
4 4 4 4 4 4
1 7 1 7 1 1
= × + × + ×2+ ×0 (2.24)
4 4 4 4 4 4
11
= bits. (2.25)
8
13 27
Similarly, H (Y |X) = 8 bits and H (X, Y ) = 8 bits.
Corollary
H (X, Y |Z) = H (X|Z) + H (Y |X, Z). (2.21)
Proof: The proof follows along the same lines as the theorem. !
Conditional Entropy and Joint Entropy: Example
Example 2.2.1 Let (X, Y ) have the following joint distribution:
∑
H(X | Y ) = P(Y = y) ⋅ H(X | Y = y) The marginal distribution of X is ( 12 , 14 , 18 , 18 ) and the marginal distribution
y=1 of Y is ( 14 , 14 , 14 , 14 ), and hence H (X) = 74 bits and H (Y ) = 2 bits. Also,
4
!
1 1 H (X|Y ) 1= p(Y = i)H (X|Y = i)1 (2.22)
= H (1/2,1/4,1/8,1/8) + H (1/4,1/2,1/8,1/8) + Hi=1 (1/4,1/4,1/4,1/4) + H(1,0,0,0)
4 4 4 " # 4" #
1 1 1 1 1 1 1 1 1 1
= H , , , + H , , ,
1 7 1 7 1 1 11 4 2 4 8 8 4 4 2 8 8
= × + × + ×2+ ×0= bits. " #
4 4 4 4 4 4 8 1
+ H
1 1 1 1
, , , +
1
H (1, 0, 0, 0) (2.23)
4 4 4 4 4 4
1 7 1 7 1 1
13 27 = × + × + × 2 + × 0 (2.24)
Similarly, we find that H(Y | X ) = bits and H(X, Y ) = bits.4 Note 4 4H(X4| Y ) ≠ H(Y
4 that 4 |X) .
8 8 11
= bits. (2.25)
8
13 27
Similarly, H (Y |X) = 8 bits and H (X, Y ) = 8 bits.
Definition. The relative entropy D (p(X )∥q(X )) between two probability distributions p(X ), q(X )
of a random variable X is defined as
( q(X = x) )
p(X = x)
D (p(X )∥q(X )) =
∑
p(X = x) log2 .
x∈𝒳
The relative entropy is a measure of the distance between two probability distributions.
It can be thought of as the inefficiency of assuming distribution q(X ) when the correct
distribution is p(X ) .
( )
p(X )
Note that D (p(X )∥q(X )) = 𝔼P log2 .
q(X )
Definition. The Mutual Information I(X; Y ) between random variables X and Y is defined as
Pr[X = x, Y = y]
I(X; Y ) = D (Pr[X, Y ]∥Pr[X ]Pr[Y ]) =
∑
Pr[X = x, Y = y] log2
x∈𝒳,y∈𝒴
Pr[X = x]Pr[Y = y]
The mutual information I(X; Y ) measures the information (in bits) we receive about the random
It also describes the information we receive about the random variable Y when observing the
Mutual Information is also the a measure of the price for encoding (X, Y ) as a pair of independent
I(X; X ) = H(X )
p(X = x | Z = z)
where D (p[X | Z ] ∥ q[X | Z ]) :=
∑ ∑
p(Z = z) p(X = x | Z = z) log2
z∈𝒵 x∈𝒳
q(X = x | Z = z)
The mutual information is the relative entropy between the joint distribution and the product distribution.
p(x, y)
∑
I(X; Y ) = p(x, y) log
x,y
p(x)p(y)
p(x | y)
∑
= p(x, y) log
x,y
p(x)
∑ ∑
=− p(x, y) log p(x) + p(x, y) log p(x | y)
x,y x,y
∑ ∑
=− p(x) log p(x) − − p(x, y) log p(x | y) = H(X ) − H(X | Y ) .
x x,y
Mutual Information
Thus the mutual information I(X; Y ) is the reduction in uncertainty of X due to knowledge of Y .
22 ENTROPY, RELATIVE ENTROPY, AND MUTUAL INFORMATION
Theto relationship
Therefore, the entropy is also sometimes referred between H (X), H (Y ), H (X, Y ), H (X|Y ), H
as the self-information.
and I (X; Y ) is expressed in a Venn diagram (Figure 2.2). Noti
the mutual information I (X; Y ) corresponds to the intersection
information in X with the information in Y .
∑
We have the key entropy H(K ) = − Pr[K = k] log2 Pr[K = k]
k∈𝒦
∑
and the message entropy H(M ) = − Pr[M = m] log2 Pr[M = m] .
m∈ℳ
The key entropy describes the uncertainty Eve faces regarding the unknown key a priori
and the message entropy describes the uncertainty regarding the transmitted message.
∑
The key equivocation H(K | C) = − Pr[K = k, C = c] log2 Pr[K = k | C = c]
k∈𝒦,c∈𝒞
∑
and the message equivocation H(M | C) = − Pr[M = m, C = c] log2 Pr[M = m | C = c]
m∈ℳ,c∈𝒞
describe the remaining uncertainty after Eve observes the transmitted ciphertext.
Perfect Secrecy in Information-Theoretic terms
This reflects the fact that the uncertainties never increase by knowledge of the ciphertext.
In a system with perfect secrecy, the plaintext and the ciphertext are independent.
When Eve observes the ciphertext, she obtains no information at all about the plaintext,
Theorem. For an encryption scheme with perfect secrecy we have H(M ) ≤ H(K ) .
When key and ciphertext are given, the plaintext is uniquely determined since Dec
is deterministic. So that H(M | C, K ) = 0.
Since H(M | C) ≤ H(K, M | C) (the uncertainty about the joint variable K, M is at least as large
as the uncertainty about M ), we have that H(M | C) ≤ H(K, M | C) = H(K | C) .
Since H(K | C) ≤ H(K ) (conditioning cannot increase entropy), we have that H(M | C) ≤ H(K ) .
Now, by definition of perfect secrecy we have H(M | C) = H(M ) . So H(M ) ≤ H(K ) . ◼
For perfect secrecy, length of key bit string N should be ≥ entropy of the plaintext language.
Exercise: Information-Theory for Cryptography
Q . Let K, M, C be the random variables denoting the key, message and ciphertext respectively.
(a) . For a general private-key encryption scheme, is it the case that H(M | K, C) = 0? Explain your answer.
(b) . For a general private-key encryption scheme, is it the case that H(C | K, M ) = 0? Explain your answer.
(c) . In the One-Time Pad scheme with | ℳ | = | 𝒦 | = | 𝒞 | = {0,1}l, how many bits of information
about the message and key are revealed by a single ciphertext on average? Explain your answer.
Assume that the key is chosen uniformly at random, independent of the message.
Exercise: Information-Theory for Cryptography
If you know the ciphertext and the key then you know the plaintext.
This must hold since otherwise decryption will not work correctly.
H(C | M, K ) = 0 means that if you know the plaintext and key, then you know the ciphertext.
This holds when encryption is deterministic, but not for general private-key encryption schemes.
Exercise: Information-Theory for Cryptography
A . (c) . In the One-Time Pad, where the key is chosen uniformly from {0,1}l, we have H(K ) = l, the maximum.
To find how many bits of information are revealed about the key on average by a ciphertext,
we must compute H(K ) − H(K | C), i.e., the mutual information between key and ciphertext.
H(K | C) is the key equivocation, i.e., the amount of uncertainty about the key left
Similarly, to find how many bits of information are revealed about the message on average by a ciphertext,
we must compute H(M ) − H(M | C), i.e., the mutual information between message and ciphertext.
Exercise: Information-Theory for Cryptography
A . (c) . Contd. Since the One-Time Pad is a perfectly secret scheme, it holds that
H(M ) = H(M | C) .
That is, the message equivocation is equal to the message entropy, where the message equivocation
denotes the amount of uncertainty about the message left after one ciphertext is observed.
Therefore H(M ) − H(M | C) = 0 bits of information about the message are revealed on average by
a single ciphertext.
A communication channel is a system in which the output depends probabilistically on its input.
The channel is characterised by transition matrix with elements p(y | x) that determine the
For a communication channel with input X and output Y we define the capacity C by
C = max I(X; Y ) .
p(x)
The capacity is the maximum rate at which we can send information over the channel and
1 1
2 2
3 3
4 4