You are on page 1of 29

Differential Entropy

Definition
Let

X be a random variable with cumulative

distribution function F(x) = Pr ( X x ). If F(x) is continuous, the r.v. is said to be continuous. Let f(x) = F(x) when the derivative is defined. If

f ( x )dx = 1, then f(x) is called the pdf for X .

The set where f(x) > 0 is called the support set of X .


2

Definition
The differential entropy h( X ) of a continuous r.v. X with a density function f(x) is defined as

h ( X ) = f ( x ) log f ( x ) dx
s

(1)

where S is the support of the r.v. Since h ( X ) depends only on f(x), sometimes the differential entropy is written as h( f ) rather then
h( X ) .
3

EX.1: (Uniform distribution)


, 0x<a f (x ) = 0, otherwise a 1 1 h ( X ) = log dx = log a 0 a a Note: For a<1, log a < 0, and h ( X ) = log a < 0.
1 a

However, 2

h(X )

=2

log a

= a is the volume of the

support set, which is always non-negative.

Ex. 2: (Normal distribution)


Let

X ~=

1 22

2 2 2

, then

h( ) = ln x2 2 = ( x ) ln 2 2 2 1 1 2 2 2 = E X + ln (2 ) 2 2 1 2 1 2 = + ln 2 2 2 2

[ ]

Changing the base of the logarithm, we have

1 1 2 = + ln 2 2 2 1 1 2 = ln e + ln 2 2 2 1 = ln 2e 2 2

nats

1 2 h( ) = log 2e bits. 2
6

Theorem 1
Let

X 1 , X 2 ,L, X n be a sequence of rvs drawn


1 log f ( X 1 , X 2 ,L X n ) n E [ log f ( X )] = h ( X ) in probability

i.i.d. according to the density f(x). Then

proof: The proof follows directly from the weak law of the large numbers.
7

Def: For >0 and any n, we define the typical set A(n ) w.r.t. f(x) as follows:

1 n A = ( X1, X2 ,LXn ) S : log f ( X1, X2 ,LXn ) h( X ) , n n where f ( X1, X2 ,LXn ) = i=1 f ( Xi )


(n )

Def: The volume Vol(A) of a set ARn is defined as

vol ( A) = dx1dx2 L dxn


A

( Thm: The typical set An ) has the following

(A( ) ) > 1 for n sufficiently large 1. Pr ( ) ( ( )) 2. Vol(A ) 2 for all n ( ( )) ( ) 3. Vol(A ) (1 )2 for n sufficiently large
n n n h X + n n h X
9

properties:

Thm: The Set A is the smallest volume set with probability 1-, to the first order in the exponent. The volume of the smallest set that contains most of the Prob. Is approximately 2nh. This is an n-D volume, so the corresponding side length is (2nh)1/n=2h. The differential entropy is the logarithm of the equivalent side length of the smallest set that contains most of the Prob. low entropy implies that the rv is confined to a small effective volume and high entropy indicates that the rv is widely dispersed.
10

(n )

Relation of Differential Entropy to Discrete Entropy


f(x)

Quantization of a continuous rv

Spose we divide the range of X into bins of length . Lets assume that the density is continuous within the bins.
11

By the mean value theorem, there is a value x i within each bin such that
f ( xi ) =
(i +1) i

f ( x )dx

Consider the quantized rv X , which is defined by


X = xi , if i X < (i + 1)

Then the prob. that X = xi is


Pi =
(i +1) i

f ( x )dx = f ( xi )
12

The entropy of the quantized version is


H (X

) =

Pi log Pi

= f ( xi ) log f ( xi ) f ( xi ) log = f ( xi ) log f ( xi ) log

= f ( xi ) log( f ( xi ) )

Since

f (x ) = f (x ) = 1
i
13

If f(x)logf(x) is Riemann integrable, then


f ( xi ) log f ( xi ) f ( x ) log f ( x ) dx , as 0

This proves the following Thm: If the density f(x) of the rv X is Riemann integrable, then

H (X

) + log h ( f ) = h ( X ),
-n

as 0

Thus the entropy of an n-bit quantization of a continuous rv X is approximately h( X ) + n

Since = 2 for a n - bit uniform quantizer

14

Joint and Conditional Differential Entropy:

h( X 1 , X 2 , L , X n ) = f ( x ) log f ( x )dx
n n

h( X | Y ) = f ( x, y ) log f ( x | y )dxdy h( X | Y ) = h( X , Y ) h(Y )

15

Theorem (Entropy of a multivariate normal distribution)

Let X 1 , X 2 , L , X n have a multivariate normal distribution with mean and covariance matrix K. Then 1 h( X 1 , X 2 , L , X n ) = h( n ( , K ) ) = log(2e) n K bits 2 where K denotes the determinant of K .

16

x pf : ( X 1 , X 2 ,L, X n ) ~ N n ( , K ) f ( ~ ) = Then

1 2

)K
n

1 2

exp

1 ~ ( x )T K 1 ( ~ ) x 2

1 ~ T h( f ) = f ( x ) ( x ) K 1 (~ ) ln 2 x 2 1 1 1 = E ( X i i )(K )ij (X j j ) + ln(2 ) n K 2 ij 2 1 1 1 = E ( X i i )(X j j )(K )ij + ln(2 ) n K 2 ij 2

)K
n

1 2

dx

17

= =

1 1 E (X j j )( X i i ) K 1 ij + ln(2 ) n K 2 ij 2 1 1 K ji K 1 ij + ln(2 ) n K 2 j i 2

]( )

( )
jj

1 = KK 1 2 j =

1 + ln(2 ) n K 2

1 1 I jj + ln(2 ) n K 2 2 j

n 1 = + ln(2 ) n K 2 2 1 = ln(2e) n K nats 2 1 = log(2e) n K bits 2

18

Relative Entropy and Mutual Information

D( f // g ) = I ( X ; Y ) =

f f log g

f ( x, y ) dxdy f ( x, y ) log f ( x) f ( y )

I ( X ;Y ) = h( X ) h( X | Y ) = h(Y ) h (Y | X ) = h( X ) + h(Y ) h( X , Y ) I ( X ;Y ) = D ( f ( x, y ) // f ( x ) f ( y ) )

19

Remark: The mutual information between two continuous r.vs is the limit of the mutual information between their quantized versions.

I ( X ;Y ) = H ( X ) H ( X | Y ) h ( X ) log (h( X | Y ) log ) = I ( X ; Y )

20

Properties of

h ( x ) , D(p

q) , I(X; Y)

D(f

g) 0

g g pf : - D(f g) = s f log log s f ( Jensen' s inequality ) f f = log s g log 1 = 0. I (X; Y) 0 h(X Y) h(X)
21

h (X1 , X 2 ,L, X n ) = h( XX 1 , X 2 ,L X i 1 ) i h (X1 , X 2 ,L, X n ) = h( X i )


i =1

Theorems h( X + c) = h( x ) : translation does not change the differential entropy h(aX ) = h( x ) + log a

22

1 y pf : let Y = aX. Then , f Y ( y ) = f x ( ), and a a h(aX) = - f Y ( y ) log f Y ( y ) d y y 1 1 y = f x ( ) log ( f x ( ))dy a a a a = f x ( x ) log x ( x ) dx + log a = h ( x ) + log a Corollary : h(AX) = h(X) + log det(A)
23

Theorem : The multivariate normal distribution maximizes the entropy over all distributions with the same variance.
Let the random vector X R n have zero mean and convariance K = E XX (i.e. , K ij = EX i X j , 1 i , j n ) , Then
T

1 h(x) log (2e) n K , with equality iff X ~ N n (0 , k ) 2

24

Pf:
x Let g (~) be any density satisfying g( ~ ) xi x j d~ = K ij for all i , j. x x Let k be the density of a (0, K) vector. Note that logk (~) is a quadratic form and xi x jk (~)d~ = K ij . x x x 0 D ( g k ) = g log( g = h( g ) + h(k )

k )

= h( g ) g log k = h( g ) k log k

25

where the substitution glogk follows from the fact that g and k yield the same moments of the quadraic form logk (x) the Gaussian distribution maximizes the entropy over all distributions with the same variance.

26

Let X be a random variable with differential entropy h(x) Let X be an estimate of X and let . E( X - X) 2 be the expected prediction error. Let h(x) be in nats

Theorem : For any r.v. X and estimator X ) 2 1 e 2h(x) E(X - X 2e with equality iff X is Gaussian and X is the mean of X

27

pf : Let X be any estimator of X then E(X - X) 2 min E (X - X) 2 (1)


x

= E(X - E(X)) 2 [ the mean of X is the best estimator for X ] 1 2h(x) e (2) = var (X) 2e [Gaussian distribution has the maximum entropy for a given varance] 1 i.e. , h(x) ln 2e 2 2

28

We have equality, only in (1), only if x is the best estimator (i.e. , x is the mean of X) and equality in (2) only if X is Gaussian. Gorollary : Given side information Y and estimator X(Y) it follows that (Y))2 1 e 2h(XY) E(X - X 2e Fano' s inequality

29