Renyi's Entropy: I P P I

JOSE C.
PRINCIPE
Renyi’s entropy
U. OF FLORIDA
EEL 6935
History: Alfred Renyi was looking for the most general definition of
352-392-2662
information measures that would preserve the additivity for indepen-
principe@cnel.ufl.edu
dent events and was compatible with the axioms of probability.
copyright 2009
He started with Cauchy’s functional equation: If p and q are indepen-

dent than I(pq)=I(p)+I(q).
Apart from a normalizing constant this is compatible with Hartley’s

information content I(p)=-log p. If we assume that the events
X={x1,...xN} have different probabilities {p1,...pN}, and each delivers Ik
bits of information, then the total amount of information for the set is
N
I ( P) = ∑ pk Ik
k =1
This can be recognized as Shannon’s entropy. But he reasoned that
there is an implicit assumption used in this equation: we use the linear
average, which is not the only one that can be used.
In the general theory of means, for any function g(x) with inverse g-1,
the mean can be computed as
N
g (∑ pk g( xk ))
−1
k =1
page 1 of 35
JOSE C. PRINCIPE
Applying this definition to the I(P) we get
U. OF FLORIDA N
EEL 6935
I ( P) = g (∑ pk g (I k ))
−1
k =1
352-392-2662
When the postulate of additivity for independent events is applied we
get just two possible g(x):
copyright 2009
g(x)=cx
g(x)=c-2(1-α)x
The first form gives Shannon information and the second gives
N
1
Iα (P) = log(∑ pkα )
1− α k =1
for non negative α different from 1. This gives a parametric family of
information measures that are called today Renyi’s entropies.
It can be shown that Shannon is a special case when α→1
N N
N
lim ∑ log pk ⋅ pk α
∑ pkα −1
1 α →1
lim Hα ( X ) = lim log ∑ pαk = k =1 k =1
= HS ( X )
α →1 α →1 1 − α lim− 1
k =1
α →1
So Renyi’s entropies contain Shannon as a special case.
page 2 of 35
JOSE C. PRINCIPE
Meaning of α
U. OF FLORIDA
EEL 6935
If we compare the two definitions we see that instead of weighting the
352-392-2662
log pk by the probabilities, here the log is external to the sum and its
argument is the α power of the PMF, i.e.
( )
copyright 2009
Hα ( X ) =
1
log(Vα (X )) = −log(α −1Vα ( X ) = −log α−1 E(Vα−1(X )) Vα ( X ) = ∑ p αk
1 −α k
Note that Vα(X) is the argument of the α norm of the PMF.
A geometric view will help here. All the PMFs of N r.v. exist on what is
called the simplex, in an N dimensional space.
p2 n
α α
∑
k=1
pk = p α
(entropy α-norm) p2
1 (α–norm of p raised power to α) 1 p = ( p1, p2, p3)
p = (p 1, p2)
0 p3
p1 1
p1 1
0 1
page 3 of 35
JOSE C. PRINCIPE
The norm measures exactly the distance of the PMF to the origin, and
U. OF FLORIDA
α is the designated l-norm (Euclidean norm is l=2). Changing α
EEL 6935
changes the metric distance in the simplex.
352-392-2662
[ 0 01] [0 01] [ 001 ]
copyright 2009
[ 100] α = 0.2 [01 0] [1 00] α = 0.5 [ 01 0] [1 00] α =1 [ 010]
[ 0 01] [0 01] [ 001 ]
[ 100] α =2 [01 0] [1 00] α =4 [ 01 0] [1 00] α = 20 [ 010]
From another perspective, the α root of Vα(X) is the α-norm of the

PMF. So we conclude that Renyi’s entropy is more flexible than Shan-
non and includes Shannon as a special case.
We will be using extensively α=2.
page 4 of 35
JOSE C. PRINCIPE
Properties of Renyi’s entropies
U. OF FLORIDA
EEL 6935
352-392-2662
copyright 2009
page 5 of 35
JOSE C. PRINCIPE
Renyi’s quadratic entropy
U. OF FLORIDA
We will be using heavily Renyi’s entropy with α=2, called the quadratic
EEL 6935
entropy
352-392-2662
H2 ( X ) = − log(∑ pk2 )
k
copyright 2009
It has been used in physics, in signal processing and in economics.
We want also to stress that the argument of the log, which is the 2-
norm of the PMF V2(X) has meaning in itself. In fact it is the E[p(X)] or
if we make the change in variables ξk=p(xk), then it is the mean of the
transformed variable when the PMF is the transformation.
p(x)
Mean p(X)
x
mean of X
For us Reny’s quadratic entropy is appealing because we have found

an easy way to estimate it directly from samples.
page 6 of 35
JOSE C. PRINCIPE
Extensions to continuous variables.
U. OF FLORIDA
EEL 6935
It is possible to show that Renyi’s entropy measure extends to continu-
352-392-2662
ous r.v. and reads
1
copyright 2009 Hα ( X ) = lim(Iα (Pn ) − log n) = log ∫ pα (x)dx
n→∞ 1−α
and for the quadratic entropy
H2 ( X ) = − log ∫ p2 ( x)dx
Notice however that now entropy is no longer positive, in fact it can

become arbitrarily large negative.
page 7 of 35
JOSE C. PRINCIPE
Estimation of Renyi’s quadratic entropy
U. OF FLORIDA
EEL 6935
We will use here ideas from density estimation called kernel density
352-392-2662
estimation. This is an old method (Rosenblatt) that is now called
Parzen density estimation because Parzen proved many important
copyright 2009
statistical properties of the estimator.
The idea is very simple: Place a kernel over the samples and sum with
proper normalization, i.e.
1 N
x − xi
pˆ X ( x ) =
Nσ
∑κ ( σ
)
i =1
The kernel has the following properties:

1. sup R k < ∞
2. ∫R k <∞
3. lim xk ( x ) = 0
x→∞
4. k ( x ) ≥ 0, ∫R k ( x )dx = 1
Parzen proved that the estimator is asymptotically unbiased and con-
sistent with a good efficiency.
The free parameter is called the kernel size. Normally a Gaussian is
used and σ becomes the standard deviation.
page 8 of 35
JOSE C. PRINCIPE
But notice that here we are not interested in estimating the PDF that is
U. OF FLORIDA
a function, we are interested in a single number the 2-norm of the PDF.
EEL 6935
For a Gaussian kernel, substituting the estimator yields immediately
352-392-2662
∞ 2
⎛1 N ⎞
principe@cnel.ufl.edu H 2 ( X ) = − log ∫ ⎜⎜ ∑ Gσ ( x − xi ) ⎟⎟ dx
ˆ
−∞ ⎝ ⎠
N i =1
copyright 2009
1
∞ ⎛N N ⎞
⎜ ∑∑ Gσ ( x − x j ) ⋅ Gσ ( x − xi ) ⎟dx
= − log 2
N ∫ ⎜ i =1 j =1 ⎟
−∞ ⎝ ⎠
∞
1 N N
= − log 2 ∑∑ ∫ Gσ ( x − x j ) ⋅ Gσ ( x − xi ) dx
N i =1 j =1 −∞
1 N N
= − log( 2 ∑∑ Gσ 2
( x j − xi ))
N i =1 j =1
We call this estimator for V2(X), the Information Potential.
Let us look at the derivation: We never had to compute the integral
explicitly since the integral of the product of Gaussians is the value of a
Gaussian at the difference of arguments (with a larger kernel size).
Let us also look at the expression: The algorithm is O(N2), and there is
a free parameter σ that the user has to select from the data.
Cross validation or the Silverman’s rule suffice for most cases
( )
1
−1 −1 d + 4
σ opt = σ X 4 N (2d + 1)
page 9 of 35
JOSE C. PRINCIPE
Extended Estimator for Renyi’s Entropy
U. OF FLORIDA
EEL 6935
We can come up with still another estimator for Reny’s entropy of any
352-392-2662
order
[ ]
Δ ∞
1 1
copyright 2009 Hα ( X ) = log ∫ pαX ( x) dx = log EX pαX −1( X )
1−α −∞
1− α
by approximating the expected value by the empirical mean
N
1 1
Hα ( X ) ≈
1 −α
log
N
∑ pαX −1 (x j )
j =1
to yield
α −1
1 1 N ⎛1 N ⎞
Hˆ α ( X ) =
1 −α
log
N
∑ ⎜ N ∑ κ σ ( x j − xi ) ⎟⎟
⎜
j =1 ⎝ i =1 ⎠
α −1
1 1 N ⎛N ⎞
=
1−α
log
Nα
∑⎜ ∑ σ j i ⎟⎟
⎜ κ ( x − x )
j =1 ⎝ i=1 ⎠
We will now present a set of properties of these estimators.
page 10 of 35
JOSE C. PRINCIPE
Properties of the Estimators
U. OF FLORIDA
Property 2.1: The estimator of Eq. (2.18) for Renyi’s quadratic entropy using Gaussian
EEL 6935
352-392-2662 kernels only differs from the IP of Eq. (2.14) by a factor of 2 in the kernel size.
copyright 2009 Property 2.2: For any Parzen kernel that obeys the relation
∞
κ new
(x j − xi ) = ∫κ old ( x − xi ) ⋅κ old ( x − x j )dx (2.19)
−∞
the estimator of Renyi’s quadratic entropy of Eq. (2.18) matched the estimator of Eq.
(2.14) using the IP.
Property 2.3. The kernel size must be a parameter that satisfies the scaling property
κ cσ ( x) = κ σ ( x / c) / c for any positive factor c
Property 2.4. The entropy estimator in Eq. (2.18) is invariant to the mean of the
underlying density of the samples as is the actual entropy
Property 2.5. The limit of Renyi’s entropy as α → 1 is Shannon’s entropy. The
limit of the entropy estimator in Eq. (2.18) as α → 1 is Shannon’s entropy estimated
using Parzen windowing with the expectation approximated by the sample mean.
page 11 of 35
JOSE C. PRINCIPE
U. OF FLORIDA
Property 2.6. In order to maintain consistency with the scaling property of the actual
EEL 6935 entropy, if the entropy estimate of samples {x1 ,…,x N} of a random variable X is estimated
352-392-2662 using a kernel size of σ, the entropy estimate of the samples {cx 1,…,cxN} of a random
variable cX must be estimated using a kernel size of |c| σ.
copyright 2009
Property 2.7. When estimating the joint entropy of an n-dimensional random vector X
from its samples {x 1,…,x N}, use a multi-dimensional kernel that is the product of single-
dimensional kernels. This way, the estimate of the joint entropy and estimate of the
marginal entropies are consistent.
Theorem 2.1. The entropy estimator in Eq. (2.18) is consistent if the Parzen
windowing and the sample mean are consistent for the actual PDF of the iid samples.
Theorem 2.2. If the maximum value of the kernel κσ(ξ) is achieved when ξ = 0, then the
minimum value of the entropy estimator in Eq. (2.18) is achieved when all samples are
equal to each other, i.e., x1=…= xN = c
page 12 of 35
JOSE C. PRINCIPE
U. OF FLORIDA Theorem 2.3. If the kernel function κ σ(.) is continuous, differentiable, symmetric and
EEL 6935
unimodal, then the global minimum described in Theorem 2.2 of the entropy estimator in
352-392-2662
principe@cnel.ufl.edu Eq. (2.18) is smooth, i.e., it has a zero gradient and a positive semi-definite Hessian
copyright 2009
matrix.
Property 2.8. If the kernel function satisfies the conditions in Theorem 2.3, then in
the limit, as the kernel size tends to infinity, the quadratic entropy estim ator approaches
to the logarithm of a scal ed and biased version of the sample variance.
Property 2.9. In the case of joint entropy estim ation, if the m ulti-dimensional
kernel function satisfies κ Σ (ξ ) = κ Σ ( R −1 ξ ) for all orthonorm al m atrices, R, then the
entropy estim ator in Eq. (2.18) is invariant under rotations as is the actual entropy of a
random vector X. Notice that the condition on the joint kernel function requires hyper-
spherical sym metry.
page 13 of 35
JOSE C. PRINCIPE
U. OF FLORIDA
Theorem 2.4. lim Hˆ α ( X ) = H α ( Xˆ ) ≥ H α ( X ) , where X̂ is a random variable with the
EEL 6935 N →∞
352-392-2662
PDF f X (.) * κ σ (.) . The equality (in the inequality portion) occurs if and only if (iff) the
kernel size is zero. This result is also valid on the average for the finite-sample case.
copyright 2009
Bias of the Information Potential
Bias[Vˆ ( X )] = E[Vˆ ( X )] − ∫ f 2 ( x )dx = (σ 2 / 2 ) E[ f ′ ( X )]
Variance of the Information Potential

aN (N −1)(N − 2) + bN ( N − 1) a
Var (Vˆ ( X )) = E(Vˆ 2 ( X )) − ( E(Vˆ ( X ))) 2 = ≈ , as N →∞
σN 4 Nσ
[ ] [ ][
a = E Kσ ( xi − xj )Kσ ( x j − xl ) − E Kσ ( xi − x j ) E Kσ (x j − xl ) ]
b = Var[Kσ (xi − x j )]
The AMISE (asymptotic mean integrated square error)

σ 4
aN ( N − 1)( N − 2) + bN ( N − 1)
AMISE(Vˆ ( X )) = E[∫ (Vˆ ( X ) − V ( X )) dx] = ∫ ( f ′′(x)) dx +
2 2
2 σN 4
This is expected from the kernel density estimation theory. The IP

belongs to this class of estimators.
page 14 of 35
JOSE C. PRINCIPE
Physical Interpretation of the Information Potential Estimator
U. OF FLORIDA
EEL 6935
There is a useful physical interpretation of the IP. Think of the gravita-
352-392-2662
tional or the electrostatic field.
Assume the samples are particles (information particles) of unit mass
copyright 2009
that interact with each other with the rules specified by the Parzen ker-
nel used. This analogy is exact!
The information potential field is given by the kernel density estimation
Δ 1 N
Vˆ2 ( x j ) = ∑ Gσ 2
( x j − xi )
N i =1
the Information Potential is the total potential
N
Vˆ2 ( X ) = 1 / N ∑ Vˆ2 ( x j )
j =1
Now if there is potential there are forces in the space of the samples
that can be easily computed as
∂ ˆ 1 N
1 N
∂xj
V2 (x j ) =
N
∑Gσ′ 2 (xj − xi ) = 2Nσ 2 ∑Gσ 2 (xj − xi )(xi − xj )
i =1 i=1
Δ 1
F2 ( x j ; xi ) = Gσ′ 2
( x j − xi )
N
Δ
∂ N
F2 ( x j ) = V2 ( x j ) = ∑ F2 ( x j ; x i )
∂x j i =1
page 15 of 35
JOSE C. PRINCIPE
Interaction between one information particle in the center of 2-D space
U. OF FLORIDA
EEL 6935
352-392-2662
copyright 2009
Another way to look at the interactions among samples is to create two

matrices in the space
⎧ D = {d ij }, dij = xi − x j ⎧ˆ 1 N ˆ
⎨ ⎪ V (i) = ∑ Vij
⎩ς = {Vˆ
ij } Vˆ
ij = Gσ 2
(dij ) fields
⎪
⎨
N j =1
⎪ Fˆ (i ) = − 1 ∑ Vîjd ij
N
⎪⎩ Nσ 2
j =1
1 N N ˆ 1 N
V ( X ) = 2 ∑∑ Vij =
ˆ ∑ Vˆ (i )
N i =1 j =1 N i =1
D is a matrix of distances, and ζ is a matrix of scalars. As we can see

the samples create a metric space given by the kernel.
page 16 of 35
JOSE C. PRINCIPE
Example of forces pulling apart samples when the entropy is maxi-
U. OF FLORIDA
mized. The lines show the direction of the force and their size is the
EEL 6935
intensity of the pull.
352-392-2662
principe@cnel.ufl.edu 1
copyright 2009
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
page 17 of 35
JOSE C. PRINCIPE
Extension to any α and kernel
U. OF FLORIDA
EEL 6935
The framework of information potential and forces extends to any
352-392-2662
Parzen kernel where the shape of the kernel creates the interaction
field. Changing α also changes the field.
copyright 2009
The α information potential field is
α −1
Δ1 ⎛N ⎞
Vα (x j )= α −1 ⎜⎜∑κσ (x j − xi )⎟⎟
ˆ
N ⎝ i=1 ⎠
N
1
Vˆα ( X) = ∑Vˆα (xj )
N j =1
and it can be expressed as a function of the IP as

α−2
1 ⎛N ⎞ 1 N
Vα(x j) = α−2 ⎜⎜∑κσ (xj − xi) ⎟⎟
ˆ ∑ κσ (x j − xi ) = pˆα−2 (xj )Vˆ2(x j)
N ⎝ i=1 ⎠ N i=1
The α information force is likewise given by
α −2
Δ
∂ ˆ α −1 ⎛ N ⎞ ⎛N ⎞
Fα ( x j ) =
ˆ Vα ( x j ) = α −1 ⎜⎜ ∑ κσ ( x j − xi ) ⎟⎟ ⎜⎜ ∑κσ′ ( x j − xi ) ⎟⎟
∂x j N ⎝ i =1 ⎠ ⎝ i =1 ⎠
= (α −1) pˆ αX −2 ( x j )Fˆ2 ( x j )
so conceptually all fields can be composed from the IP. α<2 empha-
sizes regions of lower density of samples (opposite for α>2).
page 18 of 35
JOSE C. PRINCIPE
Divergence Measures
U. OF FLORIDA
EEL 6935
Renyi proposed a divergence measure from his entropy definition
352-392-2662
which is different from Kullback Liebler divergence discussed in ch 1.
copyright 2009
Renyi’s divergence
∞ α−1
Δ 1 ⎛ f (x) ⎞
Dα ( f || g)= log ∫ f (x)⎜⎜ ⎟⎟ dx
α −1 −∞ ⎝ g(x) ⎠
with properties
i. Dα ( f || g) ≥ 0, ∀f , g,α > 0
ii. Dα ( f || g) = 0 iff f(x) = g(x) ∀x ∈ℜ
iii. lim Dα ( f || g ) = DKL ( f || g)

α →1
α −1
1 ⎡⎛ f (x) ⎞α −1⎤ 1 1 N ⎛⎜ fˆ (x j ) ⎞⎟
f
Dα ( f || g) = log Ep ⎢⎜⎜ ⎟⎟ ⎥ ≈ log ∑
α −1 ⎢⎣⎝ g ( x) ⎠ ⎥⎦ α − 1 N j=1 ⎜⎝ gˆ (xgj ) ⎟⎠
α −1
⎛N f f f ⎞
N ∑
⎜ κ ( x j − xi )⎟
log ∑⎜⎜ i=m1 ⎟
1 1
⎟ =D
ˆ ( f || g)
α
α − 1 N j=1
⎜⎜ ∑κ ( xj − xi ) ⎟⎟
g g g
⎝ i=1 ⎠
page 19 of 35
JOSE C. PRINCIPE
Reny’s Mutual Information
U. OF FLORIDA
EEL 6935
Renyi’s mutual information is also the divergence between the joint
352-392-2662
and the product of the marginals
∞ ∞
Δ 1 pαX ( x1 ,..., xn ) 1
copyright 2009 I( X ) = log ∫ L ∫ n dx ...dxn
1 −α −∞ −∞
∏ p1X−oα ( xo )
o =1
and we can come up with a kernel estimate for it if we approximate the

E[.] by the empirical mean
1 −α
⎛ ⎛1 N ⎞ ⎞
⎜ ⎜ ∑ κ Σ ( x j − xi ) ⎟ ⎟
⎜ ⎟ ⎟
Îα ( X ) = 1 log 1 ∑ ⎜ ⎝ N i =1
Δ N
⎠
⎜ n ⎟
1− α N j =1 ⎜ ⎛1 N o ⎞⎟
⎜ ∏ ⎜ N ∑ κ σ o ( x j − xi ) ⎟ ⎟
⎜ o
⎟
⎝ o =1 ⎝ i =1 ⎠⎠
1− α
⎛⎛ 1 N n ⎞⎞
⎜ ⎜ ∑ ∏ κ σ ( x oj − xio ) ⎟ ⎟
1 1 N ⎜ ⎜⎝ N i =1 o =1 o ⎟⎟
⎠
= log ∑ ⎜ n ⎟
1− α N j =1 ⎜ ⎛ 1 N
o ⎞ ⎟
⎜ ∏⎜ N ∑ σo j
⎜ κ (x − xi ) ⎟⎟ ⎟
o
⎝ o =1 ⎝ i = 1 ⎠⎠
that also approximates KL estimator for α->1. However Renyi’s diver-

gence and MI are not as general as KL because Shannon measure of
information is the only one for which the increase in information is
equal to the negative of the decrease in uncertainty.
page 20 of 35
JOSE C. PRINCIPE
Quadratic Divergence and Mutual Information
U. OF FLORIDA
EEL 6935
Measuring dissimilarity in probability space is a complex issue and
352-392-2662
there are many other divergence measures that can be used. If we
transform the simplex to an hypersphere preserving the FIsher infor-
copyright 2009
mation, the coordinates of each PMF become pk . The geodesic dis-
tance between PMF f and g becomes cos D G = ∑ fk gk . This is related to
the Bhattacharyya distance (which is Renyi’s divergence with α=1/2)
(
DB( f , g) = −ln ∫ f (x)g(x)dx )
Now if we measure the distance between f and g in a linear projection
0.5
space (the cordal distance) we get DH = ∑ ( fk – gk ) This resembles
the Hellinger distance which is also related to Reny’s divergence
( ) [( )]
1/ 2
DH ( f , g) = ⎡∫ f (x) − g(x) dx⎤
2 1/ 2
= 21− ∫ f (x)g(x)dx
⎢⎣ ⎥⎦
Since we have an estimator for quadratic Renyi’s entropy (the IP) we
will substitute α=1/2 for 2 and define two quadratic divergences called
Euclidean distance and Cauchy Schwarz divergence between f and g.
page 21 of 35
JOSE C. PRINCIPE
Euclidean distance between PDFs
U. OF FLORIDA
EEL 6935
We will drop the square root and define the distance as
352-392-2662
DED ( f , g) = ∫ ( f (x) − g( x))2 dx = ∫ f (x)2 dx − 2∫ f ( x)g (x)dx + ∫ g( x)2 dx
copyright 2009 Notice that each of the three terms can be estimated with the IP. The
middle term involves samples from the two PDFs and so will be called
the Cross Information Potential (CIP) that measures the potential cre-
ated by one PDF in the locations specified by the samples of the other
PDF.
We will define the quadratic mutual information QMIED between two

random variables as
I ED ( X1, X 2 ) = DED( f X1 X2 ( x1, x2 ), f X1 (x1) f X2 (x2 ))
Notice that the IED is zero if the two r.v. are independent.
The extension to multidimensional random variables is also possible
k
I ED ( X 1,..., X k ) = DED ( f X ( x1,... xk ), ∏ f X i ( xi ))
i =1
Although a distance, we sometimes refer to DED as a divergence.
page 22 of 35
JOSE C. PRINCIPE
Cauchy Schwarz divergence
U. OF FLORIDA
EEL 6935
The other divergence is related to the Bhattacharyya distance but can
352-392-2662
be formally derived from the Cauchy Schwarz inequality
∫ ∫ (x)dx ≥ ∫ f ( x)g( x)dx

2 2
copyright 2009 f ( x) dx g
where the equality holds iff f(x)=g(x). The divergence is defined as
DCS ( f , g ) = − log ∫ f ( x)g( x)dx

∫ ∫ (x)dx
2 2
f ( x) dx g
DCS is always greater than zero, it is zero for f(x)=g(x) and it is sym-
metric (but does not obey the triangular inequality). We will also work
with the square of this quantity in practice. We can also write
DCS ( f , g ) = log(∫ f ( x)2 dx) + log(∫ g( x)2 dx) − 2log(∫ f (x) g( x)dx)
We can also define the quadratic mutual information QMICS as

I CS( X1, X 2) = D CS( f X1 X2 ( x 1, x 2 ), f X1 ( x 1)fX 2 ( x2 ) )
and as before for independent variables it is zero. It can also be

extended to multiple variables easily
k
I CS ( X 1,..., X k ) = DCS ( f X ( x1,... xk ), ∏ f X i ( xi ))
i =1
page 23 of 35
JOSE C. PRINCIPE
Relation between DCS and Renyi’s relative entropy
U. OF FLORIDA
EEL 6935
There is a very interesting interpretation of the DCS in terms of Renyi’s
352-392-2662
entropy. Lutwak defines Renyi’s relative entropy as
(∫ ) (∫ g )
1
α −1 α 1/ α
copyright 2009 g (x) f (x) 1− α (x)
DRα ( f , g ) = log R R
(∫ ( x) )
1
fα α (1− α )
R
for α=2 this gives exactly DCS, so

DCS ( X ,Y ) = log( ∫ f ( x) g ( x) dx) − log(1/ 2∫ f ( x) 2 dx) − log(1/ 2∫ g ( x)2 dx) =
= H 2 ( X ;Y ) − 1 / 2 H2 ( X ) − 1/ 2H2 (Y )
where the first term can be shown to be quadratics Reny’s cross
entropy.
There is a striking similarity between DCS and Shannon’s mutual infor-
mation, but notice that here we are dealing with Renyi’s quadratic
entropy.
page 24 of 35
JOSE C. PRINCIPE
Geometric Interpretation of Quadratic Mutual Information
U. OF FLORIDA
EEL 6935
Let us define
352-392-2662
⎧ VJ = ∫∫ f X X (x 1, x2 )2dx 1dx 2
principe@cnel.ufl.edu ⎪ 1 2
⎪
⎨ VM = ∫∫ (f X1( x1 )fX2( x2) ) dx1 dx2
2
copyright 2009
⎪
⎪ V = f
⎩ c ∫ ∫ X1X2 (x1, x2 )fX1( x1 )fX2(x2 )dx 1dx 2
therefore QMIED and QMIED can be rewritten as

⎧ IED = V J – 2V c + V M
⎨
⎩ ICS = log V J – 2 log V c + log V M
The Figure illustrates these distances in the simplex
fX1 X2 ( x 1, x2 )
IED ( Euclidean Distance)
I s (K-L Divergence )
VJ
f X ( x1 )fX ( x 2 )
θ 1 2
VM
0
2
I CS = –log ( ( cos θ ) ) V c = cos θ VJ VM
page 25 of 35
JOSE C. PRINCIPE
Example
U. OF FLORIDA 21
X2 PX
Is
EEL 6935
352-392-2662 2
P X2 2 PX
12
PX
22
Is
principe@cnel.ufl.edu 1 11
PX2 1 PX P 21
X
copyright 2009
X1
1 2 21 11
PX PX 11
1 PX
P X1 P 2X1
21
IE D PX
Fix PX1= (0.6,0.4) I ED
Vary
P(1,1) -> [0, 0.6] 21 11 11
P(2,1) -> [0, 0.4] PX PX PX
I CS 21
PX
IC S
21 11 11
⎧ n m PX PX PX
I ED ( X 1, X 2 ) = ∑∑ ( PX (i , j ) − PX 1 ( i ) PX 2 ( j )) 2
i =1 j =1
⎛ n m ⎞⎛ n m
⎜ ∑∑ ∑∑ (PX 1 (i) PX 2 ( j ))2
⎨ ⎜ ( P (i , j )) 2 ⎟⎜
X ⎟⎜
I CS ( X1, X 2 ) = log ⎝ ⎠⎝ i =1 j =1
i =1 j =1
n m
∑∑ (PX (i, j) PX 1(i)PX 2 ( j))2
⎩ i =1 j =1
page 26 of 35
JOSE C. PRINCIPE
Information Potential and Forces in the Joint Spaces
U. OF FLORIDA
EEL 6935
The important lesson to be learned from the IP framework and the dis-
352-392-2662
tances measures defined, is that each PDF creates its own potential
that interacts with the others. This is an additive process, so the forces
copyright 2009
are also additive, and it becomes relatively simple.
Euclidean and Cauchy Schwarz divergences
The three potentials needed are

1 N N
Vf = 2 ∑∑G
ˆ
2σ
( x f (i) − x f ( j )) 2
N i =1 i = j
1 N N
Vg = 2 ∑ ∑ G
ˆ
2σ
( x g (i ) − x g ( j )) 2
N i =1 i = j
1 N N
Vˆc = 2 ∑ ∑ G 2σ
( x f (i ) − x g ( j )) 2
N i =1 i = j
and the divergences and forces are written as
ˆ ( f , g) = Vˆ = Vˆ + Vˆ − 2Vˆ
D ∂VÊD ∂ Vˆf ∂Vˆg ∂Vˆc
ED ED f g c = + −2
∂xi ∂xi ∂xi ∂xi
Vˆ Vˆ
ˆ ( f , g ) = Vˆ = log f g
D ∂VˆCS 1 ∂Vˆ f 1 ∂Vˆg 2 ∂Vˆc
CS CS
Vˆ 2 = + −
c
∂xi V f ∂xi Vg ∂ xi Vˆc ∂xi
ˆ ˆ
page 27 of 35
JOSE C. PRINCIPE
Quadratic Mutual Information
U. OF FLORIDA
EEL 6935
The QMIs are a bit more detailed because now we have two deal with
352-392-2662
the joints, the marginals and their product. But everything is still addi-
tive. The three PDFs are
copyright 2009
⎧ˆ 1 N
⎪ f X 1 X 2 ( x1 , x2 ) = ∑ Gσ ( x − x(i ))
⎪ N i =1
⎪ ˆ 1 N
⎨ f X 1 ( x1 ) = ∑ Gσ ( x1 − x1 (i ))
⎪ N i =1
⎪ ˆ 1 N
⎪ f X 2 ( x2 ) = ∑ Gσ ( x2 − x2 (i))
⎩ N i =1
they create the following fields

1 N N 1 N N
VJ = 2 ∑∑ G
ˆ
2σ
( x (i ) − x ( j )) = 2 ∑∑ G 2σ ( x1 (i ) − x1 ( j ))G 2σ ( x2 (i ) − x2 ( j ))
N i =1 j =1 N i =1 j =1
1 N N
VM = V1V2
ˆ ˆ ˆ with Vk = 2 ∑∑ G 2σ ( xk (i ) − xk ( j )),
ˆ k = 1, 2
N i =1 j =1
1 N ⎛⎜ 1 N ⎞⎛ 1 N ⎞
VC =
ˆ ∑ ⎜ N
∑
N i =1 ⎝ j =1
G (
2σ 1
x ( i ) − x1 ( j )) ⎟
⎟
⎜
⎜ N
∑ G 2σ
( x2 (i ) − x2 ( j )) ⎟
⎟
( 2. 104)
⎠⎝ j =1 ⎠
page 28 of 35
JOSE C. PRINCIPE
let us exemplify for VC
U. OF FLORIDA
EEL 6935 VˆC = ∫∫ fˆ (x1, x2 ) fˆ (x1) fˆ ( x2 )dx1dx2
⎡1 N ⎤⎡ 1 N ⎤⎡ 1 N ⎤
352-392-2662
principe@cnel.ufl.edu = ∫∫ ⎢ ∑Gσ (x1 − x1(k ))Gσ ( x2 − x2 (k ))⎥⎢ ∑Gσ (x1 − x1(i ))⎥ ⎢ ∑Gσ ( x2 − x2 ( j ))⎥ dx1dx2
copyright 2009
⎣ N k =1 ⎦⎣ N i =1 ⎦ ⎢⎣ N j =1 ⎥⎦
1 N 1 N 1 N
= ∑ ∑ ∑∫ Gσ (x1 − x1(i))Gσ (x1 − x1(k ))dx1 ∫Gσ ( x2 − x2 (k ))Gσ ( x2 − x2 ( j ))dx2
N i =1 N j =1 N k =1
1 N ⎡1 N ⎤⎡ 1 N ⎤
= ∑⎢ ∑G ( x (k ) − x1(i))⎥⎢ ∑G
2σ 1
( x (k ) − x2 ( j))⎥
2σ 2
(2.102)
N k =1 ⎣ N i =1 ⎦⎣⎢ N j =1 ⎦⎥
Notice that Vc has complexity O(N3). Finally we have

N N
1
IÊD ( X 1 , X 2 ) = 2
N
∑∑G 2σ
( x1 ( i ) − x1 ( j ))G 2σ
( x 2 ( i ) − x 2 ( j )) + Vˆ1Vˆ2 −
i =1 j = 1
N ⎛ 1 N ⎞⎛ 1 N ⎞
∑ ⎜⎜ N ∑G ∑G
1
− ( x1 ( i ) − x1 ( j )) ⎟⎜ ( x 2 ( i ) − x2 ( j )) ⎟
N i =1 ⎝
2σ ⎟⎜ N 2σ ⎟
j =1 ⎠⎝ j =1 ⎠
⎛ 1 ⎞
( )
N N
⎜
⎜ N2 ∑ ∑G 2σ
( x1 ( i ) − x1 ( j ))G 2σ ( x 2 ( i ) − x 2 ( j )) ⎟ Vˆ1Vˆ2
⎟
IˆCS ( X 1 , X 2 ) = lo g ⎝ i = 1 j =1 ⎠
2
⎛ 1 N ⎛ 1 N ⎞⎛ 1 N ⎞⎞
⎜
⎜N ∑ ⎜⎜ N ∑ G 2σ ( x1 (i ) − x1 ( j )) ⎟⎟ ⎜⎜ N ∑ G 2σ ( x 2 ( i ) − x 2 ( j )) ⎟⎟ ⎟⎟
⎝ i =1 ⎝ j =1 ⎠⎝ j =1 ⎠⎠
page 29 of 35
JOSE C. PRINCIPE
The interactions are always based on the marginals, but there are
U. OF FLORIDA
three levels: the level of the individual marginal samples (joint space),
EEL 6935
the level of the marginal sample and marginal fields (the CIP) and the
352-392-2662
product of the two marginal fields. This field can be generalized to any
number of variables
copyright 2009
N N K
Î ( X ,..., X ) = 1 2 N K ˆ K
2 ∑∑∏ k
ED 1 K V (ij ) − ∑∏ Vk (i) + ∏ Vˆk
ˆ
N i =1 j =1 k =1 N i =1 k =1 k =1
⎛ 1 N N K ⎞K
⎜ N 2 ∑∑∏
⎜ Vˆk (ij ) ⎟∏ Vˆk
⎟ k =1
Î ( X ,..., X ) = log ⎝ i =1 j =1 k =1 ⎠
CS 1 K 2
⎛1 N K ˆ ⎞
⎜⎜ ∑∏ Vk (i) ⎟⎟
⎝ N i =1 k =1 ⎠
The information forces for each field are also easily derived
∂IˆCS ∂Vˆj 2∂VˆC ∂Vˆk
FED (i) =
ˆ = − +
∂xk (i) ∂xk (i) ∂xk (i) ∂xk (i)
∂Iˆ 1 ∂Vˆ 2 ∂VˆC 1 ∂Vˆk
FCS (i) = = − +
ˆ CS j
∂xk (i) Vj ∂xk (i) VC ∂xk (i) Vk ∂xk (i)
These expressions are going to be very useful when adapting systems

with divergences or quadratic mutual information.
page 30 of 35
JOSE C. PRINCIPE
Fast Information and Cross Information Potential Calculations
U. OF FLORIDA
EEL 6935
One of the problems with this IP estimator methodology is the compu-
tational complexity which is O(N2) or O(N3). There are two ways to get
352-392-2662
approximations to the IP and CIP with arbitrary accuracy: The Fast
copyright 2009
Gauss Transform and the Incomplete Cholesky decomposition. They
both transform the complexity to O(NM) where M <<N.
Fast Gauss Transform
The FGT takes advantage of the fast decay of the Gaussian function
and can efficiently compute weighted sums of scalar Gaussians.
S( yi ) = ∑j=1wie
−( y j − yi ) / 4σ
2 2
N
i = 1,...M
The savings come from the shifting property of the Gaussian function
which reads
2 2 2
⎛ y j − yi ⎞ ⎛ y j − yC − ( yi − yC ) ⎞ ⎛ y j − yc ⎞
−⎜⎜ ⎟ − ⎜⎜ ⎟⎟ −⎜⎜ ⎟
1 ⎛ yi − yc ⎞
n
⎛ y j − yc ⎞
n
exp(−x2)
hn( y) = (−1)
σ ⎟⎠ σ ⎟⎠ nd
∑n=0 n!⎝ σ ⎠ n ⎜⎜⎝ σ ⎟⎟⎠
σ ∞
e ⎝ =e ⎝ ⎠ = e ⎝
⎜ ⎟ H
dxn
and the efficient approximation of the Hermite polynomials with a small

order p.
page 31 of 35
JOSE C. PRINCIPE
The shifting property is useful because there is no need to evaluate
U. OF FLORIDA
every Gaussian at every point. Instead a p term sum is computed
EEL 6935
around a small number yc of cluster centers with O(Np) computation.
352-392-2662
These sums are then shifted to the yi desired locations and computed
in another O(Mp) operations. Normally the centers c are found using
copyright 2009
the furthest point algorithm which is efficient.
The information potential calculation (M=N) can be immediately given

as
p −1
1 ⎛ y j − yC b ⎞
N B n
1 ⎛ y j − yCb ⎞
V ( y) ≈
2σ N 2 π
∑ ∑ ∑ n!hn ⎜⎜ 2σ ⎟⎟ C n ( b ) Cn (b ) = ∑ ⎜⎜
yi ∈B ⎝ 2σ
⎟⎟
⎠
j =1 b = 1 n = 0 ⎝ ⎠
which requires O(NpB) calculations. p is normally 4 or 5 (independent

of N), and B, the number of clusters is also relatively small (~10), but
with a weak dependence on N.
The problem with this algorithm is if the data is multidimensional

because the complexity p changes to pD where D is the dimension. In
order to cope with high dimensions, a vector Taylor series has been
proposed to approximate the high dimensional Gaussian as
page 32 of 35
JOSE C. PRINCIPE
U. OF FLORIDA
⎛ y −y 2⎞ ⎛ y −c 2⎞
⎟ ⎛⎜ y i − c ⎞⎟ ⎛ (y j − c) ⋅ (yi − c) ⎞
2
⎜ j i ⎟ ⎜ j
EEL 6935
exp⎜ − ⎟ = exp⎜− ⎟ exp⎜ − ⎟ exp⎜⎜ 2 ⎟
⎟
⎜ 4σ 2
⎟ ⎜ 4σ 2 ⎟ ⎝ 4σ 2
⎝ 4σ 2
⎠
352-392-2662
⎝ ⎠ ⎝ ⎠ ⎠
and the cross term is approximated by a Taylor expansion as
copyright 2009
α α α α! = α 1!α 2 !L α d !
⎛ (y j − c) ⋅ (yi − c) ⎞ ⎛ y j − c ⎞ ⎛ yi − c ⎞
∑
2
exp⎜⎜2 ⎟=
⎟
⎜ ⎟
⎜ 2σ ⎟ ⎜⎝ 2σ ⎟⎠ + ε (α ) α =α1 +α2 +L+αd
4σ ⎠ α ≥0 α!
2
⎝ ⎝ ⎠
Now the IP for multidimensions becomes

⎛ y −c 2 ⎞ α
N
⎜ ⎟⎛ y j − cB ⎞
) ∑∑∑
Cα ( B)exp⎜ −
1
⎟⎜⎜ ⎟⎟
j B
VT (y) ≈
(
N 2 4πσ 2
d 2
j =1 B α ≥0 ⎜
⎝
4σ 2 ⎟⎝ 2σ ⎠
⎠
⎧ α ⎛ y −c 2 ⎞⎛ y − c ⎞α ⎫⎪
⎪
Cα (B ) = ∑
exp⎜ − ⎟⎜ i
2 i B
⎟⎝ 2σ ⎟⎠ ⎬
B
⎨ ⎜
α ! ⎪e ∈B 4σ 2
⎪⎭
⎩i ⎝ ⎠
For a D dimensional data set the calculation is O(NBrp,D) instead of

( p + D )!
O(NBrD) with r = ⎛
p + D⎞
⎝ d ⎠
= -------------------- ,
D!p!
but normally p must be larger than
before for the same precision.
page 33 of 35
JOSE C. PRINCIPE
Incomplete Cholesky Decomposition
U. OF FLORIDA
EEL 6935
It turns out that the eigenvalues of the matrix created by the pairwise
352-392-2662
evaluation of the Gaussian is full rank, but the eigenspectrum decays
very fast. Hence we can take advantage of this fact.
For a NxN symmetric matrix K=GTG, where G is a lower triangular
copyright 2009
matrix of positive diagonal entries. When the eigenspectum falls fast

we can approximate K by NxD lower triangular matrix G˜ such that
~ ~
K − GT G < ε
. There are ways of computing G˜ effectively. The computation
complexity of the procedure becomes O(ND).
The IP can be computed as
1 N N 1 1 ~
V( X) = 2 ∑∑k (xi − x j ) = 2 1TN KXX 1N = 2 1TN GXX
2
ˆ
N i=1 j=1 N N 2
ÎED = 1 1TD (G
~T ~ ~T ~ 1 T ~ 2 T ~ 2 2 T ~ ~T ~ ~T
XX GYY o G XXGYY )1D + 4 1NGXX 1N GYY − 3 (1N GXX )(G XXGYY )(GYY 1N )
N2 X
Ny 2 2 N
~ ~ ~T ~ T ~ 2 T ~ 2
1TDX (GT XX GYY o G XX GYY )1Dy 1 N GXX 1N GYY
IˆCS = log 2 2
( ~ ~ ~ ~
(1TN GXX )(GTXX GYY )(GYY
T
)
1N )
2
page 34 of 35
JOSE C. PRINCIPE
However the CIP is not a positive definite matrix, so we need to extend
U. OF FLORIDA
K XX K XY
K ZZ =
EEL 6935 K K
KXX to create a positive definite matrix where KXY denotes XY YY
352-392-2662
the Gram matrix for the CIP. The calculation of the CIP becomes
( )( )
1 N N
1 1 ~ ~ e1 = {1{ ,... 0} T
,... 1, 0{
copyright 2009
Vˆ ( X ; Y ) = 2 ∑∑ k (xi − y j ) = 2 e1T Kzze2 = 2 e1T GZZ GZZ e2 N N
N i =1 j =1 N N e 2 = { 0{ ,. .. 1}T
,. .. 0 , 1{
N N
with complexity O(ND2). The divergence measures can be computed

efficiently as
1 ~ ~T 1 ~ ~T 2 ~ ~T
Dˆ ED = 2 (e1TGZZ )(GZZ e1 ) + 2 (eT2GZZ )(GZZ e2 ) − 2 (eT1GZZ )(GZZ e2 )
N N N
~ ~T ~ ~T
(e1TGZZ )(GZZ e1 )(eT2GZZ )(GZZ e2 )
Dˆ CS = log
( ~ ~T
(e1TGZZ )(GZZ e2 )
2
)
page 35 of 35

Renyi's Entropy: I P P I

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Renyi's Entropy: I P P I

Uploaded by

Copyright:

Available Formats

JOSE C.

He started with Cauchy’s functional equation: If p and q are indepen-

Apart from a normalizing constant this is compatible with Hartley’s

It can be shown that Shannon is a special case when α→1

So Renyi’s entropies contain Shannon as a special case.

Note that Vα(X) is the argument of the α norm of the PMF.

1 (α–norm of p raised power to α) 1 p = ( p1, p2, p3)

[ 100] α = 0.2 [01 0] [1 00] α = 0.5 [ 01 0] [1 00] α =1 [ 010]

[ 0 01] [0 01] [ 001 ]

[ 100] α =2 [01 0] [1 00] α =4 [ 01 0] [1 00] α = 20 [ 010]

From another perspective, the α root of Vα(X) is the α-norm of the

We will be using extensively α=2.

For us Reny’s quadratic entropy is appealing because we have found

and for the quadratic entropy

Notice however that now entropy is no longer positive, in fact it can

The kernel has the following properties:

Property 2.5. The limit of Renyi’s entropy as α → 1 is Shannon’s entropy. The

limit of the entropy estimator in Eq. (2.18) as α → 1 is Shannon’s entropy estimated

marginal entropies are consistent.

to the logarithm of a scal ed and biased version of the sample variance.

kernel function satisfies κ Σ (ξ ) = κ Σ ( R −1 ξ ) for all orthonorm al m atrices, R, then the

spherical sym metry.

Bias[Vˆ ( X )] = E[Vˆ ( X )] − ∫ f 2 ( x )dx = (σ 2 / 2 ) E[ f ′ ( X )]

Variance of the Information Potential

The AMISE (asymptotic mean integrated square error)

This is expected from the kernel density estimation theory. The IP

Another way to look at the interactions among samples is to create two

D is a matrix of distances, and ζ is a matrix of scalars. As we can see

and it can be expressed as a function of the IP as

ii. Dα ( f || g) = 0 iff f(x) = g(x) ∀x ∈ℜ

iii. lim Dα ( f || g ) = DKL ( f || g)

and we can come up with a kernel estimate for it if we approximate the

that also approximates KL estimator for α->1. However Renyi’s diver-

We will define the quadratic mutual information QMIED between two

Although a distance, we sometimes refer to DED as a divergence.

∫ ∫ (x)dx ≥ ∫ f ( x)g( x)dx

DCS ( f , g ) = − log ∫ f ( x)g( x)dx

We can also define the quadratic mutual information QMICS as

and as before for independent variables it is zero. It can also be

for α=2 this gives exactly DCS, so

therefore QMIED and QMIED can be rewritten as

The Figure illustrates these distances in the simplex

Fix PX1= (0.6,0.4) I ED

Euclidean and Cauchy Schwarz divergences

The three potentials needed are

they create the following fields

EEL 6935 VˆC = ∫∫ fˆ (x1, x2 ) fˆ (x1) fˆ ( x2 )dx1dx2

Notice that Vc has complexity O(N3). Finally we have

These expressions are going to be very useful when adapting systems

Fast Gauss Transform

and the efficient approximation of the Hermite polynomials with a small

The information potential calculation (M=N) can be immediately given

which requires O(NpB) calculations. p is normally 4 or 5 (independent

The problem with this algorithm is if the data is multidimensional

Now the IP for multidimensions becomes

For a D dimensional data set the calculation is O(NBrp,D) instead of

matrix of positive diagonal entries. When the eigenspectum falls fast

with complexity O(ND2). The divergence measures can be computed

You might also like