You are on page 1of 17

KL Divergence

• We can think of the KL divergence as distance metric (although it isn’t


symmetric) that quantifies the difference between two probability
distributions.
• One common scenario where this is useful is when we are working with a
complex distribution.
• Rather than working with the distribution directly, we can make our life
easier by using another distribution with well known properties (i.e.
normal distribution) that does a decent job of describing the data.
• In other words, we can use the KL divergence to tell whether a poisson
distribution or a normal distribution is a better at approximating the data.
Python code
import numpy as np
from scipy.stats import norm
from matplotlib import pyplot as plt
import tensorflow as tf
import seaborn as sns
sns.set()
Next, we define a function to calculate the KL divergence of two
probability distributions. We need to make sure that we don’t include
any probabilities equal to 0 because the log of 0 is negative infinity.
def kl_divergence(p, q):
return np.sum(np.where(p != 0, p * np.log(p / q), 0))
#The KL divergence between a normal distribution with a mean of 0
and a standard deviation of 2 and another distribution with a mean of 2
and a standard deviation of 2 is equal to 500.
x = np.arange(-10, 10, 0.001)
p = norm.pdf(x, 0, 2)
q = norm.pdf(x, 2, 2)
plt.title('KL(P||Q) = %1.3f' % kl_divergence(p, q))
plt.plot(x, p)
plt.plot(x, q, c='red')
If we measure the KL divergence between the initial probability
distribution and another distribution with a mean of 5 and a standard
deviation of 4, we expect the KL divergence to be higher than in the
previous example.
It’s important to note that the KL divergence is not symmetrical. In other words, if
we switch P for Q and vice versa, we get a different result.
q = norm.pdf(x, 5, 4)

q = norm.pdf(x, 5, 4)
plt.title('KL(P||Q) = %1.3f' % kl_divergence(p, q))
plt.plot(x, p)
plt.plot(x, q, c='red')
The lower the KL divergence, the closer the two distributions are to one
another. Therefore, as in the case of t-SNE and Gaussian Mixture
Models, we can estimate the Gaussian parameters of one distribution
by minimizing its KL divergence with respect to another.
• Minimizing KL Divergence
• Let’s see how we could go about minimizing the KL divergence
between two probability distributions using gradient descent. To
begin, we create a probability distribution with a known mean (0) and
variance (2). Then, we create another distribution with random
parameters.
Given that we are using gradient descent, we need to select
values for the hyperparameters (i.e. step size, number of
iterations).
x = np.arange(-10, 10, 0.001)
p_pdf = norm.pdf(x, 0, 2).reshape(1, -1)
np.random.seed(0)
random_mean = np.random.randint(10, size=1)
random_sigma = np.random.randint(10, size=1)
random_pdf = norm.pdf(x, random_mean, random_sigma).reshape(1,
-1)
learning_rate = 0.001
epochs = 100
Just like numpy, in tensorflow we need to allocate memory for variables. For the
variable q, we use the equation for a normal distribution given mu and sigma, only
we exclude the part before the exponent since we’re normalizing the result.
p = tf.placeholder(tf.float64, shape=pdf.shape)
mu = tf.Variable(np.zeros(1))
sigma = tf.Variable(np.eye(1))
normal = tf.exp(-tf.square(x - mu) / (2 * sigma))
q = normal / tf.reduce_sum(normal)
kl_divergence = tf.reduce_sum(
tf.where(p == 0, tf.zeros(pdf.shape, tf.float64), p * tf.log(p / q))
)
Next, we initialize an instance of the
GradientDescentOptimizer class and call the minimize method
with the KL divergence function as an argument.
optimizer =
tf.train.GradientDescentOptimizer(learning_rate).minimize(kl_divergenc
e)

Only after running tf.global_variables_initializer() will the variables hold


the values we set when we declared them (i.e. tf.zeros).

init = tf.global_variables_initializer()
All operations in tensorflow must be done within a session. In the
proceeding code block, we minimize the KL divergence using gradient
descent.
with tf.Session() as sess:
sess.run(init)

history = []
means = []
variances = []

for i in range(epochs):
sess.run(optimizer, { p: pdf })

if i % 10 == 0:
history.append(sess.run(kl_divergence, { p: pdf }))
means.append(sess.run(mu)[0])
variances.append(sess.run(sigma)[0][0])
All operations in tensorflow must be done within a session. In the
proceeding code block, we minimize the KL divergence using gradient
descent.
for mean, variance in zip(means, variances):
q_pdf = norm.pdf(x, mean, np.sqrt(variance))
plt.plot(x, q_pdf.reshape(-1, 1), c='red')
plt.title('KL(P||Q) = %1.3f' % history[-1])
plt.plot(x, p_pdf.reshape(-1, 1), linewidth=3)
plt.show()

plt.plot(history)
plt.show()

sess.close()
• Then, we plot the probability distribution and KL divergence at
different points in time.

You might also like