Professional Documents
Culture Documents
q = norm.pdf(x, 5, 4)
plt.title('KL(P||Q) = %1.3f' % kl_divergence(p, q))
plt.plot(x, p)
plt.plot(x, q, c='red')
The lower the KL divergence, the closer the two distributions are to one
another. Therefore, as in the case of t-SNE and Gaussian Mixture
Models, we can estimate the Gaussian parameters of one distribution
by minimizing its KL divergence with respect to another.
• Minimizing KL Divergence
• Let’s see how we could go about minimizing the KL divergence
between two probability distributions using gradient descent. To
begin, we create a probability distribution with a known mean (0) and
variance (2). Then, we create another distribution with random
parameters.
Given that we are using gradient descent, we need to select
values for the hyperparameters (i.e. step size, number of
iterations).
x = np.arange(-10, 10, 0.001)
p_pdf = norm.pdf(x, 0, 2).reshape(1, -1)
np.random.seed(0)
random_mean = np.random.randint(10, size=1)
random_sigma = np.random.randint(10, size=1)
random_pdf = norm.pdf(x, random_mean, random_sigma).reshape(1,
-1)
learning_rate = 0.001
epochs = 100
Just like numpy, in tensorflow we need to allocate memory for variables. For the
variable q, we use the equation for a normal distribution given mu and sigma, only
we exclude the part before the exponent since we’re normalizing the result.
p = tf.placeholder(tf.float64, shape=pdf.shape)
mu = tf.Variable(np.zeros(1))
sigma = tf.Variable(np.eye(1))
normal = tf.exp(-tf.square(x - mu) / (2 * sigma))
q = normal / tf.reduce_sum(normal)
kl_divergence = tf.reduce_sum(
tf.where(p == 0, tf.zeros(pdf.shape, tf.float64), p * tf.log(p / q))
)
Next, we initialize an instance of the
GradientDescentOptimizer class and call the minimize method
with the KL divergence function as an argument.
optimizer =
tf.train.GradientDescentOptimizer(learning_rate).minimize(kl_divergenc
e)
init = tf.global_variables_initializer()
All operations in tensorflow must be done within a session. In the
proceeding code block, we minimize the KL divergence using gradient
descent.
with tf.Session() as sess:
sess.run(init)
history = []
means = []
variances = []
for i in range(epochs):
sess.run(optimizer, { p: pdf })
if i % 10 == 0:
history.append(sess.run(kl_divergence, { p: pdf }))
means.append(sess.run(mu)[0])
variances.append(sess.run(sigma)[0][0])
All operations in tensorflow must be done within a session. In the
proceeding code block, we minimize the KL divergence using gradient
descent.
for mean, variance in zip(means, variances):
q_pdf = norm.pdf(x, mean, np.sqrt(variance))
plt.plot(x, q_pdf.reshape(-1, 1), c='red')
plt.title('KL(P||Q) = %1.3f' % history[-1])
plt.plot(x, p_pdf.reshape(-1, 1), linewidth=3)
plt.show()
plt.plot(history)
plt.show()
sess.close()
• Then, we plot the probability distribution and KL divergence at
different points in time.