Professional Documents
Culture Documents
The number of applications for restricted Boltzmann In this work, we argue that CD learning may not be a
machines (RBMs) has grown rapidly in the past few very good algorithm for training CRBMs and propose
years. They have now been applied to multiclass clas- two new algorithms for tackling structured output pre-
sification (Larochelle & Bengio, 2008), collaborative diction problems in two different settings. In the first,
filtering (Salakhutdinov et al., 2007), motion capture we can assume that the variability in the output space
is limited, meaning that the available training data
∗
This work was done while Hugo was at University of covers the set of likely output configurations relatively
Toronto
well. One example of such problem is multi-label clas- bhj h bhj
sification. We propose to use semantic hashing in order W uh
to define and efficiently compute, a small set of possi- W Wuv Wvh
ble outputs to predict given some input. This allows bvi v u bvi v
us to perform exact inference over this set, under the
CRBM’s energy function. (A) (B)
where F (v) is called the free energy and can be com- 2.2 Contrastive Divergence
puted in time linear in the number of elements in v
The first practical method for training RBMs was in-
and h:
troduced by Hinton (2002), who showed that the neg-
X ative gradient can be approximated using samples ob-
F (v) = − log exp (−E(v, h)) (4)
tained by starting a Gibbs chain at a training vec-
h
T v
X tor and running it for a few steps. This method ap-
log 1 + exp bhj + vT W·j
=−v b − proximately minimizes an objective function known
j as the Contrastive Divergence. Even though it has
(5) been shown that the resulting gradient estimate is not
the gradient of any function (Sutskever & Tieleman, there is an additional requirement that Gibbs sam-
2010), CD learning has been used extensively for train- pling can be performed.
ing RBMs and other energy-based models.
The CRBM model defines the following probability
distribution:
2.3 Persistent Contrastive Divergence
exp (−F (v, u))
p(v|u) = P 0
. (11)
One problem with CD learning is that it provides bi- v0 exp (−F (v , u))
ased estimates of the gradient. The Persistent Con-
trastive Divergence (PCD) algorithm (Tieleman, 2008) Learning in conditional RBMs generally involves doing
addressed this problem. In the positive phase, PCD gradient descent in negative log conditional likelihood.
does not differ from CD training. In the negative The gradient of the negative log conditional likelihood
phase, however, instead of running a new chain for for a conditional RBM is given by
each parameter update, PCD maintains a single per-
sistent chain. The update at time t takes the state ∂ − l(θ) ∂F (v|u) X ∂F (v0 , u)
= − p(v0 |u). (12)
of the Gibbs chain at time t − 1, performs one round ∂θ ∂θ 0
∂θ
v
of Gibbs sampling, and uses this state in the negative
gradient estimates. When the learning rate is small, In some cases this gradient can be computed exactly
the model does not change much between updates and (Larochelle & Bengio, 2008), but it is intractable in
the state of the persistent chain is able to stay close general. However, since p(v|h, u) and p(h|v, u) are
to the model distribution, leading to more accurate both factorial over v and h, CD can be used to train
gradient estimates. a conditional RBM – the positive gradient is still
tractable and it is still possible to do block Gibbs sam-
A number of other algorithms have been proposed for pling to approximate the negative gradient.
obtaining better gradient estimates for training RBMs
(Desjardins et al., 2010; Salakhutdinov, 2010; Tiele- Almost all algorithms that have been proposed for
man & Hinton, 2009). However, these algorithms training RBMs are not applicable to training con-
make use of persistent Markov chains for obtaining im- ditional RBMs because they make use of persistent
proved gradient estimates and, as we will later show, chains. With a conditional RBM, each condition-
this makes these algorithms, as well as PCD, not ap- ing vector u generally leads to a unique distribution
plicable to training of conditional RBMs. p(v|u), hence one needs to sample from a different
model distribution for each training case, making it
impossible to use a single persistent chain. One could
3 Training Conditional RBMs run a separate persistent chain for each training case,
but this is only feasible on very small datasets. To
A CRBM (see figure 1) models the distribution p(v|u) learn efficiently on large datasets we need to update
by using an RBM to model v and using u to dynami- the weights on mini-batches of training data that are
cally determine the biases or weights of that RBM. In much smaller than the whole training set, so by the
this paper we only allow u to determine increments to time we revisit a training case the weights will have
the visible and hidden biases of the RBM: changed substantially and the persistent chain for that
case will be far from its stationary distribution.
E (v, h, u) = − vT Wvh h − vT bv − uT Wuv v
Unlike algorithms based on persistent chains, a num-
− uT Wuh h − hT bh (8)
ber of the algorithms recently proposed for training
non-normalized statistical models could be applied to
with the associated free energy
conditional RBM training (Hyvärinen, 2007; Vickrey
X et al., 2010). While we only compare our approaches
F (v, u) = − log exp (−E(v, h, u)) (9)
to algorithms that have been applied to conditional
h
X RBMs, the Contrastive Constraint Generation (CCG)
=− log 1 + exp bhj + vT Wvh T
·j + u W·j
uh
algorithm (Vickrey et al., 2010) is the most closely re-
j lated approach to our work. CCG is a batch algorithm
− vT bv − uT Wuv v. (10) that maintains a contrastive set of points for each
training case, where the contrastive points are gen-
Note that the algorithms we propose in this paper are erated by running some inference procedure. Both of
not restricted to this model and can be used to train our proposed approaches have similarities to the CCG
any energy-based model for which the free-energy (or algorithm. The HashCRBM algorithm uses spectral
just the energy, for models without latent variables) hashing to define the contrastive neighborhoods and
can be computed. For the CD-PercLoss algorithm only includes values of v that are in the training set,
which ensures that the total number of unique con- as the data v, the loss function LCD will be zero on
trastive points is reasonable. The CD-PercLoss algo- average, meaning that we have successfully learned the
rithm generates contrastive examples using inference, data distribution p(v|u). However, since the negative
but it is an online algorithm, and hence can scale up phase Gibbs chain starts at the data, this objective
to large datasets unlike CCG. function will also be small if the chain is mixing slowly.
Indeed, if the chain starts at v and is mixing slowly
4 Making Predictions with then, for small k, vk will be very close to v, in turn
making LCD small. This is particularly troubling be-
Conditional RBMs cause, when training a conditional RBM, the aim is
usually to be able to make good predictions for v when
While RBMs have primarily been used for learning
given a conditioning vector u and making LCD small
new representations of data, conditional RBMs are
gives no guarantees about the quality of the predic-
generally used for making predictions. This typically
tions.
requires inferring either the modes of the conditional
marginals p(vi |u) or the mode of the full conditional It is interesting to note that CD training of non-
p(v|u). Such inferences are intractable in general, conditional RBMs also suffers from the same problem,
hence approximate inference algorithms must be used. i.e. the gradient estimate will be small when the nega-
tive phase Gibbs chain is mixing slowly. Nevertheless,
The most popular in the RBM literature for condi-
this is not as much of an issue when training non-
tional models is mean-field inference (Salakhutdinov
conditional RBMs because they are generally used for
et al., 2007; Mandel et al., 2011), which is a fast mes-
learning representations of the data. If a Gibbs chain
sage passing algorithm and can work well in practice.
started at the data v is mixing slowly and not mov-
In the context of structured prediction, it assumes
ing far from v, then the hidden representation h must
a full conditional that factorizes into its marginals
Q contain enough information to reconstruct v well, pos-
q(v|u) = i q(vi |u) and finds the marginals that pro-
sibly making h a reasonable representation of v. In a
vide the best approximation of the true conditional
conditional RBM being able to reconstruct the data v
distribution.
is not sufficient for making good predictions, because
Loopy belief propagation (Murphy et al., 1999) is an- at prediction time one only has access to u and not v.
other possible message passing algorithm that could be
Therefore, it seems that while CD is a reasonable train-
used and that tends to provide better approximations
ing procedure for non-conditional RBMs, it may not be
of the conditional marginals of CRBMs (Mandel et al.,
a reasonable training procedure for conditional RBMs
2011). Unfortunately, it is also slower than mean-field,
because it does not directly encourage the model to
and we have found it to be practical only on problems
make good predictions. We provide experimental sup-
with relatively small output dimensionality and hidden
port for this argument by showing that CD training of
layer size.
conditional RBMs can fail on seemingly simple struc-
tured output prediction problems.
5 Suitability of CD Training
LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M., & Salakhutdinov, R., & Hinton, G. E. (2009). Replicated
Huang, F. (2006). A tutorial on energy-based learn- softmax: an undirected topic model. Advances in
ing. Predicting Structured Data. MIT Press. Neural Information Processing Systems 22 (NIPS)
(pp. 1607–1614). MIT Press.
Mandel, M., Pascanu, R., Larochelle, H., & Bengio,
Y. (2011). Autotagging music with conditional re- Salakhutdinov, R., Mnih, A., & Hinton, G. E. (2007).
stricted Boltzmann machines. ArXiv e-prints. Restricted Boltzmann machines for collaborative fil-
tering. Proceedings of the 24th International Con-
Mandel, M. I., Eck, D., & Bengio, Y. (2010). Learning ference on Machine Learning (ICML) (pp. 791–798).
tags that vary within a song. Proceedings of the 11th ACM.
International Conference on Music Information Re-
trieval (ISMIR) (pp. 399–404). Sutskever, I., & Tieleman, T. (2010). On the conver-
gence properties of Contrastive Divergence. Proc.
Mandel, M. I., & Ellis, D. P. W. (2008). A web-based Conference on AI and Statistics (AI-Stats).
game for collecting music metadata. Journal New
of Music Research, 37, 151–165. Taylor, G. W., & Hinton, G. E. (2009). Factored con-
ditional restricted boltzmann machines for modeling
Memisevic, R., & Hinton, G. E. (2010). Learning motion style. Proceedings of the 26th International
to represent spatial transformations with factored Conference on Machine Learning (ICML) (p. 129).
higher-order boltzmann machines. Neural Compu- ACM.
tation, 22, 1473–1492.
Tieleman, T. (2008). Training Restricted Boltzmann
Murphy, K. P., Weiss, Y., & Jordan, M. I. (1999). Machines using Approximations to the Likelihood
Loopy belief propagation for approximate inference: Gradient. Proceedings of the 25th international con-
An empirical study. Proceedings of the 15th Confer- ference on Machine learning (ICML) (pp. 1064–
ence in Uncertainty in Artificial Intelligence (UAI) 1071). ACM.
(pp. 467–475). Tieleman, T., & Hinton, G. (2009). Using Fast Weights
to Improve Persistent Contrastive Divergence. Pro-
Osindero, S., & Hinton, G. E. (2008). Modeling image
ceedings of the 26th International Conference on
patches with a directed hierarchy of markov random
Machine Learning (ICML) (pp. 1033–1040). ACM.
field. Advances in Neural Information Processing
Systems 20 (NIPS) (pp. 1121–1128). MIT Press. Vickrey, D., Lin, C. C.-Y., & Koller, D. (2010). Non-
local contrastive objectives. International Confer-
Salakhutdinov, R. (2010). Learning deep boltzmann
ence on Machine Learning (pp. 1103–1110).
machines using adaptive mcmc. Proceedings of the
27th International Conference on Machine Learning Weiss, Y., Torralba, A., & Fergus, R. (2009). Spectral
(ICML). hashing. In D. Koller, D. Schuurmans, Y. Bengio
and L. Bottou (Eds.), Advances in neural informa-
Salakhutdinov, R., & Hinton, G. E. (2007). Semantic tion processing systems 21 (NIPS), 1753–1760. MIT
Hashing. SIGIR workshop on Information Retrieval Press.
and applications of Graphical Models.