Why We Use Sigmoid and ReLu in RBM

A NNEALING G AUSSIAN INTO R E LU: A N EW S AM -
PLING S TRATEGY FOR L EAKY-R E LU RBM
Chun-Liang Li Siamak Ravanbakhsh Barnabás Póczos

Department of Machine Learning
Carnegie Mellon University
Pittsburgh, PA 15213, USA
{chunlial,mravanba,bapoczos}@cs.cmu.edu
arXiv:1611.03879v1 [stat.ML] 11 Nov 2016
A BSTRACT
Restricted Boltzmann Machine (RBM) is a bipartite graphical model that is used
as the building block in energy-based deep generative models. Due to numeri-
cal stability and quantifiability of the likelihood, RBM is commonly used with
Bernoulli units. Here, we consider an alternative member of exponential family
RBM with leaky rectified linear units – called leaky RBM. We first study the joint
and marginal distributions of leaky RBM under different leakiness, which pro-
vides us important insights by connecting the leaky RBM model and truncated
Gaussian distributions. The connection leads us to a simple yet efficient method
for sampling from this model, where the basic idea is to anneal the leakiness rather
than the energy; – i.e., start from a fully Gaussian/Linear unit and gradually de-
crease the leakiness over iterations. This serves as an alternative to the annealing
of the temperature parameter and enables numerical estimation of the likelihood
that are more efficient and more accurate than the commonly used annealed impor-
tance sampling (AIS). We further demonstrate that the proposed sampling algo-
rithm enjoys faster mixing property than contrastive divergence algorithm, which
benefits the training without any additional computational cost.
1 I NTRODUCTION
In this paper, we are interested in deep generative models. There is a family of directed deep genera-
tive models which can be trained by back-propagation (e.g., Kingma & Welling, 2013; Goodfellow
et al., 2014). The other family is the deep energy-based models, including deep belief network (Hin-
ton et al., 2006) and deep Boltzmann machine (Salakhutdinov & Hinton, 2009). The building block
of deep energy-based models is a bipartite graphical model called restricted Boltzmann machine
(RBM). The RBM model consists of two layers, visible and hidden layers, which can model higher-
order correlation of the visible units (visible layer) using the hidden units (hidden layer). It also
makes the inference easier that there are no interactions between the variables in each layer.
The conventional RBM uses Bernoulli units for both the hidden and visible units (Smolensky, 1986).
One extension is using Gaussian visible units to model general natural images (Freund & Haussler,
1994). For hidden units, we can also generalize Bernoulli units to the exponential family (Welling
et al., 2004; Ravanbakhsh et al., 2016).
Nair & Hinton (2010) propose one special case by using Rectified Linear Unit (ReLU) for the
hidden layer with the heuristic sampling procedure, which has promising performance in terms of
reconstruction error and classification accuracy. Unfortunately, due to its lack of strict monotonicity,
ReLU RBM does not fit within the framework of exponential family RBMs (Ravanbakhsh et al.,
2016). Instead we study leaky-ReLU RBM (leaky RBM) in this work and address two important
issues i) a better training (sampling) algorithm for ReLU RBM and; ii) a better quantification of
leaky RBM –i.e., its performance in terms of likelihood.
We study some of the fundamental properties of leaky RBM, including its joint and marginal dis-
tributions (Section 2). By analyzing these distributions, we show that the leaky RBM is a union
of truncated Gaussian distributions. In this paper we will show that training leaky RBM involves
underlying positive definite constraints. Because of this, the training can diverge if these constrains
1
are not satisfied. This is an issue that was previously ignored in ReLU RBM, as it was mainly used
for pre-training rather than generative modeling. Our contribution in this paper is three-fold: I) We
systematically identify and address model constraints in leaky RBM (Section 3); II) For the training
of leaky RBM, we propose a meta algorithm for sampling, which anneals leakiness during the Gibbs
sampling procedure (Section 3) and empirically show that it can boost contrastive divergence with
faster mixing (Section 5); III) We demonstrate the power of the proposed sampling algorithm on
estimating the partition function. In particular, comparison on several benchmark datasets shows
that the proposed method outperforms the conventional AIS (Salakhutdinov & Murray, 2008) (Sec-
tion 4). Moreover, we provide an incentive for using leaky RBM by showing that the leaky ReLU
hidden units perform better than the Bernoulli units in terms of the model log-likelihood (Section 4).
2 R ESTRICTED B OLTZMANN M ACHINE AND R E LU

The Boltzmann distribution is defined as
e−E(x)
p(x) = ,
Z
P −E(x)
where Z = xe is the partition function. Restricted Boltzmann Machine (RBM) is a
Boltzmann distribution with a bipartite structure It is also the building block for many deep mod-
els (e.g., Hinton et al., 2006; Salakhutdinov & Hinton, 2009; Lee et al., 2009), which are widely
used in numerous applications (Bengio, 2009). The conventional Bernoulli RBM, models the
joint probability p(v, h) for the visible units v ∈ [0, 1]I and the hidden units h ∈ [0, 1]J as
p(v, h) ∝ exp(−E(v, h)), where
E(v, h) = a> v − v > W h + b> h.
The parameters are a ∈ RI , b ∈ RJ and W ∈ RI×J . We can derive the conditional probabilities as
  !
X J X I
p(vi = 1|h) = σ  Wij hj + ai  and p(hj = 1|v) = σ Wij vi + bj , (1)
j=1 i=1
where σ(x) = (1 + e−x )−1 is the sigmoid function.

One extension of Bernoulli RBM is replacing the binary visible units by linear units v ∈ RI with
independent Gaussian noise. The energy function in this case is given by
I I X J
X (vi − ai )2 X vi
E(v, h) = − Wij hj + b> h.
i=1
2σi2 i=1 j=1
σi
To simplify the notation, we eliminate ai and σi in this paper, and then the energy function is
2
simplified to be E(v, h) = kvk > >
2 − v W h + b h. Note that the elimination does not influence the
discussion and one can easily extend all the results in this paper to the model that includes ai and σi .
The conditional distributions are as follows:
  !
XJ I
X
p(vi |h) = N  Wij hj , 1 and p(hj = 1|v) = σ Wij vi + bj , (2)
j=1 i=1
where N (µ, V ) is a Gaussian distribution with mean µ and variance V .
2.1 R E LU RBM WITH C ONTINUOUS V ISIBLE U NITS
From (1) and (2), we can see that the mean of the p(hj |v) is actually the evaluation of a sigmoid
PI
function at the response i=1 Wij vi + bj , which is the non-linearity of the hidden units. From
this perspective, we can extend the sigmoid function to other functions and thus allow RBM to have
more expressive power (Ravanbakhsh et al., 2016). Nair & Hinton (2010) propose to use rectified
linear unit (ReLU) to replace conventional sigmoid hidden units. The activation function is defined
as a one-sided function max(0, x).
2
However, as it has been shown in Ravanbakhsh et al. (2016), only the strictly monotonic activa-
tion functions can derive feasible joint and conditional distributions1 . Therefore, we consider the
leaky ReLU (Maas et al., 2013) in this paper. The activation function of leaky ReLU is defined as
max(cx, x), where c ∈ (0, 1) is the leakiness parameter.
PI
To simplify the notation, we define ηj = i=1 Wij vi + bj . By Ravanbakhsh et al. (2016), the
conditional probability of the activation f is defined as p(hj |v) = exp (−Df (ηj khj ) + g(hj )),
where Df (ηj khj ) is a Bregman Divergence and g(hj ) is the base measure. The Bergman divergence
of f is given by Df (ηj khj ) = −ηj hj + F (ηj ) + F ∗ (hj ), where F with dηdj F (ηj ) = f (ηj ) is the
anti-derivative of f and F ∗ is the anti-derivative of f −1 . We then get the conditional distributions
of leaky RBM as
N (ηj , 1), if ηj > 0
p(hj |v) = (3)
N (cηj , c), if ηj ≤ 0.
Note that the conditional distribution of the visible unit is
 
XJ
p(vi |h) = N  Wij hj , 1 , (4)
j=1
P
which can also be written as p(vi |h) = exp −Df˜(νi kvi ) + g(vi ) , where νi = j=1 Wij hj and
f˜(x) = x. By having these two conditional distributions, we can train and do inference on a leaky
RBM model by using contrastive divergence (Hinton, 2002) or other algorithms (Tieleman, 2008;
Tieleman & Hinton, 2009).
3 T RAINING AND S AMPLING FROM LEAKY RBM

First, we explore the joint and marginal distribution of the leaky RBM. Given the conditional distri-
butions p(v|h) and p(h|v), the joint distribution p(v, h) from the general treatment for MRF model
given by Yang et al. (2012) is
 
X I J
X
p(v, h) ∝ exp v > W h − (F̃ ∗ (vi ) + g(vi )) − (F ∗ (hj ) + g(hj )) . (5)
i=1 j=1
By (5), we can derive the joint distribution of the leaky-ReLU RBM as

 ! ! 
kvk 2 X h2j √ X h2j √
p(v, h) ∝ exp v > W h − − + log 2π − + log 2cπ + b> h ,
2 2 2c
η >0 j ηj ≤0
and the marginal distribution as

! !
kvk2 Y ηj2 Y cηj2

p(v) ∝ exp − exp
2 2 2
  ηj >0 ηj ≤0  
1 X X X X
∝ exp − v > I − Wj Wj> − c Wj Wj>  v + bj Wj> v + c bj Wj> v  .
2 η >0 j ηj ≤0 η >0 j ηj ≤0
(6)
where Wj is the j-th column of W .
3.1 L EAKY RBM AS U NION OF T RUNCATED G AUSSIAN D ISTRIBUTIONS
From (6), the marginal probability is determined by the affine constraints ηj > 0 or ηj ≤ 0 for
PI
all j. By combinatorics, these constraints divide RI into at most M = i=1 Ji convex regions

R1 , · · · RM . An example with I = 2 and J = 3 is shown in Figure 2. If I > J, then we have at
most 2J regions.
1
Nair & Hinton (2010) use the heuristic noisy ReLU for sampling.
3
z W3
0.6 R3
W1
R5 0.4
W2 y
R6 R4 0.2
R1 xW
1
W3 0
R7 R3
R2 W2
Figure 2: An one dimensional Figure 3: A three dimensional
example of truncated Gaussian example with 3 hidden units,
Figure 1: A two dimensional distributions with different vari- where Wj are orthogonal to
example with 3 hidden units. ances. each other.
We discuss the two types of these regions. For bounded regions, such as R1 in Figure 2, the integra-
tion of (6) is also bounded, which results in a valid distribution. Before we discuss the unbounded
cases, we define Ω = I − j=1 αj Wj Wj> , where αj = 1ηj >0 + c1ηj ≤0 . For the unbounded
PJ
region, if Ω ∈ RI×I is a positive definite (PD) matrix, thenthe probability density is proportional to
PJ
a multivariate Gaussian distribution with mean µ = Ω−1 j=1 αj bj Wj and precision matrix Ω
(covariance matrix Ω−1 ) but over an affine-constrained region. Therefore, the distribution of each
unbounded region can be treated as a truncated Gaussian distribution.
On the other hand, if Ω is not PD, and the region Ri contains the eigenvectors with negative eigen-
values of Ω, the integration of (6) over Ri is divergent (infinite), which can not result in a valid
probability distribution. In practice, with this type of parameter, when we do Gibbs sampling on the
conditional distributions, the sampling will diverge. However, it is unfeasible to check exponentially
many regions for each gradient update.
Theorem 1. If I − W W > is positive definite, then I − j αj Wj Wj> is also positive definite, for
P
all αj ∈ [0, 1].
The proof is shown in Appendix 1. From Theorem 1 we can see that if the constraint I − W W >
is PD, then one can guarantee that the distribution of every region is a valid truncated Gaussian
distribution. Therefore, we introduce the following projection step for each W after the gradient
update.
argmin kW − W̃ k2F
W̃ (7)
s.t. I − W̃ W̃ > 0
Theorem 2. The above projection step (7) can be done by shrinking the singular values to be less
than 1.
The proof is shown in Appendix B. The training algorithm of the leaky RBM is shown in Algo-
rithm 1. By using the projection step (7), we could treat the leaky RBM as the union of truncated
Gaussian distributions, which uses weight vectors to divide the space of visible units into several
regions and use a truncated Gaussian distribution to model each region. Note that the leaky RBM
model is different from Su et al. (2016), which uses a truncated Gaussian distribution to model the
conditional distribution p(h|v) instead of the marginal distribution. The empirical study about the
divergent values and the necessity of the projection step is shown in Appendix C.
Algorithm 1 Training Leaky RBM

for t = 1, . . . , T do
Estimate gradient gθ by CD or other algorithms with (3) and (4), where θ = {W, a, b}.
θ(t) ← θ(t−1) + ηgθ .
Project W (t) by (7).
end for
4
3.2 S AMPLING FROM L EAKY-R E LU RBM
Gibbs sampling is the core procedure for RBM, including training, inference, and estimating the
partition function (Fischer & Igel, 2012; Tieleman, 2008; Salakhutdinov & Murray, 2008). For ev-
ery task, we start from randomly initializing v by an arbitrary distribution q, and iteratively sample
from the conditional distributions. Gibbs sampling guarantees the procedure result in the stationary
distribution in the long run for any initialized distribution q. However, if q is close to the target dis-
tribution p, it can significantly shorten the number of iterations to achieve the stationary distribution.
If we set the leakiness c to be 1, then (6) becomes a simple multivariate Gaussian distribution
N (I − W W > )−1 W b, (I − W W > )−1 , which can be easily sampled without Gibbs sampling.
Also, the projection step (7) guarantees it is a valid Gaussian distribution. Then we decrease the
leakiness with a small , and use samples from the multivariate Gaussian distribution when c = 1
as the initialization to do Gibbs sampling. Note that the distribution of each region is a truncated
Gaussian distribution. When we only decrease the leakiness with a small amount, the resulted dis-
tribution is a “similar” truncated Gaussian distribution with more concentrated density. From this
observation, we could expect the original multivariate Gaussian distribution serves as a good initial-
ization. The one-dimensional example is shown in Figure 2. We then repeat this procedure until we
reach the target leakiness. The algorithm can be seen as annealing the leakiness during the Gibbs
sampling procedure. The meta algorithm is shown in Algorithm 2. Next, we show the proposed
sampling algorithm can help both the partition function estimation and the training of leaky RBM.
Algorithm 2 Meta Algorithm for Sampling from Leaky RBM

Sample v from N (I − W W > )−1 W b, (I − W W > )−1

c0 = 1
for t = 1, . . . , T do
if c0 > c then
c0 = c0 −
end if
Do Gibbs sampling by using (3) and (4) with leakiness c0
end for
4 PARTITION F UNCTION E STIMATION

It is known that estimating the partition function of RBM is intractable (Salakhutdinov & Murray,
2008). Existing approaches, including Salakhutdinov & Murray (2008); Grosse et al. (2013); Liu
et al. (2015); Carlson et al. (2016) focus on using sampling to approximate the partition function of
the conventional Bernoulli RBM instead of the RBM with Gaussian visible units and non-Bernoulli
hidden units. In this paper, we focus on extending the classic annealed importance sampling (AIS)
algorithm (Salakhutdinov & Murray, 2008) to leaky RBM.
Assuming P that we want to estimate the partition function Z of p(v) with p(v) = p∗ (v)/Z and
p∗ (v) ∝ P h exp(−E(v, h)), Salakhutdinov & Murray (2008) start from a initial distribution
p0 (v) ∝ h exp(−E0 (v, h)), where computing the partition Z0 of p0 (v) is tractable and we can
draw samples from p0P(v). They then use the “geometric path” to anneal the intermediate distribution
as pk (v) ∝ p∗k (v) = h exp (−βk E0 (v, h) − (1 − βk )E(v, h)), where they grid βk from 1 to 0. If
we let β0 = 1, we can draw samples vk from pk (v) by using samples vk−1 from pk−1 (v) for k ≥ 1
PM
via Gibbs sampling. The partition function is then estimated via Z = ZM0 i=1 ω (i) , where
(i) (i) (i) (i)
p∗1 (v0 ) p∗2 (v1 ) p∗K−1 (vK−2 ) p∗K (vK−1 )
ω (i) = (i) (i)
··· (i) (i)
,
p∗0 (v0 ) p∗1 (v1 ) p∗K−2 (vK−2 ) p∗K−1 (vK−1 )
and βK = 0.
Salakhutdinov & Murray (2008) use the initial distribution with independent visible units and with-
out hidden units. Therefore, we extend Salakhutdinov & Murray (2008) to the leaky-ReLU case with
2
E0 (v, h) = kvk
2 , which results in a multivariate Gaussian distribution p0 (v). Compared with the
meta algorithm shown in Algorithm 2 which anneals between leakiness, the extension of Salakhut-
dinov & Murray (2008) anneals between energy functions.
5
J =5 J = 10 J = 20 J = 30
Log partition function 2825.48 2827.98 2832.98 2837.99
Table 1: The true partition function for Leaky-ReLU RBM with different number of hidden units.
J =5 J = 10 J = 20 J = 30
AIS-Energy 1.76 ± 0.011 3.56 ± 0.039 7.95 ± 0.363 9.60 ± 0.229
AIS-Leaky 0.02 ± 0.001 0.04 ± 0.002 0.08 ± 0.003 0.13 ± 0.004
Table 2: The difference between the true partition function and the estimations of two algorithms
with standard deviation.
4.1 S TUDY ON T OY E XAMPLES
As we discussed in Section 3.1, leaky RBM with J hidden units is a union of 2J truncated Gaussian
distributions. Here we perform a study on the leaky RBM with a small number hidden units. Since
in this example the number of hidden units is small, we can integrate out all possible configurations
of h. However, integrating a truncated Gaussian distribution with general affine constraints does
not have analytical solutions, and several approximations have been developed (e.g., Pakman &
Paninski, 2014). To compare our results with the exact partition function, we consider a special case
that has the following form:
   
1 X X
p(v) ∝ exp − v > I − Wj Wj> − c Wj Wj>  v  . (8)
2 η >0 j ηj ≤0
Compared to (6), it is equivalent to the setting where b = 0. Geometrically, every Wj passes through
the origin. We further put the additional constraint Wi ⊥ Wj , ∀i 6= j. Therefore. we divide the
whole space into 2J equally-sized regions. A three dimensional example is shown in Figure 3. Then
the partition function of this special case has the analytical form
 − 12
J
1 X
− I2
X
Z= J (2π) I − αj Wj Wj>  .
2 j=1
αj ∈{1,c},∀j
We randomly initialize W and use SVD to make each column orthogonal to each other. Also, we
scale kWj k to satisfy I − W W > 0. The leakiness parameter is set to be 0.01. For Salakhutdinov
& Murray (2008) (AIS-Energy), we use 105 particles with 105 intermediate distributions. For the
proposed method (AIS-Leaky), we use only 104 particles with 103 intermediate distributions. In
this small problem we study the cases when the model has 5, 10, 20 and 30 hidden units and 3072
visible units. The true log partition function log Z is shown in Table 1 and the difference between
log Z and the estimates given by the two algorithms are shown in Table 2.
From Table 1, we observe that AIS-Leaky has significantly better and more stable estimations than
AIS-Energy especially when J is large. For example, when we increase J from 5 to 30, the bias (dif-
ference) of AIS-Leaky only increases from 0.02 to 0.13; however, the bias of AIS-Energy increases
from 1.76 to 9.6. Moreover, we note that AIS-Leaky uses less particles and less intermediate dis-
tributions, and therefore is more computationally efficient than AIS-Energy. We further study the
implicit connection between the proposed AIS-Leaky and AIS-Energy in Appendix D, which shows
AIS-Leaky is a special case of AIS-Energy under certain conditions.
4.2 C OMPARISON BETWEEN LEAKY-R E LU RBM AND B ERNOULLI -G AUSSIAN RBM
It is known that the reconstruction error is not a proper approximation of the likelihood (Hinton,
2012). By having an accurate estimation of the partition function, we can study the power of leaky
RBM when our goal is to use under the likelihood function as our objective instead of the recon-
struction error.
6
CIFAR-10 SVHN
Bernoulli-Gaussian RBM −2548.3 −2284.2
Leaky-ReLU RBN −1031.1 −182.4
Table 3: The log-likelihood performance of Bernoulli-Gaussian RBM and leaky RBM.
We compare the Bernoulli-Gaussian RBM2 , which has Bernoulli hidden units and Gaussian visible
units. We trained both models with CD-203 and momentum. For both model, we all used 500
hidden units. We initialized W by sampling from Unif(0, 0.01), a = 0, b = 0 and σ = 1. The
momentum parameter was 0.9 and the batch size was set to 100. We tuned the learning rate between
10−1 and 10−6 . We studied two benchmark data sets, including CIFAR10 and SVHN. The data
was normalized to have zero mean and standard deviation of 1 for each pixel. The results of the
log-likelihood values are reported in Table 3.
From Table 3, leaky RBM outperforms Bernoulli-Gaussian RBM significantly. The unsatisfactory
performance of Bernoulli-Gaussian RBM may be in part due to the optimization procedure. If we
tune the decay schedule of the learning-rate for each dataset in an ad-hoc way, we observe the
performance of Bernoulli-Gaussian RBM can be improved by ∼ 300 nats for both datasets. Also,
increasing CD-steps brings slight improvement. The other possibility is the bad mixing during the
CD iterations. The advanced algorithms Tieleman (2008); Tieleman & Hinton (2009) may help.
Although Nair & Hinton (2010) demonstrate the power of ReLU in terms of reconstruction error
and classification accuracy, it does not imply its superior generative capability. Our study confirms
leaky RBM could have a much better generative performance compared to Bernoulli-Gaussian
RBM
5 B ETTER M IXING BY A NNEALING L EAKINESS
In this section, we show the idea of annealing between leakiness benefit the mixing in Gibbs sam-
pling in other settings. A common procedure for comparison of sampling methods for RBM is
through visualization. Here, we are interested in more quantitative metrics and the practical benefits
of improved sampling. For this, we consider optimization performance as the evaluation metric.
The gradient of the log-likelihood function L(θ|vdata ) of general RBM models is

∂L(θ|vdata ) ∂E(v, h) ∂E(v, h)
= Eh|vdata − Ev,h . (9)
∂θ ∂θ ∂θ
Since the second expectation in (9) is usually intractable, people use different algorithms (Fischer &
Igel, 2012) to approximate it.
In this section, we compare two gradient approximation procedure. The first one is the conventional
contrastive divergence (CD) (Hinton, 2002). The second method is using Algorithm 2 (Leaky) with
the same number of mixing steps as CD. The experiment setup is the same as that of Section 4. The
results are shown in Figure 4. The proposed sampling procedure is slightly better than typical CD
steps. The reason is we only anneals the leakiness for 20 steps. To get accurate estimation requires
thousands of steps as shown in Section 4 when we estimate the partition function. Therefore, the
estimated gradient is still inaccurate. However, it still outperforms the conventional CD algorithm,
which can demonstrate the better mixing power of the proposed sampling algorithm as we expect.
The drawback of using Algorithm 2 is sampling v from N (I − W W > )−1 W b, (I − W W > )−1

requires computing mean, covariance and the Cholesky decomposition of the covariance matrix in
every iteration, which are computationally expensive. We study a mixture algorithm by combining
CD and the idea of annealing leakiness. The mixture algorithm is replacing the sampling v from
N (I − W W > )−1 W b, (I − W W > )−1 with sampling from the empirical data distribution. The
resulted mix algorithm is almost the same as CD algorithm while it anneals the leakiness over the
iterations as Algorithm 2. The results of the mix algorithm is also shown in Figure 4.
2
Our GPU implementation with gnumpy and cudamat can reproduce the results of
http://www.cs.toronto.edu/ tang/code/GaussianRBM.m
3
CD-n means that contrastive divergence was run for n steps
7
-1520 -2000
-1540 -2020
-1560 -2040
Log Likelhood
Log Likelhood
-1580 -2060
-1600 -2080
-1620 -2100
CD CD
-1640 Mix -2120 Mix
Leaky Leaky
-1660 -2140
0 0.5 1 1.5 2 0 0.5 1 1.5 2
Iterations 10 4 Iterations 10 4
(a) SVHN (b) CIFAR10
Figure 4: Training leaky RBM with different sampling algorithms.
The mix algorithm is slightly worse than the original leaky algorithm, but outper-
forms the conventional CD algorithm. Starting from the data distribution is biased to
N (I − W W > )−1 W b, (I − W W > )−1 , which cause the mix algorithm perform worse than Al-
gorithm 2. However, by sampling from the data distribution, it is as efficient as the CD algorithm
(without additional computation cost). Annealing the leakiness helps the mix algorithm explore dif-
ferent modes of the distribution, which benefits the training. The idea could also be combined with
more advanced algorithms (Tieleman, 2008; Tieleman & Hinton, 2009)4 .
6 C ONCLUSION
In this paper, we study the properties of the distributions of leaky RBM. The study links the leaky
RBM model and truncated Gaussian distributions. Also, our study shows and addresses an under-
lying positive definite constraint of training leaky RBM. Based on our study, we further propose a
meta sampling algorithm, which anneals between leakiness during the Gibbs sampling procedure.
We first demonstrate the proposed sampling algorithm is more effective and more efficient in esti-
mating the partition function than the conventional AIS algorithm. Second, we show the proposed
sampling algorithm has better mixing property under the evaluation via optimization.
A few direction worth further studying. For example, one is how to speed up the naive projection
step. Some potential direction is using the barrier function as shown in Hsieh et al. (2011) to avoid
the projection step.
R EFERENCES
Y. Bengio. Learning deep architectures for ai. Found. Trends Mach. Learn., 2009.
Y. Burda, R. B. Grosse, and R. Salakhutdinov. Accurate and conservative estimates of mrf log-
likelihood using reverse annealing. In AISTATS, 2015.
D. E. Carlson, P. Stinson, A. Pakman, and L. Paninski. Partition functions from rao-blackwellized

tempered sampling. In ICML, 2016.
A. Fischer and C. Igel. An introduction to restricted boltzmann machines. In CIARP, 2012.
Y. Freund and D. Haussler. Unsupervised learning of distributions on binary vectors using two layer
networks. Technical report, 1994.
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and

Y. Bengio. Generative adversarial nets. In ICML. 2014.
4
We studied the PCD extension of the proposed sampling algorithm. However, the performance is not as
stable as CD.
8
R. B. Grosse, C. J. Maddison, and R. Salakhutdinov. Annealing between distributions by averaging
moments. In NIPS, 2013.
G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computa-
tion, 2002.
G. E. Hinton. A practical guide to training restricted boltzmann machines. In Neural Networks:
Tricks of the Trade (2nd ed.). 2012.
G. E. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural
Computation, 2006.
C.-J. Hsieh, M. A. Sustik, I. S. Dhillon, and P. Ravikumar. Sparse inverse covariance matrix estima-
tion using quadratic approximation. In NIPS, 2011.
D. P. Kingma and M. Welling. Auto-encoding variational bayes. CoRR, 2013.
H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief networks for scalable
unsupervised learning of hierarchical representations. In ICML, 2009.
Q. Liu, J. Peng, A. Ihler, and J. Fisher III. Estimating the partition function by discriminance
sampling. In UAI, 2015.
A. L. Maas, A. Y. Hannun, and A. Y. Ng. Rectifier nonlinearities improve neural network acoustic
models. In ICML Workshop on Deep Learning for Audio, Speech, and Language Processing,
2013.
V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In ICML,
2010.
A. Pakman and L. Paninski. Exact hamiltonian monte carlo for truncated multivariate gaussians.
Journal of Computational and Graphical Statistics, 2014.
N. Parikh and S. Boyd. Proximal algorithms. Found. Trends Optim., 2014.
S. Ravanbakhsh, B. Póczos, J. G. Schneider, D. Schuurmans, and R. Greiner. Stochastic neural
networks with monotonic activation functions. In AISTATS, 2016.
R. Salakhutdinov and G. Hinton. Deep Boltzmann machines. In AISTATS, 2009.
R. Salakhutdinov and I. Murray. On the quantitative analysis of Deep Belief Networks. In ICML,
2008.
P. Smolensky. Parallel distributed processing: Explorations in the microstructure of cognition, vol.
1. 1986.
Q. Su, X. Liao, C. Chen, and L. Carin. Nonlinear statistical learning with truncated gaussian graph-
ical models. In ICML, 2016.
L. Theis, A. van den Oord, and M. Bethge. A note on the evaluation of generative models. In ICLR,
2016.
T. Tieleman. Training restricted boltzmann machines using approximations to the likelihood gradi-
ent. In ICML, 2008.
T. Tieleman and G.E. Hinton. Using Fast Weights to Improve Persistent Contrastive Divergence. In
ICML, 2009.
M. Welling, M. Rosen-Zvi, and G. E. Hinton. Exponential family harmoniums with an application
to information retrieval. In NIPS, 2004.
E. Yang, P. Ravikumar, G. I. Allen, and Z. Liu. Graphical models via generalized linear models. In
NIPS, 2012.
9
A P ROOF OF T HEOREM 1
Proof. Since W W > − j αj Wj Wj = j (1 − αj )Wj Wj> 0, we have W W > j αj Wj Wj .

P P P
Therefore, I − j αj Wj Wj> I − W W > 0.

P
B P ROOF OF T HEOREM 2
Proof. Let the SVD decomposition of W and W̃ as W = U SV > and W̃ = Ũ S̃ Ṽ > . Then we have
I
X
kW − W̃ k2F = kU SV > − Ũ S̃ Ṽ > k2F ≥ (Sii − S̃ii )2 , (10)
i=1
and the constraint I − W̃ W̃ > 0 can be rewritten as 0 ≤ S̃ii ≤ 1, ∀i. The transformed problem
has a Lasso-like formulation and we can solve it by S̃ii = min(Sii , 1) (Parikh & Boyd, 2014). Also,
PI
the lower bound i=1 (Sii − S̃ii )2 in (10) becomes a tight bound when we set Ũ = U and Ṽ = V ,
which completes the proof.
C N ECESSITY OF THE P ROJECTION S TEP
We conduct a short comparison to demonstrate the projection step is necessary for the leaky RBM
on generative tasks. We train two leaky RBM as follows. The first model is trained by the same
setting in Section 4. We use the convergence of log likelihood as the stopping criteria. The second
model is trained by CD-1 with weight decay and without the projection step. We stop the training
when the reconstruction error is less then 10−2 . After we train these two models, we run Gibbs
sampling with 1000 independent chains for several steps and output the average value of the visible
units. Note that the visible units are normalized to zero mean. The results on SVHN and CIFAR10
are shown in Figure 5.
5 7
Weight Decay 6
Weight Decay
Average of Visible Units (log scale)
Average of Visible Units (log scale)
4 Projection Projection
5
3 4
3
2
2
1
1
0 0
-1
-1
-2
-2 -3
0 20 40 60 80 100 0 1 2 3 4 5
Gibbs Sampling Iterations Gibbs Sampling Iterations 10 4
(a) SVHN (b) CIFAR10
Figure 5: Divergence results on two datasets.
From Figure 5, the model trained by weight decay without projection step is suffered by the problem
of the diverged values. It confirms the study shown in Section 3.1. It also implies that we cannot
train leaky RBM with larger CD steps when we do not do projection; otherwise, we would have the
diverged gradients. Therefore, the projection is necessary for training leaky RBM for the generative
purpose. However, we also oberseve that the projection step is not necessary for the classification
and reconstruction tasks. he reason may be the independency of different evaluation criteria (Hinton,
2012; Theis et al., 2016) or other implicit reasons to be studied.
10
D E QUIVALENCE BETWEEN A NNEALING THE E NERGY AND L EAKINESS
We analyze the performance gap between AIS-Leaky and AIS-Energy. One major difference is the
initial distribution. The intermediate marginal distribution of AIS-Energy has the following form:
   
1 X X
pk (v) ∝ exp − v > I − (1 − βk ) Wj Wj> − (1 − βk )c Wj Wj>  v  . (11)
2 η >0
j ηj ≤0
Here we eliminated the bias terms bPfor simplicity. Compared with Algorithm 2, (11) not
only anneals the leakiness (1 − βk )c ηj ≤0 Wj Wj> when ηj ≤ 0, but also in the case (1 −
βk ) ηj >0 Wj Wj> when ηj > 0, which brings more bias to the estimation. In other words,
P
AIS-Leaky is a one-sided leakiness annealing while AIS-Energy is a two-sided leakiness annealing

method.
To address the higher bias problem of AIS-Energy, we replace the initial distribution with the one
used in Algorithm 2. By elementary calculation, the marginal distribution becomes
   
1 X X
pk (v) ∝ exp − v > I − Wj Wj> − (βk + (1 − βk )c) Wj Wj>  v  , (12)
2 η >0
j ηj ≤0
which recovers the proposed Algorithm 2. From this analysis, we understand AIS-Leaky is a special
case of conventional AIS-Energy with better initialization inspired by the study in Section 3. Also,
by this connection between AIS-Energy and AIS-Leaky, we note that AIS-Leaky can be combined
with other extensions of AIS (Grosse et al., 2013; Burda et al., 2015) as well.
11

Why We Use Sigmoid and ReLu in RBM

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Why We Use Sigmoid and ReLu in RBM

Uploaded by

Copyright:

Available Formats

A NNEALING G AUSSIAN INTO R E LU: A N EW S AM -

PLING S TRATEGY FOR L EAKY-R E LU RBM

Chun-Liang Li Siamak Ravanbakhsh Barnabás Póczos

2 R ESTRICTED B OLTZMANN M ACHINE AND R E LU

where σ(x) = (1 + e−x )−1 is the sigmoid function.

where N (µ, V ) is a Gaussian distribution with mean µ and variance V .

2.1 R E LU RBM WITH C ONTINUOUS V ISIBLE U NITS

3 T RAINING AND S AMPLING FROM LEAKY RBM

By (5), we can derive the joint distribution of the leaky-ReLU RBM as

and the marginal distribution as

3.1 L EAKY RBM AS U NION OF T RUNCATED G AUSSIAN D ISTRIBUTIONS

all αj ∈ [0, 1].

Algorithm 1 Training Leaky RBM

Algorithm 2 Meta Algorithm for Sampling from Leaky RBM

4 PARTITION F UNCTION E STIMATION

4.1 S TUDY ON T OY E XAMPLES

4.2 C OMPARISON BETWEEN LEAKY-R E LU RBM AND B ERNOULLI -G AUSSIAN RBM

Table 3: The log-likelihood performance of Bernoulli-Gaussian RBM and leaky RBM.

5 B ETTER M IXING BY A NNEALING L EAKINESS

(a) SVHN (b) CIFAR10

Figure 4: Training leaky RBM with different sampling algorithms.

D. E. Carlson, P. Stinson, A. Pakman, and L. Paninski. Partition functions from rao-blackwellized

A. Fischer and C. Igel. An introduction to restricted boltzmann machines. In CIARP, 2012.

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and

Proof. Since W W > − j αj Wj Wj = j (1 − αj )Wj Wj>  0, we have W W >  j αj Wj Wj .

Therefore, I − j αj Wj Wj>  I − W W >  0.

C N ECESSITY OF THE P ROJECTION S TEP

(a) SVHN (b) CIFAR10

Figure 5: Divergence results on two datasets.

AIS-Leaky is a one-sided leakiness annealing while AIS-Energy is a two-sided leakiness annealing

You might also like

Proof. Since W W > − j αj Wj Wj = j (1 − αj )Wj Wj> 0, we have W W > j αj Wj Wj .

Therefore, I − j αj Wj Wj> I − W W > 0.