Professional Documents
Culture Documents
fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TFUZZ.2015.2406889, IEEE Transactions on Fuzzy Systems
Abstract—In recent years, deep learning caves out a research knowledge of the RBM and its deep architectures can be found
wave in machine learning. With outstanding performance, more in [11] [12].
and more applications of deep learning in pattern recognition,
image recognition, speech recognition and video processing have
been developed. Restricted Boltzmann machine (RBM) plays an
important role in current deep learning techniques, as most
of existing deep networks are based on or related to it. For
regular RBM, the relationships between visible units and hidden
units are restricted to be constants. This restriction will certainly
downgrades the representation capability of the RBM. To avoid
this flaw and enhance deep learning capability, the fuzzy re-
stricted Boltzmann machine (FRBM) and its learning algorithm
are proposed in this paper, in which the parameters governing
the model are replaced by fuzzy numbers. In this way, the
original RBM becomes a special case in the FRBM, when there Fig. 1: Restricted Boltzmann Machine
is no fuzziness in the FRBM model. In the process of learning
FRBM, the fuzzy free energy function is defuzzified before the Most researchers in deep learning field focus on deep net-
probability is defined. The experimental results based on Bar-
And-Stripe benchmark inpainting and MNIST handwritten digits
work design and corresponding fast learning algorithms. Some
classification problems show that the representation capability research works try to improve the deep learning technique from
of FRBM model is significantly better than traditional RBM. the model representation. For example, Gaussian restricted
Additionally, the FRBM also reveals better robustness property Boltzmann machines (GRBMs) with Gaussian linear units are
compared with RBM when the training data are contaminated proposed to learn representations from real-valued data [13].
by noises. It improves the RBM by replacing binary-valued visible units
Keywords—Deep learning, Fuzzy restricted Boltzmann machine, with Gaussian ones. The deep networks based on GRBM are
RBM, Fuzzy deep networks, Image inpainting, Image classification. also developed in recent years, such as Gaussian-Bernoulli
Deep Boltzmann Machine [14] [15]. Conditional versions of
RBMs have also been developed for collaborative filtering
I. I NTRODUCTION [16], and temporal RBMs and recurrent RBMs are proposed to
model high dimensional sequence data such as motion capture
R ESTRICTED Boltzmann machine (RBM), as illustrated
in Fig. 1, is a stochastic graph model that can learn
a joint probability distribution over its n visible units x =
data [17] and video sequences [18].
For regular RBMs and their existing variants, the parameters
that represent the relationships between units in visible and
[x1 , · · · , xn ] and m hidden feature units h = [h1 , · · · , hm ]. hidden layers are restricted to be constants. This structure
The model is governed by parameters θ denoting connection design will surely lead to many problems. Firstly, it will
weights and biases between cross-layer units. The RBM [1] constrain the representation capability, since the variables often
was invented in 1986, however, it received a new birth when interact in some uncertain ways. Secondly, the RBMs are not
G. Hinton and his partners recently proposed several deep very robust when the training data samples corrupted by noises.
networks and corresponding fast learning algorithms, including Thirdly, the parameter learning process of the RBMs is con-
deep autoencoder [2], deep belief networks [3], and deep fined in a relatively small space. This is contrary with merits of
Boltzmann machine [4]. The RBM and its deep architectures deep learning. All of these constraints will be reflected by the
have a large number of applications [5], such as dimensionality fitness of the target joint probability distribution. To overcome
reduction [6], classification [7], collaborative filtering [8], these disadvantages and reduce the inaccuracy and distortion
feature learning [9], and topic modelling [10]. A detailed deduced by linearization of the relationship between cross-
C. L. Philip Chen, Chun-Yang Zhang and Long Chen are with the layer units, this paper proposes the fuzzy restricted Boltz-
Department of Computer and Information Science, Faculty of Science and mann machines (FRBM) and corresponding learning algo-
Technology, University of Macau, Macau, China. rithm, where the parameters governing the models are all fuzzy
Min Gan is with the School of Electrical Engineering and Automation, numbers. After the vagueness in the relationship between the
Hefei University of Technology, Hefei, China.
This research is sponsored by Macau Science and Technology development visible units and their high level features is introduced into the
fund under Grant No. 008/2010/A1, UM Multiyear Research Grants, and the model, the fuzzy RBMs demonstrate competitive performances
National Natural Science Foundation of China under Grant No. 61203106. in both data representation capability and robustness to cope
1063-6706 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TFUZZ.2015.2406889, IEEE Transactions on Fuzzy Systems
1063-6706 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TFUZZ.2015.2406889, IEEE Transactions on Fuzzy Systems
III. F UZZY R ESTRICTED B OLTZMANN M ACHINE AND I TS accordance with extension principle as follow
L EARNING A LGORITHM T
E(x, h, θ) = −b x − cT h − hT Wx, (10)
A. Fuzzy Restricted Boltzmann Machine
where E(x, h, θ) is a fuzzified energy function, and θ =
The proposed novel fuzzy restricted Boltzmann machine {b, c, W} are fuzzy parameters. Correspondingly, the fuzzy
(FRBM) is illustrated in Fig. 3, in which the connection free energy F , which marginalize hidden units and map Eqn.
weighs and biases are fuzzy parameters denoted by θ. There (7) into a simpler one, is deduced as
are several merits of the FRBM model. The first one is that X
the FRBM has much better representation than the regular F (x, θ) = − log e−E(x,h̃,θ) , (11)
RBM in modelling probabilities over visible and hidden units. h̃
Specifically, the RBM is only a special case of the FRBM
when no fuzziness exists in the FRBM model. The second where F is extended from crisp free energy function F :
one is that the robustness of the FRBM model surpasses RBM X
model. The FRBM shows out more robustness when it comes F (x, θ) = − log e−E(x,h̃,θ) . (12)
to the fitting of the model with noisy data. All these advantages h̃
spring from the fuzzy extension of the relationships between
cross-layer variables, and inherit the characteristics of fuzzy If the fuzzy free energy function is directly employed to
models. define the probability, it leads to a fuzzy probability [27].
Finally, the optimization in learning process turns into a fuzzy
maximum likelihood problem. However, this kind of problem
is quite intractable because the fuzzy objective function is non-
linear and the membership function is difficult to compute,
since the computation of its alpha-cuts become NP-hard prob-
lems [29]. Therefore, it is necessary to transform the problem
into regular maximum likelihood problem by defuzzifying the
fuzzy free energy function (11). The centre of area (centroid)
method [30] is employed to defuzzify the fuzzy free energy
function F (x). Then the likelihood function can be defined by
the defuzzified fuzzy free energy function. Consequently, the
fuzzy optimization problem becomes real-valued problem, and
Fig. 3: Fuzzy Restricted Boltzmann Machine conventional optimization approaches can be directly applied
to find the optimal solutions. The centroid of fuzzy number
Since the FRBM is an extension of the RBM model, so the F (x) is denoted by Fc (x), and has the following form
discussion starts from a brief introduction on the RBM model. R
θF (x, θ)dθ
An RBM is an energy-based probabilistic model, in which the Fc (x, θ) = R , θ ∈ θ. (13)
probability distribution is defined through an energy function. F (x, θ)dθ
Its probability is defined as Naturally, after the fuzzy free energy is defuzzified, the prob-
−E(x,h,θ) ability can be defined as
e
P (x, h, θ) = , (7) X
Z
XX e−Fc (x;θ)
−E(x̃,h̃,θ) Pc (x, θ) = , Z= e−Fc (x̃,θ) . (14)
Z = e , (8) Z
x̃
x̃ h̃
In the fuzzy RBM model, the objective function is the negative
where E(x, h, θ) is the energy function, θ are the parameters log-likelihood, which is given by
governing the model, Z is the normalizing factor which is X
called the partition function, x̃ and h̃ are two vector variables L(θ, D) = − log Pc (x, θ), (15)
representing visible and hidden units that are used to traverse x∈D
and summarize all the configurations of units on the graph. where D is the training dataset.
The energy function for the RBM is defined by
E(x, h, θ) = −bT x − cT h − hT Wx, (9) The learning problem is to find optimal solutions for param-
eters θ that minimize the objective function L(θ, D), i.e.,
where bj and ci are the offsets, and Wij is the connection
weight between j-th visible unit and i-th hidden unit, and θ = min L(θ, D) (16)
{b, c, W}. θ
To establish fuzzy restricted Boltzmann machine, it is nec- In the following subsection, the detailed procedure to address
essary to firstly define the fuzzy energy function for the model. the dual problem of maximum likelihood by utilizing stochas-
The fuzzy energy function can be extended from Eqn. (9) in tic gradient descent method will be investigated.
1063-6706 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TFUZZ.2015.2406889, IEEE Transactions on Fuzzy Systems
= [F (x, θ R ), F (x, θ L )]. In the commonly studied case of the RBM with binary units,
where xj and hi ∈ {0, 1}, the probabilistic version of the usual
2) Approximation of Centroid: An approximation of the neuron activation functions can be derived from conditional
centroid of the fuzzy free energy function is provided by probabilities. They have the following affine forms
discretizing the fuzzy output F (x, θ) and calculating M eci +Wi x
number of its α−cuts. By combining with these α−cuts, the P (hi = 1|x) = = σ(ci + Wi x), (25)
approximate centroid is given by 1 + eci +Wi x
T
PM ebj +W.j h
αi [F (x, θiL ) + F (x, θiR )] P (xj = 1|h) = T = σ(bj + W.jT h), (26)
Fc (x, θ) ≈ i=1 PM , (18) 1 + ebj +W.j h
2 i=1 αi
where Wi and W.j denote the i−th row and j−th column of
where α = (α1 , · · · , αN ), α ∈ [0, 1]N and θ[αi ] = W , respectively, σ is logistic sigmoid function
[θiL , θiR ]. ex 1
As all the α−cuts are bounded intervals [31], for con- σ(x) = = .
ex + 1 1 + e−x
venience, we only consider a special case, in which fuzzy
numbers are degraded into intervals (α = 1). Let θ = [θ L , θ R ]. For fuzzy RBM, the conditional probabilities also become
According to Eqn. (18), the free energy function can be written fuzzy and can be extended from Eqn. (25) and (26) as
as P (hi = 1|x) = σ(ci + W i x),
1
Fc (x, θ) ≈ [F (x, θ L ) + F (x, θR )] . (19) T
2 P (xj = 1|h) = σ(bj + W .j h).
After defuzzifying the fuzzy free energy, the probability de- After defuzzifying the objective function, Monte Carlo
fined on it returns to Eqn. (14). Then the problem is trans- Markov Chain (MCMC) method can be employed to sample
formed into regular optimization problem which can be solved from these conditional distributions. This process is crucial to
by the gradient descend based stochastic maximum likelihood approximate the objective function due to the difficulties to
method. calculate the expectations. For the predefined α, the α-cut of
3) Gradient Related Optimization: The gradients of negative fuzzy conditional probabilities are consistent with the targeting
log-probability with respect to θL then have a particularly form probability distribution. They are given as follows
(see Appendix A): P (hi = 1|x)[α] = [PL (hi = 1|x), PR (hi = 1|x)],
∂ log Pc (x, θ) ∂Fc (x, θ L ) ∂Fc (x, θ L ) P (xj = 1|h)[α] = [PL (xj = 1|h), PR (xj = 1|h)],
− = − EP , (20)
∂θL ∂θL ∂θL
where PL (hi |x), PR (hi |x), PL (xj |h) and PR (xj |h) are the
where EP (·) means the expectation over the target probability conditional probabilities with respect to the lower bounds and
distribution P . Similarly, the gradients of Eqn. (16) with upper bounds of the parameters governing the model. They
1063-6706 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TFUZZ.2015.2406889, IEEE Transactions on Fuzzy Systems
have following forms running two Markov chains to convergence, using Gibbs sam-
pling as the transition operator. One is for the lower bounds,
PL (hi |x) = P (hi |x; θL ) = σ(cL L
i + Wi x), and the other is for the upper bounds. Gibbs sampling of the
(27)
PR (hi |x) = P (hi |x; θR ) = σ(cR R
i + Wi x), joint distribution over N random variables S = (S1 , ..., SN )
is done through a sequence of N sampling sub-steps of the
and
form Si ∼ P (Si |S−i ), where S−i contains the N − 1 other
PL (xj |h) = P (xj |h; θL ) = σ(bL L
j + W.j h), random variables in S excluding Si .
(28) For both the RBM and fuzzy RBM model, S consists
PR (xj |h) = P (hj |h; θR ) = σ(bR R
j + W.j h).
of the set of visible and hidden units. However, since they
There are six kinds of parameters for visible unit j and are conditionally independent, one can perform block Gibbs
hidden unit i in the FRBM model, i.e., lower bound of sampling. A step in the Markov chain, visible units are sampled
connection weight WijL , visible bias bL L
j , hidden bias ci and given hidden units, hidden units are sampled given visible
their upper bounds WijR , bR R
j , and ci . For simplicity, the energy
units, as illustrated in Fig. 4.
function is denoted by a sum of terms associated with only one
h(k+1) ∼ P (h(k+1) |x(k) ); (32)
hidden unit
Xm x(k+1) ∼ P (x(k+1) |h(k+1) ). (33)
E(x, h) = −µ(x) − φi (x, hi ), (29) As k → ∞, samples (x(k) , h(k) ) are guaranteed to be
i=1 accurate samples of P (x, h). However, Gibbs sampling is very
where time-consuming as k needs to be large enough. An efficient
learning approach, called contrastive divergence (CD) [32]
µ(x) = bT x, φi (x, hi ) = −hi (ci + Wi x). (30) learning was proposed in 2002. It shows that learning process
Then, the free energy of RBM with binary units can be further still perform very well even though only a number of steps are
simplified explicitly to (see Appendix B) run in Markov chain [33]. The CD learning uses two tricks to
m
speed up the sampling process. The first on it to initialize the
X Markov chain with a training example, and the second one is
F (x) = −bT x − log(1 + e(ci +Wi x) ). (31) to obtain samples after only k−steps of Gibbs sampling. This
i=1
is regarded as CD−k learning algorithm. A lot of experiments
The gradients of the free energy can be calculated explicitly show that the performances of the approximations are still very
when the RBM has binary units and the energy function has good when k = 1.
form Eqn. (29).
No matter whether the fuzzy parameters in fuzzy RBM are
symmetric or not, their α-cuts are always intervals with lower
and upper bounds that need to be learned in the training phase.
According to Eqn. (19), (20) and (21), all the gradients (see
details in Appendix C) can be obtained. Associate with Eqn.
(20) and (31), it is easy to get the following negative log-
likelihood gradients for the fuzzy RBM with binary units
∂ log Pc (x)
− = EP [PL (hi |x) · xL L
j ] − PL (hi |x) · xj ,
∂WijL
∂ log Pc (x)
− = EP [PL (hi |x)] − PL (hi |x),
∂cLi
∂ log Pc (x)
− = EP [PL (xj |h)] − xL
j, Fig. 4: k−step Gibbs Sampling
∂bLj
∂ log Pc (x) Contrastive divergence is a different function compared with
− = EP [PR (hi |x) · xR R
j ] − PR (hi |x) · xj ,
∂WijR Kullback-Leibler (KL) divergence to measure the difference
∂ log Pc (x) between approximated distribution and true distribution. Why
− = EP [PR (hi |x)] − PR (hi |x), is CD learning efficient? It is because that CD learning
∂cR
i provides an approximation of the log-likelihood gradient that
∂ log Pc (x) has been found to be a successful update rule for training
− = EP [PR (xj |h)] − xR
j ,
∂bR
j
probabilistic models. Variational justification can provide a
theoretical proof to the convergence of the learning processes
where Pc (x) is the centroid probability defined though Eqn. [11] [34]. Conducting CD−1 learning by using Eqn. (25) and
(14) combining with Eqn. (18) and (19). (26), namely x = x(0) → h(0) → x(1) → h(1) , it is easy to
5) Contrastive Divergence: When to approximate the ex- get the updating rules for all the parameters (θL and θR ) in the
pectations, samples of PL (x) and PR (x) can be obtained by FRBM model. The pseudo-code is demonstrated in Algorithm
1063-6706 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TFUZZ.2015.2406889, IEEE Transactions on Fuzzy Systems
1063-6706 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TFUZZ.2015.2406889, IEEE Transactions on Fuzzy Systems
one can recover the benchmark by conducting Gibbs sampling conclude that the FRBM has stronger representation capability
over P (x1 , · · · , xl |xl+1 , · · · , xn ). The unknown observations than the RBM. This advantage springs from the fuzzy rela-
are initially resigned to be 0.5. tionship between the visible and hidden units. Intuitively, the
In the following experiments conducted to evaluate FRBM parameters’ searching space of RBM can be regarded as a sub-
on different energy functions and different types of fuzzy space of the FRBM’s searching space when the relationships
numbers for RBMs’ parameters, the energy function defined are extended.
as Eq. (10) is regarded as type-1 energy function (EF-1), and
the energy of a joint configuration ignoring biases terms is
regarded as type-2 energy function (EF-2). The fuzzy RBM
with symmetric triangular fuzzy number is denoted as FRBM-
STFN. Accordingly, fuzzy RBM with Gaussian membership
function is denoted as FRBM-GMF. In the experiments, the
FRBM and RBM model have 25 visible units and 50 hidden
units. For the two different models, the mini-batch learning
scheme (100 samples each batch) and CD-1 learning approach
are employed to speed up the training processes. The biases b
and c are initialized with zero values, and connection weights
are generated randomly from standard Gaussian distribution in
the initial stages. The learning rates for updating W , b and c
are set to be 0.05. The MSEs (the summation over reconstruc-
tion error of all the 60, 000 training samples) produced in the
two learning phases (50 hidden units) are shown in Fig. 6. It is
easy to see that the FRBM model generates less reconstruction
errors than the RBM model, which means the FRBM can learn
the probability distribution more accurate than the traditional
RBM model. Fig. 7: BAS Benchmark Inpainting Based on RBM
1063-6706 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TFUZZ.2015.2406889, IEEE Transactions on Fuzzy Systems
model learn a more accurate distribution than the regular RBM both RBM and FRBM models are much better when the biases
model. terms are taken into account in energy functions for this case.
The energy functions for current variants of RBM are designed
by considering the structures of the variants. Otherwise, the
membership function of fuzzy parameters also has effect on
representation capability of the FRBM model. From the results
in TABLE I, the FRBM model with Gaussian membership
function has better representation of BAS dataset than that
with symmetric triangular membership function. The learning
time for FRBM model raises accordingly as the number of
the parameters increases when fuzziness is introduced into the
model.
1063-6706 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TFUZZ.2015.2406889, IEEE Transactions on Fuzzy Systems
Fig. 11: Learning Processes of RBM and FRBM Based on Fig. 12: MNIST Samples
MNIST Dataset
V. C ONCLUSION
In order to improve the representation capability and robust-
ness of the RBM model, a novel FRBM model is proposed, in
increased by relaxing restrictions on the relationships between which the connection weighs and biases between visible and
cross-layer units. When the number of hidden units are 100, hidden units are fuzzy numbers. It means that the relationship
the fuzzy RBM improves the recognition accuracy by 1.44%. It between input and its high level features (hidden units in the
is a significant advancement in the MNIST digits recognition, graph) are extended. After relaxing the relationship between
as the state-of-the-art machine learning recognition algorithms the random variables by introducing fuzzy parameters, the
only produce 0.5% improvement compared with the previous RBM is only a special case of the FRBM when no fuzziness
approaches. As the number of hidden units increasing, the exists in the fuzzy model. In other words, the parameter
fuzzy RBMs produce less error rate than the RBM but the searching space of RBM is only a sub-space of the FRBM’s
improvement of the fuzzy RBM decreases to 1.06%. It is searching space. Therefore, the FRBM has more powerful
because that both the RBMs and fuzzy RBMs reach their representation capability than the regular RBM. On the other
capacities to model the data. As every image in MNIST dataset hand, the robustness of the FRBM is also more powerful than
is different and with low resolution (noises already exist, see the RBM model. These merits attribute to the fuzziness of the
Fig. 12), it does not need to factitiously add noises into the FRBM model.
images. Both the RBM and FRBM can be trained in both supervised
1063-6706 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TFUZZ.2015.2406889, IEEE Transactions on Fuzzy Systems
TABLE II: MNIST Classification: error rate is the percentage of the negative recognition for 10, 000 testing samples, m is the
number of hidden units
A PPENDIX A
1063-6706 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TFUZZ.2015.2406889, IEEE Transactions on Fuzzy Systems
then, the likelihood function can be given by all the gradients can be expressed as
P (x) ∂Fc (x)
= −PL (hi |x) · xL
j,
1 −E(x) ∂WijL
= e
Z ∂Fc (x)
1 X −E(x,h̃) = −PL (hi |x),
= e ∂cL
i
Z
h̃ ∂Fc (x)
1 X X X µ(x)−Pm = −xL
j,
= ··· e i=1 φi (x,hi ) ∂bL
j
Z (41)
h1 h2 hm ∂Fc (x)
= −PR (hi |x) · xR
j ,
m
1 X X X µ(x) Y −φi (x,hi ) ∂WijR
= ··· e e
Z ∂Fc (x)
h1 h2 i=1 hm = −PR (hi |x),
∂cR
eµ(x) X −φ1 (x,h1 ) X −φ2 (x,h2 ) i
= e e ∂Fc (x)
Z = −xR
j .
X
h1 h2 ∂bR
j
··· e−φm (x,hm )
hm
R EFERENCES
m
eµ(x) Y −φi (x,hi ) [1] P. Smolensky, “Parallel distributed processing: Explorations in the
= e . (37) microstructure of cognition.” MIT Press, 1986, vol. 1, pp. 194–281.
Z i=1 [2] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy
layer-wise training of deep networks,” Advances in neural information
For the RBM model with binary value units, its free energy processing systems, vol. 19, p. 153, 2007.
can be written as [3] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for
deep belief nets,” Neural Computation, vol. 18, no. 7, pp. 1527–1554,
F (x) = − log P (x) − log Z 2006.
m
X X [4] R. Salakhutdinov and G. Hinton, “An efficient learning procedure for
= −µ(x) − log e−φi (x,hi ) deep boltzmann machines,” Neural Computation, vol. 24, no. 8, pp.
i=1 hi 1967–2006, 2012.
m
X [5] R. Salakhutdinov, J. Tenenbaum, and A. Torralba, “Learning with
= −bT x − log(1 + e(ci +Wi x) ). (38) hierarchical-deep models,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 35, no. 8, pp. 1958–1971, Aug 2013.
i=1
[6] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of
data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507,
2006.
[7] J. Schmidhuber, “Multi-column deep neural networks for image classi-
fication,” in Proceedings of the 2012 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), ser. CVPR ’12. IEEE
A PPENDIX C Computer Society, 2012, pp. 3642–3649.
[8] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol,
“Stacked denoising autoencoders: Learning useful representations in a
deep network with a local denoising criterion,” The Journal of Machine
As the free energy of RBM model is explicitly deduced as Learning Research, vol. 11, pp. 3371–3408, 2010.
m
X [9] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A
F (x) = −bT x − log(1 + e(ci +Wi x) ), review and new perspectives,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, vol. 35, no. 8, pp. 1798–1828, 2013.
i=1
[10] R. Salakhutdinov and G. Hinton, “Replicated softmax: an undirected
corresponding to Eqn. (19) and (30), it is easily to get topic model,” in In Advances in Neural Information Processing Systems,
2010.
1 1
Fc (x) = − µ(x; θL ) − µ(x; θR ) [11] Y. Bengio, “Learning deep architectures for ai,” Foundations and Trends
2 2 in Machine Learning, vol. 2, no. 1, pp. 1–127, 2009.
m
1X L L L [12] G. Hinton, “A practical guide to training restricted boltzmann ma-
− log(1 + eφi (x,hi ,ci ,Wi ) ) chines,” UTML TR 2010-003, 2010.
2 i=1
[13] N. Wang, J. Melchior, and L. Wiskott, “Gaussian-binary restricted
m
1X R R R boltzmann machines on modeling natural image statistics,” CoRR, vol.
− log(1 + eφi (x,hi ,ci ,Wi ) ). (39) abs/1401.5900, 2014.
2 i=1 [14] G. W. Taylor, G. E. Hinton, and S. T. Roweis, “Two distributed-
state models for generating high-dimensional time series,” Journal of
In term of the following donations Machine Learning Research, vol. 12, pp. 1025–1068, 2011.
PL (hi |x) = σ(cL L [15] K. H. Cho, T. Raiko, and A. Ilin, “Gaussian-bernoulli deep boltzmann
i + Wi x)), machine,” in Neural Networks (IJCNN), The 2013 International Joint
(40)
PR (hi |x) = σ(ci + WiR x)),
R Conference on, Aug 2013, pp. 1–7.
1063-6706 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TFUZZ.2015.2406889, IEEE Transactions on Fuzzy Systems
[16] R. Salakhutdinov, A. Mnih, and G. Hinton, “Restricted boltzmann C. L. Philip Chen (S’88-M’88-SM’94-F’07) re-
machines for collaborative filtering,” in Proceedings of the 24th In- ceived the M.S. degree in electrical engineering from
ternational Conference on Machine Learning, ser. ICML ’07. New the University of Michigan, Ann Arbor, in 1985
York, NY, USA: ACM, 2007, pp. 791–798. and the Ph.D. degree in electrical engineering from
[17] G. W. Taylor, G. E. Hinton, and S. Roweis, “Modeling human motion Purdue University, West Lafayette, IN, in 1988.
using binary latent variables,” in Advances in Neural Information After having worked at U.S. for 23 years as a
Processing Systems. MIT Press, 2007, pp. 1345–1352. tenured professor, as a department head and associate
dean in two different universities, he is currently
[18] I. Sutskever and G. E. Hinton, “Learning multilevel distributed repre- the Dean of the Faculty of Science and Technology,
sentations for high-dimensional sequences,” in International Conference
University of Macau, Macau, China and a Chair
on Artificial Intelligence and Statistics, ser. JMLR Proceedings, vol. 2. Professor of the Department of Computer and In-
JMLR.org, 2007, pp. 548–555. formation Science.
[19] H.-F. Wang and R.-C. Tsaur, “Insight of a fuzzy regression model,” Dr. Chen is a Fellow of the IEEE and the AAAS. After being the IEEE
Fuzzy Sets and Systems, vol. 112, no. 3, pp. 355–369, 2000. SMC Society President (2012-2103), currently, he is the Editor-in-Chief of
[20] P.-Y. Hao and J.-H. Chiang, “Fuzzy regression analysis by support IEEE Transactions on Systems, Man, and Cybernetics: Systems (2014-) and
vector learning approach,” IEEE Transactions on Fuzzy Systems, vol. 16, associate editors of several IEEE Transactions. He is also the Chair of
no. 2, pp. 428–441, April 2008. TC 9.1 Economic and Business Systems of IFAC. His research areas are
[21] A. Fischer and C. Igel, “Training restricted boltzmann machines: An systems, cybernetics, and computational intelligence. In addition, he is a
introduction,” Pattern Recognition, vol. 47, no. 1, pp. 25–39, 2014. program evaluator for Accreditation Board of Engineering and Technology
Education (ABET) in Computer Engineering, Electrical Engineering, and
[22] C.-Y. Zhang and C. Chen, “An automatic setting for training restricted Software Engineering programs.
boltzmann machine,” in Systems, Man and Cybernetics (SMC), 2014
IEEE International Conference on, 2014, pp. 4037–4041.
[23] M. Aoyagi, “Learning coefficient in bayesian estimation of restricted
boltzmann machine,” Journal of Algebraic Statistics, vol. 4, no. 1, pp.
31–58, 2013.
[24] M. Aoyagi and K. Nagata, “Learning coefficient of generalization
error in bayesian estimation and vandermonde matrix-type singularity,”
Neural Computation, vol. 24, no. 6, pp. 1569–1610, 2012.
[25] G. J. Klir and B. Yuan, Fuzzy Sets and Fuzzy Logic: Theory and ChunYang Zhang received the B.S. degree in Math-
Applications. Prentice-Hall, 1995. ematics from Beijing Normal University Zhuhai,
[26] S. Dutta and M. Chakraborty, “Fuzzy relation and fuzzy function over China, in 2010 and M.S. degree in Mathematics
fuzzy sets: a retrospective,” Soft Computing, pp. 1–14, 2014. from University of Macau, Macau, in 2012. He is
currently working toward the Ph.D. degree in the
[27] J. J. Buckley, Fuzzy Probabilities: New Approach And Applications, ser. Department of Computer and Information Science,
Studies in fuzziness and soft computing. Springer, 2009. Faculty of Science and Technology, University of
[28] W. Pedrycz, A. Skowron, and V. Kreinovich, Handbook of Granular Macau,Macau, China.
Computing. Wiley-Interscience, 2008. His research interests include computational intel-
[29] W. A. Lodwick and J. Kacprzyk, Fuzzy Optimization: Recent Advances ligence, data mining and machine learning algorithm.
and Applications. Springer, 2010.
[30] N. N. Karnik and J. M. Mendel, “Centroid of a type-2 fuzzy set,”
Information Sciences, vol. 132, no. 14, pp. 195 – 220, 2001.
[31] O. Castillo and P. Melin, Type-2 Fuzzy Logic: Theory and Applications.
Springer, 2008.
[32] G. Hinton, “Training products of experts by minimizing contrastive
divergence,” Neural Computation, vol. 14, no. 8, pp. 1771–1800, 2002.
[33] M. A. C. Perpinan and G. E. Hinton, “On contrastive divergence
learning,” Articial Intelligence and Statistics, 2005. Long Chen (M’11) received the B.S. degree in in-
[34] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul, “An formation sciences from Peking University, Beijing,
introduction to variational methods for graphical models,” Machine China in 2000, the M.S.E. degree from Institute
Learning, vol. 37, no. 2, pp. 183–233, 1999. of Automation, Chinese Academy of Sciences in
[35] C. M. Bishop, Pattern Recognition and Machine Learning. Secaucus, 2003, the M.S. degree in computer engineering, from
NJ, USA: Springer-Verlag New York, Inc., 2006. University of Alberta, Canada in 2005, and the Ph.D.
degree in electrical engineering from the University
[36] G. E. Hinton, “Learning multiple layers of representation,” Trends in of Texas at San Antonio, USA in 2010. From 2010 to
Cognitive Sciences, vol. 11, no. 10, pp. 428 – 434, 2007. 2011 he was a postdoctoral fellow in the University
[37] C. P. Chen and C.-Y. Zhang, “Data-intensive applications, challenges, of Texas at San Antonio. Dr. Chen currently is an
techniques and technologies: A survey on big data,” Information Sci- assistant professor with the department of Computer
ences, vol. 275, no. 0, pp. 314 – 347, 2014. and Information Science, University of Macau, China.
His current research interests include computational intelligence, Bayesian
methods, and other machine learning techniques and their applications. Dr.
Chen has been working in publication matters for many IEEE conferences
and was the Publications Co-chair of the IEEE International Conference on
Systems, Man and Cybernetics (SMC) 2009, 2012, and 2014.
1063-6706 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TFUZZ.2015.2406889, IEEE Transactions on Fuzzy Systems
1063-6706 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.