Redes Profundas y Modulares

473
Deep and Mo
28. Deep and Modular Neural Networks
Ke Chen
Part D | 28.1
28.1 Overview ............................................. 473
In this chapter, we focus on two important areas
in neural computation, i. e., deep and modular 28.2 Deep Neural Networks.......................... 474
neural networks, given the fact that both deep 28.2.1 Background and Motivation ....... 474
and modular neural networks are among the most 28.2.2 Building Blocks
powerful machine learning and pattern recogni- and Learning Algorithms ............ 475
tion techniques for complex AI problem solving. 28.2.3 Hybrid Learning Strategy ............ 480
We begin by providing a general overview of deep 28.2.4 Relevant Issues ......................... 483
and modular neural networks to describe the gen- 28.3 Modular Neural Networks..................... 484
eral motivation behind such neural architectures 28.3.1 Background and Motivation ....... 484
and fundamental requirements imposed by com- 28.3.2 Tightly Coupled Models .............. 485
plex AI problems. Next, we describe background 28.3.3 Loosely Coupled Models ............. 487
and motivation, methodologies, major building 28.3.4 Relevant Issues ......................... 491
blocks, and the state-of-the-art hybrid learning 28.4 Concluding Remarks ............................ 492
strategy in context of deep neural architectures.
Then, we describe background and motivation, References................................................... 492
taxonomy, and learning algorithms pertaining to
various typical modular neural networks in a wide issues and discuss open problems in deep and
context. Furthermore, we also examine relevant modular neural network research areas.
28.1 Overview
The human brain is a generic effective and efficient other parts of the brain [28.1]. In particular, the cerebral
system that solves complex and difficult problems and cortex consists of several regions attributed to main per-
generates the trait of intelligence and creation. Neural ceptual and cognitive tasks, where modularity emerges
computation has been inspired by brain-related research in two different aspects: i. e., structural and functional
in different disciplines, e.g., biology and neuroscience, modularity. Structural modularity is observable from
on various levels ranging from a simple single-neuron the fact that there are sparse connections between dif-
to complex neuronal structure and organization [28.1]. ferent neuronal groups but neurons are often densely
Among many discoveries in brain-related sciences, two connected within a neuronal group, while functional
of the most important properties are modularity and hi- modularity is evident from different response patterns
erarchy of neuronal organization in the human brain. produced by neural modules for different perceptual
Neuroscientific research has revealed that the cen- and cognitive tasks. Modularity evidence in the human
tral nervous system (CNS) in the human brain is brain strongly suggests that domain-specific modules
a distributed, massively parallel, and self-organizing are required by specific tasks and different modules
modular system [28.1–3]. The CNS is composed of sev- can cooperate for high level, complex tasks, which pri-
eral regions such as the spinal cord, medulla oblongata, marily motivates the modular neural network (MNN)
pons, midbrain, diencephalon, cerebellum, and the two development in neural computation (NC) [28.4, 5].
cerebral hemispheres. Each such region forms a func- Apart from modularity, the human brain also ex-
tional module and all regions are interconnected with hibits a functional and structural hierarchy given the fact
474 Part D Neural Networks
that information processing in the human brain is done models [28.9] in NC. The main difference between
in a hierarchical way. Previous studies [28.6, 7] sug- biologically plausible and artificial models lies their
gested that there are different cortical visual areas that methodologies that a biologically plausible model of-
lead to hierarchical information representations to carry ten takes both structural and functional resemblance
out highly complicated visual tasks, e.g., object recog- to its biological counterpart into account, while an
nition. In general, hierarchical information processing artificial model simply works towards modeling the
enables the human brain to accomplish complex percep- functionality of a biological system without consider-
Part D | 28.2
tual and cognitive tasks in an effective and extremely ing those bio-mimetic factors. Due to the limited space,
efficient way, which mainly inspires the study of deep in this chapter we merely focus on artificial DNNs
neural networks (DNNs) of multiple layers in NC. and MNNs. Readers interested in biologically plausi-
In general, both DNNs and MNNs can be cate- ble models are referred to the literature, e.g., [28.4], for
gorized into biologically plausible [28.8] and artificial useful information.
28.2 Deep Neural Networks

In this section, we overview main deep neural net- searchers gradually gave up deep architectures and
work (DNN) techniques with an emphasis on the latest devoted their attention to shallow learning architectures
progress. We first review background and motivation of theoretical justification, e.g., the formal but non-
for DNN development. Then we describe major build- constructive proof that an MLP of single hidden layer
ing blocks and relevant learning algorithms for con- may be a universal function approximator [28.14] and a
structing different DNNs. Next, we present a hybrid support vector machine (SVM) [28.15], instead. It has
learning strategy in the context of NC. Finally, we ex- been shown that shallow architectures often work well
amine relevant issues related to DNNs. with support of effective feature extraction techniques
(but these are often handcrafted). However, recent the-
28.2.1 Background and Motivation oretic justification suggests that learning models of
insufficient depth have a fundamental weakness as they
The study of NC dates back to the 1940s when Mc- cannot efficiently represent the very complicated func-
Cullod and Pitts modeled a neuron mathematically. tions often required in complex AI tasks [28.16, 17].
After that NC was an active area in AI studies until To solve complex the non-convex optimization
Minsky and Papert published their influential book, Per- problem encountered in DNN learning, Hinton and his
ceptron [28.10], in 1969. In the book, they formally colleagues made a breakthrough by coming up with
proved the limited capacities of the single-layer percep- a hybrid learning strategy in 2006 [28.18]. The novel
tron and further concluded that there is a slim chance learning strategy combines unsupervised and super-
to expand its capacities with its multi-layer version, vised learning paradigms where a layer-wise greedy
which significantly slowed down NC research until the unsupervised learning is first used to construct an ini-
back-propagation (BP) algorithm was invented (or rein- tial DNN with chosen building blocks (such an initial
vented) to solve the learning problem in a multi-layer DNN alone can also be used for different purposes,
perceptron (MLP) [28.11]. e.g., unsupervised feature learning [28.19]), and super-
In theory, the BP algorithm enables one to train an vised learning is then fulfilled based on the pre-trained
MLP of many hidden layers to form a powerful DNN. DNN. Their seminal work led to an emerging machine
Such an attractive technique has aroused tremendous learning (ML) area, deep learning. As a result, differ-
enthusiasm in applying DNNs in different fields [28.9]. ent building blocks and learning algorithms have been
Apart from a few exceptions, e.g., [28.12], researchers developed to construct various DNNs. Both theoretical
soon found that an MLP of more than two hidden lay- justification and empirical evidence suggest that the hy-
ers often failed [28.13] due to the well-known fact brid learning strategy [28.18] greatly facilitates learning
that MLP learning involves an extremely difficult non- of DNNs [28.17].
convex optimization problem, and the gradient-based Since 2006, DNNs trained with the hybrid learn-
local search used in the BP algorithm easily gets stuck ing strategy have been successfully applied in differ-
in an unwanted local minimum. As a result, most re- ent and complex AI tasks, such as pattern recogni-
Deep and Modular Neural Networks 28.2 Deep Neural Networks 475
tion [28.20–23], various computer vision tasks [28.24– dimensional representation h.x/ D .h1 .x/; : : : ; hM .x//T
26], audio classification and speech information (hereinafter, we use the notation h.x/ D .hm .x//M mD1 to
processing [28.27–31], information retrieval [28.32– indicate a vector–element relationship for simplifying
34], natural language processing [28.35–37], and the presentation) for a given input x D .xn /NnD1 in N-
robotics [28.38]. Thus, DNNs have become one of the dimensional space
most promising ML and NC techniques to tackle chal-
lenging AI problems [28.39]. h.x/ D f .Wx C bh / ;
Part D | 28.2
28.2.2 Building Blocks where W is a connection weight matrix between the
and Learning Algorithms input and the hidden layers, bh is the bias vector for
all hidden neurons, and f . / is a transfer function, e.g.,
In general, a building block is composed of two para- the sigmoid function [28.9]. Let f .u/ D .f .uk //KkD1 be
metric models, encoder and decoder, as illustrated in a collective notation for output of all K neurons in
Fig. 28.1. An encoder transforms a raw input or a low- a layer. Accordingly, the hidden and the output lay-
level representation x into a high-level and abstract ers form a decoder that yields a reconstructed version
representation h.x/, while a decoder generates an out- xO D .Oxn /NnD1
put xO , a reconstructed version of x, from h.x/. The
learning building block is a self-supervised learning xO D f .W T h.x/ C bo / ;
task that minimizes an elaborate reconstruction cost
function to find appropriate parameters in encoder and where W T is the transpose of the weight matrix W and
decoder. Thus, the distinction between two building bo is the bias vector for all output neurons. Note that the
blocks of different types lies in their encoder and de- auto-encoder can be viewed as a special case of auto-
coder mechanisms and reconstruction cost functions (as associator when the same weights are tied to be used
well as optimization algorithms used for parameter es- in connections between different layers, which will be
timation). Below we describe different building blocks clearly seen in the learning algorithm later on. Doing
and their learning algorithms in terms of the generic ar- so avoids an unwanted solution when an over-complete
chitecture shown in Fig. 28.1. representation, i. e., M > N, is required [28.22].
Further studies [28.41] suggest that the auto-
Auto-Encoders encoder is unlikely to lead to the discovery of a more
The auto-encoder [28.40] and its variants are sim- useful representation than the input despite the fact that
ple building blocks used to build an MLP of many a representation should encode much of the informa-
layers. It is carried out by an MLP of one hidden tion conveyed in the input whenever the auto-encoder
layer. As depicted in Fig. 28.2, the input and the produces a good reconstruction of its input. As a re-
hidden layers constitute an encoder to generate a M- sult, a variant named the denoising auto-encoder (DAE)
was proposed to capture stable structures underlying
the distribution of its observed input. The basic idea is
Reconstruction as follows: instead of learning the auto-encoder from
cost the intact input, the DAE will be trained to recover the
Decoder
Decoder x
··· x
x
WT
h (x)
··· h (x)
W
Encoder ··· x
Encoder
Fig. 28.1 Schematic diagram of a generic building block
architecture Fig. 28.2 Auto-encoder architecture
original input from its distorted version of partial de-

struction [28.41]. As is illustrated in Fig. 28.3, the DAE Decoder
leads to a more useful representation h.Qx/ by restoring ··· x
the corrupted input xQ to a reconstructed version xO as T
W
close to the clean input x as possible. Thus, the encoder
yields a representation as ··· h (x)
W
h.Qx/ D f .W xQ C bh / ;
Part D | 28.2
··· x~
and the decoder produces a restored version xO via the Encoder
representation h.Qx/
Fig. 28.3 Denoising auto-encoder architecture
xO D f .W T h.Qx/ C bo / :
To minimize reconstruction functions in (28.1),
To produce a corrupted input, we need to distort a clean application of the stochastic gradient descent algo-
input by corrupting it with appropriate noise. Depend- rithm [28.12] leads to a generic learning algorithm for
ing on the attribute nature of input, there are three training the auto-encoder and its variant, summarized as
kinds of noise used in the corruption process: i. e., follows:
the isotropic Gaussian noise, N.0; 2 I/, masking noise
(by setting some randomly chosen elements of x to Auto-Encoder Learning Algorithm. Given a train-
zero) and salt-and-pepper noise (by flipping some ran- ing set of T examples, f.z t ; x t /gTtD1 where z t D x t for
domly chosen elements’ values of x to the maximum the auto-encoder or z t D xQ t for the DAE, and a transfer
or the minimum of a given range). Normally, Gaus- function, f . /, randomly initialize all parameters, W; bh
sian noise is used for input of real or continuous and bo , in auto-encoders and pre-set a learning rate .
values, while and masking and salt-and-pepper noise Furthermore, the training set is randomly divided into
is applied to input of discrete values, e.g., pixel in- several batches of TB examples, f.zt ; x t /gTtD1
B
, and then
tensities of gray images. It is worth stating that the parameters are updated based on each batch:
variance 2 in Gaussian noise and the number of ran-
domly chosen elements in masking and salt-and-pepper
Forward computation
For the input z t .t D 1; ; TB /, output of the hidden
noise are hyper-parameters that affect DAE learning.
layer is
By corrupting a clean input with the chosen noise, we
achieve an example, .Qx; x/, for self-supervised learn- h.zt / D f .uh .zt // ; uh .zt / D Wz t C bh :
ing.
Given a training set of T examples f.x t ; x t /gTtD1 And output of the output layer is
(auto-encoder) or f.Qx t ; x t /gTtD1 (DAE) two reconstruc-
tion cost functions are commonly used for learning xO t D f .uo .z t // ; uo .z t / D W T h.zt / C bo :
auto-encoders as follows
Backward gradient computation
For the cost function in (28.1a),
1 XX
T N
L.W; bh ; bo / D .x tn xO tn /2 ; (28.1a)
2T tD1 nD1 @L.W; bh ; bo / N
D .Ox tn x tn /f 0 .uo;n .zt // nD1 ;
@uo .zt /
L.W; bh ; bo / where f 0 . / is the first-order derivative function of
f . /. For the cost function in (28.1b),
1 XX
T N
D .x tn log xO tn C .1 x tn / log .1 xO tn // :
T tD1 nD1 @L.W; bh ; bo /
D xO t x t :
(28.1b) @uo .zt /
Then, the gradient with respect to h.zt / is
The cost function in (28.1a) is used for input of real
or discrete values, while the cost function in (28.1b) is @L.W; bh ; bo / @L.W; bh ; bo /
employed especially for input of binary values. DW :
@h.zt / @uo .zt /
Applying the chain rule achieves the gradient with

respect to uh .zt / as Hidden layer
m
@L.W; bh ; bo / @L.W; bh ; bo / M
D f 0 .uh;m .zt // :
Decoder
Encoder
@uh .zt / @hm .zt / mD1 W WT
Gradients with respect to biases are n
Part D | 28.2
@L.W; bh ; bo / @L.W; bh ; bo / Visible layer
D ;
@bo @uo .zt / Fig. 28.4 Restricted Boltzmann machine (RBM) architec-
ture
and
hidden layer, RBM forms an encoder that yields a prob-
@L.W; bh ; bo / @L.W; bh ; bo /
D : abilistic representation h D .hm /M
mD1 for input data v D
@bh @uh .zt / .vn /NnD1
Parameter update Y
M
Applying the gradient descent method and tied P.hjv/ D P.hm jv/ ;
weights leads to update rules mD1
!
" X
N
X
TB
@L.W; bh ; bo / T P.hm jv/ D Wmn vn C bh;m ; (28.2)
W W Œz t
TB tD1 @uh .zt / nD1
# where Wmn is the connection weight between the vis-

@L.W; bh ; bo / T
C h.zt / ; ible neuron n and the hidden neuron m, and bh;m
@uo .zt /
is the bias of the hidden neuron m. .u/ D 1Ce1 u
X is the sigmoid transfer function. As hm is assumed
TB
@L.W; bh ; bo /
bo bo ; to take a binary value, i. e., hm 2 f0; 1g, P.hm jv/ is
TB tD1 @uo .zt / interpreted as the probability of hm D 1. Accord-
ingly, RBM performs a probabilistic decoder via the
and top-down connections from the hidden to the visi-
ble layer to reconstruct an input with the probabil-
X
TB
@L.W; bh ; bo / ity
bh bh :
TB tD1 @uh .z t /
Y
M
P.vjh/ D P.vn jh/ ;
The above three steps repeat for all batches, which mD1
leads to a training epoch. The learning algorithm runs !
X
N
iteratively until a termination condition is met (typically P.vn jh/ D Wnm hm C bv;n ; (28.3)
based on a cross-validation procedure [28.12]). mD1
The Restricted Boltzmann Machine where Wnm is the connection weight between the hid-
Strictly speaking, the restricted Boltzmann machine den neuron m and the visible neuron n, and bv;n is
(RBM) [28.42] is an energy-based generative model, the bias of visible neuron n. Like connection weights
a simplified version of the generic Boltzmann machine. in auto-encoders, bi-directional connection weights are
As illustrated in Fig. 28.4, an RBM can be viewed tied, i. e., Wmn D Wnm , as shown in Fig. 28.4. By
as a probabilistic NN of two layers, i. e., visible and learning a parametric model of the data distribution
hidden layers, with bi-directional connections. Unlike P.v/ derived from the joint probability P.v; h/ for
the Boltzmann machine, there are no lateral connec- a given data set, RBM yields a probabilistic repre-
tions among neurons in the same layer in an RBM. sentation that tends to reconstruct any data subject
With the bottom-up connections from the visible to the to P.v/.
The joint probability P.v; h/ is defined based on the

following energy function for vn 2 f0; 1g h0 h1
X
M X
N
E.v; h/ D Wmn hm vn
mD1 nD1
X
M X
N
υ0 υ1
hm bh;m vn bv;m : (28.4)
Part D | 28.2
mD1 nD1 k=0 k=1

Data Reconstruction
As a result, the joint probability is subject to the Boltz-
Fig. 28.5 Gibbs sampling process in the CD-1 algorithm
mann distribution
eE.v;h/ Figure 28.5 illustrates a Gibbs sampling process
P.v; h/ D P P E.v;h/ : (28.5) used in the CD-1 algorithm as follows
v he
i) Estimating probabilities P.h0m jv0 /, for m D 1;
Thus, we achieve the data probability by marginalizing
; M, with the encoder defined in (28.2) and then
the joint probability as follows
forming a realization of h0 by sampling with these
X X probabilities.
P.v/ D P.v; h/ D P.vjh/P.h/ :
ii) Applying the decoder defined in (28.3) to estimate
h h
probabilities P.v1n jh0 /, for n D 1; ; N, and then
In order to achieve the most likely reconstruction, we producing a reconstruction of v1 via sampling.
need to maximize the log-likelihood of P.v/. Therefore, iii) With the reconstruction, estimating probabilities
the reconstruction cost function of an RBM is its nega- P.h1m jv1 /, for m D 1; ; M, with the encoder.
tive log-likelihood function With the above Gibbing sampling procedure, the
X CD-1 algorithm is summarized as follows:
L.W; bh ; bv / D log P.v/ D log P.vjh/P.h/ :
h Algorithm 28.1 RBM CD-1 Learning Algorithm
(28.6) Given a training set of T instances, fx t gTtD1 , randomly
initialize all parameters, W; bh and bv , in an RBM and
From (28.5) and (28.6), it is observed that the direct pre-set a learning rate :
use of a gradient descent method for optimal parame-
ters often leads to intractable computation due to the Positive phase
fact that the exponential number of possible hidden- – Present an instance to the visible layer, i. e.,
layer configurations needs to be summed over in (28.5) v0 D x t .
and then used in (28.6). Fortunately, an approximation – Estimate probabilities with the encoder:
O 0 jv0 / D .P.h0m jv0 //M by using (28.2).
P.h
algorithm named contrastive divergence (CD) has been mD1
proposed to solve this problem [28.42]. The key ideas Negative phase
behind the CD algorithm are (i) using Gibbs sampling – Form a realization of h0 by sampling with prob-
based on the conditional distributions in (28.2) and O 0 jv0 /.
abilities P.h
(28.3), and (ii) running only a few iterations of Gibbs – With the realization of h0 , apply the de-
sampling by treating the data x input to an RBM as the coder to estimate probabilities: P.v O 1 jh0 / D
initial state, i. e., v0 D x, of the Markov chain at the vis- .P.vn jh //nD1 by using (28.3), and then pro-
0 0 N
ible layer. Many studies have suggested that only the duce a reconstruction v1 via sampling based on
use of one iteration of the Markov chain in the CD algo- O 1 jh0 /.
P.v
rithm works well for building up a deep belief network – With the encoder and the reconstruction, es-
(DBN) in practice [28.17, 18, 22], and hence the algo- timate probabilities: P.hO 1 jv1 / D .P.h1m jv1 //M
mD1
rithm is dubbed CD-1 in this situation, a special case by using (28.2).
of the CD-k algorithm that executes k iterations of the Parameter update
Markov chain in the Gibbs sampling process. Based on Gibbs sampling results in the positive and
the negative phases, parameters are updated as fol- cost function defined in (28.9), the first term specifies
lows: reconstruction errors, the second term refers to the mag-
nitude of non-sparse representations, and the last term
W W C P.h
O 0 jv0 /.v0 /T P.h
O 1 jv1 /.v1 /T ;
drives the encoder towards yielding the optimal repre-
sentation.
bh O 0 jv0 / P.h
bh C P.h O 1 jv1 / ;
For learning a PSD building block, the cost function
in (28.9) needs to be optimized simultaneously with
and
Part D | 28.2
respect to the hidden representation and all the param-
bv bv C v0 v1 : eters. As a result, a learning algorithm of two alternate
steps has been proposed to solve this problem [28.43]
The above three steps repeat for all instances in the as follows:
given training set, which leads to a training epoch. The
learning algorithm runs iteratively until it converges. Algorithm 28.2 PSD Learning algorithm
Given a training set of T instances fx t gTtD1 randomly
initialize all the parameters, WE ; WD ; G; bh , and the
Predictive Sparse Decomposition optimal sparse representation fh .x t /gTtD1 in a PSD
Predictive sparse decomposition (PSD) [28.43] is building block and pre-set hyper-parameters ˛ and ˇ
a building block obtained by combining sparse cod- as well as learning rates i .i D 1; ; 4/:
ing [28.44] and auto-encoder ideas. In a PSD building
block, the encoder is specified by Optimal representation update
In this step, the gradient descent method is applied
h.x t / D G tanh.WE x t C bh / ; (28.7) to find the optimal sparse representation based on
the current parameter values of the encoder and the
where WE is the MN connection matrix between input
decoder, which leads to the following update rule
and hidden neurons in the encoder and G D diag.gmm /

is an MM learnable diagonal gain matrix for an M- h .x t / h .x t / 1 ˛sign.h .x t //
dimensional representation of an N-dimensional input,
x t , bh are biases of hidden neurons, and tanh. / is the hy- C ˇ.h .x/t h.x t //

perbolic tangent transfer function [28.9]. Accordingly, C.WD /T .WD h .x t / x t / ;
the decoder is implemented by a linear mapping used in
the sparse coding [28.44] where sign. / is the sign function; sign.u/ D ˙1 if
u ? 0 and sign.u/ D 0 if u D 0.
xO t D WD h.x t / ; (28.8) Parameter update
In this step, h .x t / achieved in the above step is
where WD is an NM connection matrix between hidden fixed. Then the gradient descent method is applied
and output neurons in the decoder, and each column of to the cost function (28.9) with respect to all en-
WD always needs to be normalized to a unit vector to coder and decoder parameters, which results in the
avoid trivial solutions [28.44]. following update rules
Given a training set of T instances, fx t gTtD1 , the PSD
cost function is defined as WE WE 2 g.x t /.x t /T ;

LPSD G; WE ; WD ; bh I h .x t / bh bh 3 g.x t / :
X
T
Here g.x t / is obtained by
D kWD h .x t / xk22 C ˛kh .x t /k1
tD1 g.x t / D g1 mm .gmm hm .x t //
2 2
M
C ˇkh .x t / h.x t /k22 ; (28.9) .hm .x t / hm .x t // mD1 :
" !
where h .x t / is the optimal sparse hidden representa-
tion of x t while h.x t / is the output of the encoder in G G 2 diag m .x t / hm .x t /
h
(28.7) based on the current parameter values. In (28.9), !#
˛ and ˇ are two hyper-parameters to control regular- X
N
ization strengths, and k k1 and k k2 are L1 and L2 tanh ŒWE mn x tn C bv;m ;
norm, respectively. Intuitively, in the multi-objective nD1
and to the CAE cost function in (28.10) to derive a learn-

T ing algorithm used for training a CAE. Furthermore,
WD WD 4 WD h .x t / x t h .x t / : an improved version of CAE was also proposed by
penalizing additional higher order derivatives [28.46].
Normalize each column of WD such that The sparse auto-encoder (SAE) is another class of
regularized auto-encoders. The basic idea underlying
kŒWD n k22 D 1 for n D 1; ; N : SAEs is the introduction of a sparse regularization term
Part D | 28.2
working on either hidden neuron biases, e.g., [28.47],

The above two steps repeat for all the instances in or their outputs, e.g., [28.48], into the standard re-
the given training set, which leads to a training epoch. construction cost function. Different forms of sparsity
The learning algorithm runs iteratively until it con- penalties, e.g., L1 norm and student-t, are employed
verges. for regularization, and the learning algorithm is derived
by applying the coordinate descent optimization pro-
cedure to a new reconstruction cost function [28.47,
Other Building Blocks 48].
While the auto-encoders and the RBM are building The RBM described above works only for an input
blocks widely used to construct DNNs, there are other of binary values. When an input has real values, a vari-
building blocks that are either derived from existing ant named Gaussian RBM (GRBM) [28.49], has been
building blocks for performance improvement or are proposed with the following energy function
developed with an alternative principle. Such build-
ing blocks include regularized auto-encoders and RBM X
M X
N
vn
variants. Due to the limited space, we briefly overview E.v; h/ D Wmn hm
n
them below. mD1 nD1
Recently, a number of auto-encoder variants have X
M XN
.vn bv;n /2
been developed by adding a regularization term to hm bh;m ; (28.11)
2n2
the standard reconstruction cost function in (28.1) and mD1 nD1
hence are dubbed regularized auto-encoders. The con-
trastive auto-encoder (CAE) is a typical regularized where n is the standard deviation of the Gaussian
version of the auto-encoder with the introduction of the noise for the visible neuron n. In the CD learning al-
norm of the Jacobian matrix of the encoder evaluated at gorithm, the update rule for the hidden neurons remains
each training example x t into the standard reconstruc- the same except that each vn is substituted by vnn , and
tion cost function [28.45] the update rule for all visible neurons needs to use
reconstructions v, produced byPsampling from a Gaus-
sian distribution with mean n M mD1 Wnm hm C bv;n and
X
T
LCAE .W; bh ; bo / D L.W; bh ; bo / C ˛ kJ.x t /k2F ; variance n2 for n D 1; ; N. In addition, an improved
tD1
GRBM was also proposed by introducing an alternative
(28.10) parameterization of the energy function in (28.11) and
incorporating it into the CD algorithm [28.50]. Other
where ˛ is a trade-off parameter to control the regular- RBM variants will be discussed later on, as they of-
ization strength and kJ.x t /k2F is the Frobenius norm of ten play a different role from being used to construct
the Jacobian matrix of the encoder and is calculated as a DNN.
follows
28.2.3 Hybrid Learning Strategy
X
M X
N
0 2 2
kJ.x t /k2F D f .uh;m .x t // Wmn : Based on the building blocks described in Sect. 28.2.2,
mD1 nD1 we describe a systematic approach to establish-
ing a feed-forward DNN for supervised and semi-
Here, f 0 . / is the first-order derivative of a transfer func- supervised learning. This approach employs a hybrid
tion f . /, and f 0 Œuh;m .x t / D hm .x t /Œ1 hm .x t / when learning strategy that combines unsupervised and su-
f . / is the sigmoid function [28.9]. It is straightfor- pervised learning paradigms to overcome the optimiza-
ward to apply the stochastic gradient method [28.12] tion difficulty in training DNNs. The hybrid learning
strategy [28.18, 40] first applies layer-wise greedy un-

supervised learning to set up a DNN and initialize a) hK–1(x) ···
parameters with input data only and then uses a global WKT
supervised learning algorithm with teachers’ informa- hK (x) ···
tion to train all the parameters in the initialized DNN WK
hK–1(x) ··· o (x)
for a given task. Wo
··········
··· hK (x)
Part D | 28.2
Layer-Wise Unsupervised Learning h1(x) ··· WK
In the hybrid learning strategy, unsupervised learning W2T ·········
is a layer-wise greedy learning process that constructs h2(x) ··· W3
a DNN with a chosen building block and initializes pa- W2 ··· h2(x)
h1(x) ··· W2
rameters in a layer-by-layer way.
··· h1(x)
Suppose we want to establish a DNN of K .K >1/ W1
hidden layers and denote output of hidden layer k as x ···
W1T
x ···
hk .x/ .k D 1; ; K/ for a given input x and output of
h1(x) ···
the output layer as o.x/, respectively. To facilitate the
W1
presentation, we stipulate h0 .x/ D x. Then, the generic x ···
layer-wise greedy learning procedure can be summa-
rized as follows:
b) hK (x) ···
Algorithm 28.3 Layer-wise greedy learning proce- WK
hK–1(x) ··· o (x)
dure
Wo
Given a training set of T instances fx t gTtD1 , randomly
·········· ··· hK (x)
initialize its parameters in a chosen building block and WK
pre-set all hyper-parameters required for learning such ·········
a building block:
h2(x) ··· W3
W2 ··· h2(x)
Train a building block for hidden layer k h1(x) ··· W2
– Set the number of neurons required by hidden ··· h1(x)
layer k to be the dimension of the hidden repre- W1
h1(x) ···
sentation in the chosen building block. x ···
W1
– Use the training data set fhk1 .x t /gTtD1 train the x ···
building block to achieve its optimal parame-
ters.
Fig. 28.6a,b Construction of a DNN with a building block via
Construct a DNN up to hidden layer k
layer-wise greedy learning. (a) Auto-encoder or its variants.
With the trained building block in the above step,
(b) RBM or its variants
discard its decoder part, including all associated pa-
rameters, and stack its hidden layer on the existing
shows a schematic diagram of the layer-wise greedy
DNN with connection weights of the encoder and
learning process with the auto-encoder or its vari-
biases of hidden neurons achieved in the above step
ants; to construct the hidden layer k, the output layer
(the input layer h0 .x/ D x is viewed as the starting
and its associated parameters WkT and bo;k are re-
architecture of a DNN).
moved and the remaining part is stacked onto hid-
The above steps are repeated for k D 1; ; K. den layer k 1, and Wo is a randomly initialized
Then, the output layer o.x/ is stacked onto hidden layer weight matrix for the connection between the hid-
K with randomly initialized connection weights so as to den layer K and the output layer. When a DNN is
finalize the initial DNN construction and its parameter constructed with the RBM or its variants, all back-
initialization. ward connection weights in the decoder are abandoned
after training and only the hidden layer with those
Figure 28.6 illustrated two typical instances for forward connection weights and biases of hidden neu-
constructing an initial DNN via the layer-wise greedy rons are used to construct the DNN, as depicted in
learning procedure described above. Figure 28.6a Fig. 28.6b.
Global Supervised Learning where f 0 . / is the first-order derivative of the transfer

Once a DNN is constructed and initialized based on function f . /.
the layer-wise greedy learning procedure, it is ready For all hidden layers, i. e., k D K; ; 1, applying
to be further trained in a supervision fashion for the chain rule leads to
a classification or regression task. There are a vari- jhk j
ety of optimization methods for supervised learning, @LB .; D/ @LB .; D/ 0
D f .ukj .x t // ;
e.g., stochastic gradient descent and the second-order @uk .x t / @hkj .x t / jD1
Part D | 28.2
Levenberg–Marquadt methods [28.9, 12]. Also there

are cost functions of different forms used for various and
supervised learning tasks and regularization towards
improving the generalization of a DNN. Due to the @LB .; D/ .i/ T @LB .; D/
D WkC1 :
limited space, we only review the stochastic gradi- @hk .x t / @ukC1 .x t /
ent descent algorithm with a generic cost function for
global supervised learning. For k D K C 1; ; 1, gradients with respect to bi-
For a generic cost function L.; D/, where ases of layer k are
is a collective notation of all parameters in a DNN
(Fig. 28.6) and D is a training data set for a given su- @LB .; D/ @LB .; D/
D :
pervised learning task, applying the stochastic gradient @bk @uk .x t /
descent method [28.9, 12] to L.; D/ leads to the fol-
lowing learning algorithm for fine-tuning parameters. Parameter update
Applying the gradient descent method results in the
Algorithm 28.4 Global supervised learning algo- following update rules:
rithm For k D KC1; ; 1
Given a training set of T examples D D f.x t ; y t /gTtD1
X
TB
pre-set a learning rate (and other hyper-parameters @LB .; D/
Wk Wk .hk1 .x t //T ;
if required). Furthermore, the training set is ran- TB tD1 @uk .x t /
domly divided into many mini-batches of TB examples
X
TB
f.x t ; y t /gTtD1
B
and then parameters are updated based on @LB .; D/
bk bk :
each mini-batch. D .fWk gKC1 TB tD1 @uk .x t /
kD1 ; fbk gkD1 / are all pa-
KC1
rameters in a DNN, where Wk is the weight matrix

for the connection between the hidden layers k and The above three steps repeat for all mini-batches,
k 1, and bk is biases of neurons in layer k (Fig. 28.6). which leads to a training epoch. The learning algorithm
Here, input and output layers are stipulated as layers 0 runs iteratively until a termination condition is met (typ-
and K C 1, respectively, i. e., h0 .x t / D x t , Wo D WKC1 , ically based on a cross-validation procedure [28.12]).
bo D bKC1 and o.x t / D hKC1 .x t /:
For the above learning algorithm, the BP algo-
Forward computation
rithm [28.11] is a special case when the transfer func-
Given the input x t , for k D 1; ; KC1, the output
tion is the sigmoid function, i. e., f .u/ D .u/, and the
of layer k is
cost function is the mean square error (MSE) function,
i. e., for each mini-batch
hk .x t / D f .uk .x t // ; uk .x t / D Wk hk1 .x t / Cbk :
1 X
TB
Backward gradient computation LB .; D/ D ko.x t / y t k22 :

2TB tD1
Given a cost function on each mini-batch, LB .; D/
calculate gradients at the output layer, i. e.,
Thus, we have
@LB .; D/ @LB .; D/
D ; 0 .u/ D .u/.1 .u// ;
@hKC1 .x t / @o.x t /
joj
1 X
TB
@LB .; D/ @LB .; D/ 0 @LB .; D/
D f .u.KC1/j .x t // ; D .o.x t / y t / ;
@uKC1 .x t / @h.KC1/j .x t / jD1 @o.x t / TB tD1
and On the other hand, a successful story was recently re-

@LB .; D/ ported [28.52] where no unsupervised pre-training was
used in the DNN learning for a non-trivial task; which
@uKC1 .x t /
poses another open problem as to when and where
" #joj
1 X such a learning strategy must be employed for training
TB
D .oj .x t / y tj /oj .x t /.1 oj .x t // : a DNN to yield a satisfactory generalization perfor-
TB tD1
jD1 mance.
Part D | 28.2
Recent studies also suggest that the use of arti-
28.2.4 Relevant Issues ficially distorted or deformed training data and un-
supervised front-ends can considerably improve the
In the literature, the hybrid learning strategy described performance of DNNs regardless of the hybrid learning
in Sect. 28.2.3 is often called the semi-supervised learn- strategy. As DNN learning is of the data-driven nature,
ing strategy [28.17, 39]. Nevertheless, semi-supervised augmenting training data with known input deforma-
learning implies the situation that there are few labeled tion amounts to the use of more representative examples
examples but many unlabeled instances in a training set. conveying intrinsic variations underlying a class of
Indeed, such a strategy works well in a situation where data in learning. For example, speech corrupted by
both unlabeled and labeled data in a training set are some known channel noise and deformed images by
used for layer-wise greedy learning, and only labeled using affine transformation and adding noise have sig-
data are used for fine-tuning in global supervised learn- nificantly improved the DNN performance in various
ing. However, other studies, e.g., [28.28, 29, 51], also speech and visual information processing tasks [28.22,
show that this strategy can considerably improve the 28, 29, 51, 52]. On the other hand, the generic build-
generalization of a DNN even though there are abun- ing blocks reviewed in Sect. 28.2.2 can be extended
dant labeled examples in a training data set. Hence we to be specialist front-ends by exploiting intrinsic data
would rather name it hybrid learning. On the other hand, structures. For instance, the RBM has several vari-
our review focuses on only primary supervised learn- ants, e.g., [28.53–55], to capture covariance and other
ing tasks in the context of NC. In a wider context, the statistical information underlying an image. After un-
unsupervised learning process itself develops a novel supervised learning, such front-ends generate power-
approach to automatic feature discovery/extraction via ful representations that greatly facilitate further DNN
learning, which is an emerging ML area named rep- learning in visual information processing.
resentation learning [28.39]. In such a context, some While our review focuses on only fully connected
DNNs can perform a generative model. For instance, feed-forward DNNs, there are alternative and more
the DBN [28.18] is a RBM-based DNN by retaining effective DNN architectures for specific tasks. Convo-
both forward and backward connections during layer- lutional DNNs [28.12] make use of topological locality
wise greedy learning. To be a generative model, the constraints underlying images to form more effective
DBN needs an alternative learning algorithm, e.g., the locally connected DNN architecture. Furthermore, var-
wake–sleep algorithm [28.18], for global unsupervised ious pooling techniques [28.56] used in convolutional
learning. In general, the global unsupervised learning DNNs facilitate learning invariant and robust features.
for a generative DNN is still a challenging problem. With appropriate building blocks, e.g., the PSD re-
While the hybrid learning strategy has been success- viewed in Sect. 28.2.2, convolutional DNNs work very
fully applied to many complex AI tasks, in general, it is well with the hybrid learning strategy [28.43, 57]. In
still not entirely clear why such a strategy works well addition, novel DNN architectures need to be devel-
empirically. A recent study attempted to provide some oped by exploring the nature of a specific problem, e.g.,
justification of the role played by layer-wise greedy a regularized Siamese DNN was recently developed for
learning for supervised learning [28.51]. The findings generic speaker-specific information extraction [28.28,
can be summarized as follows: such an unsupervised 29]. As a result, novel DNN architecture development
learning process brings about a regularization effect that and model selection are among important DNN re-
initializes DNN parameters towards the basin of attrac- search topics.
tion corresponding to a good local minimum, which Finally, theoretical justification of deep learning and
facilitates global supervised learning in terms of gener- the hybrid learning strategy, along with other developed
alization [28.51]. In general, a deeper understanding of recently techniques, e.g., parallel graphics processing
such a learning strategy will be required in the future. unit (GPU) computing, enable researchers to develop
large-scale DNNs to tackle very complex real world ing learning systems for dealing with complex and
problems. While some theoretic justification has been large-scale real world problems. For example, such evi-
provided in the literature, e.g., [28.16, 17, 39], to show dence can be found from one of the latest developments
strengths in their potential capacity and efficient rep- in a DNN application to computer vision where it is
resentational schemes of DNNs, more and more suc- demonstrated that applying a DNN of nine layers con-
cessful applications of DNNs, particulary working with structed with the SAE building block via layer-wise
the hybrid learning strategy, lend evidence to support greedy learning results in the favorable performance in
Part D | 28.3
the argument that DNNs are one of the most promis- object recognition of over 22 000 categories [28.58].
28.3 Modular Neural Networks

In this section, we review main modular neural net- a lower overall structural complexity in tackling the
works (MNN) and their learning algorithms with our same problem [28.5]. This main computational merit
own taxonomy. We first review background and moti- makes MNNs scalable and extensible to large-scale
vation for MNN research and present our MNN tax- MNN implementation.
onomy. Then we describe major MNN architectures There are two highly influential principles that are
and relevant learning algorithms. Finally, we exam- often used in artificial MNN development; i. e., divide-
ine relevant issues related to MNNs in a boarder and-conquer and diversity-promotion. The divide-and-
context. conquer principle refers to a generic methodology that
tackles a complex and difficult problem by dividing it
28.3.1 Background and Motivation into several relatively simple and easy subproblems,
whose solutions can be combined seamlessly to yield
Soon after neural network (NN) research resurged in the a final solution. On the other hand, theoretical justifica-
middle of the 1980s, MNN studies emerged; they have tion [28.60, 61] and abundant empirical studies [28.62]
become an important area in NC since then. There are suggest that apart from the condition that component
a variety of motivations that inspire MNN researches, networks need to reach some certain accuracy, the suc-
e.g., biological, psychological, computational, and im- cess of MNNs are largely attributed to diversity among
plementation motivations [28.4, 5, 9]. Here, we only them. Hence, the promotion of diversity in MNNs be-
describe the background and motivation of MNN re- comes critical in their design and development. To
searches from learning and computational perspectives. understand motivations and ideas underlying different
From the learning perspective, MNNs have several MNNs, we believe that it is crucial to examine how two
advantages over monolithic NNs. First of all, MNNs principles are applied in their development.
adopt an alternative methodology for learning, so that There are different taxonomies of MNNs [28.4, 5,
complex problem can be solved based an ensemble of 9]. In this chapter, we present an alternative taxonomy
simple NNs, which might avoid/alleviate the complex that highlights the interaction among component net-
optimization problems encountered in monolithic NN works in an MNN during learning. As a result, there
learning without decreasing the learning capacity. Next, is a dichotomy between tightly and loosely coupled
modularity enables MNNs to use a priori knowledge models in MNNs. In a tightly coupled MNN, all com-
flexibly and facilitates knowledge integration and up- ponent networks are jointly trained in a dependent way
date in learning. To a great extent, MNNs are immune by taking their interaction into account during a single
to temporal and spatial cross-talk, a problem faced by learning stage, and hence all parameters of different net-
monolithic NNs during learning [28.9]. Finally, theoret- works (and combination mechanisms if there are any)
ical justification and abundant empirical evidence show need to be updated simultaneously by minimizing a cost
that an MNN often yields a better generalization than function defined at the global level. In contrast, training
its component networks [28.5, 59]. From the compu- of a loosely coupled MNN often undergoes multi-
tational perspective, modularization in MNNs leads to ple stages in a hierarchical or sequential way where
more efficient and robust computation, given the fact learning undertaken in different stages may be either
that MNNs often do not suffer from a high coupling correlated or uncorrelated via different strategies. We
burden in a monolithic NN and hence tend to have believe that such a taxonomy facilitates not only un-
Deep and Modular Neural Networks 28.3 Modular Neural Networks 485
derstanding different MNNs especially from a learning linearity [28.64]. As a result, output of the n-th expert
perspective but also relating MNNs to generic ensemble network is a generalized linear function of the input x
learning in a broader context.
on .x/ D f .Wn x/ ;
28.3.2 Tightly Coupled Models
where Wn is a parameter matrix, a collective notation
There are two typical tightly coupled MNNs: the mix- for both connection weights and biases, and f . / is
a nonlinear transfer function. The gating network is also
Part D | 28.3
ture of experts (MoE) [28.63, 64] and MNNs trained via
negative correlation learning (NCL) [28.65]. a generalized linear model, and its n-th output g.x; vn /
is the softmax function of vTn x
Mixture of Experts T
The MoE [28.63, 64] refers to a class of MNNs that evn x
g.x; vn / D PN ;
dynamically partition input space to facilitate learn- vTk x
kD1 e
ing in a complex and non-stationary environment.
By applying the divide-and-conquer principle, a soft- where vn is the n-th column of the parameter matrix V
competition idea was proposed to develop the MoE in the gating network and is responsible for the linear
architecture. That is, at every input data point, multiple combination coefficient regarding the expert network n.
expert networks compete to take on a given supervised The overall output of the MoE is the weighted sum re-
learning task. Instead of winner-take-all, all expert net- sulted from the soft-competition at the point x
works may work together but the winner expert plays
a more important role than the losers. X
N
The MoE architecture is composed of N expert net- o.x/ D g.x; vn /on .x/ :
works and a gating network, as illustrated in Fig. 28.7. nD1
The n-th expert network produces an output vector,
on .x/, for an input, x. The gating network receives the There is a natural probabilistic interpretation of the
vector x as input and produces N scalar outputs that MoE [28.64]. For a training example .x; y/, the values
form a partition of the input space at each point x. For of g.x; V/ D .g.x; vn //NnD1 are interpreted as the multi-
the input x, the gating network outputs N linear combi- nomial probabilities associated with the decision that
nation coefficients used to verdict the importance of all terminates in a regressive process that maps x to y.
expert networks for a given supervised learning task. Once a decision has been made that leads to a choice
The final output of MoE is a convex weighted sum of of regressive process n, the output y is chosen from
all the output yielded by N expert networks. Although a probability distribution P.yjx; Wn /. Hence, the over-
NNs of different types can be used as expert networks, all probability of generating y from x is the mixture of
a class of generalized linear NNs are often employed the probabilities of generating y from each component
where such an NN is linear with a single output non- distribution and the mixing proportions are subject to
a multinomial distribution
o(x) X
N
P.yjx; / D g.x; vn /P.yjx; Wn / ; (28.12)
nD1
+
where is a collective notation of all the parameters in
g(x, υ1) the MoE, including both expert and gating network pa-
Gating •
network
•
• g(x, υN) rameters. For different learning tasks, specific compo-
nent distribution models are required. For example, the
o1(x) oN (x) probabilistic component model should be a Gaussian
Expert Expert distribution for a regression task, while a Bernoulli dis-
x network 1 • • • network N
tribution and multinomial distributions are required for
binary and multi-class classification tasks, respectively.
x x In general, MoE is viewed as a conditional mixture
model for supervised learning, a non-trivial extension
Fig. 28.7 Architecture of the mixture of experts of finite mixture model for unsupervised learning.
By means of the above probabilistic interpretation, task in essence, the problem always exists if the IRLS
learning in the MoE is treated as a maximum likeli- algorithm is used in the EM algorithm. Fortunately,
hood problem defined based on the model in (28.12). improved algorithms were proposed to remedy this
An expectation-maximization (EM) algorithm was pro- problem in the EM learning [28.66]. In summary, nu-
posed to update parameters in the MoE [28.64]. It is merous MoE variants and extensions have been de-
summarized as follows: veloped in the past 20 years [28.59], and the MoE
architecture turns out to be one of the most successful
Part D | 28.3
Algorithm 28.5 EM algorithm for MoE learning MNNs.

pre-set the number of expert networks, N, and randomly Negative Correlation Learning
initialize all the parameters D fV; .Wn /NnD1 g in the The NCL [28.65] is a learning algorithm to establish
MoE: an MNN consisting of diverse neural networks (NNs)
by promoting the diversity among component networks
E-step
during learning. The NCL development was clearly in-
For each example, (x t ; y t ) in D, estimate posterior
spired by the bias-variance analysis of generalization
probabilities, h t D .hnt /NnD1 , with the current pa-
errors [28.60, 61]. As a result, the NCL encourages co-
rameter values, VO and fW O n gNnD1
operation among component networks via interaction
during learning to manage the bias-variance trade-off.
g.xt ; vO n /P.y t jx t ; W
O n/ In the NCL, an unsupervised penalty term is in-
hnt D PN :
kD1 g.x t ; v O k /P.yt jx t ; WO k/ troduced to the MSE cost function for each com-
ponent NN so that the error diversity among com-
M-step ponent networks is explicitly managed via training
– For expert network n .n D 1; ; n/, solve the towards negative correlation. Suppose that an NN en-
maximization problems semble F.x; / is established by simply taking the
average of N neural networks f .x; Wn / .n D 1; ; N/,
X
T where Wn denotes all the parameters in the n-th com-
O n D arg max
W hnt log P.y t jx t ; Wn / ; ponent network and D fWn gNnD1 . Given a training
Wn
tD1 set D D f.x t ; y t /gTtD1 the NCL cost function for the
n-th component network [28.65] is defined as fol-
with all examples in D and posterior probabili- lows
ties fh t gTtD1 achieved in the E-step.
– For the gating network, solve the maximization
1 X
T
problem L.D; Wn / D kf .x t ; Wn / y t k22
2T tD1
X
T X
N
X
T
VO D arg max hnt log g.x t ; vn / ;
V kf .xt ; Wn / F.x t ; /k22 ;
tD1 nD1 2T tD1
(28.13)
with training examples, f.x t ; h t /gTtD1 , derived
from posterior probabilities fh t gTtD1 . PN
where F.x t ; / D N1 nD1 f .x t ; Wn / and is a trade-
Repeat the E-step and the M-step alternately until off hyper-parameter. In (28.13), the first term is the
the EM algorithm converges. MSE cost for network n and the second term refers
to the negative correlation cost. By taking all compo-
To solve optimization problems in the M-step, the nent networks into account, minimizing the second term
iteratively re-weighted least squares (IRLS) algorithm leads to maximum negative correlation among them.
was proposed [28.64]. Although the IRLS algorithm Therefore, needs to be set properly to control the
has the strength to solve maximum likelihood prob- penalty strength [28.65].
lems arising from MoE learning, it might result in For the NCL, all N cost functions specified in
some instable performance due to its incorrect assump- (28.13) need to be optimized together for parameter es-
tion on multi-class classification [28.66]. As learning timation. Based on the stochastic descent method, the
in the gating network is a multi-class classification generic NCL algorithm is summarized as follows:
Algorithm 28.6 Negative correlation learning al- els. Below we review several typical loosely coupled
gorithm MNNs.
pre-set the number of component networks, N, and Neural Network Ensemble
learning rate, , as well as randomly initialize all the An neural network ensemble here refers to a committee
parameters D fWn gNnD1 in component networks: machine where a number of NNs trained indepen-
Output computation dently but their outputs are somehow combined to reach
Part D | 28.3
For each example, (x t ; y t ) in D, calculate output of a consensus as a final solution. The development of
each component network f .x t ; Wn / and that of the NN ensembles is explicitly motivated by the diversity-
NN ensemble by promotion principle [28.67, 68].
Intuitively, errors made by component NNs can be
corrected by taking their diversity or mutual comple-
1 X
N
F.x t ; / D f .xt ; Wn / : ment into account. For example, three NNs, NNi .i D
N nD1 1; 2; 3/, trained on the same data set have different yet
imperfect performance on test data. Given three test
Gradient computation points, x t ; .t D 1; 2; 3/, NN1 yields the correct output for
For component network n .n D 1; ; N/, calculate x2 and x3 but does not for x1 , NN2 yields the correct out-
the gradient of the NCL cost function in (28.13) put for x1 , and x3 but does not for x2 and NN3 yields the
with respect to the parameters based on all training correct output for x1 and x2 but does not for x3 , respec-
examples in D tively. In such circumstances, an error made by one NN
can be corrected by other two NNs with a majority vote
T so that the ensemble can outperform any component
@L.D; Wn / 1X NNs. Formally, there is a variety of theoretical justifica-
D kf .x t ; Wn / y t k2
@Wn T tD1 tion [28.60, 61] for NN ensembles. For example, it has
been proven for regression that the NN ensemble per-
.N 1/ @f .x t ; Wn /
kf .x t ; Wn / F.x t ; /k2 : formance is never inferior to the average performance
N @Wn of all component NNs [28.60]. Furthermore, a theo-
retical bias-variance analysis [28.61] suggests that the
Parameter update promotion of diversity can improve the performance of
For component network n .n D 1; ; N/, update NN ensembles provided that there is an adequate trade-
the parameters off between bias and variance. In general, there are
two non-trivial issues in constructing NN ensembles;
@L.D; Wn / i. e., creating diverse component NNs and ensembling
Wn Wn :
@Wn strategies.
Depending on the nature of a given problem [28.5,
Repeat the above three steps until a pre-specified 62], there are several methodologies for creating diverse
termination condition is met. component NNs. First of all, a NN learning process
itself can be exploited. For instance, learning in a mono-
While the NCL was originally proposed based on lithic NN often needs to solve a complex non-convex
the MSE cost function, the NCL idea can be extended to optimization problem [28.9]. Hence, a local-search-
other cost functions without difficulty. Hence, applying based learning algorithm, e.g., BP [28.11], may end up
appropriate optimization techniques on alternative cost with various solutions corresponding to local minima
functions leads to NCL algorithms of different forms due to random initialization. In addition, model selec-
accordingly. tion is required to find out an appropriate NN structure
for a given problem. Such properties can be exploited
28.3.3 Loosely Coupled Models to create component networks in a homogeneous NN
ensemble [28.67]. Next, NNs of different types trained
In a loosely coupled model, component networks are on the same data may also yield different performance
trained independently or there is no direct interaction and hence are candidates in a heterogeneous NN en-
among them during learning. There are a variety of semble [28.5]. Finally, exploration/exploitation of input
MNNs that can be classified as loosely coupled mod- space and different representations is an alternative
methodology for creating different component NNs. In- improve the overall generalization. In this sense, such
stead of training an NN on the input space, NNs can be a combination mechanism is trained on a validation
trained on different input subspaces achieved by a par- data set that is different from the training data set used
titioning method, e.g., random partitioning [28.69], in component NN learning. As is shown in Fig. 28.8,
which results in a subspace NN ensemble. Whenever combination mechanisms have been developed from
raw data can be characterized by different represen- different perspectives, i. e., input-dependent and input-
tations, NNs trained on different feature sets would independent.
Part D | 28.3
constitute a multi-view NN ensemble [28.70]. An input-dependent mechanism combines compo-

Ensembling strategies are required for different nent NNs based on test data; i. e., given two inputs,
tasks. For regression, some optimal fusion rules have x1 and x2 ; there is the property: c.x1 j/ ¤ c.x2 j/
been developed for NN ensembles, e.g., [28.68], which if x1 ¤ x2 , where c.xj/ D .cn .xj//NnD1 is an input-
are supported by theoretical justification, e.g., [28.60, dependent mechanism used to combine an ensemble of
61]. For classification, ensembling strategies are more N component NNs and collectively denotes all learn-
complicated but have been well-studied in a wider con- able parameters in this parametric model. As a result,
text, named combination of multiple classifiers. As is output of an NN ensemble with the input-dependent en-
shown in Fig. 28.8, ensembling strategies are gener- sembling strategy is of the following form
ally divided into two categories: learnable and non-
learnable. Learnable strategies use a parametric model o.x/ D ˝ .o1 .x/; ; oN .x/ j c.xj// ;
to learn an optimal fusion rule, while non-learnable
strategies fulfil the combination by directly using the where on .x/ is output of the n-th component NN for
statistics of all competent network outputs along with n D 1; ; N and ˝ indicates a method on how to ap-
simple measures. As depicted in Fig. 28.8, there are ply c.xj/ to component NNs. For example, ˝ may be
six main non-learnable fusion rules: sum, product, min, a linear combination scheme such that
max, median, and majority vote; details of such non-
X
N
learnable rules can be found in [28.71]. Below, we focus o.x/ D cn .xj/on .x/ : (28.14)
on the main learnable ensembling strategies in terms of nD1
classification.
In general, learnable ensembling strategies are As listed in Fig. 28.8, soft-competition and associative
viewed as an application of the stacked generalization switch are two commonly used input-dependent com-
principle [28.72]. In light of stacked generalization, bination mechanisms. The soft-competition mechanism
all component NNs serve as level 0 generalizers, and can be regarded as a special case of the MoE described
a learnable ensembling strategy carried out by a combi- earlier when all expert networks were trained on a data
nation mechanism would perform a level 1 generalizer set independently in advance. In this case, the gating
working on the output space of all component NNs to network plays the role of the combination mechanism
Ensembling
strategies
Learnable Non-learnable
Input Input Majority

Sum Prod Min Max Med
dependent independent vote
Soft Associative Bayesian Evidence Linear

competition switch fusion reasoning combination
Fig. 28.8 A taxonomy of ensembling strategies

by deciding the importance of component NNs via soft- bination schemes of different forms are also popular
competition. Although various learning models may be as input-independent combination mechanisms [28.62].
used as such a gating network, a RBF-like (radial ba- For instance, the work presented in [28.68] exemplifies
sis function) parametric model [28.73] trained on the how to achieve optimal linear combination weights in
EM algorithm has been widely used for this purpose. a linear combination scheme.
Unlike a soft-competition mechanism that produces the
continuous-value weight vector c.x/ used in (28.14), Constructive Modularization Learning
Part D | 28.3
the associativePswitch [28.74] adopts a winner-take-all Efforts have also been made towards constructive mod-
strategy, i. e., NnD1 cn .xj/ D 1 and cn .xj/ 2 f0; 1g. ularization learning for a given supervised learning
Thus, an associative switch yields a specific code for task. In such work, the divide-and-conquer principle
a given input so that the output of the best performed is explicitly applied in order to develop a constructive
component NN can be selected as the final output of learning strategy for modularization. The basic idea be-
the NN ensemble according to (28.14). The associative hind such methods is to divide a difficult and complex
switch learning is a multi-class classification problem, problem into a number of subproblems that are eas-
and an MLP is often used to carry it out [28.74]. ily solvable by NNs of proper capacities, matching the
Although an input-dependent ensembling strategy is ap- requirements of the subproblems, and then the solu-
plicable to most NN ensembles, it is difficult to apply it tions to subproblems are combined seamlessly to form
to multi-view NN ensembles, since different represen- a solution to the original problem. On the other hand,
tations need to be considered simultaneously in training constructive modularization learning may alleviate the
a combination mechanism. Fortunately, such issues model selection problem encountered by a monolithic
have been explored in a wider context on how to use dif- NN. As NNs of simple and even different architectures
ferent representations simultaneously for ML [28.70, may be used to solve subproblems, empirical stud-
75–79] so that both soft-competition and associative ies suggest that an MNN generated via constructive
switch mechanisms can be extended to multi-view NN modularization learning is often insensitive to compo-
ensembles. nent NN architectures and hence is less likely to suffer
In contrast, an input-independent mechanism com- from overall overfitting or underfitting [28.81]. Below
bines component NNs based on the dependence of their we describe two constructive modularization learning
outputs without considering input data directly. Given strategies [28.81–83] for exemplification.
two inputs x1 and x2 , and x1 ¤ x2 , the same c./ The partitioning-based strategy [28.81, 82] per-
may be applied to outputs of component NNs, where forms the constructive modularization learning by ap-
c./ D .cn .//NnD1 is an input-independent combina- plying the divide-and-conquer principle explicitly. For
tion mechanism used to combine an ensemble of N a given supervised learning task, the strategy consists
component NNs. Several input-independent mecha- of two learning stages: dividing and conquering. In
nisms have been developed [28.62], which often fall the dividing stage, it first recursively partitions the
into one of three categories, i. e., Bayesian fusion, ev- input space into overlapping subspaces, which facili-
idence reasoning, and a linear combination scheme, as tates dealing with various uncertainties, by taking into
shown in Fig. 28.8. Bayesian fusion [28.80] refers to supervision information into account until the nature
a class of combination schemes that use the informa- of each subproblem defined in generated subspaces
tion collected from errors made by component NNs
on a validation set in order to find out the optimal
output of the maximum a posteriori probability, C D P
arg max1 l L P.Cl jo1 .x/; ; oN .x/; /, via Bayesian
reasoning, where Cl is the label for the l-th class
P P
in a classification task of L classes, and here en-
codes the information gathered, e.g., a confusion matrix
achieved during learning [28.80]. Similarly, evidence P P NN P
reasoning mechanisms make use of alternative reason-
ing theories [28.80], e.g., the Dempster–Shafer theory,
to yield the best output for NN ensembles via an ev- NN NN NN NN NN NN
idence reasoning process that works on all outputs of
component NNs in an ensemble. Finally, linear com- Fig. 28.9 A self-generated tree-structured MNN
matches the capacity of one pre-selected NN. In the (sub)space into two subspaces with an overlapping de-
conquering stage, an NN works on a given input sub- fined by P.x/ . > 0/. CL . /, and CR . / are
space to complete the corresponding learning subtask. two credit assignment functions for two subspaces, re-
As a result, a tree-structured MNN is self-generated, spectively. For a test data point xQ :
where a learnable partitioning mechanism P, is situ-
ated at intermediate levels and NNs works at leaves
Initialization
Set ˛.Qx/ 1 and Pointer Root.
of the tree, as illustrated in Fig. 28.9. To enable the
Credit assignment
Part D | 28.3
partition-based constructive modularization learning,

As a recursive credit propagation process to assign
two generic algorithms have been proposed, i. e., grow-
credits to all the component NNs at leaf nodes that
ing and credit-assignment algorithms [28.81, 82] as
xQ can reach, CRŒ˛.Qx/; Pointer consists of three
summarized below.
steps:
– If Pointer points to a leaf node, then output
Algorithm 28.7 Growing algorithm
˛.Qx/ and stop.
Given a training set D, set X D. Randomly initialize
– If P.Qx/ , ˛.Qx/ ˛.Qx/CL.P.Qx// and invoke
parameters in all component NNs in a given repository
CRŒ˛.Qx/; Pointer.Leftchild.
and pre-set hyper-parameters in a learnable partitioning
– If P.Qx/ , ˛.Qx/ ˛.Qx/CR .P.Qx// and in-
mechanism and compatibility criteria, respectively:
voke CRŒ˛.Qx/; Pointer.Rightchild.
Compatibility test
Thus, the output of a self-generated MNN is
For a training (sub)set X , apply the compatibility
criteria to X to examine whether the learning task X
defined on X matches the capacity of a component o.Qx/ D ˛n .Qx/on .Qx/ ;
NN in the repository. n2N
Partitioning space
where N denotes all the component NNs that xQ can
If none in the repository can solve the problem de-
reach, and ˛n .Qx/ and on .Qx/ are the credit assigned and
fined on X , then train the partitioning mechanism
the output of the n-th component NN in N for xQ , re-
on the current X to partition it into two overlapped
spectively.
Xl and Xr . Set X Xl , then go to the compatibility
test step. Set X Xr , then go to the compatibility
To implement such a strategy, hyper-planes placed
test step.
with heuristics [28.81] or linear classifiers trained with
Otherwise, go to the subroblem solving step.
the Fisher discriminative analysis [28.82] were first
Subproblem solving
used as the partition mechanism and NNs such as
Train this NN on X with an appropriate learning
MLP or RBF can be employed to solve subproblems.
algorithm. The trained NN resides at the current leaf
Accordingly, two piece-wise linear credit assignment
node.
functions [28.81, 82] were designed for the hyper-
The growing algorithm expands a tree-structured plane partitioning mechanism, so that CL .x/ C CR .x/ D
MNNs until learning problems defined on all parti- 1. Heuristic compatibility criteria were developed by
tioned subspaces are solvable with NNs in the reposi- considering learning errors and efficiency [28.81, 82].
tory. By using the same constructive learning algorithms
described above, an alternative implementation was
For a given test data point, output of such a tree- also proposed by using the self-organization map as
structured MNN may depend on several component a partitioning mechanism and SVMs were used for sub-
NNs at the leaves of the tree since the input space problem solving [28.84]. Empirical studies suggest that
is partitioned into overlapping subspaces. A credit- the partitioning-based strategy leads to favorable results
assignment algorithm [28.81, 82] has been developed to in various supervised learning tasks despite different
weight the importance of component NNs contributed implementations [28.81, 82, 84].
to the overall output, which is summarized as follows: By applying the divide-and-conquer principle, task
decomposition [28.83] is yet another constructive mod-
Algorithm 28.8 Credit-assignment algorithm ularization learning strategy for classification. Unlike
P.x/ is a trained partitioning mechanisms that resides the partitioning-based strategy, the task decomposition
at a nonterminal node and partitions the current input strategy converts a multi-class classification task into
a number of binary classification subtasks in a brute- conditional models for supervised learning would be
force way and each binary classification subtask is a non-trivial topic in tightly coupled MNN studies.
expected to be fulfilled by a simple NN. If a subtask On the other hand, the NCL [28.65] directly applies
is too difficult to carry out by a given NN, the subtask is the bias-variance analysis [28.60, 61] to construction
allowed to be further decomposed into simpler binary of an MNN. This implies that MNNs could be also
classification subtasks. For a multi-class classification built up via alternative loss functions that properly
task of M categories, the task decomposition strategy promote diversities among component MNNs during
first exhaustively decomposes it into 12 M.M 1/ differ-
Part D | 28.3
learning.
ent primary binary subtasks where each subtask merely Almost all existing NN ensemble methods are now
concerns classification between two different classes included in ensemble learning [28.85], which is an
without taking remaining M 2 classes into account, important area in ML, or the multiple classifier sys-
which differs from the commonly used one-against-rest tem [28.62] in the context of pattern recognition. In
decomposition method. In general, the original multi- statistical ensemble learning, generic frameworks, e.g.,
class classification task may be decomposed into more boosting [28.86] and bootstrapping [28.87], were devel-
binary subtasks if some primary subtasks are too dif- oped to construct ensemble learners where any learning
ficult. Once the decomposition is completed, all the models including NNs may be used as component
subtasks are undertaken by pre-selected simple NNs, learners. Hence, most of common issues raised for
e.g., MLP of one hidden layer, in parallel. For a final ensemble learning are applicable to NN ensembles.
solution to the original problem, three non-learnable op- Nevertheless, ensemble learning researches suggest that
erations, min, max, and inv, were proposed to combine behaviors of component learners may considerably af-
individual binary classification results achieved by all fect the stability and overall performance of ensemble
the component NNs. By applying three operations prop- learning. As exemplified in [28.88], properties of dif-
erly, all the component NNs are integrated together to ferent NN ensembles are worth investigating from both
form a min-max MNN [28.83]. theoretical and application perspectives.
While constructive modularization learning pro-
28.3.4 Relevant Issues vides an alternative way of model selection, it is gener-
ally a less developed area in MNNs, and existing meth-
In general, studies of MNNs closely relate to several ods are subject to limitation due to a lack of theoretical
areas in different disciplines, e.g., ML and statistics. We justification and underpinning techniques. For example,
here examine several important issues related to MNNs a critical issue in the partitioning-based strategy [28.81,
in a wider context. 82] is how to measure the nature of a subproblem to
As described above, a tightly coupled MNN leads decide if any further partitioning is required and the
to an optimal solution to a given supervised learn- appropriateness of a pre-selected NN to a subproblem
ing problem. The MoE is rooted in the finite mixture in terms of its capacity. In previous studies [28.81, 82],
model (FMM) studied in probability and statistics and a number of heuristic and simple criteria were proposed
becomes a non-trivial extension to conditional models based on learning errors and efficiency. Although such
where each expert is a parametric conditional prob- heuristic criteria work practically, there is no theoretical
abilistic model and the mixture coefficients also de- justification. As a result, more sophisticated compatibil-
pend on input [28.64]. While the MoE has been well ity criteria need to be developed for such a constructive
studied for 20 years [28.59] in different disciplines, learning strategy based on the latest ML development,
there still exist some open problems in general, e.g., e.g., manifold and adaptive kernel learning. Fortunately,
model selection, global optimal solution, and conver- the partitioning-based strategy has inspired the latest
gence of its learning algorithms for arbitrary component developments in ML [28.89]. In general, constructive
models. Different from the FMM, the product of ex- modularization learning is still a non-trivial topic in
perts (PoE) [28.42] was also proposed to combine MNN research.
a number of experts (parametric probabilistic mod- Finally, it is worth stating that our MNN review here
els) by taking their product and normalizing the re- only focuses on supervised learning due to the limited
sult into account. The PoE has been argued to have space. Most MNNs described above may be extended
some advantages over the MoE [28.42] but has so to other learning paradigms, e.g., semi-supervised and
far merely been developed in the context of unsu- unsupervised learning. More details on such topics are
pervised learning. As a result, extending the PoE to available in the literature, e.g., [28.90, 91].
28.4 Concluding Remarks

In this chapter, we have reviewed two important ar- modeling highly intelligent behaviors, although some
eas, DNNs and MNNs, in NC. While we have pre- progress has been made recently [28.58]. In a wider
sented several sophisticated techniques that are ready context, DNNs and MNNs are closely related to two
for applications, we have discussed several challeng- active areas, deep learning and ensemble learning, in
ing problems in both deep and modular neural net- ML. We anticipate that motivation and methodolo-
Part D | 28
work research as well. Apart from other non-trivial gies from different perspectives will mutually ben-
issues discussed in the chapter, it is worth empha- efit each other and lead to effective solutions to
sizing that it is still an open problem to develop common challenging problems in the NC and ML
large-scale DNNs and MNNs and integrate them for communities.
References
28.1 E.R. Kandel, J.H. Schwartz, T.M. Jessell: Principle of L. Bottou, O. Chapelle, D. DeCoste, J. Weston (MIT
Neural Science, 4th edn. (McGraw-Hill, New York Press, Cambridge 2006), Chap. 14
2000) 28.17 Y. Bengio: Learning deep architectures for AI,
28.2 G.M. Edelman: Neural Darwinism: Theory of Neural Found. Trends Mach. Learn. 2(1), 1–127 (2009)
Group Selection (Basic Books, New York 1987) 28.18 G.E. Hinton, S. Osindero, Y. Teh: A fast learning al-
28.3 J.A. Fodor: The Modularity of Mind (MIT Press, Cam- gorithm for deep belief nets, Neural Comput. 18(9),
bridge 1983) 1527–1554 (2006)
28.4 F. Azam: Biologically inspired modular neural net- 28.19 Y. Bengio: Deep learning of representations for un-
works, Ph.D. Thesis (School of Electrical and Com- supervised and transfer learning, JMLR: Workshop
puter Engineering, Virginia Polytechnic Institute Conf. Proc., Vol. 7 (2011) pp. 1–20
and State University, Blacksburg 2000) 28.20 H. Larochelle, D. Erhan, A. Courville, J. Bergstra,
28.5 G. Auda, M. Kamel: Modular neural networks: Y. Bengio: An empirical evaluation of deep archi-
A survey, Int. J. Neural Syst. 9(2), 129–151 (1999) tectures on problems with many factors of vari-
28.6 D.C. Van Essen, C.H. Anderson, D.J. Fellman: In- ation, Proc. Int. Conf. Mach. Learn. (ICML) (2007)
formation processing in the primate visual system, pp. 473–480
Science 255, 419–423 (1992) 28.21 R. Salakhutdinov, G.E. Hinton: Learning a nonlin-
28.7 J.H. Kaas: Why does the brain have so many visual ear embedding by preserving class neighbourhood
areas?, J. Cogn. Neurosci. 1(2), 121–135 (1989) structure, Proc. Int. Conf. Artif. Intell. Stat. (AISTATS)
28.8 G. Bugmann: Biologically plausible neural compu- (2007)
tation, Biosystems 40(1), 11–19 (1997) 28.22 H. Larochelle, Y. Bengio, J. Louradour, P. Lamblin:
28.9 S. Haykin: Neural Networks and Learning Machines, Exploring strategies for training deep neural net-
3rd edn. (Prentice Hall, New York 2009) works, J. Mach. Learn. Res. 10(1), 1–40 (2009)
28.10 M. Minsky, S. Papert: Perceptrons (MIT Press, Cam- 28.23 W.K. Wong, M. Sun: Deep learning regularized
bridge 1969) Fisher mappings, IEEE Trans. Neural Netw. 22(10),
28.11 D.E. Rumelhurt, G.E. Hinton, R.J. Williams: Learn- 1668–1675 (2011)
ing internal representations by error propagation, 28.24 S. Osindero, G.E. Hinton: Modeling image patches
Nature 323, 533–536 (1986) with a directed hierarchy of Markov random field,
28.12 Y. LeCun, L. Bottou, Y. Bengio, P. Haffner: Gradient Adv. Neural Inf. Process. Syst. (NIPS) (2007) pp. 1121–
based learning applied to document recognition, 1128
Proc. IEEE 86(9), 2278–2324 (1998) 28.25 I. Levner: Data driven object segmentation, Ph.D.
28.13 G. Tesauro: Practical issues in temporal difference Thesis (Department of Computer Science, University
learning, Mach. Learn. 8(2), 257–277 (1992) of Alberta, Edmonton 2008)
28.14 G. Cybenko: Approximations by superpositions of 28.26 H. Mobahi, R. Collobert, J. Weston: Deep learning
sigmoidal functions, Math. Control Signals Syst. from temporal coherence in video, Proc. Int. Conf.
2(4), 302–314 (1989) Mach. Learn. (ICML) (2009) pp. 737–744
28.15 N. Cristianini, J. Shawe-Taylor: An Introduction to 28.27 H. Lee, Y. Largman, P. Pham, A. Ng: Unsupervised
Support Vector Machines and Other Kernel-Based feature learning for audio classification using con-
Learning Methods (Cambridge University Press, volutional deep belief networks, Adv. Neural Inf.
Cambridge 2000) Process. Syst. (NIPS) (2009)
28.16 Y. Bengio, Y. LeCun: Scaling learning algorithms to- 28.28 K. Chen, A. Salman: Learning speaker-specific
wards AI. In: Large-Scale Kernel Machines, ed. by characteristics with a deep neural architec-
Deep and Modular Neural Networks References 493
ture, IEEE Trans. Neural Netw. 22(11), 1744–1756 28.45 S. Rifai, P. Vincent, X. Muller, X. Glorot, Y. Ben-
(2011) gio: Contracting auto-encoders: Explicit invariance
28.29 K. Chen, A. Salman: Extracting speaker-specific during feature extraction, Proc. Int. Conf. Mach.
information with a regularized Siamese deep net- Learn. (ICML) (2011)
work, Adv. Neural Inf. Process. Syst. (NIPS) (2011) 28.46 S. Rifai, G. Mesnil, P. Vincent, X. Muller, Y. Ben-
28.30 A. Mohamed, G.E. Dahl, G.E. Hinton: Acoustic mod- gio, Y. Dauphin, X. Glorot: Higher order contractive
eling using deep belief networks, IEEE Trans. Audio auto-encoder, Proc. Eur. Conf. Mach. Learn. (ECML)
Speech Lang. Process. 20(1), 14–22 (2012) (2011)
Part D | 28
28.31 G.E. Dahl, D. Yu, L. Deng, A. Acero: Context- 28.47 M. Ranzato, C. Poultney, S. Chopra, Y. LeCun: Ef-
dependent pre-trained deep neural networks for ficient learning of sparse representations with an
large-vocabulary speech recognition, IEEE Trans. energy based model, Adv. Neural Inf. Process. Syst.
Audio Speech Lang. Process. 20(1), 30–42 (2012) (NIPS) (2006)
28.32 R. Salakhutdinov, G.E. Hinton: Semantic hashing, 28.48 M. Ranzato, Y. Boureau, Y. LeCun: Sparse feature
Proc. SIGIR Workshop Inf. Retr. Appl. Graph. Model. learning for deep belief networks, Adv. Neural Inf.
(2007) Process. Syst. (NIPS) (2007)
28.33 M. Ranzato, M. Szummer: Semi-supervised learn- 28.49 G.E. Hinton, R. Salakhutdinov: Reducing the di-
ing of compact document representations with mensionality of data with neural networks, Science
deep networks, Proc. Int. Conf. Mach. Learn. (ICML) 313, 504–507 (2006)
(2008) 28.50 K. Cho, A. Ilin, T. Raiko: Improved learning
28.34 A. Torralba, R. Fergus, Y. Weiss: Small codes and of Gaussian-Bernoulli restricted Boltzmann ma-
large databases for recognition, Proc. Int. Conf. chines, Proc. Int. Conf. Artif. Neural Netw. (ICANN)
Comput. Vis. Pattern Recogn. (CVPR) (2008) pp. 1– (2011)
8 28.51 D. Erhan, Y. Bengio, A. Courville, P.A. Manzagol,
28.35 R. Collobert, J. Weston: A unified architecture for P. Vincent, S. Bengio: Why does unsupervised pre-
natural language processing: Deep neural networks training help deep learning?, J. Mach. Learn. Res.
with multitask learning, Proc. Int. Conf. Mach. 11, 625–660 (2010)
Learn. (ICML) (2008) 28.52 D.C. Ciresan, U. Meier, L.M. Gambardella, J. Schmid-
28.36 A. Mnih, G.E. Hinton: A scalable hierarchical dis- huber: Deep big simple neural nets for handwrit-
tributed language model, Adv. Neural Inf. Process. ten digit recognition, Neural Comput. 22(1), 1–14
Syst. (NIPS) (2008) (2010)
28.37 J. Weston, F. Ratle, R. Collobert: Deep learning via 28.53 M. Ranzato, A. Krizhevsky, G.E. Hinton: Factored 3-
semi-supervised embedding, Proc. Int. Conf. Mach. way restricted Boltzmann machines for modeling
Learn. (ICML) (2008) natural images, Proc. Int. Conf. Artif. Intell. Stat.
28.38 R. Hadsell, A. Erkan, P. Sermanet, M. Scoffier, (AISTATS) (2010) pp. 621–628
U. Muller, Y. LeCun: Deep belief net learning in 28.54 M. Ranzato, V. Mnih, G.E. Hinton: Generating more
a long-range vision system for autonomous of- realistic images using gated MRF’s, Adv. Neural Inf.
froad driving, Proc. Intell. Robots Syst. (IROS) (2008) Process. Syst. (NIPS) (2010)
pp. 628–633 28.55 A. Courville, J. Bergstra, Y. Bengio: Unsupervised
28.39 Y. Bengio, A. Courville, P. Vincent: Representa- models of images by spike-and-slab RBMs, Proc.
tion learning: A review and new perspectives, IEEE Int. Conf. Mach. Learn. (ICML) (2011)
Trans. Pattern Anal. Mach. Intell. 35(8), 1798–1827 28.56 H. Lee, R. Grosse, R. Ranganath, A.Y. Ng: Unsuper-
(2013) vised learning of hierarchical representations with
28.40 Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle: convolutional deep belief networks, Commun. ACM
Greedy layer-wise training of deep networks, Adv. 54(10), 95–103 (2011)
Neural Inf. Process. Syst. (NIPS) (2006) 28.57 D. Hau, K. Chen: Exploring hierarchical speech rep-
28.41 P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, resentations with a deep convolutional neural net-
P.A. Manzagol: Stacked denoising autoencoders: work, Proc. U.K. Workshop Comput. Intell. (UKCI)
Learning useful representations in a deep network (2011)
with a local denoising criterion, J. Mach. Learn. 28.58 Q. Le, M. Ranzato, R. Monga, M. Devin, K. Chen,
Res. 11, 3371–3408 (2010) G.S. Corrado, J. Dean, A.Y. Ng: Building high-
28.42 G.E. Hinton: Training products of experts by min- level features using large scale unsupervised
imizing contrastive divergence, Neural Comput. learning, Proc. Int. Conf. Mach. Learn. (ICML)
14(10), 1771–1800 (2002) (2012)
28.43 K. Kavukcuoglu, M. Ranzato, Y. LeCun: Fast infer- 28.59 S.E. Yuksel, J.N. Wilson, P.D. Gader: Twenty years of
ence in sparse coding algorithms with applications mixture of experts, IEEE Trans. Neural Netw. Learn.
to object recognition. CoRR, arXiv:1010.3467 (2010) Syst. 23(8), 1177–1193 (2012)
28.44 B.A. Olshausen, D.J. Field: Sparse coding with an 28.60 A. Krogh, J. Vedelsby: Neural network ensembles,
overcomplete basis set: A strategy employed by V1?, cross validation, and active learning, Adv. Neural
Vis. Res. 37, 3311–3325 (1997) Inf. Process. Syst. (NIPS) (1995)
28.61 N. Ueda, R. Nakano: Generalization error of ensem- on different feature sets, Neurocomputing 20(1–3),
ble estimators, Proc. Int. Conf. Neural Netw. (ICNN) 227–252 (1998)
(1996) pp. 90–95 28.77 K. Chen: On the use of different representations for
28.62 L.I. Kuncheva: Combining Pattern Classifiers speaker modeling, IEEE Trans. Syst. Man Cybern. C
(Wiley-Interscience, Hoboken 2004) 35(3), 328–346 (2005)
28.63 R.A. Jacobs, M.I. Jordan, S. Nowlan, G.E. Hinton: 28.78 Y. Yang, K. Chen: Temporal data clustering via
Adaptive mixture of local experts, Neural Comput. weighted clustering ensemble with different rep-
3(1), 79–87 (1991) resentations, IEEE Trans. Knowl. Data Eng. 23(2),
Part D | 28
28.64 M.I. Jordan, R.A. Jacobs: Hierarchical mixture of ex- 307–320 (2011)
perts and the EM algorithm, Neural Comput. 6(2), 28.79 Y. Yang, K. Chen: Time series clustering via RPCL
181–214 (1994) ensemble networks with different representations,
28.65 Y. Liu, X. Yao: Simultaneous training of negatively IEEE Trans. Syst. Man Cybern. C 41(2), 190–199 (2011)
correlated neural networks in an ensemble, IEEE 28.80 L. Xu, A. Krzyzak, C.Y. Suen: Methods of combining
Trans. Syst. Man Cybern. B 29(6), 716–725 (1999) multiple classifiers and their applications to hand-
28.66 K. Chen, L. Xu, H.S. Chi: Improved learning algo- writing recognition, IEEE Trans. Syst. Man Cybern.
rithms for mixture of experts in multi-class classi- 22(3), 418–435 (1992)
fication, Neural Netw. 12(9), 1229–1252 (1999) 28.81 K. Chen, X. Yu, H.S. Chi: Combining linear discrimi-
28.67 L.K. Hansen, P. Salamon: Neural network ensem- nant functions with neural networks for supervised
bles, IEEE Trans. Pattern Anal. Mach. Intell. 12(10), learning, Neural Comput. Appl. 6(1), 19–41 (1997)
993–1001 (1990) 28.82 K. Chen, L.P. Yang, X. Yu, H.S. Chi: A self-generating
28.68 M.P. Perrone, L.N. Cooper: Ensemble methods for modular neural network architecture for super-
hybrid neural networks. In: Artificial Neural Net- vised learning, Neurocomputing 16(1), 33–48 (1997)
works for Speech and Vision, ed. by R.J. Mammone 28.83 B.L. Lu, M. Ito: Task decomposition and module
(Chapman-Hall, New York 1993) pp. 126–142 combination based on class relations: A modu-
28.69 T.K. Ho: The random subspace method for con- lar neural network for pattern classification, IEEE
structing decision forests, IEEE Trans. Pattern Anal. Trans. Neural Netw. Learn. Syst. 10(5), 1244–1256
Mach. Intell. 20(8), 823–844 (1998) (1999)
28.70 K. Chen, L. Wang, H.S. Chi: Methods of com- 28.84 L. Cao: Support vector machines experts for time
bining multiple classifiers with different feature series forecasting, Neurocomputing 51(3), 321–339
sets and their applications to text-independent (2003)
speaker identification, Int. J. Pattern Recogn. Ar- 28.85 T.G. Dietterich: Ensemble learning. In: Handbook of
tif. Intell. 11(3), 417–445 (1997) Brain Theory and Neural Networks, ed. by M.A. Ar-
28.71 J. Kittler, M. Hatef, R.P.W. Duin, J. Matas: On com- bib (MIT Press, Cambridge 2002) pp. 405–408
bining classifiers, IEEE Trans. Pattern Anal. Mach. 28.86 Y. Freund, R.E. Schapire: Experiments with a new
Intell. 20(3), 226–239 (1998) boosting algorithm, Proc. Int. Conf. Mach. Learn.
28.72 D.H. Wolpert: Stacked generalization, Neural Netw. (ICML) (1996) pp. 148–156
2(3), 241–259 (1992) 28.87 L. Breiman: Bagging predictors, Mach. Learn. 24(2),
28.73 L. Xu, M.I. Jordan, G.E. Hinton: An alternative 123–140 (1996)
model for mixtures of experts, Adv. Neural Inf. Pro- 28.88 H. Schwenk, Y. Bengio: Boosting neural networks,
cess. Syst. (NIPS) (1995) Neural Comput. 12(8), 1869–1887 (2000)
28.74 L. Xu, A. Krzyzak, C.Y. Suen: Associative switch 28.89 J. Wang, V. Saligrama: Local supervised learning
for combining multiple classifiers, J. Artif. Neural through space partitioning, Adv. Neural Inf. Pro-
Netw. 1(1), 77–100 (1994) cess. Syst. (NIPS) (2012)
28.75 K. Chen: A connectionist method for pattern classi- 28.90 X.J. Zhu: Semi-supervised learning literature sur-
fication with diverse feature sets, Pattern Recogn. vey, Technical Report, School of Computer Science
Lett. 19(7), 545–558 (1998) (University of Wisconsin, Madison 2008)
28.76 K. Chen, H.S. Chi: A method of combining multiple 28.91 J. Ghosh, A. Acharya: Cluster ensembles, WIREs Data
probabilistic classifiers through soft competition Min. Knowl. Discov. 1(2), 305–315 (2011)

Redes Profundas y Modulares

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Redes Profundas y Modulares

Uploaded by

Copyright:

Available Formats

473

28.2 Deep Neural Networks

original input from its distorted version of partial de-

Applying the chain rule achieves the gradient with

Gradients with respect to biases are n

# where Wmn is the connection weight between the vis-

The joint probability P.v; h/ is defined based on the

mD1 nD1 k=0 k=1

and to the CAE cost function in (28.10) to derive a learn-

working on either hidden neuron biases, e.g., [28.47],

strategy [28.18, 40] first applies layer-wise greedy un-

Global Supervised Learning where f 0 . / is the first-order derivative of the transfer

Levenberg–Marquadt methods [28.9, 12]. Also there

rameters in a DNN, where Wk is the weight matrix

Backward gradient computation LB .; D/ D ko.x t /  y t k22 :

and On the other hand, a successful story was recently re-

28.3 Modular Neural Networks

Algorithm 28.5 EM algorithm for MoE learning MNNs.

constitute a multi-view NN ensemble [28.70]. An input-dependent mechanism combines compo-

Input Input Majority

Soft Associative Bayesian Evidence Linear

Fig. 28.8 A taxonomy of ensembling strategies

partition-based constructive modularization learning,

28.4 Concluding Remarks

You might also like

Backward gradient computation LB .; D/ D ko.x t / y t k22 :