Professional Documents
Culture Documents
Deep and Mo
28. Deep and Modular Neural Networks
Ke Chen
Part D | 28.1
28.1 Overview ............................................. 473
In this chapter, we focus on two important areas
in neural computation, i. e., deep and modular 28.2 Deep Neural Networks.......................... 474
neural networks, given the fact that both deep 28.2.1 Background and Motivation ....... 474
and modular neural networks are among the most 28.2.2 Building Blocks
powerful machine learning and pattern recogni- and Learning Algorithms ............ 475
tion techniques for complex AI problem solving. 28.2.3 Hybrid Learning Strategy ............ 480
We begin by providing a general overview of deep 28.2.4 Relevant Issues ......................... 483
and modular neural networks to describe the gen- 28.3 Modular Neural Networks..................... 484
eral motivation behind such neural architectures 28.3.1 Background and Motivation ....... 484
and fundamental requirements imposed by com- 28.3.2 Tightly Coupled Models .............. 485
plex AI problems. Next, we describe background 28.3.3 Loosely Coupled Models ............. 487
and motivation, methodologies, major building 28.3.4 Relevant Issues ......................... 491
blocks, and the state-of-the-art hybrid learning 28.4 Concluding Remarks ............................ 492
strategy in context of deep neural architectures.
Then, we describe background and motivation, References................................................... 492
taxonomy, and learning algorithms pertaining to
various typical modular neural networks in a wide issues and discuss open problems in deep and
context. Furthermore, we also examine relevant modular neural network research areas.
28.1 Overview
The human brain is a generic effective and efficient other parts of the brain [28.1]. In particular, the cerebral
system that solves complex and difficult problems and cortex consists of several regions attributed to main per-
generates the trait of intelligence and creation. Neural ceptual and cognitive tasks, where modularity emerges
computation has been inspired by brain-related research in two different aspects: i. e., structural and functional
in different disciplines, e.g., biology and neuroscience, modularity. Structural modularity is observable from
on various levels ranging from a simple single-neuron the fact that there are sparse connections between dif-
to complex neuronal structure and organization [28.1]. ferent neuronal groups but neurons are often densely
Among many discoveries in brain-related sciences, two connected within a neuronal group, while functional
of the most important properties are modularity and hi- modularity is evident from different response patterns
erarchy of neuronal organization in the human brain. produced by neural modules for different perceptual
Neuroscientific research has revealed that the cen- and cognitive tasks. Modularity evidence in the human
tral nervous system (CNS) in the human brain is brain strongly suggests that domain-specific modules
a distributed, massively parallel, and self-organizing are required by specific tasks and different modules
modular system [28.1–3]. The CNS is composed of sev- can cooperate for high level, complex tasks, which pri-
eral regions such as the spinal cord, medulla oblongata, marily motivates the modular neural network (MNN)
pons, midbrain, diencephalon, cerebellum, and the two development in neural computation (NC) [28.4, 5].
cerebral hemispheres. Each such region forms a func- Apart from modularity, the human brain also ex-
tional module and all regions are interconnected with hibits a functional and structural hierarchy given the fact
474 Part D Neural Networks
that information processing in the human brain is done models [28.9] in NC. The main difference between
in a hierarchical way. Previous studies [28.6, 7] sug- biologically plausible and artificial models lies their
gested that there are different cortical visual areas that methodologies that a biologically plausible model of-
lead to hierarchical information representations to carry ten takes both structural and functional resemblance
out highly complicated visual tasks, e.g., object recog- to its biological counterpart into account, while an
nition. In general, hierarchical information processing artificial model simply works towards modeling the
enables the human brain to accomplish complex percep- functionality of a biological system without consider-
Part D | 28.2
tual and cognitive tasks in an effective and extremely ing those bio-mimetic factors. Due to the limited space,
efficient way, which mainly inspires the study of deep in this chapter we merely focus on artificial DNNs
neural networks (DNNs) of multiple layers in NC. and MNNs. Readers interested in biologically plausi-
In general, both DNNs and MNNs can be cate- ble models are referred to the literature, e.g., [28.4], for
gorized into biologically plausible [28.8] and artificial useful information.
tion [28.20–23], various computer vision tasks [28.24– dimensional representation h.x/ D .h1 .x/; : : : ; hM .x//T
26], audio classification and speech information (hereinafter, we use the notation h.x/ D .hm .x//M mD1 to
processing [28.27–31], information retrieval [28.32– indicate a vector–element relationship for simplifying
34], natural language processing [28.35–37], and the presentation) for a given input x D .xn /NnD1 in N-
robotics [28.38]. Thus, DNNs have become one of the dimensional space
most promising ML and NC techniques to tackle chal-
lenging AI problems [28.39]. h.x/ D f .Wx C bh / ;
Part D | 28.2
28.2.2 Building Blocks where W is a connection weight matrix between the
and Learning Algorithms input and the hidden layers, bh is the bias vector for
all hidden neurons, and f . / is a transfer function, e.g.,
In general, a building block is composed of two para- the sigmoid function [28.9]. Let f .u/ D .f .uk //KkD1 be
metric models, encoder and decoder, as illustrated in a collective notation for output of all K neurons in
Fig. 28.1. An encoder transforms a raw input or a low- a layer. Accordingly, the hidden and the output lay-
level representation x into a high-level and abstract ers form a decoder that yields a reconstructed version
representation h.x/, while a decoder generates an out- xO D .Oxn /NnD1
put xO , a reconstructed version of x, from h.x/. The
learning building block is a self-supervised learning xO D f .W T h.x/ C bo / ;
task that minimizes an elaborate reconstruction cost
function to find appropriate parameters in encoder and where W T is the transpose of the weight matrix W and
decoder. Thus, the distinction between two building bo is the bias vector for all output neurons. Note that the
blocks of different types lies in their encoder and de- auto-encoder can be viewed as a special case of auto-
coder mechanisms and reconstruction cost functions (as associator when the same weights are tied to be used
well as optimization algorithms used for parameter es- in connections between different layers, which will be
timation). Below we describe different building blocks clearly seen in the learning algorithm later on. Doing
and their learning algorithms in terms of the generic ar- so avoids an unwanted solution when an over-complete
chitecture shown in Fig. 28.1. representation, i. e., M > N, is required [28.22].
Further studies [28.41] suggest that the auto-
Auto-Encoders encoder is unlikely to lead to the discovery of a more
The auto-encoder [28.40] and its variants are sim- useful representation than the input despite the fact that
ple building blocks used to build an MLP of many a representation should encode much of the informa-
layers. It is carried out by an MLP of one hidden tion conveyed in the input whenever the auto-encoder
layer. As depicted in Fig. 28.2, the input and the produces a good reconstruction of its input. As a re-
hidden layers constitute an encoder to generate a M- sult, a variant named the denoising auto-encoder (DAE)
was proposed to capture stable structures underlying
the distribution of its observed input. The basic idea is
Reconstruction as follows: instead of learning the auto-encoder from
cost the intact input, the DAE will be trained to recover the
Decoder
Decoder x
··· x
x
WT
h (x)
··· h (x)
W
Encoder ··· x
Encoder
Fig. 28.1 Schematic diagram of a generic building block
architecture Fig. 28.2 Auto-encoder architecture
476 Part D Neural Networks
W
h.Qx/ D f .W xQ C bh / ;
Part D | 28.2
··· x~
and the decoder produces a restored version xO via the Encoder
representation h.Qx/
Fig. 28.3 Denoising auto-encoder architecture
xO D f .W T h.Qx/ C bo / :
To minimize reconstruction functions in (28.1),
To produce a corrupted input, we need to distort a clean application of the stochastic gradient descent algo-
input by corrupting it with appropriate noise. Depend- rithm [28.12] leads to a generic learning algorithm for
ing on the attribute nature of input, there are three training the auto-encoder and its variant, summarized as
kinds of noise used in the corruption process: i. e., follows:
the isotropic Gaussian noise, N.0; 2 I/, masking noise
(by setting some randomly chosen elements of x to Auto-Encoder Learning Algorithm. Given a train-
zero) and salt-and-pepper noise (by flipping some ran- ing set of T examples, f.z t ; x t /gTtD1 where z t D x t for
domly chosen elements’ values of x to the maximum the auto-encoder or z t D xQ t for the DAE, and a transfer
or the minimum of a given range). Normally, Gaus- function, f . /, randomly initialize all parameters, W; bh
sian noise is used for input of real or continuous and bo , in auto-encoders and pre-set a learning rate
.
values, while and masking and salt-and-pepper noise Furthermore, the training set is randomly divided into
is applied to input of discrete values, e.g., pixel in- several batches of TB examples, f.zt ; x t /gTtD1
B
, and then
tensities of gray images. It is worth stating that the parameters are updated based on each batch:
variance 2 in Gaussian noise and the number of ran-
domly chosen elements in masking and salt-and-pepper
Forward computation
For the input z t .t D 1; ; TB /, output of the hidden
noise are hyper-parameters that affect DAE learning.
layer is
By corrupting a clean input with the chosen noise, we
achieve an example, .Qx; x/, for self-supervised learn- h.zt / D f .uh .zt // ; uh .zt / D Wz t C bh :
ing.
Given a training set of T examples f.x t ; x t /gTtD1 And output of the output layer is
(auto-encoder) or f.Qx t ; x t /gTtD1 (DAE) two reconstruc-
tion cost functions are commonly used for learning xO t D f .uo .z t // ; uo .z t / D W T h.zt / C bo :
auto-encoders as follows
Backward gradient computation
For the cost function in (28.1a),
1 XX
T N
L.W; bh ; bo / D .x tn xO tn /2 ; (28.1a)
2T tD1 nD1 @L.W; bh ; bo / N
D .Ox tn x tn /f 0 .uo;n .zt // nD1 ;
@uo .zt /
L.W; bh ; bo / where f 0 . / is the first-order derivative function of
f . /. For the cost function in (28.1b),
1 XX
T N
D .x tn log xO tn C .1 x tn / log .1 xO tn // :
T tD1 nD1 @L.W; bh ; bo /
D xO t x t :
(28.1b) @uo .zt /
Then, the gradient with respect to h.zt / is
The cost function in (28.1a) is used for input of real
or discrete values, while the cost function in (28.1b) is @L.W; bh ; bo / @L.W; bh ; bo /
employed especially for input of binary values. DW :
@h.zt / @uo .zt /
Deep and Modular Neural Networks 28.2 Deep Neural Networks 477
m
@L.W; bh ; bo / @L.W; bh ; bo / M
D f 0 .uh;m .zt // :
Decoder
Encoder
@uh .zt / @hm .zt / mD1 W WT
Part D | 28.2
@L.W; bh ; bo / @L.W; bh ; bo / Visible layer
D ;
@bo @uo .zt / Fig. 28.4 Restricted Boltzmann machine (RBM) architec-
ture
and
hidden layer, RBM forms an encoder that yields a prob-
@L.W; bh ; bo / @L.W; bh ; bo /
D : abilistic representation h D .hm /M
mD1 for input data v D
@bh @uh .zt / .vn /NnD1
Parameter update Y
M
Applying the gradient descent method and tied P.hjv/ D P.hm jv/ ;
weights leads to update rules mD1
!
" X
N
X
TB
@L.W; bh ; bo / T P.hm jv/ D Wmn vn C bh;m ; (28.2)
W W Œz t
TB tD1 @uh .zt / nD1
The Restricted Boltzmann Machine where Wnm is the connection weight between the hid-
Strictly speaking, the restricted Boltzmann machine den neuron m and the visible neuron n, and bv;n is
(RBM) [28.42] is an energy-based generative model, the bias of visible neuron n. Like connection weights
a simplified version of the generic Boltzmann machine. in auto-encoders, bi-directional connection weights are
As illustrated in Fig. 28.4, an RBM can be viewed tied, i. e., Wmn D Wnm , as shown in Fig. 28.4. By
as a probabilistic NN of two layers, i. e., visible and learning a parametric model of the data distribution
hidden layers, with bi-directional connections. Unlike P.v/ derived from the joint probability P.v; h/ for
the Boltzmann machine, there are no lateral connec- a given data set, RBM yields a probabilistic repre-
tions among neurons in the same layer in an RBM. sentation that tends to reconstruct any data subject
With the bottom-up connections from the visible to the to P.v/.
478 Part D Neural Networks
X
M X
N
E.v; h/ D Wmn hm vn
mD1 nD1
X
M X
N
υ0 υ1
hm bh;m vn bv;m : (28.4)
Part D | 28.2
ible layer. Many studies have suggested that only the duce a reconstruction v1 via sampling based on
use of one iteration of the Markov chain in the CD algo- O 1 jh0 /.
P.v
rithm works well for building up a deep belief network – With the encoder and the reconstruction, es-
(DBN) in practice [28.17, 18, 22], and hence the algo- timate probabilities: P.hO 1 jv1 / D .P.h1m jv1 //M
mD1
rithm is dubbed CD-1 in this situation, a special case by using (28.2).
of the CD-k algorithm that executes k iterations of the Parameter update
Markov chain in the Gibbs sampling process. Based on Gibbs sampling results in the positive and
Deep and Modular Neural Networks 28.2 Deep Neural Networks 479
the negative phases, parameters are updated as fol- cost function defined in (28.9), the first term specifies
lows: reconstruction errors, the second term refers to the mag-
nitude of non-sparse representations, and the last term
W W C
P.h
O 0 jv0 /.v0 /T P.h
O 1 jv1 /.v1 /T ;
drives the encoder towards yielding the optimal repre-
sentation.
bh O 0 jv0 / P.h
bh C
P.h O 1 jv1 / ;
For learning a PSD building block, the cost function
in (28.9) needs to be optimized simultaneously with
and
Part D | 28.2
respect to the hidden representation and all the param-
bv bv C
v0 v1 : eters. As a result, a learning algorithm of two alternate
steps has been proposed to solve this problem [28.43]
The above three steps repeat for all instances in the as follows:
given training set, which leads to a training epoch. The
learning algorithm runs iteratively until it converges. Algorithm 28.2 PSD Learning algorithm
Given a training set of T instances fx t gTtD1 randomly
initialize all the parameters, WE ; WD ; G; bh , and the
Predictive Sparse Decomposition optimal sparse representation fh .x t /gTtD1 in a PSD
Predictive sparse decomposition (PSD) [28.43] is building block and pre-set hyper-parameters ˛ and ˇ
a building block obtained by combining sparse cod- as well as learning rates
i .i D 1; ; 4/:
ing [28.44] and auto-encoder ideas. In a PSD building
block, the encoder is specified by Optimal representation update
In this step, the gradient descent method is applied
h.x t / D G tanh.WE x t C bh / ; (28.7) to find the optimal sparse representation based on
the current parameter values of the encoder and the
where WE is the MN connection matrix between input
decoder, which leads to the following update rule
and hidden neurons in the encoder and G D diag.gmm /
is an MM learnable diagonal gain matrix for an M- h .x t / h .x t /
1 ˛sign.h .x t //
dimensional representation of an N-dimensional input,
x t , bh are biases of hidden neurons, and tanh. / is the hy- C ˇ.h .x/t h.x t //
perbolic tangent transfer function [28.9]. Accordingly, C.WD /T .WD h .x t / x t / ;
the decoder is implemented by a linear mapping used in
the sparse coding [28.44] where sign. / is the sign function; sign.u/ D ˙1 if
u ? 0 and sign.u/ D 0 if u D 0.
xO t D WD h.x t / ; (28.8) Parameter update
In this step, h .x t / achieved in the above step is
where WD is an NM connection matrix between hidden fixed. Then the gradient descent method is applied
and output neurons in the decoder, and each column of to the cost function (28.9) with respect to all en-
WD always needs to be normalized to a unit vector to coder and decoder parameters, which results in the
avoid trivial solutions [28.44]. following update rules
Given a training set of T instances, fx t gTtD1 , the PSD
cost function is defined as WE WE
2 g.x t /.x t /T ;
LPSD G; WE ; WD ; bh I h .x t / bh bh
3 g.x t / :
X
T
Here g.x t / is obtained by
D kWD h .x t / xk22 C ˛kh .x t /k1
tD1 g.x t / D g1 mm .gmm hm .x t //
2 2
M
C ˇkh .x t / h.x t /k22 ; (28.9) .hm .x t / hm .x t // mD1 :
" !
where h .x t / is the optimal sparse hidden representa-
tion of x t while h.x t / is the output of the encoder in G G
2 diag m .x t / hm .x t /
h
(28.7) based on the current parameter values. In (28.9), !#
˛ and ˇ are two hyper-parameters to control regular- X
N
ization strengths, and k k1 and k k2 are L1 and L2 tanh ŒWE mn x tn C bv;m ;
norm, respectively. Intuitively, in the multi-objective nD1
480 Part D Neural Networks
Part D | 28.2
Layer-Wise Unsupervised Learning h1(x) ··· WK
In the hybrid learning strategy, unsupervised learning W2T ·········
is a layer-wise greedy learning process that constructs h2(x) ··· W3
a DNN with a chosen building block and initializes pa- W2 ··· h2(x)
h1(x) ··· W2
rameters in a layer-by-layer way.
··· h1(x)
Suppose we want to establish a DNN of K .K >1/ W1
hidden layers and denote output of hidden layer k as x ···
W1T
x ···
hk .x/ .k D 1; ; K/ for a given input x and output of
h1(x) ···
the output layer as o.x/, respectively. To facilitate the
W1
presentation, we stipulate h0 .x/ D x. Then, the generic x ···
layer-wise greedy learning procedure can be summa-
rized as follows:
b) hK (x) ···
Algorithm 28.3 Layer-wise greedy learning proce- WK
hK–1(x) ··· o (x)
dure
Wo
Given a training set of T instances fx t gTtD1 , randomly
·········· ··· hK (x)
initialize its parameters in a chosen building block and WK
pre-set all hyper-parameters required for learning such ·········
a building block:
h2(x) ··· W3
W2 ··· h2(x)
Train a building block for hidden layer k h1(x) ··· W2
– Set the number of neurons required by hidden ··· h1(x)
layer k to be the dimension of the hidden repre- W1
h1(x) ···
sentation in the chosen building block. x ···
W1
– Use the training data set fhk1 .x t /gTtD1 train the x ···
building block to achieve its optimal parame-
ters.
Fig. 28.6a,b Construction of a DNN with a building block via
Construct a DNN up to hidden layer k
layer-wise greedy learning. (a) Auto-encoder or its variants.
With the trained building block in the above step,
(b) RBM or its variants
discard its decoder part, including all associated pa-
rameters, and stack its hidden layer on the existing
shows a schematic diagram of the layer-wise greedy
DNN with connection weights of the encoder and
learning process with the auto-encoder or its vari-
biases of hidden neurons achieved in the above step
ants; to construct the hidden layer k, the output layer
(the input layer h0 .x/ D x is viewed as the starting
and its associated parameters WkT and bo;k are re-
architecture of a DNN).
moved and the remaining part is stacked onto hid-
The above steps are repeated for k D 1; ; K. den layer k 1, and Wo is a randomly initialized
Then, the output layer o.x/ is stacked onto hidden layer weight matrix for the connection between the hid-
K with randomly initialized connection weights so as to den layer K and the output layer. When a DNN is
finalize the initial DNN construction and its parameter constructed with the RBM or its variants, all back-
initialization. ward connection weights in the decoder are abandoned
after training and only the hidden layer with those
Figure 28.6 illustrated two typical instances for forward connection weights and biases of hidden neu-
constructing an initial DNN via the layer-wise greedy rons are used to construct the DNN, as depicted in
learning procedure described above. Figure 28.6a Fig. 28.6b.
482 Part D Neural Networks
Part D | 28.2
Recent studies also suggest that the use of arti-
28.2.4 Relevant Issues ficially distorted or deformed training data and un-
supervised front-ends can considerably improve the
In the literature, the hybrid learning strategy described performance of DNNs regardless of the hybrid learning
in Sect. 28.2.3 is often called the semi-supervised learn- strategy. As DNN learning is of the data-driven nature,
ing strategy [28.17, 39]. Nevertheless, semi-supervised augmenting training data with known input deforma-
learning implies the situation that there are few labeled tion amounts to the use of more representative examples
examples but many unlabeled instances in a training set. conveying intrinsic variations underlying a class of
Indeed, such a strategy works well in a situation where data in learning. For example, speech corrupted by
both unlabeled and labeled data in a training set are some known channel noise and deformed images by
used for layer-wise greedy learning, and only labeled using affine transformation and adding noise have sig-
data are used for fine-tuning in global supervised learn- nificantly improved the DNN performance in various
ing. However, other studies, e.g., [28.28, 29, 51], also speech and visual information processing tasks [28.22,
show that this strategy can considerably improve the 28, 29, 51, 52]. On the other hand, the generic build-
generalization of a DNN even though there are abun- ing blocks reviewed in Sect. 28.2.2 can be extended
dant labeled examples in a training data set. Hence we to be specialist front-ends by exploiting intrinsic data
would rather name it hybrid learning. On the other hand, structures. For instance, the RBM has several vari-
our review focuses on only primary supervised learn- ants, e.g., [28.53–55], to capture covariance and other
ing tasks in the context of NC. In a wider context, the statistical information underlying an image. After un-
unsupervised learning process itself develops a novel supervised learning, such front-ends generate power-
approach to automatic feature discovery/extraction via ful representations that greatly facilitate further DNN
learning, which is an emerging ML area named rep- learning in visual information processing.
resentation learning [28.39]. In such a context, some While our review focuses on only fully connected
DNNs can perform a generative model. For instance, feed-forward DNNs, there are alternative and more
the DBN [28.18] is a RBM-based DNN by retaining effective DNN architectures for specific tasks. Convo-
both forward and backward connections during layer- lutional DNNs [28.12] make use of topological locality
wise greedy learning. To be a generative model, the constraints underlying images to form more effective
DBN needs an alternative learning algorithm, e.g., the locally connected DNN architecture. Furthermore, var-
wake–sleep algorithm [28.18], for global unsupervised ious pooling techniques [28.56] used in convolutional
learning. In general, the global unsupervised learning DNNs facilitate learning invariant and robust features.
for a generative DNN is still a challenging problem. With appropriate building blocks, e.g., the PSD re-
While the hybrid learning strategy has been success- viewed in Sect. 28.2.2, convolutional DNNs work very
fully applied to many complex AI tasks, in general, it is well with the hybrid learning strategy [28.43, 57]. In
still not entirely clear why such a strategy works well addition, novel DNN architectures need to be devel-
empirically. A recent study attempted to provide some oped by exploring the nature of a specific problem, e.g.,
justification of the role played by layer-wise greedy a regularized Siamese DNN was recently developed for
learning for supervised learning [28.51]. The findings generic speaker-specific information extraction [28.28,
can be summarized as follows: such an unsupervised 29]. As a result, novel DNN architecture development
learning process brings about a regularization effect that and model selection are among important DNN re-
initializes DNN parameters towards the basin of attrac- search topics.
tion corresponding to a good local minimum, which Finally, theoretical justification of deep learning and
facilitates global supervised learning in terms of gener- the hybrid learning strategy, along with other developed
alization [28.51]. In general, a deeper understanding of recently techniques, e.g., parallel graphics processing
such a learning strategy will be required in the future. unit (GPU) computing, enable researchers to develop
484 Part D Neural Networks
large-scale DNNs to tackle very complex real world ing learning systems for dealing with complex and
problems. While some theoretic justification has been large-scale real world problems. For example, such evi-
provided in the literature, e.g., [28.16, 17, 39], to show dence can be found from one of the latest developments
strengths in their potential capacity and efficient rep- in a DNN application to computer vision where it is
resentational schemes of DNNs, more and more suc- demonstrated that applying a DNN of nine layers con-
cessful applications of DNNs, particulary working with structed with the SAE building block via layer-wise
the hybrid learning strategy, lend evidence to support greedy learning results in the favorable performance in
Part D | 28.3
the argument that DNNs are one of the most promis- object recognition of over 22 000 categories [28.58].
derstanding different MNNs especially from a learning linearity [28.64]. As a result, output of the n-th expert
perspective but also relating MNNs to generic ensemble network is a generalized linear function of the input x
learning in a broader context.
on .x/ D f .Wn x/ ;
28.3.2 Tightly Coupled Models
where Wn is a parameter matrix, a collective notation
There are two typical tightly coupled MNNs: the mix- for both connection weights and biases, and f . / is
a nonlinear transfer function. The gating network is also
Part D | 28.3
ture of experts (MoE) [28.63, 64] and MNNs trained via
negative correlation learning (NCL) [28.65]. a generalized linear model, and its n-th output g.x; vn /
is the softmax function of vTn x
Mixture of Experts T
The MoE [28.63, 64] refers to a class of MNNs that evn x
g.x; vn / D PN ;
dynamically partition input space to facilitate learn- vTk x
kD1 e
ing in a complex and non-stationary environment.
By applying the divide-and-conquer principle, a soft- where vn is the n-th column of the parameter matrix V
competition idea was proposed to develop the MoE in the gating network and is responsible for the linear
architecture. That is, at every input data point, multiple combination coefficient regarding the expert network n.
expert networks compete to take on a given supervised The overall output of the MoE is the weighted sum re-
learning task. Instead of winner-take-all, all expert net- sulted from the soft-competition at the point x
works may work together but the winner expert plays
a more important role than the losers. X
N
The MoE architecture is composed of N expert net- o.x/ D g.x; vn /on .x/ :
works and a gating network, as illustrated in Fig. 28.7. nD1
The n-th expert network produces an output vector,
on .x/, for an input, x. The gating network receives the There is a natural probabilistic interpretation of the
vector x as input and produces N scalar outputs that MoE [28.64]. For a training example .x; y/, the values
form a partition of the input space at each point x. For of g.x; V/ D .g.x; vn //NnD1 are interpreted as the multi-
the input x, the gating network outputs N linear combi- nomial probabilities associated with the decision that
nation coefficients used to verdict the importance of all terminates in a regressive process that maps x to y.
expert networks for a given supervised learning task. Once a decision has been made that leads to a choice
The final output of MoE is a convex weighted sum of of regressive process n, the output y is chosen from
all the output yielded by N expert networks. Although a probability distribution P.yjx; Wn /. Hence, the over-
NNs of different types can be used as expert networks, all probability of generating y from x is the mixture of
a class of generalized linear NNs are often employed the probabilities of generating y from each component
where such an NN is linear with a single output non- distribution and the mixing proportions are subject to
a multinomial distribution
o(x) X
N
P.yjx; / D g.x; vn /P.yjx; Wn / ; (28.12)
nD1
+
where is a collective notation of all the parameters in
g(x, υ1) the MoE, including both expert and gating network pa-
Gating •
network
•
• g(x, υN) rameters. For different learning tasks, specific compo-
nent distribution models are required. For example, the
o1(x) oN (x) probabilistic component model should be a Gaussian
Expert Expert distribution for a regression task, while a Bernoulli dis-
x network 1 • • • network N
tribution and multinomial distributions are required for
binary and multi-class classification tasks, respectively.
x x In general, MoE is viewed as a conditional mixture
model for supervised learning, a non-trivial extension
Fig. 28.7 Architecture of the mixture of experts of finite mixture model for unsupervised learning.
486 Part D Neural Networks
By means of the above probabilistic interpretation, task in essence, the problem always exists if the IRLS
learning in the MoE is treated as a maximum likeli- algorithm is used in the EM algorithm. Fortunately,
hood problem defined based on the model in (28.12). improved algorithms were proposed to remedy this
An expectation-maximization (EM) algorithm was pro- problem in the EM learning [28.66]. In summary, nu-
posed to update parameters in the MoE [28.64]. It is merous MoE variants and extensions have been de-
summarized as follows: veloped in the past 20 years [28.59], and the MoE
architecture turns out to be one of the most successful
Part D | 28.3
Algorithm 28.6 Negative correlation learning al- els. Below we review several typical loosely coupled
gorithm MNNs.
Given a training set of T examples D D f.x t ; y t /gTtD1
pre-set the number of component networks, N, and Neural Network Ensemble
learning rate,
, as well as randomly initialize all the An neural network ensemble here refers to a committee
parameters D fWn gNnD1 in component networks: machine where a number of NNs trained indepen-
Output computation dently but their outputs are somehow combined to reach
Part D | 28.3
For each example, (x t ; y t ) in D, calculate output of a consensus as a final solution. The development of
each component network f .x t ; Wn / and that of the NN ensembles is explicitly motivated by the diversity-
NN ensemble by promotion principle [28.67, 68].
Intuitively, errors made by component NNs can be
corrected by taking their diversity or mutual comple-
1 X
N
F.x t ; / D f .xt ; Wn / : ment into account. For example, three NNs, NNi .i D
N nD1 1; 2; 3/, trained on the same data set have different yet
imperfect performance on test data. Given three test
Gradient computation points, x t ; .t D 1; 2; 3/, NN1 yields the correct output for
For component network n .n D 1; ; N/, calculate x2 and x3 but does not for x1 , NN2 yields the correct out-
the gradient of the NCL cost function in (28.13) put for x1 , and x3 but does not for x2 and NN3 yields the
with respect to the parameters based on all training correct output for x1 and x2 but does not for x3 , respec-
examples in D tively. In such circumstances, an error made by one NN
can be corrected by other two NNs with a majority vote
T so that the ensemble can outperform any component
@L.D; Wn / 1X NNs. Formally, there is a variety of theoretical justifica-
D kf .x t ; Wn / y t k2
@Wn T tD1 tion [28.60, 61] for NN ensembles. For example, it has
been proven for regression that the NN ensemble per-
.N 1/ @f .x t ; Wn /
kf .x t ; Wn / F.x t ; /k2 : formance is never inferior to the average performance
N @Wn of all component NNs [28.60]. Furthermore, a theo-
retical bias-variance analysis [28.61] suggests that the
Parameter update promotion of diversity can improve the performance of
For component network n .n D 1; ; N/, update NN ensembles provided that there is an adequate trade-
the parameters off between bias and variance. In general, there are
two non-trivial issues in constructing NN ensembles;
@L.D; Wn / i. e., creating diverse component NNs and ensembling
Wn Wn
:
@Wn strategies.
Depending on the nature of a given problem [28.5,
Repeat the above three steps until a pre-specified 62], there are several methodologies for creating diverse
termination condition is met. component NNs. First of all, a NN learning process
itself can be exploited. For instance, learning in a mono-
While the NCL was originally proposed based on lithic NN often needs to solve a complex non-convex
the MSE cost function, the NCL idea can be extended to optimization problem [28.9]. Hence, a local-search-
other cost functions without difficulty. Hence, applying based learning algorithm, e.g., BP [28.11], may end up
appropriate optimization techniques on alternative cost with various solutions corresponding to local minima
functions leads to NCL algorithms of different forms due to random initialization. In addition, model selec-
accordingly. tion is required to find out an appropriate NN structure
for a given problem. Such properties can be exploited
28.3.3 Loosely Coupled Models to create component networks in a homogeneous NN
ensemble [28.67]. Next, NNs of different types trained
In a loosely coupled model, component networks are on the same data may also yield different performance
trained independently or there is no direct interaction and hence are candidates in a heterogeneous NN en-
among them during learning. There are a variety of semble [28.5]. Finally, exploration/exploitation of input
MNNs that can be classified as loosely coupled mod- space and different representations is an alternative
488 Part D Neural Networks
methodology for creating different component NNs. In- improve the overall generalization. In this sense, such
stead of training an NN on the input space, NNs can be a combination mechanism is trained on a validation
trained on different input subspaces achieved by a par- data set that is different from the training data set used
titioning method, e.g., random partitioning [28.69], in component NN learning. As is shown in Fig. 28.8,
which results in a subspace NN ensemble. Whenever combination mechanisms have been developed from
raw data can be characterized by different represen- different perspectives, i. e., input-dependent and input-
tations, NNs trained on different feature sets would independent.
Part D | 28.3
Ensembling
strategies
Learnable Non-learnable
by deciding the importance of component NNs via soft- bination schemes of different forms are also popular
competition. Although various learning models may be as input-independent combination mechanisms [28.62].
used as such a gating network, a RBF-like (radial ba- For instance, the work presented in [28.68] exemplifies
sis function) parametric model [28.73] trained on the how to achieve optimal linear combination weights in
EM algorithm has been widely used for this purpose. a linear combination scheme.
Unlike a soft-competition mechanism that produces the
continuous-value weight vector c.x/ used in (28.14), Constructive Modularization Learning
Part D | 28.3
the associativePswitch [28.74] adopts a winner-take-all Efforts have also been made towards constructive mod-
strategy, i. e., NnD1 cn .xj/ D 1 and cn .xj/ 2 f0; 1g. ularization learning for a given supervised learning
Thus, an associative switch yields a specific code for task. In such work, the divide-and-conquer principle
a given input so that the output of the best performed is explicitly applied in order to develop a constructive
component NN can be selected as the final output of learning strategy for modularization. The basic idea be-
the NN ensemble according to (28.14). The associative hind such methods is to divide a difficult and complex
switch learning is a multi-class classification problem, problem into a number of subproblems that are eas-
and an MLP is often used to carry it out [28.74]. ily solvable by NNs of proper capacities, matching the
Although an input-dependent ensembling strategy is ap- requirements of the subproblems, and then the solu-
plicable to most NN ensembles, it is difficult to apply it tions to subproblems are combined seamlessly to form
to multi-view NN ensembles, since different represen- a solution to the original problem. On the other hand,
tations need to be considered simultaneously in training constructive modularization learning may alleviate the
a combination mechanism. Fortunately, such issues model selection problem encountered by a monolithic
have been explored in a wider context on how to use dif- NN. As NNs of simple and even different architectures
ferent representations simultaneously for ML [28.70, may be used to solve subproblems, empirical stud-
75–79] so that both soft-competition and associative ies suggest that an MNN generated via constructive
switch mechanisms can be extended to multi-view NN modularization learning is often insensitive to compo-
ensembles. nent NN architectures and hence is less likely to suffer
In contrast, an input-independent mechanism com- from overall overfitting or underfitting [28.81]. Below
bines component NNs based on the dependence of their we describe two constructive modularization learning
outputs without considering input data directly. Given strategies [28.81–83] for exemplification.
two inputs x1 and x2 , and x1 ¤ x2 , the same c./ The partitioning-based strategy [28.81, 82] per-
may be applied to outputs of component NNs, where forms the constructive modularization learning by ap-
c./ D .cn .//NnD1 is an input-independent combina- plying the divide-and-conquer principle explicitly. For
tion mechanism used to combine an ensemble of N a given supervised learning task, the strategy consists
component NNs. Several input-independent mecha- of two learning stages: dividing and conquering. In
nisms have been developed [28.62], which often fall the dividing stage, it first recursively partitions the
into one of three categories, i. e., Bayesian fusion, ev- input space into overlapping subspaces, which facili-
idence reasoning, and a linear combination scheme, as tates dealing with various uncertainties, by taking into
shown in Fig. 28.8. Bayesian fusion [28.80] refers to supervision information into account until the nature
a class of combination schemes that use the informa- of each subproblem defined in generated subspaces
tion collected from errors made by component NNs
on a validation set in order to find out the optimal
output of the maximum a posteriori probability, C D P
arg max1 l L P.Cl jo1 .x/; ; oN .x/; /, via Bayesian
reasoning, where Cl is the label for the l-th class
P P
in a classification task of L classes, and here en-
codes the information gathered, e.g., a confusion matrix
achieved during learning [28.80]. Similarly, evidence P P NN P
reasoning mechanisms make use of alternative reason-
ing theories [28.80], e.g., the Dempster–Shafer theory,
to yield the best output for NN ensembles via an ev- NN NN NN NN NN NN
idence reasoning process that works on all outputs of
component NNs in an ensemble. Finally, linear com- Fig. 28.9 A self-generated tree-structured MNN
490 Part D Neural Networks
matches the capacity of one pre-selected NN. In the (sub)space into two subspaces with an overlapping de-
conquering stage, an NN works on a given input sub- fined by P.x/ . > 0/. CL . /, and CR . / are
space to complete the corresponding learning subtask. two credit assignment functions for two subspaces, re-
As a result, a tree-structured MNN is self-generated, spectively. For a test data point xQ :
where a learnable partitioning mechanism P, is situ-
ated at intermediate levels and NNs works at leaves
Initialization
Set ˛.Qx/ 1 and Pointer Root.
of the tree, as illustrated in Fig. 28.9. To enable the
Credit assignment
Part D | 28.3
Partitioning space
where N denotes all the component NNs that xQ can
If none in the repository can solve the problem de-
reach, and ˛n .Qx/ and on .Qx/ are the credit assigned and
fined on X , then train the partitioning mechanism
the output of the n-th component NN in N for xQ , re-
on the current X to partition it into two overlapped
spectively.
Xl and Xr . Set X Xl , then go to the compatibility
test step. Set X Xr , then go to the compatibility
To implement such a strategy, hyper-planes placed
test step.
with heuristics [28.81] or linear classifiers trained with
Otherwise, go to the subroblem solving step.
the Fisher discriminative analysis [28.82] were first
Subproblem solving
used as the partition mechanism and NNs such as
Train this NN on X with an appropriate learning
MLP or RBF can be employed to solve subproblems.
algorithm. The trained NN resides at the current leaf
Accordingly, two piece-wise linear credit assignment
node.
functions [28.81, 82] were designed for the hyper-
The growing algorithm expands a tree-structured plane partitioning mechanism, so that CL .x/ C CR .x/ D
MNNs until learning problems defined on all parti- 1. Heuristic compatibility criteria were developed by
tioned subspaces are solvable with NNs in the reposi- considering learning errors and efficiency [28.81, 82].
tory. By using the same constructive learning algorithms
described above, an alternative implementation was
For a given test data point, output of such a tree- also proposed by using the self-organization map as
structured MNN may depend on several component a partitioning mechanism and SVMs were used for sub-
NNs at the leaves of the tree since the input space problem solving [28.84]. Empirical studies suggest that
is partitioned into overlapping subspaces. A credit- the partitioning-based strategy leads to favorable results
assignment algorithm [28.81, 82] has been developed to in various supervised learning tasks despite different
weight the importance of component NNs contributed implementations [28.81, 82, 84].
to the overall output, which is summarized as follows: By applying the divide-and-conquer principle, task
decomposition [28.83] is yet another constructive mod-
Algorithm 28.8 Credit-assignment algorithm ularization learning strategy for classification. Unlike
P.x/ is a trained partitioning mechanisms that resides the partitioning-based strategy, the task decomposition
at a nonterminal node and partitions the current input strategy converts a multi-class classification task into
Deep and Modular Neural Networks 28.3 Modular Neural Networks 491
a number of binary classification subtasks in a brute- conditional models for supervised learning would be
force way and each binary classification subtask is a non-trivial topic in tightly coupled MNN studies.
expected to be fulfilled by a simple NN. If a subtask On the other hand, the NCL [28.65] directly applies
is too difficult to carry out by a given NN, the subtask is the bias-variance analysis [28.60, 61] to construction
allowed to be further decomposed into simpler binary of an MNN. This implies that MNNs could be also
classification subtasks. For a multi-class classification built up via alternative loss functions that properly
task of M categories, the task decomposition strategy promote diversities among component MNNs during
first exhaustively decomposes it into 12 M.M 1/ differ-
Part D | 28.3
learning.
ent primary binary subtasks where each subtask merely Almost all existing NN ensemble methods are now
concerns classification between two different classes included in ensemble learning [28.85], which is an
without taking remaining M 2 classes into account, important area in ML, or the multiple classifier sys-
which differs from the commonly used one-against-rest tem [28.62] in the context of pattern recognition. In
decomposition method. In general, the original multi- statistical ensemble learning, generic frameworks, e.g.,
class classification task may be decomposed into more boosting [28.86] and bootstrapping [28.87], were devel-
binary subtasks if some primary subtasks are too dif- oped to construct ensemble learners where any learning
ficult. Once the decomposition is completed, all the models including NNs may be used as component
subtasks are undertaken by pre-selected simple NNs, learners. Hence, most of common issues raised for
e.g., MLP of one hidden layer, in parallel. For a final ensemble learning are applicable to NN ensembles.
solution to the original problem, three non-learnable op- Nevertheless, ensemble learning researches suggest that
erations, min, max, and inv, were proposed to combine behaviors of component learners may considerably af-
individual binary classification results achieved by all fect the stability and overall performance of ensemble
the component NNs. By applying three operations prop- learning. As exemplified in [28.88], properties of dif-
erly, all the component NNs are integrated together to ferent NN ensembles are worth investigating from both
form a min-max MNN [28.83]. theoretical and application perspectives.
While constructive modularization learning pro-
28.3.4 Relevant Issues vides an alternative way of model selection, it is gener-
ally a less developed area in MNNs, and existing meth-
In general, studies of MNNs closely relate to several ods are subject to limitation due to a lack of theoretical
areas in different disciplines, e.g., ML and statistics. We justification and underpinning techniques. For example,
here examine several important issues related to MNNs a critical issue in the partitioning-based strategy [28.81,
in a wider context. 82] is how to measure the nature of a subproblem to
As described above, a tightly coupled MNN leads decide if any further partitioning is required and the
to an optimal solution to a given supervised learn- appropriateness of a pre-selected NN to a subproblem
ing problem. The MoE is rooted in the finite mixture in terms of its capacity. In previous studies [28.81, 82],
model (FMM) studied in probability and statistics and a number of heuristic and simple criteria were proposed
becomes a non-trivial extension to conditional models based on learning errors and efficiency. Although such
where each expert is a parametric conditional prob- heuristic criteria work practically, there is no theoretical
abilistic model and the mixture coefficients also de- justification. As a result, more sophisticated compatibil-
pend on input [28.64]. While the MoE has been well ity criteria need to be developed for such a constructive
studied for 20 years [28.59] in different disciplines, learning strategy based on the latest ML development,
there still exist some open problems in general, e.g., e.g., manifold and adaptive kernel learning. Fortunately,
model selection, global optimal solution, and conver- the partitioning-based strategy has inspired the latest
gence of its learning algorithms for arbitrary component developments in ML [28.89]. In general, constructive
models. Different from the FMM, the product of ex- modularization learning is still a non-trivial topic in
perts (PoE) [28.42] was also proposed to combine MNN research.
a number of experts (parametric probabilistic mod- Finally, it is worth stating that our MNN review here
els) by taking their product and normalizing the re- only focuses on supervised learning due to the limited
sult into account. The PoE has been argued to have space. Most MNNs described above may be extended
some advantages over the MoE [28.42] but has so to other learning paradigms, e.g., semi-supervised and
far merely been developed in the context of unsu- unsupervised learning. More details on such topics are
pervised learning. As a result, extending the PoE to available in the literature, e.g., [28.90, 91].
492 Part D Neural Networks
work research as well. Apart from other non-trivial gies from different perspectives will mutually ben-
issues discussed in the chapter, it is worth empha- efit each other and lead to effective solutions to
sizing that it is still an open problem to develop common challenging problems in the NC and ML
large-scale DNNs and MNNs and integrate them for communities.
References
28.1 E.R. Kandel, J.H. Schwartz, T.M. Jessell: Principle of L. Bottou, O. Chapelle, D. DeCoste, J. Weston (MIT
Neural Science, 4th edn. (McGraw-Hill, New York Press, Cambridge 2006), Chap. 14
2000) 28.17 Y. Bengio: Learning deep architectures for AI,
28.2 G.M. Edelman: Neural Darwinism: Theory of Neural Found. Trends Mach. Learn. 2(1), 1–127 (2009)
Group Selection (Basic Books, New York 1987) 28.18 G.E. Hinton, S. Osindero, Y. Teh: A fast learning al-
28.3 J.A. Fodor: The Modularity of Mind (MIT Press, Cam- gorithm for deep belief nets, Neural Comput. 18(9),
bridge 1983) 1527–1554 (2006)
28.4 F. Azam: Biologically inspired modular neural net- 28.19 Y. Bengio: Deep learning of representations for un-
works, Ph.D. Thesis (School of Electrical and Com- supervised and transfer learning, JMLR: Workshop
puter Engineering, Virginia Polytechnic Institute Conf. Proc., Vol. 7 (2011) pp. 1–20
and State University, Blacksburg 2000) 28.20 H. Larochelle, D. Erhan, A. Courville, J. Bergstra,
28.5 G. Auda, M. Kamel: Modular neural networks: Y. Bengio: An empirical evaluation of deep archi-
A survey, Int. J. Neural Syst. 9(2), 129–151 (1999) tectures on problems with many factors of vari-
28.6 D.C. Van Essen, C.H. Anderson, D.J. Fellman: In- ation, Proc. Int. Conf. Mach. Learn. (ICML) (2007)
formation processing in the primate visual system, pp. 473–480
Science 255, 419–423 (1992) 28.21 R. Salakhutdinov, G.E. Hinton: Learning a nonlin-
28.7 J.H. Kaas: Why does the brain have so many visual ear embedding by preserving class neighbourhood
areas?, J. Cogn. Neurosci. 1(2), 121–135 (1989) structure, Proc. Int. Conf. Artif. Intell. Stat. (AISTATS)
28.8 G. Bugmann: Biologically plausible neural compu- (2007)
tation, Biosystems 40(1), 11–19 (1997) 28.22 H. Larochelle, Y. Bengio, J. Louradour, P. Lamblin:
28.9 S. Haykin: Neural Networks and Learning Machines, Exploring strategies for training deep neural net-
3rd edn. (Prentice Hall, New York 2009) works, J. Mach. Learn. Res. 10(1), 1–40 (2009)
28.10 M. Minsky, S. Papert: Perceptrons (MIT Press, Cam- 28.23 W.K. Wong, M. Sun: Deep learning regularized
bridge 1969) Fisher mappings, IEEE Trans. Neural Netw. 22(10),
28.11 D.E. Rumelhurt, G.E. Hinton, R.J. Williams: Learn- 1668–1675 (2011)
ing internal representations by error propagation, 28.24 S. Osindero, G.E. Hinton: Modeling image patches
Nature 323, 533–536 (1986) with a directed hierarchy of Markov random field,
28.12 Y. LeCun, L. Bottou, Y. Bengio, P. Haffner: Gradient Adv. Neural Inf. Process. Syst. (NIPS) (2007) pp. 1121–
based learning applied to document recognition, 1128
Proc. IEEE 86(9), 2278–2324 (1998) 28.25 I. Levner: Data driven object segmentation, Ph.D.
28.13 G. Tesauro: Practical issues in temporal difference Thesis (Department of Computer Science, University
learning, Mach. Learn. 8(2), 257–277 (1992) of Alberta, Edmonton 2008)
28.14 G. Cybenko: Approximations by superpositions of 28.26 H. Mobahi, R. Collobert, J. Weston: Deep learning
sigmoidal functions, Math. Control Signals Syst. from temporal coherence in video, Proc. Int. Conf.
2(4), 302–314 (1989) Mach. Learn. (ICML) (2009) pp. 737–744
28.15 N. Cristianini, J. Shawe-Taylor: An Introduction to 28.27 H. Lee, Y. Largman, P. Pham, A. Ng: Unsupervised
Support Vector Machines and Other Kernel-Based feature learning for audio classification using con-
Learning Methods (Cambridge University Press, volutional deep belief networks, Adv. Neural Inf.
Cambridge 2000) Process. Syst. (NIPS) (2009)
28.16 Y. Bengio, Y. LeCun: Scaling learning algorithms to- 28.28 K. Chen, A. Salman: Learning speaker-specific
wards AI. In: Large-Scale Kernel Machines, ed. by characteristics with a deep neural architec-
Deep and Modular Neural Networks References 493
ture, IEEE Trans. Neural Netw. 22(11), 1744–1756 28.45 S. Rifai, P. Vincent, X. Muller, X. Glorot, Y. Ben-
(2011) gio: Contracting auto-encoders: Explicit invariance
28.29 K. Chen, A. Salman: Extracting speaker-specific during feature extraction, Proc. Int. Conf. Mach.
information with a regularized Siamese deep net- Learn. (ICML) (2011)
work, Adv. Neural Inf. Process. Syst. (NIPS) (2011) 28.46 S. Rifai, G. Mesnil, P. Vincent, X. Muller, Y. Ben-
28.30 A. Mohamed, G.E. Dahl, G.E. Hinton: Acoustic mod- gio, Y. Dauphin, X. Glorot: Higher order contractive
eling using deep belief networks, IEEE Trans. Audio auto-encoder, Proc. Eur. Conf. Mach. Learn. (ECML)
Speech Lang. Process. 20(1), 14–22 (2012) (2011)
Part D | 28
28.31 G.E. Dahl, D. Yu, L. Deng, A. Acero: Context- 28.47 M. Ranzato, C. Poultney, S. Chopra, Y. LeCun: Ef-
dependent pre-trained deep neural networks for ficient learning of sparse representations with an
large-vocabulary speech recognition, IEEE Trans. energy based model, Adv. Neural Inf. Process. Syst.
Audio Speech Lang. Process. 20(1), 30–42 (2012) (NIPS) (2006)
28.32 R. Salakhutdinov, G.E. Hinton: Semantic hashing, 28.48 M. Ranzato, Y. Boureau, Y. LeCun: Sparse feature
Proc. SIGIR Workshop Inf. Retr. Appl. Graph. Model. learning for deep belief networks, Adv. Neural Inf.
(2007) Process. Syst. (NIPS) (2007)
28.33 M. Ranzato, M. Szummer: Semi-supervised learn- 28.49 G.E. Hinton, R. Salakhutdinov: Reducing the di-
ing of compact document representations with mensionality of data with neural networks, Science
deep networks, Proc. Int. Conf. Mach. Learn. (ICML) 313, 504–507 (2006)
(2008) 28.50 K. Cho, A. Ilin, T. Raiko: Improved learning
28.34 A. Torralba, R. Fergus, Y. Weiss: Small codes and of Gaussian-Bernoulli restricted Boltzmann ma-
large databases for recognition, Proc. Int. Conf. chines, Proc. Int. Conf. Artif. Neural Netw. (ICANN)
Comput. Vis. Pattern Recogn. (CVPR) (2008) pp. 1– (2011)
8 28.51 D. Erhan, Y. Bengio, A. Courville, P.A. Manzagol,
28.35 R. Collobert, J. Weston: A unified architecture for P. Vincent, S. Bengio: Why does unsupervised pre-
natural language processing: Deep neural networks training help deep learning?, J. Mach. Learn. Res.
with multitask learning, Proc. Int. Conf. Mach. 11, 625–660 (2010)
Learn. (ICML) (2008) 28.52 D.C. Ciresan, U. Meier, L.M. Gambardella, J. Schmid-
28.36 A. Mnih, G.E. Hinton: A scalable hierarchical dis- huber: Deep big simple neural nets for handwrit-
tributed language model, Adv. Neural Inf. Process. ten digit recognition, Neural Comput. 22(1), 1–14
Syst. (NIPS) (2008) (2010)
28.37 J. Weston, F. Ratle, R. Collobert: Deep learning via 28.53 M. Ranzato, A. Krizhevsky, G.E. Hinton: Factored 3-
semi-supervised embedding, Proc. Int. Conf. Mach. way restricted Boltzmann machines for modeling
Learn. (ICML) (2008) natural images, Proc. Int. Conf. Artif. Intell. Stat.
28.38 R. Hadsell, A. Erkan, P. Sermanet, M. Scoffier, (AISTATS) (2010) pp. 621–628
U. Muller, Y. LeCun: Deep belief net learning in 28.54 M. Ranzato, V. Mnih, G.E. Hinton: Generating more
a long-range vision system for autonomous of- realistic images using gated MRF’s, Adv. Neural Inf.
froad driving, Proc. Intell. Robots Syst. (IROS) (2008) Process. Syst. (NIPS) (2010)
pp. 628–633 28.55 A. Courville, J. Bergstra, Y. Bengio: Unsupervised
28.39 Y. Bengio, A. Courville, P. Vincent: Representa- models of images by spike-and-slab RBMs, Proc.
tion learning: A review and new perspectives, IEEE Int. Conf. Mach. Learn. (ICML) (2011)
Trans. Pattern Anal. Mach. Intell. 35(8), 1798–1827 28.56 H. Lee, R. Grosse, R. Ranganath, A.Y. Ng: Unsuper-
(2013) vised learning of hierarchical representations with
28.40 Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle: convolutional deep belief networks, Commun. ACM
Greedy layer-wise training of deep networks, Adv. 54(10), 95–103 (2011)
Neural Inf. Process. Syst. (NIPS) (2006) 28.57 D. Hau, K. Chen: Exploring hierarchical speech rep-
28.41 P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, resentations with a deep convolutional neural net-
P.A. Manzagol: Stacked denoising autoencoders: work, Proc. U.K. Workshop Comput. Intell. (UKCI)
Learning useful representations in a deep network (2011)
with a local denoising criterion, J. Mach. Learn. 28.58 Q. Le, M. Ranzato, R. Monga, M. Devin, K. Chen,
Res. 11, 3371–3408 (2010) G.S. Corrado, J. Dean, A.Y. Ng: Building high-
28.42 G.E. Hinton: Training products of experts by min- level features using large scale unsupervised
imizing contrastive divergence, Neural Comput. learning, Proc. Int. Conf. Mach. Learn. (ICML)
14(10), 1771–1800 (2002) (2012)
28.43 K. Kavukcuoglu, M. Ranzato, Y. LeCun: Fast infer- 28.59 S.E. Yuksel, J.N. Wilson, P.D. Gader: Twenty years of
ence in sparse coding algorithms with applications mixture of experts, IEEE Trans. Neural Netw. Learn.
to object recognition. CoRR, arXiv:1010.3467 (2010) Syst. 23(8), 1177–1193 (2012)
28.44 B.A. Olshausen, D.J. Field: Sparse coding with an 28.60 A. Krogh, J. Vedelsby: Neural network ensembles,
overcomplete basis set: A strategy employed by V1?, cross validation, and active learning, Adv. Neural
Vis. Res. 37, 3311–3325 (1997) Inf. Process. Syst. (NIPS) (1995)
494 Part D Neural Networks
28.61 N. Ueda, R. Nakano: Generalization error of ensem- on different feature sets, Neurocomputing 20(1–3),
ble estimators, Proc. Int. Conf. Neural Netw. (ICNN) 227–252 (1998)
(1996) pp. 90–95 28.77 K. Chen: On the use of different representations for
28.62 L.I. Kuncheva: Combining Pattern Classifiers speaker modeling, IEEE Trans. Syst. Man Cybern. C
(Wiley-Interscience, Hoboken 2004) 35(3), 328–346 (2005)
28.63 R.A. Jacobs, M.I. Jordan, S. Nowlan, G.E. Hinton: 28.78 Y. Yang, K. Chen: Temporal data clustering via
Adaptive mixture of local experts, Neural Comput. weighted clustering ensemble with different rep-
3(1), 79–87 (1991) resentations, IEEE Trans. Knowl. Data Eng. 23(2),
Part D | 28
28.64 M.I. Jordan, R.A. Jacobs: Hierarchical mixture of ex- 307–320 (2011)
perts and the EM algorithm, Neural Comput. 6(2), 28.79 Y. Yang, K. Chen: Time series clustering via RPCL
181–214 (1994) ensemble networks with different representations,
28.65 Y. Liu, X. Yao: Simultaneous training of negatively IEEE Trans. Syst. Man Cybern. C 41(2), 190–199 (2011)
correlated neural networks in an ensemble, IEEE 28.80 L. Xu, A. Krzyzak, C.Y. Suen: Methods of combining
Trans. Syst. Man Cybern. B 29(6), 716–725 (1999) multiple classifiers and their applications to hand-
28.66 K. Chen, L. Xu, H.S. Chi: Improved learning algo- writing recognition, IEEE Trans. Syst. Man Cybern.
rithms for mixture of experts in multi-class classi- 22(3), 418–435 (1992)
fication, Neural Netw. 12(9), 1229–1252 (1999) 28.81 K. Chen, X. Yu, H.S. Chi: Combining linear discrimi-
28.67 L.K. Hansen, P. Salamon: Neural network ensem- nant functions with neural networks for supervised
bles, IEEE Trans. Pattern Anal. Mach. Intell. 12(10), learning, Neural Comput. Appl. 6(1), 19–41 (1997)
993–1001 (1990) 28.82 K. Chen, L.P. Yang, X. Yu, H.S. Chi: A self-generating
28.68 M.P. Perrone, L.N. Cooper: Ensemble methods for modular neural network architecture for super-
hybrid neural networks. In: Artificial Neural Net- vised learning, Neurocomputing 16(1), 33–48 (1997)
works for Speech and Vision, ed. by R.J. Mammone 28.83 B.L. Lu, M. Ito: Task decomposition and module
(Chapman-Hall, New York 1993) pp. 126–142 combination based on class relations: A modu-
28.69 T.K. Ho: The random subspace method for con- lar neural network for pattern classification, IEEE
structing decision forests, IEEE Trans. Pattern Anal. Trans. Neural Netw. Learn. Syst. 10(5), 1244–1256
Mach. Intell. 20(8), 823–844 (1998) (1999)
28.70 K. Chen, L. Wang, H.S. Chi: Methods of com- 28.84 L. Cao: Support vector machines experts for time
bining multiple classifiers with different feature series forecasting, Neurocomputing 51(3), 321–339
sets and their applications to text-independent (2003)
speaker identification, Int. J. Pattern Recogn. Ar- 28.85 T.G. Dietterich: Ensemble learning. In: Handbook of
tif. Intell. 11(3), 417–445 (1997) Brain Theory and Neural Networks, ed. by M.A. Ar-
28.71 J. Kittler, M. Hatef, R.P.W. Duin, J. Matas: On com- bib (MIT Press, Cambridge 2002) pp. 405–408
bining classifiers, IEEE Trans. Pattern Anal. Mach. 28.86 Y. Freund, R.E. Schapire: Experiments with a new
Intell. 20(3), 226–239 (1998) boosting algorithm, Proc. Int. Conf. Mach. Learn.
28.72 D.H. Wolpert: Stacked generalization, Neural Netw. (ICML) (1996) pp. 148–156
2(3), 241–259 (1992) 28.87 L. Breiman: Bagging predictors, Mach. Learn. 24(2),
28.73 L. Xu, M.I. Jordan, G.E. Hinton: An alternative 123–140 (1996)
model for mixtures of experts, Adv. Neural Inf. Pro- 28.88 H. Schwenk, Y. Bengio: Boosting neural networks,
cess. Syst. (NIPS) (1995) Neural Comput. 12(8), 1869–1887 (2000)
28.74 L. Xu, A. Krzyzak, C.Y. Suen: Associative switch 28.89 J. Wang, V. Saligrama: Local supervised learning
for combining multiple classifiers, J. Artif. Neural through space partitioning, Adv. Neural Inf. Pro-
Netw. 1(1), 77–100 (1994) cess. Syst. (NIPS) (2012)
28.75 K. Chen: A connectionist method for pattern classi- 28.90 X.J. Zhu: Semi-supervised learning literature sur-
fication with diverse feature sets, Pattern Recogn. vey, Technical Report, School of Computer Science
Lett. 19(7), 545–558 (1998) (University of Wisconsin, Madison 2008)
28.76 K. Chen, H.S. Chi: A method of combining multiple 28.91 J. Ghosh, A. Acharya: Cluster ensembles, WIREs Data
probabilistic classifiers through soft competition Min. Knowl. Discov. 1(2), 305–315 (2011)