Arel2010 PDF

Itamar Arel, Derek C.
Rose, Research
and Thomas P. Karnowski
The University of Tennessee, USA
Frontier
Deep Machine LearningA New Frontier

in Artificial Intelligence Research
M
I. Introduction ognition systems has shifted to the time [5] [6]. To that end, modeling the
imicking the efficiency and human-engineered feature extraction temporal component of the observations
robustness by which the human process, which at times can be challeng- plays a critical role in effective informa-
brain represents information ing and highly application-dependent tion representation. Capturing spa-
has been a core challenge in n arti- [2]. Mo
Moreover, if incomplete or tiotemporal dependencies, based on
ficial intelligence research for erro
erroneous features are ex - regularities in the observations, is there-
decades. Humans are ex - tra
tracted, the classification fore viewed as a fundamental goal for
posed to myriad of senso- p
process is inherently lim- deep learning systems.
ry data received every ited in performance. Assuming robust deep learning is
second of the day and are Recent neuroscience achieved, it would be possible to train
somehow able to capture findings have provided such a hierarchical network on a large set
critical aspects of this insight into the princi- of observations and later extract signals
data in a way that allows p governing informa-
ples from this network to a relatively simple
for its future use in a con- ti representation in the
tion classification engine for the purpose of
cise manner. Over 50 years rs mam
mammalian brain, leading to robust pattern recognition. Robustness
ago, Richard Bellman, who new ideas for designing sys- here refers to the ability to exhibit classi-
BRAND X
introduced dynamic programming i PICTURES h represent information.
tems that fication invariance to a diverse range of
theory and pioneered the field of opti- One of the key findings has been that transformations and distortions, including
mal control, asserted that high dimen- the neocortex, which is associated with noise, scale, rotation, various lighting
sionality of data is a fundamental hurdle many cognitive abilities, does not explic- conditions, displacement, etc.
in many science and engineering appli- itly pre-process sensory signals, but rath- This article provides an overview of
cations. The main difficulty that arises, er allows them to propagate through a the mainstream deep learning approach-
particularly in the context of pattern complex hierarchy [3] of modules that, es and research directions proposed over
classification applications, is that the over time, learn to represent observa- the past decade. It is important to
learning complexity grows exponen- tions based on the regularities they emphasize that each approach has
tially with linear increase in the dimen- exhibit [4]. This discovery motivated the strengths and weaknesses, depending on
sionality of the data. He coined this emergence of the subfield of deep the application and context in which it
phenomenon the curse of dimensional- machine learning, which focuses on is being used. Thus, this article presents a
ity [1]. The mainstream approach of computational models for information summary on the current state of the
overcoming the curse has been to representation that exhibit similar char- deep machine learning field and some
pre-process the data in a manner that acteristics to that of the neocortex. perspective into how it may evolve.
would reduce its dimensionality to that In addition to the spatial dimension- Convolutional Neural Networks
which can be effectively processed, for ality of real-life data, the temporal com- (CNNs) and Deep Belief Networks
example by a classification engine. This ponent also plays a key role. An observed (DBNs) (and their respective variations)
dimensionality reduction scheme is sequence of patterns often conveys a are focused on primarily because they
often referred to as feature extraction. meaning to the observer, whereby inde- are well established in the deep learning
As a result, it can be argued that the pendent fragments of this sequence field and show great promise for future
intelligence behind many pattern rec- would be hard to decipher in isolation. work. Section II introduces CNNs and
Meaning is often inferred from events or subsequently follows by details of DBNs
Digital Object Identifier 10.1109/MCI.2010.938364 observations that are received closely in in Section III. For an excellent further
1556-603X/10/$26.002010IEEE NOVEMBER 2010 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE 13

data observed. The method provides a
Input fx level of invariance to shift, scale and
bx wx+1 bx+1 rotation as the local receptive field allows
the neuron or processing unit access to
X elementary features such as oriented
*
Sx+1 edges or corners.
One of the seminal papers on the
Cx
topic [8] describes an application of
CNNs to the classification handwritten
FIGURE 1 The convolution and subsampling process: the convolution process consists of con-
volving an input (image for the first stage or feature map for later stages) with a trainable filter fx
digits in the MNIST database. Essentially,
then adding a trainable bias bx to produce the convolution layer Cx. The subsampling consists of the input image is convolved with a set
summing a neighborhood (four pixels), weighting by scalar wx+1, adding trainable bias bx+1, and of N small filters whose coefficients are
passing through a sigmoid function to produce a roughly 2x smaller feature map Sx+1.
either trained or pre-determined using
some criteria. Thus, the first (or lowest)
in-depth look at the foundations of are intended for speech and time-series layer of the network consists of feature
these technologies, the reader is referred processing [53]. CNNs are the first truly maps which are the result of the convo-
to [7]. Section IV contains other deep successful deep learning approach where lution processes, with an additive bias
architectures that are currently being many layers of a hierarchy are successful- and possibly a compression or normal-
proposed. Section V contains a brief note ly trained in a robust manner. A CNN is ization of the features. This initial stage is
about how this research has impacted a choice of topology or architecture that followed by a subsampling (typically a 2
government and industry initiatives. The leverages spatial relationships to reduce 3 2 averaging operation) that further
conclusion provides a perspective of the the number of parameters which must reduces the dimensionality and offers
potential impact of deep-layered archi- be learned and thus improves upon gen- some robustness to spatial shifts (see Fig-
tectures as well as key questions that eral feed-forward back propagation ure 1). The subsampled feature map then
remain to be answered. training. CNNs were proposed as a deep receives a weighting and trainable bias
learning framework that is motivated by and finally propagates through an activa-
II. Convolutional Neural Networks minimal data preprocessing require- tion function. Some variants of this exist
CNNs [8] [9] are a family of multi-layer ments. In CNNs, small portions of the with as few as one map per layer [13] or
neural networks particularly designed image (dubbed a local receptive field) summations of multiple maps [8].
for use on two-dimensional data, such as are treated as inputs to the lowest layer When the weighting is small, the
images and videos. CNNs are influenced of the hierarchical structure. Information activation function is nearly linear and
by earlier work in time-delay neural generally propagates through the differ- the result is a blurring of the image;
networks (TDNN), which reduce learn- ent layers of the network whereby at other weightings can cause the
ing computation requirements by shar- each layer digital filtering is applied in activation output to resemble an AND
ing weights in a temporal dimension and order to obtain salient features of the or OR function. These outputs form a
new feature map that is then passed
through another sequence of convolu-
tion, sub-sampling and activation func-
tion flow, as illustrated in Figure 2. This
process can be repeated an arbitrary
number of times. It should be noted
that subsequent layers can combine one
NN or more of the previous layers; for
example, in [8] the initial six feature
maps are combined to form 16 feature
maps in the subsequent layer. As
described in [33], CNNs create their
invariance to object translations by a
Input C1 S2 C3 S4
method dubbed feature pooling (the
S layers in Figure 2). However, feature
FIGURE 2 Conceptual example of convolutional neural network. The input image is convolved
with three trainable filters and biases as in Figure 1 to produce three feature maps at the C1 pooling is hand crafted by the network
level. Each group of four pixels in the feature maps are added, weighted, combined with a bias, organizer, not trained or learned by the
and passed through a sigmoid function to produce the three feature maps at S2. These are again system; in CNNs, the pooling is
filtered to produce the C3 level. The hierarchy then produces S4 in a manner analogous to S2.
Finally these pixel values are rasterized and presented as a single vector input to the conven- tuned by parameters in the learning
tional neural network at the output. process but the basic mechanism (the
14 IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | NOVEMBER 2010

combination of inputs to the S layers,
for example) are set by the network
Associative Memory
designer. Finally, at the final stage of the Top Level Units
process, the activation outputs are for-
warded to a conventional feedforward Label Units Hidden Units
neural network that produces the final
output of the system.
The intimate relationship between Hidden Units
the layers and spatial information in Detection Weights Generative Weights
CNNs renders them well suited for Hidden Units
image processing and understanding, and
they generally perform well at autono-
Hidden RBM Layer
mously extracting salient features from
images. In some cases Gabor filters have Weights
Visible
been used as an initial pre-processing
step to emulate the human visual
response to visual excitation [10]. In Observation Vector v
more recent work, researchers have (e.g., 32 32 Image)
applied CNNs to various machine
learning problems including face detec-
FIGURE 3 Illustration of the Deep Belief Network framework.
tion [11] [13], document analysis [38],
and speech detection [12]. CNNs have
recently [25] been trained with a tem- connected). The hidden units are trained added to the network improves the log-
poral coherence objective to leverage to capture higher-order data correlations probability of the training data, which we
the frame-to-frame coherence found in that are observed at the visible units. Ini- can think of as increasing true represen-
videos, though this objective need not tially, aside from the top two layers, which tational power. This meaningful expan-
be specific to CNNs. form an associative memory, the layers of sion, in conjunction with the utilization
a DBN are connected only by directed of unlabeled data, is a critical component
III. Deep Belief Networks top-down generative weights. RBMs are in any deep learning application.
DBNs, initially introduced in [14], are attractive as a building block, over more At the top two layers, the weights are
probabilistic generative models that stand traditional and deeply layered sigmoid tied together, such that the output of the
in contrast to the discriminative nature belief networks, due to their ease of lower layers provides a reference clue or
of traditional neural nets. Generative learning these connection weights. To link for the top layer to associate with
models provide a joint probability distri- obtain generative weights, the initial pre- its memory contents. We often encoun-
bution over observable data and labels, training occurs in an unsupervised greedy ter problems where discriminative per-
facilitating the estimation of both layer-by-layer manner, enabled by what formance is of ultimate concern, e.g. in
P(Obser vation|Label) as well as Hinton has termed contrastive divergence classification tasks. A DBN may be fine
P(Label|Observation), while discrimina- [15]. During this training phase, a vector tuned after pre-training for improved
tive models are limited to the latter, v is presented to the visible units that for- discriminative performance by utilizing
P(Label|Observation). DBNs address ward values to the hidden units. Going in labeled data through back-propagation.
problems encountered when traditional- reverse, the visible unit inputs are then At this point, a set of labels is attached to
ly applying back-propagation to deeply- stochastically found in an attempt to the top layer (expanding the associative
layered neural networks, namely: (1) reconstruct the original input. Finally, memory) to clarify category boundaries
necessity of a substantial labeled data set these new visible neuron activations are in the network through which a new set
for training, (2) slow learning (i.e. con- forwarded such that one step reconstruc- of bottom-up, recognition weights are
vergence) times, and (3) inadequate tion hidden unit activations, h, can be learned. It has been shown in [16] that
parameter selection techniques that lead attained. Performing these back and forth such networks often perform better than
to poor local optima. steps is a process known as Gibbs sam- those trained exclusively with back-
DBNs are composed of several layers pling, and the difference in the correla- propagation. This may be intuitively
of Restricted Boltzmann Machines, a tion of the hidden activations and visible explained by the fact that back-propaga-
type of neural network (see Figure 3). inputs forms the basis for a weight tion for DBNs is only required to per-
These networks are restricted to a sin- update. Training time is significantly form a local search on the weight
gle visible layer and single hidden layer, reduced as it can be shown that only a (parameter) space, speeding training and
where connections are formed between single step is needed to approximate convergence time in relation to tradi-
the layers (units within a layer are not maximum likelihood learning. Each layer tional feed-forward neural networks.
NOVEMBER 2010 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE 15

Performance results obtained when ciples as DBNs but are less strict in the level of the hierarchy receives its inputs
applying DBNs to the MNIST hand- parameterization of the layers. Unlike from a small region of an input image.
written character recognition task have DBNs, auto-encoders use discriminative Higher levels of the hierarchy correspond
demonstrated significant improvement models from which the input sample to larger regions (or receptive fields) as
over feedforward networks. Shortly space cannot be sampled by the archi- they incorporate the representation con-
after DBNs were introduced, a more tecture, making it more difficult to inter- structs of multiple lower receptive fields.
thorough analysis presented in [17] pret what the network is capturing in its In addition to the scaling change across
solidified their use with unsupervised internal representation. However, it has layers of the hierarchy, there is an impor-
tasks as well as continuous valued been shown [21] that denoising auto- tant temporal-based aspect to each layer,
inputs. Further tests in [18] [19] illus- encoders, which utilize stochastic cor- which is created by translation or scan-
trated the resilience of DBNs (as well as ruption during training, can be stacked ning of the input image itself.
other deep architectures) on problems to yield generalization performance that During the learning phase, the first
with increasing variation. is comparable to (and in some cases bet- layer compiles the most common input
The flexibility of DBNs was recently ter that) traditional DBNs. The training patterns and assigns indices to them.
expanded [20] by introducing the notion procedure for a single denoising autoen- Temporal relationships are modeled as
of Convolutional Deep Belief Networks coder corresponds to the goals used for probability transitions from one input
(CDBNs). DBNs do not inherently generative models such as RBMs. sequence to another and are clustered
embed information about the 2D struc- together using graph partitioning tech-
ture of an input image, i.e. inputs are IV. Recently Proposed niques. When this stage of learning con-
simply vectorized formats of an image Deep Learning Architectures cludes, the subsequent (second) layer
matrix. In contrast, CDBNs utilize the There are several computational archi- concatenates the indices of the current
spatial relationship of neighboring pixels tectures that attempt to model the neo- observed inputs from its children mod-
with the introduction of what are cortex. These models have been inspired ules and learns the most common con-
termed convolutional RBMs to provide by sources such as [42], which attempt catenations as an alphabet (another group
a translation invariant generative model to map various computational phases in of common input sequences, but at a
that scales well with high dimensional image understanding to areas in the cor- higher level). The higher layers charac-
images. DBNs do not currently explicit- tex. Over time these models have been terization can then be provided as feed-
ly address learning the temporal rela- refined; however, the central concept of back down to the lower level modules.
tionships between observables, though visual processing over a hierarchical The lower level, in turn, incorporates this
there has been recent work in stacking structure has remained. These models broader representation information into
temporal RBMs [22] or generalizations invoke the simple-to-complex cell orga- its own inference formulation. This pro-
of these, dubbed temporal convolution nization of Hubel and Weisel [44], which cess is repeated at each layer of the hier-
machines [23], for learning sequences. was based on studies of the visual corti- archy. After a network is trained, image
The application of such sequence learn- cal cells of cats. recognition is performed using the
ers to audio signal processing problems, Similar organizations are utilized by Bayesian belief propagation algorithm
whereby DBNs have made recent head- CNNs as well as other deep-layered [46] to identify the most likely input pat-
way [24], offers an avenue for exciting models (such as the Neocognitron [40] tern given the beliefs at the highest layer
future research. [41] [43] and HMAX [32] [45]), yet of the hierarchy (which corresponds to
Static image testing for DBNs and more explicit cortical models seek a the broadest image scope). Other archi-
CNNs occurs most commonly with the stronger mapping of their architecture to tectures proposed in the literature, which
MNIST database [27] of handwritten biologically-inspired models. In particu- resemble HTMs, include the Hierarchi-
digits and Caltech-101 database [28] of lar, they attempt to solve problems of cal Quilted SOMs of Miller & Lommel
various objects (belonging to 101 catego- learning and invariance through diverse [47] that employ two-stage spatial clus-
ries). Classification error rates for each of mechanisms such as temporal analysis, in tering and temporal clustering using self-
the architectures can be found in [19] which time is considered an inseparable organizing maps, and the Neural
[20] [21]. A comprehensive and up-to- element of the learning process. Abstraction Pyramid of Behnke [48].
date performance comparison for various One prominent example is Hierar- A framework recently introduced by
machine learning techniques applied to chical Temporal Memory (HTM) devel- the authors for achieving robust infor-
the MNIST database is provided in [27]. oped at the Numenta Corporation [30] mation representation is the Deep Spa-
Recent works pertaining to DBNs [33]. HTMs have a hierarchical structure tioTemporal Inference Network
include the use of stacked auto-encoders based on concepts described in [39] and (DeSTIN) model [26]. In this frame-
in place of RBMs in traditional DBNs bear similarities to other work pertaining work, a common cortical circuit (or
[17] [18] [21]. This effort produced deep to the modeling of cortical circuits. With node) populates the entire hierarchy, and
multi-layer neural network architectures a specific focus on visual information each of these nodes operates indepen-
that can be trained with the same prin- representation, in an HTM the lowest dently and in parallel to all other nodes.

TABLE I Summary of mainstream deep machine learning approaches.
APPROACH UNSUPERVISED GENERATIVE VS.

(ABBREVIATION) PRE-TRAINING? DISCRIMINATIVE NOTES
CONVOLUTIONAL NO DISCRIMINATIVE UTILIZES SPATIAL/TEMPORAL RELATIONSHIPS TO
NEURAL NETWORKS (CNNS) REDUCE LEARNING REQUIREMENTS
DEEP BELIEF NETWORKS (DBNS) HELPFUL GENERATIVE MULTI-LAYERED RECURRENT NEURAL NETWORK
TRAINED WITH ENERGY MINIMIZING METHODS
STACKED (DENOISING) HELPFUL DISCRIMINATIVE (DENOISING STACKED NEURAL NETWORKS THAT LEARN COM-
AUTO-ENCODERS ENCODER MAPS TO PRESSED ENCODINGS THROUGH
GENERATIVE MODEL) RECONSTRUCTION ERROR
HIERARCHICAL TEMPORAL NO GENERATIVE HIERARCHY OF ALTERNATING SPATIAL RECOGNITION
MEMORY AND TEMPORAL INFERENCE LAYERS WITH SUPERVISED
LEARNING METHOD AT TOP LAYER
DEEP SPATIOTEMPORAL NO DISCRIMINATIVE HIERARCHY OF UNSUPERVISED SPATIAL-TEMPORAL
INFERENCE NETWORK (DESTIN) CLUSTERING UNITS WITH BAYESIAN STATE-TO-STATE
TRANSITIONS AND TOP-DOWN FEEDBACK
This solution is not constrained to a lay- speech recognition and detection [12], ples, then projecting these similarity mea-
er-by-layer training procedure, making general object recognition [9], natural surements to lower-dimensional spaces.
it highly attractive for implementation language processing [24], and robotics. In addition, further inspiration and tech-
on parallel processing platforms. Nodes The reality of data proliferation and niques may be found from evolutionary
independently characterize patterns abundance of multimodal sensory infor- programming approaches [59, 60] where
through the use of a belief state con- mation is admittedly a challenge and a conceptually adaptive learning and core
struct, which is incrementally updated as recurring theme in many military as well architectural changes can be learned with
the hierarchy is presented with data. as civilian applications, such as sophisti- minimal engineering efforts.
This rule is comprised of two con- cated surveillance systems. Consequently, Some of the core questions that
structs: one representing how likely sys- interest in deep machine learning has necessitate immediate attention include:
tem states are for segments of the not been limited to academic research. how well does a particular scheme scale
observation, P(observation|state), and Recently, the Defense Advanced with respect to the dimensionality of the
another representing how likely state to Research Projects Agency (DARPA) has input (which in images can be in the
state transitions are given feedback from announced a research program exclu- millions)? What is an efficient frame-
above, P(subsequent state|state,feedback). sively focused on deep learning [29]. work for capturing both short and long-
The first construct is unsupervised and Several private organizations, including term temporal dependencies? How can
driven purely by observations, while the Numenta [30] and Binatix [31], have multimodal sensory information be most
second, modulating the first, embeds the focused their attention on commercial- naturally fused within a given architec-
dynamics in the pattern observations. izing deep learning technologies with tural framework? What are the correct
Incremental clustering is carefully applications to broad domains. attention mechanisms that can be used
applied to estimate the observation dis- to augment a given deep learning
tribution, while state transitions are esti- VI. The Road Ahead technology so as to improve robustness
mated based on frequency. It is argued Deep machine learning is an active area and invariance to distorted or missing
that the value of the scheme lies in its of research. There remains a great deal of data? How well do the various solutions
simplicity and repetitive structure, facili- work to be done in improving the learn- map to parallel processing platforms that
tating multi-modal representations and ing process, where current focus is on facilitate processing speedup?
straightforward training. lending fertile ideas from other areas of While deep learning has been suc-
Table 1 provides a brief comparison machine learning, specifically in the con- cessfully applied to challenging pattern
summary of the mainstream deep text of dimensionality reduction. One inference tasks, the goal of the field is far
machine learning approaches described example includes recent work on sparse beyond task-specific applications. This
in this paper. coding [57] where the inherent high scope may make the comparison of vari-
dimensionality of data is reduced through ous methodologies increasingly complex
V. Deep Learning Applications the use of compressed sensing theory, and will likely necessitate a collaborative
There have been several studies demon- allowing accurate representation of signals effort by the research community to
strating the effectiveness of deep learning with very small numbers of basis vectors. address. It should also be noted that,
methods in a variety of application Another example is semi-supervised despite the great prospect offered by
domains. In addition to the MNIST manifold learning [58] where the dimen- deep learning technologies, some
handwriting challenge [27], there are sionality of data is reduced by measuring domain-specific tasks may not be directly
applications in face detection [10] [51], the similarity between training data sam- improved by such schemes. An example
NOVEMBER 2010 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE 17

is identifying and reading the routing [9] F.-J. Huang and Y. LeCun, Large-scale learning with tex, in Proc. Nat. Conf. Artificial Intelligence, 2007, vol.
SVM and convolutional nets for generic object categori- 22, p. 1597.
numbers at the bottom of bank checks. zation, in Proc. Computer Vision and Pattern Recognition [35] T. Dean, A computational model of the cerebral
Though these digits are human readable, Conf. (CVPR06), 2006. cortex, in Proc. Nat. Conf. Artificial Intelligence, 2005, vol.
[10] B. Kwolek, Face detection using convolutional neu- 20, pp. 938943.
they are comprised of restricted character ral networks and Gabor filters, in Lecture Notes in Com- [36] T. S. Lee and D. Mumford, Hierarchical Bayesian
sets which specialized readers can recog- puter Science, vol. 3696. 2005, p. 551. inference in the visual cortex, J. Opt. Soc. Amer. A, vol.
[11] F. H. C. Tivive and A. Bouzerdoum, A new class 20, no. 7, pp. 14341448, 2003.
nize flawlessly at very high data rates of convolutional neural networks (SICoNNets) and their [37] M. Szarvas, U. Sakai, and J. Ogata, Real-time
[49]. Similarly, iris recognition is not a application of face detection, in Proc. Int. Joint Conf. Neu- pedestrian detection using LIDAR and convolutional
ral Networks, 2003, vol. 3, pp. 21572162. neural networks, in Proc. 2006 IEEE Intelligent Vehicles
task that humans generally perform; [12] S. Sukittanon, A. C. Surendran, J. C. Platt, and C. Symp., pp. 213218.
indeed, without training, one iris looks J. C. Burges, Convolutional networks for speech detec- [38] P. Y. Simard, D. Steinkraus, and J. C. Platt, Best
tion, Interspeech, pp. 10771080, 2004. practices for convolutional neural networks applied to vi-
very similar to another to the untrained [13] Y.-N. Chen, C.-C. Han, C.-T. Wang, B.-S. Jeng, sual document analysis, in Proc. 7th Int. Conf. Document
eye, yet engineered systems can produce and K.-C. Fan, The application of a convolution neu- Analysis and Recognition, 2003, pp. 958963.
ral network on face and license plate detection, in Proc. [39] J. Hawkins and S. Blakeslee, On Intelligence. Times
matches between candidate iris images 18th Int. Conf. Pattern Recognition (ICPR06), 2006, pp. Books, Oct. 2004.
and an image database with high preci- 552555. [40] K. Fukushima, Neocognitron for handwritten digit
[14] G. E. Hinton, S. Osindero, and Y. Teh, A fast learn- recognition, Neurocomputing, vol. 51, pp. 161180, 2003.
sion and accuracy to serve as a unique ing algorithm for deep belief nets, Neural Comput., vol. [41] K. Fukushima, Restoring partly occluded patterns:
identifier [50]. Finally, recent develop- 18, pp. 15271554, 2006. A neural network model, Neural Netw., vol. 18, no. 1,
[15] G. E. Hinton, Training products of experts by min- pp. 3343, 2005.
ments in facial recognition [51] show imizing contrastive divergence, Neural Comput., vol. 14, [42] D. Marr, Vision: A Computational Investigation into the
equivalent performance relative to pp. 17711800, 2002. Human Representation and Processing of Visual Information.
[16] G. E. Hinton and R. R. Salakhutdinov, Reducing W. H. Freeman, 1983.
humans in their ability to match query the dimensionality of data with neural networks, Science, [43] K. Fukushima, Neocognitron: A self-organizing
images against large numbers of candi- vol. 313, no. 5786, pp. 504507, 2006. neural network model for a mechanism of pattern recog-
[17] Y. Bengio, P. Lamblin, D. Popovici, and H. Laro- nition unaffected by shift in position, Biol. Cybern., vol.
dates, potentially matching far more than chelle, Greedy layer-wise training of deep networks, 36, no. 4, pp. 193202, 1980.
most humans can recall [52]. Neverthe- in Advances in Neural Information Processing Systems 19 [44] D. H. Hubel and T. N. Wiesel, Receptive fields, bin-
(NIPS06). 2007, pp. 153160. ocular interaction and functional architecture in the cats
less, these remain highly specific cases [18] M. Ranzato, F. J. Huang, Y. Boureau, and Y. LeCun, visual cortex, J. Physiol., vol. 160, pp. 106154, 1962.
and are the result of lengthy feature engi- Unsupervised learning of invariant feature hierarchies [45] M. Riesenhuber and T. Poggio, Hierarchical mod-
with applications to object recognition, in Proc. Com- els of object recognition in cortex, Nat. Neurosci., vol. 2,
neering optimization processes (as well as puter Vision and Pattern Recognition Conf., 2007. no. 11, pp. 10191025, 1999.
years of research) that do not map to [19] H. Larochelle, D. Erhan, A. Courville, J. Bergstra, [46] J. Pearl, Probabilistic Reasoning in Intelligent Systems:
and Y. Bengio, An empirical evaluation of deep archi- Networks of Plausible Inference. San Mateo, CA: Morgan
other, more general applications. Fur- tectures on problems with many factors of variation, in Kaufmann, 1988.
thermore, deep learning platforms can Proc. 24th Int. Conf. Machine Learning (ICML07), 2007, [47] J. W. Miller and P. H. Lommel, Biometric sensory
pp. 473480. abstraction using hierarchical quilted self-organizing
also benefit from engineered features [20] H. Lee, R. Grosse, R. Ranganath, and A. Ng, Con- maps, Proc. SPIE, vol. 6384, 2006.
while learning more complex represen- volutional deep belief networks for scalable unsupervised [48] S. Behnke, Hierarchical Neural Networks for Image Inter-
learning of hierarchical representations, in Proc. 26th Int. pretation. New York: Springer-Verlag, 2003.
tations which engineered systems typi- Conf. Machine Learning, 2009, pp. 609616. [49] S. V. Rice, F. R. Jenkins, and T. A. Nartker, The
cally lack. [21] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. fifth annual test of OCR accuracy, Information Sciences
Manzagol, Extracting and composing robust features Res. Inst., Las Vegas, NV, TR-96-01, 1996.
Despite the myriad of open research with denoising autoencoders, in Proc. 25th Int. Conf. Ma- [50] E. M. Newton and P. J. Phillips, Meta-analysis of
issues and the fact that the field is still in chine Learning (ICML08), 2008, pp. 10961103. third-party evaluations of iris recognition, IEEE Trans.
[22] I. Sutskever and G. Hinton, Learning multilevel dis- Syst., Man, Cybern. A, vol. 39, no. 1, pp. 411, 2009.
its infancy, it is abundantly clear that tributed representations for high-dimensional sequences, in [51] M. Osadchy, Y. LeCun, and M. Miller, Synergistic
advancements made with respect to Proc. 11th Int. Conf. Artificial Intelligence and Statistics, 2007. face detection and pose estimation with energy-based mod-
[23] A. Lockett and R. Miikkulainen, Temporal convo- els, J. Mach. Learn. Res., vol. 8, pp. 11971215, May 2007.
developing deep machine learning sys- lution machines for sequence learning, Dept. Comput. [52] A. Adler and M. Schuckers, Comparing human and
tems will undoubtedly shape the future Sci., Univ. Texas, Austin, Tech. Rep. AI-09-04, 2009. automatic face recognition performance, IEEE Trans.
[24] H. Lee, Y. Largman, P. Pham, and A. Ng, Unsuper- Syst., Man, Cybern. B, vol. 37, no. 5, pp. 12481255, 2007.
of machine learning and artificial intelli- vised feature learning for audio classification using con- [53] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano,
gence systems in general. volutional deep belief networks, in Advances in Neural and K. Lang, Phoneme recognition using time-delay
Information Processing Systems 22 (NIPS09), 2009. neural networks, IEEE Trans. Acoust., Speech, Signal Pro-
[25] H. Mobahi, R. Collobert, and J. Weston, Deep cessing, vol. 37, pp. 328339, 1989.
learning from temporal coherence in video, in Proc. 26th [54] K. Lang, A. Waibel, and G. Hinton, A time-delay
Annu. Int. Conf. Machine Learning, 2009, pp. 737744. neural-network architecture for isolated word recogni-
References [26] I. Arel, D. Rose, and B. Coop, DeSTIN: A deep tion, Neural Netw., vol. 3, no. 1, pp. 2344, 1990.
[1] R. Bellman, Dynamic Programming. Princeton, NJ:
learning architecture with application to high-dimension- [55] R. Hadseel, A. Erkan, P. Sermanet, M. Scoffier,
Princeton Univ. Press, 1957.
al robust pattern recognition, in Proc. 2008 AAAI Work- U. Muller, and Y. LeCun, Deep belief net learning in a
[2] R. Duda, P. Hart, and D. Stork, Pattern Recognition, shop Biologically Inspired Cognitive Architectures (BICA). long-range vision system for autonomous off-road driving,
2nd ed. New York: Wiley-Interscience, 2000. [27] The MNIST database of handwritten digits [On- in Proc. Intelligent Robots and Systems, 2008, pp. 628633.
[3] T. Lee and D. Mumford, Hierarchical Bayesian in- line]. Available: http://yann.lecun.com/exdb/mnist/ [56] G. E. Hinton and R. R. Salakhutdinov, Reducing
ference in the visual cortex, J. Opt. Soc. Amer., vol. 20, [28] Caltech 101 dataset [Online]. Available: http:// the dimensionality of data with neural networks, Science,
pt. 7, pp. 14341448, 2003. www.vision.caltech.edu/Image_Datasets/Caltech101/ vol. 313, pp. 504507, 2006.
[4] T. Lee, D. Mumford, R. Romero, and V. Lamme, [29] http://www.darpa.mil/IPTO/solicit/baa/BAA09- [57] K. Kavukcuoglu, M. Ranzato, R. Fergus, and Y. Le-
The role of the primary visual cortex in higher level vi- 40_PIP.pdf Cun, Learning invariant features through topographic
sion, Vision Res., vol. 38, pp. 24292454, 1998. [30] http://www.numenta.com filter maps, in Proc. Int. Conf. Computer Vision and Pattern
[5] G. Wallis and H. Blthoff, Learning to recognize [31] http://www.binatix.com Recognition, 2009.
objects, Trends Cogn. Sci., vol. 3, no. 1, pp. 2331, 1999. [32] T. Serre, L. Wolf, S. Bileschi, M. Riesenhuber, and [58] J. Weston, F. Ratle, and R. Collobert, Deep learn-
[6] G. Wallis and E. Rolls, Invariant face and object rec- T. Poggio, Robust object recognition with cortex-like ing via semi-supervised embedding, in Proc. 25th Int.
ognition in the visual system, Prog. Neurobiol., vol. 51, mechanisms, IEEE Trans. Pattern Anal. Machine Intell., Conf. Machine Learning, 2008, pp. 11681175.
pp. 167194, 1997. vol. 29, no. 3, pp. 411426, 2007. [59] K. A. DeJong, Evolving intelligent agents: A 50
[7] Y. Bengio, Learning deep architectures for AI, [33] D. George, How the brain might work: A hierar- year quest, IEEE Comput. Intell. Mag., vol. 3, no. 1, pp.
Found. Trends Mach. Learn., vol. 2, no. 1, pp. 1127, 2009. chical and temporal model for learning and recognition, 1217, 2008.
[8] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Ph.D. dissertation, Stanford Univ., Stanford, CA, 2008. [60] X. Yao and M. Islam, Evolving artificial neural net-
Gradient-based learning applied to document recogni- [34] T. Dean, G. Carroll, and R. Washington, On the work ensembles, IEEE Comput. Intell. Mag., vol. 2, no. 1,
tion, Proc. IEEE, vol. 86, no. 11, pp. 22782324, 1998. prospects for building a working model of the visual cor- pp. 3142, 2008.

Arel2010 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Arel2010 PDF

Uploaded by

Copyright:

Available Formats

Itamar Arel, Derek C.

Deep Machine LearningA New Frontier

1556-603X/10/$26.002010IEEE NOVEMBER 2010 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE 13

14 IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | NOVEMBER 2010

NOVEMBER 2010 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE 15

16 IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | NOVEMBER 2010

APPROACH UNSUPERVISED GENERATIVE VS.

NOVEMBER 2010 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE 17

18 IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | NOVEMBER 2010

You might also like