You are on page 1of 224




For the Degree of
Doctor of Philosophy

Submitted to

Gajraula, Amroha (UTTAR PRADESH)

Research Supervisor: Research
Dr. Name



I hereby declare that this submission is my own work and that, to the best of my
knowledge and belief, it contains no material previously published or written by
another person nor material which to a substantial extent has been accepted for the
award of any other degree or diploma of the university or other institute of higher
learning, except where due acknowledgment has been made in the text.

Signature of Research Scholar

Name :

Enrollment No.



Certified that Name of student (enrollment no..) has carried out the research work
presented in this thesis entitled Title of Thesis. for the award of Doctor
of Philosophy from Shri Venkateshwara University, Gajraula under my/our (print only
that is applicable) supervision. The thesis embodies results of original work, and
studies as are carried out by the student himself/ herself (print only that is applicable)
and the contents of the thesis do not form the basis for the award of any other degree to
the candidate or to anybody else from this or any other University/Institution.

Signature Signature

(Name of Supervisor) (Name of Supervisor)

Designation) (Designation)

(Address) (Address)



(To be submitted in duplicate)

1. Name :
2. Enrollment No. :

3. Thesis title:....

4. Degree for which the thesis is submitted:

5. Department of the University to which the thesis is submitted :


6. Faculty of the University to which the thesis is submitted :


7. Thesis Preparation Guide was referred to for preparing the thesis. Yes No
8. Specifications regarding thesis format have been closely followed. Yes No
9. The contents of the thesis have been organized based on the guidelines. Yes No
10. The thesis has been prepared without resorting to plagiarism. Yes No
11. All sources used have been cited appropriately. Yes No
12. The thesis has not been submitted elsewhere for a degree. Yes No
13. Submitted two copies of spiral bound thesis plus one CD. Yes No
14. Submitted five copies of synopsis approved by RDC. Yes No
15. Submitted two copies of spiral bound research summary. Yes No

Name...Enrollment No

(To be submitted in duplicate)
1. Name : .........

2. Enrollment No. :

3. Thesis title:...



4. Degree for which the thesis is submitted: ........

5. Department of University to which the thesis is submitted :


6. Faculty of the University to which the thesis is submitted :

7. Thesis Preparation Guide was referred to for preparing the thesis. Yes No
8. Specifications regarding thesis format have been closely followed. Yes No
9. The contents of the thesis have been organized based on the guidelines. Yes No
10. The thesis has been prepared without resorting to plagiarism. Yes No
11. All sources used have been cited appropriately. Yes No
12. The thesis has not been submitted elsewhere for a degree. Yes No
13. All the corrections have been incorporated. Yes No
14. Submitted five hard bound copies of the thesis plus one CD. Yes No
15. Submitted five copies of research summary. Yes No

(Signature(s) of the Supervisor(s) (Signature of the Candidate)


Name(s): Name..

Enrollment No
AnObject recognition systems constitute a deeply entrenched and omnipresent
component of modern intelligent systems. Research on object recognition algorithms
has led to advances in factory and office automation through the creation of optical
character recognition systems, assembly-line industrial inspection systems, as well as
chip defect identification systems. It has also led to significant advances in medical
imaging, defence and biometrics. In this paper we discuss the evolution of computer-
based object recognition systems over the last fifty years, and overview the successes
and failures of proposed solutions to the problem. We survey the breadth of
approaches adopted over the years in attempting to solve the problem, and highlight
the important role that active and attentive approaches must play in any solution that
bridges the semantic gap in the proposed object representations, while simultaneously
leading to efficient learning and inference algorithms. From the earliest systems
which dealt with the character recognition problem, to modern visually-guided agents
that can purposively search entire rooms for objects, we argue that a common thread
of all such systems is their fragility and their inability to generalize as well as the
human visual system can. At the same time, however, we demonstrate that the
performance of such systems in strictly controlled environments often vastly
outperforms the capabilities of the human visual system. We conclude our survey by

arguing that the next step in the evolution of object recognition algorithms will
require radical and bold steps forward in terms of the object representations, as well
as the learning and inference algorithms used.

Table: 1.
Comparison of kernel descriptors (KDES) and hierarchi-cal kernel descriptors
(HKDES) on CIFAR10 provide extensive comparisons with current state-of-the-art
algorithms in terms of accuracy.
Table: 2.
Comparisons on the RGB-D Object Dataset. RGB de-notes features over RGB images
and depth denotes features over depth images.
Table: 3.
Comparisons to existing recognition approaches using a combination of depth features
and image features. Nonlinear SVMs use Gaussian kernel.


Different components of an object recognition system are shown
Figure 2:
Hierarchical Kernel Descriptors
Figure 3:
Examples of correspondences established between frames of a database image (left) and a
query image (right).
Figure 4:
Examples of corresponding query (left columns) and database (right columns) images from
the ZuBuD dataset. The image pairs exhibit occlusion, varying illumi-nation and viewpoint
and orientation changes
Figure 5:
Examples of corresponding query (left columns) and database (right columns) images from
the ZuBuD dataset. The image pairs exhibit occlusion, varying illumi-nation and viewpoint
and orientation changes.
Figure 6:
Image retrieval on FOCUS dataset: query local-isation results. query images, database
images, and query localisations
Figure 7:
An example of matches established on a wide-baseline stereo pair.
Figure 8:
Overview of the spatiotemporal (4-D) approach to dynamic vision (adapted from [50, 268]).

Chart 1: Summary of the 1989-2009 papers in Table 5 on active object detection. By
definition search efficiency is not the primary concern in these systems, since by assumption
the object is always in the sensors field of view. However inference scalability constitutes a
significant component of such systems. We notice very little use of function and context in
these systems. Furthermore, training such systems is often non-trivial.
Figure 9:
A sequence of viewpoints from which the system developed by Wilkes and Tsotsos [266]
actively recognizes an origami object.
Figure 10:
The object verification and next viewpoint selection algorithm used in [280] (diagram
adapted from [280]).
Figure 11:
Graphical model for next-view-planning as proposed in [284, 285].
Figure 12:
The aspects of an object and its congruence classes (adapted from Gremban and Ikeuchi
Figure 13:
An aspect resolution tree used to determine if there is a single interval of values for that
satisfy certain constraints (adapted from Gremban and Ikeuchi [287]).
Figure 14:
The two types of view degeneracies proposed by Dickinson et al. [49].
Chart 2: Summary of the 1992-2012 papers on active object localization and recognition from
Table 6. As expected, search efficiency and the role of 3D information is significantly more
prominent in these papers (as compared to Chart 7)
Figure 15:

Reconstructionist vision vs. Selective Perception, after Rimey and Brown [302]
feature vector c. Laporte and Arbel [291] build upon this work and choose the best next
viewpoint by calculating the symmetric KL divergence (Jeffrey divergence) of the likelihood
of the observed data given the assumption that this data resulted from two views of two
distinct objects. By weighing each Jeffrey divergence by the product of the probabilities of
observing the two competing objects and their two views, they can determine the next view
which provides the object identity hypothesis, thus again demonstrating the active vision
systems direct applicability in the standard recognition pipeline (see Fig.1).
Figure 16:
A PART-OF Bayes net for a table-top scenario, similar to what was proposed by Rimey and
Brown [302].
Figure 17:
An IS-A Bayes tree for a table-top scenario that was used by Rimey and Brown [302].
Figure 18:
The direct-search model, which includes nodes that affect direct search efficiency (unboxed
nodes) and explicit model parameters (boxed nodes). Adapted from Wixson and Ballard
Figure 19:
Junction types proposed by Malik [321] and used by Brunnstrom et al. [306] for recognizing
man-made objects.
Figure 20:
An ASIMO humanoid robot was used by Andreopoulos et al. [24] to actively search an
indoor environment.
Figure 21:
An example of ASIMO pointing at an object once the target object is successfully localized
in a 3D environment [24].
Figure 22:

The twenty object classes that the 2011 PASCAL dataset contains. Some of the earlier
versions of the PASCAL dataset only used subsets of these object classes. Adapted from
Chart 3: Summary of the PASCAL Challenge papers from Table 7 which correspond to
algorithms published between 2002-2011. Notice that the winning PASCAL challenge
algorithms typically make little use of function, context, 3D and make a moderate use of
Figure 23:
The HOG detector of Dalal and Triggs (from [335] with permission). (a): The average
gradient image over a set of registered training images. (b), (c): Each pixel demonstrates the
maximum and minimum (respectively) SVM weight of the corresponding block. (d): The test
image used in the rest of the subfigures. (e): The computed R-HOG descriptor of the image in
subfigure (e). (f),(g): The R-HOG descriptor weighed by the positive and negative SVM
weights respectively.
Figure 24:
Examples of the Harris-Laplace detector and the Laplacian detector, which were used
extensively in [142] as interest-point/region detectors (figure reproduced from [142] with
Figure 25:
The distributions of various object classes corresponding to six feature classes.
Figure 26:
Example of the algorithm by Felzenszwalb et al. [366] localizing a person using the coarse
template representation and the higher resolution subpart templates of the person (from [366]
with permission).
Figure 27:
The HOG feature pyramid used in [366], showing the coarse root-level template and the
higher resolution templates of the persons subparts (from [366] with permission).
Figure 28:

The distribution of edges and appearance patches of certain car model training images used
by Chum and Zisserman [365], with the learned regions of interest overlaid (from [365], with
Figure 29:
The 35 most frequent 2AS constructed from 10 outdoor images (from [367] with permission).
It is easier to understand the left images contents (e.g., a busy road with mountains in the
background) if the cars in the image have been firstly localized. Conversely, in the right
image, occlusions make the object localization problem difficult. Thus, prior knowledge that
the image contains exclusively cars, can make the localization problem easier (from [361]
with permission).
Figure 30:
Demonstrating how top-down category-specific attentional biases can modulate the shape-
words during the bag-of-words histogram construction (from [358] with permission). low
level features (e.g., edges, color) and they are grouping them in more complex ways in order
to achieve more universal representations of object parts. In terms of object verification and
object hypothesizing (see Fig.1) the work by Felzenszwalb et al. [366] represents the most
successful approach tested in Pascal 2007, for using a coarse generative model of object parts
to improve recognition performance.
Figure 31:
(a)The 3-layer tree-like object representation in [348]. (b) A reference template without any
part displacement, showing the root-node bounding box (blue), the centers of the 9 parts in
the 2nd layer (yellow dots), and the 36 part at the last layer in color purple. (c) and (d) denote
object localizations (from [348] with permission).
Figure 32:
On using context to mitigate the negative effects of ambiguous localizations [350]. The
greater the ambiguities, the greater role contextual knowledge plays (from [350] with
Figure 33:

An example of a feature extraction stage of the type FC SG Rabs N PA . An input
image (or a feature map) is passed through a non-linear filterbank, followed by rectification,
local contrast normalization and spatial pooling/sub- sampling.
Figure 34:
Test Error rate vs. number of training samples per class on NORB Dataset. Although pure
random features perform surprisingly well when training data is very scarce, for large
number of training data learning improves the per- formance significantly. Absolute value
rectification (Rabs) and local normalization (N ) is shown to improve the performance in all
Figure 35:
Left: random stage-1 filters, and corresponding optimal inputs that maximize the response
of each corresponding complex cell in a FC SG Rabs N PA architecture.
Figure 36:
Left: A dictionary with 128 elements, learned with patch based sparse coding model. Right:
A dictionary with 128 elements, learned with convolutional sparse coding model. The
dictionary learned with the convolutional model spans the orientation space much more
uniformly. In addition it can be seen that the diversity of filters obtained by convolutional
sparse model is much richer compared to patch based one.
Figure 37:
Top Left: Smooth shrinkage function. Parameters and b control the smoothness and
location of the kink of the function. As it converges more closely to soft thresholding
operator. Top Right: Total loss as a function of number of iterations. The vertical dotted
line marks the iteration number when diagonal hessian approximation was updated. It is
clear that for both encoder functions, hessian update improves the convergence significantly.
Bottom: 128 convolutional filters (k) learned in the encoder using smooth shrinkage
Figure 38:
Second stage filters. Left: Encoder kernels that correspond to the dictionary elements.
Right: 128 dictionary elements, each row shows 16 dictionary elements, connecting to a

single second layer feature map. It can be seen that each group extracts similar type of
features from their corresponding inputs.
Figure 39:
Results on the INRIA dataset with per-image metric. Left: Comparing two best systems
with unsupervised initialization (U U ) vs random initialization (RR). Right: Effect of
bootstrapping on final performance for unsupervised ini- tialized system.
Figure 40:
Results on the INRIA dataset with per-image metric. These curves are computed from the
bounding boxes and confidences made available by (Dollar et al., 2009b). Comparing our
two best systems labeled (U +U + and R+R+)with all the other methods.
Figure 41:
Reconstruction Error vs 1 norm Sparsity Penalty for coordinate de- scent sparse coding and
variational free energy minimization.

Figure 42:
Angle between representations obtained for two consecutive frames using different parameter
values using sparse coding and variational free energy minimization.


a. System Component
b. Complexity of Object Recognition
c. Two Dimensional
d. Three Dimensional
e. Segmented
a. Kernel Descriptor
b. Kernel Descriptor Over Kernel Descriptor
c. Everyday Object Recognition Using RGB-D
d. Experiments
i. Cifar 10
ii. RGB-D Object Dataset
a. Classes of object recognition methods
i. Appearance based methods
ii. Geometry Based Methods
a. The approach of David lowe
b. The approach of Mikolajczyk & Schmid
c. The approach of Tuytelaars Ferrari & Gool
d. The LAF approach of Matas
e. The approach of Zisserman

i. Indexing and Matching
ii. Verification
f. Scale Saliency by Kadir & Brady
g. Local PCA, approaches of Jugessur & Ohba
h. The approach of Selinger and Nelson
i. Applications
a. Active and Dynamic Vision
b. Active object detection literature survey
c. Active object localization
a. Dataset and evaluation techniques
b. Sampling the current state of the art in the recognition literature
i. Pascal 2005
ii. Pascal 2006
iii. Pascal 2008
iv. Pascal 2009
v. Pascal 2010
vi. Pascal 2011
c. The evolving landscape
a. Modules for hierarchical systems
b. Combining module into hierarchy
c. Training protocol
d. Experiment with Caltech 101 Dataset
e. Using a single stage of feature extraction
f. Using two stage of feature extraction
g. NORB datatest

h. Random filter performance
i. Handwritten digit recognition
j. Convolutional space coding
k. Algorithms and Methods
i. Learning convolutional dictionaries
ii. Learning and efficient encoder
iii. Patch based vs convolutional sparse modelling
l. Multi stage architecture
m. Experiments
i. Object recognition using Caltech 101 dataset
ii. Pedestrian detection
iii. Architecture and Training
iv. Per image evaluation
n. Sparse coding by variational marginalization
o. Variational marginalization for spare coding
p. Stability experiments
9. References


2) Introduction
Object recognition is a fundamental and challenging problem and is a major focus of
research in computer vision, machine learning and robotics. The task is difficult partly
because images are in high-dimensional space and can change with viewpoint, while
the objects them-selves may be deformable, leading to large intra-class variation. The
core of building object recognition systems is to extract meaningful representations
(features) from high-dimensional observations such as images, videos, and 3D point
clouds. This paper aims to discover such representations using machine learning
methods. An object recognition system finds objects in the real world from an image
of the world, using object models which are known a priori. This task is surprisingly
difficult. Humans perform object recognition effortlessly and instantaneously.
Algorithmic description of this task for implementation on machines has been very
difficult. In this chapter we will discuss different steps in object recognition and
introduce some techniques that have been used for object recognition in many
applications. We will discuss the different types of recognition tasks that a vision
system may need to perform. We will analyze the complexity of these tasks and
present approaches useful in different phases of the recognition task.

Over the past few years, there has been increasing interest in feature learning for
object recognition using ma-chine learning methods. Deep belief nets (DBNs) are
appealing feature learning methods that can learn a hierarchy of features. DBNs are
trained one layer at a time using contrastive divergence, where the feature learned by
the current layer becomes the data for training the next layer. Deep belief nets have
shown impressive results on handwritten digit recognition, speech recognition and
visual object recognition. Convolutional neural networks (CNNs) are another example
that can learn multiple layers of nonlinear features. In CNNs, the parameters of the
entire network, including a final layer for recognition, are jointly optimized using the
back-propagation algorithm.
The object recognition problem can be defined as a labeling problem based on models
of known objects. Formally, given an image containing one or more objects of interest
(and background) and a set of labels corresponding to a set of models known to the
system, the system should assign correct labels to regions, or a set of regions, in the
image. The object recognition problem is closely tied to the segmentation problem:
without at least a partial recognition of objects, segmentation cannot be done, and
without segmentation, object recognition is not possible.
In this chapter, we discuss basic aspects of object recognition hiarchy and its analysis.
We present the architecture and main components of object recognition and discuss
their role in object recognition systems of varying complexity.

Figure1: Different components of an object recognition system are shown

a) System Component
An object recognition system must have the following components to perform the
Model database (also called model base)
Feature detector
Hypothesis verifier
A block diagram showing interactions and information flow among different
components of the system is given in Figure. The model database contains all the
models known to the system. The information in the model database depends on the

approach used for the recognition. It can vary from a qualitative or functional
description to pre-cise geometric surface information. In many cases, the models of
objects are abstract feature vectors, as discussed later in this section. A feature is some
attribute of the object that is considered important in describing and recognizing the
object in relation to other objects. Size, color, and shape are some commonly used
The feature detector applies operators to images and identifies locations of features
that help in forming object hypotheses. The features used by a system depend on the
types of objects to be recognized and the organization of the model database. Using
the detected features in the image, the hypothesizer assigns likelihoods to objects
present in the scene. This step is used to reduce the search space for the recognizer
using certain features.
The model base is organized using some type of indexing scheme to facilitate
elimination of unlikely object candidates from possible consideration. The verifier
then uses object models to verify the hypotheses and refines the likelihood of objects.
The system then selects the object with the highest likelihood, based on all the
evidence, as the correct object.
All object recognition systems use models either explicitly or implicitly and employ
feature detectors based on these object models. The hypothesis formation and
verification components vary in their importance in different approaches to object
recognition. Some systems use only hypothesis formation and then select the object

with highest likelihood as the correct object. Pattern classification approaches are a
good example of this approach. Many artificial intelligence systems, on the other
hand, rely little on the hypothesis formation and do more work in the verification
phases. In fact, one of the classical approaches, template matching, bypasses the
hypothesis formation stage entirely.
An object recognition system must select appropriate tools and techniques for the
steps discussed above. Many factors must be considered in the selection of
appropriate methods for a particular application. The central issues that should be
considered in designing an object recognition system are:
Object or model representation: How should objects be represented in the
model database? What are the important attributes or features of objects that must be
captured in these models? For some objects, geometric descriptions may be available
and may also be efficient, while for another class one may have to rely on generic or
functional features.
The representation of an object should capture all relevant information without any
redundancies and should organize this information in a form that allows easy access
by different components of the object recognition system.
Feature extraction: Which features should be detected, and how can they be
detected reliably? Most features can be computed in two-dimensional images but they
are related to three-dimensional characteristics of objects. Due to the nature of the

image formation process, some features are easy to compute reliably while others are
very difficult. Feature detection issues were discussed in many chapters in this book.
Feature-model matching: How can features in images be matched to models
in the database? In most object recognition tasks, there are many features and
numerous objects. An exhaustive matching approach will solve the recognition
problem but may be too slow to be useful. Effectiveness of features and efficiency of
a matching technique must be considered in developing a matching approach.
Hypotheses formation: How can a set of likely objects based on the feature
matching be selected, and how can probabilities be assigned to each possible object?
The hypothesis formation step is basically a heuristic to reduce the size of the search
space. This step uses knowledge of the application domain to assign some kind of
probability or confidence measure to different objects in the domain. This measure
reflects the likelihood of the presence of objects based on the detected features.
Object verification: How can object models be used to select the most likely
object from the set of probable objects in a given image? The presence of each likely
object can be verified by using their models. One must examine each plausible
hypothesis to verify the presence of the object or ignore it. If the models are
geometric, it is easy to precisely verify objects using camera location and other scene
parameters. In other cases, it may not be possible to verify a hypothesis.

Depending on the complexity of the problem, one or more modules in Figure. may
become trivial. For example, pattern recognition-based object recognition systems do
not use any feature-model matching or object verification; they directly assign
probabilities to objects and select the object with the highest probability.

b) Complexity of Object Recognition
As we studied in earlier chapters in this book, images of scenes depend on
illumination, camera parameters, and camera location. Since an object must be
recognized from images of a scene containing multiple entities, the complexity of
object recognition depends on several factors. A qualitative way to consider the
complexity of the object recognition task would consider the following factors:
Scene constancy: The scene complexity will depend on whether the images are
acquired in similar conditions (illumination, background, camera parameters, and
viewpoint ) as t:le models. As seen in earlier chapters, scene conditions affect images
of the same object dramatically. Under different scene conditions, the performance of
different feature detectors will be significantly different. The nature of the
background, other objects, and illumination must be considered to determine what
kind of features can be efficiently and reliably detected.
Image-models spaces: In some applications, images may be obtained such that
three-dimensional objects can be considered two-dimensional. The models in such

cases can be represented using two-dimensional characteristics. If models are three-
dimensional and perspective effects cannot be ignored, then the situation becomes
more complex. In this case, the features are detected in two-dimensional image space,
while the models of objects may be in three-dimensional space. Thus, the same three-
dimensional feature may appear as a different feature in an image. This may also
happen in dynamic images due to the motion of objects.
Number of objects in the model database: If the number of objects is very
small, one may not need the hypothesis formation stage. A sequential exhaustive
matching may be acceptable. Hypothesis formation becomes important for a large
number of objects. The amount of effort spent in selecting appropriate features for
object recognition also increases rapidly with an increase in the number of objects.
Number of objects in an image and possibility of occlusion: If there is only
one object in an image, it may be completely visible. With an increase in the number
of objects in the image, the probability of occlusion increases. Occlusion is a serious
problem in many basic image computations. Occlusion results in the absence of
expected features and the generation of unexpected features. Occlusion should also be
considered in the hypothesis verification stage. Generally, the difficulty in the
recognition task increases with the number of objects in an image. Difficulties in
image segmentation are due to the presence of multiple occluding objects in images.
The object recognition task is affected by several factors. We classify the object

recognition problem into the following classes.

c) Two-dimensional
In many applications, images are acquired from a distance sufficient to consider the
projection to be orthographic. If the objects are always in one stable position in the
scene, then they can be considered two-dimensional. In these applications, one can
use a two-dimensional modelbase. There are two possible cases:
Objects will not be occluded, as in remote sensing and many industrial
applications .
Objects may be occluded by other objects of interest or be partially visible, as
in the bin of parts problem.
In some cases, though the objects may be far away, they may appear in different
positions resulting in multiple stable views. In such cases also, the problem may be
considered inherently as two-dimensional object recognition.

d) Three-dimensional
If the images of objects can be obtained from arbitrary viewpoints, then an object may
appear very different in its two views. For object recognition using three-dimensional

models, the perspective effect and viewpoint of the image have to be considered. The
fact that the models are three-dimensional and the images contain only two-
dimensional information affects object recognition approaches. Again, the two factors
to be considered are whether objects are separated from other objects or not.
For three-dimensional cases, one should consider the information used in the object
recognition task. Two different cases are:
Intensity: There is no surface information available explicitly in intensity
images. Using intensity values, features corresponding to the three-dimensional
structure of objects should be recognized .
2.5-dimensional images: In many applications, surface representations with
viewer-centered coordinates are available, or can be computed, from images. This
information can be used in object recognition.
Range images are also 2.5-dimensional. These images give the distance to different
points in an image from a particular view point.

e) Segmented
The images have been segmented to separate objects from the background. As
discussed in Chapter 3 on segmentation, object recognition and segmentation
problems are closely linked in most cases. In some applications, it is possible to

segment out an object easily. In cases when the objects have not been segmented, the
recognition problem is closely linked with the segmentation problem.

Kernel descriptors highlight the kernel view of orientation histograms, such as SIFT
and HOG, and show that they are a particular type of match kernels over patches. This
novel view suggests a unified framework for turning pixel attributes (gradient, color,
local binary pattern, etc.) into patch-level features:
(1) design match kernels using pixel attributes;
(2) learn compact basis vectors using kernel principal component analysis (KPCA);
(3) construct kernel descriptors by projecting the infinite-dimensional fea-ture vectors
to the learned basis vectors.
The key idea of this work is that we can apply the kernel descriptor framework not
only over sets of pixels (patches), but also sets of kernel descriptors. Hierarchical
kernel descriptors aggregate spatially nearby patch-level features to form higher level
features by using kernel descriptors recursively, as shown in Fig. 1. This procedure
can be repeated until we reach the final image-level features.


a) Kernel Descriptors
Patch-level features are critical for many computer vision tasks. Orientation
histograms like SIFT and HOG are popular patch-level features for object recognition.
Kernel descriptors include SIFT and HOG as special cases, and provide a principled
way to generate rich patch-level features from various pixel attributes.
The gradient match kernel, K
, is based on the pixel

Figure2:. Hierarchical Kernel Descriptors.
In the first layer, pixel attributes are aggregated into patch-level features. In the
second layer, patch-level features are turned into aggregated patch-level features. In
the final layer, aggregated patch-level features are converted into image-level
features. Kernel descriptors are used in every layer.


where P and Q are patches from two different images, and denotes the 2D position of
a pixel in an image patch. Let z, mz be e the orientation and magnitude of the image
gradient at a pixe z.
The color kernel descriptor K
is based on the pixel intensity attribute

where c
is the pixel color at position z (intensity for gray images and RGB values for
color images) and k
, c
0 ) = exp(
0 k
) is a Gaussian kernel. The shape
kernel descriptor, K
, is based on the local binary pattern attribute

Gradient, color and shape kernel descriptors are strong in their own right and
complement one another. Their combi-nation turns out to be always (much) better
than the best in-dividual feature. Kernel descriptors are able to generate rich visual
feature sets by turning various pixel attributes into patch-level features, and are
superior to the current state-of-the-art recognition algorithms on many standard visual
object recognition datasets.
b) Kernel Descriptors over Kernel Descriptors
The match kernels used to aggregate patch-level features have similar structure to
those used to aggregate pixel at-tributes:

where A and A
denote image patches, and P and Q are sets of image patches.
The patch position Gaussian kernel k
, C
0 ) = exp(
0 k
) =


0 ) describes the spatial relationship between two patches, where C
the center position of patch A (normalized to [0, 1]). The patch Gaussian kernel k

, F
0 ) = exp(
0 k
) =

0 ) measures the similarity of two
patch-level features, where F
are gradient, shape or color kernel descriptors in our
case. The linear kernel W
0 weights the contribution of each patch-level fea- ture
where W small positive constant. W
is the average of gradient mag-nitudes for the
gradient kernel descriptor, the average of standard deviations for the shape kernel
descriptor and is always 1 for the color kernel descriptor.
Note that although efficient match kernels [1] used match kernels to aggregate patch-
level features, they dont con- sider spatial information in match kernels and so spatial
pyramid is required to integrate spatial information. In addi-tion, they also do not
weight the contribution of each patch, which can be suboptimal. The novel joint
match kernels ( 5) provide a way to integrate patch-level features, patch varia-tion,
and spatial information jointly.


Evaluating match kernels ( 5) is expensive. Both for com-putational efficiency and for
representational convenience, we again extract compact low-dimensional features
from ( 5) using the idea from kernel descriptors.
The inner product representation of two Gaussian ker-nels is given by

Following [ 1], we learn compact features by projecting the infinite-dimensional
vector F (FA) C (CA) to a set of basis vectors. Since CA is a two-dimensional
vector, we can generate the set {C (X1), , C (XdC )} of basis vec-tors by
sampling X on 5 5 regular grids (dC = 25). However, patch-level features FA are in
high-dimensional space and it is infeasible to sample them on dense and uni-form
grids. Instead, we cluster patch-level features from training images using K-means,
similar to the bag of vi-sual words method, and take the resulting centers as the set.
)}. If 5000 basis vectors are generated from patch-level features, the dimensional ity
of the second layer kernel descriptors is 5000 25 = 125, 000. To obtain the second
layer kernel descriptors of reasonable size, we can reduce the number of basis vectors
using KPCA. KPCA finds the linear combination of basis vectors that best preserves
variance of the original data. The first kernel principal component can be computed
by maxi-mizing the variance of projected data with the normalization condition.
performed on the joint kernel, the product of spatial ker-nel kC and feature kernel kF ,

which can be written as a single Gaussian kernel. This procedure is optimal in the
sense of minimizing the least square approximation error. However, it is intractable to
compute the eigenvectors of a 125, 000 125, 000 matrix on a modern personal
computer. Here we propose a fast algorithm for finding the eigenvec-tors of the
Kronecker product of kernel matrices. Since ker-nel matrices are symmetric positive
definite, we have

suggests that the top r eigenvectors of KF KC can be chosen from the Kronecker
product of the eigenvectors of KF and those of KC , which significantly reduces com-
putational cost. The second layer kernel descriptors have the form. Recursively
applying kernel descriptors in a similar man-ner, we can get kernel descriptors of
more layers, which represents features at different levels.
c) Everyday Object Recognition using RGB-D
We recorded with the camera mounted at three different heights relative to the
turntable, giving viewing angles of approxi-mately 30, 45 and 60 degrees with the
horizon. One revolu-tion of each object was recorded at each height. Each video
sequence is recorded at 20 Hz and contains around 250 frames, giving a total of
250,000 RGB + Depth frames. A combination of visual and depth cues (Mixture-of-
Gaussian fitting on RGB, RANSAC plane fitting on depth) produces a segmentation

for each frame separating the object of in-terest from the background. The objects are
organized into a hierarchy taken from WordNet hypernym/hyponym rela-tions and is
a subset of the categories in ImageNet. Each of the 300 objects in the dataset belong
to one of 51 cate-gories

Our hierarchical kernel descriptors, being a generic ap-proach based on kernels, has
no trouble generalizing from color images to depth images. Treating a depth image as
a grayscale image, i.e. using depth values as intensity, gra-dient and shape kernel
descriptors can be directly extracted and they capture edge and shape information in
the depth channel. However, color kernel descriptors extracted over the raw depth
image does not have any significant mean-ing. Instead, we make the observation that
the distance d of an object from the camera is inversely proportional to the square root
of its area s in RGB images. For a given object.
Since we have the segmen-tation of objects, we can represent s using the number of
pixels belonging to the object mask. Finally, we multiply depth values by s before
extracting color kernel descrip-tors over this normalized depth image. This yields a
feature that is sensitive to the physical size of the object.

In the experiments section, we will compare in de-tail the performance of our
hierarchical kernel descrip-tors on RGB-D object recognition to that in [15]. Our
approach consistently outperforms the state of the art in [15]. In particular, our
hierarchical kernel descriptors on the depth image perform much better than the com-
bination of depth features (including spin images) used in [15], increasing the depth-
only object category recog-nition from 53.1% (linear SVMs) and 64.7% (nonlinear
SVMs) to 75.7% (hierarchical kernel descriptors and lin-ear SVMs). Moreover, our
depth features served as the backbone in the object-aware situated interactive system
that was successfully demonstrated at the Consumer Elec-tronics Show 2011 despite
adverse lighting conditions.
d) Experiments
In this section, we evaluate hierarchical kernel descrip-tors on CIFAR10 and the
RGB-D Object Dataset. We also
Features KDES [1]
HKDES (this
Color 53.9 63.4
Shape 68.2 69.4
Gradient 66.3 71.2

Combination 76.0 80.0

Table 1. Comparison of kernel descriptors (KDES) and hierarchi-cal kernel
descriptors (HKDES) on CIFAR10 provide extensive comparisons with current state-
of-the-art algorithms in terms of accuracy.
In all experiments we use the same parameter settings as the original kernel
descriptors for the first layer of hi-erarchical kernel descriptors. For SIFT as well as
gradi-ent and shape kernel descriptors, all images are transformed into grayscale ([0,
1]). Image intensity and RGB values are normalized to [0, 1]. Like HOG [5], we
compute gradients using the mask [1, 0, 1] for gradient kernel descriptors. We also
evaluate the performance of the combination of the three hierarchical kernel
descriptors by concatenating the image-level feature vectors. Our experiments suggest
that this combination always improves accuracy.
i) CIFAR10
CIFAR10 is a subset of the 80 million tiny images dataset [26, 14]. These images are
downsampled to 32 32 pixels. The training set contains 5,000 images per category,
while the test set contains 1,000 images per category.
Due to the tiny image size, we use two-layer hierarchical kernel descriptors to obtain
image-level features. We keep the first layer the same as kernel descriptors. Kernel

de-scriptors are extracted over 8 8 image patches over dense regular grids with a
spacing of 2 pixels. We split the whole training set into 10,000/40,000
training/validation set, and optimize the kernel parameters of the second layer kernel
descriptors on the validation set using grid search. Fi-nally, we train linear SVMs on
the full training set using the optimized kernel parameter setting. Our hierarchical
model can handle large numbers of basis vectors. We tried both 1000 and 5000 basis
vectors for the patch-level Gaus-sian kernel k
, and found that a larger number of
visual words is slightly better (0.5% to 1% improvement depend-ing on the type of
kernel descriptor). In the second layer, we use 1000 basis vector, enforce KPCA to
keep 97% of the energy for all kernel descriptors, and produce roughly 6000-
dimensional image-level features. Note that the sec-ond layer of hierarchical kernel
descriptors are image-level features, and should be compared to that of image-level
features formed by EMK, rather than that of kernel descriptors over image patches.
The dimensionality of EMK features [1] in is 14000, higher than that of hierarchical
kernel descriptors.
We compare kernel descriptors and hierarchical kernel
Method Accuracy
Logistic regression 36.0
Support Vector Machines 39.5
GIST 54.7

SIFT 65.6
fine-tuning GRBM 64.8
GRBM two layers 56.6
mcRBM 68.3
mcRBM-DBN 71.0
Tiled CNNs 73.1
improved LCC 74.5
KDES + EMK + linear SVMs 76.0
Convolutional RBM 78.9
K-means (Triangle, 4k features) 79.6
HKDES + linear SVMs (this
work) 80.0

descriptors in Table 1. As we see, hierarchical kernel de-scriptors consistently
outperform kernel descriptors. The shape hierarchical kernel descriptor is slightly
better than the shape kernel descriptor. The other two hierarchical ker-nel descriptors
are much better than their counterparts: gra-dient hierarchical kernel descriptor is
about 5 percent higher than gradient kernel descriptor and color hierarchical kernel
descriptor is 10 percent better than color kernel descriptor. Finally, the combination of
all three hierarchical kernel de-scriptors outperform the combination of all three
kernel de-scriptors by 4 percent. We were not able to run nonlinear SVMs with

Laplacian kernels on the scale of this dataset in reasonable time, given the high
dimensionality of image-level features. Instead, we make comparisons on a subset of
5,000 training images and our experiments suggest that non-linear SVMs have similar
performance with linear SVMs when hierarchical kernel descriptors are used.
We compare hierarchical kernel descriptors with the cur-rent state-of-the-art feature
learning algorithms in Table 2. Deep belief nets and sparse coding have been
extensively evaluated on this dataset [25, 31]. mcRBM can model pixel intensities and
pairwise dependencies between them jointly. Factorized third-order restricted
Boltzmann machine, fol-lowed by deep belief nets, has an accuracy of 71.0%. Tiled
CNNs has the best accuracy among deep networks. The improved LCC extends the
original local coordinate coding by including local tangent directions and is able to
integrate geometric information. As we have seen, sophisticated fea-ture extraction
can significantly boost accuracy and is much better than using raw pixel features.
SIFT features have an accuracy of 65.2% and works reasonably even on tiny images.
The combination of three hierarchical kernel de-scriptors has an accuracy of 80.0%,
higher than all other competing techniques; its accuracy is 14.4 percent higher than
SIFT, 9.0 percent higher than mcRBM combined with DBNs, and 5.5 percent higher
than the improved LCC. Hi-erarchical kernel descriptors slightly outperform the very
recent work: the convolutional RBM and the triangle K-means with 4000 centers.
ii) RGB-D Object Dataset

We evaluated hierarchical kernel descriptors on the RGB-D Object Dataset. The goal
of this experiment is to: 1) verify that hierarchical kernel descriptors work well for
both RGB and depth images; 2) test whether using depth in-formation can improve
object recognition. We subsampled the turntable video data by taking every fifth
frame, giving around 41,877 RGB-depth image pairs. To the best of our knowledge,
the RGB-D Object Dataset presented here is the largest multi-view object dataset
where both RGB and depth images are provided for each view.
We use two-layer hierarchical kernel descriptors to con-struct image-level features.
We keep the first layer the same as kernel descriptors and tune the kernel parameters
of the second layer kernel descriptors by cross validation optimization. We extract the
first layer of kernel descrip-tors over 16 16 image patches in dense regular grids
with spacing of 8 pixels. In the second layer, we use 1000 basis vectors for the patch-
level Gaussian kernel k
, enforce that KPCA keep 97% of the energy for all kernel
descriptors as mentioned in Section 4.1, and produce roughly 3000-dimensional
image-level features. Finally, we train linear SVMs on the training set and apply them
on the test set. We also tried three layer kernel descriptors, but they gave similar
performance to two-layer ones.
As in, we distinguish between two levels of object recognition: instance recognition
and category recognition. Instance recognition is recognizing distinct objects, for ex-
ample a coffee mug with a particular appearance and shape. Category recognition is

determining the category name of an object (e.g. coffee mug). One category usually
contains many different object instances.
To test the generalization ability of our approaches, for category recognition we train
models on a set of objects and at test time present to the system objects that were not
present in the training set [15]. At each trial, we randomly leave one object out from
each category for testing and train classifiers on the remaining 300 - 51 = 249 objects.
For in-stance recognition we also follow the experimental setting suggested by [15]:
train models on the video sequences of each object where the viewing angles are 30

and 60

with the horizon and test them on the 45

video sequence.
For category recognition, the average accuracy over 10 random train/test splits is
reported in the second column of Table. For instance recognition, the accuracy on the
test set is reported in the third column of Table. As we ex-pect, the combination of
hierarchical kernel descriptors is much better than any single descriptor. The
underlying rea- son is that each depth descriptor captures different informa-tion and
the weights learned by linear SVMs using super-vised information can automatically
balance the importance of each descriptor across objects.

Method Category Instance
Color HKDES (RGB) 60.12.1 58.4

Shape HKDES (RGB) 72.61.9 74.6
Gradient HKDES (RGB) 70.12.9 75.9
Combination of HKDES
(RGB) 76.12.2 79.3
Color HKDES (depth) 61.82.4 28.8
Shape HKDES (depth) 65.81.8 36.7
Gradient HKDES (depth) 70.82.7 39.3
Combination of HKDES
(depth) 75.72.6 46.8
Combination of all HKDES 84.12.2 82.4
Table2: Comparisons on the RGB-D Object Dataset. RGB de-notes features over
RGB images and depth denotes features over depth images.
Approaches Category Instance
Linear SVMs [15] 81.92.8 73.9
Nonlinear SVMs [15] 83.83.5 74.8
Random Forest [15] 79.64.0 73.1
Combination of all
HKDES 84.12.2 82.4
Table3: Comparisons to existing recognition approaches using a combination of depth
features and image features. Nonlinear SVMs use Gaussian kernel.

In Table 4, we compare hierarchical kernel descriptors with the rich feature set used
in, where SIFT, color and textons were extracted from RGB images, and 3-D bound-
ing boxes and spin images over depth images. Hier-archical kernel descriptors are
slightly better than this rich feature set for category recognition, and much better for
in-stance recognition.
It is worth noting that, using depth alone, we improve the category recognition
accuracy in from 53.1% (lin-ear SVMs) to 75.7% (hierarchical kernel descriptors and
linear SVMs). This shows the power of our hierarchical kernel descriptor formulation
when being applied to a non-conventional domain. The depth-alone results are
meaning-ful for many scenarios where color images are not used for privacy or
robustness reasons.
As a comparison, we also extracted SIFT features on both RGB and depth images and
trained linear SVMs over image-level features formed by spatial pyramid EMK. The
resulting classifier has an accuracy of 71.9% for category recognition, much lower
than the result of the combination of hierarchical kernel descriptors (84.2%). This is
not sur-prising since SIFT fails to capture shape and object size information.
Nevertheless, hierarchical kernel descriptors provide a unified way to generate rich
feature sets over both RGB and depth images, giving significantly better accuracy.


Recognition of general three-dimensional objects from 2D images and videos is a
challenging task. The common for-mulation of the problem is essentially: given some
knowl-edge of how certain objects may appear, plus an image of a scene possibly
containing those objects, find which objects are present in the scene and where.
Recognition is accom-plished by matching features of an image and model of an
object. The two most important issues that a method must address are the definition of
a feature, and how the matching is found.
What is the goal in designing an object recognition sys-tem? Achieving generality, i.e.
the ability to recognise any object hand-crafted adaptation to a specific task,
robustness, the ability to recognise the objects in arbitrary conditions, and easy
learning, i.e. avoiding special or demanding proce-dures to obtain the database of
models. Obviously these requirements are generally impossible to achieve, as it is for
example impossible to recognise objects in images taken in complete darkness. The
challenge is then to develop a method with minimal constraints.
Object recognition methods can be classified according to a number of characteristics.
We focus on model acqui-sition (learning) and invariance to image formation condi-
tions. Historically, two main trends can be identified. In the so called geometry- or
model-based object recognition, the knowledge of an object appearance is provided
by the user as an explicit CAD-like model. Typically, such a model describes only the

3D shape, omitting other properties such as colour and texture. On the other end of
the spectrum are the appearance-based methods, where no explicit user-provided
model is required. The object representations are usually acquired through an
automatic learning phase (but not necessarily), and the model typically relies on
surface re-flectance (albedo) properties. Recently, methods which put local image
patches into correspondence emerged. Models are learned automatically, objects are
represented by ap-pearance of small local elements. Global arrangement of the
representation is constrained by weak or strong geometric models.
The rest of the paper is structured as follows. In Sec-tion 2, an overview of classes of
object recognition methods is given. Survey on methods which are based on match-
ing of local features is presented in Section 3, and Section 4 describes some of their
successful applications. Section 5 concludes the paper.
i) Appearance Based Methods
The central idea behind appearance-based methods is the following. Having seen all
possible appearances of an object, can recognition be achieved by just eciently
remembering all of them? Could recognition be thus implemented as an ecient
visual (pictorial) memory? The answer obviously depends on what is meant by all
appearances. The ap-proach has been successfully demonstrated for scenes with
unoccluded objects on black background [34]. But remem-bering all possible object

appearances in the case of arbitrary background, occlusion and illumination, is
currently compu-tationally prohibitive.

Appearance based methods [6, 70, 20, 3, 40, 33, 68, 21, 30, 34] typically include two
phases. In the first phase, a model is constructed from a set of reference images. The
set includes the appearance of the object under dierent ori-entations, dierent
illuminants and potentially multiple in-stances of a class of objects, for example faces.
The images are highly correlated and can be eciently compressed using e.g.
Karhunen-Loeve transformation (also known as Princi-pal Component Analysis -
In the second phase, recall, parts of the input image (subimages of the same size as
the training images) are ex-tracted, possibly by segmentation (by texture, colour, mo-
tion) or by exhaustive enumeration of image windows over whole image. The
recognition system then compares an ex-tracted part of the input image with the
reference images (e.g. by projecting the part to the Karhunen-Loeve space).
A major limitation of the appearance-based approaches is that they require isolation of
the complete object of inter-est from the background. They are thus sensitive to occlu-
sion and require good segmentation. A number of attempts have been made to address
recognition with occluded or par-tial data [32, 30, 65, 5, 21, 4, 64, 20, 15, 19].

The family of appearance-based object recognition meth-ods includes global
histogram matching methods. In [66, 67], Swain and Ballard proposed to represent an
object by a colour histogram. Objects are identified by matching his-tograms of image
regions to histograms of a model image. While the technique is robust to object
orientation, scaling, and occlusion, it is very sensitive to lighting conditions, and it is
not suitable for recognition of objects that cannot be identified by colour alone. The
approach has been later mod-ified by Healey and Slater [14] and Funt and Finlayson
[12] to exploit illumination invariants. Recently, the concept of histogram matching
was generalised by Schiele [52, 51, 50], where, instead of pixel colours, responses of
various filters are used to form the histograms (called then receptive field histograms).
To summarise, appearance based approaches are attrac-tive since they do not require
image features or geometric primitives to be detected and matched. But their
limitations, i.e. the necessity of dense sampling of training views and the low
robustness to occlusion and cluttered background, make them suitable mainly for
certain applications with limited or controlled variations in the image formation
conditions, e.g. for industrial inspection.

iii) Geometry-Based Methods
In geometry- (or shape-, or model-) based methods, the in-formation about the objects
is represented explicitly. The recognition can than be interpreted as deciding whether

(a part of) a given image can be a projection of the known (usually 3D) model [41] of
an object.
Generally, two representations are needed: one to repre-sent object model, and
another to represent the image con-tent. To facilitate finding a match between model
and image, the two representations should be closely related. In the ideal case there
will be a simple relation between primitives used to describe the model and those used
to describe the image. Would the object be, for example, described by a wireframe
model, the image might be best described in terms of linear intensity edges. Each edge
can be then matched directly to one of the model wires. However, the model and
image rep-resentations often have distinctly dierent meanings. The model may
describe the 3D shape of an object while the im-age edges correspond only to visible
manifestations of that shape mixed together with false edges (discontinuities in
surface albedo) and illumination eects (shadows).
To achieve pose and illumination invariance, it is prefer-able to employ model
primitives that are at least somewhat invariant with respect to changes in these
conditions. Con-siderable eort has been directed to identify primitives that are
invariant with respect to viewpoint change.
The main disadvantages of geometry-based methods are: the dependency on reliable
extraction of geometric primi-tives (lines, circles, etc.), the ambiguity in interpretation
of the detected primitives (presence of primitives that are not modelled), the restricted

modelling capabilities only to a class of objects which are composed of few easily
detectable elements, and the need to create the models manually.

iv) Recognition as a Correspondence of Local Features
Neither geometry-based nor appearance-based methods dis-cussed previously do well
as defined by the requirements stated in the beginning of the paper, i.e. the generality,
ro-bustness, and easy learning. Geometry-based approaches re-quire the user to
specify the object models, and can usu-ally handle only objects consisting of simple
geometric prim-itives. They are not general, nor do they support easy learn-ing.
Appearance-based methods demanded exhaustive set of learning images, taken from
densely distributed views and il-luminations. Such set is only available when the
object can be observed in a controlled environment, e.g. placed on a turntable. The
methods are also sensitive to occlusion of the objects, and to the unknown
background, thus they are not robust.
As an attempt to address the above mentioned issues, methods based on matching
local features have been pro-posed. Objects are represented by a set of local features,
which are automatically computed from the training images. The learned features are
organised into a database. When recognising a query image, local features are
extracted as in the training images. Similar features are then retrieved from the
database and the presence of objects is assessed in the terms of the number of local

correspondences. Since it is not required that all local features match, the approaches
are robust to occlusion and cluttered background.
To recognise objects from dierent views, it is necessary to handle all variations in
object appearance. The varia-tions might be complex in general, but at the scale of the
local features they can be modelled by simple, e.g. ane, transformations. Thus, by
allowing simple transformations at local scale, a significant viewpoint invariance is
achieved even for objects with complicated shapes. As a result, it is possible to obtain
models of objects from only a few views, taken e.g. 90 degrees apart.
The main advantages of the approaches based on match-ing local features are
summarised below.
[1] Learning, i.e. the construction of internal models of known objects, is done
automatically from images depict-ing the objects. No user intervention is required
except for providing the training images.
[2] The local representation is based on appearance. There is no need to extract geometric
primitives (e.g. lines), which are generally hard to detect reliably.
[3] Segmentation of objects from background is not required prior recognition, and yet
objects are recognised on an unknown background.
[4] Objects of interest are recognised even if partially oc-cluded by other unknown
objects in the scene.

[5] Complex variations in object appearance caused by vary-ing viewpoint and
illumination conditions are approxi-mated by simple transformations at a local scale.
[6] Measurements on both database and query images are obtained and represented in an
identical way.
Putting local features into correspondence is an approach that is robust to object
occlusion and cluttered background in principle. When a part of an object is occluded
by other objects in the scene, only features of that part are missed. As long as there
are enough features detected in the unoccluded part, the object can be recognised. The
problem of cluttered background is solved in a final step of the recognition process,
when a hypothesised match is verified and confirmed, and false correspondences are
Several approaches based on local features have been pro-posed. Generally, they
follow a certain common structure, which is summarised below.
Detectors. First, image elements of interest are detected. The elements will serve as
anchor locations in the images descriptors of local appearance will be computed at
these locations. Thus, an image element is of interest if it depicts a part of an object,
which can be repeatedly detected and localised in images taken over large range of
conditions. The challenge is to find such a definition of interest, that would allow
fast, reliable and precisely localised detection of such elements. The brute force

alternative to the detectors is to generate local descriptors at every point. This course
is obviously infeasible due to its computational complexity.
Descriptors. Once the elements of interest are found, the local image appearance in
their neighbourhood has to be encoded in a way that would allow for searching of
similar elements.
When designing a descriptor (also called a feature vec-tor), several aspects have to be
taken into account. First, the descriptors should be discriminative enough to distin-
guish between features of the objects stored in the database. Would we for example
want to distinguish between two or three objects, each described by some ten odd
features, the descriptions of local appearance can be as simple as e.g. four-bin colour
histograms. On the other hand, handling thousands of database objects requires the
ability to distin-guish between a vast number of descriptors, demanding thus highly
discriminative representation. This problem can be partially alleviated by using
grouping, i.e. simultaneous con-sistent matching of several detected elements.
Another aspect in designing a descriptor is that it has to be invariant, or at least in
some degree robust, to variations in an objects appearance that are not reflected by
the detector. If, for example, the detector detects circular or el-liptical regions without
assigning an orientation to them, the descriptor must be made invariant to the
orientation (rota-tional invariants). Or if the detector is imprecise in locating the
elements of interest, e.g. having few pixel tolerance, the descriptor must be insensitive

to these small misalignments. Such a descriptor might be based e.g. on colour
moments (in-tegral statistics over whole region), or on local histograms.
It follows that the major factors that aect the discrim-inative potential, and thus the
ability to handle large object databases, of a method are the repeatability and the
locali-sation precision of the detector.
Indexing. During learning of object models, descriptors of local appearance are stored
into a database. In the recognition phase, descriptors are computed on the query
image, and the database is looked up for similar descriptors (potential matches). The
database should be organised (indexed) in a way that allows an ecient retrieval of
similar descriptors. The character of suitable indexing structure depends generally on
the properties of the descriptors (e.g. their di-mensionality) and on the distance
measure used to determine which are the similar ones (e.g. euclidean distance). Gen-
erally, for optimal performance of the index (fast retrieval times), such combination of
descriptor and distance measure should be sought, that minimises the ratio of
distances to correct and to false matches.
The choice of indexing scheme has major eect on the speed of the recognition
process, especially on how the speed scales to large object databases. Commonly,
though, the database searches are done simply by sequential scan, i.e. without using
any indexing structure.

Matching. When recognising objects in an unknown query image, local features are
computed in the same form as for the database images. None, one, or possibly more
tentative correspondences are then established for every feature de-tected in the query
image. Searching the database, euclidean or mahalanobis distance is typically
evaluated between the query feature and the features stored in the database. The
closest match, if close enough, is retrieved. These tentative correspondences are based
purely on the similarity of the descriptors. A database object which exhibit high (non-
random) number of established correspondences is considered as a candidate match.
Verification. The similarity of descriptors, on its own, is not a measure reliable
enough to guarantee that an established correspondence is correct. As a final step of
the recognition process, a verification of presence of the model in the query image is
performed. A global transformation connecting the images is estimated in a robust
way (e.g. by using RANSAC algorithm). Typically, the global transformation has the
form of epipolar geometry constraint for general (but rigid) 3D objects, or of
homography for planar objects. More complex transformations can be derived for
non-rigid or articulated (piecewise rigid) objects.
As mentioned before, if a detector cannot recover certain parameters of the image
transformations, descriptor must be made invariant to them. It is preferable, though, to
have a covariant detector rather than an invariant descriptor, as that allows for more
powerful global consistency verification. If, for example, the detector does not
provide the orienta-tions of the image elements, rotational invariants have to be

employed in the descriptor. In such a case, it is impossi-ble to verify that all of the
matched elements agree in their orientation.
Finally, tentative correspondences which are not consistent with the estimated global
transformation are rejected, and only remaining correspondences are used to estimate
the final score of the match.
In the following, main contributions to the field of object recognition based on local
correspondences are reviewed. The approaches follow the aforementioned structure,
but differ in individual steps; in the way how are the local features obtained
(detectors), and what are the features themselves (descriptors).


a) The Approach of David Lowe
David Lowe has developed an object recognition system, with emphasis on eciency,
achieving real-time recognition times. Anchor points of interest are detected with
invariance to scale, rotation and translation. Since local patches undergo more
complicated transforma-tions then similarities, a local-histogram based descriptor is
proposed, which is robust to imprecisions in alignment of the patches.
Detector. The detection of regions of interest proceeds as follows:
[12] Detection of scale-space peaks. Circular regions with maximal response of the
dierence-of-gaussians (DoG) filter, are detected at all scales and image locations. Ef-
ficient implementation exploits the scale-space pyramid. The initial image is
repeatedly convolved with a Gaus-sian filter to produce a set of scale-space images.
Adja-cent scale-space images are then subtracted to produce a set of DoG images. In
these images, local minima and maxima (i.e. extrema of the DoG filter response) are
de-tected, both in spatial and scale domains. The result of the first phase is thus a set
of triplets x, y and , image locations and a characteristic scales.
[13] The location of the detected points is refined. The DoG responses are locally
fitted with 3D quadratic function and the location and characteristic scale of the
circular regions are determined with subpixel accuracy. The re-finement is necessary,

as, at higher levels of the pyramid, a displacement by a single pixel might result in a
large shift in the image domain. Unstable regions are then rejected, the stability is
given by the magnitude of the DoG response. Regions with the response lower than a
predefined threshold are removed. Further regions are discarded which were found
along linear edges, which, although having high DoG response, have unstable local-
isation in one direction.
[14] One or more orientations are assigned to each region. Local histograms of
gradient orientations are formed and peaks in the histogram determine the
characteristic ori-entations.
The SIFT Descriptor. Local image gradients are mea-sured at the regions
characteristic scale, weighted by the distance from the region centre and combined
into a set of orientation histograms. Using the histograms, small mis-alignments in the
localisation does not aect the final de-scription. The construction of the descriptors
allows for ap-proximately 20

3D rotations before the similarity model fails. At the

end, every detected region is represented by a 128-dimensional vector.
Indexing. To support fast retrieval of database vectors, a modification of the kD tree
algorithm, called BBF (best bin first), is adopted. The algorithm is approximate in the
sense that it returns the closest neighbour with high probability, or else another point
that is very close in distance to the closest neighbour. The BBF algorithm modifies
the kD tree algorithm to search bins in feature space in the order of their closest
distance from the query location, instead of the order given by the tree hierarchy.

Verification. The Hough transform is used to identify clusters of tentative
correspondences with a consistent ge-ometric transformation. Since the actual
transformation is approximated by a similarity, the Hough accumulator is 4-
dimensionsional and is partitioned to rather broad bins. Only clusters with at least 3
entries in a bin, are considered further. Each such cluster is then subject to a
geometric ver-ification procedure in which an iterative least-squares fitting is used to
find the best ane projection relating the query and database images.

b) The Approach of Mikolajczyk & Schmid
The approach by Schmid et al. is described in [44, 28, 56, 54, 53, 55, 27, 10]. Based on an
ane generalisation of Harris corner detector, anchor points are detected and described by
Gaussian derivatives of image intensities in shape-adapted elliptical neighbourhoods.
Detector. In their work, Mikolajczyk and Schmid imple-ment ane-adapted Harris point
detector. Since the three-parametric ane Gaussian scale space is too complex to be
practically useful, they propose a solution which itera-tively search for ane shape
adaptation in neighbourhoods of points detected in uniform scale space. For initialisa-tion,
approximate locations and scales of interest points are extracted by standard multi-scale
Harris detector. These points are not ane invariant because of the uniform Gaus-sian kernel
used. Given the initial approximate solution, their algorithm iteratively modifies the shape,

the scale and the spatial location of neighbourhood of each point, and con-verges to ane-
invariant interest points. For more details see [28].
Descriptors and Matching. The descriptors are com-posed from Gaussian derivatives
computed over the shape-normalised regions. Invariance to rotation is obtained by steering
the derivatives in the direction of the gradient. Using derivatives up to 4th order, the
descriptors are 12-dimensional. The similarity of descriptors is in first approx-imation
measured by the Mahalanobis distance. Promis-ing close matches are then confirmed or
rejected by cross-correlation measure computed over normalised neighbour-hood windows.
Verification. Once the point-to-point correspondences are obtained, a robust estimation of the
geometric transforma-tion between the two images is computed using RANSAC algorithm.
The transformation used is either a homography or a fundamental matrix.
Recently, Dorko and Schmid [10] extended the approach towards object categorisation. Local
image patches are de-tected and described by the same approach as described above. Patches
from several examples of objects from a given category (e.g. cars) are collected together, and
a classifier is trained to distinguish them from patches of dierent cate-gories and from
background patches.

c) The Approach of Tuytelaars, Ferrari & van Gool

Luc van Gool and his collaborators developed an approach based on matching of local image
features [73, 75, 11, 72, 71, 74, 69]. They start with detection of elliptical or parallelo-gram
image regions. The regions are described by a vector of photometricaly invariant generalised
colour moments, and matching is typically verified by the epipolar geometry con-straint.
Detector. Two methods for extraction of anely invariant regions are proposed, yielding
geometry- and intensity-based regions. The regions are ane covariant, they adapt their
shape to the underlying intensity profile, in order to keep on representing the same physical
part of an object. Apart from the geometric invariance, photometric invariance allows for
independent scaling and osets for each of the three colour channels. The region extraction
always starts by detecting stable anchor points. The anchor points are either Harris points
[13], or local extrema of image intensity. Although the detection of Harris points is not really
ane invariant, as the support set over which is the response computed is circu-lar, the points
are still fairly stable under viewpoint changes, and could be precisely localised (even to
subpixel accuracy). Intensity extrema, on the other hand, are invariant to any continuous
geometric transformation and to any monotonic transformation of the intensity, but are not
localised as ac-curately. On colour images, the detection is performed three times, separately
on each of the colour bands.
Descriptors and Matching. In the case of geometry-based regions, each of the regions is
described by a vector of 18 generalised colour moments [29], invariant to photometric
transformations. For the intensity-based regions, 9 rotation-invariant generalised colour
moments are used. The simi-larity between the descriptors is given by the Mahalanobis

distance, correspondences between two images are formed from regions with the distance
mutually smallest. Once cor-responding regions have been found, the cross-correlation be-
tween them is computed as a final check before accepting the match. In the case of the
intensity-based regions, where the rotation is unknown, the crosscorrelation is maximised
over all rotations. Good matches are further fine-tuned by non-linear optimisation: the
crosscorrelation is maximised over small deviations of the transformation parameters.
Verification. The set of tentative correspondences is pruned by both geometric and
photometric constraints. The geometric constraint basically rejects correspondences con-
tradicting the epipolar geometry. Photometric constraint assumes that there is always a group
of corresponding re-gions that undergo the same transformation of intensities.
Correspondences that have singular photometric transforma-tion are rejected. Recently, a
growing flexible homography approach was presented, which allows for accurate model
alignment even for nonrigid objects. The size of the aligned area is then used as a measure of
the match quality.

d) The LAF Approach of Matas.
The approach of Matas et al. [25, 37, 26, 36] starts with detection of Maximally Stable
Extremal Regions. Ane co-variant local coordinate systems (called Local Ane Frames,
LAFs) are then established, and measurements taken rela-tive to them describe the regions.

Figure 3: Examples of correspondences established between frames of a database image (left)
and a query image (right).
Detector. The Maximally Stable Extremal Regions (MSERs) were introduced in [25].
The attractive properties of MSERs are: 1. invariance to ane transformations of im-age
coordinates, 2. invariance to monotonic transformation of intensity, 3. computational
complexity almost linear in the number of pixels and consequently near real-time run time,
and 4. since no smoothing is involved, both very fine and coarse image structures are
detected. Starting with contours of the detected region, local frames (coordinate systems) are
constructed in several ane covariant ways. Ane covari-ant properties of covariance
matrix, bi-tangent lines, and line parallelism are exploited. As demonstrated in Figure, local
ane frames facilitate normalisation of image patches into a canonical frame and enable
direct comparison of photomet-ricaly normalised intensity values, eliminating the need for
Descriptor. Three dierent descriptors were used. The first is directly the intensities of the
local patches. The intensities are discretised into 15 15 3 rasters, yield-ing 675-
dimensional descriptors. The size is discriminative enough to distinguish between a large

amount of database objects, yet coarse enough to be tolerant to decent misalign-ments in the
frame localisation. Second type of descriptor employs the discrete cosine transformation,
which is applied to the discretised patches [38]. The number of low frequency DCT
coecients that are kept in the database is used to adapt the preference of descriptor
discriminativity against the localisation tolerance. Finally, rotational invariants were used.
Verification. In the wide-baseline stereo problems, the cor-respondences are verified by
robustly selecting only these conforming to the epipolar geometry constraint. For object
recognition it is typically sucient to approximate the global geometry transformation by a
homography with flexible tol-eration increasing towards the object boundaries.

e) The Approach of Zisserman
A. Zisserman and his collaborators developed strategies for matching of local features mainly
in the context of the wide-baseline stereo problem [43, 42, 48, 45, 46]. Recently they
presented an interesting work relating image retrieval prob-lem and text retrieval [63, 47, 49].
They introduced an im-age retrieval system, called VideoGoogle, which is capable of
processing and indexing full-length movies.

Detectors and Descriptors. Two types of detectors of lo-cal image elements are employed.
One is the shape-adapted elliptical regions by Mikolajczyk and Schmid, as described in

Section 3.2, second the Maximally Stable Extremal Regions from Section 3.4. Representation
of the local appearance is realised by the SIFT descriptors introduced by David Lowe (see
Section 3.1). Knowing that a motion video sequence is being processed, noisy and unstable
regions can be elimi-nated. The regions detected in each frame of the video are tracked using
a simple constant velocity dynamic model and correlation. Any region which does not
survive for more than three frames is rejected. The estimate of the descriptor for a region is
then computed by averaging the descriptors throughout the track.
i) Indexing and Matching.
The descriptors are grouped into clusters, based on their similarity. In analogy to stop-lists in
text retrieval, where common words, like the, are ignored, large clusters are eliminated.
When a new image is observed, each descriptor of the new image is matched only against
representants of individual clusters. Selection of the nearest cluster immediately generates
matches for all frames of the cluster, throughout the whole movie. The exhaustive com-
parison with every descriptor of every frame is thus avoided. The similarity measure, used for
both the clustering and the closest cluster determination, is given by the Mahalanobis distance
of the descriptors.
ii) Verification.
Video frames are first retrieved using the fre-quency of matched descriptors, and then re-
ranked based on a measure of spatial consistency of the correspondences. The matched
regions provide ane transformation between the query and the retrieved image, so a point to

point corre-spondence is locally available. A search area of each match is defined by few
nearest neighbours. Other regions which also match within this area casts a vote for that
frame. Matches with no support are rejected. The final rank of the frame is determined by the
total number of votes.
e) Other Related Work - Scale Saliency by Kadir & Brady
Kadir and Brandy presented an algorithm [17] that define image regions salient if they are
unpredictable in some spe-cific feature-space, i.e. if exhibiting high entropy with re-spect to a
chosen representation of local appearance. The approach oers a more general model of
feature saliency com-pared with conventional techniques, which define saliency only with
respect to a particular set of properties, chosen in advance.
In its basic form, the algorithm is invariant only to simi-larity transformations (thence the
name scale saliency; only the scale of circular regions is estimated on top of their lo-
cations). Recently, an ane extension to the scale selection was presented [18], capable of
detecting elliptical regions. The modified saliency measure is then a function of three
parameters representing the ane deformation, instead of the single one for the scale.

f) Local PCA, approaches of Jugessur and Ohba
As discussed in Section 2.1, global PCA (principal compo-nent analysis) based methods are
sensitive to variations in the background behind objects of interest, changes in the ori-

entation of the objects, and to occlusion. Ohba and Ikeuchi and Jugessur and Dudek propose
an appearance-based object recognition method robust to variations in the background and
occlusion of a substantial fraction of the image.

In order to apply the eigenspace analysis to recognition of partially occluded objects, they
propose to divide the ob-ject appearance into small windows, referred to as eigen windows,
and to apply eigenspace analysis to them. Like in other approaches exploiting local
appearance, even if some of the windows are occluded, the remaining are still eective and
can recover the object identity and pose.
In addition to robustness to occlusions, Jugessur and Dudek [16] also address the problem of
rotation invariance. The proposed solution is to compute the PCA not on the intensity
patches, but rather in frequency domain of the win-dows represented in polar coordinates.

g) The Approach of Selinger & Nelson
The object recognition system developed by Nelson and Selinger at the University of
Rochester exploits a four-level hierarchy of grouping processes [35, 59, 61, 58, 57, 60]. The
system architecture is similar to other local feature-based ap-proaches though a dierent
terminology is used. Inspired by the Gestalt laws and perceptual grouping principles, a four-

level grouping hierarchy is built, where higher levels contains groups of elements from lower
The hierarchy is constructed as follows. At the fourth highest level, a 3D object is represented
as a topologically structured set of flexible 2D views. The geometric relation between the
views is stored here. This level is used for geo-metric reasoning, but not for recognition.
Recognition takes place at the third level, the level of the component views. In these views
the visual appearance of an object, derived from a training image, is represented as a loosely
structured com-bination of a number of local context regions. Local context regions (local
features) are represented at the second level. The regions can be thought of as local image
patches that surround first level features. At the first level are features (detected image
elements) that are the result of grouping processes run on the image, typically representing
connected contour fragments, or locally homogeneous regions. Only


Figure 4: Examples of corresponding query (left columns) and database (right columns)
images from the ZuBuD dataset. The image pairs exhibit occlusion, varying illumi-nation and
viewpoint and orientation changes.
Ecient recognition is achieved by using a database implemented as an associative memory
of keyed context patches. An unknown keyed context patch recalls associ-ated hypotheses for
all known views of objects that could have produced such context patch. These hypotheses
are processed by a second associative memory, indexed by the view parameters, which
partitions the hypotheses into clus-ters that are mutually consistent within a loose geometric
framework (these clusters are the third level groups). The looseness is obtained by tolerating
a specified deviation in position, size, and orientation. The bounds are set to be consistent
with a given distance between training views (e.g. approximately 20 degrees). The output of
the recognition stage is a set of third level groupings that represent hypothe-ses of the identity
and pose of objects in the scene, ranked by the total evidence for each hypothesis.
Approaches matching local features have been experimen-tally shown to obtain state-of-the-
art results. Here we present few examples of the adressed problems. Results are demonstrated
using the approach of Matas et al. [37, 36, 38], although comparable results have been shown
by others.


Figure 5: Image retrieval on FOCUS dataset: query local-isation results. query images,
database images, and query localisations
Object Recognition. In object recognition experiments, Columbia Object Image Library
(COIL-100) [1], or more of-ten its subset COIL-20, has been widely used, and for com-
parison purposes has become a de facto standard benchmark

Figure 6: An example of matches established on a wide-baseline stereo pair.
dataset. COIL-100 is a set of colour images of 100 dierent objects, where 72 images of each
object were taken at pose intervals of 5

. The objects are unoccluded and on unclut-tered

black background. Such a configuration is benign for appearance-based methods. Table 1
compares recognition rates achieved by the LAF approach with the rates of sev-eral

appearance-based object recognition methods. Results are presented for five experimental
set-ups, diering in the number of training views per object. Decreasing the number of
training views increases demands on the methods general-isation ability, and on the
insensivity to image deformations. The LAF approach performs best in all experiments,
regard-less of the number of training views. For only four training views, the recognition rate
is almost 95%, demonstrating the remarkable robustness to local ane distortions.
Image retrieval. The retrieval performance of the LAF method was evaluated on the FOCUS
dataset, containing 360 colour high-resolution images of advertisements scanned from
magazines. The task was to retrieve adverts for a given product, given a query image of the
product logo. Examples of query logos, retrieved images, and visualised localisations of the
logos are depicted in Figure.
Another challenging retrieval problem involved recogni-tion of buildings in urban scenes.
Given an image of an unknown building, taken from an unknown viewpoint, the algorithm
was to identify the building. The experiments were conducted on a set of images of 201
dierent buildings. The dataset was provided by ETH Zurich and is publicly available [62].
The database contains five photographs of every of the 201 buildings, and a separate set of
115 query images is provided. Examples of corresponding query and database images are
shown in Figure. The LAF method achieved 100% recognition rate in rank 1.
Video retrieval. The problem of retrieval of video frames from full-length movies was
addressed in [63]. Local descrip-tors were computed on key frames and stored into database.

To reduce the otherwise enormous database size, descriptors were clustered according to their
similarity. Impresive real-time retrieval was achieved for a closed system, i.e. for the case of
query images originating from the movie itself.
Wide baseline stereo matching. For a significant variety of scenes the epipolar geometry can
be computed automati-cally from two (or possibly more) uncalibrated images, show-ing the
scene from significantly dierent viewpoints. The role of the matching in the wide-baseline
stereo problem is to provide corresponding points, i.e. the points which in the two images
represent identical element of the 3D scene. Correspondences found in a dicult stereo pair
are shown in Figure.


a) Active and Dynamic Vision
In the introduction we overviewed some of the advantages and disadvantages of the active
vision framework. The human visual system has two main characteristics: the eyes can move,
and visual sensitivity is highly heterogeneous across visual space [33]. Curiously, these
characteristics are largely ignored by the vision community.
The human eyes exhibit four types of behaviours: saccades, fixation, smooth pursuit, and
vergence. Saccades are ballistic movements associated with visual search. Fixation is
partially associated with recognition tasks which do not require overt attention. Smooth
pursuit is associated with tracking tasks and vergence is associated with vision tasks which
change the relative directions of the optical axes. How do these behaviours fit within the
active vision framework in computer vision? As discussed in Sec.2.7, it is believed that
during early childhood development, the association between the sight of an object and its
function is primed by manipulation, randomly at first, and then in a more and more refined
way. This hints that there exists a strong association between active vision and learning.
Humans are excellent in recognizing and categorizing objects even from static images 2. It
can thus be argued that active vision research is at least as important for learning object
representations as it is for online recognition tasks. Findlay and Gilchrist [33] make a
compelling argument in support of more research in the active approach to human vision:

Vision is a difficult problem consisting of many building blocks that can be characterized in
isolation. Eye movements are one such building block.
2. Since visual sensitivity is the highest in the fovea, in general, eye movements are needed
for recognizing small stimuli.
3. During a fixation, a number of things happen concurrently: the visual information around
the fixation is analysed , and visual information away from the current fixation is analyzed to
help select the next saccade target.
The exact processes involved in this are still largely unknown. Findlay and Gilchrist [33] also
pose a number of questions, in order to demonstrate that numerous basic problems in vision
still remain open for research.
1. What visual information determines the target of the next eye movement?
2. What visual information determines when eyes move?
3. What information is combined across eye movements to form a stable representation of the
As discussed earlier [29], a brute force approach to object localization subject to a cost
constraint, is often intractable e as the search space size increases. Furthermore, the human
brain would have to be some hundreds of thousands times larger than it currently is, if visual
sensitivity across the visual space was the same as that in the fovea [29]. Thus, active and

attentive approaches to the problem are usually proposed as a means of addressing these
We will show in this section that within the context of the general framework for object
recognition that was illustrated in Fig.1, previous work on active object recognition systems
has conclusively demonstrated that active vision systems are capable of leading to significant
improvements in both the learning and inference phases of object recognition. This includes
improvements in the robustness of all the components of the feature-extraction feature
grouping object-hypothesis object-verification object-recognition pipeline.
Some of the problems inherent in single view object recognition, include [266]:
1. The impossibility of inverting projection and the fragility of 3D inference. It is, in general,
impossible to recover a three dimensional world from its two dimensional projection on an
image, unless we make restrictive assumptions about the world.
2. Occlusion. Features necessary for recognition might be self-occluded or occluded by other
3. Detectibility. Features necessary for recognition might be missing due to low image
contrast, illumination conditions and incorrect camera placement [73].
4. View Degeneracies. As discussed in [49], view degeneracies that are caused by accidental
alignments can easily lead to wrong feature detection and bad model parameterizations.

Figure 7: Overview of the spatiotemporal (4-D) approach to dynamic vision (adapted from
[50, 268]).
It is straight-forward to see how the above problems can adversely influence the components
of a typical object recognition system shown in Fig.1. Various attempts have been made to
address these problems. The various 3D active object recognition systems that have been
proposed so far in the literature can be compared based on the following four main
characteristics [267]: Nature of the Next View Planning Strategy. Often the features
characterizing two views of two distinct objects are identical, making single view recognition
very difficult. A common goal of many active recognition strategies is to plan camera
movements and adjust the cameras intrinsic parameters in order to obtain different views of

the object that will enable the system to escape from the single view ambiguities. While
classical research on active vision from the field of psychology has largely focused on eyes
and head movements, the next-view planning literature in computer vision and robotics
assumes more degrees of freedom since there are no constraints on how the scene can be
sensed or what types of actuators the robotic platform can have.
2. Uncertainty Handling Capability of the Hypothesis Generation Mechanism. One can
distinguish between Bayesian based and non-Bayesian based approaches to the hypothesis
generation problem and the handling of uncertainty in inference.
3. Efficient Representation of Domain Knowledge. The efficiency of the mechanism used to
represent domain knowledge and form hypotheses is another feature distinguishing the
recognition algorithms. This domain knowledge could emerge in the form of common
features such as edges, moments, etc. as well as other features that are appropriate for
using context or an objects function to perform recognition.
4. Speed and Efficiency of Algorithms for Both Hypothesis Generation and Next View
Planning. Complexity issues arise, for example, in terms of the reasoning and next view
planning algorithm that is used, but also in terms of other components of the recognition
algorithm. The complexity of those sub components can play a decisive role as to whether we
will have a real-time performing active object recognition algorithm, even if we use a highly
efficient representation scheme of the domain knowledge from point 3.

As indicated in the introduction, the dynamic vision paradigm subsumes the active vision
paradigm, and is more focused on dynamic scenes where vision algorithms (such as
recognition algorithms) are applied concurrently to the actions being executed. Within this
context, a significant topic of research in dynamic vision systems is the incorporation of
predictions of future developments and possibilities [50, 268]. Historically, dynamic vision
systems have focused on the construction of vision systems that are reliable in indoor and
outdoor environments. Within this context, dynamic vision systems are also more tightly
coupled to the research interests of the robotics community, as compared to classical
computer vision research.
Historically, dynamic vision research emerged due to the need to integrate recursive
estimation algorithms (e.g.,Kalman filters) with spatio-temporal models of objects observed
from moving platforms. As pointed out by Dick manns applying vision algorithms
concurrently to the actions performed requires the following (also see Fig. 24):
(i) The computation of the expected visual appearance from fast moving image sequences
and the representation of models for motion in 3-D space and time. (ii) Taking into account
the time delays of the different sensor modalities, and taking into account these time delays in
order to synchronize the image interpretation. (iii) The ability to robustly fuse different
elements of perception (such as inertial information, visual information and odometry
information) whose strengths and weaknesses might complement each other in different
situations. For example, visual feedback is better for addressing long term stability drift-
problems which might emerge from inertial signals, while inertial signals are better for short-

term stability when implementing ego-motion and gaze stabilization algorithms. (iv)
Incorporating a knowledge-base of manoeuvre elements for helping with situational
assessment. (v) Incorporating a knowledge base of behavioural capabilities for various scene
objects, so that the objects behaviour and identity could be identified more easily from small
temporal action elements. (vi) Taking into consideration the interdependence of the
perceptual and behavioural capabilities and actions across the systems various levels, all the
way down to the actual hardware components.
We see that dynamic vision systems incorporate what is often also referred to as contextual
information, thus taking a much broader and holistic approach to the vision problem. A
significant insight of Dickmanns spatio-temporal approach to vision was that the modelling
of objects and motion processes over time in 3-D space (as compared to modelling them
directly in the image plane) and the subsequent perspective projection of those models in the
image plane, led to drastic improvements in the calculation of the respective Jacobian
matrices used in the recursive estimation processes, and thus became necessary components
of a dynamic vision system. This approach led to the creation of robust vision systems that
were far more advanced than what had been considered to be the state-of-the-art up until
then. Examples of such systems include pole balancing using an electro-cart [269], the first
high-speed road vehicle guidance by vision on a highway [50] (which includes modules for
road recognition [270, 271, 272, 273], lane recognition, road curvature estimation, and lane
switching [274, 50], obstacle detection and avoidance [275], recognition of vehicles and
humans [276], and autonomous off-road driving [50]) as well as aircraft and helicopters with

the sense of vision for autonomously landing [277, 278]. Within the context of the
recognition pipeline shown in Fig.1, we see that the work by Dickmanns improved the
reliability of the measured features, it improved the reliability of predicted features, of the
object hypotheses and their subsequent grouping, when attempting to extract these features
under egomotion. These improvements led to significant and surprising for the time
innovations in vision, by demonstrating for example the first self-driving vision-based
In Sec. 1 we discussed some of the execution costs associated with an active vision system.
These problems (such as the problem of determining correspondences under an imperfect
stereo depth extraction algorithm and the problem of addressing dead-reckoning errors) are
further exacerbated in dynamic vision systems where the actions are executed concurrently to
the vision algorithms. This is one major reason why the related problems are more
prominently identified and addressed in the literature on dynamical vision systems, since
addressing these problems usually becomes a necessary component of any dynamic vision
At this point we need to make a small digression, and discuss the difference between passive
sensors, active sen sors, active vision and passive vision. While passive and active vision
refers to the use (or lack thereof) of intelligent control strategies applied to data acquisition
process, an active sensor refers to a sensor which provides its own energy for emitting
radiation, which in turn is used to sense the scene. In practice, active sensors are meant to
complement classical passive sensors such as light sensitive cameras. The Kinect [295] is a

popular example of a sensor that combines a passive RGB sensor and an active sensor (an
infrared laser combined with a monochrome CMOS camera

Chart 1: Summary of the 1989-2009 papers in Table 5 on active object detection. By
definition search efficiency is not the primary concern in these systems, since by assumption
the object is always in the sensors field of view. However inference scalability constitutes a
significant component of such systems. We notice very little use of function and context in
these systems. Furthermore, training such systems is often non-trivial.


Figure 8: A sequence of viewpoints from which the system developed by Wilkes and Tsotsos
[266] actively recognizes an origami object.
for interpreting the active sensor data and extracting depth). One could classify vision
systems into those which have access to depth information (3D) and those that do not. One
could argue that the use of active sensors for extracting depth information is not essential in
the object recognition problem, since the human eyes are passive sensors and stereo depth
information is not an essential cue for the visual cortex. In practice, however, active sensors
are often superior for extracting depth under variable illumination. Furthermore, depth is a
useful cue in the segmentation and object recognition process. One of the earliest active
recognition system [286] made use of laser-range finders. Within the context of more recent
work, the success of Kinect-based systems [296, 297, 298, 299] demonstrates how com bined
active and passive sensing systems improve recognition. For example, the work by Tang et
al. [296] achieved top ranked performance in a related recognition challenge, by leveraging
the ability of the Kinect to provide accurate depth information in order to build reliable 3D
object models. Within the context of the recognition pipeline shown in Fig.1, active sensors
enable us to better register the scene features with the scene depth. This enables the creation
of higher fidelity object models, which in turn are useful in improving the feature grouping
phase (e.g., determining the features which lie at similar depths) as well as the object
hypothesis and recognition phases (by making 3D object model matching more reliable).

b) Active Object Detection Literature Survey

With the advent popularity in the 1990s of machine learning and Bayesian based approaches
for solving com puter vision problems, active vision approaches lost their popularity. The
related number of publications decreased significantly between the late 1990s and the next
This lack of interest in active vision systems is partially attributable to the fact that power
efficiency is not a major factor in the design of vision algorithms. This is also evidenced by
the evaluation criteria of vision algorithms in popular conferences and journals, where
usually no power metrics are presented. Note that an algorithms asymptotic space and time
complexity is not necessarily a sufficiently accurate predictor of power efficiency, since this
does not necessarily model well the degree of communication between CPU and memory in a
von-Neumann architecture.
One of the main research interests of the object recognition community over the last 10-15
years, has been on the interpretation of large datasets containing images and video. This has
been mainly motivated by the growth of the internet, online video, and smartphones, which
make it extremely easy for anyone to capture high quality pictures and video. As a result
most resources by the vision community have been focused on addressing the industrys need
for good vision algorithms to mine all this data. As a result, research on active approaches to
vision was not a priority.
Recently, however, there has been a significant upsurge of interest in active vision related
research. This is ev idenced by some of the more recent publications on active vision, which
are also discussed in Secs.3.1-3.2. In this section we focus on the active object detection

problem, which involves the use of intelligent data acquisition strate gies in order to robustly
choose the correct value of at least one binary label/classification associated with a small 3D
region. The main distinguishing characteristic of the active object detection literature, as
compared to the literature on active object localization and recognition, is that in the
detection problem we are interested in improving the classifi cation performance in some
small 3D region, and are not as interested in searching a large 3D region to determine the
positions of one or more objects. In Charts 7, 8 and Tables 5,6 we compare, along certain
dimensions, a number of the papers surveyed in Secs.3.1,3.2. Notice that the active object
detection systems of Table 5 make little use of func tion and context. In contrast to the non-
active approaches, all the active vision systems rely on 3D depth extraction mechanisms
through passive (stereo) or active sensors. From Tables 5,6 we notice that no active
recognition system
is capable of achieving consistently good performance along all the compared dimensions. In
this respect it is evident that the state of the art in passive recognition (Table 7) surpasses the
capabilities of active recognition systems.
Wilkes and Tsotsos [266] published one of the first papers to discuss the concept of active
object detection (see Fig. 25) by presenting an algorithm to actively determine whether a
particular object is present on a table. As the authors argue, single view object recognition
has many problems because of various ambiguities that might arise in the image, and the
inability of standard object recognition algorithms to move the camera and obtain a more
suitable viewpoint of the object and thus, escape from these ambiguities. The paper describes
a behaviour based approach to camera motion and describes some solutions to the above

mentioned ambiguities. These ambiguities are discussed in more detail by Dickinson et al.
[280]. Inspired by the arguments in [266], the authors begin by presenting certain reasons as
to why the problem of recognizing objects from single images is so difficult. The reasons
were discussed in the previous section and include the impossibility of inverting projection,
occlusions, feature detectibility issues, the fragility of 3D inference, and view degeneracies.
To address these issues the authors define a special view as a view of the object, optimizing
some function f of the features extracted from the image data. Let P0, P1, P2 be three points
on the object and di j denote the distance of the projected line between points Pi and Pj
The authors try to locate a view of the object maximizing d01, and d02 subject to the
constraint that the distance of the camera from the center of the line joining P0 and P1 is at
some constant distance r. The authors argue that such a view will make it less likely that they
will end up in degeneracies involving points P0, P1, P2 [49]. Once they have found this
special view, the authors suggest using any standard 2D pattern recognition algorithm to do
the recognition. Within the context of the standard recognition pipeline in Fig.1, we see that
[266] showed how an active vision system can escape from view-degeneracies, thus leading
to more reliable feature extraction and grouping. Callari and Ferrie [300, 279] introduce a
method for view selection that uses prior information about the objects in the scene. The
work is an example of an active object detection system that incorporates contextual
This contextual knowledge is used to select viewpoints that are optimal with respect to a
criterion. This constrains the gaze control loop, and leads to more reliable object detection.

The authors define contextual knowledge as the join of a discrete set of prior hypotheses
about the relative likelihood of various model parameters s, given a set of object views, with
the likelihood of each object hypothesis as the agent explores the scene. The active camera
control mechanism is meant to augment this contextual knowledge and, thus, enable a
reduction in the amount of data needed to form hypotheses and provide us with more reliable
object recognition. The paper describes three main operations that an agent must perform: (a)
Data collection, registration with previous data and modelling using a pre-defined scene
class. (b) Classification of the scene models using a set of object hypotheses. (c)
Disambiguation of ambiguous hypotheses/classifications by collecting new object views/data
to reduce ambiguity. The paper does not discuss how to search through an arbitrary 3D region
to discover the objects of interest. The paper assumes that the sensor is focused on some
object, and any motion along the allowed degrees of freedom will simply sense the object
from a different viewpoint (i.e., it tackles a constrained version of the object search problem).
Thus, this active vision system provided a methodology for improving the object hypothesis
and verification phases of the pipeline in Fig.1.
Dickinson et al. [280] combine various computer vision techniques in a single framework in
order to achieve robust object recognition. The algorithm is given the target object as its
input. Notice that even though the paper does deal with the problem of object search and
localization within a single image, its next viewpoint controller deals mostly with verifying
the object identity from a new viewpoint, which is the reason we refer to this algorithm as an
active object detector.
The paper combines a Bayesian based attention mechanism, with aspect graph based object

recognition and view point control, in order to achieve robust recognition in the presence of
ambiguous views of the object. See Figs. 26, 27 for an overview of the various modules
implemented in the system. The object representation scheme is a combi nation of Object
Centered Modelling and Viewer Centered Modelling. The object centered modelling is
accomplished by using 10 geons. These geons can be combined to describe more complex
types of objects. The Viewer Centered modelling is accomplished by using aspects to
represent a small set of volumetric parts from which an object is con structed, rather than
directly representing an object. One obvious advantage of this is the decrease in the size of
the aspect hierarchy. However, if a volumetric part is occluded, this could cause problems in
the recognition. To solve this problem, the authors extend the aspect graph representation into
an aspect graph hierarchy (see Fig. 12) which consists of three levels. The set of aspects that
model the chosen volumes, the set of component faces of the aspects, and the set of boundary
groups representing all subsets of contours bounding the faces. The idea is that if an aspect is

occluded, they can use some of these more low-level features to achieve the recognition.
From this hierarchy of
Figure 26: The face recovery and attention mechanism used in [280] (diagram adapted from


Figure 9: The object verification and next viewpoint selection algorithm used in [280]
(diagram adapted from [280]).
geon primitives, aspects, faces and boundary groups, the authors create a Bayesian network,
and extract the associated conditional probabilities. The probabilities are extracted in a
straightforward manner by uniformly sampling the geons using a Gaussian sphere. For

example, to estimate the probability of face x occurring given that boundary group y is
currently visible, they use the sampled data to calculate the related probability. From this
data, the authors use a slight modification of Shannons entropy formula to discover that for
the geon based representation, faces are more discriminative than boundary groups.
Therefore, they use faces as a focus feature for the recovery of volumetric parts. Using
various segmentation algorithms described in the literature, the authors segment the images
and create region topology graphs (denoting region adjacencies), region boundary topology
graphs (denoting relations between partitioned segments of bounding contours) and face
topology graphs (indicating the labelled face hypothesis for all regions in the image). Each
regions shape in the image is classified by matching its region boundary graph to those
graphs representing the faces in the augmented aspect hierarchy graph using interpretation
tree search. This enables the creation of face topology graphs labelling the current image.
They use this face topology graph labelling with attention driven recognition in order to limit
search in both the image and the model database. Given as input the object they wish to
detect, the authors define a utility function U that can be used in conjunction with the
previously defined conditional probabilities and the aspect graph, to determine the most
likely face to start their search with, given the object they are trying to find. The search uses
concepts inspired from game theory, and does the search until there is a good match between
the face topology graph for the image and the augmented aspect graph.
Then, a verification step is done, by using various metrics to see if the aspects and volumes
also match. If there is no match the authors proceed with the next most likely matching face,
and the process continues like this. Extensions of this recognition algorithm to multipart

objects are also described and involve some extra steps in the verification phase searching for
connectedness among their part aspects. The final component of the recognition algorithm
involves viewpoint control. Viewpoint control makes it possible to resolve viewpoint
degeneracies. As already discussed in this survey (also see the discussion towards the end of
this section), such degeneracies have been shown to frequently occur in practice. The authors
define an aspect prediction graph which is a more compact version of the aspect graph and
specifies transitions between topologically equivalent views of the object. They use this
graph to decide the direction of camera motion. The main idea is to move the camera to the
most likely aspect excluding the already viewed aspects , based on the previously
calculated conditional probabilities and the most likely volume currently viewed, in order to
verify whether it is indeed this hypothesized volume that is in the field of view. Then the
algorithm described above is repeated.
The main innovation of the paper is the combination of lots of ideas in computer vision
(attention, object recog nition, viewpoint control) in a single framework. Limitations of the
paper include the assumption that objects can be represented as constructions of volumetric
parts which is difficult for various objects such as clouds or trees , and its reliance on
salient homogeneous regions in the image for the segmentation. Real objects contain a lot of
details, and the segmentation is in general difficult. Notice that there is room for
improvement in the attentive mechanisms used. No significant effort is made to create a
model that adaptively adjusts its expressive power during learning, potentially making proper
training of the model somewhat of an art and dependent on manual intervention by the user.
As it is the case with many of the papers described so far, the model relies heavily on the

extraction of edges and corners which might make it difficult to distinguish an object based
on its texture or color. Within the context of Fig.1, the work by Dickinson et al. [280]
proposes an active vision framework for improving all the components of the standard vision
pipeline. This also includes the object databases component, since the use of a hierarchy is
meant to provide a spacewise efficient representation of the objects.
Schiele and Crowley [281] describe the use of a measure called transinformation for building
a robust recognition system. The authors use this to describe a simple and robust algorithm
for determining the most discriminating viewpoint of an object. Spectacular recognition rates
of nearly 100% are presented in the paper. The main idea of the paper is to represent the 3D
objects by using the probability density function of local 2D image characteristics acquired
from different viewpoints. The authors use Gaussian derivatives as the local characteristics of
the object.
These derivatives allow them to build histograms (probability density functions) of the image
resulting after the application of the filter. Assuming that a measurement set M of some local
characteristics {mk} is acquired from the image where the local characteristics might be
for example the x-coordinate derivatives or the images Laplacian they obtain a
probability distribution p(M|on, R, T, P, L, N) for the object on (where R, T, P, L, N denote
the rotation, translation, partial occlusion, light changes and noise). The authors argue that for
various reasons (the filters they use, the use of histograms and so on) the distribution is
conditionally independent of various of these variables and it suffices to build histograms for
p(M|on, S ) where S denotes the rotation and one of the translation parameters. The

authors define the quantity.
which gives the probability of object on in pose S j occurring given that we know the
resulting images under set {mk} of filters. The probabilities on the right hand side are known
through the histogram based probability density estimation we described above. We can use
this probability to recognize the object we currently view and its pose by simply maximizing
the probability over all values for variables n, j. Test results that the authors cite indicate that
this performs very well even in cases where only 40% of the object is visible. The authors
then describe the object recognition process in terms of the transmission of information. The
(for the sets O, M of the objects and image features respectively) is the transinformation.
Intuitively, the lower the quantity, the closer the two sets are to being statistically
independent, implying that one sets values do not affect the other sets values. This is used to
choose the salient viewpoints of an object and thus, provide an algorithm for active object
detection. By rewriting the previous equation for transinformation as

we see that the transinformation can be interpreted as the average transinformation of some
object ons transinforma tion n T(on, M) =PKk=1p(mk|on)log p(mk |on)p(mk )

. By going one step further and incorporating the pose S j of an object in the
previous definition of transinformation we get
and we see that we can find the most significant viewpoints of an object by finding the
maximum over all j of this equation. The authors use this last formula to hypothesize the
object identity and pose from an image. Then, they use again this last formula to estimate the
most discriminating viewpoint for the hypothesized object, move the camera to that
viewpoint, perform verification and proceed until some threshold is passed, indicating that
the object has been identified.
Overall, the main advantage of the paper is that it provides an elegant and simple method to
perform object recognition. The test results provide strong evidence of the power of the
active object recognition framework. The more verification steps performed, the lower the
misclassification rate. A drawback of the method is that it has not been tested on much larger
datasets, and little work has been done to see how it performs under non-uniform
Furthermore, a question arises on the algorithms performance as the errors in the estimation
of the camera position increase. As discussed in [45], the implications could be significant.
Similarly to the above paper, Borotschnig et al. [282] use an information theoretic based
quantity (entropy) in order to decide the next view of the object that the camera should take
to recognize the object and obtain more robust object recognition in the presence of

ambiguous viewpoints. The approach uses an appearance based recognition system (inspired
by Murase and Nayars [141] popular PCA based recognition algorithm) that is augmented
by probability distributions. The paper begins by describing how to obtain an eigenbasis of
all the objects in our database from all views. Then, given a new image, the algorithm can
project that new image on the eigenbasis to obtain a point g in that basis, denoting the image.
Denote by p(g|oi , j) the probability of point g occurring in the eigenspace of all objects that
are projecting an image of object oi with pose parameters j
. Under ideal circumstances p(g|oi, j) would be a spike function. In other words, the function
would be zero for all values of g, except for one value for which it would be equal to 1.
However, due to various sources of error (fluctuations in imaging conditions, pan, tilt, zoom
errors, segmentation errors etc.) the authors estimate this probability from a set of sample
images with fixed oi and j values.


Figure 10: Graphical model for next-view-planning as proposed in [284, 285].
The probability density function is modelled as a multivariate Gaussian with mean and
standard deviation estimated from the sample images. By Bayes theorem it can be shown

In the experiments the authors assumed that p(oi) and p(j |oi) are uniformly distributed. In
their test cases the authors choose a number of bins in which they will discretize the possible
number of viewpoints and use them to build these probability distribution functions. Then,
given some vector g in the eigenspace of shapes, the conditional probability of seeing object
oi is given by p(oi|g) =Pj P(oi, j|g).
By iterating over all the objects in the database and finding the most likely object, objects are
recognized. The authors then further expand on this idea and present an algorithm for actively
controlling the camera. They show that in cases where the object database contains objects
that share similar views, the active object recognition framework leads to striking
improvements. The key to this is the use of planned camera movements that lead to
viewpoints from which the object appears distinct. Note also that the authors use only one
degree of freedom for rotating around the object along a constant radius. However,
extensions to arbitrary rotations should be straightforward to implement. The authors define a
metric s() which gives the average entropy reduction to the object identity if the point of
view is changed by . Since there is a discreet number of views, finding the optimal is a

simple linear search problem. The authors make 3 major conclusions based on their results:
(a) The dimension of the eigenspace can be lowered significantly if active recognition is
guiding the object classification. In other words active recognition might open the way to the
use of very large object databases, suitable for real world applications (b) Even objects that
share most views can be successfully disambiguated. (c) The number of steps needed to
obtain good recognition results is much lower than random camera placement, again
indicating the usefulness of the algorithm (2.6 vs. 12.8 steps on average). This last point is
further supported in [24]. The three above points demonstrate how an active vision
framework might decrease the size of the object database needed to represent an object, and
help improve the object hypotheses and verification phase, by improving the disambiguation
of objects that share many views (see Fig.1)
These ideas were further expanded upon by Paletta and Prantl [283], where the authors
incorporated temporal context as a means of helping disambiguate initial object hypotheses.
Notice that in their previous work, the authors treated all the views as bags of features
without taking advantage of the view/temporal context. In [283] the authors work on this
shortcoming by adding a few constraints to their probabilistic quantities. They add in their
probabilistic formulation temporal context by encoding that the probability of observing a
view (oi, j) due to a viewpoint change 1 must be equal to the probability of observing
view (oi, j 1). This leads to a slight change in the Bayesian equations used to fuse the
data and leads to an improvement in recognition performance. In [301] the authors use a
radial-basis function based network to learn object identity. The authors point out that the on-
line evaluation of the information gain, and most probabilistic quantities as a matter of fact,

are intractable, and therefore, learned mappings of decision policies have to be applied in
next view planning to achieve real-time performance.
Roy et al. [284] presents an algorithm for pose estimation and next-view planning. A novelty
of this paper is that it presents an active object recognition algorithm for objects that might
not fit entirely in the cameras field of view and does not assume calibrated intrinsic
parameters. In other words it improves the feature grouping and object hypothesis modules of
the standard recognition pipeline (see Fig.1), through the use of a number of invariants that
enable the recognition of objects which do not fit in a cameras field of view, and thus are not
recognizable using a passive approach to vision. It should be pointed out that this was the
first active recognition/detection system to tackle this important and often encountered real
world problem. The paper introduces the use of inner camera invariants for pose estimation.
These image computable quantities, in general, do not depend on most intrinsic camera
parameters, but assume a zero skew. The authors use a probabilistic reasoning framework
that is expressed in terms of a graphical model, and use this framework for next-view
planning to further help them with disambiguating the object. Andreopoulos and Tsotsos
[239] also present an active object localization algorithm that can localize objects that might
not fall entirely in the sensors field of view (see Fig. 23). Overall this system was shown to
be robust in the case of occlusion/clutter. A drawback of the method is that it was only tested
with simple objects that contained parallelograms. It is interesting to see how the method
would extend if we were processing objects containing more complicated features. Again, its
sensitivity to dead-reckoning errors is not investigated. Roy and Kulkarni [285] present a
related paper with a few important differences. First of all, the paper does not make use of

invariant features as [284] does. Furthermore, the graphical model is used to describe an
appearance based aspect graph: features i j represent the aspects of the various objects in our
database, and the classes Ck represent the set of topologically equivalent aspects. These
aspects might belong to different parts of the same object, or to different objects altogether,
yet they are identical with respect to the features we measure. For each class Ck the authors
build an eigenspace Uk of object appearances. Given any image I, they find the eigenspace
parameter c, and affine transformation parameter a, that would minimize
where is a robust error function, is a scale parameter and f is an affine transformation.
They use this c, to find the most likely class Ck corresponding to the object. The probabilities
are estimated by the reconstruction error induced by projecting the image I on each one of the
class eigenspaces Uk. The smaller the reconstruction error, the more likely we have found the
corresponding class. Then, the a priori estimated probabilities P(i j|Ck) are used to find the
most likely object Om corresponding to the viewed image. If the probability of the most
likely object is not high enough, we need to move to a next view to disambiguate the
currently viewed object. The view-planning is similar to that of paper [284], only that there is
just 1 degree of freedom in this paper (clockwise or counter clockwise rotation around some
axis). By using a heuristic that is very similar to the one in paper [284] and based on
knowledge from previously viewed images of the object, the authors form a list of the camera
movements that we should make to disambiguate the object. This procedure is repeated until
the object is disambiguated.

The authors use the COIL-20 object database from Columbia University to do their testing.
The single-view based correct recognition rate was 65.70% while the multi-view recognition
rate increased to 98.19%, indicating the usefulness of the recognition results and the promise
in general of the active object recognition framework under good dead-reckoning.
Furthermore, the average number of camera movements to achieve recognition was 3.46 vs.
5.40 moves for the case of random camera movements, again indicating the usefulness of the
heuristic the authors defined for deciding the next view. Notice that this is consistent with the
results in [24, 282]. Disadvantages of the paper include the testing of the method on objects
with only a black background (of apparently little occlusion) and the use of only a single
degree of freedom in moving the camera to disambiguate the object.
Hutchinson and Kak [286] presents one of the earliest attempts at active object detection. The
authors generalize their work by assuming that they can have at their disposal lots of different
sensors (monocular cameras, laser range finders, manipulator fingers etc.). Thus, within the
context of the standard object recognition pipeline (Fig.1), this is an example of a system that
combines multiple types of feature extractors. It also represents one of the earliest active
approaches for object hypothesis and verification. Each one of those sensors provides various
surface features that can be used to disambiguate the object. These features include surface
normal vectors, Gaussian and Mean curvatures, area of each polyhedral surface and
orientation, amongst others. By creating an aspect graph for each object and by associating
with each aspect the features corresponding to the surfaces represented by that aspect, the
algorithm can formulate hypotheses as to the objects in a database that might correspond to

the observed object. The authors then

Figure 11: The aspects of an object and its congruence classes (adapted from Gremban and
Ikeuchi [287]).
do a brute force search on all the aspects of each aspect graph in the hypotheses, and move
the camera to a viewpoint of the object that will lead to the greatest reduction in the number
of hypotheses. In general, this is one of the first papers to address the active object detection
problem. A disadvantage of this paper is the oversimplifying assumption of polyhedral
objects. Another disadvantage is the heuristic used to make the camera movements since in
general it gives no guarantees that this sensor movement will be optimal in terms of the
number of movements till recognition takes place. Notice that complexity issues need to be
addressed, since in practice the aspect graphs of objects are quite large and can make brute
force search through the aspects of all aspect graphs infeasible. Furthermore, as it is the case
with most of the active object recognition algorithms described so far, the issue of finding the
optimal sequence of actions subject to a time constraint is not addressed.

Gremban and Ikeuchi [287] investigate the sensor planning phase of object recognition, and
thus their work con stitutes another effort in improving the object hypothesis and verification
of the standard recognition pipeline (see Fig.1). Like many of the papers described in this
survey, the algorithm uses aspect graphs to determine the next sensor movement. Similarly to
[285] and [284], the authors of this paper make use of so called congruent aspects. In a com
puter vision system, aspects can be defined in various ways. The most typical way of defining
them is based on the set of visible surfaces or the presence/absence of various features.
Adjacent viewpoints over a contiguous object region, for which the features defining the
aspect remain the same, give an aspect equivalence class. In practice, however, researchers
who work with aspect graphs have noticed that the measured features can be identical over
many disparate viewpoints of the object. This makes it impossible to determine the exact
aspect viewed. These indistinguishable as pects which share the same features are called
congruence classes. The authors argue that any given feature set will consist of congruent
aspects and this is responsible for the fact that virtually every object recognition system uses
a unique feature set in order to improve the performance of the algorithm on that particular
domain and distinguish between the congruent aspects. Other reasons why congruent aspects
might arise include noise and occlusion. The authors argue that since congruent aspects
cannot be avoided sensing strategies are needed to discriminate them. In Fig. 29 we give an
example of the aspects of an object and its congruence classes, where the feature used to
define the aspects is the topology of the viewed surfaces in terms of the visible edges. The
authors use Ikeuchi and Kanades aspect classification algorithm [41] to find the congruence
class corresponding to the aspect viewed by the camera. The camera motion is used to decide

the aspect that this particular class corresponds to. This is referred to as aspect resolution.
This enables the system to recognize whether the image currently viewed contains the target
object. The authors define a class restricted observation function (, ) that returns the
congruence class currently viewed by the camera. The variable defines the angle of rotation
of the sensor around some axis in the objects coordinate system the authors assume
initially that the only permissible motion is rotation around one axis and denotes the rota
tion of the object with respect to the world coordinate frame. An observation function (, )
can be constructed for the object model that is to be identified in the image. The authors
discuss in the paper only how to detect instances of a single object, not how to perform image
understanding. The authors initially position the camera at = 0 they


Figure 12: An aspect resolution tree used to determine if there is a single interval of values
for that satisfy certain constraints (adapted from Gremban and Ikeuchi [287]).
assume that the object they wish to recognize is manually positioned in front of the camera
with an appropriate pose and estimate the congruence class that is currently viewed by
investigating the extracted features (see Fig. 30). By scanning through the function (, )
they find the set of values of , (if any), for which (0, ) = . If no values of satisfy this
function, the object viewed is not the one they are searching for. Otherwise, by using a
heuristic, the authors move the camera to a new value of , estimate the congruence class
currently viewed by the camera and use this new knowledge to further constrain the values of
satisfying this new constraint (see Fig. 30). If they end up with a single interval of values of
that satisfy all these constraints, they have recognized an instance of the object they are
looking for. The authors can also use this knowledge to extrapolate the aspect that the sensor
is currently viewing, and thus, achieve aspect resolution. The authors describe various data
structures for extending this idea to more than a single degree of camera motion.
Dickinson et al. [49] quantify an observation that degenerate views occupy a significant
fraction of the viewing sphere surrounding an object and show how active and purposive
control of the sensor could enable such a system to escape from these degeneracies, thus
leading to more reliable recognition. A view of an object is considered degenerate if at least
one of the two conditions below hold (see Fig. 31): a zero dimensional (point-like) object
feature is collinear with both the front nodal point of the lens 2. and either:
(a) another zero dimensional object feature, or

(b) some point on a line (finite or infinite) defined by two zero-dimensional object features.
The paper gives various examples of when such degeneracies might occur. An example of
degeneracy is when we have two cubes such that the vertex x of one cube is touching a point
y on an edge of the other cube. If the front nodal point of the lens lies on the line defined by
points x, y the authors say that this view of the object is degenerate. Of course, in the case of
infinite camera resolution, the chances of this happening are virtually non-existent. However,
cameras have finite resolution. Therefore, the chances of degeneracies occurring are no
longer negligible.
The authors conduct various experiments under realistic assumptions and observe that for a
typical computer vision setup the chances of degenerate views are not negligible and can be
as high as 50%. They also tested a parameterization which partially matched the human
foveal acuity of 20 seconds of arc, and noticed that the probability

Figure 13: The two types of view degeneracies proposed by Dickinson et al. [49].

of degeneracies is extremely small. The authors argue that this is one reason why the
importance of degenerate views in computer vision has been traditionally underestimated.
Obviously an active vision system could be of immense help in disambiguating these
degeneracies. The authors argue that if the goal is to avoid the degenerate views in a viewer
centered object representation or to avoid making inferences from such viewpoints, the vision
systems must have a system for detecting degeneracies and actively controlling the sensor to
move it out of the degeneracy. One solution to the problem of reducing the probability of
degeneracy or reducing the chance of having to move the camera is to simply change
the focal length of the camera to increase the resolution in the region of interest. The analysis
performed in the paper indicates that it is important to compensate for degeneracies in
computer vision systems and also further motivates the benefits of an active approach to
vision. Intelligent solutions to the view-degeneracy problem can decrease the probability of
executing expensive and unnecessary camera movement to recognize an object. Within the
context of the recognition pipeline in Fig.1, we see that these degeneracies could potentially
affect all the modules in the pipeline, from the quality of the low-level feature extracted, to
the way the features are grouped, and to the reliability of the final object verification.
Herbin [288] presents an active recognition system whose actions can influence the external
environment (camera position) or the internal recognition system. The author assumes the
processing of segmented images, and uses the silhouette of the objects chess pieces to
recognize the object. The objects are encoded in aspect graphs, where each aspect contains
the views with identical singularities of the objects contour. Each view is encoded by a
constant vector indicating whether a convex point, a concave point or no extremum was

found. Three types of actions are defined: A camera movement by 5 degrees upwards or
downwards and a switch between two different feature detection scales. The author defines a
training phase for associating an action at at time t given the sequence of states up until time
t. This simply learns the permissible actions for a certain object. Standard Bayesian methods
determine whether there is high enough confidence so far on the object identity, or whether
more aspects should be learned.
Kovacic et al. [289] present a method for planning view sequences to recognize objects.
Given a set of objects and object views, where the silhouette of each object view is
characterized by a vector of moment-based features, the feature vectors are clustered. Given a
detected silhouette, the corresponding cluster is determined. For each candidate new
viewpoint, the object vectors in the cluster are mapped onto another feature set of the same
objects but from the new viewpoint. A number of different mappings are attempted where
each mapping depends on the next potential view and each mappings points are clustered.
The next view which results in the greatest number of clusters is chosen, since this will on
average lead to the quickest disambiguation of the object class. This procedure is repeated
until clusters with only one feature vector remain, at which point recognition is possible.
Denzler and Brown [290] use a modification of mutual information to determine optimal
actions. They determine the action al that leads to the greatest conditional mutual information

between the object identity and the observed

Chart 2: Summary of the 1992-2012 papers on active object localization and recognition from
Table 6. As expected, search efficiency and the role of 3D information is significantly more
prominent in these papers (as compared to Chart 1).

Figure 14: Reconstructionist vision vs. Selective Perception, after Rimey and Brown [302]
feature vector c. Laporte and Arbel [291] build upon this work and choose the best next
viewpoint by calculating the symmetric KL divergence (Jeffrey divergence) of the likelihood
of the observed data given the assumption that this data resulted from two views of two

distinct objects. By weighing each Jeffrey divergence by the product of the probabilities of
observing the two competing objects and their two views, they can determine the next view
which provides the object identity hypothesis, thus again demonstrating the active vision
systems direct applicability in the standard recognition pipeline.
Mishra and Aloimonos [292] and Mishra et al. [293] suggest that recognition algorithms
should always include an active segmentation module. By combining monocular cues with
motion or stereo, they identify the boundary edges in the scene. This supports the algorithms
ability in tracing the depth boundaries around the fixation point, which in turn can be of help
in challenging recognition problems. These two papers provide an example of a different
approach to recognition, where the intrinsic recognition module parameters are intelligently
controlled and are more tightly coupled to changes to the low-level feature cues and their
grouping in the standard recognition pipeline (see Fig.1).
Finally, Zhou et al. [294] present an interesting paper on feature selectivity. Even though the
authors present the paper as having an application to active recognition, and cite the relevant
literature, they limit their paper to the medical domain (Ultrasound) by selecting the most
likely feature(s) that would lead to accurate diagnosis. The authors present three slight
modifications to information gain and demonstrate how to choose the feature y that would
lead to maximally reducing the uncertainty in classification, given that a set of features X is
used. They perform tests to determine the strengths and weaknesses of each approach and
recommend a hybrid approach based on the presented metrics as the optimal approach to

conditional feature selection. Within the context of an active vision system, feature selection
algorithms could be used to choose the optimal next sensor action.
While most of the methods discussed in this section mainly show that active image
acquisition makes the problem easier, the last few papers discussed give an insight of a
general nature for object recognition, where active image acquisition is tightly coupled to the
more classical vision and recognition modules. Another general conclusion is that very few of
the papers surveyed so far, take into consideration the effects of cost constraints, noise-
constraints (e.g., dead-reckoning errors) or object representational power. As it was
previously argued [26], taking into account such
constraints is of importance, since they can lead to a reassessment of proper strategies for
next-view-planning and recognition.

c) Active Object Localization and Recognition Literature Survey
We now present an overview of the literature on the active object localization and recognition
problems. In more recent literature, the problems are sometimes referred to under the title of
semantic object search. In Table 6 and Chart 8 we compare the algorithms discussed in this
subsection, along a number of dimensions. A general conclusion one can reach, is that on
average, the scalability of inference for active object localization algorithms is worse than the
current state of the art in passive recognition (see Table 7 of Sec.4.2 for example). This is
partially attributable to the online requirements of active localization/recognition

mechanisms, which make the construction of such real-time and online systems a significant
Notice that in contrast to the Simultaneous Localization and Mapping (SLAM) problem, in
the active object localization problem the vision system is tasked with determining an optimal
sequence of sensor movements that enable the system to determine the position of the apriori
specified object, as quickly as possible. In contrast, in the SLAM problem, the scene
features/objects are usually learnt/determined online during the map building process.
Notice that within the context of Sec. 1, the localization and recognition problems subsume
the detection problem, since the detection problem is a limited/constrained version of the
localization and recognition problems.

Figure 15: A PART-OF Bayes net for a table-top scenario, similar to what was proposed by
Rimey and Brown [302].
When dealing with the vision-based SLAM problem, the issue of extracting scene structure

from a moving pla form and using this information to build a map of the environment
emerges. While this problem also emerges in the active object localization and recognition
problem, in practice, it is typically of secondary importance, since the main research effort
while constructing active object localization and recognition systems is focused around the
creation of the object recognition module and the creation of the next-viewpoint selection
algorithm. As it was pointed out at the beginning of Sec. 3, active object localization and
recognition research on dynamic scenes is limited, and in this regard it is less developed than
the structure from motion and SLAM literature.
For example Ozden et al. [317] indicate that the main requirements for building a robust
dynamic structure from motion framework, include:
constantly determining the number of independently moving objects.
segmenting the moving object tracks.
computing the object 3D structure and camera motion with sufficient accuracy.
resolving geometric ambiguities.
achieving robustness against degeneracies caused by occlusion, self-occlusion and motion
scaling the system to non-trivial recording times.
It is straightforward to see that these also constitute important requirements when
constructing an active object localization and recognition system, since making a recognition

system robust to these challenges would likely require changes to all the components of the
standard recognition pipeline (see Fig.1). However, none of the active localization and
recognition systems that we will survey is capable of dealing with dynamic scenes,
demonstrating that the field is still evolving. Note that this last point differentiates active
vision research from dynamic vision research (see Sec. 3).
In the active object localization and recognition problems, any reduction in the total number
of mechanical move ments involved would have a significant effect on the search time and
the commercial viability of the solution. Thus, a central tenet of the discussion in this section
involves efficient algorithms for locating objects in an environment subject to various
constraints [45, 26]. The constraints include time constraints, noise rates, and object and
scene representation lengths amongst others. In Table 6 and Chart 8 we present a comparison,
along certain dimensions, for a number of the papers surveyed in Sec.3.2.
Rimey and Brown [302] present the TEA-1 vision system that can search within a static
image for a particular object and that can also actively control a camera if the object is not

within its field of view. Within the context of

Figure 16: An IS-A Bayes tree for a table-top scenario that was used by Rimey and Brown
Minskys frame theory [124] which we discussed in Sec.2.7, the authors define a knowledge
representation framework that uses PART-OF, IS-A and adjacent relationships a
form of contextual knowledge for guiding the search. The authors [302] also focus on the
decision making algorithms that are used to control the current focus of attention during the
search for the object. A Bayesian network is used to encode the confidences regarding the
various hypotheses. As the authors point out, a significant constraint in any vision system that
purposively controls an active sensor, such as a camera, is resource allocation and
minimization of the time-consuming camera movements.

Purposiveness is necessary in any active vision system. The system must attempt specific
tasks. Open ended tasks such as randomly move the camera around the entire room until the
desired object falls in our field of view lack the purposiveness constraint. A number of
papers [282, 285, 24] have experimentally demonstrated that random search exhibits a
significantly worse reliability and localization speed than purposive search, giving further
credence to the arguments given in Rimey and Brown [302]. This approach to vision is
inspired by the apparent effect that task specification has on human eye movements. As
Yarbus demonstrated [318], human foveation patterns depend on the task at hand and the
fixated objects seem to be the ones relevant for solving a particular task. Somehow, irrelevant
features are ignored and humans do not search through the entire scene. This is exactly what
Rimey and Brown are trying to accomplish in their paper, namely, to perform sequential
actions that extract the most useful information and perform the task in the shortest period of
time. Thus, within the context of the standard recognition pipeline in Fig.1, this constitutes an
effort in improving the object hypothesis generation module. The authors provide a nice
summary of the main differences in the selective/attentive approach to vision and the
reconstructionist/non-active/non-attentive approach to vision (see Fig. 32).
The authors use two different Bayesian-network-like structures for knowledge representation:
composite nets and two-nets. The composite net, as its name suggests, is composed of four
kinds of nets: PART-OF nets, IS-A trees, expected area nets and task nets (see Figs. 34, 33).
PART-OF nets are graphical models which use PART-OF relations to model the feasible
structure of the scene and the associated conditional probabilities (see Fig. 33). Each node is
a Boolean variable indicating the presence or absence of a particular item. For example, a

node might represent a tabletop, its children might represent different kinds of tables, and
each kind of table might have nodes denoting the types of utensils located on the particular
table type. Expected area nets have the same structure as PART-OF nets and identify the area
in the particular scene where the object is expected to be located and the area it will take up.
These are typically represented using 2D discrete random variables representing the
probability of the object being located in a certain grid location. Also values for the height
and width of objects are typically stored in the expected area net. A relation-map is also
defined which uses the expected area net to specify the relative location probability of one
object given another object. An IS-A tree is a taxonomic hierarchy representing mutually
exclusive subset relationships of objects (see Fig. 34).
For example, one path in the hierarchy might be object table-object bowl black-bowl.
A task-net specifies what kind of scene information could help with solving a recognition
problem but it does not specify how to obtain that information. The two-net is a simpler
version of the composite net, and is useful for experimental analysis. The authors then define
a number of actions such as moving the camera or applying a simple object detection
By iteratively choosing the most appropriate action to perform, and updating the probabilities
based on the evidence provided by the actions, recognition is achieved. Each action has a cost
and profit associated with it. The cost might include the cost of moving a camera and the
profit increases if the next action is consistent with the probability tables likelihoods. Three
different methods for updating the probabilities are suggested. The dummy-evidence method

sets a user specified node in the composite-nets and two-nets to a constant value, specifying
judgemental values about the nodes values. The instantiate-evidence method is set when a
specific value of a random variable is observed as true.
Finally, the IS-A evidence approach uses the values output by an action to update the IS-A
nets probabilities using the likelihood ratios for some evidence e: = p(e|S )/p(e|S ) where
S denotes whether a specific set of nodes in the IS-A tree was detected or not by the action.
The cost and profits are used to define a goodness function which is used to select the best
next action. A depth first search in the space of all action sequences is used to select the best
next action that would minimize the cost and lead to the most likely estimation of the
unknown object or variable. The authors perform some tests on the problem of classifying
whether a particular tabletop scene corresponds to a fancy or non-fancy meal and present
some results on the algorithms performance as the values of the various costs were adjusted.
The method is tested only for recognizing a single 2D scene.
Wixson and Ballard [303] present an active object localization algorithm that uses
intermediate objects to maxi mize the efficiency and accuracy of the recognition system (see
Fig. 35 and Fig. 36). The paper was quite influential and similar ideas are explored in more,
recent work [304, 319, 320]. The system by Wixson and Ballard [303] in corporates some
sort of contextual knowledge about the scene by encoding the relation between intermediate
objects. Such intermediate objects are usually easy to recognize at low resolutions and are,
thus, located quickly. Since we typically have some clues about the target objects location
relative to the intermediate objects location, we can use intermediate objects to speed up the

search for the target. The authors present a mathematical model of search ef ciency that
estimates the factors which affect search efficiency, and they use these factors to improve
search efficiency.
They note that in their experiments, indirect search provided an 8-fold increase in efficiency.
As the authors indicate, the higher the resolution needed to accurately recognize an object,
the smaller the field of view of the camera has to be because, for example, we might need
to bring the camera closer to the object. However, this forces more mechan

Figure 17: The direct-search model, which includes nodes that affect direct search efficiency
(unboxed nodes) and explicit model parameters (boxed nodes). Adapted from Wixson and
Ballard [303].
ical movements of the camera to acquire more views of the scene, which are typically quite
time consuming. This indicates a characteristic trade-off in the active localization literature

that many researchers in the field have attempted to address, namely, search accuracy vs.
total search time.
In this work the authors speed up the search through the use of intermediate objects. An
example is the task of searching for a pencil by first locating a desk since pencils are
usually located on desks. Thus within the context of the standard recognition pipeline in
Fig.1, this constitutes an effort in improving the feature grouping and object hypothesis
generation module, by using intermediate object to influence the grouping probabilities and
relevant hypotheses of various objects or object parts. The authors demonstrate the smaller
number of images required to detect the pencil if the intermediate object detected was the
desk an almost two-thirds decrease. The efficiency of a search is defined as /T where is
the probability that the search finds the object and T is the expected time to do the search.
The authors model direct and indirect search. Direct search (see Figs. 35, 36) is a brute force
search defined
in terms of the random variable R denoting the number of objects detected by our object
detection algorithm over a search sequence spanning the search space, in terms of the
probability of detecting a false positive, the number of possible views V for the
intermediate object and cj, the average cost for each view j. Usually cj is a constant c for all j.
The success probability of indirect search is


and the expected cost for the direct search is

where (k,r, V) denotes the expected number of images that must be examined before finding
k positive responses, given that r positive responses can occur in V images. A close look at
the underlying parameters shows that and P(R) are coupled: If everything else remains
constant, a greater number of positive responses a smaller value of P(R = 0) causes the
expected values of R to be higher, but it also increases .
An indirect search model (see Fig. 35) is defined recursively by applying a direct search
around the neighbourhood indicated by each indirectly detected object. The authors perform a
number of tests on some simple scenes using simple object detectors. One type of test they
perform, for example, is detecting plates by first detecting tables as intermediate objects. An
almost 8-fold increase in detection speed is observed. These mathematical models examine
the conditions under which spatial relationships between objects can provide more efficient

searches. The models and

Figure 18: Junction types proposed by Malik [321] and used by Brunnstrom et al. [306] for
recognizing man-made objects.
experiments demonstrate that indirect search may require fewer images/foveations and
increases the probability of detecting an object, by making it less likely that we will process
irrelevant information. As with most early research, the work is not tested on the large
datasets that more recent papers usually are tested on. Nevertheless, the results are consistent
with the results presented in [24], regarding the significant speed up of object search that is
achieved if we use a purposive search strategy, as compared to random search. We should
point out that this paper does not take into account the effects of various cost constraints and

dead-reckoning errors. In contrast, it is mostly concentrated on the next-view-planner while
ignoring somewhat the possible effects due to the next-view-planners synergy with an object
detector, in terms of simulating attentional priming effects to speed up the search for
Brunnstrom et al. [305, 306] present a set of computational strategies for choosing fixation
points in a contextual and task-dependent manner. As shown in Fig. 37, a number of
junctions are specified, and a grouping strategy for these junctions is specified, where this
grouping strategy is dependent on depth discontinuities (determined by a stereo camera), and
also affects the sensors fixation strategy (see Fig.1). The authors present a methodology for
determining the junction type present in the image, and argue that this strategy could be quite
useful for recognizing an even larger variety of textureless objects.
Ye and Tsotsos [307] provide an early systematic study of the problem of sensor planning for
3D object search. The authors propose a sensor planning strategy for a robot that is equipped
with a pan, tilt and zoom camera. The authors show that under a particular probability
updating scheme, the brute force solution to the problem of object search maximizing the
probability of detecting the target with minimal cost is NP-Complete and, thus, propose a
heuristic strategy for solving this problem. The special case of the problem under Bayesian
updating was discussed in [45, 322]. The search agents knowledge of object location is
encoded as a discrete probability density, and each sensing action is defined by a viewpoint, a
viewing direction, a field of view and the application of a recognition algorithm. The most
obvious approach to solving this problem is by performing a 360o pan of the scene using

wide angle camera settings and searching for the object in this whole scene. However, this
might not work well if we are searching for a small object that is relatively far away, since
the object might be too small to detect. The authors propose a greedy heuristic approach to
solving the problem, that consists of choosing the action that maximizes the fraction of the
expected object detection probability divided by the expected cost of the action. Thus, within
the context of the recognition pipeline in Fig.1, this constitutes an algorithm for
hypothesizing and verifying the objects present in the scene, by adjusting the viewpoint
parameters with which the object is sensed.
Minut and Mahadevan [308] present a reinforcement learning approach to next viewpoint
selection using a pan tilt-zoom camera.

Figure 19: An ASIMO humanoid robot was used by Andreopoulos et al. [24] to actively
search an indoor environment.
They use a Markov Decision Process (MDP) and the Q-learning algorithm to determine the
next saccade given the current state, where states are defined as clusters of images

representing the same region in the environment. A simple histogram intersection using
color information is used to match an image I with a template M. If a match is found with
a low resolution version of the image, the camera zooms in and obtains a higher resolution
image and verifies the matching. If no match is found, (i.e., the desired object is not found),
they use the pan tilt unit to direct the camera to the most salient region (saliency is
determined by a symmetry operator defined in the paper) located with one of 8 subregions.
Choosing the subregion to search within is determined by the MDP and the prior contextual
knowledge it has about the room.
Kawanishi et al. [309] use multiple pan-tilt-zoom cameras to detect known objects in 3D
environments. They demonstrate that with multiple cameras the object detection and
localization problems can become more efficient (2.5 times faster) and more accurate than
with a single camera. The system collects images under various illumination conditions,
object views, and zoom rates, which are categorized as reference images for prediction (RIP)
and verif cation (RIV). RIP images are small images that are discriminative for roughly
predicting the existence of the object.
RIV images are higher resolution images for verifying the existence of objects. For each
image region that detected a likely object when using the RIP images, the cameras zoom in,
and pan and tilt, in order to verify whether the object was indeed located at that image region.
More recently, Ekvall et al. [310] integrated a SLAM approach with an object recognition
algorithm based on receptive-field co-occurrence histograms. Other recent algorithms
combine image saliency mechanisms with bags of feature approaches [311, 312]. Saidi et al.

[313] present an implementation, on a humanoid robot, of an active object localization
system that uses SIFT features [72] and is based on the next-view-planner described by Ye
and Tsotsos [307].
Masuzawa and Miura [314] use a robot equipped with vision and range sensor to localize
objects. The range finder is used to detect free space and vision is used to detect the objects.
The detection module is based on color histogram information and SIFT features. Color
features are used for coarse object detection, and the SIFT features are used for verification of
the candidate objects presence. Two planning strategies are proposed. One is for the coarse
object detection and one is for the object verification. The object detection planner maximizes
a utility function for the next movement, which is based on the increase in the observed area
divided by the cost of making this movement.
The verification planner proposes a sequence of observations that minimizes the total cost
while making it possible to verify all the relevant candidate object detections. Thus, this
paper makes certain proposals for improving the object hypothesis and verification module of
the standard recognition pipeline (see Fig.1) by using a utility function to choose the optimal
next viewpoint.
Sjoo et al. [315] present an active search algorithm that uses a monocular camera with zoom
capabilities. A robot that is equipped with a camera and a range finder is used to create on
occupancy grid and a map of the relevant features present in the search environment. The
search environment consists of a number of rooms. The closest unvisited room is searched
next, where the constructed occupancy grid is used to guide the robot. For each room, a

greedy algorithm is used to select the order in which the rooms viewpoints are sensed, so
that all possible object locations in the
Figure 20: An example of ASIMO pointing at an object once the target object is successfully
localized in a 3D environment [24].
map are sensed. The algorithm uses receptive field co-occurrence histograms to detect
potential objects. If potential objects are located, the sensors zoom settings are appropriately
adjusted so that SIFT based recognition is possible.
If recognition using SIFT features is not possible, this viewpoint hypothesis is pruned (also
see Fig.1), and the process is repeated until recognition has been possible for all the possible
positions in the room where an object might be located.
Ma et al. [316] use a two-wheeled non-holonomic robot with an actuated stereo camera
mounted on a pan-tilt unit, to search for 3D objects in an indoor environment. A global search
based on color histograms is used to perform coarse search, somewhat similar in spirit to the
idea of indirect search by Wixson and Ballard [303] which we previously discussed.
Subsequently, a more refined search (based on SIFT features and a stereo depth extraction

algorithm) is used in order to determine the objects actual position and pose. An Extended
Kalman Filter is used for sustained tracking and the A* graph search is used for navigation.
Andreopoulos et al. [24] present an implementation of an online active object localization
system, using an ASIMO humanoid robot developed by Honda (see Figs. 38, 39). A
normalized metric for target uniqueness within a single image but also across multiple images
of the scene that were captured from different viewpoints, is introduced. This metric provides
a robust probability updating methodology. The paper makes certain proposals for building
more robust active visual search systems under the presence of various errors. Imperfect
disparity estimates, an imperfect recognition algorithm, and dead-reckoning errors, place
certain constraints on the conditions chosen for determining when the object of interest has
been successfully localized. A combination of mutliple-view recognition and single view
recognition approaches is used to achieve robust and real-time object search in an indoor
environment. A hierarchical object recognition architecture, inspired by human vision, is used
[218]. The object training is done by in hand demonstration and the system is extensively
tested on over four-hundred test scenarios. The paper demonstrates the feasibility of using
state of the art vision-based robotic systems for efficient and reliable object localization in an
indoor 3D environment. This constitutes an example of a neuromorphic vision system applied
to robotics, due to the use of (i) a humanoid robot that emulates human locomotion, (ii) the
use of a hierarchical feed-forward recognition system inspired by human vision, and (iii) the
use of a next-view planner that shares many of the behavioural properties of the ideal
searcher [323]. Within the context of the recognition pipeline in Fig.1, this constitutes a
proposal for hypothesizing and verifying the objects present in the scene (by adjusting the

viewpoint parameters with which the object is sensed) and for extracting and grouping low-
level features more reliably based on contextual knowledge about the relative object scale.
As previously indicated, on average, the scalability of inference for active object localization
algorithms is worse than the current state of the art in passive recognition. This is partially
attributable to the online requirements of active localization/recognition mechanisms, which
make the construction of such real-time and online systems a significant challenge.
Furthermore, powerful vision systems implemented on current popular CPU architectures are
extremely expensive power-wise. This makes it difficult to achieve the much coveted
mobility threshold that is often a necessary requirement of active object localization

Figure 21: The twenty object classes that the 2011 PASCAL dataset contains. Some of the
earlier versions of the PASCAL dataset only used subsets of these object classes. Adapted
from [324].


In this section we present a number of case studies that exemplify the main characteristics of
algorithms that have been proven capable of addressing various facets of the recognition
problem. Based on this exposition we also provide a brief discussion as to where the field
appears to be headed.

a) Datasets and Evaluation Techniques
Early object recognition systems were for the most part tested on a handful of images. With
the exception of industrial inspection related systems, basic research related publications
tended to focus on the exposition of novel recognition algorithms, with a lesser focus on
actually quantifying the performance of these algorithms. More recently, however, large
annotated datasets of images containing a significant number of object classes, have become
readily available, precipitating the use of more quantitative methodologies for evaluating
recognition systems. Everingham et al. [324] overview the PASCAL challenge dataset, which
is updated annually (see Fig. 40). Other popular datasets for testing the performance of
object/scene classification and object localization algorithms include the Caltech-101 and
Caltech-256 datasets (Fei-Fei et al. [325], Griffin et al. [326]), Flickr groups 3 , the
TRECVID dataset (Smeaton et al. [327]), the MediaMill challenge (Snoek et al. [328]), the
Lotus-Hill dataset (Yao et al. [329]), the ImageCLEF dataset (Sanderson et al. [330]), the

COIL-100 dataset (Nene et al. [331]), the ETH-80 dataset (Leibe and Schiele [332]), the
Xerox7 dataset (Willamowski et al. [333]), the KTH action dataset (Laptev and Lindeberg
[334]) the INRIA person dataset (Dalal and Triggs [335]), the Graz dataset (Opelt et al.
[336]), the LabelMe dataset (Russell et al. [337]) the TinyImages dataset (Torralba et al.
[338]), the ImageNet dataset (Deng et al. [339]), and the Stanford action dataset (Yao et al.
[340]). Notice that such offline datasets have almost exclusively been applied to passive
recognition algorithms, since active vision systems cannot be easily tested using offline
batches of datasets. Testing an active vision system using offline datasets would require an
inordinate number of images that sample the entire search space under all possible intrinsic
and extrinsic sensor and algorithm parameters. Typically, such systems are initially tested
using simple simulations, followed by a significant amount of time that is spent field testing
the system.
A number of metrics are commonly used to provide succinct descriptors of system
performance. Receiver Oper ating Characteristic (ROC) curves are often used to visualize the
true positive rate versus the false positive rate of an object detector (see Sec. 1) as a class
label threshold is changed, assuming of course that the algorithm uses such a threshold (note
that sometimes in the literature the false positive rate is also referred to as the false accept
rate, and the false negative rate is referred to as the false reject rate). In certain cases
Detection Error Tradeoff (DET) curves are used to provide a better visualization of an
algorithm performance [341], especially when small probabilities are involved. The equal
error rate (EER) corresponds to the false positive value FP achieved when the corresponding
ROC curve point maps to a true positive value T P that satisfies FP = 1 T P. This metric is

convenient as it provides
a single value of algorithm quality (a lower EER value indicates a better detector). The area
under the curve of an ROC curve is also often used as a metric of algorithm quality. The use
of the average precision (AP) metric in the more recent instantiations of the PASCAL
challenge has also gained acceptance [324, 342]: The average precision (AP) is defined as

where |R| is the set of positive examples in the validation or test set,

and Mk = {i1, ..., ik} is the list of the top k best performing test set samples. Standard tests of
statistica signif cance (e.g., t-tests, ANOVA tests, Wilcoxon rank-sum tests, Friedman tests)
are sometimes used when comparing the performance of two or more algorithms which
output continuous values (e.g., comparing the percentage of overlap between the automated
object localization/segmentation with the ground-truth segmentation). See [343, 344, 345] for
a discussion on good strategies for annotating datasets and evaluating recognition algorithms.
Our discussion on evaluation techniques for recognition algorithms would be incomplete
without the presentation of the criticism associated with the use of such datasets. Such
criticism is sometimes encountered in the literature or in conferences on vision research (see
[193, 73, 194, 346] for example). In other words, the question arises as to how good

indicators these datasets and their associated tests are for determining whether progress is
being made in the field of object recognition. One argument is that the current state-of-the-art
algorithms in object recognition identify correlations in images, and are unable to determine
true causality, leading to fragile recognition systems. An example of this problem arose in
early research on neural networks, where the task was to train a neural network to determine
the presence or absence of a certain vehicle type in images 4.
The neural network was initially capable of reliably detecting the objects of interest from the
images of the original dataset. However, on a new validation dataset of images, the
performance dropped drastically. On careful examination it was determined that in the
original dataset, the images containing the object of interest had on average a higher intensity.
During training, the neural network learned to decide whether the object was present or
absent from the image, by calculating this average image intensity and thresholding this
intensity value. It is evident that in the original dataset there existed a correlation between
average image intensity and the object presence. However in the new dataset this correlation
was no longer present, making the recognition system unable to generalize in this new
situation that the human visual system is capable of addressing almost effortlessly. It has
been argued that only correlation can be perceived from experience, and determining true
causality is an impossibility. In medical research the mitigation of such problems is often
accomplished through the use of control groups and the inclusion of placebo groups, which
allow the scientist to test the effect of a particular drug by also testing the effect of the drug
under an approximation of a counter-factual state of the world. However, as experience has
shown and as it is often the case in computer vision research the results of such controlled

experiments, whose conclusions ultimately rely on correlations, are often wrong. Ioannidis
[347] analyses the problem, and provides a number of suggestions as to why this
phenomenon occurs, which we quote below:
The smaller the case studies, the more likely the findings are false.
The smaller the effect sizes in a research field, the less likely the research findings are true.
For example
a study of the impact of smoking on cardiovascular disease will more likely lead to correct
results than an epidemiological study that targets a small minority of the population.
The greater the number and the lesser the selection of tested relationships in a scientific
field, the less likely the research findings are to be true. As a result, confirmatory designs
such as large controlled trials are more likely true, than the results of initial hypothesis-
generating experiments.

The greater the flexibility in designs, definitions, outcomes and analytical models in a
scientific field, the less likely the research findings are to be true. For example, flexibility
increases the potential of turning negative results into positive results. Similarly, fields
that use stereotyped and commonly agreed analytical methods, typically result in a larger
proportion of true findings.

The greater the financial and other interests and prejudices in a scientific field, the less
likely the research findings are to be true. As empirical evidence shows, expert opinion is
extremely unreliable.
The hotter a scientific field, with more scientific teams involved, the less likely the research
findings are true.
The fact that usually only positive results supporting a particular hypothesis are submitted
for publication, while negative results not supporting a particular hypothesis are often not
submitted for publication, can make it more difficult to understand the limitations of many
methodologies [347]. Despite these potential limitations, the hard reality is that the use of
datasets currently constitutes the most reliable means of testing recognition algorithms. As
Pinto et al. [193] indicate, an improvement in evaluation methodologies might entail
simulating environments and testing recognition systems on these environments. But of
course creating environments that are acceptable by the vision community and which are
sufficiently realistic, is a challenging task. As argued in [73], typically offline datasets are
pre-screened for good quality in order to eliminate images with saturation effects, poor
contrast, or significant noise. Thus, this pre-screening introduces an implicit bias in the
imaging conditions of such datasets. In the case of active and dynamic vision systems, which
typically sense an environment from a greater number of viewpoints and under more
challenging imaging conditions, it becomes more difficult to predict the performance of a
vision system by using exclusively such datasets.


b) Sampling the Current State-of-the-Art in the Recognition Literature
A survey on the object recognition literature that does not attempt to determine what the
state-of-the-art is in terms of performance, would be incomplete. To this extent, we present in
some detail some of the algorithms for which there is some consensus in the community in
terms of them belonging to the top-tier of algorithms that reliably address the object
detection, localization and recognition problems (see Sec. 1). In Chart 9 and Table 7 we
present a comparison, along certain dimensions, for a number of the papers that will be
surveyed in Sec.4.2. For the reasons earlier elaborated upon, determining the best performing
algorithms remains a difficult problem. In the active and dynamic vision literature there does
not currently exist a standardized methodology for evaluating the systems in

Chart 3: Summary of the PASCAL Challenge papers from Table 7 which correspond to
algorithms published between 2002-2011. Notice that the winning PASCAL challenge
algorithms typically make little use of function, context, 3D and make a moderate use of

Figure 22: The HOG detector of Dalal and Triggs (from [335] with permission). (a): The
average gradient image over a set of registered training images. (b), (c): Each pixel
demonstrates the maximum and minimum (respectively) SVM weight of the corresponding
block. (d): The test image used in the rest of the subfigures. (e): The computed R-HOG
descriptor of the image in subfigure (e). (f),(g): The R-HOG descriptor weighed by the
positive and negative SVM weights respectively.
terms of their performance and search efficiency. However, sporadically, there have been
certain competitions (such as the semantic robot vision challenge) attempting to address these
questions. Arguably the most popular competition for evaluating passive recognition
algorithms is the annual PASCAL challenge. We thus focus our discussion in this section on
presenting in some detail the general approaches taken by some of the best performing
algorithms in the annual PASCAL challenge for classifying and localizing the objects present
in images. In general, good performance on the PASCAL datasets is a necessary condition of
a solution to the recognition problem, but it is not a sufficient condition. In other words, good
performance on a dataset does not guarantee that we have found a solution, but it can be
used as a hint, or a simple guiding principle, for the construction of vision systems, which is

why we focus on these datasets in this section. For each annual PASCAL challenge, we
discuss some of the best performing algorithms and discuss the reasons as to why the
approaches from each year were able to achieve improved performance. These annual
improvements are always characterized within the general setting described in Fig.1.
From Table 7 and Chart 9 we notice that the top-ranked PASCAL systems make very little
use of 3D object representations. In modern work, 3D is mostly used within the context of
robotics and active vision systems (see Tables 5-6). In general, image
categorization/classification algorithms (which indicate whether an image contains an
instance of a particular object class), are significantly more reliable than object localization
algorithms whose task is to localize (or segment) in an image all instances of the object of
interest. Good localization performance has been achieved for restricted object classes: in
general there still does not exist an object localization algorithm that can consistently and
reliably localize arbitrary object classes. As Chum and Zisserman [365] indicate, image

Figure 23: Examples of the Harris-Laplace detector and the Laplacian detector, which were
used extensively in [142] as interest-point/region detectors (figure reproduced from [142]
with permission).

classification algorithms have achieved significant improvements since early 2000, and this is
in general attributed to the advent popularity of powerful classifiers and feature

i) Pascal 2005
We now briefly discuss some of the best performing approaches tested during the 2005
Pascal challenge for the image classification and object localization problems (see Fig. 41).
This is not meant to be an exhaustive listing of the relevant approaches, but rather to provide
a sample of some relatively successful approaches tested over the years. 2005 was the first
year of the PASCAL Visual Object Challenge. One of the best performing approaches was
presented by Leibe et al. [199], which we also overviewed in Sec. 2.9.
Dalal and Triggs [335] tested their Histogram of Oriented Gradient (HOG) descriptors in this
challenge. In their original paper, Dalal and Triggs focused on the pedestrian localization
problem, but over the years HOG-based approaches have become quite popular, and
constitute some of the most popular descriptors in the object recognition literature. See Fig.
42 for an overview of the pipeline proposed by Dalal and Triggs. The authors experiments
suggest that the other best-performing keypoint-based approaches have false positive rates
that are at least 1-2 orders of magnitude greater than their presented HOG dense grid
approach for human detection. As the authors indicate, the fine orientation sampling and the
strong photometric normalization used by their approach, constitute the best strategy for
improving the performance of pedestrian detectors, because it enables limbs and body

segments to change their position and their appearance (see Fig. 43). The authors evaluated
numerous pixel colour representations such as greyscale, RGB and LAB colour spaces, with
and without gamma equalization. The authors also tested various approaches for evaluating
gradients, and based on their results the simplest scheme which relied on point-derivatives
with Gaussian smoothing gave the best results. The main constituent component of the HOG
representation is the orientation binning with normalization that is applied to various
descriptor blocks/cells. The cells tested are both rectangular and radial. Orientation
votes/histograms are accumulated in each one of those cells. The orientation bins tested are
both unsigned (0-180 degrees) and signed (0-360 degrees). The authors choose to use 9
orientation bins since more bins only lead to marginal improvements at best. Furthermore, the
authors note that the use of signed orientations decreases performance. The authors also
tested various normalization schemes, which mostly entail dividing the cell histograms by the
orientation energy present in a local neighborhood. The above-described combinations for
constructing histograms of orientation were then used in conjunction with linear and non-
linear SVMs, achieving state-of-the art performance for pedestrian detection. Note, however,
that the system was tested on images where the size of the pedestrians projection on the
image was significant. A final observation that the authors make is that any significant
amount of smoothing before gradient calculation degrades the system performance,
demonstrating that the most important discriminative information is from sudden changes in
the image at fine scales.
Zhang et al. [142] discuss a number of local-image-feature extraction techniques for texture
and object category classification. In conjunction with powerful discriminative classifiers,

these approaches have led to top-tier perfor mance in the VOC2005, VOC2006 and
VOC2007 competitions. Their work is mostly focused on the problem of classifying an image
as containing an instance of a particular object, and is not as much focused on the object local
ization problem. As we discussed earlier, and as we will discuss in more detail later in this
section, a good classifier does not necessarily lead to a good solution to the object
localization problem. This is due to the fact that simple brute force sliding-window
approaches to the object localization problem are extremely slow, due to the need to
Figure 24: The distributions of various object classes corresponding to six feature classes.
These results were generated by the self-organizing-map

algorithm used in the PicSOM framework [263]. Darker map regions represent SOM areas
where images of the respective object class have been densely mapped based on the
respective feature (from [263] with permission). all possible positions, scales, and aspect
ratios of a bounding-box for the object position.
As Zhang et al. [142] indicate, in the texture recognition problem local features play the role
of frequently re peated elements, while in the object recognition problem, these local features
play the role of words which are often powerful predictors of a certain object class. The
authors show that using a combination of multiple interest-point de tectors and descriptors,
usually achieves much better results than the use of a single interest-point detector/descriptor
pair achieves. They also reach the conclusion that using local features/descriptors with the
highest possible degree of invariance, does not necessarily lead to the optimal performance.
As a result, they suggest that when designing recognition algorithms, only the minimum
necessary degree of feature invariance should be used. The authors note that many popular
approaches make use of both foreground and background features. They argue that the use of
background features could often be seen as a means of providing contextual information for
recognition. However, as the authors discover during their evaluation, such background
features tend to aid when dealing with easy datasets, while for more challenging datasets,
the use of both foreground and background features does not improve the recog nition
Zhang et al. [142] use affine-invariant versions of two interest point detectors: the Harris-
Laplace detector [102] which responds to corners, and the Laplace detector [372] which

responds to blob regions (see Fig. 44). These elliptical regions are normalized into circular
regions from which descriptors are subsequently extracted. The authors also test these
interest-point detectors using scale invariance only, using scale with rotation invariance, and
by using affine invariance. As descriptors, the authors investigated the use of SIFT, SPIN,
and RIFT descriptors [373, 374].
The SIFT descriptor was discussed in Sec. 2.9. The SPIN descriptor is a two dimensional
rotation invariant histogram of intensities in a neighborhood surrounding an interest-point,
where each histogram cell (d, i) corresponds to the distance d from the center of the region
and the weight of intensity value i at that distance. The RIFT descriptor is similar to SIFT and
SPIN, where rotation invariant histograms of orientation are created for a number of
concentric circular regions centered at each interest point. The descriptors are made invariant
to affine changes in illumination, by assuming pixel intensity transformations of the form
aI(x) + b at pixel x, and by normalizing those regions with respect to the mean and standard
deviation. The authors use various combinations of interest-point detectors, detectors


Figure 25: Example of the algorithm by Felzenszwalb et al. [366] localizing a person using
the coarse template representation and the higher resolution subpart templates of the person
(from [366] with permission).
and classifiers to determine the best performing combination. Given training and test images,
the authors create a more compact representation of the extracted image features by
clustering the descriptors in each image to discover its signature {(p1, u1), ...,(pm, um)},
where m is the number of clusters discovered by a clustering algorithm, pi is the clusters
center and ui is the fraction of image descriptors present in that cluster. The authors discover
that signatures of length 20-40 tend to provide the best results. The Earth Movers Distance
(EMD) [375] is used to define a distance D(S 1, S 2) between two signatures S 1, S 2. The
authors also consider the use of mined vocabularies/words from training sets of images,
corresponding to clusters of common features. Two histograms S 1 = (u1, ..., um), S 2 =
(w1, ...,wm) of such words can be compared to determine if a given image belongs to a
particular object class.
The authors use the 2 distance to compare two such histograms:

Image classification is tested on SVMs with linear, quadratic, Radial-Basis-Function, X2 and
EMD kernels, where the X2 and EMD kernels are given by


where D(, ) can represent the EMD or 2 distance and A is a normalization constant. The
bias term of the SVM decision function is varied to obtain ROC curves of the various tests
performed. The system is evaluated on texture and object datasets. As we have already
indicated, the authors discover that greater affine invariance does not necessarily help
improve the system performance. The Laplacian detector tends to extract four to five times
more regions per image than the Harris-Laplace detector, leading to better performance in the
image categorization task, and overall a combination of Harris-Laplace and Laplacian
detectors with SIFT and SPIN descriptors. Both the EMD and 2 kernels seem to provide
good and comparable performance. Furthermore, the authors notice that randomly
varying/shuffling the backgrounds during training, results in more robust classifiers.
Within the context of Fig.1 (i.e., the feature-extraction feature-grouping object-
hypotheses object-verification object-recognition pipeline), we see that the best
performing systems of PASCAL 2005 demonstrate how the care ful pre-processing during
the low level feature extraction phase makes a significant difference in system reliability.
Small issues such as the number of orientation bins, the number of scales, or whether to
normalize the respective his tograms, make a significant difference in system performance.
This demonstrates the importance of carefully studying the feature-processing strategies
adopted by the winning systems. One could argue that vision systems should not be as
sensitive to these parameters. However, the fact remains that current state-of-the-art systems
have not reached the level of maturity that would make them robust against such variations in
the low-level parameters. Another obser vation with respect to Fig.1 is that the object
representations of the winning systems in PASCAL 2005, were for the most part flat and

made little use of the object hierarchies whose importance we have emphasized in this
Figure 26: The HOG feature pyramid used in [366], showing the coarse root-level template
and the higher resolution templates of the persons subparts (from [366] with permission).
As we will see, in more recent work, winning systems have made greater use of such
hierarchies. Finally, while one could argue that Leibe et al. [199] made use of a generative
object hypothesis and verification phase, in general, winning algorithms of PASCAL 2005
were discriminative based, and did not make use of sophisticated modules for implementing
the full pipeline of Fig.1.
ii) Pascal 2006
In addition to the previously described methodologies, a combination of the approaches
described in [263], [264] was proven successful for many of the object classes tested in
VOC2006 (see Fig. 41). The presented algorithm ([263], [264]) is used both for the VOC
challenges image classification task as well as for the object localization task. In testing their

algorithm for the object localization task, the authors consider an object as successfully
localized if a0 > 0.5 where

and Bgt, Bp denote the ground truth and localized image regions. The object classification
and localization system tested relies to a large extent on the PicSOM framework for creating
self-organizing maps (see Fig. 45). The authors in [264] take advantage of the topology
preserving nature of the SOM mapping to achieve an images classification by determining
the distance of the images representation on the grid, from positive and negative examples of
the respective object class hypothesis. For the classification task a greedy sequential forward
search is performed to enlarge the set of features used in determining the distance metric,
until the classification performance stops increasing on the test dataset. The feature
descriptors used, include many of the descriptors used in the MPEG-7 standard as well as
some non-standard descriptors. The authors experimented with using numerous color
descriptors. These include, for example, color histograms in HSV and HMMD color spaces
and their moments, as well as color layout descriptors, where the image is split in non-
overlapping blocks and the dominant colors in YCbCr space are determined for each block
(the corresponding discrete cosine transform coefficients are used as the final descriptors).
Furthermore, Fourier descriptors of segment contours are used as features, as well as
histograms and co-occurrence matrices of Sobel edge directions. The object localization
algorithm relies to a large extent on the use of a simple greedy hierarchical segmentation

algorithm that merges regions with high similarity. These regions are provided as input to the
classifier, which in turn enables the object localization.
Thus, within the context of Fig.1 we see that during PASCAL 2006, and as compared to
PASCAL 2005, one of the winning systems evolved by making use of a significantly greater
number of low level features. Furthermore, the use of a self organizing map by Viitaniemi
and Laaksonen [264] demonstrated that the proper grouping and representation of these
features plays a conspicuous role in the best performing algorithms. and Bgt, Bp denote the
ground truth and localized image regions. The object classification and localization system
tested relies to a large extent on the PicSOM framework for creating self-organizing maps
(see Fig. 45). The authors in [264] take advantage of the topology preserving nature of the
SOM mapping to achieve an images classification by determining the distance of the
images representation on the grid, from positive and negative examples of the respective
object class hypothesis. For the classification task a greedy sequential forward search is
performed to enlarge the set of features used in determining the distance metric, until the
classification performance stops increasing on the test dataset. The feature descriptors used,
include many of the descriptors used in the MPEG-7 standard as well as some non-standard
descriptors. The authors experimented with using numerous color descriptors. These include,
for example, color histograms in HSV and HMMD color spaces and their moments, as well
as color layout descriptors, where the image is split in non-overlapping blocks and the
dominant colors in YCbCr space are determined for each block (the corresponding discrete
cosine transform coefficients are used as the final descriptors). Furthermore, Fourier
descriptors of segment contours are used as features, as well as histograms and co-occurrence

matrices of Sobel edge directions. The object localization algorithm relies to a large extent on
the use of a simple greedy hierarchical segmentation algorithm that merges regions with high
similarity. These regions are provided as input to the classifier, which in turn enables the
object localization.
Thus, within the context of Fig.1 we see that during PASCAL 2006, and as compared to
PASCAL 2005, one of the winning systems evolved by making use of a significantly greater
number of low level features. Furthermore, the use of a self organizing map by Viitaniemi
and Laaksonen [264] demonstrated that the proper grouping and representation of these
features plays a conspicuous role in the best performing algorithms.

where (x,z) is the score of positioning the object representation according to deformation
z, and Z(x) denotes all possible deformations of the object representation. Given a training
dataset D = (hx1, y1i, ...,hxn, yni) (where xi denotes the I th HOG pyramid vector and yi
{1, 1} denotes a label), the authors attempt to find the optimal vector (D) which is
defined as

Notice, however, that due to the existence of positive labelled examples (yi = 1), this is not a
convex optimization problem. As a result the authors execute the following loop a number of
times: (i) Keep fixed and find the optimal latent variable zi for the positive example. (ii)

Then by holding the latent variables of positive examples constant, optimize by solving the
corresponding convex problem. The authors try to ignore the easy negative training
examples, since these examples are not necessary to achieve good performance. During the
initial stage of the training process, a simple SVM is trained for only the root filter. The
optimal position of this filter is then discovered in each training image. Since the training data
only contains a bounding box of the entire object and does not specify the subpart-positions,
during training the subparts are initialized by finding high-energy subsets of the root-filters
bounding box. This results in a new training dataset that specifies object subpart positions.
This dataset is iteratively solved using the methodologies above in order to find the filter
representations for the entire object and its subparts. The authors decide to use six subparts
since this leads to the best performance.
Perronnin and Dance [364] use the Fisher kernel for image categorization. The authors
extract a gradient vector from a generative probability of the extracted image features (local
SIFT and RGB statistics). These gradient vectors are then used in a discriminative classifier.
An SVM and a logistic regression classifier with a Laplacian prior is tested. They both
perform similarly. The authors indicate that historically, even on databases containing very
few object classes, the best performance is achieved when using large vocabularies with
hundreds or thousands of visual words. However, the use of such high-dimensional histogram
computations can have a high associated computational cost. Often the vocabularies extracted
from a training image dataset are not universal, since they tend to be tailored to the particular
object categories being learnt. The authors indicate that an important goal in vision research
is to discover truly universal vocabularies, as we already discussed in Sec. 2. However, the

lack of significant progress in this problem, has caused some researchers to abandon this idea.
In more detail, given a set of visual words X ={x1, x2, ..., xT } extracted from an image, a
probability distribution function p(X|) with parameters is calculated. In

Figure 27: The distribution of edges and appearance patches of certain car model training
images used by Chum and Zisserman [365], with the learned regions of interest overlaid
(from [365], with permission).
practice, this pdf is modelled as a Gaussian Mixture Model. Given the Fisher information

the authors obtain the corresponding normalized gradient vectors F12 p(X|). The
authors derive analytical expres sions for these gradients with respect to the mean, variance
and weight associated with each one of the Gaussians in the mixture that model this
probability. These gradients were used to train powerful classifiers, which provided state-of-
the-art image classification performance on the Pascal datasets.

Viitaniemi and Laaksonen [265] overview a general approach for image classification, object
localization, and object segmentation. The methodology relies on the fusion of multiple
classifiers. The authors report the slightly counter-intuitive observation that while their
approach provides the best performing segmentation results, and some of the best image
classification results, the approach is unable to provide the best object localization results.
van de Weijer and Schmid [368] expand local feature descriptors by appending to the
respective feature vectors photometric invariant color descriptors. These descriptors were
tested during the 2007 Pascal competition. The authors survey some popular photometric
invariants and test the effects they have in recognition performance. It is demonstrated that
for images where color is a highly discriminative feature, such color invariants can be quite
useful. However, there is no single color descriptor that consistently gives good results. In
other words, the optimal color descriptor to use is application dependent.
Chum and Zisserman [365] introduced a model for learning and generating a region of
interest around instances of the object, given labelled and unsegmented training images. The
algorithm achieves good localization performance in various PASCAL challenges it was
tested on. In other words, the algorithm is given as input only images of the object class in
question, with no further information on the position, scale or orientation of the object in the
From this data, an object representation is learnt that is used to localize instances of the object
of interest. Given an input or training set of images, a hierarchical spatial pyramidal
histogram of edges is created. Also a set of highly discriminative words is learned from a

set of mined appearance patches (see Fig. 48). A cost function that is the sum of the distances
between all pairs of training examples is used to automatically learn the object position from
an input image. The cost function takes into account the distances between the discriminative
words and the edge histograms.
A similar procedure, with a number of heuristics, is used to measure the similarity between
two images and localize any instances of the target object in an image.
Ferrari et al. [367] present a family of translation and scale-invariant feature descriptors
composed of chains of k-connected approximately straight contours, referred to as kAS . See
Fig. 49 for examples of kAS for k = 2. It is shown that for kAS of intermediate complexity,
these fragments have significant repeatability and provide a simple framework for simulating
certain perceptual grouping characteristics of the human visual system (see Fig. 7). The
authors show that kAS substantially outperform interest points for detecting shape-based
classes. Given a vocabulary of kAS, an input image is split into cells, and a histogram of the
kAS present in each cell is calculated. An SVM is then used to classify the object present in

an image, by using a multiscale sliding window approach to extract the

Figure 28: The 35 most frequent 2AS constructed from 10 outdoor images (from [367] with
Compared to the Pascal competitions from previous years, a push towards the use of more
complex hierarchies is evident in Pascal 2007. The use of these hierarchies resulted in
improved performance. Despite the belief on the part of many researchers that finding truly
universal words/part-based representations has proven a failure so far, their research indicates
that for class specific datasets these representations can be of help. Within the context of
Fig.1, these hierarchies represent a more complex type of feature grouping. Effectively the
authors are using similar

Figure 29: It is easier to understand the left images contents (e.g., a busy road with
mountains in the background) if the cars in the image have been firstly localized. Conversely,
in the right image, occlusions make the object localization problem difficult. Thus, prior
knowledge that the image contains exclusively cars, can make the localization problem easier
(from [361] with permission).

Figure 30: Demonstrating how top-down category-specific attentional biases can modulate
the shape-words during the bag-of-words histogram construction (from [358] with
permission). low level features (e.g., edges, color) and they are grouping them in more
complex ways in order to achieve more universal representations of object parts. In terms of

object verification and object hypothesizing (see Fig.1) the work by Felzenszwalb et al. [366]
represents the most successful approach tested in Pascal 2007, for using a coarse generative
model of object parts to improve recognition performance.

iii) Pascal 2008
Harzallah et al. [360, 361] present a framework in which the outputs of object localization
and classification algorithms are combined to improve each others results. For example,
knowing the type of image can help improve the localization of certain objects (see Fig. 50).
Motivated by the cascade of classifiers proposed by Viola and Jones [227, 235, 236] (see Sec.
2.11) the authors propose a low-computational cost linear SVM classifier for pre-selection of
regions, followed by a costly but more reliable non-linear SVM (based on a 2 kernel) for
scoring the final localization output, providing a good trade-off between speed and accuracy.
A winning image classifier from VOC 2007 is used for the image classification algorithm.
Objects are represented using a combination of shape and appearance descriptors.
Shape descriptors consist of HOG descriptors calculated over 40 and 350 overlapping or non-
overlapping tiles (the authors compare various approaches for splitting the image into tiles).
The appearance descriptors are built using SIFT features that are quantized into words and
calculated over multiple scales. These words are used to construct visual word histograms
summarizing the content of each one of the tiles. The authors note that overlapping square
tiles seem to give the best performance. The number of positive training set examples used by
the linear SVM is artificially increased and a procedure for retaining only the hard negative

examples during training, is presented. The final image classification and localization
probabilities are combined via simple multiplication, to obtain the probability of having an
object in an image given the windows score (localization) and the images score
(classification). Various results presented by the authors show that the combination of the two
improves in general the localization and classification results for both VOC 2007 and VOC
Tahir et al. [362, 342] propose the use of Spectral Regression combined with Kernel
Discriminant Analysis for classifying images in a particular class. The authors show that this
classifier is appropriate for large scale visual category recognition, since its training is much
faster than the SVM-based approaches that they tested, while at the same time achieving at
least as good performance as SVMs. This makes SR-KDA approaches a straightforward
replacement of the SVM modules often used in the literature. The image representation is
based on classical interest point detection, combined with various extensions of the SIFT
descriptor, combined with a visual codebook extraction phase. The algorithm achieves top
ranked performance on PASCAL VOC 2008 and Mediamill challenge. Within the context of
Fig.1, the main innovation evident in the top-ranked algorithms of Pascal 2008 lies in their
use of more powerful discriminative classifiers which enabled an improvement of the object
verification modules.

iv) Pascal 2009

Felzenszwalb et al. [363] present an extension of their previous work [366]. In contrast to
their earlier work, they now use stochastic gradient descent to perform the latent SVM
training. Furthermore, they investigate the use of PCA based dimensionality reduction
techniques to transform the object representation vectors and obtain lower dimensional
vectors for representing the image cells. They also introduce the use of contextual knowledge
to improve object localization performance. They achieve this by obtaining the set of
localizations from k detections, thus constructing a related context vector from these
scores, and then using this vector in conjunction with a quadratic-kernel based SVM to
rescore the images. The authors test their algorithm on various PASCAL challenge datasets,
achieving comparatively excellent performance.
Vedaldi et al. [356] investigate the use of a combination of kernels, where each kernel
corresponds to a different feature channel (such as bag of visual words, dense words,
histograms of oriented edges and self-similarity features).
The use of combinations of multiple kernels results in excellent performance, demonstrating
that further research on kernel methods has a high likelihood of further improving the
performance of vision systems. Similarly to the work in [360, 361], the authors use a cascade
of progressively more costly but more accurate kernels (linear, quasi-linear and non-linear
kernels) to efficiently localize the objects. However, as the authors note, further work could
be done to reduce the computational complexity of the framework. This algorithm also results
in comparatively excellent results on the PASCAL datasets it was tested on.

Similarly, Wang et al. [357] present the Locality-constrained Linear Coding (LLC) approach
for obtaining sparse representations of scenes. These sparse bases are obtained through the
projection of the data onto various local coordinate frames. Linear weight combinations of
these bases are used to reconstruct local descriptors. The authors also propose a fast
approximation to LLC which speeds up the LLC computations significantly. An SVM is used
to classify the resulting images descriptors, achieving top-ranked performance when tested
with various benchmarks.
Khan et al. [358, 359] attempt to bridge the gap between the bottom-up bags-of-words
paradigms which have been quite successful in the PASCAL challenges, by incorporating a
top-down attention mechanism that can selectively bias the features extracted in an image
based on their dominant color (see Fig. 51). As the authors point out, the two main
approaches for fusing color and shape information into a bag-of-words representation is via
early fusion (where joint shape-color descriptors are used) and via late fusion (where
histogram representations of color and shape are simply concatenated). Given separate
vocabularies for shape and color, each training images corresponding color histogram is
estimated and a class specific posterior p(class|word) is estimated. By concatenating the
posteriors for all the color words of interest, the corresponding low-level features are primed.
Difference of Gaussian detectors, Harris Laplace detectors and SIFT descriptors are used to
obtain the shape descriptors. The Color Name and HUE descriptor are used as color
descriptors [368, 376, 377]. A standard 2 SVM is used for classifying images. These top-
down approaches are compared to early-fusion based approaches that combine SIFT
descriptors with color descriptors, and which are known to perform well [378]. It is shown

that for certain types of images the top-down priming can result in drastic classification
Within the context of Fig.1, it is evident that during Pascal 2009 there was a significant shift
towards more com plex object representations and more complex object inference and
verification algorithms. This is evidenced by the incorporation of top down priming
mechanisms, complex kernels that incorporate contextual knowledge, as well as by novel
local sparse descriptors which achieved top-ranked performance. Consistent in all this work
is the preference in using SVMs for contextual classification, model building during training,
as well as object recognition, demonstrating that the use of SVMs has become more subtle
and less monolithic compared to early recognition algorithms.

v) Pascal 2010
Perronnin et al. [354] present an extension of their earlier work [364], which we have already
described in this section. The modifications they introduce, result in an increase of over 10%
in the average precision. An interesting aspect of this work is that during the 2010 challenge
the work was also trained on its own dataset (non-VOC related) and subsequently tested
successfully on various tasks, demonstrating the algorithms ability to generalize. The authors
achieve these results by using linear classifiers. This last point is important since linear SVMs
have a training cost of O(N), while non-linear SVMs have a training cost of around O(N2) to
O(N3), where N is the number of training images. Thus, training non-linear SVMs becomes

impractical with tens of thousands of training images. The au thors achieve this improvement
in their results by normalizing the respective gradient vectors first described in [364].
Another problem with the gradient representation is the sparsity of many vector dimensions.
As a result the authors apply to each dimension a function f(z) = sign(z)|z| for some [0,
1], which results in a significant classification improvement.
A number of other novel ideas were tested within the context of the 2010 Pascal challenge.
van de Sande et al. [351] proposed a selective search algorithm for efficiently searching
within a single image, without having to exhaus tively search the entire image (see Secs.
2.11,3.2 for more related work). They achieve this by adopting segmentation as a selective
search strategy, so that rather than aiming for a few accurate object localizations, they
generate more ap proximate object localizations, thus placing a higher emphasis on high
recall rates. A novel object-class-specific part


Figure 31: (a)The 3-layer tree-like object representation in [348]. (b) A reference template
without any part displacement, showing the root-node bounding box (blue), the centers of the
9 parts in the 2nd layer (yellow dots), and the 36 part at the last layer in color purple. (c) and
(d) denote object localizations (from [348] with permission).
representations was also introduced for human pose estimation [352, 353]. It achieved state-
of-the-art performance for localizing people, demonstrating the significance of properly
choosing the object representations.
Overall, in the top-ranked systems of Pascal 2010 there is evidence of an effort to mitigate
the effects of training set biases. This has motivated Perronnin et al. [354] to test the
generalization ability of their system even when trained on a non Pascal related dataset.
Approaches proposed to improve the computational complexity of training and online search
algorithms include the use of combinations of linear and non-linear SVMs as well as various
image search algorithms. Within the context of Fig.1, this corresponds to ways of improving
the hypothesis generation and object verification process.

vi) Pascal 2011
Zhu et al. [348] present an incremental concave-convex procedure (iCCP) which enables the
authors to efficiently learn both two and three layer object representations. The authors
demonstrate that their algorithm outperforms the model by Felzenszwalb et al. [363]. These
results are used by the authors as evidence that deep-structures (3-layers) are better than 2-
layer based object representations (see Fig. 52). The authors begin their exposition by

describing the task of structural SVM learning. Let (x1, y1, h1),...,(xN, yN, hN) X Y H
denote training samples where the xi are training patches, the yi are class labels, and hi = (Vi,
~pi) with Vi denoting a viewpoint and ~pi denoting the positions of object parts. In other
words, the hi encode the spatial arrangement of the object representation. In structural SVM
learning the task is to learn a function Fw(x) = arg maxy,h[w (x, y, h)] where is a joint
feature vector encoding the relation between the input x and the structure (y, h). In practice
encodes spatial and appearance information similarly to [363]. If the structure information h
is not labelled in the training set (as the case usually is since in training data we are
customarily only given the bounding box of the object of interest and not part-relation
information) then, we deal with the latent structural SVM problem, where we need to solve

where C is a constant penalty value, i,y,h = (xi, y, h) and Li,y,h = L(yi, y, h) is the loss
function which is equal to 1 iff yi = y. The authors use some previous results from the latent
structural SVM training literature: By splitting the above expression in two terms, they
iteratively find a hyperplane (a function of w) which bounds the last max term (which is
concave in terms of w), replace the max term with this hyperplane, solve the resulting convex
problem, and repeat the process. This trains the model and enables the authors to use Fw to

localize objects in an image, achieving

Figure 32: On using context to mitigate the negative effects of ambiguous localizations [350].
The greater the ambiguities, the greater role contextual knowledge plays (from [350] with
comparatively excellent results. Chen et al. [349] present a similar latent hierarchical model
which is also solved using a concave-convex procedure, and whose results are comparable to
other state of the art algorithms. The latent-SVM procedure is again used to learn the
hierarchical object representation. A top-down dynamic programming algorithm is used to
localize the objects.
Song et al. [350] present a paper on using context to improve image classification and object
localization perfor mance when we are dealing with ambiguous situations where
methodologies that do not use context tend to fail. Overall, in one of the top-ranked
approaches of Pascal 2011, Zhu et al. [348] demonstrated that even deeper hierarchies are
achievable. They showed that such hierarchies can provide even better results than another
top-ranked Pascal competition algorithm [363]. Within the context of Fig.1, the work by Zhu
et al. [348] provides an approach for building deeper hierarchies which affect the grouping,

hypothesis generation and verification modules of the standard recognition pipeline. Song et
al. [350] provided an elegant way for adaptively controlling the object hypothesis module, by
using context as an index that adaptively selects a different classifier that is appropriate for
the current context.

c) The Evolving Landscape
In 1965, Gordon Moore stated that the number of transistors that could be incorporated per
integrated circuit would increase exponentially with time [379, 380]. This provided one of the
earliest technology roadmaps fo r semi conductors. Even earlier, Engelbart [381] made a
similar prediction on the miniaturization of circuitry. Engelbart would later join SRI and
found the Augmentation Research Center (ARC) which is widely credited as a pioneer in the
creation of the modern Internet era computing, due to the centers early proposals for the
mouse, videoconferencing, interactive text editing, hypertext and networking [382]. As
Engelbart would later point out, it was his early prediction on the rapid increase of
computational power that convinced him on the promise of the research topics later pursued
by his ARC laboratory. The early identification of trends and shifts in technology, can
provide a competitive edge for any individual or corporation. The question arises as to
whether we are currently entering a technological shift of the same scope and importance as
the one identified by Moore and Engelbart fifty years ago.
For all intents and purposes, Moores law is coming to an end. While Moores law is still
technically valid, since multicore technologies have enabled circuit designers to

inexpensively pack more transistors on a single chip, this no longer leads to commensurate
increases in application performance. Moores law has historically provided a vital
technology roadmap that influenced the agendas of diverse groups in academia and business.
Today, fifty years after the early research on object recognition systems, we are
simultaneously confronted with the end to Moores law and with a gargantuan explosion in
multimedia data growth [253]. Fundamental limits on processing speed, power consumption,
reliability and programmability are placing severe constraints on the evolution of the
computing technologies that have driven economic growth since the 1950s [383]. It is
becoming clear that traditional von Neumann architectures are becoming unsuitable for
human-level intelligence tasks, such as vision, since the machine complexity in terms of the
number of gates and their power requirements tends to grow exponentially with the size of
the input and the environment complexity [384, 383]. The question for the near future is that
of determining to what extent the end to Moores law will lead to a significant evolution in
vision research that will be capable of accommodating the shifting needs of industry. As the
wider vision community slowly begins to address this fact, it will define the evolution of
object recognition research, it will influence the vision systems that remain relevant, and it
will lead to significant changes in vision and computer science education in general by
affecting other related research areas that are strongly dependent on vision (such as robotics).
According to the experts responsible for the International Technology Roadmap for
Semiconductors [384], the most promising future strategy for chip and system design is that
of complementing current information technology with low-power computing systems
inspired by the architecture of the brain [383]. How would von-Neumann ar chitectures

compare to a non von-Neumann architecture that emulates the organization of the organic
brain? The two architectures should be suitable for complementary applications. The
complexity of neuromorphic architectures should increase more gradually with increasing
environment complexity, and it should tolerate noise and errors [383].
However, such neuromorphic architectures would likely not be suitable for high precision
numerical analysis tasks.
Modern von-Neumann computing precipitates the need for a program that relies on
synchronous, serial, centralized, hardwired, general purpose and brittle circuits [385]. The
brain architecture on the other hand relies on neurons and synapses operating in a mixed
digital-analog mode, is asynchronous, parallel, fault tolerant, distributed, slow, and with a
blurred distinction between CPU and memory (as compared to von-Neumann architectures)
since the memory is, to a large extent, represented by the synaptic weights.
How does our current understanding of the human brain differentiate it from typical von-
Neumann architectures? Turing made the argument that since brains are computers then
brains are computable [386]. But if that is indeed the case, why do reliable image
understanding algorithms still elude us? Churchland [387] and Hawkins [388] argue that
general purpose AI is difficult because (i) computers must have a large knowledge base
which is difficult to construct, and because (ii) it is difficult to extract the most relevant and
contextual information from such a knowledge base. As it was demonstrated throughout our
discussion on object recognition systems, the problem of efficient object representations and

efficient feature extraction constitutes a central tenet of any non-trivial recognition system,
which supports the viewpoint of Churchland and Hawkins.
There is currently a significant research thrust towards the construction of neuromorphic
systems, both at the hardware and the software level. This is evidenced by recent high-profile
projects, such as EU funding of the human brain project with over a billion Euros over 10
years [383], U.S. funding for the NIH BRAIN Initiative [389], and by the growing interest in
academia and industry for related projects [390, 391, 392, 393, 394, 395, 396, 397]. The
appeal of neuromorphic architectures lies in [398] (i) the possibility of such architectures
achieving human like intelligence by utilizing unreliable devices that are similar to those
found in neuronal tissue, (ii) the ability of neuromorphic strategies to deal with anomalies,
caused by noise and hardware faults for example, and (iii) their low-power requirements, due
to their lack of a power intensive bus and due to the blurring of a distinction between CPU
and memory.

Vision and object recognition should assume a central role in any such research endeavour.
About 40% of the neocortex is devoted to visual areas V1, V2 [388], which in turn are
devoted just to low-level feature extraction. It is thus reasonable to argue that solving the
general AI problem is similar in scope to solving the image understanding problem (see
Sec.1). Current hardware and software architectures for vision systems are unable to scale to
the massive computational resources required for this task. The elegance of the solution to the
vision problem is astounding. The human neocortex consists of 80% of the human brain,

which has around 100 billion neurons and 1014 synapses, con sumes just 20-30 Watts, and is
to a large extent self trained [399]. One of the most astounding results in neuroscience is
attributable to Mountcastle [400, 388, 401]. By investigating the detailed anatomy of the
neocortex, he was able to show that the micro-architecture of the regions looks extremely
similar regardless of whether a region is for vision, hearing or language. Mountcastle
proposed that all parts of the neocortex operate based on a common principle, with the
cortical column being the unit of computation. What distinguishes different regions is simply
their input (whether their input is vision based, auditory based etc.). From a machine learning
perspective this is a surprising and puzzling result, since the no-free-lunch theorem,
according to which it is best to use a problem specific optimization/learning algorithm,
permeates much of the machine learning research. In contrast the neocortex seems to rely on
a single learning architecture for all its tasks and input modalities. Looking back at the object
recognition algorithms surveyed in this paper, it becomes clear that no mainstream vision
system comes close to achieving the generalization abilities of the neocortex. This sets the
stage for what may well become one of the most challenging and rewarding scientific
endeavours of this century.

8) Multi-Stage Architectures for Object Recognition
The PSD algorithm can be used to build multi-layer feed-forward object recog- nition
architectures that can be trained in a layer-by-layer fashion as introduced in (Hinton and
Salakhutdinov, 2006). The hierarchy stacks one or several feature extraction stages, each of

which consists of filter bank layer, non-linear trans- formation layers, and a pooling layer
that combines filter responses over local neighborhoods using an average or max operation,
thereby achieving invariance to small distortions. In order to achieve efficient feature
extraction and supervised fine-tuning, we only use the predictor module Fe(x; K ) to build
the hierarchy.

a) Modules for Hierarchical Systems
The core component of the feature extraction stage is the filterbank. In con- ventional
systems that contain only a single layer of feature extractor, Gabor func- tions (Serre et al.,
2005; Mutch and Lowe, 2006; Pinto et al., 2008) or oriented edge detectors (Lowe, 2004;
Dalal and Triggs, 2005) are the most common choice. In other methods where the feature
extraction stage is trained (Ranzato et al.,
2007b; Kavukcuoglu et al., 2009; Lee et al., 2009), the first layer filterbank ends up being
similar to Gabor-like features, however these methods also provide a method to train
successive layers. On the other hand, it is not clear how to design higher level features
following Gabor functions.

Another important component of multi-stage architectures is the non-linearities.
Convolutional networks (LeCun et al., 1998a) use a simple point-wise sigmoid or tanh
function following the filterbank, while models that are strongly inspired by biology have
included rectifying non-linearities, such as positive part, absolute value, or squaring functions
(Pinto et al., 2008), often followed by a local contrast normalization (Pinto et al., 2008),
which is inspired by divisive normalization mod- els (Lyu and Simoncelli, 2008).
The final component of a stage is the pooling function that can be applied over space (LeCun
et al., 1998a; Lazebnik et al., 2006; Ranzato et al., 2007b; Ahmed et al., 2008), over scale and
space (Serre et al., 2005; Mutch and Lowe, 2006; Pinto et al., 2008), or over similar feature
types and space (Kavukcuoglu et al., 2009). This layer builds robustness to small distortions
by computing an average or a max of the filter responses within the pool.
Filter Bank Layer - FC SG : the input of a filter bank layer is a 3D array with n1 2D
feature maps of size n2 n3 . Each component is denoted xijk , and each feature map is
denoted xi. The output is also a 3D array, y composed of m1 feature maps of size m2 m3
. A filter in the filter bank kij has size l1 l2 and connects input feature map xi to
output feature map yj . The module computes the features as defined by equation 4.1.

By taking into account the borders effect, we have m1 = n1 l1 + 1, and m2 = n2 l2
+ 1. This layer is also denoted by FC SG because it is composed of a set of convolution

filters (C), a sigmoid/tanh non-linearity (S), and gain coefficients (G). In the following,
superscripts are used to denote the size of the filters. For instance, a filter bank layer with
64 filters of size 9x9, is denoted as: 64F 99 . Note that this layer is the predictor function
(Fe (x; K )) as defined in chapter 2 applied convolutionally.
Rectification Layer - Rabs: This module simply applies the absolute value function to all
the components of its input: yijk = |xijk |. Several rectifying non-linearities were tried,
including the positive part, and produced similar results.
Local Contrast Normalization Layer - N : This module performs local subtractive and
divisive normalizations, enforcing a sort of local competition between adjacent features in a
feature map, and between features at the same spatial location in different feature maps. The
subtractive normalization operation for a given location xijk computes:

where wpq is a Gaussian weighting window (of size 9 9 in our experiments
with standard deviation 1.591) normalized so that Pipq wpq = 1. The divisive normalization

For each sample, the constant c is set to the mean(jk ) in the experiments. The denominator
is the weighted standard deviation of all features over a spatial neighborhood. The local
contrast normalization layer is inspired by computational neuroscience models (Pinto et al.,
2008; Lyu and Simoncelli, 2008).
Average Pooling and Subsampling Layer - PA : The purpose of this layer is to build
robustness to small distortions, playing the same role as the complex cells in models of visual
perception. Each output value is

where wpq is a uniform weighting window (boxcar filter). Each output feature map is
then subsampled spatially by a factor S horizontally and vertically.
Max-Pooling and Subsampling Layer - PM : building local invariance to shift can be
performed with any symmetric pooling operation. The max- pooling module is similar to the
average pooling, except that the average operation is replaced by a max operation. In our

experiments, the pooling windows were non-overlapping. A max-pooling layer with 4x4
down-sampling is denoted P 44 .

b) Combining Modules into a Hierarchy
Different architectures can be produced by cascading the above-mentioned modules in
various ways. An architecture is composed of one or two stages of feature extraction, each
of which is formed by cascading a filtering layer with different combinations of
rectification, normalization, and pooling. Recognition architectures are composed of one or
two such stages, followed by a classifier, gen- erally a multinomial logistic regression. A
single stage with filterbank, rectification, contrast normalization and pooling stages are
shown in Fig.
FCSG PA This is the basic building block of traditional convolutional networks,
alternating tanh-squashed filter banks with average down-sampling layers (LeCun et al.,
1998a; Huang and LeCun, 2006). A complete convolu-


Figure 33: An example of a feature extraction stage of the type FC SG Rabs N PA . An
input image (or a feature map) is passed through a non-linear filterbank, followed by
rectification, local contrast normalization and spatial pooling/sub- sampling.
tional network would have several sequences of FC SG - PA followed by a linear
FCSG Rabs PA The tanh-squashed filter bank is followed by an abso- lute value non-
linearity, and by an average down-sampling layer.
FCSG Rabs N PA The tanh-squashed filter bank is followed by an absolute value
non-linearity, by a local contrast normalization layer and by an average down-sampling layer.
FCSG PM This is also a typical building block of convolutional networks, as well as
the basis of the HMAX and other architectures (Serre et al., 2005; Ranzato et al., 2007b),
which alternate tanh-squashed filter banks with max- pooling layers.

c) Training Protocol
Given a particular architecture, a number of training protocols have been con- sidered and
tested. Each protocol is identified by a letter R, U, R+ , or U + . A single letter (e.g. R)
indicates an architecture with a single stage of feature extraction, followed by a classifier,
while a double letter (e.g. RR) indicates an architecture with two stages of feature extraction

followed by a classifier. R and RR represents feature extraction using random kernels. U
and U U represents feature extraction using kernels initialized using the PSD algorithm. R
and U followed by a su- perscript + represents supervised training of the combined feature
extractor and classifier.
Random Features and Supervised Classifier - R and RR: The filters in the feature
extraction stages are set to random values and kept fixed (no feature learning takes place),
and the classifier stage is trained in supervised mode.
Unsupervised Features, Supervised Classifier - U and UU. The predictor (Fe) filters of
the feature extraction stages are trained using the unsupervised PSD algorithm, described in
chapter 2, and kept fixed. The classifier stage is trained in supervised mode.
Random Features, Global Supervised Refinement - R+ and R+R+. The filters in the
feature extractor stages are initialized with random val- ues, and the entire system
(feature stages + classifier) is trained in super- vised mode by gradient descent. The
gradients are computed using back- propagation, and all the filters are adjusted by
stochastic online updates. This is identical to the usual method for training supervised
convolutional networks.
Unsupervised Feature, Global Supervised Refinement - U+ and U+U+ . The filters in
the feature extractor stages are initialized with the PSD unsupervised learning algorithm,
and the entire system (feature stages

+ classifier) is then trained (refined) in supervised mode by gradient descent. The system is
trained the same way as random features with global refinement using online stochastic
updates. This is reminiscent of the deep belief net- work strategy in which the stages are
first trained in unsupervised mode one after the other, and then globally refined using
supervised learning (Hinton and Salakhutdinov, 2006; Bengio et al., 2007; Ranzato et al.,
A traditional convolutional network with a single stage initialized at random (Le- Cun et al.,
1998a) would be denoted by an architecture motif like FC SG PA , and the training
protocol would be denoted by R+ . The stages of a convolutional network with max-pooling
would be denoted by FC SG PM . A system with two such stages trained in
unsupervised mode, and the classifier (only) trained in supervised mode, as in (Ranzato et
al., 2007b), is denoted U U and a two stage deep belief network is represented as U+U+ .

d) Experiments with Caltech-101 Dataset
We have used Caltech-101 (Fei-Fei et al., 2004) dataset to compare different training
protocols and architectures. A more detailed study can be found in our work What is the
best multi-stage architecture for object recognition? (Jarrett et al., 2009).
Images from the Caltech-101 dataset were pre-processed with a procedure sim- ilar to (Pinto
et al., 2008). The steps are: 1) converting to gray-scale (no color) and resizing to 151 151

pixels. 2) subtracting the image mean and dividing by the image standard deviation, 3)
applying subtractive/divisive normalization (N layer with c = 1). 4) zero-padding the shorter
side to 143 pixels.

e) Using a Single Stage of Feature Extraction
The first stage is composed of an FC SG layer with 64 filters of size 99 (64F 99 ),
followed by an abs rectification (Rabs), a local contrast normalization layer (N ) and an
average pooling layer with 10 10 boxcar filter and 5 5 down-sampling A ). The output
of the first stage is a set of 64 features maps of size 26 26. This output is then fed to a
multinomial logistic regression classifier that produces a 102-dimensional output vector
representing a posterior distribution over class labels. Lazebniks PMK-SVM classifier
(Lazebnik et al., 2006) was also tested.

Table 4.1: Average recognition rates on Caltech-101 with 30 training samples per class. Each
row contains results for one of the training protocols, and each column for one type of
architecture. All columns use an FC SG as the first module, followed by the modules shown
in the column label. The error bars for all experiments are within 1%, except where noted.

f) Using Two Stages of Feature Extraction

In two-stage systems, the second-stage feature extractor is fed with the output of the first
stage. The first layer of the second stage is an FC SG module with 256 output features maps,
each of which combines a random subset of 16 feature maps from the previous stage using 9
9 kernels. Hence the total number of convolution kernels is 256 16 = 4096. The average
pooling module uses a 6 6 boxcar filter with a 4 4 down-sampling step. This produces
an output feature map of size 256 4 4, which is then fed to a multinomial logistic
regression classifier. The PMK-SVM classifier was also tested.
Results for these experiments are summarized.
1. The most astonishing result is that systems with random filters and no fil- ter learning
whatsoever achieve decent performance (53.3% for R and 62.9% for RR), as long as they
include absolute value rectification and contrast normalization (Rabs N PA ).
2. Comparing experiments from rows R vs R+, RR vs R+R+, U vs U + and U U vs U +U +,
we see that supervised fine tuning consistently improves the performance, particularly with
weak non-linearities: 62.9% to 64.7% for RR, 63.7% to 65.5% for U U using Rabs N
PA and 35.1% to 59.5% for RR using Rabs PA .
3. It appears clear that two-stage systems (U U , U +U +, RR, R+R+) are system- atically and
significantly better than their single-stage counterparts (U , U + , R, R+). For instance,
54.2% obtained by U + compared to 65.5% obtained by U +U + using Rabs N PA .
However, when using the PA architecture, adding a second stage without supervised
refinement does not seem to help. This may be due to cancellation effects of the PA layer
when rectification is not present.

4. It seems that unsupervised training (U , U U , U + , U +U + ) does not seem to significantly
improve the performance (comparing with (R, RR, R+, R+R+) if both rectification and
normalization are used (62.9% for RR versus 63.7% for U U ). When contrast
normalization is removed, the performance gap becomes significant (33.7% for RR versus
46.7% for U U ). If no supervised refinement is performed, it looks as if appropriate
architectural components are a good substitute for unsupervised training.
5. It is clear that abs rectification is a crucial component for achieving good performance, as
shown with the U +U + protocol by comparing columns PA (32.0%) and Rabs N PA
(65.5%). However, using max pooling seems to alleviate the need for abs rectification,
confirming the hypothesis that aver- age pooling without rectification falls victim to
cancellation effects between neighboring filter outputs.
6. A single-stage system with PMK-SVM reaches the same performance as a two-stage
system with logistic regression (around 65%) as shown in the last two rows of Table 4.1. It
looks as if the pyramid match kernel is able to play the same role as a second stage of feature
extraction. Perhaps it is because PMK first performs a K-means based vector
quantization, which can be seen as an extreme form of sparse coding, followed by local
histogramming, which is a form of spatial pooling. Hence, the PM kernel is conceptually
similar to a second stage based on sparse coding and pooling as recently pointed out in
(Yang et al., 2009). Furthermore, these numbers are similar to the performance of the
original PMK-SVM system which used dense SIFT features (64.6%) (Lazebnik et al., 2006).
Again, this is hardly surprising as the SIFT module is conceptually very similar to our
feature extraction stage. When using features extracted using U U architecture, the

performance of PMK-SVM classifier drops significantly. This behaviour might be caused
by the very small spatial density (18 18) of features at the second layer.

g) NORB Dataset
Caltech-101 is very peculiar in that many objects have distinctive textures and few pose
variations. More importantly, the particularly small size of the train- ing set favors
methods that minimize the role learning and maximize the role of clever engineering. A
diametrically opposed object dataset is NORB (LeCun et al., 2004). The Small NORB
dataset has 5 object categories (humans, air- planes, cars, trucks, animals) 5 object instances
for training, and 5 different object instances for test. Each object instance has 972 samples
(18 azimuths, 9 elevations, and 6 illuminations), for a total of 24300 training samples and
24300 test samples


Figure 34: Test Error rate vs. number of training samples per class on NORB Dataset.
Although pure random features perform surprisingly well when training data is very scarce,
for large number of training data learning improves the per- formance significantly.
Absolute value rectification (Rabs) and local normalization (N ) is shown to improve the
performance in all cases.
(4860 per class). Each image is 96 96 pixels, grayscale. Experiments were con- ducted to
elucidate the importance of the non-linearity, and the performance of random filter systems
when many labeled samples are available.
Only the RR, R+R+, U U , U + U + protocols were used with 8 feature maps and 55 filters
at the first stage, 44 average pooling followed by 24 feature maps with 6 6 filters, each of
which combines input from 4 randomly picked stage-1 feature maps, followed by 3 3
average pooling. The last stage is a 5-way multinomial logistic regressor.

Several systems with various non-linearities were trained on subsets of the train- ing set with
20, 50, 100, 200, 500, 1000, 2000, and 4860 training samples per class. The results are
shown in figure 4.3 in log-log scale. All the curves except the blue curve use represent
models using absolute value rectification and contrast normalization. It appears that the
use of abs and normalization makes a big difference when labeled samples are scarce, but
the difference diminishes as the number of training samples increases. Training seems to
compensate for architec- tural simplicity, or conversely architectural sophistication seems to
compensate for lack of training. Comparing U U and RR, it is apparent that unsupervised
learn- ing improves the recognition performance significantly, however, R+R+ shows that
labeled information is more effective than unsupervised information. Moreover, comparing
R+R+ and U +U +, it can be seen that unsupervised training is most effective when
number of samples is not large (< 100). The best error rate when trained on the full training
set with a model including abs and normalization is 5.6% and 5.5% for R+R+ and U + U +
respectively, but 6.9% with neither ((LeCun et al., 2004) reported 6.6%).
More interesting is the behavior of the system with random filters: While its error rate is
comparable to that of the network trained in supervised mode for small training sets (in the
Caltech-101 regime), the error rate remains high as samples are added. Hence, while
random filters perform well on Caltech-101, they would most likely not perform as well as
learned filters on tasks with more labeled samples.


h) Random Filter Performance
The NORB experiments show that random filters yield sub-par performance when labeled
samples are abundant. But the experiments also show that random filters seem to require the
presence of abs and normalization. To explore why ran- dom filters work at all, starting from
randomly initialized inputs, we used gradient descent minimization to find the optimal input
patterns that maximize each output unit (after pooling) in a FC SG Rabs N PA
stage. The surprising finding is that the optimal stimuli for random filters are oriented
gratings (albeit noisy and faint ones), similar to the optimal stimuli for trained filters. As
shown in fig 4.4, it appears that random weights, combined with the abs/norm/pooling
creates a spontaneous orientation selectivity.

Figure 35: Left: random stage-1 filters, and corresponding optimal inputs that maximize
the response of each corresponding complex cell in a FC SG Rabs N PA
The small asymmetry in the random filters is sufficient to make them orientation selective.
Right: same for PSD filters. The optimal input patterns contain several periods since they
maximize the output of a complete stage that contains rectification, local normalization, and

average pooling with down-sampling. Shifted versions of each pattern yield similar

i) Handwritten Digits Recognition
As a sanity check for the overall training procedures and architectures, exper- iments were
run on the MNIST dataset, which contains 60,000 gray-scale 28x28 pixel digit images for
training and 10,000 images for testing. An architecture with two stages of feature extraction
was used: the first stage produces 32 feature maps using 5 5 filters, followed by 2x2
average pooling and down-sampling. The second stage produces 64 feature maps, each of
which combines 16 feature maps from stage 1 with 5 5 filters (1024 filters total), followed
by 2 2 pooling/down-sampling.
1 with 5 5 filters (1024 filters total), followed by 2 2 pooling/down-sampling.
The classifier is a 2-layer fully-connected neural network with 200 hidden units, and 10
outputs. The loss function is equivalent to that of a 10-way multinomial logistic regression
(also known as cross-entropy loss). The two feature stages use abs rectification and
The parameters for the two feature extraction stages are first trained with PSD as explained in
Section 2. The classifier is initialized randomly. The whole system is fine-tuned in
supervised mode (the protocol could be described as (U +U + R+R+). A validation set of

size 10,000 was set apart from the training set to tune the only hyper-parameter: the
sparsity constant . Nine different values were tested between 0.1 and 1.6 and the best value
was found to be 0.2. The system was trained with stochastic gradient descent on the 50,000
non-validation training samples until the best error rate on the validation set was reached
(this took 30 epochs). We also used a diagonal approximation to the Hessian of the
parameters using Levenberg-Marquardt method. We updated the Hesssian approximation at
every epoch using 500 samples. The whole system was then tuned for another 3 epochs on
the whole training set. A test error rate of 0.53% was obtained. To our knowledge, this is
the best error rate ever reported on the original MNIST dataset, without distortions or
preprocessing. The best previously reported error rate was 0.60% (Ranzato et al., 2006).

j) Convolutional Sparse Coding
Although sparse modeling has been used successfully for image analysis (Aharon et al.,
2005; Mairal et al., 2009) and learning feature extractors (Kavukcuoglu et al.,
2009), these systems are trained on single image patches whose dimensions match those of
the filters. In sparse coding, a sparse feature vector z is computed so as to best reconstruct
the input x through a linear operation with a fixed dictionary matrix D. The inference
procedure produces a code z by minimizing an energy function:

As explained in Chapter 2, the dictionary can be learned by minimizing the energy 5.1 wrt
D: minz,D E(x, z, D) averaged over a training set of input samples. After training, patches
in the image are processed separately. This procedure completely ignores the fact that
the filters are eventually going to be used in a convolutional fashion. Learning will produce a
dictionary of filters that are essen- tially shifted versions of each other over the patch, so as to
reconstruct each patch in isolation. Inference is performed on all (overlapping) patches
independently, which produces a very highly redundant representation for the whole image.
To address this problem, we apply sparse coding to the entire image at once, and we view the
dictionary as a convolutional filter bank:

where Di is an s s 2D filter kernel, x is a w h image (instead of an s s
patch), zi is a 2D feature map of dimension (w s + 1) (h s + 1), and denotes the
discrete convolution operator. Convolutional Sparse Coding has been used by several
authors, including (Zeiler et al., 2010).

Utilizing same idea proposed in PSD , we can modify the convolutional sparse coding
formulation in eq. (5.2) to include a trainable feed-forward predictor Fe .

where z = arg minz L(x, z, D, K ), K = {k, g} and ki is an encoding convolution kernel of
size ss, gi is a singleton per feature map i and Fe is the encoder function. Two crucially
important questions are the form of the Fe , and the optimization method to find z. Both
questions will be discussed at length below.

Figure 36: Left: A dictionary with 128 elements, learned with patch based sparse coding
model. Right: A dictionary with 128 elements, learned with convolutional sparse coding
model. The dictionary learned with the convolutional model spans the orientation space
much more uniformly. In addition it can be seen that the diversity of filters obtained by
convolutional sparse model is much richer compared to patch based one.


k) Algorithms and Method
In this section, we analyze the benefits of convolutional sparse coding for ob- ject recognition
systems, and propose convolutional extensions to the coordinate descent sparse coding (CoD)
(Li and Osher) algorithm and the dictionary learning procedure.

1) Learning Convolutional Dictionaries
The key observation for modeling convolutional filter banks is that the convo- lution of a
signal with a given kernel can be represented as a matrix-vector product by constructing a
special Toeplitz-structured matrix for each dictionary element and concatenating all such
matrices to form a new dictionary. Any existing sparse coding algorithm can then be used.
Unfortunately, this method incurs a cost, since the size of the dictionary then depends on the
size of the input signal. Therefore, it is advantageous to use a formulation based on
convolutions rather than follow- ing the naive method outlined above. In this work, we use
the coordinate descent sparse coding algorithm (Li and Osher) as a starting point and
generalize it using convolution operations. Two important issues arise when learning
convolutional dictionaries: 1. The boundary effects due to convolutions need to be properly
han- dled. 2. The derivative of equation 5.2 should be computed efficiently. Since the loss
is not jointly convex in D and z, but is convex in each variable when the other one is kept
fixed, sparse dictionaries are usually learned by an approach similar to block coordinate
descent, which alternatively minimizes over z and D (e.g., see (Ol- shausen and Field, 1997;

Mairal et al., 2009; Ranzato et al., 2006, 2007a)). One can use either batch (Aharon et al.,
2005) (by accumulating derivatives over many samples) or online updates (Mairal et al.,
2009; Zeiler et al., 2010; Kavukcuoglu et al., 2009) (updating the dictionary after each
sample). In this work, we use a stochastic online procedure for updating the dictionary
The updates to the dictionary elements, calculated from equation 5.2, are sen- sitive to the
boundary effects introduced by the convolution operator. The code units that are at the
boundary might grow much larger compared to the middle elements, since the outermost
boundaries of the reconstruction take contributions from only a single code unit, compared to
the middle ones that combine s s units.
Therefore the reconstruction error, and correspondingly the derivatives, grow pro- portionally
larger. One way to properly handle this situation is to apply a mask on the derivatives of the
reconstruction error wrt z: DT (x D z) is replaced by DT (mask(x) D z), where
mask is a term-by-term multiplier that either puts zeros or gradually scales down the
The second important point in training convolutional dictionaries is the compu- tation of the
S = DT D operator. For most algorithms like coordinate descent (Li and Osher), FISTA
(Beck and Teboulle, 2009) and matching pursuit (Mallat and Zhang, 1993), it is advantageous
to store the similarity matrix (S) explicitly and use a single column at a time for updating the
corresponding component of code z. For convolutional modeling, the same approach can be

followed with some additional care. In patch based sparse coding, each element (i, j) of S
equals the dot product of dictionary elements i and j. Since the similarity of a pair of
dictionary elements has to be also considered in spatial dimensions, each term is expanded
as full convolution of two dictionary elements (i, j), producing 2s 1 2s 1 matrix. It
is more convenient to think about the resulting matrix as a 4D tensor of size d d 2s 1
2s 1. One should note that, depending on the input image size, proper alignment of
corresponding column of this tensor has to be applied in the z space. One can also use the
steepest descent algorithm for finding the solution to convolutional sparse coding given in
equation 5.2, however using this method would be orders of magnitude slower compared to
specialized algorithms like CoD (Li and Osher) and the solution would never contain exact
zeros. In algorithm 3 we explain the extension of the coordinate descent algorithm (Li and
Osher) for convolutional inputs. Having formulated convolutional sparse coding, the
overall learning procedure is simple stochastic (online) gradient descent over dictionary

The columns of D are normalized after each iteration. A convolutional dictio- nary with 128
elements which was trained on images from Berkeley dataset (Martin et al., 2001) is shown
in figure.
2) Learning an Efficient Encoder

In (Ranzato et al., 2007a), (Jarrett et al., 2009) and (Gregor and LeCun, 2010) a feedforward
regressor was trained for fast approximate inference. In this work, we extend their encoder
module training to convolutional domain and also propose a new encoder function that
approximates sparse codes more closely. The encoder used in (Jarrett et al., 2009) is a simple
feedforward function which can also be seen as a small convolutional neural network: z =
gi tanh(x ki ) (i = 1..d). This function has been shown to produce good features for object
recognition (Jarrett et al., 2009), however it does not include a shrinkage operator, thus its
ability to produce sparse representations is very limited. Therefore, we propose a different
encoding function with a shrinkage operator. The standard soft thresholding oper- ator has the
nice property of producing exact zeros around the origin, however for a very wide region, the
derivatives are also zero. In order to be able to train a filter bank that is applied to the input
before the shrinkage operator, we propose to use an encoder with a smooth shrinkage
operator z = shi ,bi (x ki ) where i = 1..d and :

Note that each i and bi is a singleton per each feature map i. The shape of the smooth
shrinkage operator is given in figure 5.2 for several different values of .
Updating the parameters of the encoding function is performed by minimizing equation 5.3.
The additional cost term penalizes the squared distance between optimal code z and
prediction z. In a sense, training the encoder module is simi- lar to training a ConvNet. To

aid faster convergence, we use stochastic diagonal Levenberg-Marquardt method (LeCun et
al., 1998b) to calculate a positive diag- onal approximation to the hessian. We update the
hessian approximation every 10000 samples and the effect of hessian updates on the total
loss is shown in fig- ure 5.2. It can be seen that especially for the tanh encoder function, the
effect of using second order information on the convergence is significant.

3) Patch Based vs Convolutional Sparse Modeling
Natural images, sounds, and more generally, signals that display translation in- variance in
any dimension, are better represented using convolutional dictionaries. The convolution
operator enables the system to model local structures that appear anywhere in the signal. For
example, if m m image patches are sampled from a set of natural images, an edge at a given
orientation may appear at any location, forcing local models to allocate multiple dictionary
elements to represent a single underlying orientation. By contrast, a convolutional model

only needs to record

Figure 37: Top Left: Smooth shrinkage function. Parameters and b control the
smoothness and location of the kink of the function. As it converges more closely to
soft thresholding operator. Top Right: Total loss as a function of number of iterations. The
vertical dotted line marks the iteration number when diagonal hessian approximation was
updated. It is clear that for both encoder functions, hessian update improves the convergence
significantly. Bottom: 128 convolutional filters (k) learned in the encoder using smooth
shrinkage function.

the oriented structure once, since dictionary elements can be used at all locations. Figure 5.1
shows atoms from patch-based and convolutional dictionaries compris- ing the same number
of elements. The convolutional dictionary does not waste resources modeling similar filter
structure at multiple locations. Instead, it models more orientations, frequencies, and
different structures including center-surround filters, double center-surround filters, and
corner structures at various angles.
In this work, we present two encoder architectures, 1. steepest descent sparse coding with
tanh encoding function using gi tanh(x ki ), 2. convolutional CoD sparse coding with
shrink encoding function using sh,b(x ki ). The time required for training the first system
is much higher than for the second system due to steepest descent sparse coding. However,
the performance of the encoding functions are almost identical.

l) Multi-stage architecture
Our convolutional encoder can be used to replace patch-based sparse coding modules used in
multi-stage object recognition architectures such as the one pro- posed in our previous work
(Jarrett et al., 2009). Building on our previous findings, for each stage, the encoder is
followed by and absolute value rectification, contrast normalization and average
subsampling. Absolute Value Rectification is a simple pointwise absolute value function
applied on the output of the encoder. Contrast Normalization is the same operation used for
pre-processing the im- ages. This type of operation has been shown to reduce the

dependencies between components (Schwartz and Simoncelli, 2001; Lyu and Simoncelli,
2008) (feature maps in our case). When used in between layers, the mean and standard
deviation is calculated across all feature maps with a 9 9 neighborhood in spatial dimen-
sions. The last operation, average pooling is simply a spatial pooling operation that is applied
on each feature map independently.
One or more additional stages can be stacked on top of the first one. Each stage then takes
the output of its preceding stage as input and processes it using the same series of operations
with different architectural parameters like size and connections. When the input to a stage
is a series of feature maps, each output feature map is formed by the summation of multiple
In the next sections, we present experiments showing that using convolutionally trained
encoders in this architecture lead to better object recognition performance.

m) Experiments
We closely follow the architecture proposed in (Jarrett et al., 2009) for object recognition
experiments. As stated above, in our experiments, we use two different systems: 1. Steepest
descent sparse coding with tanh encoder: SDtanh. 2. Coor- dinate descent sparse coding
with shrink encoder: CDshrink . In the following, we give details of the unsupervised
training and supervised recognition experiments.

1) Object Recognition using Caltech 101 Dataset
The Caltech-101 dataset (Fei-Fei et al., 2004) contains up to 30 training images per class and
each image contains a single object. We process the images in the dataset as follows: 1.
Each image is converted to gray-scale and resized so that the largest edge is 151. 2. Images
are contrast normalized to obtain locally zero mean and unit standard deviation input using a
9 9 neighborhood. 3. The short side of each image is zero padded to 143 pixels. We report
the results in Table 5.1 and 5.2. All results in these tables are obtained using 30 training
samples per class and 5 different choices of the training set. We use the background class
during training and testing.
Architecture : We use the unsupervised trained encoders in a multi-stage system identical to
the one proposed in (Jarrett et al., 2009). At first layer 64 features are extracted from the
input image, followed by a second layers that produces 256 features. Second layer
features are connected to fist layer features through a sparse connection table to break the
symmetry and to decrease the number of parameters.
Unsupervised Training : The input to unsupervised training consists of con- trast normalized
gray-scale images (Pinto et al., 2008) obtained from the Berkeley segmentation dataset
(Martin et al., 2001). Contrast normalization consists of processing each feature map
value by removing the mean and dividing by the standard deviation calculated around 9
9 region centered at that value over all feature maps.

First Layer: We have trained both systems using 64 dictionary elements. Each dictionary
item is a 9 9 convolution kernel. The resulting system to be solved is a 64 times
overcomplete sparse coding problem. Both systems are trained for 10 different sparsity
values ranging between 0.1 and 3.0.
Second Layer: Using the 64 feature maps output from the first layer encoder on Berkeley
images, we train a second layer convolutional sparse coding. At the second layer, the number
of feature maps is 256 and each feature map is connected to 16 randomly selected input
features out of 64. Thus, we aim to learn 4096 convolutional kernels at the second layer.
To the best of our knowledge, none of the previous convolutional RBM (Lee et al., 2009) and
sparse coding (Zeiler et al.,
2010) methods have learned such a large number of dictionary elements. Our aim is
motivated by the fact that using such large number of elements and using a linear
classifier (Jarrett et al., 2009) reports recognition results similar to (Lee et al., 2009) and
(Zeiler et al., 2010). In both of these studies a more powerful Pyramid Match Kernel SVM
classifier (Lazebnik et al., 2006) is used to match the same level of performance. Figure 5.3
shows 128 filters that connect to 8 first layer features. Each row of filters connect a particular
second layer feature map. It is seen that each row of filters extract similar features since their
output response is summed together to form one output feature map.

One Stage System: We train 64 convolutional unsupervised features using both SDtanh and
CDshrink methods. We use the encoder function obtained from this training followed by
absolute value rectification, contrast normalization and

Figure 38: Second stage filters. Left: Encoder kernels that correspond to the dictionary
elements. Right: 128 dictionary elements, each row shows 16 dictionary elements,
connecting to a single second layer feature map. It can be seen that each group extracts
similar type of features from their corresponding inputs.

Table 5.1: Comparing SDtanh encoder to CDshrink encoder on Caltech 101 dataset using a
single stage architecture. Each system is trained using 64 convolutional filters. The
recognition accuracy results shown are very similar for both systems.

average pooling. The convolutional filters used are 9 9. The average pooling is applied
over a 10 10 area with 5 pixel stride. The output of first layer is then 64 26 26 and fed
into a logistic regression classifier and Lazebniks PMK-SVM classifier (Lazebnik et al.,
2006) (that is, the spatial pyramid pipeline is used, using our features to replace the SIFT
Two Stage System: We train 4096 convolutional filters with SDtanh method using 64 input
feature maps from first stage to produce 256 feature maps. The second layer features are
also 99, producing 2561818 features. After applying absolute value rectification,
contrast normalization and average pooling (on a 6 6 area with stride 4), the output features
are 256 4 4 (4096) dimensional. We only use multinomial logistic regression classifier
after the second layer feature extraction stage.

We denote unsupervised trained one stage systems with U , two stage unsupervised trained
systems with U U and + represents supervised training is performed afterwards. R stands
for randomly initialized systems with no unsupervised training. Comparing our U system
using both SDtanh and CDshrink (57.1% and 57.3%) with the 52.2% reported in (Jarrett et
al., 2009), we see that convolutional training results in significant improvement. With two
layers of purely unsupervised features (U U , 65.3%), we even achieve the same performance
as the patch-based model of Jarrett et al. (Jarrett et al., 2009) after supervised fine-tuning

(63.7%). Moreover, with additional supervised fine-tuning (U +U +) we match or perform
very close.

Table 5.2: Recognition accuracy on Caltech 101 dataset using a variety of different feature
representations using two stage systems and two different classifiers.

to (66.3%) similar models (Lee et al., 2009; Zeiler et al., 2010) with two layers of
convolutional feature extraction, even though these models use the more complex spatial

pyramid classifier (PMK-SVM) instead of the logistic regression we have used; the spatial
pyramid framework comprises a codeword extraction step and an SVM, thus effectively
adding one layer to the system. We get 65.7% with a spatial pyramid on top of our single-
layer U system (with 256 codewords jointly encoding 2 2 neighborhoods of our features
by hard quantization, then max pooling in each cell of the pyramid, with a linear SVM, as
proposed by authors in (Boureau et al., 2010)).
Our experiments have shown that sparse features achieve superior recognition performance
compared to features obtained using a dictionary trained by a patch- based procedure as
shown in Table 5.1. It is interesting to note that the improve- ment is larger when using
feature extractors trained in a purely unsupervised way, than when unsupervised training is
followed by a supervised training phase (57.1 to 57.6). Recalling that the supervised tuning
is a convolutional procedure, this last training step might have the additional benefit of
decreasing the redundancy between patch-based dictionary elements. On the other hand, this
contribution would be minor for dictionaries which have already been trained
convolutionally in the unsupervised stage.

2) Pedestrian Detection
We train and evaluate our architecture on the INRIA Pedestrian dataset (Dalal and Triggs,
2005) which contains 2416 positive examples (after mirroring) and

1218 negative full images. For training, we also augment the positive set with small
translations and scale variations to learn invariance to small transformations, yielding 11370
and 1000 positive examples for training and validation respectively. The negative set is
obtained by sampling patches from negative full images at random scales and locations.
Additionally, we include samples from the positive set with larger and smaller scales to
avoid false positives from very different scales. With these additions, the negative set is
composed of 9001 training and 1000 validation samples.

3) Architecture and Training
A similar architecture as in the previous section was used, with 32 filters, each 7 7 for the
first layer and 64 filters, also 7 7 for the second layer. We used 2 2 average pooling
between each layer. A fully connected linear layer with 2 output scores (for pedestrian and
background) was used as the classifier. We trained this system on 78 38 inputs where
pedestrians are approximately 60 pixels high. We have trained our system with and
without unsupervised initialization, followed by fine-tuning of the entire architecture in
supervised manner. Figure 5.5 shows comparisons of our system with other methods as well
as the effect of unsupervised initialization.

Figure 39: Results on the INRIA dataset with per-image metric. Left: Comparing two best
systems with unsupervised initialization (U U ) vs random initialization (RR). Right:
Effect of bootstrapping on final performance for unsupervised ini- tialized system.
After one pass of unsupervised and/or supervised training, several bootstrap- ping passes
were performed to augment the negative set with the 10 most offending samples on each full
negative image and the bigger/smaller scaled positives. We select the most offending
sample that has the biggest opposite score. We limit the number of extracted false
positives to 3000 per bootstrapping pass. As (Walk et al., 2010) showed, the number of
bootstrapping passes matters more than the initial training set. We find that the best results
were obtained after four passes, as shown in figure 5.5 improving from 23.6% to 11.5%.


Figure 40: Results on the INRIA dataset with per-image metric. These curves are
computed from the bounding boxes and confidences made available by (Dollar et al.,
2009b). Comparing our two best systems labeled (U +U + and R+R+)with all the other

4) Per-Image Evaluation
Performance on the INRIA set is usually reported with the per-window method- ology to
avoid post-processing biases, assuming that better per-window perfor- mance yields better
per-image performance. However (Dollar et al., 2009b) empir- ically showed that the per-
window methodology fails to predict the performance per-image and therefore is not
adequate for real applications. Thus, we evalu- ate the per-image accuracy using the
source code available from (Dollar et al.,

2009b), which matches bounding boxes with the 50% PASCAL matching measure.
In figure 5.5, we compare our best results (11.5%) to the latest state-of-the- art results
(8.7%) gathered and published on the Caltech Pedestrians website1 . The results are
ordered by miss rate (the lower the better) at 1 false positive per image on average (1 FPPI).
The value of 1 FPPI is meaningful for pedestrian detection because in real world
applications, it is desirable to limit the number of false alarms.
It can be seen from figure 5.4 that unsupervised initialization significantly im- proves the
performance (14.8% vs 11.5%). The number of labeled images in INRIA dataset is relatively
small, which limits the capability of supervised learning algo- rithms. However, an
unsupervised method can model large variations in pedestrian pose, scale and clutter with
much better success.
Top performing methods (Dollar et al., 2009a), (Dollar et al., 2010), (Felzen- szwalb et al.,
2010), (Walk et al., 2010) also contain several components that our simplistic model does not
contain. Probably, the most important of all is color information, whereas we have trained
our systems only on gray-scale images. An- other important aspect is training on multi-
resolution inputs (Dollar et al., 2009a), (Dollar et al., 2010), (Felzenszwalb et al., 2010).
Currently, we train our systems on fixed scale inputs with very small variation. Additionally,
we have used much lower resolution images than top performing systems to train our models
(78 38 vs 128 64 in (Walk et al., 2010)). Finally, some models (Felzenszwalb et al.,

2010) use deformable body parts models to improve their performance, whereas we rely on
a much simpler pipeline of feature extraction and linear classification.
Our aim in this work was to show that an adaptable feature extraction system that learns its
parameters from available data can perform comparably to best systems for pedestrian
detection. We believe by including color features and using multi-resolution input our
systems performance would increase.

n) Sparse Coding by Variational Marginalization
In chapters 2 and 3, we have shown that the optimal solution for the sparse coding problem
is unstable under small perturbations of the input signal. The pre- dictor function Fe provides
a smooth mapping, and alleviates the stability problem as shown in chapter 2. Additionally
we have developed an extension to learn invari- ance from data in chapter 3 by extending the
sparse coding problem using complex cell models from neuroscience which is equivalent to
the group sparse coding formu- lation with overlapping pools. However, we have always
assumed that the optimal solution is the minimum of a certain energy function, amounting to
finding the maximum a posteriori (MAP) solution, rather than the maximum likelihood so-
lution. In this chapter, we provide a variational marginalization (MacKay, 2003) model of
the sparse coding problem for finding more stable representations. Very similar approaches
that use variational formulation for sparse modeling have al- ready been published in the
literature (Girolami, 2001; Wipf et al., 2004; Seeger, 2008). In these works, both the code

inference and dictionary update steps are modeled using variational marginalization,
however, we only formulate the sparse coding using variational methods and keep using
stochastic gradient descent for dictionary updates since the instability of the solution is due to
the MAP solution of code inference using an over complete dictionary.

o) Variational Marginalization for Sparse Coding
One can write the probability distribution corresponding to a given energy function
through Gibbs distribution as:
where Z (, ) is the partition function (normalization coefficient). For high di- mensional
problems, computing Z is intractable and Energy-Based Models (LeCun et al., 2006) avoid
computing this term. On the other hand, using variational free energy minimization
techniques (MacKay, 2003), one can find a simpler distribu- tion Q(x; ) that approximates P
(x) and find a lower bound to Z (, ).
The free energy of a system is:
and the corresponding variational free energy under the approximating distribution Q(x; )

by substituting the probability distribution from equation (6.1):

one can see that the variational free energy F is an upper bound on the free energy and the
difference is defined by the KL-divergence between Q and P .
In this section we focus on the 1 minimization sparse coding model without the predictor

and rewriting eq. (6.4), the corresponding variational free energy is:

The above equation is valid for any distribution Q(x) and we assume a Gaussian with mean m
and diagonal variance S (s.t 2 = si = Sii , i = 1..d)

Below we give a detailed expansion of equation (6.8). Expectation of energy term can be
expanded as follows:

By using linearity of expectation and the fact that zT Az = Tr AzzT , the quadratic term can
be simplified as:

A random variable (y) such that y = |x| and x N (, 2) has folded normal distribution
(Leone et al., 1961). The pdf of y is given as:

p) Stability Experiments

Next, we compare the stability of representations obtained from sparse coding solution and
variational free energy minimization. We take 60 consecutive frames from a video and plot
the angle between representations for all consecutive pairs of frames as shown in figure 6.4.
It can be seen that, as the sparsity coefficient is decreased the sparse coding solution
obtained using coordinate descent method (Li and Osher) produces less stable representations
since the solution will contain many more active code units to achieve better reconstruction
and using an overcomplete dictionary, there are many similar configurations to achieve same
error level. The curves corresponding to variational free energy minimization are all
produced using = 0.5 for varying values of and it can be seen that even for high values of
, variational free energy minimization produces more stable representations.
In order to quantify the effect of this stability improvement on recognition per- formance
using variational free energy minimization, more controlled experiments on recognition and
classification tasks should be performed.

Figure 42: Reconstruction Error vs 1 norm Sparsity Penalty for coordinate de- scent sparse
coding and variational free energy minimization.

Figure 42: Angle between representations obtained for two consecutive frames using
different parameter values using sparse coding and variational free energy minimization.


In this paper we analysed and reviewed object recogni-tion methods, focusing on these based
on matching of lo-cal features. We presented a literature survey, and stated the relationship to
other recognition methods. Examples of successful applications in realistic conditions were
pre-sented, demonstrating the strengths of the local methods. The applications included
recognition of household objects in a database of 100 objects, recognition of buildings in a
database of 200 buildings, retrieval of advertisements and the wide-baseline stereo matching.
The challenging and interesting problem of object cate-gorisation was not covered.

In this thesis we have developed unsupervised methods for learning feature extractors that
can be used in deep architectures for visual recognition. We have
1. trained sparse models simultaneously with a feed-forward regressor to achieve efficient
inference, 2. merged complex cell models with sparse models to learn locally invariant
feature extractors, 3. developed convolutional sparse modeling that learns translation
invariant dictionaries over images. 4. shown that the non- linear functions that are used for
building deep architectures have a very significant effect on the recognition performance.
we have extended sparse modeling algorithms by jointly training a feed-forward predictor
function. The resulting algorithm was named Predictive Sparse Decomposition (PSD). The
output representations from this predictor func- tion were then compared with optimal sparse

representations on object recognition task using Caltech 101 dataset and slightly better
performance was achieved. This modification enabled the use of sparse modeling
algorithms for training generic feature extractors.
we have shown that modeling invariant representations can be achieved using group sparsity
criterion with overlapping group definitions instead of per-component 1 sparsity criterion.
When the trained feature extractor was compared with the SIFT feature extractor on the
Caltech 101 dataset, it achieved very similar recognition performance, however, when
applied on the MNIST hand- written digit classification task, the proposed system performed
significantly better since the parameters are adapted to the particular task at hand, rather than
being optimized for only natural images.
We have then used the PSD algorithm to build hierarchical feature extractors as explained in
chapter 4 and tested on three benchmark datasets: Caltech 101, MNIST and NORB. We
have shown that the form of the non-linear functions used in between stages of a hierarchical
system become very important especially when there is lack of supervised labeled data.

Finally, in order to reduce the redundancy in the dictionary introduced from training on
local image patches, we have developed convolutional sparse model- ing. A dictionary that
is trained convolutionally contains a richer set of features compared to the same size
dictionary trained on image patches and and using the feature extractors trained
convolutionally increases the recognition performance compared to patch based training.

The unsupervised methods proposed in this thesis are a continuation of a gen- eral
unsupervised learning framework that contains a feed-forward predictor func- tion and a
linear decoding function which was first proposed in (Ranzato et al., 2006). (Ranzato et al.,
2007b) extended the algorithm proposed in (Ranzato et al., 2006) to encode translational
invariance and to build hierarchical feature extrac- tion systems for object recognition. In
(Ranzato et al., 2007a) a new sparsifying penalty function was introduced instead of the
temporal soft-max used in (Ran- zato et al., 2006) and the predictor and dictionary
parameters were constrained to be transpose of each other to prevent them from growing. In
PSD (Kavukcuoglu et al.), we used 1 penalty for enforcing sparsity on the representation
and also re- placed the symmetry constraint with a unit L2-norm constraint on the dictionary
elements, which also enabled us to use various different architectures in predictor function.
We have then extended PSD using complex cell models to learn invariant representations in
(Kavukcuoglu et al., 2009) and in (Kavukcuoglu et al., 2010) we introduced convolutional
sparse modeling together with better predictor functions for sparse representations and
efficient convolutional sparse coding algorithms.
In summary, we have developed algorithms and methods to achieve a success- ful
object recognition system that has a hierarchical structure, can be trained using
unlabeled data for improved accuracy, can produce locally invariant repre- sentations
and finally efficient to be used on large datasets and real-time vision applications.
There is much more to be done to achieve this goal, however we be- lieve the models

proposed in this work are valuable contributions towards realizing this objective. Several
interesting problems are left open for future research. In particular, rather than using
a fixed contrast normalization function, developing an adaptive function to reduce
correlations between features, integrating invariant models into hierarchical structure
and developing integrated learning algorithms for deep networks to improve on
layer-wise training seem like the most important ones.
We have proposed hierarchical kernel descriptors for ex-tracting image features layer by
layer. Our approach is based on the observation that kernel descriptors can be re-cursively
used to produce features at different levels. We have compared hierarchical kernel descriptors
to current state-of-the-art algorithms and shown that our hierarchical kernel descriptors have
the best accuracy on CIFAR10, a large scale visual object recognition dataset to date. In ad-
dition, we also evaluated our hierarchical kernel descriptors on a large RGB-D dataset and
demonstrated their ability to generate rich feature set from multiple sensor modalities, which
is critical for boosting accuracy. In the future, we plan to investigate deep hierarchies of
kernel descriptors to see whether more layers are helpful for object recognition.


[1] Columbia object image library.
[2] J.S. Beis and D.G. Lowe. Indexing without invariants in 3d object recognition. PAMI: IEEE
Transactions on Pattern Analysis and Machine Intelligence, 21(10):10001015, Octo-ber
[3] Peter N. Belhumeur, Joao Hespanha, and David J. Kriegman. Eigenfaces vs. fisherfaces:
Recognition using class specific linear projection. In ECCV (1), pages 4558, 1996.
[4] H. Bischof and A. Leonardis. Robust recognition of scaled eigenimages through a
hierarchical approach. In CVPR98, pages 664670, 1998.
[5] H. Bischof, H. Wildenauer, and A. Leonardis. Illumination insensitive eigenspaces. In
ICCV01, pages I: 233238, 2001.
[6] T.E. Boult, R.S. Blum, S.K. Nayar, P.K. Allen, and J.R. Kender. Advanced visual sensor
systems (1998). In DARPA98, pages 939952, 1998.
[7] M. Brown and D. Lowe. Invariant features from interest point groups. In BMVC02, 2002.
[8] M. Brown and D.G. Lowe. Recognising panoramas. In ICCV03, pages 12181225, 2003.
[9] B. Caputo, J. Hornegger, D. Paulus, and H. Niemann. A spin-glass markov random field for
3-d object recognition. Technical Report LME-TR-2002-01, Lehrstuhl fur Muster-
erkennung, Institut fur Informatik, Universitat Erlangen-Nurnberg, 2002.
[10] G. Dorko and C. Schmid. Selection of scale-invariant parts for object class recognition. In
ICCV03, pages 634640, 2003.
[12] V. Ferrari, T. Tuytelaars, and L.J. Van Gool. Wide-baseline multiple-view
correspondences. In CVPR03, pages I: 718 725, 2003.
[13] B.V. Funt and G.D. Finlayson. Color constant color indexing.

[14] C. Harris and M.J. Stephens. A combined corner and edge detector. In Alvey88, pages
147152, 1988.
[15] G. Healey and D.A. Slater. Using illumination invariant color histogram descriptors for
recognition. In CVPR94, pages 355360, 1994.
[16] M. Jogan and A. Leonardis. Robust localization using eigenspace of spinning-images. In
OMNIVIS00, 2000.
[17] D. Jugessur and G. Dudek. Local appearance for robust ob-ject recognition. In Computer
Vision and Pattern Recogni-tion (CVPR00), pages 834840, June 2000.
[18] T. Kadir and M. Brady. Saliency, scale and image description. IJCV, 45(2):83105,
November 2001.
[19] T. Kadir and M. Brady. Scale saliency : A novel approach to salient feature and scale
selection. In International Con-ference Visual Information Engineering 2003, pages 2528,
[20] J. Krumm. Eigenfeatures for planar pose measurement of partially occluded objects. In
CVPR96, pages 5560, 1996.
[21] A. Leonardis and H. Bischo. Dealing with occlusions in the eigenspace approach. In
IEEE Conference on Computer Vision and Pattern Recognition, pages 453458, June 1996.
[22] Ales Leonardis and Horst Bischof. Robust recognition using eigenimages. Computer
Vision and Image Understanding: CVIU, 78(1):99118, 2000.
[23] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International
Journal on Computer Vision, 2004.
[24] D.G. Lowe. Object recognition from local scale-invariant fea-tures. In ICCV99, pages
11501157, 1999.
[25] D.G. Lowe. Local feature view clustering for 3d object recog-nition. In CVPR01, pages
I:682688, 2001.
[26] Jir Matas, Ondrej Chum, Martin Urban, and Tomas Pajdla. Robust wide baseline
stereo from maximally stable extremal regions. In Proceedings of the British Machine Vision
Conference, pages 384393, 2002.
[27] Jir Matas, Stepan Obdrzalek, and Ondrej Chum. Local ane frames for wide-
baseline stereo. In ICPR02, August 2002.
[28] K. Mikolajczyk and C. Schmid. Indexing based on scale in-variant interest points. In
ICCV01, pages I: 525531, 2001.

[29] K. Mikolajczyk and C. Schmid. An ane invariant interest point detector. In ECCV02,
page I: 128 ., 2002.
[30] F. Mindru, T. Moons, and L. Van Gool. Recognizing color patterns irrespective of
viewpoint and illumination. In Pro-ceedings of the Computer Vision and Pattern Recognition,
pages 368373, 1999.
[31] B. Moghaddam and A. Pentland. Probabilistic visual learn-ing for object detection. In
International Conference on Computer Vision (ICCV95), pages 786793, Cambridge, USA,
June 1995.
[32] J.L. Mundy and A. Zisserman. Geometric Invariance in Computer Vision. Book, 1992.
[33] H. Murase and S.K. Nayar. Image spotting of 3d objects using parametric eigenspace
representation. In SCIA95, pages 325 332, 1995.
[34] P. Navarrete and J. Ruiz del Solar. Comparative study be-tween dierent eigenspace-
based approaches for face recogni-tion. Lecture Notes in Computer Science, 2275:178??,
[35] S.K. Nayar, S.A. Nene, and H. Murase. Real-time 100 object recognition system. In
ARPA96, pages 12231228, 1996.
[36] R.C. Nelson and A. Selinger. Perceptual grouping hierarchy for 3d object recognition and
representation. In DARPA98, pages 157163, 1998.
[37] Stepan Obdrzalek and Jir Matas. Local ane frames for im-age retrieval. In The
Challenge of Image and Video Retrieval (CIVR2002), July 2002.
[38] Stepan Obdrzalek and Jir Matas. Object recognition using local ane frames on
distinguished regions. In The British Machine Vision Conference (BMVC02), September
[39] Stepan Obdrzalek and Jir Matas. Image retrieval using local compact dct-based
representation. In DAGM 2003: Proceed-ings of the 25th DAGM Symposium, pages 490
497, 9 2003.
[40] K. Ohba and K. Ikeuchi. Detectability, uniqueness and re-liability of eigen windows for
stable verification of partially occluded objects. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 19(9):10431048, September 1997.
[41] A. Pentland, B. Moghaddam, and T. Starner. View-based and modular eigenspaces for
face recognition. In Proc. of IEEE Conf. on Computer Vision and Pattern Recognition
(CVPR94), Seattle, WA, June 1994.
[42] A.R. Pope. Model-based object recognition: A survey of recent research. In Univ. of
British Columbia, 1994.

[43] Philip Pritchett and Andrew Zisserman. Matching and re-construction from widely
separated views. Lecture Notes in Computer Science, 1506:7885, 1998.
[44] Philip Pritchett and Andrew Zisserman. Wide baseline stereo matching. In ICCV, pages
754760, 1998.
[45] F. Rothganger, S. Lazebnik, C. Schmid, and J. Ponce. 3d ob-ject modeling and
recognition using ane-invariant patches and multi-view spatial constraints. In CVPR03,
pages II: 272277, 2003.
[46] F. Schaalitzky and A. Zisserman. Geometric grouping of repeated elements within
images. In BMVC98, 1998.
[47] F. Schaalitzky and A. Zisserman. Viewpoint invariant tex-ture matching and wide
baseline stereo. In ICCV01, pages II: 636643, 2001.
[48] F. Schaalitzky and A. Zisserman. Automated scene match-ing in movies. In CIVR02,
pages 186197, 2002.
[49] F. Schaalitzky and A. Zisserman. Multi-view matching for unordered image sets, or
how do i organize my holiday snaps?. In ECCV02, page I: 414 ., 2002.
[50] F. Schaalitzky and A. Zisserman. Automated location matching in movies. CVIU, 92(2-
3):236264, November 2003.
[51] B. Schiele and J.L. Crowley. Object recognition using mul-tidimensional receptive field
histograms. In ECCV96, pages I:610619, 1996.
[52] B. Schiele and J.L. Crowley. Probabilistic object recogni-tion using multidimensional
receptive field histograms. In ICPR96, 1996.
[53] B. Schiele and J.L. Crowley. Recognition without correspon-dence using
multidimensional receptive field histograms. In-ternational Journal on Computer Vision,
36(1):3150, Jan-uary 2000.
[54] C. Schmid. Constructing models for content-based image retrieval. In CVPR01, pages
II:3945, 2001.
[55] C. Schmid and R. Mohr. Combining grey value invariants with local constraints for object
recognition. In CVPR96, pages 872877, 1996.
[56] C. Schmid and R. Mohr. Image retrieval using local charac-terization. In ICIP96, page
18A1, 1996.
[57] C. Schmid and R. Mohr. Local grayvalue invariants for image retrieval. PAMI,
19(5):530535, May 1997.

[58] A. Selinger and R.C. Nelson. A perceptual grouping hierarchy for appearance-based 3d
object recognition. CVIU, 76(1):83 92, October 1999.
[59] A. Selinger and R.C. Nelson. Improving appearance-based object recognition in cluttered
backgrounds. In ICPR00, pages Vol I: 4650, 2000.
[60] A. Selinger and R.C. Nelson. Appearance-based object recog-nition using multiple views.
In CVPR01, pages I:905911, 2001.
[61] A. Selinger and R.C. Nelson. Minimally supervised acqui-sition of 3d recognition models
from cluttered images. In CVPR01, pages I:213220, 2001.
[62] Andrea Selinger. Analysis and Applications of Feature-Based Object Recognition. PhD
thesis, Dept. of Computer Science, University of Rochester, New York, 2001.
[63] Hao Shao, Tomas Svoboda, and Luc Van Gool. ZuBuD Zurich Buildings Database
for Image Based Recognition. Technical Report 260, Computer Vision Laboratory, Swiss
Federal Institute of Technology, March 2003.
[64] J. Sivic and A. Zisserman. Video google: A text retrieval approach to object matching in
videos. In ICCV03, pages 14701477, 2003.
[65] D. Skocaj, H. Bischof, and A. Leonardis. A robust pca al-gorithm for building
representations from panoramic images. In ECCV02, page IV: 761 ., 2002.
[66] D. Skocaj and A. Leonardis. Weighted and robust incremen-tal method for subspace
learning. In ICCV03, pages 1494 1501, 2003.
[67] M.J. Swain and D.H. Ballard. Indexing via color histograms. In Ph. D., 1990.
[68] M.J. Swain and D.H. Ballard. Color indexing. International Journal on Computer Vision,
7(1):1132, November 1991.
[69] Daniel L. Swets and Juyang Weng. Using discriminant eigen-features for image retrieval.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(8):831836, 1996.
[70] A. Turina, T. Tuytelaars, T. Moons, and L.J. Van Gool. Grouping via the matching of
repeated patterns. In ICAPR01, pages 250259, 2001.
[71] Matthew Turk and Alex Pentland. Eigenfaces for recognition. Journal of Cognitive
Neuroscience, 3(1):7186, 1991.
[72] T. Tuytelaars, L. Van Gool, L. Dhaene, and R. Koch. Match-ing of anely invariant
regions for visual servoing. In In-ternational Conference on Robotics and Automation, pages
16011606, 1999.

[73] T. Tuytelaars, A. Turina, and L.J. Van Gool. Noncombinato-rial detection of regular
repetitions under perspective skew. PAMI, 25(4):418432, April 2003.
[74] T. Tuytelaars and L.J. Van Gool. Wide baseline stereo matching based on local, anely
invariant regions. In BMVC00, 2000.
[75] Tinne Tuytelaars. Local Invariant Features for Registration and Recognition. PhD
thesis, University of Leuven, ESAT - PSI, 2000.
[76] Tinne Tuytelaars and Luc J. Van Gool. Content-based image retrieval based on local
anely invariant regions. In Visual Information and Information Systems, pages 493
500, 1999.
[77] Weiss. Geometric invariants and object recognition. Inter-national Journal on
Computer Vision, 10(3):207231, June 1993.
[78] M. H. Yang, D. Roth, and N. Ahuja. Learning to Recognize 3D Objects with SNoW.
In ECCV 2000, pages 439454, 2000. L. Bo, X. Ren, and D. Fox. Kernel Descriptors for
Visual Recognition. In NIPS, December 2010.
[79] L. Bo and C. Sminchisescu. Efficient Match Kernel between Sets of Features for
Visual Recognition. In NIPS, 2009.
[80] M. Carreira-Perpinan and G. Hinton. On Contrastive Diver-gence Learning. In
AISTATS, 2005.
[81] A. Coates, H. Lee, and A. Ng. An analysis of single-layer networks in unsupervised
feature learning. In NIPS*2010 Workshop on Deep Learning, 2010.
[82] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In
CVPR, 2005.
[83] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-fei. ImageNet: A Large-Scale
Hierarchical Image Database. In CVPR, 2009.
[84] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ra-manan. Object detection with
discriminatively trained part based models. IEEE PAMI, 32(9):16271645, 2009.
[85] G. Hinton, S. Osindero, and Y. Teh. A fast learning algorithm for deep belief nets.
Neural Computation, 18(7):15271554, 2006.
[86] G. Hinton and R. Salakhutdinov. Reducing the Dimensional-ity of Data with Neural
Networks. Science, 313(5786):504 507, July 2006.
[87] G. Hua, M. Brown, and S. Winder. Discriminant embedding for local image
descriptors. In ICCV, 2007.

[88] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun. What is the best multi-stage
architecture for object recog-nition? In ICCV, 2009.
[89] A. Johnson and M. Hebert. Using spin images for efficient object recognition in
cluttered 3D scenes. IEEE PAMI, 21(5), 1999.
[90] K. Kavukcuoglu, M. Ranzato, R. Fergus, and Y. LeCun. Learning invariant features
through topographic filter maps. In CVPR, 2009.
[91] A. Krizhevsky. Learning multiple layers of features from tiny images. Technical
report, 2009.
[92] K. Lai, L. Bo, X. Ren, and D. Fox. A Large-Scale Hierar-chical Multi-View RGB-D
Object Dataset. In IEEE Inter-national Conference on on Robotics and Automation, 2011.
[93] Q. Le, J. Ngiam, Z. C. Chia, P. Koh, and A. Ng. Tiled con-volutional neural networks.
In NIPS. 2010.
[94] H. Lee, A. Battle, R. Raina, and A. Ng. Efficient sparse coding algorithms. In NIPS,
pages 801808, 2006.
[95] H. Lee, R. Grosse, R. Ranganath, and A. Ng. Convolutional deep belief networks for
scalable unsupervised learning of hierarchical representations. In ICML, 2009.
[96] D. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60:91110,
[97] T. Ojala, M. Pietikainen, and T. Maenpaa. Multiresolution gray-scale and rotation
invariant texture classification with local binary patterns. IEEE PAMI, 24(7):971987,
[98] B. Olshausen and D. Field. Emergence of simple-cell re-ceptive field properties by
learning a sparse code for natural images. Nature, 381:607609, 1996.
[99] J. Philbin, M. Isard, J. Sivic, and A. Zisserman. Descriptor learning for efficient
retrieval. In ECCV, 2010.
[100] M. Ranzato, K. A., and G. Hinton. Factored 3-way restricted boltzmann machines for
modeling natural images. In AIS-TATS, 2010.
[101] M. Ranzato and G. Hinton. Modeling pixel means and co-variances using factorized
third-order boltzmann machines. In CVPR, 2010.
[102] A. Torralba, R. Fergus, and W. Freeman. 80 million tiny images: A large data set for
nonparametric object and scene recognition. IEEE PAMI, 30(11):19581970, 2008.
[103] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Guo. Locality-constrained linear
coding for image classification. In CVPR, 2010.

[104] C. Y and L. Saul. Kernel methods for deep learning. In NIPS, 2009.
[105] Y. B. Y. LeCun, L. Bottou and P. Haffner. Gradient-based learning applied to
document recognition. Proceedings of the IEEE, 86(11):22782324, 1998.
[106] J. Yang, K. Yu, Y. Gong, and T. Huang. Linear spatial pyra-mid matching using
sparse coding for image classification. In CVPR, 2009.
[107] K. Yu and T. Zhang. Improved local coordinate coding using local tangents. In ICML,
pages 12151222, 2010.
[108] K. Yu, T. Zhang, and Y. Gong. Nonlinear Learning using Local Coordinate Coding .
In NIPS, December 2009.