You are on page 1of 694

Arcangelo Distante

Cosimo Distante

Handbook of Image
Processing and
Computer Vision
Volume 3
From Pattern to Object
Handbook of Image Processing
and Computer Vision
Arcangelo Distante • Cosimo Distante

Handbook of Image
Processing and Computer
Vision
Volume 3: From Pattern to Object

123
Arcangelo Distante Cosimo Distante
Institute of Applied Sciences Institute of Applied Sciences
and Intelligent Systems and Intelligent Systems
Consiglio Nazionale delle Ricerche Consiglio Nazionale delle Ricerche
Lecce, Italy Lecce, Italy

ISBN 978-3-030-42377-3 ISBN 978-3-030-42378-0 (eBook)


https://doi.org/10.1007/978-3-030-42378-0
© Springer Nature Switzerland AG 2020
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, express or implied, with respect to the material contained herein or
for any errors or omissions that may have been made. The publisher remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
To my parents and my family, Maria and
Maria Grazia—Arcangelo Distante

To my parents, to my wife Giovanna, and to


my children Francesca and Davide—Cosimo
Distante
Preface

In the last 20 years, several interdisciplinary researches in the fields of physics,


information technology and cybernetics, the numerical processing of Signals and
Images, and electrical and electronic technologies have led to the development of
Intelligent Systems.
The so-called Intelligent Systems (or Intelligent Agents) represent the still more
advanced and innovative frontier of research in the electronic and computer field,
able to directly influence the quality of life, competitiveness, and production
methods of companies, to monitor and evaluate the environmental impact, to make
public service and management activities more efficient, and to protect people’s
safety.
The study of an intelligent system, regardless of the area of use, can be sim-
plified into three essential components:

1. The first interacts with the environment for the acquisition of data of the domain
of interest, using appropriate sensors (for the acquisition of Signals and Images);
2. The second analyzes and interprets the data collected by the first component,
also using learning techniques to build/update adequate representations of the
complex reality in which the system operates (Computational Vision);
3. The third chooses the most appropriate actions to achieve the objectives
assigned to the intelligent system (choice of Optimal Decision Models) inter-
acting with the first two components, and with human operators, in case of
application solutions based on man–machine cooperative paradigms (the current
evolution of automation including industrial one).
In this scenario of knowledge advancement for the development of Intelligent
Systems, the information content of this manuscript is framed in which are reported
the experiences of multi-year research and teaching of the authors, and of the
scientific insights existing in the literature. In particular, the manuscript divided into
three parts (volumes), deals with aspects of the sensory subsystem in order to
perceive the environment in which an intelligent system is immersed and able to act
autonomously.
The first volume describes the set of fundamental processes of artificial vision
that lead to the formation of the digital image from energy. The phenomena of light
propagation (Chaps. 1 and 2), the theory of color perception (Chap. 3), the impact

vii
viii Preface

of the optical system (Chap. 4), the aspects of transduction from luminous energy
are analyzed (the optical flow) with electrical signal (of the photoreceptors), and
aspects of electrical signal transduction (with continuous values) in discrete values
(pixels), i.e., the conversion of the signal from analog to digital (Chap. 5). These
first 5 chapters summarize the process of acquisition of the 3D scene, in symbolic
form, represented numerically by the pixels of the digital image (2D projection
of the 3D scene).
Chapter 6 describes the geometric, topological, quality, and perceptual infor-
mation of the digital image. The metrics are defined, the aggregation and correlation
modalities between pixels, useful for defining symbolic structures of the scene of
higher level with respect to the pixel. The organization of the data for the different
processing levels is described in Chap. 7 while in Chapter 8, the representation and
description of the homogeneous structures of the scene is shown.
With Chapter 9 starts the description of the image processing algorithms, for the
improvement of the visual qualities of the image, based on point, local, and global
operators. Algorithms operating in the spatial domain and in the frequency domain
are shown, highlighting with examples the significant differences between the
various algorithms also from the point of view of the computational load.
The second volume begins with the chapter describing the boundary extraction
algorithms based on local operators in the spatial domain and on filtering techniques
in the frequency domain.
In Chap. 2 are presented the fundamental linear transformations that have
immediate application in the field of image processing, in particular, to extract the
essential characteristics contained in the images. These characteristics, which
effectively summarize the global informational character of the image, are then used
for the other image processing processes: classification, compression, description,
etc. Linear transforms are also used, as global operators, to improve the visual
qualities of the image (enhancement), to attenuate noise (restoration), or to reduce
the dimensionality of the data (data reduction).
In Chap. 3, the geometric transformations of the images are described, necessary
in different applications of the artificial vision, both to correct any geometric dis-
tortions introduced during the acquisition (for example, images acquired while the
objects or the sensors are moving, as in the case of satellite and/or aerial acquisi-
tions), or to introduce desired visual geometric effects. In both cases, the geomet-
rical operator must be able to reproduce as accurately as possible the image with the
same initial information content through the image resampling process.
In Chap. 4Reconstruction of the degraded image (image restoration), a set of
techniques are described that perform quantitative corrections on the image to
compensate for the degradations introduced during the acquisition and transmission
process. These degradations are represented by the fog or blurring effect caused by
the optical system and by the motion of the object or the observer, by the noise
caused by the opto-electronic system and by the nonlinear response of the sensors,
by random noise due to atmospheric turbulence or, more generally, from the pro-
cess of digitization and transmission. While the enhancement techniques tend to
reduce the degradations present in the image in qualitative terms, improving their
Preface ix

visual quality even when there is no knowledge of the degradation model, the
restoration techniques are used instead to eliminate or quantitatively attenuate the
degradations present in the image, starting also from the hypothesis of knowledge
of degradation models.
Chapter 5, Image Segmentation, describes different segmentation algorithms,
which is the process of dividing the image into homogeneous regions, where all the
pixels that correspond to an object in the scene are grouped together. The grouping
of pixels in regions is based on a homogeneity criterion that distinguishes them
from one another. Segmentation algorithms based on criteria of similarity of pixel
attributes (color, texture, etc.) or based on geometric criteria of spatial proximity of
pixels (Euclidean distance, etc.) are reported. These criteria are not always valid,
and in different applications, it is necessary to integrate other information in relation
to the a priori knowledge of the application context (application domain). In this last
case, the grouping of the pixels is based on comparing the hypothesized regions
with the a priori modeled regions.
Chapter 6, Detectors and descriptors of points of interest, describes the most
used algorithms to automatically detect significant structures (known as points of
interest, corners, features) present in the image corresponding to stable physical
parts of the scene. The ability of such algorithms is to detect and identify physical
parts of the same scene in a repeatable way, even when the images are acquired
under conditions of lighting variability and change of the observation point with
possible change of the scale factor.
The third volume describes the artificial vision algorithms that detect objects in
the scene, attempt their identification, 3D reconstruction, their arrangement and
location with respect to the observer, and their eventual movement.
Chapter 1, Object recognition, describes the fundamental algorithms of artificial
vision to automatically recognize the objects of the scene, essential characteristics of
all systems of vision of living organisms. While a human observer also recognizes
complex objects, apparently in an easy and timely manner, for a vision machine, the
recognition process is difficult, requires considerable calculation time, and the results
are not always optimal. Fundamental to the process of object recognition become the
algorithms for selecting and extracting features. In various applications, it is possible
to have an a priori knowledge of all the objects to be classified because we know the
sample patterns (meaningful features) from which we can extract useful information
for the decision to associate (decision-making) each individual of the population to a
certain class. These sample patterns (training set) are used by the recognition system
to learn significant information about the objects population (extraction of statistical
parameters, relevant characteristics, etc.). The recognition process compares the
features of the unknown objects to the model pattern features, in order to uniquely
identify their class of membership. Over the years, there have been various disci-
plinary sectors (machine learning, image analysis, object recognition, information
research, bioinformatics, biomedicine, intelligent data analysis, data mining, …) and
the application sectors (robotics, remote sensing, artificial vision, …) for which
different researchers have proposed different methods of recognition and developed
different algorithms based on different classification models. Although the proposed
x Preface

algorithms have a unique purpose, they differ in the property attributed to the classes
of objects (the clusters) and the model with which these classes are defined (con-
nectivity, statistical distribution, density, …). The diversity of disciplines, especially
between automatic data extraction (data mining) and machine learning (machine
learning), has led to subtle differences, especially in the use of results and in ter-
minology, sometimes contradictory, perhaps caused by the different objectives. For
example, in data mining the dominant interest is automatic grouping extraction, in
automatic classification the discriminating power of the pattern classes is funda-
mental. The topics of this chapter overlap between aspects related to machine
learning and those of recognition based on statistical methods. For simplicity, the
algorithms described are broken down according to the methods of classifying
objects in supervised methods (based on deterministic, statistical, neural, and non-
metric models such as syntactic models and decision trees) and non-supervised
methods, i.e., methods that do not use any prior knowledge to extract the classes to
which the patterns belong.
In Chap. 2 RBF, SOM, Hopfield and deep neural networks, four different types
of neural networks are described: Radial Basis Functions—RBF, Self-Organizing
Maps—SOM, the Hopfield, and the deep neural networks. RBF uses a different
approach in the design of a neural network based on the hidden layer (unique in the
network) composed of neurons in which radial-based functions are defined, hence
the name of Radial Basis Functions, and which performs a nonlinear transformation
of the input data supplied to the network. These neurons are the basis for input data
(vectors). The reason why a nonlinear transformation is used in the hidden layer,
followed by a linear one in the output one, allows a pattern classification problem to
operate in a much larger space (in nonlinear transformation from the input in the
hidden one) and is more likely to be linearly separable than a small-sized space.
From this observation, derives the reason why the hidden layer is generally larger
than the input one (i.e., the number of hidden neurons is greater than the cardinality
of the input signal).
The SOM network, on the other hand, has an unsupervised learning model and
has the originality of autonomously grouping input data on the basis of their
similarity without evaluating the convergence error with external information on the
data. It is useful when there is no exact knowledge on the data to classify them. It is
inspired by the topology of the brain cortex model considering the connectivity
of the neurons and in particular, the behavior of an activated neuron and the
influence with neighboring neurons that reinforce the connections compared to
those further away that are becoming weaker.
With the Hopfield network, the learning model is supervised and with the ability
to store information and retrieve it through even partial content of the original
information. It presents its originality based on physical foundations that have
revitalized the entire field of neural networks. The network is associated with an
energy function to be minimized during its evolution with a succession of states,
until reaching a final state corresponding to the minimum of the energy function.
This feature allows it to be used to solve and set up an optimization problem in
terms of the objective function to be associated with an energy function. The
Preface xi

chapter concludes with the description of the convolutional neural networks


(CNN), by now the most widespread since 2012, based on the deep learning
architecture (deep learning).
In Chap. 3 Texture Analysis, the algorithms that characterize the texture present
in the images are shown. Texture is an important component for the recognition of
objects. In the field of image processing has been consolidated with the term
texture, any geometric and repetitive arrangement of the levels of gray (or color) of
an image. In this context, texture becomes an additional strategic component to
solve the problem of object recognition, the segmentation of images, and the
problems of synthesis. Some of the algorithms described are based on the mech-
anisms of human visual perception of texture. They are useful for the development
of systems for the automatic analysis of the information content of an image
obtaining a partitioning of the image in regions with different textures.
In Chap. 4 3D Vision Paradigms are reported the algorithms that analyze 2D
images to reconstruct a scene typically of 3D objects. A 3D vision system that has
the fundamental problem typical of inverse problems, i.e., from single 2D images,
which are only a two-dimensional projection of the 3D world (partial acquisition),
must be able to reconstruct the 3D structure of the observed scene and eventually
define a relationship between the objects. 3D reconstruction takes place starting
from 2D images that contain only partial information of the 3D world (loss of
information from the projection 3D!2D) and possibly using the geometric and
radiometric calibration parameters of the acquisition system. The mechanisms of
human vision are illustrated, based also on the a priori prediction and knowledge
of the world. In the field of artificial vision, the current trend is to develop 3D
systems oriented to specific domains but with characteristics that go in the direction
of imitating certain functions of the human visual system. 3D reconstruction
methods are described that use multiple cameras observing the scene from multiple
points of view, or sequences of time-varying images acquired from a single camera.
Theories of vision are described, from the Gestalt laws to the paradigm of Marr’s
vision and the computational models of stereovision.
In Chap. 5 Shape from Shading—(SfS) are reported the algorithms to reconstruct
the shape of the visible 3D surface using only the brightness variation information
(shading, that is, the level variations of gray or colored) present in the image. The
inverse problem of reconstructing the shape of the surface visible from the changes
in brightness in the image is known as the Shape from Shading problem. The
reconstruction of the visible surface should not be strictly understood as a 3D
reconstruction of the surface. In fact, from a single point of the observation of the
scene, a monocular vision system cannot estimate a distance measure between
observer and visible object, so with the SfS algorithms, there is a nonmetric but
qualitative reconstruction of the 3D surface. It is described the theory of the SfS
based on the knowledge of the light source (direction and distribution), the model of
reflectance of the scene, the observation point, and the geometry of the visible
surface, which together contribute to the image formation process. The relationships
between the light intensity values of the image and the geometry of the visible
surface are derived (in terms of the orientation of the surface, point by point) under
xii Preface

some lighting conditions and the reflectance model. Other 3D surface reconstruc-
tion algorithms based on the Shape from xxx paradigm are also described, where xxx
can be texture, structured light projected onto the surface to be reconstructed, or
2D images of the focused or defocused surface.
In Chap. 6 Motion Analysis, the algorithms of perception of the dynamics of the
scene are reported, analogous to what happens in the vision systems of different
living beings. With motion analysis algorithms, it is possible to derive the 3D
motion, almost in real time, from the analysis of sequences of time-varying 2D
images.
Paradigms on movement analysis have shown that the perception of movement
derives from the information of the objects evaluating the presence of occlusions,
texture, contours, etc. The algorithms for the perception of the movement occurring
in the physical reality and not the apparent movement are described. Different
methods of movement analysis are analyzed from those with limited computational
load such as those based on time-variant image difference to the more complex ones
based on optical flow considering application contexts with different levels of
motion entities and scene-environment with different complexities.
In the context of rigid bodies, from the motion analysis, derived from a sequence
of time-variant images, are described the algorithms that, in addition to the
movement (translation and rotation), estimate the reconstruction of the 3D structure
of the scene and the distance of this structure by the observer. Useful information
are obtained in the case of mobile observer (robot or vehicle) to estimate the
collision time. In fact, the methods for solving the problem of 3D reconstruction
of the scene are acquired by acquiring a sequence of images with a single camera
whose intrinsic parameters remain constant even if not known (camera not cali-
brated) together with the non-knowledge of motion. The proposed methods are part
of the problem of solving an inverse problem. Algorithms are described to recon-
struct the 3D structure of the scene (and the motion), i.e., to calculate the coordi-
nates of 3D points of the scene whose 2D projection is known in each image of the
time-variant sequence.
Finally, in Chap. 7 Camera Calibration and 3D Reconstruction, the algorithms
for calibrating the image acquisition system (normally a single camera and stere-
ovision) are fundamental for detecting metric information (detecting an object’s size
or determining accurate measurements of object–observer distance) of the scene
from the image. The various camera calibration methods are described that deter-
mine the relative intrinsic parameters (focal length, horizontal and vertical
dimension of the single photoreceptor of the sensor, or the aspect ratio, the size
of the matrix of the sensor, the coefficients of the radial distortion model, the
coordinates of the main point or the optical center) and the extrinsic parameters that
define the geometric transformation to pass from the reference system of the world
to that of camera. The epipolar geometry introduced in Chap. 5 is described in this
chapter to solve the problem of correspondence of homologous points in a stereo
vision system with the two cameras calibrated and not. With the epipolar geometry
is simplified the search for the homologous points between the stereo images
introducing the Essential matrix and the Fundamental matrix. The algorithms for
Preface xiii

estimating these matrices are also described, known a priori the corresponding
points of a calibration platform.
With epipolar geometry, the problem of searching for homologous points is
reduced to mapping a point of an image on the corresponding epipolar line in the
other image. It is possible to simplify the problem of correspondence through a
one-dimensional point-to-point search between the stereo images. This is accom-
plished with the image alignment procedure, known as stereo image rectification.
The different algorithms have been described; some based on the constraints of the
epipolar geometry (non-calibrated cameras where the fundamental matrix includes
the intrinsic parameters) and on the knowledge or not of the intrinsic and extrinsic
parameters of calibrated cameras. Chapter 7 ends with the section of the 3D
reconstruction of the scene in relation to the knowledge available to the stereo
acquisition system. The triangulation procedures for the 3D reconstruction of the
geometry of the scene without ambiguity are described, given the 2D projections
of the homologous points of the stereo images, known the calibration parameters
of the stereo system. If only the intrinsic parameters are known, the 3D geometry
of the scene is reconstructed by estimating the extrinsic parameters of the system at
less than a non-determinable scale factor. If the calibration parameters of the stereo
system are not available but only the correspondences between the stereo images
are known, the structure of the scene is recovered through an unknown homography
transformation.

Francavilla Fontana, Italy Arcangelo Distante


December 2020 Cosimo Distante
Acknowledgments

We thank all the fellow researchers of the Department of Physics of Bari, of the
Institute of Intelligent Systems for Automation of the CNR (National Research
Council) of Bari, and of the Institute of Applied Sciences and Intelligent Systems
“Eduardo Caianiello” of the Unit of Lecce, who have indicated errors and parts to
be reviewed. We mention them in chronological order: Grazia Cicirelli, Marco Leo,
Giorgio Maggi, Rosalia Maglietta, Annalisa Milella, Pierluigi Mazzeo, Paolo
Spagnolo, Ettore Stella, and Nicola Veneziani. A thank you is addressed to Arturo
Argentieri for the support on the graphic aspects of the figures and the cover.
Finally, special thanks are given to Maria Grazia Distante who helped us realize the
electronic composition of the volumes by verifying the accuracy of the text and the
formulas.

xv
Contents

1 Object Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Classification Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Prior Knowledge and Features Selection . . . . . . . . . . . . . . . . . . 4
1.4 Extraction of Significant Features . . . . . . . . . . . . . . . . . . . . . . . 6
1.4.1 Feature Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.2 Selection of Significant Features . . . . . . . . . . . . . . . . . . 10
1.5 Interactive Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.6 Deterministic Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.6.1 Linear Discriminant Functions . . . . . . . . . . . . . . . . . . . 19
1.6.2 Generalized Discriminant Functions . . . . . . . . . . . . . . . 20
1.6.3 Fisher’s Linear Discriminant Function . . . . . . . . . . . . . . 21
1.6.4 Classifier Based on Minimum Distance . . . . . . . . . . . . . 28
1.6.5 Nearest-Neighbor Classifier . . . . . . . . . . . . . . . . . . . . . . 30
1.6.6 K-means Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1.6.7 ISODATA Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.6.8 Fuzzy C-means Classifier . . . . . . . . . . . . . . . . . . . . . . . 35
1.7 Statistical Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1.7.1 MAP Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1.7.2 Maximum Likelihood Classifier—ML . . . . . . . . . . . . . . 39
1.7.3 Other Decision Criteria . . . . . . . . . . . . . . . . . . . . . . . . . 48
1.7.4 Parametric Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . 48
1.7.5 Maximum Likelihood Estimation—MLE . . . . . . . . . . . . 49
1.7.6 Estimation of the Distribution Parameters
with the Bayes Theorem . . . . . . . . . . . . . . . . . . . ..... 52
1.7.7 Comparison Between Bayesian Learning
and Maximum Likelihood Estimation . . . . . . . . . . . . . . 56
1.8 Bayesian Discriminant Functions . . . . . . . . . . . . . . . . . . . . . . . . 57
1.8.1 Classifier Based on Gaussian Probability Density . . . . . . 58
1.8.2 Discriminant Functions for the Gaussian Density . . . . . . 61

xvii
xviii Contents

1.9 Mixtures of Gaussian—MoG . . . . . . . . . . . . . . . . . . . . . . . .... 66


1.9.1 Parameters Estimation of the Gaussians Mixture
with the Maximum Likelihood—ML . . . . . . . . . . . .... 68
1.9.2 Parameters Estimation of the Gaussians Mixture
with Expectation–Maximization—EM . . . . . . . . . . . . . . 69
1.9.3 EM Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
1.9.4 Nonparametric Classifiers . . . . . . . . . . . . . . . . . . . . . . . 76
1.10 Method Based on Neural Networks . . . . . . . . . . . . . . . . . . . . . . 87
1.10.1 Biological Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 87
1.10.2 Mathematical Model of the Neural Network . . . . . . . . . 89
1.10.3 Perceptron for Classification . . . . . . . . . . . . . . . . . . . . . 93
1.10.4 Linear Discriminant Functions and Learning . . . . . . . . . 100
1.11 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
1.11.1 Multilayer Perceptron—MLP . . . . . . . . . . . . . . . . . . . . 109
1.11.2 Multilayer Neural Network for Classification . . . . . . . . . 111
1.11.3 Backpropagation Algorithm . . . . . . . . . . . . . . . . . . . . . 113
1.11.4 Learning Mode with Backpropagation . . . . . . . . . . . . . . 118
1.11.5 Generalization of the MLP Network . . . . . . . . . . . . . . . 120
1.11.6 Heuristics to Improve Backpropagation . . . . . . . . . . . . . 121
1.12 Nonmetric Recognition Methods . . . . . . . . . . . . . . . . . . . . . . . . 125
1.13 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
1.13.1 Algorithms for the Construction of Decision Trees . . . . . 127
1.14 ID3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
1.14.1 Entropy as a Measure of Homogeneity
of the Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
1.14.2 Information Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
1.14.3 Other Partitioning Criteria . . . . . . . . . . . . . . . . . . . . . . . 135
1.15 C4.5 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
1.15.1 Pruning Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . 138
1.15.2 Post-pruning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 140
1.16 CART Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
1.16.1 Gini Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
1.16.2 Advantages and Disadvantages of Decision Trees . . . . . 146
1.17 Hierarchical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
1.17.1 Agglomerative Hierarchical Clustering . . . . . . . . . . . . . . 149
1.17.2 Divisive Hierarchical Clustering . . . . . . . . . . . . . . . . . . 151
1.17.3 Example of Hierarchical Agglomerative Clustering . . . . 152
1.18 Syntactic Pattern Recognition Methods . . . . . . . . . . . . . . . . . . . 154
1.18.1 Formal Grammar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
1.18.2 Language Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 158
1.18.3 Types of Grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
1.18.4 Grammars for Pattern Recognition . . . . . . . . . . . . . . . . 163
1.18.5 Notes on Other Methods of Syntactic Analysis . . . . . . . 171
Contents xix

1.19 String Recognition Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 172


1.19.1 String Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
1.19.2 Boyer–Moore String-Matching Algorithm . . . . . . . . . . . 174
1.19.3 Edit Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
1.19.4 String Matching with Error . . . . . . . . . . . . . . . . . . . . . . 188
1.19.5 String Matching with Special Symbol . . . . . . . . . . . . . . 190
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
2 RBF, SOM, Hopfield, and Deep Neural Networks . . . . . . . . . . . . . . 193
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
2.2 Cover Theorem on Pattern Separability . . . . . . . . . . . . . . . . . . . 194
2.3 The Problem of Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
2.4 Micchelli’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
2.5 Learning and Ill-Posed Problems . . . . . . . . . . . . . . . . . . . . . . . . 200
2.6 Regularization Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
2.7 RBF Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
2.8 RBF Network Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
2.9 Learning Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
2.9.1 Centers Set and Randomly Selected . . . . . . . . . . . . . . . 208
2.9.2 Selection of Centers Using Clustering Techniques . . . . . 210
2.10 Kohonen Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
2.10.1 Architecture of the SOM Network . . . . . . . . . . . . . . . . . 211
2.10.2 SOM Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
2.11 Network Learning Vector Quantization-LVQ . . . . . . . . . . . . . . . 220
2.11.1 LVQ2 and LVQ3 Networks . . . . . . . . . . . . . . . . . . . . . 223
2.12 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
2.12.1 Hopfield Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
2.12.2 Application of Hopfield Network to Discrete States . . . . 228
2.12.3 Continuous State Hopfield Networks . . . . . . . . . . . . . . . 233
2.12.4 Boltzmann Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
2.13 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
2.13.1 Deep Traditional Neural Network . . . . . . . . . . . . . . . . . 239
2.13.2 Convolutional Neural Networks-CNN . . . . . . . . . . . . . . 240
2.13.3 Operation of a CNN Network . . . . . . . . . . . . . . . . . . . . 248
2.13.4 Main Architectures of CNN Networks . . . . . . . . . . . . . . 256
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
3 Texture Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
3.2 The Visual Perception of the Texture . . . . . . . . . . . . . . . . . . . . . 261
3.2.1 Julesz’s Conjecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
3.2.2 Texton Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
3.2.3 Spectral Models of the Texture . . . . . . . . . . . . . . . . . . . 265
xx Contents

3.3 Texture Analysis and its Applications . . . . . . . . . . . . . . . . . . . . 265


3.4 Statistical Texture Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
3.4.1 First-Order Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
3.4.2 Second-Order Statistics . . . . . . . . . . . . . . . . . . . . . . . . . 268
3.4.3 Higher Order Statistics . . . . . . . . . . . . . . . . . . . . . . . . . 268
3.4.4 Second-Order Statistics with Co-Occurrence Matrix . . . . 270
3.4.5 Texture Parameters Based on the Co-Occurrence
Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
3.5 Texture Features Based on Autocorrelation . . . . . . . . . . . . . . . . 276
3.6 Texture Spectral Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
3.7 Texture Based on the Edge Metric . . . . . . . . . . . . . . . . . . . . . . . 281
3.8 Texture Based on the Run Length Primitives . . . . . . . . . . . . . . . 282
3.9 Texture Based on MRF, SAR, and Fractals Models . . . . . . . . . . 286
3.10 Texture by Spatial Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
3.10.1 Spatial Filtering with Gabor Filters . . . . . . . . . . . . . . . . 295
3.11 Syntactic Methods for Texture . . . . . . . . . . . . . . . . . . . . . . . . . . 302
3.12 Method for Describing Oriented Textures . . . . . . . . . . . . . . . . . . 303
3.12.1 Estimation of the Dominant Local Orientation . . . . . . . . 304
3.12.2 Texture Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
3.12.3 Intrinsic Images with Oriented Texture . . . . . . . . . . . . . 307
3.13 Tamura’s Texture Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
4 Paradigms for 3D Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
4.1 Introduction to 3D Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
4.2 Toward an Optimal 3D Vision Strategy . . . . . . . . . . . . . . . . . . . 316
4.3 Toward the Marr’s Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
4.4 The Fundamentals of Marr’s Theory . . . . . . . . . . . . . . . . . . . . . 319
4.4.1 Primal Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
4.4.2 Toward a Perceptive Organization . . . . . . . . . . . . . . . . . 325
4.4.3 The Gestalt Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
4.4.4 From the Gestalt Laws to the Marr Theory . . . . . . . . . . 338
4.4.5 2.5D Sketch of Marr’s Theory . . . . . . . . . . . . . . . . . . . 342
4.5 Toward 3D Reconstruction of Objects . . . . . . . . . . . . . . . . . . . . 344
4.5.1 From Image to Surface: 2.5D Sketch Map
Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
4.6 Stereo Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
4.6.1 Binocular Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
4.6.2 Stereoscopic Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
4.6.3 Stereopsis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
4.6.4 Neurophysiological Evidence of Stereopsis . . . . . . . . . . 358
4.6.5 Depth Map from Binocular Vision . . . . . . . . . . . . . . . . 374
4.6.6 Computational Model for Binocular Vision . . . . . . . . . . 377
Contents xxi

4.6.7 Simple Artificial Binocular System . . . . . . . . . . . . . . . . 383


4.6.8 General Binocular System . . . . . . . . . . . . . . . . . . . . . . . 389
4.7 Stereo
Vision Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
4.7.1 Point-Like Elementary Structures . . . . . . . . . . . . . . . . . 394
4.7.2 Local Elementary Structures and Correspondence
Calculation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
4.7.3 Sparse Elementary Structures . . . . . . . . . . . . . . . . . . . . 406
4.7.4 PMF Stereo Vision Algorithm . . . . . . . . . . . . . . . . . . . . 406
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
5 Shape from Shading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
5.2 The Reflectance Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414
5.2.1 Gradient Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
5.3 The Fundamental Relationship of Shape from Shading
for Diffuse Reflectance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
5.4 Shape from Shading-SfS Algorithms . . . . . . . . . . . . . . . . . . . . . 423
5.4.1 Shape from Stereo Photometry with Calibration . . . . . . . 426
5.4.2 Uncalibrated Stereo Photometry . . . . . . . . . . . . . . . . . . 433
5.4.3 Stereo Photometry with Calibration Sphere . . . . . . . . . . 436
5.4.4 Limitations of Stereo Photometry . . . . . . . . . . . . . . . . . 439
5.4.5 Surface Reconstruction from the Orientation Map . . . . . 440
5.5 Shape from Texture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
5.6 Shape from Structured Light . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
5.6.1 Shape from Structured Light with Binary Coding . . . . . . 454
5.6.2 Gray Code Structured Lighting . . . . . . . . . . . . . . . . . . . 456
5.6.3 Pattern with Gray Level . . . . . . . . . . . . . . . . . . . . . . . . 458
5.6.4 Pattern with Phase Modulation . . . . . . . . . . . . . . . . . . . 458
5.6.5 Pattern with Phase Modulation and Binary Code . . . . . . 462
5.6.6 Methods Based on Colored Patterns . . . . . . . . . . . . . . . 462
5.6.7 Calibration of the Camera-Projector Scanning
System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463
5.7 Shape from (de)Focus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
5.7.1 Shape from Focus (SfF) . . . . . . . . . . . . . . . . . . . . . . . . 465
5.7.2 Shape from Defocus (SfD) . . . . . . . . . . . . . . . . . . . . . . 472
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477
6 Motion Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479
6.2 Analogy Between Motion Perception and Depth Evaluated
with Stereo Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482
6.3 Toward Motion Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484
6.3.1 Discretization of Motion . . . . . . . . . . . . . . . . . . . . . . . . 487
6.3.2 Motion Estimation—Continuous Approach . . . . . . . . . . 493
xxii Contents

6.3.3 Motion Estimation—Discrete Approach . . . . . . . . . . . . . 495


6.3.4 Motion Analysis from Image Difference . . . . . . . . . . . . 496
6.3.5 Motion Analysis from the Cumulative Difference
of Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496
6.3.6 Ambiguity in Motion Analysis . . . . . . . . . . . . . . . . . . . 498
6.4 Optical Flow Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499
6.4.1 Horn–Schunck Method . . . . . . . . . . . . . . . . . . . . . . . . . 504
6.4.2 Discrete Least Squares Horn–Schunck Method . . . . . . . 509
6.4.3 Horn–Schunck Algorithm . . . . . . . . . . . . . . . . . . . . . . . 511
6.4.4 Lucas–Kanade Method . . . . . . . . . . . . . . . . . . . . . . . . . 513
6.4.5 BBPW Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517
6.4.6 Optical Flow Estimation for Affine Motion . . . . . . . . . . 521
6.4.7 Estimation of the Optical Flow for Large
Displacements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523
6.4.8 Motion Estimation by Alignment . . . . . . . . . . . . . . . . . 526
6.4.9 Motion Estimation with Techniques Based
on Interest Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535
6.4.10 Tracking Based on the Object Dynamics—Kalman
Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542
6.5 Motion in Complex Scenes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556
6.5.1 Simple Method of Background Subtraction . . . . . . . . . . 557
6.5.2 BS Method with Mean or Median . . . . . . . . . . . . . . . . . 558
6.5.3 BS Method Based on the Moving Gaussian Average . . . 559
6.5.4 Selective Background Subtraction Method . . . . . . . . . . . 560
6.5.5 BS Method Based on Gaussian Mixture
Model (GMM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 561
6.5.6 Background Modeling Using Statistical Method
Kernel Density Estimation . . . . . . . . . . . . . . . . . . . . . . 563
6.5.7 Eigen Background Method . . . . . . . . . . . . . . . . . . . . . . 564
6.5.8 Additional Background Models . . . . . . . . . . . . . . . . . . . 565
6.6 Analytical Structure of the Optical Flow of a Rigid Body . . . . . . 566
6.6.1 Motion Analysis from the Optical Flow Field . . . . . . . . 570
6.6.2 Calculation of Collision Time and Depth . . . . . . . . . . . . 573
6.6.3 FOE Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577
6.6.4 Estimation of Motion Parameters for a Rigid Body . . . . 580
6.7 Structure from Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585
6.7.1 Image Projection Matrix . . . . . . . . . . . . . . . . . . . . . . . . 585
6.7.2 Methods of Structure from Motion . . . . . . . . . . . . . . . . 590
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596
7 Camera Calibration and 3D Reconstruction . . . . . . . . . . . . . . . . . . . 599
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 599
7.2 Influence of the Optical System . . . . . . . . . . . . . . . . . . . . . . . . . 600
Contents xxiii

7.3 Geometric Transformations Involved in Image Formation . . . . . . 602


7.4 Camera Calibration Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 603
7.4.1 Tsai Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605
7.4.2 Estimation of the Perspective Projection Matrix . . . . . . . 610
7.4.3 Zhang Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 616
7.4.4 Stereo Camera Calibration . . . . . . . . . . . . . . . . . . . . . . 625
7.5 Stereo Vision and Epipolar Geometry . . . . . . . . . . . . . . . . . . . . 627
7.5.1 The Essential Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . 629
7.5.2 The Fundamental Matrix . . . . . . . . . . . . . . . . . . . . . . . 634
7.5.3 Estimation of the Essential and Fundamental Matrix . . . 638
7.5.4 Normalization of the 8-Point Algorithm . . . . . . . . . . . . . 641
7.5.5 Decomposition of the Essential Matrix . . . . . . . . . . . . . 642
7.5.6 Rectification of Stereo Images . . . . . . . . . . . . . . . . . . . . 645
7.5.7 3D Stereo Reconstruction by Triangulation . . . . . . . . . . 655
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 666

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 669
Object Recognition
1

1.1 Introduction

The ability of object recognition is an essential feature of all living organisms. Various
creatures have different abilities and modes of recognition. Very important is the
sensorial nature and the modality of interpretation of the available sensorial data.
Evolved organisms like humans can recognize other humans through sight, voice, or
how they write, while less evolved organisms like the dog can recognize other animals
or humans, simply, using the olfactory and visual sense organ. These activities are
classified as recognition.
While a human observer also performs the recognition of complex objects, appar-
ently in an easy and timely manner, for a vision machine, the recognition process
is difficult, requires considerable calculation time, and the results are not always
optimal. The ability of a vision machine is to automatically recognize the objects
that appear in the scene. Normally, a generic object observed for the recognition
process is called patter n. In several applications, the patter n adequately describes
a generic object with the purpose of recognizing it.
A pattern recognition system can be specialized to recognize people, animals,
territory, artifacts, electrocardiogram, biological tissue, etc. The most general ability
of a recognition system can be to discriminate between a population of objects and
determine those that belong to the same class. For example, an agro-food company
needs a vision system to recognize different qualities of fruit (apples, pears, etc.),
depending on the degree of ripeness and size. This means that the recognition system
will have to examine the whole population and classify each fruit thus obtaining
different groupings that identify certain quality classes. The de facto recognition
system is a classification system.
If we consider, for simplicity, the population of only apples to determine the class
to which each apple belongs, it is necessary that be adequately described, that is to

© Springer Nature Switzerland AG 2020 1


A. Distante and C. Distante, Handbook of Image Processing and Computer Vision,
https://doi.org/10.1007/978-3-030-42378-0_1
2 1 Object Recognition

find its intrinsic characteristics (features),1 functional to determine the correct class of
membership (classi f ication). The expert, in this case, can propose to characterize
the apple population using the color (which indicates the state of maturation) and the
geometric shape (almost circular), possibly a measure of the area and to give greater
robustness to the classification process, the characteristic weight could also be used.
We have actually illustrated how to describe a pattern (in this case, the apple pattern
is seen as a set of features that characterizes the apple object), an activity that in
literature is known as selection of significant features (namely, features selection).
The activity instead of producing measures associated with the characteristics of the
pattern is called feature extraction.
The selection and extraction of the feature are important activities in the design
of a recognition system. In various applications, it is possible to have a priori knowl-
edge of the population of the objects to be classified because we know the sample
patterns from which useful information can be extracted for the decision to asso-
ciate (decision-making) each individual of the population to a specific class. These
sample patterns (e.g., training set) are used by the recognition system to learn mean-
ingful information about the population (extraction of statistical parameters, relevant
features, etc.).
In this context of object recognition, Clustering (Group Analysis) becomes central,
i.e., the task of grouping a collection of objects in such a way that objects in the same
group (called clusters) are more similar than those of the other groups. In the example
of apples, the concept of similarity is associated with the color, area, and weight
measurements used as descriptors of the apple pattern to define apple clusters with
different qualities. A system of recognition of very different objects, for example,
the patterns apples and pears, requires a more complex description of patterns, in
terms of selection and extraction of significant features, and clustering methods, for
a correct classification.
In this case, for the recognition of quality classes of the apple or pear pattern, it is
reasonable to add the feature of shape (elliptic), in the context selection of features to
discriminate between the two patterns. The recognition process compares the features
of the unknown objects with the features of the pattern samples, in order to uniquely
identify the class to which they belong. In essence, the classi f ication is the final
goal of a recognition system based on the clustering.
In this introduction of the analysis of the observed data of a population of objects,
two activities emerged: the selection and the extraction of the features. The feature
extraction activity also has the task of expressing them with an adequate metric to
be used, appropriately by the decision component, that is, the classifier that deter-
mines, based on the chosen method of clustering, which class associates with each
object. In reality, the features of an object do not always describe it correctly, they
often represent only a good approximation and therefore while developing a com-
plex classification process with different recognition strategies, it can be difficult to
identify the object in a unique class of belonging.

1 From now on, the two words “characteristic” and “feature” will be used interchangeably.
1.2 Classification Methods 3

1.2 Classification Methods

The first group analysis studies were introduced to classify the psychological traits
of the personality [1,2]. Over the years there have been several disciplinary sectors
(machine learning, image analysis, object recognition, information research, bioin-
formatics, biomedicine, intelligent data analysis, data mining, ...) and application
sectors (robotics, remote sensing, artificial vision, ...) for which several researchers
have proposed different clustering methods and developed different algorithms based
on different types of clusters. Although the proposed algorithms have univocal pur-
poses, they differ in the property attributed to the clusters and in the model with
which the clusters are defined (connectivity, statistical distribution, density, ...).
The diversity of disciplines, especially those of data mining2 and machine learn-
ing, has led to subtle differences especially in the use of results and sometimes
contradictory terminologies perhaps caused by different objectives. For example, in
data mining, the dominant interest is the automatic extraction of groups, in the auto-
matic classification, the discriminating power of the classes to which the patterns
belong is fundamental. The topics of this chapter overlap between aspects related
to machine learning and those of recognition based on statistical methods. Object
classification methods can be divided into two categories:

Supervised Methods, i.e., based on a priori knowledge of data (cluster model,


decision rules, ...) to better define the discriminating power of the classifier.
Given a population of objects and a set of measures with which to describe
the single object represented by a vector pattern x, the goal of these methods
is to assign the pattern to one of the possible N classes Ci , i = 1, . . . , N .
A decision-making strategy is adopted to partition the space of the measures
into N regions i , i = 1, . . . , N . The expectation is to use rules or decision
functions such as to obtain separate, nonoverlapping regions of i , bounded by
separation surfaces. This can be achieved by iterating the classification process
after analyzing the decision rules and the adequacy of the measures associated
with the patterns. This category also includes methods where it is assumed that
the knowledge of the probability density function (PDF) of the feature vector
for a given class is known.

In this category, the following methods will be described:

(a) Deterministic models


(b) Statistical models
(c) Neural models
(d) Nonmetric Models (Syntactic and Decision Trees)

2 Literally, it means automatic data extraction, normally coming from a large population of data.
4 1 Object Recognition

Non-Supervised Methods, i.e., methods that do not use any prior knowledge to
extract the classes to which the patterns belong. Often in statistical literature,
these methods are referred to as clustering. In this context, initially objects
cannot be labeled and the goal is to explore the data to find an approach that
groups them once selected features that distinguish one group from another.
The assignment of the labels for each cluster takes place subsequently with
the intervention of the expert. An interlocutory phase can be considered using
a supervised approach on the first classes generated to extract representative
samples of some classes.

This category includes the hierarchical and partitioning clustering methods that in
turn differ in the criteria (models) of clustering adopted:

(a) Single Linkage (Nearest Neighbor)


(b) Complete Linkage (Furthest Neighbor)
(c) Centroid Linkage
(d) Sum of squares
(e) Partitioning.

1.3 Prior Knowledge and Features Selection

The description of an object is defined by a set of scalar entities of the object itself that
are combined to constitute a vector of features x (also called feature vector). Con-
sidering the features studied in Chap. 8 on Forms Vol. I, an object can be completely
described by a vector x whose components x = (x1 , x2 , . . . , x M ) (represented, for
example, by some measures which dimensions, compactness, perimeter, area, etc.)
are some of the geometric and topological information of an object. In the case of
multispectral images, the problem of feature selection becomes the choice of the
most significant bands where the feature vector represents the pixel pattern with the
radiometric information related to each band (for example, the spectral bands from
visible to infrared). In other contexts, for example, in the case of a population of
economic data, the problem of feature selection may be more complex because some
measurements are not accessible.
An approach to the analysis of the features to be selected consists in considering
the nature of the available measures, i.e., whether of a physical nature (as in the case
of spectral bands and color components) or of a structural nature (such as geometric
ones) or derived from mathematical transformations of the previous measures. The
physical features are derived from sensory measurements of which information on the
behavior of the sensor and the level of uncertainty of the observed measurements may
be available. Furthermore, information can be obtained (or found experimentally) on
the correlation level of the measurements observed for the various sensors. Similarly,
information on the characteristics of structural measures can be derived.
1.3 Prior Knowledge and Features Selection 5

Sensors: Selection Formulation


Object
Extraction
Data hypothesis hypothesis
Features
Representation

Basic
knowledge

Fig. 1.1 Functional scheme of an object recognition system based on knowledge

Sensors: Selection Clastering Object


Extraction Template
Data
Features Matching
Representation

Features
model

Fig. 1.2 Functional scheme of an object recognition system based on template matching

The analysis and selection of the features is useful to do it considering also the
strategies to be adopted in the recognition process. In general, the functioning mech-
anism of the recognition process involves the analysis of the features extracted from
the observed data (population of objects, images, ...), the formalization of some
hypotheses to define elements of similarity between the observed data and those
extracted from the sample data (in the supervised context). Furthermore, it pro-
vides for the formal verification of the hypotheses using the models of the objects
and eventually reformulate the approach of similarity of the objects. Hypothesis
generation can also be useful to reduce the search domain considering only some
features. Finally, the recognition process selects among the various hypotheses, as a
correct pattern, the one with the highest value of similarity on the basis of evidence
(see Fig. 1.1).
The vision systems typical of object recognition formulate hypotheses and iden-
tify the object on the basis of the best similarity. Vision systems based on a priori
knowledge consider hypotheses only as the starting point, while the verification
phase is assigned the task of selecting the object. For example, in the recognition of
an object, based on the comparison of the characteristics extracted from the scene
observed with those of the sample prototypes, the approach called template matching
is used, and the hypothesis formation phase is completely eliminated (see Fig. 1.2).
From the previous considerations, it emerges how a recognition system is charac-
terized by different components:

1. types of a priori knowledge,


2. types of features extracted and to be considered,
3. type of comparison between the features extracted from the observed data and
those of the model,
4. how to form assumptions,
5. verification methods.
6 1 Object Recognition

Points (1), (2), (3) are mutually interdependent. The representation of an object
depends on the type of object itself. In some cases, certain geometric and topological
characteristics are significant, for other cases, they may be of little significance and/or
redundant. Consequently, the component of the features extraction must determine
those foreseen for the representation of the object that are more adequate considering
their robustness and their difficulty in extracting from the input data. At the same
time, in the selection of features, those that can best be extracted from the model and
that can best be used for comparison must be selected.
Using many exhaustive features can make the recognition process more difficult
and slow. The hypothesis formation approach is a normally heuristic approach that
can reduce the search domain. In relation to the type of application, a priori knowledge
can be formalized by associating a priori probability or confidence level with the
various model objects. These prediction measures constitute the elements to evaluate
the similarity of the presence of objects based on the determined characteristics. The
verification of these hypotheses leads to the methods of selecting the models of
the objects that best resemble those extracted from the input data. All plausible
hypotheses must be examined to verify the presence of the object or discard it.
In vision machine applications, where geometric modeling is used, objects can be
modeled and verified using camera location information or other known information
of the scene (eg known references in the scene). In other applications, the hypotheses
cannot be verified and one can proceed with non-supervised approaches. From the
above considerations, the functional scheme of a recognition system can be modified
by eliminating the verification phase or the hypothesis formation phase (see Fig. 1.1).

1.4 Extraction of Significant Features

In Chap. 5 vol. II of the Segmentation and in Chap. 8 Vol. I on the For ms, we have
examined the image processing algorithms that extract some significant character-
istics to describe the objects of a scene. Depending on the type of application, the
object to be recognized as a physical entity can be anything. In the study of terres-
trial resources and their monitoring, the objects to be recognized are for example the
various types of woods and crops, lakes, rivers, roads, etc. In the industrial sector, on
the other hand, for the vision system of a robot cell, the objects to be recognized are,
for example, the individual components of a more complex object, to be assembled
or to be inspected. For a food industry, for example, a vision system can be used to
recognize different qualities of fruit (apples, pears, etc.) in relation to the degree of
ripeness and size.
In all these examples, similar objects can be divided into several distinct subsets.
This new grouping of objects makes sense only if we specify the meaning of a similar
object and if we find a mechanism that correctly separates similar objects to form so-
called classes of objects. Objects with common features are considered similar. For
example, the set of ripe and first quality apples are those characterized by a particular
color, for example, yellow-green, with a given almost circular geometric shape,
1.4 Extraction of Significant Features 7

and a measure of the area, higher than a certain threshold value. The mechanism
that associates a given object to a certain class is called clustering. The object
recognition process essentially uses the most significant features of objects to group
them by classes (classification or identification).
Question: is the number of classes known before the classification process?
Answer: normally classes are known and intrinsically defined in the specifications
imposed by the application but are often not known and will have to be explored by
analyzing the observed data.
For example, in the application that automatically separates classes of apples,
the number of classes is imposed by the application itself (for commercial reasons,
three classes would be sufficient) deciding to classify different qualities of apples
considering the parameters of shape, color, weight, and size.
In the application of remotely sensed images, the classification of forests, in rela-
tion to their deterioration caused by environmental impact, should take place without
knowing in advance the number of classes that should automatically emerge in rela-
tion to the types of forests actually damaged or polluted. The selection of significant
features is closely linked to the application. It is usually made based on experience
and intuition. Some considerations can, however, be made for an optimum choice of
the significant features.
The first consideration concerns their ability to be discriminating or different
values correspond to different classes. In the previous example, the measurement of
the ar ea was a significant feature for the classification of apples in three classes of
different sizes (small, medium, and large).
The second consideration regards reliability. Similar values of a feature must
always identify the same class for all objects belonging to the same class. Considering
the same example, it can happen that a large apple can have a different color than
the class of ripe apples. In this case, the color may be a nonsignificant feature.
The third consideration relates to the correlation between the various features.
The two features ar ea and weight of the apple are strongly correlated features
since it is foreseeable that the weight increases in proportion to the measure of the
area. This means that these two characteristics are the same property and therefore
are redundant features that may not be meaningful to select them together for the
classification process. The related features can be used together instead when you
want to attenuate the noise. In fact, in multisensory applications, it occurs that some
sensors are strongly correlated with each other but the relative measurements are
affected by different models of noise.
In other situations, it is possible to accept the redundancy of a feature, on one
condition, that is not correlated with at least one of the selected features (in the case
of the apple-mature class, it has been seen that the color-area features alone may not
discriminate, in this case, the characteristic weight becomes useful, although very
correlated with the area, considering the true hypothesis, that the ripe apples are on
average less heavy than the less mature ones).
8 1 Object Recognition

1.4.1 Feature Space

The number of selected features must be adequate and contained to limit the level
of complexity of the recognition process. A classifier that uses few features can
produce inappropriate results. In contrast, the use of a large number of features
leads to an exponential growth of computational resources without the guarantee
of obtaining good results. The set of all the components xi of the vector pattern
x = (x1 , x2 , . . . , x M ) is the space of the features to M-dimensions. We have already
considered how for the remote sensing images for each pixel pattern different spec-
tral characteristics are available (infrared, visible, etc.) to extract the homogeneous
regions from the observed data corresponding to different areas of the territory. In
these applications, the classification of the territory, using a single characteristic (sin-
gle spectral component), would be difficult. In this case, the analyst must select the
significant bands filtered by the noise and eventually reduce the dimensionality of
the pixels.
In other contexts, the features associated with each object are extracted producing
features with higher level information (for example, normalized spatial moments,
Fourier descriptors, etc. described in Chap. 8 Vol. I of the For ms) related to ele-
mentary regions representative of objects. An optimal feature selection is obtained
when the pattern x vectors belonging to the same class lie close to each other when
projected into the feature space.
These vectors constitute the set of similar objects represented in a single class
(cluster). In the feature space, it is possible to accumulate different classes (corre-
sponding to different types of objects) which can be separated using appropriate
discriminating functions. The latter represent the cluster separation hypersurfaces in
the feature space of M-dimensions, which control the classification process. Hyper-
surfaces can be simplified with hyperplanes and in this case, we speak of linearly
separable discriminating functions.
Figure 1.3 shows a two-dimensional (2D) example of the feature space (x1 , x2 )
where 4 clusters are represented separated by linear and nonlinear discriminant func-
tions, representing the set of homogeneous pixels corresponding in the spatial domain
to 5 regions. This example shows that the two selected features x1 and x2 exhibit an
adequate discriminating ability to separate the population of pixels in the space of
the features (x1 , x2 ) which in the spatial domain belong to 5 regions corresponding
to 4 different classes (the example schematizes a portion of land wet by the sea with
a river: class 1 indicates the ground, 2 the river, 3 the shoreline and 4 the sea).
The ability of the classification process is based on the ability to separate without
error the various clusters, which in different applications are located very close to
each other, or are superimposed generating an incorrect classification (see Fig. 1.4).
This example demonstrates that the selected features do not intrinsically exhibit a
good discriminating power to separate in the feature space the patterns that belong
to different classes regardless of the discriminant function that describes the cluster
separation plan.
1.4 Extraction of Significant Features 9

Space Domain Features Domain

x
4
1 4
3
2 1 3
Band x
1 2
Band x
Multispectral image

Fig. 1.3 Spatial domain represented by a multispectral image with two bands and 2D domain of
features where 4 homogeneous pattern classes are grouped corresponding to different areas of the
territory: bare terrain, river, shoreline, and sea

Fig. 1.4 Features domain


(x1 , x2 ) with 3 pattern C
classes and the graphical
representation (2D and 1D)
C

of the relevant Gaussian


density functions
B

B
A

B
A

The following considerations emerge from the previous examples:

(a) The characteristics of the objects must be analyzed eliminating those that are
strongly correlated that do not help for the discrimination between objects.
Instead, they are to be used to filter out any noise between related features.
(b) The underlying problem of a classifier depends on the fact that not always in the
feature space, the classes of objects are well separated. Very often the pattern
vectors of the features can belong to more than one cluster (see Fig. 1.4) and
the classifier can make mistakes associating some of them to a class of incorrect
membership. This can be avoided by eliminating features that have little dis-
criminating power (for example, the feature x1 in Fig. 1.4 does not separate the
object classes {A, B, C}, while the x2 well separates the classes {A, C}).
10 1 Object Recognition

(c) A solution to the previous point is given by performing transformations on the


measures of the features such as to produce, in the space of the features, a min-
imization of the distance intra-class, i.e., the mean quadratic distance between
patterns of the same class and maximi ze the distance intra-class, that is the
mean quadratic distance between patterns that belong to different classes (see
Fig. 1.4).
(d) In the selection of the features, the opportunity to reduce their dimensionality
must be considered. This can be accomplished with attempts by verifying the
classifier’s performance after filtering some features. A more direct way is to
use the transform to the principal components (see Sect. 2.10.1 Vol. II) which
estimates the level of significance of each feature. The objective then becomes
the reduction of the dimensionality from M-dimensions to a reduced space to p-
dimensions which are the most significant components (extraction of significant
features).
(e) The significance of the features can also be evaluated with nonlinear transforma-
tions by evaluating the performance of the classifier and then giving a different
weight to the various features. The extraction of significant features is also con-
sidered on the basis of a priori knowledge of the probability distribution of the
classes. In fact, if the latter is known, for each class, the selection of the features
can be done by minimizing the entropy to evaluate the similarity of the classes.
(f) A classifier does not always obtain better results if a considerable number of
characteristics are used, indeed in different contexts, it is shown that few properly
selected features prove to be more discriminating.

1.4.2 Selection of Significant Features

Let us analyze with an example (see Fig. 1.4) the hypothesized normal distribution
of three classes of objects in the one-dimensional feature space x1 and x2 and two-
dimensional (x1 , x2 ). It is observed that the A class is clearly separated from the B
and C classes, while the latter show overlapping zones in both the features x1 and x2 .
From the analysis of the features x1 and x2 , it is observed that these are correlated
(the patterns are distributed in a dominant way on the diagonal of the plane (x1 x2 ))
that is to similar values of x1 correspond analogous values for x2 . From the analysis
of one-dimensional distributions, it is observed that the A class is well separable with
the single feature x2 , while the B and C classes cannot be accurately separated with
the distribution of the features x1 and x2 , as the latter are very correlated.
In general, it is convenient to select only those features that have a high level
of orthogonality, that is, the distribution of the classes in the various characteristics
should be very different: while in a feature the distribution is located toward the low
values, in another feature the same class must have a different distribution. This can
be achieved by selecting or generating unrelated features. From the geometric point
of view, this can be visualized by imagining a transformation of the original vari-
ables, such that, in the new system of orthogonal axes, they are ordered in terms of
quantity of variance of the original data (see Fig. 1.5). In the Chapter Linear Trans-
1.4 Extraction of Significant Features 11

Fig. 1.5 Geometric


interpretation of the PCA YPCA=AX
transform. The transform
from the original measures
x1 , x2 to the main

Y
Pk
components y1 , y2 is
equivalent to rotating the

Y
coordinate axes until the P’k
maximum variance is
obtained when all the θ
patterns are projected on the
OY1 axis
0

formations, Sect. 2.10.1 Vol. II, we have described the transform proper orthogonal
decomposition—POD, better known as principal component analysis—PCA which
has this property, and to represent the original data in a significant way through a
small group of new variables, precisely the components principal. With this data
transformation, the expectation is to describe most of the information (variance) of
the original data with a few components. The PCA is a direct transformation on the
original data, and no assumption is made about their distribution or number of classes
present, and therefore behaves like an unsupervised feature extraction method.
To better understand how the PCA behaves, consider the pattern distribution (see
Fig. 1.5) represented by the features (x1 , x2 ) which can express, for example, respec-
tively, the length measurements (meters, centimeters, ...) and weight (grams, kilo-
grams, ...), that is, quantities with different units of measurement. For a better graph-
ical visibility of the pattern distribution, in the domain of the features (x1 , x2 ), we
imagine that these realizations have a Gaussian distribution.
With this hypothesis in the feature domain, the patterns are arranged in ellip-
soidal form. If we now rotate the axes from the reference system (x1 , x2 ) to the
system (y1 , y2 ), the ellipsoidal shape of the patterns remains the same while only
the coordinates have changed. In this new system, there can be a convenience in
representing these realizations. Since the axes are rotated, the relationship between
the two reference systems can be expressed as follows:
    
y1 cos θ sin θ x1 y1 = x1 cos θ + x2 sin θ
= or (1.1)
y2 − sin θ cos θ x2 y2 = x1 sin θ + x2 cos θ

where θ is the angle between the homologous (horizontal and vertical) axes of the two
reference systems. It can be observed from these equations that the new coordinate,
i.e., the transformed feature y1 is a linear combination of length and weight mea-
surements (with both positive coefficients), while the second new coordinate, i.e.,
the feature y2 is a linear combination always between the length and weight mea-
surements but with opposite signs of the coefficients. That said, we observe from
Fig. 1.5 that there is an imbalance in the pattern distribution with a more pronounced
dispersion on the first axis.
12 1 Object Recognition

This means that the projection of the patterns on the new axis y1 can be a good
approximation of the entire ellipsoidal distribution. This is equivalent to saying that
the set of realizations represented by the ellipsoid can be signi f icantly represented
by the single new feature y1 = x1 cos θ + x2 sin θ instead of indicating for each
pattern the original measures x1 and x2 .
It follows that, considering only the new feature y1 , we get a size reduction from
2 to 1 to represent the population of the patterns. But be careful, the concept of
meaningful representation with the new feature y1 must be specified. In fact, y1 can
take different values as the angle θ varies. It is, therefore, necessary to select a value
of θ which may be the best representation of the relationship that exists between the
population patterns in the feature domain. This is guaranteed by selecting the value
of θ which minimizes the translation of the points with the projection with respect
to the original position.
Given that the coordinates of the patterns with respect to the Y1 axis are their
orthogonal projections precisely on the OY1 axis, the solution is given by the line
whose distance from the points is minimal. Indicating withPk a generic pattern and
with Pk its orthogonal projection on the axis OY1 , the orientation of the best line is
the one that minimizes the sum, given by [3]


N
2
Pi Pi
i=1

If the Pk O Pk triangle is considered and the Pythagorean theorem is applied, we


obtain
2 2 2
O Pk = Pk Pk + O Pk

Repeating for all the N patterns, adding and dividing by N −1, we have the following:

1  1  1 
N N N
2 2 2
O Pk = Pk Pk + O Pk (1.2)
N −1 N −1 N −1
k=1 k=1 k=1

Analyzing the (1.2) it results that the first member is constant for all the patterns
and is independent of the reference system. It follows that choosing the orientation
of the OY1 axis is equivalent to minimi zing the expression of the first addend of
the (1.2) or maximi zing the expression of the second addend of the same equation.
In the hypothesis that O represents the mass center of all i pattern,3 the expression
N 2
of the second addend N 1−1 k=1 O Pk corresponds to the variance of the pattern
projections on the new axis Y1 .
Choosing the OY1 axis that minimizes the sum of the squares of the perpendicular
distances from this axis is equivalent to selecting the OY1 axis in such a way that the

3 Without loosing generality, this is achieved by expressing both the input variables xi and output
yi , in terms of deviations from the mean.
1.4 Extraction of Significant Features 13

projections of the patterns on it result with the maximum variance. These assumptions
are based on the search for principal components formulated by Hotelling [4]. It
should be noted that with the approach to the principal components, reported above,
the sum of the squared distances between pattern and axis is minimized, while with the
least squares approach, it is different, the sum is squared to the horizontal distances
of patterns from the line (represented by the OY1 axis in this case). This leads to a
different solution (linear regression).
Returning to the principal component approach, the second component is defined,
in the orthogonal direction to the f ir st, and represents the maximum of the remaining
variance of the pattern distribution. For patterns with N dimensions, subsequent
components are obtained in a similar way. It is understood that the peculiarity of
the PCA is to represent a set of patterns in the most sparse way possible along the
principal axes. Imagining a Gaussian distribution with a M-dimensions pattern, the
dispersion assumes an ellipsoidal shape,4 the axes of which are oriented with the
principal components.
In Sect. 2.10.1 Vol. II, we have shown that to calculate the axes of these ellipsoidal
structures (they represent the density level set), the covariance matrix K S was cal-
culated for a multispectral image S of N pixels with M bands (which here represent
the variables of each pixel pattern). Recall that the covariance matrix K S integrates
the variances of the variables and the covariances between different variables.
Furthermore, in the same paragraph, we showed how to get the transform to the
principal components:
Y PC A = A · S

through the diagonali zation of the covariance matrix K S , obtaining the orthogonal
matrix A having the eigenvector s ak , k = 1, 2, . . . , M and the diagonal matrix 
of the eigenvalues λk , k = 1, 2, . . . , M. It follows that the i-th principal component
is given by
yi = <x, ai > = aiT x with i = 1, . . . , M (1.3)

with the variances and covariances of the new components expressed by:

V ar (yi ) = λi with i = 1, . . . , M Cov(yi , y j ) = 0 for i = j (1.4)

4 Indeed, an effective way to represent the graph of the normal multivariate density function N (0, )

is made by curves of level c. In this case, the function is positive and the level curves to be examined
concern values of c > 0 with a positive and invertible covariance matrix. It is shown that the
equation of an ellipsoid results to be x T  −1 x = c centered in the origin. In the reference system
of the principal components, these are expressed in the bases of the eigenvectors of the covariance
y2 y2
matrix  and the equation of the ellipsoid becomes λ11 + · · · + λMM = c, with the length of the
√ √
semi-axes equal to λ1 , . . . , λ M , where λi are the eigenvalues of covariance matrix. For M = 2,
we have elliptic contour lines. If μ = 0, the ellipsoid is centered in μ.
14 1 Object Recognition

Finally, we highlight the property that the initial total variance of the pattern popu-
lation is equal to the total variance of the principal components:


M 
M
V ar (xi ) = σ12 +, · · · , σ M
2
= V ar (yi ) = λ1 +, · · · , λ M (1.5)
i=1 i=1

although distributed with different weights and decreasing in the principal compo-
nents considered the decreasing order of the eigenvalues:

λ1 ≥ λ2 ≥ · · · ≥ λ M ≥ 0

In the 2D geometrical representation of the principal components, the major and


minor axes of the ellipsoids, mentioned above, are aligned √ respectively,
√ with the
eigenvectors a1 and a2 , while their lengths are given from λ1 and λ2 (considering
the eigenvalues ordered in descending order). The plane identified by the first two
axes is called principal plane and projecting the patterns on this plane means to
calculate their coordinates given by

y1 = <x, a1 > = a1T x y2 = <x, a2 > = a2T x (1.6)

where x indicates the generic input pattern. The process can be extended for all
M principal axes, mutually orthogonal, even if the graphic display becomes difficult
beyond the three-dimensional (3D) representation. Often the graphical representation
is useful at an exploratory level to observe how the patterns are grouped in the features
space, i.e., how the latter are related to each other. For example, in the classification
of multispectral images, it may be useful to explore different 2D projections of
the principal components to explore how the ellipsoidal structures are arranged to
separate the homogeneous pattern classes (as informally anticipated with Fig. 1.4).
The elongated shape of an ellipsoid indicates that one axis is very short with respect to
the other and informs us of the little variability in that direction of that component and
consequently projecting all the patterns <x, a1 >, we get the least loss of information.
This last aspect, the variability of the new components, is connected to the problem
of the variability of the input data that can be different, both in terms of measurement
(for example, in the area-weight case, the first is the measure linked to the length,
the second indicates a measure of force of gravity), and both in terms of dynamics
of the range of variability although expressed in the same unit of measurement. The
different variability of the input data tends to influence the first principal components,
thus distorting the exploratory analysis. The solution to this problem is obtained
by activating a standardization procedure of the input data, before applying the
transform to the principal components. This procedure consists in transforming the
original data xi , i = 1, . . . , N into the normalized data zi as follows:
xi j − μ j
zi j = (1.7)
σj
1.4 Extraction of Significant Features 15

where μ j and σ j are, respectively, the mean and standard deviation of the j-th feature,
that is, 

1 N  1  N
μj = xi j σj =  (xi j − μ j )2 (1.8)
N N −1
i=1 i=1

In this way, each feature has the same mean zero and the same standard deviation
1, and if we calculate the covariance matrix Kz , this coincides with the correlation
matrix Rx . Each element r jk of the latter represents the normalized covariance
between two features, called Pearson’s correlation coefficient (linear relation mea-
sure) between the features x j and xk obtained as follows:

1 
N
Cov(x j , xk )
r jk = = z i j z ik (1.9)
σx j σxk N −1
i=1

By virtue of the inequality of Hölder [5], it is shown that the correlation coefficients
ri j have value |ri j | ≤ 1 and in particular, the elements rkk of the principal diagonal
are all equal to 1 and represent the variance of a standardized feature. Great absolute
value of r jk corresponds to a high linear relationship between the two features. For
|r jk | = 1, the values of x j and xk lie exactly on a line (with positive slope if the coeffi-
cient is positive; with negative slope if the coefficient is negative). This property of the
correlation matrix explains the invariance for change of unit of measure (scale invari-
ance) unlike the covariance matrix. Another property of the correlation coefficient
is the fact that if the features are not related to each other (E[σxl , σxk ] = μx j μxk ),
we have rσxl ,σxk = 0 while for the covariance if Cov(σxl , σxk ) = 0 is not said to be
uncorrelated.
Having two operating modes available, with the original data and standardized
data, the principal components could be calculated, respectively, by diagonalizing
the covariance matrix (data sensitive to the change of scale) or the correlation matrix
(normalized data). This choice must be carefully evaluated based on the nature of
the available data considering that the analysis of the principal components leads to
different results in using the two matrices on the same data.
A reasonable criterion could be to standardize data when these are very different
in terms of scale. For example, in the case of multispectral images, a characteristic,
represented by the broad dynamics of a band, can be in contrast with another that
represents a band with a very restricted dynamic. Instead having homogeneous data
available, for the various features, it would be better to explore with the analysis of the
principal components without performing data standardization. A direct advantage,
offered in operating with the R correlation matrix, is given by the possibility of being
able to compare data of the same nature but acquired at different times. For example,
the classification of multispectral images relating to the same territory, acquired with
the same platform, but at different times.
We now return to the interpretation of the principal components (in the litera-
ture also called latent variables). The latter term is indicated precisely to express
16 1 Object Recognition

the impossibility of giving a direct formalization and meaning to the principal com-
ponents (for example, the creation of a model). In essence, the mathematical tool
indicates the direction of the components where the information is significantly con-
centrated, but the interpretation of these new components is left to the analyst. For a
multispectral image, the characteristics represent the spectral information associated
with the various bands (different in the visible, in the infrared, ...), when projected
into the principal components, mathematically we know that the original physical
data are redistributed and represented by new variables hidden which have lost the
original physical meaning even if evaluating the explained variances, they give us
the quantitative information of the energetic content, of each component, in this
new space. In various applications, this property of the principal components is used
in reducing the dimensionality of the data considering the first p most significant
components of the M-original dimensions.
For example, the first 3 principal components can be used as the components
of a suitable color space (RGB, ...) to display a multispectral image of dozens of
bands (see an example shown in Sect. 2.10.1 Vol. II). Another example concerns
the aspects of image compression where the analysis of the principal components is
strategic to evaluate the feasible compression level even if then the data compression
is performed with computationally more performing transforms (see Chap. 2 Vol. II).
In the context of clustering, the analysis of principal components is useful in
selecting the most significant features and eliminating the redundant ones. The analyst
can decide the acceptable d percentage of variance explained keeping the first p
components calculable (considering Eq. 1.5) with the following ratio:
p
λk
d = 100 k=1
M
(1.10)
k=1 λk

Once calculated the number of components p that contribute to maintaining the


percentage d of the total variance, the data are transformed (projected in a space with
p < M dimensions) applying the Eq. (1.3) to the first p components using the first p
eigenvectors ak , k = 1, . . . , p of the transformation matrix A p that now has M × p
dimensions. In Sect. 2.10.1 Vol. II, we demonstrated the efficacy of PCA applied
to multispectral images (5 bands) where the first principal component manages to
maintain more than 90% of the total variance due also to the high correlation of the
bands.
Other heuristic approaches are proposed in the literature to select the principal
components. The so-called Kaiser rule [6] proposes to select only the components
associated with eigenvalues with values λi ≤ 1 (almost always correspond to the
first two components). Often this criterion is used but imposes a threshold of 0.7
instead of 1, to select more components by increasing the variability of the observed
samples. A more practical method is to calculate
M the mean of the variances, that is the
mean value of the eigenvalues λ = M1 k=1 λk , and select the first p components
whose variance exceeds this average, with p the largest value of k such that λk > λ.
1.4 Extraction of Significant Features 17

Another approach is to graph the value of the eigenvalues on the ordinates with
respect to their order of extraction and choose the number p of the most significant
components where an abrupt change of the slope occurs with the rest of the graph
almost flat.

1.5 Interactive Method

A very simple interactive classification method is to consider the feature space as a


look-up table. At each point in the feature space, a number is identified that identifies
the class of the object. In this case, the number of classes is interactively determined
by the expert in the features space, assuming that the corresponding clusters are well
separated. In essence, the expert uses 2D or 3D projections of the selected significant
features (for example, using the principal components) to visually evaluate the level
of class separation. The method is direct and very fast. The classes are identified in
the features space and interactively delimited with closed rectangles or polygons (in
2D projections), as in the case of Fig. 1.3, assuming strongly unrelated features. With
this approach, it is the expert who interactively decides, in the space of significant
features, the modalities (e.g., boundary decision) to partition and delimit the clusters
by choosing the most effective 2D projections. In this way, the expert directly defines
the direct association between homogeneous patterns and the class to which they
belong (i.e., the label to be assigned for each type of territory). Figure 1.6 shows the
functional scheme of this interactive method to directly produce a thematic map of
territorial classification.
This approach proves useful in the preprocessing phase of the multispectral images
to extract the sample areas (e.g., training set) belonging to parts of the territory (veg-
etation, water, bare soil, ...) to be used later in the methods of automatic classification.
A typical 2D projection is shown in Fig. 1.4 where we can see how the class of A
patterns is easily separable while there is a greater correlation between the features
x1 and x2 for the patterns of class B (very correlated, well distributed on the positive
diagonal line) and a lower correlation for the C class patterns. Although with some
errors due to the slight overlapping of the clusters, it would be possible to separate
the B and C classes to extract samples to be used as training set for subsequent
elaborations.

1.6 Deterministic Method

More generally, we can think of a deterministic classifier as an operator that pro-


vides input pattern x with M features and in output the single value yr of identifica-
tion (labeling) of the pattern between the R classes ωr , r = 1, . . . , R possible. This
18 1 Object Recognition

Multispectral image
Band x x
4 4
1 4 4
3 1 3
2 1 3 1 3 2
1 2 2 1
Banda
Bandxx

(a) Space Domain (b) Features Domain (c) Look-up Table (d) Thematic Map

Fig. 1.6 Functional scheme of the interactive deterministic method. a Spatial domain represented
by two bands (x1 , x2 ); b 2D features space where the expert checks how the patterns cluster in
nonoverlapping clusters; c definition of the look-up table after having interactively defined the
limits of separation of classes in the features domain; d thematic map, obtained using the values
of the features x1 , x2 of each pixel P as a pointer to the 2D look-up table, where the classes to be
associated are stored

operator can be defined in the form:

d(x) = yr (1.11)

where d(x) is called decision function or discriminant function. While in the inter-
active classification method, the regions associated with the different classes were
defined by the user observing the data projections in the features domain, with the
deterministic method, instead, these regions are delimited by the decision func-
tions that are defined by analyzing sample data for each observable class. The deci-
sion functions divide into practice the space of the disjoint features in R classes
ωr , r = 1, . . . , R, each of which constitutes the subset of the pattern vectors x to M-
dimensions for which the decision d(x) = yr is valid. The ωr classes are separated
by discriminating hypersurfaces.
In relation to the R classes ωr , the discriminating hypersurfaces can be defined by
means of the scalar functions dr (x) which are precisely the discriminating functions
of the classifier with the following property:

dr (x) > ds (x) ⇒ x ∈ ωr s = 1, . . . , R; s = r (1.12)

The discriminating hypersurface is given by the following:

dr (x) − ds (x) = 0 (1.13)

That said, with the (1.12), a pattern vector x is associated with the class with the
largest value of the discriminate function:

d(x) = yr ⇐⇒ dr (x) = arg max ds (x) (1.14)


s=1,...,R

Different are the discriminating functions used in literature, linear (defined with
d(x) as a linear combination of features x j ) and nonlinear multiparametric (defined
as d(x, γ ) where γ represents the parameters of the model d to be defined in the
1.6 Deterministic Method 19

training phase, note the sample patterns, as is the case for the multilevel perceptron).
Discriminant functions can also be considered as a linear regression of data to a
model where y is the class to be assigned (the dependent variable) and the regressors
are the pattern vectors (the independent variables).

1.6.1 Linear Discriminant Functions

The linear discriminant functions are the simplest and are normally the most used.
They are obtained as a linear combination of the features of the x patterns:


M+1
>0, x ∈ ωr
dr (x) = wrT x = wr,i xi = (1.15)
<0, other wise
i=1

where x = (x1 , . . . , x M , 1)T is the aumented pattern vector (to get a simpler nota-
tion), r = 1, . . . , R indicates the class, while

wr = (wr,1 , . . . , wr,M+1 )T

indicates the weight vector of (M + 1)-dimensions associated with the discriminant


function dr (x) which linearly separates the class ωr from the other R − 1 classes.
While the weight vector with the first M terms determines the orientation of the
decision plane in the feature space, the term w M+1 , added to the weight vector,
determines the translation of the decision hyperplane from the origin (see Fig. 1.7).
In essence, the discriminant function d(x) represents the hyperplane of separation
between two classes, given by the equation:

d(x) = woT x = w1 x1 + w2 x2 + · · · + w M x M + w M+1 = 0 (1.16)

with normal unit vector in the direction of the weight vector:

wo = (w1 , w2 , . . . , w M )T

Fig. 1.7 Geometry of the


discriminating linear
d(x)>0
function

Z
Dz
Do

hyperplane d(x)=0
d(x)<0
20 1 Object Recognition

(a) (b) (c)


d(x)=0 x x

0
d x)=
(x)=0

0
)=
(x
0
0 0

Fig. 1.8 Linear and nonlinear discriminant function. a A linear function that separates two classes
of patterns ω1 and ω2 ; b Absolute separation between classes ω1 , ω2 , and ω3 separated respectively
by lines with equations d1 (x) = 0, d2 (x) = 0, and d3 (x) = 0; c Example of separation between
two classes with a nonlinear discriminant function described by a parabolic curve

|w M+1 |
and it is shown [7] to have perpendicular distance Do = wo from the origin
woT z+w M+1
and distance Dz = wo from an arbitrary vector pattern z (see Fig. 1.7). If
Se Do = 0, the hyperplane passes from the origin. The value of the discriminant
function associated with a vector pattern x represents the measure of its perpendicular
distance from the hyperplane given by (1.16).
For M = 2, the linear discriminant function (1.16) corresponds to the equation
of the straight line (separation line between two classes) given by

d(x) = w1 x1 + w2 x2 + w3 = 0 (1.17)

where the coefficients wi , i = 1, 2, 3 are chosen to separate the two classes ω1 and
ω2 , i.e., for each pattern x ∈ ω1 , we have that d(x) > 0, while for every x ∈ ω2 results
d(x) < 0 as shown in Fig. 1.8a. In essence, d(x) results in the linear discriminant
function of the class ω1 . More generally, we can say that the set of R classes are
absolutely separable if each class ωr , r = 1, . . . , R is linearly separated from the
remaining pattern classes (see Fig. 1.8b).
For M = 3, the linear discriminant function is represented by the plane and for
M > 3 by the hyperplane.

1.6.2 Generalized Discriminant Functions

So far, we have considered linear discriminant functions to separate pairs of classes


directly in the features domain. Separation with more complex class configurations
can be done using piecewise linear functions. Even more complex situations cannot
be solved with linear functions (see Fig. 1.8c). In these cases, the class boundaries
can be described with the generalized linear discriminant functions given in the
following form:

d(x) = w1 φ1 (x) + · · · + w M φ M (x) + w M+1 (1.18)


1.6 Deterministic Method 21

where φi (x) are the M scalar functions associated with the pattern x with M features
(x ∈ R M ). In vector form, introducing the aumented vectors w and z, with the
substitution of the original variable x in z i = φi (x), we have


M+1
d(x) = wi φi (x) = w T z (1.19)
i=1

where z = (φ1 (x), . . . , φ M (x), 1)T is the vector function of x and w = (w1 , . . . ,
w M , w M+1 )T . The discriminant function (1.19) is linear in z i through the functions
φi (i.e., in the new transformed variables) and not in the measures of the original
features xi . In essence, by transforming the input patterns x, via the scalar functions
φi , in the new aumented domain M+ 1 of the features z i , the classes can be separated
by a linear function as described in the previous paragraph.
In the literature, several functions have been proposed φi to separate linearly pat-
terns. The most common are the discriminating functions polynomial, quadratic,
radial basis,5 multilevel perceptron. For example, for M = 2, the quadratic gener-
alized discriminant function results

d(x) = w1 x 2 + w2 x1 x2 + w3 x22 + w4 x1 + w5 x2 + w6 (1.20)

with w = (w1 , . . . , w6 )T and z = (x12 , x1 x2 , x22 , x1 , x2 , 1)T . The number of weights,


that is, the free parameters of the problem, is 6. For M = 3, the number of weights is
10 with the pattern vector z with 10 components. With the increase of M, the number
of components of z becomes very large.

1.6.3 Fisher’s Linear Discriminant Function

Fisher’s linear discriminant analysis (known in the literature as Linear Discriminant


Analysis—LDA [8] represents an alternative approach to finding a linear combination
of features for object classification. LDA like PCA is based on a linear transformation
data to extract relevant information and reduce the dimensionality of the data. But,
while the PCA represents the best linear transformation to reduce the dimensionality
of the data projecting them in the direction of maximum variance on the new axes,
without however obtaining a better representation of the new useful features on the
separability of the classes, LDA on the other hand projects the data into the new space
such as to improve the separability of the classes by maximizing the ratio between
the inter-class variance and the intra-class variance (see Fig. 1.9).
As shown in the figure, in the case of two classes, the basic idea is to project the
sample patterns X on a line where the classes are better separated. Therefore, if v

5 Function of real variables and real values dependent exclusively on the distance from a fixed
point, called centroid xc . An RBF function is expressed in the form φ : R M → R such that
φ(x) = φ(|x − xc |).
22 1 Object Recognition

(a) (b)
horiz.

axis major distance


vert.
v*

ω
v
ω

Fig. 1.9 Fisher linear discriminant function. a Optimal projection line to separate two classes of
patterns ω1 and ω2 ; b nonoptimal line of separation of the two classes where the partial overlap of
di f f er ent patterns is noted

is the vector that defines the orientation of the line where the patterns are projected
xi , i = 1, . . . , N , the goal is to find for every xi a scalar value yi that represents the
distance from the origin of its projection on the line. This distance is given by

yi = v T xi i = 1, . . . , N (1.21)

The figure shows the projection of a sample in the one-dimensional case. Now let’s
see how to determine the v or the optimal direction of the sample projection line that
best separates the K classes ωk with each consisting of n k samples. In other words,
we need to find v so that after the projection, the ratio of variances between classes
and the intra-class ratio is maximized.

1.6.3.1 Fisher Linear Discrimination—2 Classes


Using the same symbolism used so far, we initially consider a data set of samples
X = {x1 , . . . , x N } with two classes of which we can calculate the average μ1 and μ2
of each class of n 1 and n 2 samples, respectively. Applying the (1.21) to the samples
of the two classes, we get the average distance of the projections mu ˆ k , as follows:
1  1  T
μ̂k = yi = v xi = v T μk k = 1, 2 (1.22)
n k x ∈ω n k x ∈ω
i k i k

A criterion for defining a separation measure of the two classes consists in considering
the distance (see Fig. 1.9) between the projected averages |μ̂2 − μ̂1 | which represents
the inter-class distance (measure of separation):

J (v) = |μ̂2 − μ̂1 | = |v T (μ2 − μ1 )| (1.23)


1.6 Deterministic Method 23

thus obtaining an objective function J (v) to be maximized dependent on the vector


v. However, the distance between projected averages is not a robust measure since
it does not take into account the standard deviation of the classes or their dispersion
level. Fisher suggests maximizing the inter-class distance by normalizing with the
dispersion information (scatter ) of each class. Therefore, a dispersion measure Ŝk2
is defined, analogous to the variance, for each class ωk , given by

Ŝk2 = (yi − μ̂k )2 k = 1, 2; i = 1, . . . , N (1.24)
yi ∈ωk

The measure of dispersion ( Ŝ12 + Ŝ22 ) obtained is called intra-class dispersion (within-
class scatter) of the samples projected in the direction v in this case with two classes.
The Fisher linear discriminating criterion is given by the linear function defined
by the (1.21) which projects the samples on the line in the direction v and maximizes
the following linear function:

|μ̂2 − μ̂1 |2
J (v) = (1.25)
Ŝ12 + Ŝ22

The goal of the (1.25) is to project the samples of a compact class (that is, have very
small Ŝk2 ) and simultaneously project their centroids very far apart as possible (i.e.,
the very large distance |μ̂2 − μ̂1 |2 ). This is achieved by adequately finding a vector
v∗ which maximizes J (v) through the following procedure.

1. Calculation of the dispersion matrices Sk in the space of origin of the f eatur e


x:

Ŝk = (xi − μ̂k )(xi − μ̂k )T k = 1, 2; i = 1, . . . , N (1.26)
xi ∈ωk

from which we obtain the intra-class dispersion matrix Sv = S1 + S2 .


2. The dispersion of the projections y can be expressed as a function of the dispersion
matrix Sv in the feature space x, as follows:

Ŝk2 = (yi − μ̂k )2 for the (1.24)
yi ∈ωk

= (v T xi − v T μk )2 for the (1.21) and (1.22)
yi ∈ωk (1.27)

= (v (xi − μk )(xi − μk ) v
T T

yi ∈ωk

= v T Sk v

where Sk is the dispersion matrix in the source space of the f eatur e. From the
(1.27), we get
Ŝ12 + Ŝ22 = v T Sv v (1.28)
24 1 Object Recognition

3. Definition of the inter-class dispersion matrix in the source space, given by

S B = (μ1 − μ2 )(μ1 − μ2 )T (1.29)

which includes the separation measures between the centroids of the two classes
before the projection. It is observed that S B is obtained from the external product
of two vectors and has at most one rank.
4. Calculation of the difference between the centroids, after the projection, expressed
in terms of the averages in the space of the features of origin:

(μ̂1 − μ̂2 )2 = (v T μ1 − v T μ2 )2
= v T (μ1 − μ2 )(μ1 − μ2 )T v (1.30)
= v SB v
T

5. Reformulation of the objective function (1.25) in terms of the dispersion matrices


Sv and S B , expressed as follows:

|μ̂2 − μ̂1 |2 vT S B v
J (v) = = (1.31)
Ŝ12 + Ŝ22 v T Sv v

6. Find the maximum of the objective function J (v). This is achieved by deriving
J with respect to the vector v and setting the result to zero.
 
d d vT S B v
J (v) =
dv dv v T Sv v
d[v T S B v] T
v T Sv v dv − [v T S B v] d[vdvSv v]
=
(v T Sv v)2 (1.32)
[v Sv v]2S B v − [v T S B v]2Sv v
T
= =0
(v T Sv v)2
=⇒ [v T Sv v]2S B v − [v T S B v]2Sv v = 0

and dividing the last expression for v T Sv v, we have


 T   T 
v Sv v v SB v
S B v − Sv v = 0
v T Sv v v T Sv v
=⇒ S B v − J (v)Sv v = 0
=⇒ Sv−1 S B v − J (v)v = 0 (1.33)

7. Solving the problem with the eigenvalue method generalized with the Eq. (1.33)
if Sv has complete rank (with the existence of the inverse matrix). Solving from
v, its maximum value v∗ is obtained as follows:
 T 
v SB v
v∗ = arg maxv J (v) = arg maxv = Sv−1 (μ1 − μ2 ) (1.34)
v T Sv v
1.6 Deterministic Method 25

Fig. 1.10 Fisher’s multiclass


linear discriminant function
with 3-class example and
samples with 2D features
S

SB SB

S SB μ
S

We thus obtained with the (1.34) the Fisher linear discriminant function although
more than a discriminant it is rather an appropriate choice of the direction of the
one-dimensional projection of the data.

1.6.3.2 Fisher Linear Discrimination—C Classes


Fisher’s LDA extension for C classes can be generalized with good results. Assume
we have a dataset X = {x1 , x1 , . . . , x N } of N samples with d-dimensions belonging
to C classes. In this case, instead of a projection y, we have C − 1 projections
y = {y1 , y2 , . . . , yC−1 } in the linear subspaces given by

yi = viT x =⇒ y[C−1]×1 = Vd×[C−1]


T
Xd×N (1.35)

where the second equation expresses in matrix compact form the vector y of the
C − 1 projections generated by the C − 1 projection vectors vi assembled in the
C − 1 columns of the projection matrix V.
Let us now see how the equations seen above for LDA to C-classes are generalized
(Fig. 1.10 presents an example of 2D features with 3 classes and samples with 2
dimensions).

1. The intra-class dispersion matrix for 2 classes Sv = S1 + S2 for C − 1 classes is


so generalized:
C
Sv = Si (1.36)
i=1

where
 1 
Si = (x j − μi )(x j − μi )T μi = xj (1.37)
x j ∈ωi
n i x ∈ω
j i
26 1 Object Recognition

with n i indicating the number of samples in the class ωi .


2. The inter-class dispersion matrix for 2 classes, the equation (1.29), which mea-
sures the distance between classes considering the centroids (i.e., the averages μi
of the classes) is so generalized:

C
1 
N
1 
C
SB = n i (μi − μ)(μi − μ)T con μ= xj = n i μi (1.38)
N N
i=1 j=1 i=1

where μi = 1/n i x j ∈ωi x j is the average of the samples of each class ωi , each
of n i samples, while N is the total number of samples. The total dispersion matrix
is given by ST = S B + SV .
3. Similarly, the average vector μ̂i of the projected samples y of each class and of
the total average μ̂ are defined:

1  1 
C−1
μ̂i = yj μ̂ = yi (1.39)
n i y ∈ω N
j i i=1

4. The dispersion matrices of the projected samples y result



C 
C  
C
ŜV = Ŝi = (y j − μ̂i )(y j − μ̂i )T Ŝ B = n i (μi − μ̂)(μi − μ̂)T (1.40)
i=1 i=1 y j ∈ωi i=1

5. In the 2-class approach, we expressed the dispersion matrices of the projected


samples in terms of the original samples. For C classes, we have

ŜV = VT SV V (1.41)

The dispersion matrix Ŝ B = VT S B V remains the same valid for LDA with C
classes.
6. Our goal is to find a projection that maximizes the relationship between inter-class
and intra-class dispersion. Since the projection is no longer one-dimensional but
has dimensions C − 1, the determinant of the dispersion matrices is used to obtain
a scalar objective function, as follows:

|Ŝ B | |VT S B V|
J (V) = = (1.42)
|ŜV | |VT SV V|

It is now necessary to find the projections defined by the column vectors of the
projection matrix V, or a projection matrix V∗ that maximizes the ratio of the
objective function J (V).
7. Calculation of the matrix V∗ . In analogy to the 2-class case, the maximum of
J (V) is found differentiating the objective function (1.42) and equalizing the
result to zero. Subsequently, the problem with the eigenvalue method is solved by
generalizing the Eq. (1.33) previously obtained for 2 classes. It is shown that the
optimal projection matrix V∗ is the matrix whose columns are the eigenvectors
1.6 Deterministic Method 27

(a) (b)
8 12
r
10 c to

component
ve
6 st
8 Fir
4 6
4
2

X2 − Second
2
0 0
−2
−2 FDA
−4 Second vector
PCA
−4 −6
−4 −2 0 2 4 6 8 −4 −2 0 2 4 6 8 10
X1 − First component

Fig. 1.11 Application of LDA for a 2D dataset with 2 and 3 classes. a Calculation of the linear
discriminant projection vector for the 2-class example. In the figure produced with M AT L AB,
the main component of PCA is reported, together with the projection vector FDA. It is highlighted
how FDA better separates the classes from the PCA whose principal component is more oriented
to highlight the greater variance of data distribution. b Calculation of the 2 projection vectors for
the example with 3 classes

that correspond to the greatest eigenvalues of the following generalization (the


analogue of Eq. 1.33) of the eigenvalue problem:
|VT S B V|
V∗ = [v1∗ |v2∗ | · · · |vC−1

] = arg maxV =⇒ S−1 ∗ ∗
V S B vi = λi vi (1.43)
|VT SV V|
where λi = J (vi∗ ) with i = 1, 2, ..., C − 1. From (1.43), we have that if SV is
invertible (and it is also a non-singular matrix), then the Fisher objective func-
tion is maximized when the projection matrix V∗ has the projection columns
corresponding to the eigenvectors associated with the greatest eigenvalues λi of
S−1
V S B . The matrix S B is the sum of the C matrices with rank ≤ 1 so that the
corresponding rank is ≤ (C − 1) and consequently only C − 1 eigenvalues will
be different from zero.

Figure 1.11 shows the application of Fisher’s discriminant analysis (FDA) for two
datasets with 2 features but with 2 and 3 classes. Figure (a) also shows the prin-
cipal component of the PCA applied for the same 2-class dataset. As previously
highlighted, PCA tends to project data in the direction of maximum variance which
is useful for concentrating data information on a few possible components while
it is less useful for separating classes. FDA on the other hand determines the pro-
jection vectors where the data are better separated and therefore more useful for
classification.
Let us now analyze some limitations of the LDA. A first aspect concerns the
reduction of the dimensionality of the data which is only of C − 1, unlike the PCA
which can reduce the dimensionality up to a feature. For complex data, not even the
best one-dimensional projection can separate samples of different classes. Similarly
28 1 Object Recognition

to the PCA, if the classes are very large with a very large J (v) value, the classes
have large overlaps on any projection line. LDA is in fact a parametric approach
in that it assumes a tendentially Gaussian and unimodal distribution of the samples.
For the classification problem, if the distributions are significantly non-Gaussian, the
LDA projections will not be able to correctly separate complex data (see Fig. 1.8c).
In literature, there are several variants of LDA [9,10] (nonparametric, orthonormal,
generalized, and in combination with neural networks).

1.6.4 Classifier Based on Minimum Distance

1.6.4.1 Single Prototypes


Let’s say we know a prototype pattern pr for each of the R classes ωr , r = 1, . . . , R of
patterns with M features. A classifier based on minimum distance assigns (classifies)
a generic pattern x to class ωr whose prototype pr results at the minimum distance:

d(x) = ωr ⇐⇒ |x − pr | = min |x − pi | (1.44)


i=1,...,R

If there are several minimum candidates, the pattern x is assigned to the class ωr
corresponding to the first r-th found. This classifier can be considered as a special
case of a classifier based on discriminating functions. In fact, if we consider the
Euclidean distance Di between the generic pattern x and the prototype pi , we have
that this pattern is assigned to the class ωi which satisfies the relation Di < D j for
all j = i. Finding the minimum of Di is equivalent to finding the minimum of Di2
(being the positive distances) for which we have
 1 
Di2 = |x − pi |2 = (x − pi )T (x − pi ) = x T x − 2 x T pi − piT pi (1.45)
2
where in the final expression, the term xT x can be neglected being independent of
the i index and the problem is reduced to maximize the expression in brackets (•).
It follows that we can express the classifier in terms of the following discriminant
function:
1
di (x) = x T pi − piT pi i = 1, . . . , R (1.46)
2

and the generic pattern x is assigned to the class ωi if di (x) > d j (x) for each j = i.
The discriminant functions di (x) are linear expressed in the form:

di (x) = wiT x i = 1, . . . , R (1.47)

where x is given in the form of an augmented vector (x1 , . . . , x M , 1)T , while the
weights wi = (wi1 , . . . , wi M , wi,M+1 ) are determined as follows:

1
wi j = pi j wi,M+1 = − piT pi i = 1, . . . , R; j = 1, . . . , M (1.48)
2
1.6 Deterministic Method 29

Fig. 1.12 Minimum


distance classifier with only
one prototype pi per class.
The x pattern to be classified
is assigned to the ω2 class,
making it closer to the p2 to
the distant D2 . The lines
with separation lines x
associated for each of the
three prototypes are noted
Discriminant lines

It can be shown that the discriminating surface that separates each pair of prototype
patterns pi and p j is the hyperplane that bisects perpendicularly the segment joining
the two prototypes. Figure 1.12 shows an example of minimum distance classification
for a single prototype with three classes. If the prototypes patterns pi match the
average patterns μi of the class, we have a classifier of minimum distance from the
average.

1.6.4.2 Multiple Prototypes


The extension with multiple prototypes per class is immediate. In this case, we
hypothesize that in the generic class ωi , a set of prototype patterns are aggregated
(1) (n )
(classi f ied) pi , . . . , pi i , i = 1, . . . , R, where n i indicates the number of proto-
types in the i-th class. The calculation of the distance between a x pattern to classify
and the generic class ωi is given by

Di = min x − pi(k) k = 1, . . . , n i (1.49)


k

As before, also in this case, the discriminant function is determined for the generic
class ωi as follows:
(k)
di (x) = max di (x) (1.50)
k=1,...,n i

where di(k) (x) is a linear discriminant function given by

1
di(k) (x) = x T pi(k) − (pi(k) )T pi(k) i = 1, . . . , R; k = 1, . . . , n i (1.51)
2
The x pattern is assigned to the ωi class for which the discriminant function di (x)
assumes the maximum value, i.e., di (x) > d j (x) for each j = i. In other words, the x
30 1 Object Recognition

pattern is assigned to the ωi class which has the closest pattern prototype. The
 Rlinear
discriminant functions given by the (1.51) partition the features space into i=1 ni
regions, known in the literature as the Dirichlet tessellation.6

1.6.5 Nearest-Neighbor Classifier

This classifier can be considered as a generalization of the classification scheme


based on the minimum distance. Suppose we have a training set of pairs (pi , z i ), i =
1, . . . , n where pi indicates the sample pattern of which we know a priori its class
z i = j of belonging, that is, it results a sample of the class ω j , j = 1, . . . , R. If we
denote by x the generic pattern to be classified, the Nearest-neighbor (NN) classifier
assigns it to the class of the pair i-th whose sample pi results closer in terms of their
distance metrics, namely

Di = |x − pi | = min |x − pk | ⇒ (pi , z i ); x ∈ ωz i (1.52)


1≤k≤n

A version of this classifier, called k-nearest neighbor (k-NN), operates as follows:

(a) Determine between n sample-class pairs (pi , z i ), k samples closer to the pattern
x to classify (always considering the distance with an appropriate metric).
(b) The class to assign to x is the most representative class (the most voted class),
that is, the class that has the greatest number of samples among the nearest k
found.

With the classifier k-NN, the probability of erroneous attribution of the class is
reduced. Obviously the choice of k must be adequate. A high value reduces the
sensitivity to data noise, while a very low value reduces the possibility of extending
the concept of proximity to the domain of other classes. Finally, it should be noted
that with the increase of k, the probability of error of the classifier k-NN approaches
the probability of the Bayes classifier which will be described in the following para-
graphs.

1.6.6 K-means Classifier

The K-means method [11] is also known as C-means clustering, applied in different
contexts, including the compression of images and vocal signals, and the recognition
of thematic areas for satellite images. Compared to the previous supervised classifiers,

6 Also called the Voronoi diagram (from the name of Georgij Voronoi), it is a particular type of
decomposition of a metric space, determined by the distances with respect to a given finite set of
space points. For example, in the plane, given a finite set of points S, the Dirichlet tessellation for
S is the partition of the plane that associates a region R( p) to each point p ∈ S, so such that, all
points of R( p) are closer to p than to any other point in S.
1.6 Deterministic Method 31

K-means does not have a priori knowledge of the patterns to be classified. The only
information available is the number of classes k in which to group the patterns.
So far, we have adopted a clustering criterion based on the minimum Euclidean
distance to establish a similarity7 measure between two patterns to decide whether
they are elements of the same class or not. Furthermore, this similarity measure can
be considered r elative by associating a thr eshold that defines the acceptability level
of a pattern as similar to another or belonging to another class. K-means introduces
a clustering criterion based on a performance index that minimizes the sum of the
squares of the distances between all the points of each cluster with respect to its own
cluster center.
Suppose we have the dataset available X = {xi }i=1N
consisting of N observations
of a physical phenomenon M-dimensional. The goal is to partition the dataset into
a number K of groups.8 Each partition group is represented by a prototype that
on average has an intra-class distance9 smaller than distances taken between the
prototype of the group and an observation belonging to another group (inter-class
distance). Then we represent with μk , a vector M-dimensional representing the
prototype of the k-th group (with k = 1, . . . , K ). In other words, the prototype
represents the center of the group. We are interested in finding the set of prototypes
of the X dataset with the aforementioned clustering criterion, so that the sum of the
squares of the distances of each observation xi with the nearest prototype is minimal.
We now introduce a notation to define the way to assign each observation to a
prototype with the following binary variable rik = 0, 1 indicating whether the i-th
observation belongs to the k-th group if rik = 1, or rik = 0 if it belongs to some
other group other than k. In general, we will have a matrix R of membership with
dimension N × K of the binary type which highlights whether the i-th observation
belongs to the k-th class. Suppose for now we have the K prototypes μ1 , μ2 , . . . , μ K
(later we will see how to calculate them analytically), and therefore we will say that
an observation x belongs to the class ωk if the following is satisfied:

x − μk = min x − μj ⇒ x ∈ ωk (1.53)
j=1,K

At this point, temporarily assigning all the dataset patterns to the K cluster, one can
evaluate the error that is made in electing the prototype of each group, introducing
a functional named distortion measure of the data or total reconstruction error with
the following function:


N 
K
J= rik xi − μk 2
(1.54)
i=1 k=1

7 Given two patterns x and y, a measure of similarity S(mathb f x, y) can be defined as

lim x→y S(x, y) = 0 ⇒ x = y.


8 Suppose at the moment we know the number K of groups to search for.
9 The distance between all the observations belonging to the same group with the representative

prototype of the group.


32 1 Object Recognition
 
The goal is then to find the values of {rik } and μk in order to minimize the objective
function J . An iterative procedure is used which involves two different steps for each
iteration. The cluster centers are initialized μk , k = 1, . . . , K randomly. So for each
iteration, in a first phase, J is minimized with respect to rik and keeping the centers
fixed μk . In the second phase, J is minimized with respect to μk but keeping the
membership functions fixed (also called characteristic functions of the rik classes.
These two steps are repeated until convergence is reached. We will see later that
these two described update steps of rik and μk correspond to the E (Expectation)
and M (Maximization) steps of the E M algorithm.
If the membership function rik is 1, it tells us that the vector xi is closer to the
center μk , that is, we assign each point of the dataset to the nearest cluster center as
follows: 
1 If k = arg min xi − μ j
rik = j (1.55)
0 otherwise

Since a given observation x can belong to only one group, the R matrix has the
following property:
K
rik = 1 ∀i = 1, . . . , N (1.56)
k=1

and

K 
N
rik = N . (1.57)
k=1 i=1

We now derive the update formulas for rik and μk in order to minimize the function
J . If we consider the optimization of μk with respect to rik fixed, we can see that
the function J in (1.54) is a quadratic function of μk , which can be minimized by
setting the first derivative to zero:

∂J   n 
=2 rik xi − μk = 0 (1.58)
∂μk
i=1

from which we get


N
rik xi
μk = i=1
N
(1.59)
i=1 rik

Note that the denominator of (1.59) represents the number of points assigned to the
k-th cluster, i.e., it calculates μk as the average of the points that fall within the
cluster. For this reason, it is called as K-means.
So far we have described the batch version of the algorithm, in which the whole
dataset is used in a single solution to update the prototypes, as described in the
Algorithm 1. A stochastic online version of the algorithm has been proposed in the
1.6 Deterministic Method 33

literature [12] by applying the Robbins–Monro procedure to the problem of searching


for the roots of the regression function given by the derivatives of J with respect to
μk . This allows us to formulate a sequential version of the update as follows:
 
μnew
k = μold
k + ηi xi − μk
old
(1.60)

with ηi the learning parameter that is monotonically decreased based on the number
of observations that compose the dataset. In Fig. 1.13 is shown the result of the
quantization or classification of color pixels. In particular, in Fig. 1.13a, the original
image is given and in the following ones, the result of the method with different
values of prototypes K = 3, 5, 6, 7, 8. Each color indicates a particular cluster, so
the value (representing the RGB color trio) of the nearest prototype has been replaced
by the original pixel. The computational load is easy O(K N t) where t indicates the
number of iterations, K the number of clusters, and N the number of patterns to
classify. In general, we have K , t  N .

Algorithm 1 K-means algorithm


1: Initialize the centers μk for 1 ≤ k ≤ K randomly
2: repeat

3: for xi with i = 1, . . . , N do

4: 
1 I f k = arg min j xi − μ j
rik =
0 other wise

5: end for
6: for μk with k = 1, 2, . . . , K do

7: N
rik xi
μk = i=1
n
i=1 rik

8: end for

9: until convergence of parameters

1.6.6.1 K-means Limitations


The solution found with the K-means algorithm, that is the prototypes of the K
classes, can be evaluated in terms of distances between datasets and the same pro-
totypes. This distance information provides a measure of data distortion. Different
solutions are obtained with the different initializations that can be set to the algo-
rithm. The value of the distortion can guide toward the best solution found on the
34 1 Object Recognition

(a) (b) (c)

(d) (e) (f)

Fig. 1.13 Classification of RGB pixels with the K-means method of the image in (a). In b for
K = 3; c for K = 5; d for K = 6; e for K = 7; and f for K = 8

same dataset of data, or the minimum distortion. It is often useful to proceed by trial
and error by varying the number of K classes. In general, this algorithm converges
in a dozen steps, although there is no rigorous proof of its convergence.
It is also influenced by the order in which the patterns are presented. Furthermore, it
is sensitive to noise and outliers. In fact, a small number of the latter can substantially
influence the average value. Not suitable when cluster distribution has non-convex
geometric shapes. Another limitation is given by the membership variables or also
called z it responsibility variables that assign the data i-th to the cluster t in a har d
way or binary way. In the fuzzy C-means, but also in the mixture of Gaussians dealt
with below, these variables are treated so f t or with values that vary between zero
and one.

1.6.7 ISODATA Classifier

The ISODATA classifier [13] (Iterative Self-Organizing Data Analysis Technique


A10 ) is seen as an improvement to the K-means classifier. In fact, the results of K-
means can produce clusters that contain few patterns to be insignificant. Some clusters
may also be very close to each other and can be useful to combine them together
(merging). Other clusters, on the other hand, can show very elongated geometric
configurations and in this case, it can be useful to divide them into two new clusters
(splitting) on the basis of predefined criteria such as the use of a threshold for the
standard deviation calculated for each cluster and the calculated distance between
cluster centers [14].

10 The final A is added to simplify the pronunciation.


1.6 Deterministic Method 35

The ISODATA procedure requires several input parameters: the approximate


desired number K of the clusters, the minimum number Nminc of pattern per cluster,
maximum standard deviation σs to be used as the threshold for decomposing clus-
ters, maximum distance Dunion permissible for the union, and maximum number of
Nmxunion clusters eligible for the union. The Essential steps of the ISODATA iterative
procedure are as follows:

1. Apply the K-means clustering procedure.


2. Delete clusters with a few pixels according to Nminc .
3. Merge pairs of clusters that are very close to each other according to Dunion .
4. Divide large clusters, according to σs , presumably with dissimilar patterns.
5. I terate the procedure starting from step 1 or end if the maximum number of
admissible iterations is reached.

The ISODATA classifier was applied with good results for multispectral images with
a high number of bands. The heuristics adopted, in order to limit the little significant
clusters together with the ability to divide and unite the dissimilar clusters and similar
clusters, respectively, make the classifier very flexible and effective. The problem
remains that geometrically curved clusters, even with ISODATA, are difficult to
manage. Obviously, the initial parameters must be better defined, with different
attempts, repeating the procedure several times. As the K-means also ISODATA
does not guarantee convergence a priori, even if, in real applications, with clusters
not very overlapping, convergence is obtained after dozens of iterations.

1.6.8 Fuzzy C-means Classifier

Fuzzy C-Means (FCM) classifier is the f uzzy version of the K-means and is char-
acterized by the Fuzzy theory which allows three conditions:

1. Each pattern can belong to a certain degree of probability to multiple classes.


2. The sum of the degree to which each pattern belongs to all clusters must be equal
to 1.
3. The sum of the memberships of all the patterns in each cluster cannot exceed N
(the set of patterns).

The f uzzy version of the K-means proposed by Bezdek [15], also known as Fuzzy
ISODATA, differs from the previous one for the membership function. In this algo-
rithm, each pattern x has a membership function r of the smooth type, i.e., it is not
binary but defines the degree to which the data belongs to each cluster. This algorithm
partitions the dataset X = {xi }i=1
N
of N observations in K groups fuzzy, and find
cluster centers in a similar way to K-means, such as to minimize the similarity cost
function J . So the partitioning of the dataset is done in a fuzzy way, so as to have for
each given xi ∈ X a membership value of each cluster between 0 and 1. Therefore,
the membership matrix R is not binary, but has values between 0 and 1. In any case,
36 1 Object Recognition

the condition (1.56) is imposed. So the objective function becomes


N 
K
J= m
rik xi − μk 2
(1.61)
i=1 k=1

with m ∈ [1, ∞) an exponent representing a weight (fuzzification constant, which


is the fuzziness level of the classifier). The necessary condition for the search for
the minimum of the (1.61) can be found by introducing the Lagrange multipliers λi
which define the following cost function subject to the n constraints of the (1.56):
N  
K
J = J + i=1 λi k=1 rik − 1
K N m N   (1.62)
K
= k=1 r
i=1 ik x i − μ k
2+
i=1 λi k=1 r ik − 1

Now differentiating the (1.62) with respect to μk , λi and setting zero to following:
n mx
rik i
μk = i=1 n m , (1.63)
i=1 rik

and
1
rik = k = 1, . . . , K ; i = 1, . . . , N
K  2/(m−1) (1.64)
xi −μk
t=1 xi −μt

In the batch version, the algorithm is reported in Algorithm 2. We observe the iter-
ativity of the algorithm that alternately determines the centroids μk of the clusters
and the memberships rik until convergence.
It should be noted that, if the exponent m = 1 in the objective function (1.61), the
algorithm fuzzy C-means approximates the hard algorithm K-means. Since the level
of belonging of the patterns to the clusters produced by the algorithm, they become
0 and 1. At the extreme value m → ∞, the objective function has value J → 0.
Normally m is chosen equal to 2. The FCM classifier is often applied in particular in
the classification of multispectral images. However, performance remains limited by
the intrinsic geometry of the clusters. As for the K-means also for FCM, an elongated
or curved grouping of the patterns in the features space can produce unrealistic results.

Algorithm 2 Fuzzy C-means algorithm


1: Initialize the membership matrix R with random values between 0 and 1 and keeping Eq. (1.56)
satisfied.
2: Calculate the fuzzy cluster centers with the (1.63) ∀ k = 1, . . . K .
3: Calculate the cost function according to the Eq. (1.61). Evaluate the stop criterion if J <
T hr eshold or if the difference Jt − Jt−1 < T hr eshold with t the progress step.
4: Calculate the new matrix R using (1.64) and go to step 2.
1.7 Statistical Method 37

1.7 Statistical Method

The statistical approach, in analogy to the deterministic one, uses a set of decision
rules based, however, on statistical theory. In particular, the discriminating functions
can be constructed by estimating the density functions and applying the Bayes rules.
In this case, the proposed classifiers are of the parametric type extracting information
directly from the observations.

1.7.1 MAP Classifier

A deterministic classifier systematically assigns a pattern to a given class. In reality,


a vector pattern x, described by the features x1 , . . . , x M , can assume values such that
a classifier can incorrectly associate it with one of the classes ωk , k = 1, . . . , K .
Therefore, it emerges the need to use an optimal classifier that can identify patterns
based on criteria with minimal error. Bayes’ decision theory is based on the foun-
dation that the decision/classification problem can be formulated in stochastic terms
in the hypothesis that the densities of the variables are known or estimated from the
observed data. Moreover, this theory can estimate costs or risks in a probabilistic
sense related to the decisions taken.
Let x be the pattern to be classified in one of the K classes ω1 , . . . , ω K of which we
know the priori probabilities p(ω1 ), . . . , p(ω K ) (which are independent of obser-
vations). A simple decision rule can be used, to minimize the probability of error of
assignment of the class ωi i, which is the following:

p(ωi ) > p(ωk ) k = 1, . . . , K ; k = i (1.65)

This rule assigns all the patterns to a class, that is, the class with the highest priori
probability. This rule makes sense if the a priori probabilities of the classes are very
different between them, that is, p(ωi )  p(ωk ). We can now assume to know,
for each class ωi , an adequate number of sample patterns x, from which we can
evaluate the conditional probability distribution p(x|) of x given the class ωi , that is,
estimating the probability density of x assuming the association to the class ωi . At
this point, it is possible to adopt a probabilistic decision rule to associate a generic
pattern x to a class ωi , in terms of conditional probability, if the probability p(ωi ) of
the class ωi given the generic pattern x, or p(ωi |x), is greater than all other classes.
In other words, the generic pattern is assigned x to the class ωi if the following
condition is satisfied:

p(ωi |x) > p(ωk |x) k = 1, . . . , K ; k = i (1.66)

The probability p(ωi |x) is known as the posterior probability of the class ωi given x,
that is, the probability that having observed the generic pattern x, the class to which
38 1 Object Recognition

it belongs is ωi . This probability can be estimated with the Bayes theorem11 :

p(x|ωi ) p(ωi )
p(ωi |x) = (1.67)
p(x)
where

(a) ωi is the class, not known, to be estimated, to associate it with the observed
pattern x;
(b) p(ωi ) is the priori probability of the class ωi , that is, it represents part of our
knowledge with respect to which the classification is based (they can also be
equiprobable);
(c) p(x|ωi ) is the conditional probability density function of the class, interpreted
as the likeli hood of the pattern x which occurs when its features are known to
belong to the class ωi ;

11 The Bayes theorem can be derived from the definition of conditional probability and the total
probability theorem. If A and B are two events, the probability of the event A when the event B has
already occurred is given by

p(A ∩ B)
p(A|B) = if p(B) > 0
p(B)
and is called conditional probability of A conditioned on B or simply probability of A given B.
The denominator p(B) simply normalizes the joint probability p(A, B) of the events that occur
together with B. If we consider the space S of the events partitioned into B1 , . . . , B K , any event A
can be represented as

A = A ∩ S = A ∩ (B1 ∪ B2 , . . . , B K ) = (A ∩ B1 ) ∪ (A ∩ B2 ), . . . , (A ∩ B K ).

If B1 , . . . , B K are mutually exclusive, we have that

p(A) = p(A ∩ B1 )+, · · · + p(A ∩ B K )

and replacing the conditional probabilities, the total probability of any A event is given by


K
p(A) = p(A|B1 )P(B1 ) + · · · + p(A|B K ) p(B K ) = p(A|Bk ) p(Bk )
k=1

By combining the definitions of conditional probability and the total probability theorem, we obtain
the probability of the event Bi , if we suppose that the event A happened, with the following:

p(A ∩ Bi ) p(A|Bi ) p(Bi )


p(Bi |A) = = K
k=1 p(A|Bk ) p(Bk )
p(A)

known as the Bayes Rule or Theorem which represents one of the most important relations in the
field of statistics.
1.7 Statistical Method 39

(d) p(x) is known as evidence, i.e., the absolute probability density given by


K 
K
p(x) = p(x|ωk ) p(ωk ) with p(ωk ) = 1 (1.68)
k=1 k=1

which represents a normalization constant and does not influence the decision.

From the (1.67), the discriminant functions dk (x) can be considered as

dk (x) = p(x|ωk ) p(ωk ) k = 1, . . . , K (1.69)

which unless a constant factor corresponds to the value of the posterior probability
p(ωk |x) which expresses how often a pattern x belongs to the class ωk . The (1.65)
can, therefore, be rewritten, in terms of the optimal rule, to classify the generic pattern
x and associate it with a class ωk if the posterior probability p(ωk |x) is the highest
of all possible a posteriori probabilities:

p(ωk |x) = arg max p(ωi |x) (1.70)


i=1,...,K

known as the maximum a posteriori (MAP) probability decision rule. Also known
as the Bayes optimal rule for the minimum error of classification.

1.7.2 Maximum Likelihood Classifier—ML

The MAP decision rule12 can be re-expressed in another form. For simplicity, we
consider the rule for two classes ω1 and ω2 , and we apply the rule defined by the
(1.66) which assigns a generic pattern x to the class which has a posterior probability
highest. In this case, applying the Bayes rule (1.67) to the (1.66), and eliminating
the common term p(x), we would have

p(x|ω1 ) p(ω1 ) > p(x|ω2 ) p(ω2 ) (1.71)

which assigns x to the class ω1 if satisfied, otherwise to the class ω2 . This last
relationship we can rewrite it as follows:

p(x|ω1 ) p(ω2 )
(x) = > (1.72)
p(x|ω2 ) p(ω1 )
which assigns x to the class ω1 if satisfied, otherwise to the class ω2 . (x) is called
likelihood ratio and the corresponding decision rule is known as likelihood test. We

12 Inthe Bayesian statistic, MAP (Maximum A Posteriori) indicates an estimate of an unknown


quantity, that equals the mode of the probability posterior distribution. In essence, mode is the
value that happens frequently in a distribution (peak value).
40 1 Object Recognition

Fig. 1.14 The conditional


density functions of classes
and their decision regions
p(x|

observe that in the likelihood test, the evidence p(x) does not appear (while it is
necessary for the MAP rule for the calculation of the posterior probability p(ωk |x)),
since it is a constant not influenced by the class ωk . The de facto likelihood test is
a test that estimates how good the assignment decision is based on the comparison
between the a priori knowledge ratios, i.e., conditional probabilities (likelihood) and
a priori probabilities. If these latter p(ωk ) turn out to be equiprobable, then the test
is performed only by comparing the likelihoods p(x|ωk ), thus becoming the rule of
ML, Maximum Likelihood. This last rule is also used when the p(ωk ) are not known.
The decision rule (1.71) can also be expressed in geometric terms by defining the
decision regions. Figure 1.14 shows the decision regions 1 and 2 for the separation
of two classes assuming the classes ω1 and ω2 both with Gaussian distribution. In
the figure, the graphs of p(x|ωi ) p(ωi ), i = 1, 2 are displayed, with the priori prob-
abilities p(ωi ) different. The theoretical boundary of the two regions is determined
by
p(x|ω1 ) p(ω1 ) = p(x|ω2 ) p(ω2 )

In the figure, the boundary corresponds to the point of intersection of the two Gaus-
sians. Alternatively the boundary can be determined by calculating the likelihood
ratio (x) and setting a threshold θ = ω2 /ω1 . Therefore, with the likelihood test,
the decision regions would result

1 = {x ∈  : (x) > θ } and 2 = {x ∈  : (x) < θ }

1.7.2.1 Example of Nonparametric Bayesian Classification


We are interested in classifying the land areas (class ω1 ) and water (class ω2 ) of a ter-
ritory having two spectral bands available. No knowledge of a priori and conditional
probabilities is assumed if not the one found by observing the spectral measurements
of samples of the two classes extracted from the two available bands. As shown in
Fig. 1.15, the samples associated with the two classes are extracted from the 4 win-
dows (i.e., the training set, 2 for each class) identified on the two components of the
1.7 Statistical Method 41

Space Domain Features Domain

18

1 x
2 n
Band x
1
Band x
Multispectral image 50
0
Fig. 1.15 Example of the nonparametric Bayes classifier that classifies 2 types of territory (land
and river) in the spectral domain starting from the training sets extracted from two bands (x1 , x2 )
of a multispectral image

multispectral image. In the 2D spectral domain, the training set samples of which
we know the membership class are projected. A generic pixel pattern with spectral
measurements x = (x1 , x2 ) (in the figure indicated with the circle) is projected in
the features domain and associated with one of the classes using the nonparametric
MAP classifier. From the training sets, we can have a very rough estimate of the a
priori probabilities p(ωi ), of the likelihoods p(x|ωi ), and of the evidence p(x), as
follows:
n ω1 18 n ω2 50
p(ω1 ) = = = 0.26 p(ω2 ) = = = 0.74
N 68 N 68
ni ω1 4 ni ω2 7
p(x|ω1 ) = = = 0.22 p(x|ω2 ) = = = 0.14
n ω1 18 n ω2 50


2
p(x) = p(x|ωi ) p(ωi ) = 0.22 × 0.26 + 0.14 × 0.74 = 0.1608
i=1

where n ω1 and n ω2 indicate the number of samples in the training sets belonging,
respectively, to the earth and water class, ni ω1 and ni ω2 indicate the number of
samples belonging, respectively, to the earth and water class found in the window
centered in x the pattern to classify.
Applying the Bayes rule (1.67), we obtain the posterior probabilities:
p(ω1 ) p(x|ω1 ) 0.26 × 0.22 p(ω2 ) p(x|ω2 ) 0.74 × 0.14
p(ω1 |x) = = = 0.36 p(ω2 |x) = = = 0.64
p(x) 0.1516 p(x) 0.1608

For the MAP decision rule (1.70), the pattern x is assigned to the class ω2 (water
zone).
42 1 Object Recognition

1.7.2.2 Calculation of the Bayes Error Probability


To demonstrate that the Bayes rule (1.71) proves optimal, in the sense that it mini-
mizes the classification error of a pattern x, it needs to evaluate p(err or ) in terms
of posterior probability p(err or | x). In the classification context with K classes, the
probability of Bayes error can be expressed by the following:


K
p(err or ) = p(err or |ωk ) p(ωk ) (1.73)
k=1

where p(err or |ωk ) is the probability of incorrect classification of a pattern associated


with the class ωk which is given by

p(err or |ωk ) = p(x|ωk )dx (1.74)
C [k ]

With C[ k ] is indicated the set of regions complement of the region k , that is,
C[k ] = Kj=1; j=k  j . That being said, we can rewrite the probability of incorrect
classification of a pattern in the following form:
K 

p(err or ) = p(x|ωk ) p(ωk )dx
k=1 C [k ]

K   
= p(ωk ) 1 − p(x|ωk )dx (1.75)
k=1 k


K 
=1− p(ωk ) p(x|ωk )dx
k=1 k

From which it is observed that the minimization of the error is equivalent to maxi-
mizing the probability of correct classification given by


K 
p(ωk ) p(x|ωk )dx (1.76)
k=1 k

This goal is achieved by maximizing the integral of the (1.76) which is equivalent
to choosing the decision regions k for which p(ωk ) p(x|ωk ) is the value higher for
all regions, exactly as imposed by the MAP rule (1.70). This ensures that the MAP
rule minimizes the probability of error.
It is observed (see Fig. 1.16) how the decision region translates with respect to
the point of equal probability of likelihood p(x|ω1 ) = p(x|ω2 ) for different values
of the a priori probability.
1.7 Statistical Method 43

p(x|
0.4

0.3

0.2

0.1

-2 0 2 4 x

Fig. 1.16 Elements that characterize the probability of error by considering the conditional density
functions of the classes with normal distribution of equal variance and unequal a priori probability.
The blue area corresponds to the probability of error in assigning a pattern of the class ω1 (standing
in the region 1 ) to the class ω2 . The area in red represents the opposite situation

1.7.2.3 Calculation of the Minimum Risk for the Bayes Rule


With the Bayes Rule, (1.70) has shown that the assignment of a pattern to a class,
choosing the class with the highest a posteriori probability, the choice minimizes the
classification error. With the calculation of the minimum error given by the (1.75),
it is highlighted that a pattern is assigned to a class with the probability of error with
the same unit cost. In real applications, a wrong assignment can have a very different
intrinsic meaning due to a wrong classification. Incorrectly classifying a pixel pattern
of an image (normally millions of pixels) is much less severe than a pattern context
used to classify a type of disease. It is, therefore, useful to formulate a new rule that
defines, through a cost function, how to differently weigh the probability of assigning
a pattern to a class. In essence, the problem is formulated in terms of the minimum
risk theory (also called utility theory in economy) which, in addition to considering
the probability of an event occurrence, also takes into account a cost associated with
the decision/action (in our case, the action is that of assigning a pattern to a class).
Let  = {ω1 , . . . , ω K } the set of K classes and be A = {α1 , . . . , αa } the set of
a possible actions. We now define a cost function C(αi |ω j ) that indicates the cost
that would be done by performing the αi action that would assign a x pattern to the
class ωi when instead the class (the state of nature) is ω j . With this setting, we can
evaluate the conditional risk (or expected cost) R(αi |x) associated with the action
αi to assign the observed pattern x to a class.
Note the posterior probabilities p(ω j |x), but not knowing the true class ωi to
associate to the pattern, the conditional risk associated with the action αi is given by


K
R(αi |x) = C(αi |ω j ) p(ω j |x) i = 1, . . . , a (1.77)
j=1
44 1 Object Recognition

The zero-order conditional risk R, considering the zero order cost function, is defined
by 
0 if i = j
C(αi |ω j ) = i, j = 1, . . . , K (1.78)
1 if i = j

according to the Bayes decision rule, it results



R(αi |x) = p(ω j |x) = 1 − p(ωi |x) i = 1, . . . , a (1.79)
j=i

from which it can be deduced that we can minimize conditional risk by selecting the
action that minimizes R(αi |x) to classify the observed pattern x. It follows that we
need to find a decision rule α(x) which relates the input space of the features with
that of the actions, calculate the overall risk RT given by


K K 
 
K
RT = R(αi |x) = C(αi |ω j ) p(ω j |x) p(x) (1.80)
i=1 i=1 i j=1

which will be minimal by selecting αi for which R(αi |x) is minimum for all x. The
Bayes rule guarantees overall risk minimization by selecting the action α ∗ which
minimizes the conditional risk (1.77):


K
α ∗ = arg min R(αi |x) = arg min C(αi |ω j ) p(ω j |x) (1.81)
αi αi
j=1

thus obtaining the Bayes Risk which is the best achievable result.
Let us now calculate the minimum risk for an example of binary classification.
Both α1 the action to decide that the correct class is ω1 , and similarly it is α2 for ω2 .
We evaluate the conditional risks with the extended (1.77) rewritten:

R(α1 |x) = C11 p(ω1 |x) + C12 p(ω2 |x)


R(α2 |x) = C21 p(ω1 |x) + C22 p(ω2 |x)

The fundamental decision rule is to decide for ω1 if

R(α1 |x) < R(α2 |x)

which in terms of posterior probability (remembering the 1.71) is equivalent to decid-


ing for ω1 if
(C21 − C11 ) p(ω1 |x) > (C12 − C22 ) p(ω2 |x)

highlighting that the posterior probability is scaled by the cost differences (normally
positive). Applying the Bayes rule to the latter (remembering the 1.71), we decide
1.7 Statistical Method 45

Fig. 1.17 Thresholds of the


likelihood ratio (x) (related p(x|
to the distributions of
Fig. 1.16) related to the
zero-order cost function

for ω1 if
(C21 − C11 ) p(x|ω1 ) p(ω1 ) > (C12 − C22 ) p(x|ω2 ) p(ω2 ) (1.82)

Assuming that C21 > C11 and remembering the definition of likelihood ratio
expressed by the (1.72), the previous Bayes rule can be rewritten as follows:

p(x|ω1 ) (C12 − C22 ) p(ω2 )


(x) = > (1.83)
p(x|ω2 ) (C21 − C11 ) p(ω1 )
The (1.83) states the optimal decision property: If the likelihood ratio exceeds a
certain threshold, which is independent of the observed pattern x, decide for the
class ω1 . An immediate application of the (1.83) is given the zero-order cost function
(1.78). In this case, the decision rule is the MAP, that is, to say it is classified x to
the class ωi if p(ωi |x) > p(ω j |x) for each j = i. Expressing in terms of likelihood
with the (1.83), we will have
(C12 − C22 ) p(ω2 )
= θC
(C21 − C11 ) p(ω1 )
p(x|ω1 )
for which note the threshold θC it is decided that x belongs to the class ω1 if p(x|ω2)
>
0 1
θC . For the zero-order cost function with the cost matrix C = 1 0 , we have θC =
2)
0 2 p(ω2 )
1 · p(ω
p(ω1 ) = θ1 while for C = 1 0 , we have θC = 2 · p(ω1 ) = θ2 .
Figure 1.17 shows the graph of the likelihood ratio (x) and the thresholds θ1
and θ2 . Considering the generic threshold θ = C12 /C21 , it is observed that with the
increase of the cost on the class ω1 follows a reduction of the corresponding region
1 . This implies that for equiprobable classes (with θ1 threshold) p(ω1 ) = p(ω2 ),
we have C12 = C21 = 1. With the threshold θ2 , we have p(ω1 ) > p(ω2 ) for which
the region 1 is reduced.
It is highlighted the advantage of verifying the problem of the decision, with the
ratio of likelihood, through the scalar value of the threshold θ without the direct
knowledge of the regions that normally describe the space of the N -dimensional
features.
46 1 Object Recognition

1.7.2.4 Bayes Decision Rule with Rejection


In the preceding paragraphs, we have seen that a Bayesian classifier, as good as
possible, can happen that erroneously assigns a pattern to a class. When the error
turns out to be very expensive, it may be rational not to risk the wrong decision
and it is useful not to take a decision at the moment and, if necessary, delay it for
a later decision-making phase. In this way, the patterns that would potentially be
classified incorrectly are grouped in a class ω0 and classified as r ejected belonging
to a region 0 in the feature domain. Subsequently, they can be analyzed with a
manual or automatic ad hoc classification procedure. In classification applications
with differentiated decision costs, the strategy is to define a compromise between
acceptable error and rational rejection. This error-rejection compromise was initially
formulated in [16] defining a general relationship between the probability of error
and rejection. According to Chow’s rule, a x pattern is r ejected if the maximum
posterior probability is lower than a threshold value t ∈ [0, 1]:

arg max p(ωi |x) = p(ωk |x) < t (1.84)


i=1,...,K

Alternatively, a x pattern is accepted and assigned according to the Bayes rule to


the ωk class if it is
arg max p(ωi |x) = p(ωk |x) ≥ t (1.85)
i=1,...,K

It is shown that the threshold t to be chosen to carry out the rejection must be
t < KK−1 where K is the number of classes. In fact, if the classes are equiprobable,
the minimum value reachable by max p(ωi |x) is 1/K because the following relation
i
must be satisfied:

K
1= p(ωi |x) ≤ K arg max p(ωi |x) (1.86)
i=1 i=1,...,K

Figure 1.18 shows the rejection region 0 associated with a threshold t for two
Gaussian classes of Fig. 1.16. The patterns that fall into the 1 and 2 regions are
regularly classified with the Bayes rule. It is observed that the value of the threshold
t strongly influences the dimensions of the region of rejection.
For a given threshold t, the probability of correct classification c(t) is given by
the (1.76) considering only the regions of acceptance (0 is excluded):


K 
c(t) = p(ωk ) p(x|ωk )dx (1.87)
k=1 k

The unconditional probability of rejection r (t), that is, the probability of a pattern
falling into the region 0 is given by

r (t) = p(x)dx (1.88)
0
1.7 Statistical Method 47

Fig. 1.18 Chow’s rule |x)


applied for rejection 1
classification for two classes t
with Gaussian distribution.
The rejection threshold t 0.50
defines the rejection region
0 , area where patterns with
a high level of classification
uncertainty fall x
-4 -2 0 2 4

The value of the error e(t) associated with the probability of accepting to classify a
pattern and classifying it incorrectly is given by
K 

e(t) = [1 − max p(ωi |x)] p(x)dx = 1 − c(t) − r (t) (1.89)
i
k=1 k

From this relation, it is evident that a given value of correct classification c(t) =
1−r (t)−e(t) can be obtained by choosing to reduce the error e(t) and simultaneously
increase the rejection error r (t), that is, to harmonize the compromise error-rejection
being inversely related to each other.
If a Ci j cost is considered even in the assignment of a pattern to the rejected class
ω0 (normally lower than the wrong classification one), the cost function is modified
as follows:


⎨0 if i = j
Ci j = 1 if i = j i = 0, . . . , K ; j = 1, . . . , K (1.90)


t if i = 0 (rejection class ω0 )

In [16], the following decision rule (see Fig. 1.18) with optimal rejection α(x) is
demonstrated, which is also the minimum risk rule if the cost function is uniform
within each decision class:

ωi if ( p(ωi |x) > p(ω j |x)) ∧ ( p(ωi |x) > t) ∀i = j
α(x) = (1.91)
ω0 otherwise reject

where the rejection threshold t is expressed according to the cost of error e, the cost
of rejection r , and the cost of correct classification c, as follows:
e−r
t= (1.92)
e−c
where with c ≤ r it is guaranteed that t ∈ [0, 1], while if e = r we will back to the
Bayes rule. In essence, Chow’s rejection rule attempts to reduce the error by rejecting
border patterns between regions whose classification is uncertain.
48 1 Object Recognition

1.7.3 Other Decision Criteria

The Bayes criteria described, based on the MAP decider or maximum likelihood
ML, need to define the cost values Ci j and know the probabilities a priori p(ωi ).
In applications where this information is not known, in the literature [17,18], the
decision criterion Minimax and that of Neyman–Pearson are proposed. The Minimax
criterion is used in applications where the recognition system must guarantee good
behavior over a range of possible values rather than for a given priori probability
value. In these cases, although the a priori probability is not known, its variability
can be known for a given interval. The strategy used in these cases is to minimize
the maximum value of the risk by varying the prior probability.
The Neyman–Pearson criterion is used in applications where there is a need to limit
the probability of error within a class instead of optimizing the overall conditional
risk as with the Bayes criterion. For example, we want to fix a certain attention on
the probability of error associated with a false alarm and minimize the probability
of failure to alarm as required in radar applications. This criterion evaluates the
probability of error 1 in classifying patterns of ω1 in the class ω2 and vice versa the
probability of error 2 for patterns of the class ω2 attributed to the class ω1 .
The strategy of this criterion is to minimize the error on the class ω1 by imposing
to find the minimum of 1 = 2 p(x|ω1 )dx correlating it to the error  limiting it

below a value α, that is, 2 = 1 p(x|ω2 )dx < α. The criterion is set as a constrained
optimization problem, whose solution is due to the Lagrange multipliers approach
which minimizes the objective function:

F = 1 + λ(2 − α)

It is highlighted the absence of p(ωi ) and costs Ci j while the decision regions i
are to be defined with the minimization procedure.

1.7.4 Parametric Bayes Classifier

The Bayes decision rule (1.67) requires knowledge of all the conditional probabil-
ities of the classes and a priori probabilities. Its functional and exact parameters of
these density functions are rarely available. Once the nature of the observed patterns
is known, it is possible to hypothesize a parametric model for the probability den-
sity functions and estimate the parameters of this model through sample patterns.
Therefore, an approach used to estimate conditional probabilities p(x|ωi ) is based
on a training set of patterns Pi = {x1 , . . . , xn i } xi j ∈ Rd associated with the class
ωi . In the parametric context, we assume the form (for example, Gaussian) of the
probability distribution of the classes and the unknown parameters θi that describe
it. The estimation of the parameters θk , k = 1, . . . , n p (for example in the Gaussian
form are θ1 = μ; θ2 = σ and p(x) = N (μ, σ )) can be done with known approaches
of maximum likelihood or Bayesian estimation.
1.7 Statistical Method 49

1.7.5 Maximum Likelihood Estimation—MLE

The parameters that characterize the hypothesized model (e.g., Gaussian) are assumed
known (for example, mean and variance) but are to be determined (they represent
the unknowns). The estimation of the parameters can be influenced by the choice
of the training sets and an optimum result is obtained using a significant number
of samples. With the M L E method, the goal is to estimate the parameters θ̂i which
maximizes the likelihood function p(x|ωi ) = p(x|θi ) defined using the training set
Pi :
θ̂i = arg max [ p(P j |θ j )] = arg max [ p(x j1 , . . . , x jn i |θ j )] (1.93)
θj θj

If we assume that the patterns of the training set Pi = {x1 , . . . , xn i } form a sequence
of variables random independent and identically distributed (iid),13 the likelihood
function p(Pi |θi ) associated with class ωi can be expressed as follows:


ni
p(Pi |θi ) = p(xki |θi ) (1.94)
k=1

This probability density is considered as an ordinary function of the variable θi


and dependent on the n i pattern of the training set. The assumption of i.i.d. in real
applications: it is not maintained and needs to choose training sets as much as possible
in the conditions of independence of the observed patterns. In common practice,
assuming this independence, the MLE method (1.93) is the best solution to estimate
the parameters that describe the known model of the probability density function
p(Pi |θi ).
A mathematical simplification is adopted expressing in logarithmic terms the
(1.93) and replacing the (1.94), we get
  ni  
ni 
θ̂i = arg max log p(xki |θi ) = arg max log p(xki |θi ) (1.95)
θk k=1 θk k=1

The logarithmic function has the property of being monotonically increasing besides
the evidence of expressing the (1.93) in terms of sums instead of products, thus
simplifying the procedure of finding the maximum especially when the probability
function model has exponential terms, as happens with the assumption of Gaussian
distribution. Given the independence of training sets Pi = x1 , . . . , xn i of patterns
associated with K classes ωi , i = 1, . . . , K , we will omit the index i-th which
indicates the class in estimating the related parameters θi . In essence, the parameter
estimation procedure is repeated independently for each class.

13 Implies that the patterns all have the same probability distribution and are all statistically inde-

pendent.
50 1 Object Recognition

1.7.5.1 MLE Estimate for Gaussian Distribution with Mean Unknown


According to the classical definition of the central limit theorem, the probability
distribution of the sum (or mean) of the i.i.d variables with finite mean and variance
approaches the Gaussian distribution. We denote by P = {x1 , . . . , xn } the training
set of pattern d-dimensional of a generic class ω whose distribution is assumed to
be Gaussian p(x) = N (μ, ) where μ is the vector unknown mean and  is the
covariance matrix. Considering the generic pattern xk the MLE estimate of the mean
vector μ for the (1.95), we have

n
θ̂ = μ̂ = arg max log( p(xk |θ))
θ k=1
n   
1 1 T −1
= arg max log exp − (x k − μ)  (x k − μ) (1.96)
θ (2π )d/2 ||1/2 2
k=1
n    
1 1 T −1
= arg max log − (x k − μ)  (xk − μ)
θ (2π )d/2 ||1/2 2
k=1

The maximum likelihood value of the function for the sample patterns P is obtained
differentiating with respect the parameter θ and setting to zero:
n
∂ 
n
k=1 log( p(xk |θ ))
=  −1 (xk − μ) = 0 (1.97)
∂θ
k=1

from which, by eliminating the factor , we get

1
n
μ̂ = xk (1.98)
n
k=1

It is observed that the estimate of the mean (1.98) obtained with the MLE approach
leads to the same result of the mean calculated in the traditional way with the average
of the training set patterns.

1.7.5.2 MLE Estimate for Gaussian Distribution with μ and  Unknown


The problem resolves as before, the only difference consists in the calculation of the
gradient ∇θ instead of the derivative since we have two variables to estimate θ̂ =
ˆ T . For simplicity, let’s consider the one-dimensional case with
(θ̂ 1 , θ̂ 2 )T = (μ̂, )
the training set of patterns P assumed with the normal distribution p(x) = N (μ, σ 2 ),
with μ and σ to estimate. For the generic pattern xk calculating the gradient, we have
⎡ ⎤
∂ 
n
⎢ ∂θ1 k=1 log( p(x |θ )) ⎥ n  
k
θ2 (x k − θ1 )
1
⎢ ⎥ 
∇θ = ⎢ ⎢ ⎥
⎥= (xk −θ1 )2 (1.99)
⎦ k=1 − 2θ2 + 2θ22
1
⎣ ∂  n
∂θ2 log( p(xk |θ ))
k=1
1.7 Statistical Method 51

The maximum likelihood condition is obtained by setting the gradient function to


zero (∇θ = 0) (1.99) obtaining


n
1 
n
1 
n
(xk − θ̂1 )2
(xk − θ̂1 ) = 0 − + =0
k=1 θ̂2 2θˆ2
k=1 2θ̂ 2
k=1 2

from which, with respect to θˆ1 and θˆ2 , we get the estimates of μ and σ 2 , respectively,
as follows:

1 1
n n
θ̂1 = μ̂ = xk θ̂2 = σ̂ 2 = (xk − μ̂)2 (1.100)
n n
k=1 k=1

The expressions (1.100) MLE estimate of variance and mean correspond to the
traditional variance and mean calculated on training set patterns. Similarly, it can
be shown [17] that the MLE estimates, for a multivariate Gaussian distribution in
d-dimensional, are the traditional mean vector μ and the covariance matrix , given
by
1 1
n n
μ̂ = xk ˆ =
 (xk − μ̂)(xk − μ̂)T (1.101)
n n
k=1 k=1

Although the maximum likelihood estimates correspond to the traditional calcula-


tion methods of the mean and covariance, the degree of reliability of these values
with respect to the real ones remains to be verified. In other words, how much the
hypothesized Gaussian distribution adapts to the training set of the selected pattern.
In statistics, this is verified by evaluating whether the estimated parameter has a bias
(distortion or deviation), that is, there is a difference between the expected value of
an estimator and the real value of the parameter to be estimated. The distortion of
the MLE estimator is verified if the expected value is different from the quantity it
estimates. In this case, for the mean parameter, we have
  
1
n n
1
E[μ̂] = E xk = E[xk ] = μ (1.102)
n n
k=1 k=1

from which it results that the estimated mean is not distorted (unbiased), while for
the estimated variance with MLE, we have
 n 
1 n−1 2
E[σ̂ ] = E
2
(xk − μ̂) =
2
σ = σ 2 (1.103)
n n
k=1

from which it emerges that the variance is distorted (biased). It is shown that the
magnitude of a distorted estimate is related to the number of samples considered,
52 1 Object Recognition

for n → ∞ asymptotically the bias is zero. A simple estimate unbiased for the
covariance matrix is given by

1 
n
ˆU =
 (xk − μ̂)(xk − μ̂)T (1.104)
n−1
k=1

1.7.6 Estimation of the Distribution Parameters with the Bayes


Theorem

The starting conditions with the Bayesian approach are identical to those of the max-
imum likelihood, i.e., from the training set of pattern P = {x1 , . . . , xn } x j ∈ Rd
associated with the generic class ω, we assume the form (for example, Gaussian) of
the probability distribution and the unknown parameter vector θ describing it. With
the Bayesian estimate (also known as Bayesian learning), θ is assumed as a ran-
dom variable whose a priori probability distribution p(θ ) is known and intrinsically
contained in the training set P.
The goal is to derive the a posteriori probability distribution p(θ |x, P) from the
training set of patterns of the class ω. Having said this, the formula of Bayes theorem
(1.67) is rewritten as follows:

p(x|θ̂ , P)
p(ω|x, P) = p(θ̂ |x, P) = p(ω) (1.105)
p(x)

where p(x|θ̂ , P) is the parametric conditional probability density (likelihood) to


estimate, derivable from training set P associated with the generic class ω; p(ω) is
the a priori probability of the class of the known form, while p(x) can be considered
as a normalization constant. In reality, the explicit probability density p(x) is not
known, what is assumed to be known is the parametric form of this probability
density of which we want to estimate the vector θ . In other words, the relationship
between the density of p(x) on the training set P results through the vector θ which
models the assumed form of the probability density to then affirm that the conditional
density function p(x|θ̂ ) is known. From the analysis of the training set of the patterns
observed with the (1.105), we arrive at the posterior probability p(θ̂|P) with the hope
of getting an estimate of the value of θ with the least uncertainty.
By the definition of conditional probability, we have

p(x, θ |P) = p(x|θ , P) p(θ |P)

where it is evident that the probability p(x|θ , P) is independent of P since with


the knowledge of θ , this probability is completely parametrically determined by the
mathematical form of the probability distribution of x. It follows that we can write

p(x, θ |P) = p(x|θ ) p(θ |P)


1.7 Statistical Method 53

and by the total probability theorem, we can calculate p(x|P) (very close to p(x) as
much as possible) the conditional density function by integrating the joint probability
density p(x|θ , P) on the variable θ :

p(x|P) = p(x|θ ) p(θ |P)dθ (1.106)

where integration is extended over the entire parametric domain. With the (1.106),
we have a relationship between the conditional probability of the class with the
parametric conditional probability of the class (whose form is known) and the pos-
terior probability p(θ|P) for the variable θ to estimate. With the Bayes theorem it is
possible to express the posterior probability p(θ|P) as follows:

p(P|θ) p(θ ) p(P|θ ) p(θ)


p(θ |P) = = (1.107)
p(P) p(P|θ) p(θ )dθ

Assuming that the patterns of the training set P form a sequence of independent
and identically distributed (iid) random variables, the likelihood probability function
p(P|θ) of the (1.107) can be calculated with the product of the conditional probability
densities of the class ω:
n
p(P|θ) = p(xk |θ) (1.108)
k=1

1.7.6.1 Bayesian Estimation for Gaussian Distribution with Unknown


Mean
Let us now consider an example of a Bayesian learning application to estimate
the mean of a one-dimensional normal distribution with known variance. Let us
indicate with P = {x1 , . . . , xn } the training set of one-dimensional iid pattern of
a generic class ω whose distribution is assumed to be Gaussian N (μ, σ 2 ) where μ
is the unknown mean and σ 2 is the known variance. We, therefore, indicate with
p(x|θ = μ) = N (x; μ, σ 2 ) the probability of resultant likelihood.
We also assume that the a priori probability for the mean θ = μ has the normal
distribution N (μ0 , σ02 ):

1 − 1 2 (μ−μ0 )2
p(μ) = √ e 2σ0 (1.109)
2π σ0
Applicando la regola di Bayes (1.107) possiamo calcolare la probabilità a poste-
riori p(μ|P):

p0 (μ) 
n
p(P |μ) p(μ)
p(μ|P ) = = p(xk |μ)
p(P ) p(P )
k=1
n 
(1.110)
1 − 12 (μ−μ0 )2 1  1 − 1 (x −μ)2
= √ e 2σ0 √ e 2σ 2 k
2πσ0 p(P ) 2πσ
k=1
54 1 Object Recognition

We observe that the posterior probability p(μ|P) depends on the a priori probability
p(μ) and therefore from the training set of the selected patterns P. This dependence
influences the Bayesian estimate, that is, the value of p(μ|P), observable with the
increment n of the training set samples. The maximum of p(μ|P) is obtained by
computing the partial derivative of the logarithm of the (1.110) with respect to μ that

is, ∂μ log p(μ|P) and equaling to zero, we have
 n 
∂ 1 1
− 2 (μ − μ0 ) +
2
− 2 (xk − μ) = 0
2
(1.111)
∂μ 2σ0 2σ
k=1

from which, after some algebraic considerations, we obtain

nσ02 1 
n
σ2
μn = μ0 + xk (1.112)
σ 2 + nσ02 σ 2 + nσ02 n k=1
' () * ' () *
μ I nitial Estimate M L E

 that for n → ∞, the estimate of μn tends to the estimate MLE


It is highlighted
(i.e., μ = k xk ) starting from the initial value μ0 . The standard deviation σn is
calculated in the same way, given by

1 n 1 σ02 σ 2
= + =⇒ σn2 = (1.113)
σn2 σ σ0 nσ02 + σ 2

from which it emerges that the posterior variance of μ, σn2 , tends to zero as well as
1/n for n → ∞. In other words, with the posterior probability p(μ|P) calculated
with the (1.110), we get the best estimate μn of μ starting from the training set of n
observed patterns, while σn2 represents the uncertainty of μ, i.e., its posterior variance.
Figure 1.19 shows how Bayesian learning works, that is, as the number of samples
in the training set increases, the p(μ|P) becomes more and more with the peak
accentuated and narrowed toward the true value of the mean μ. The extension to the
multivariate case [18] of the Bayesian estimate for Gaussian distribution with mean
unknown μ and covariance matrix  known, is more complex, as is the calculation
of the estimate of the mean and of the covariance matrix both not known for a normal
distribution [17].

1.7.6.2 Bayesian Estimate of Conditional Density for Normal


Distribution
With the (1.110), we got the posterior density of the mean p(μ|P). Let us now propose
to calculate the conditional density of the class p(x|P) = p(x|ω, P) remembering
that for simplicity we had omitted the indication of the generic class but remained
understood that the training set considered P was associated with a generic class ω.
The density of the class p(x|P) is obtained by replacing in the (1.106) (considering
1.7 Statistical Method 55

p(μ|Ρ)
50
n=20
n=10

25
n=5
Iniziale n=1

0 0.2 0.4 0.6 0.8 μ

Fig. 1.19 Bayesian learning of the mean of a Gaussian distribution with known variance starting
from a training set of patterns

θ = μ) the posterior density p(μ|P) given by the (1.110) and assumed with normal
distribution N (μn , σn2 ):


p(x|P ) = p(x|μ) p(μ|P )dμ
    
1 1  x − μ 2 1 1  μ − μn 2
= √ exp − √ exp − dμ (1.114)
2π σ 2 σ 2π σn 2 σn
 
1 1 (x − μn )2
= exp − f (σ, σn )
2π σ σn 2 σ + σn
2 2

where    2 
1 σ 2 + σn2 σn2 x + σ 2 μn
f (σ, σn ) = exp − μ − dμ
2 σ 2 σn2 σ 2 + σn2

We highlight that the density p(x|P), as a function of x, results with normal distri-
bution:
p(x|P) ∼ N (μn , σ 2 + σn2 ) (1.115)

being proportional to the expression exp [−(1/2)(x − μn )2 /(σ 2 + σn2 )]. In conclu-
sion, to get the conditional density of the class p(x|P) = p(x|ω, P), with a known
parametric form described by the normal distribution, p(x|μ) ∼ N (μ, σ ), the param-
eters of the normal are replaced μ = μn and σ 2 = σn2 . In other words, the value
of μn is considered as the mean true while the initial variance known σ , once the
posterior density of the mean p(μ|P), is calculated, is increased by σn2 to account for
the uncertainty on the significance of the training set due to the poor knowledge of
the mean μ. This contrasts with the MLE approach which gets a point estimate of the
parameters μ̂ and σ̂ 2 instead of directly estimating the class distribution p(x|ω, P).
56 1 Object Recognition

1.7.7 Comparison Between Bayesian Learning and Maximum


Likelihood Estimation

With the MLE approach, a point value of the parameter θ is estimated which
maximizes the likelihood density p(P|θ ). Therefore with MLE, we get an esti-
mated value of θ̂ not considering the parameter a random variable. In other words,
with reference to the Bayes equation, (1.107), MLE treats the ratio p(θ )/ p(P) =
pr ob. priori/evidence as a constant and does not take into account the a priori
probability in the calculation procedure of the θ estimation.
In contrast, Bayesian learning instead considers the parameter to be estimated
θ as a random variable. Known the conditional density and a priori probability,
the Bayesian estimator obtains a probability distribution p(θ |P) associated with θ
instead of a point value as it happens for MLE. The goal is to select an expected
value of θ assuming a small possible variance of the posterior density p(θ |P). If the
variance is very large, a non-good estimate is assumed of θ .
The Bayesian estimator incorporates the information a priori and if this is not
significant, the posterior density is determined by the training set (data-driven esti-
mator). If it is significant, the posterior density is determined by the combination of
the priori density and by the training set of patterns. If even the training set has a
significant cardinality of patterns of fact, these dominate on the a priori information
making it less important. From this, it follows that between the two estimators, there
is a relation when the number of patterns n of the training set is very high. Consid-
ering the Bayes equation (1.107), we observe that the denominator can be neglected
as independent of θ and we have that

p(θ |P) ∝ p(P|θ) p(θ ) (1.116)

where the likelihood density has a peak at the maximum θ = θ̂ . With n very large, the
likelihood density shrinks around its maximum value while the integral that estimates
the conditional density of the class with the Bayesian method can be approximated
(see Eq. 1.106) as follows:
 
p(x|P) = p(x|θ ) p(θ|P)dθ ∼= p(x|θ̂ ) p(θ |P)dθ = p(x|θ̂ ) (1.117)

remembering that p(θ |P)dθ = 1. In essence, Bayesian learning instead of finding a
precise value of θ calculates a mean over all values θ of the density p(x, θ ), weighted
with the posterior density of the parameters p(θ |P).
In conclusion, the two estimators tend, approximately, to similar results, when n
is very large, while for small values, the results are very different.
1.8 Bayesian Discriminant Functions 57

1.8 Bayesian Discriminant Functions

An effective way to represent a pattern classifier is based on the discriminant func-


tions. If we denote by gi (x) a discriminant function (gi : Rd → R) associated with
the class ωi , a classifier will assign a pattern x ∈ Rd to class ωi ∈ R if

gi (x) > g j (x) ∀ j = i (1.118)

A classifier based on K discriminating functions, as many as there are classes, consti-


tutes a computational model capable of discriminating the function with the highest
value to select the class to a generic input pattern (see Fig. 1.20).
Considering the discriminating functions based on the Bayes theory, a general
form based on the minimum risk (see Eq. 1.77) is the following:

gi (x) = −R(αi |x)



K
(1.119)
=− C(αi |ω j ) p(ω j |x)
j=1

where the sign is motivated by the fact that the minimum conditional risk corresponds
to the maximum discriminating function.
In the case of a minimum zero-order error function, the further simplified Bayesian
discriminant function is given by gi (x) = p(ωi |x). The choice of discriminating
functions is not unique since a generic function gi (x) can be replaced with f (gi (x))
where f (•) is a growing monotonic function that does not affect the accuracy of
the classification. We will see that these transformations are useful for simplifying
expressions and calculation.

Class assignment

Select max

Costs
Discriminant K(x)
Functions

Features xd

Fig. 1.20 Functional scheme of a statistical classifier. The computational model is of the type
bottom-up as shown by the arrows. In the first level are the features of the patterns processed in the
second level with the discriminating functions to choose the one with the highest value that assigns
the pattern to the class to which it belongs
58 1 Object Recognition

The discriminating functions for the classification with minimum error are

p(x|ωi ) p(ωi )
gi (x) = p(ωi |x) =  K (1.120)
k=1 p(x|ωk ) p(ωk )
gi (x) = p(x|ωi ) p(ωi ) (1.121)
gi (x) = log p(x|ωi ) + log p(ωi ) (1.122)

which produce the same classification results. As already described in Sect. 1.6, the
discriminant functions partition the feature space into the K decision regions i
corresponding to the ωi classes according to the following:

i (x) = {x|gi (x) > g j (x)} ∀ j = i (1.123)

The decision boundaries that separate the regions correspond to the valleys between
the discriminant functions described by the equation gi (x) = g j (x). If we consider
a two-class classification ω1 and ω2 , we have a single discriminant function g(x)
which can be expressed as follows:

g(x) = g1 (x) − g2 (x) (1.124)

In this case, the decision rule results

x ∈ ω1 if g(x) > 0; otherwise x ∈ ω2 (1.125)

with the decision made in relation to the sign of g(x).

1.8.1 Classifier Based on Gaussian Probability Density

The Gaussian probability density functions, also called normal distribution, have
already been described in different contexts in this volume considering their partic-
ularity in modeling well the observations of various physical phenomena and for its
treatability in analytical terms. In the context of classification, it is widely used to
model the observed measurements of the various classes often subject to random
noise.
We also know from the central limit theorem that the distribution of the sum of a
high number n of independent and identically distributed random variables tends to
distribute nor mal, independently of the distribution of the single random variables.
A Bayesian classifier is based on the conditional probability density p(x|ωi ) and
the priori probability density p(ωi ) of the classes. Now let’s see how to get the
discriminant functions of a Bayesian classifier by assuming the classes with the
multivariate normal distribution (MND). The objective is to derive simple forms of
the discriminating functions by exploiting some properties of the covariance matrix
of the MNDs.
A univariate normal density is completely described by the mean μ and the
variance σ 2 and abbreviated as p(x) ∼ N (μ, σ 2 ). A multivariate normal density is
1.8 Bayesian Discriminant Functions 59

described by the mean vector μ and by the covariance matrix , and in short form
is indicated with p(x) ∼ N (μ, ) (see Eq. 1.101).
For an arbitrary class ωi with patterns described by the vectors x = (x1 , . . . , xd )
to d-dimensions with normal density, the mean vector is given by μ = (μ1 , . . . , μd )
with μi = E[xi ] while the covariance matrix is

 = E[(x − μ)(x − μ)T ]

with dimensions d × d. We will see that the covariance matrix is fundamental to


characterize the discriminant functions of a classifier based on the Gaussian model.
We, therefore, recall some properties of . Contains the covariance between each
pair of features of a pattern x represented by elements outside the principal diagonal
given by
i j = E[(xi − μi )(x j − μ j )] i = j (1.126)

while the diagonal components represent the variance of the f eatur es:

ii = σi2 = E[(xi − μi )2 ] i = 1, . . . , d (1.127)

By definition,  is symmetric, that is, i j =  ji with d(d + 1)/2 parameters free


that with d averages μi become d(d + 3)/2. If the covariance elements are null
(i j = 0), it implies that the features xi and x j are statistically independent.
Let us now look at the distribution of the patterns from the geometric point of
view x ∈ Rd in the space of the features. In the hypothesis of normal probability
density, the patterns of a generic class ωi tend to be grouped together, whose form
is described by the covariance matrix  and the center of mass of this grouping is
defined from the vector of the averages μi . From the analysis of the eigenvector s
and eigenvalues (see Sect. 2.10 Vol. II) of the covariance matrix, we know that the
eigenvectors φ i of  correspond to the main axes of the hyperellipsoid while the
eigenvalues λi determine the length of the axes (see Fig. 1.21).
An important characteristic of the multivariate normal density is represented by
the quadratic form that appears in the exponential of the normal function (1.96) that
we rewrite here in the form:

D 2M = (x − μ)T  −1 (x − μ) (1.128)

where D M is known as the Mahalonobis distance between the average vector μ


and the vector pattern x. The locus of points with constant value of density actually
describe with the (1.128) the hyperellisoid with constant distance of Mahalonobis.14

14 If  = I, where I is the identity matrix, the (1.128) becomes the Euclidean distance (norm
2). If  is diagonal,
+ the resulting measure becomes the normalized Euclidean distance given by
d (xi −μi )2
D(x, μ) = i=1 σ2
. It should also be pointed out that the Mahalanobis distance can also
i
be defined as a dissimilarity measure between two vector patterns x and,y with the same probability
density function and with covariance matrix , defined as D(x, μ) = (x − y)T  −1 (x − y).
60 1 Object Recognition

White transformation

λ
{
p(x|ωi

Fig. 1.21 2D geometric representation in the feature domain of a Gaussian pattern distribution.
We observe their grouping centered on the average vector μ and the contour lines, which in the 2D
domain are elli pses, which represent the set of points with equal probability density of the Gaussian
distribution. The orientation of the grouping is determined by the eigenvectors of the covariance
matrix, while the eigenvalues determine the extension of the grouping

The dispersion of the class patterns centered on the average vector is measurable by
the volume of the hyperellipsoid in relation to the values of D M and .
In this context, it may be useful to proceed with a linear transformation (see Chap.
2 Vol. II) of the x patterns to analyze the correlation level of the features and reduce
their dimensionality, or normalize the vectors x to have unrelated components with
variance equal to unity. The normalization of the features is obtained through the
so-called whitening of the observations, that is, by means of a linear transformation
(known as whitening transform [19]) such as to have unrelated features with unitary
variance.15
With this transformation, the ellipsoidal distribution in the feature space becomes
(see Fig. 1.21) spherical (covariance matrix is equal to the identity matrix after the
transformation  y = I) and the Euclidean metric is used instead of the Mahalanobis
distance (Eq. 1.128).

15 The whitening transform is always possible and the method used is still based on the eigen

decomposition of the covariance matrix  = T calculated on the input patterns x. It is


shown that the whitening transformation is given by y = −1/2 T x, which in fact is equivalent
to first executing the orthogonal transform y = AT x = T x and then normalizing the result
with −1/2 . In other words, with the first transformation, we have the principal components and
with normalization, the distribution of the data is made symmetrical. The direct transformation
(whitening) is y = Aw x = −1/2 T x.
1.8 Bayesian Discriminant Functions 61

1.8.2 Discriminant Functions for the Gaussian Density

Among the Bayesian discriminant functions gi (x), described above for the classifica-
tion with minimal error, we consider the (1.122) that in the hypothesis of multivariate
conditional normal density p(x|ωi ) ∼ N (μi , i ) is rewritten in the form:

1 1 d
gi (x) = − (x − μi )T i−1 (x − μi ) − log |i | − log 2π + log p(ωi ) (1.129)
2 2 '2 () *
constant

having replaced in the discriminant function (1.122) for the class ωi its conditional
density of multivariate normal distribution p(x|ωi ) given by
- .
1 − 21 (x−μi )T i−1 (x−μi )
p(x|ωi ) = e
(2π ) |i |1/2
d/2

The (1.129) is strongly characterized by the covariance matrix i for which different
assumptions can be made.

1.8.2.1 Assumption: i = σ 2 I
With this hypothesis, the features (x1 , x2 , . . . , xd ) are statistically independent and
have the same variance σ 2 with different means μ. The patterns are distributed in
the space of the features forming hyperspherical groupings with equal dimensions
and centered in μi . In this case, the calculation of the determinant and of the inverse
matrix of i are, respectively, |i | = σ 2d and i−1 = (1/σ 2 )I (I stands for the
identity matrix). Also, considering that the constant term in the (1.129) and |i | they
are both terms independent of i, they can be ignored as irrelevant. It follows that a
simplification of the discriminating functions is obtained:
x − μi 2 (x − μi )T (x − μi )
gi (x) = − + log p(ωi ) = − + log p(ωi )
2σ 2 2σ 2 (1.130)
1 - T .
= − 2 x x − 2μiT x + μiT μi ) + log p(ωi )

From (1.130), it is noted that the discriminating functions are characterized by the
Euclidean distance between the patterns and averages of each class ( x − μi 2 ) and
by the normalization terms given by the variance (2σ 2 ) and the prior density (offset
log p(ωi )). It doesn’t really need to calculate distances. In fact, with the expansion
of the quadratic form (x − μi )T (x − μi ), from the (1.130), it is evident that the
quadratic term x T x is identical for all i and can be eliminated. This allows to obtain
the equivalent of the linear discriminant functions as follows:

gi (x) = wiT x + wi0 (1.131)

where
1 1 T
wi (x) = μ wi0 = − μ μ + log p(ωi ) (1.132)
σ2 i 2σ 2 i i
62 1 Object Recognition

p(x|ωi)
p(x
0.4

0.3

0.2

0.1

-2 0 2 4 x

Fig.1.22 1D geometric representation for two classes in the feature space. If the covariance matrices
for the two distributions are equal and with identical a priori density p(ωi ) = p(ω j ), the Bayesian
surface of separation in the 1D representation is the line passing through the intersection point
of the two Gaussians p(x|ωi ). For d > 1, the separation surface is instead the hyperplane to
(d − 1)-dimensions with the groupings of the spherical patterns in d-dimensions

The term wi0 is called the threshold (or bias) of the i-th class. The Bayesian decision
surfaces are hyperplanes defined by the equations:
gi (x) = g j (x) ⇐⇒ gi (x) − g j (x) = (wi − w j )T x + (wi0 − w j0 ) = 0 (1.133)

The hyperplane equations considering the (1.131) and (1.132), we can rewrite it in
the form:
w T (x − x0 ) = 0 (1.134)

where
w = μi − μ j (1.135)

1 σ2 p(ωi )
x0 = (μ + μ j ) − log (μ − μ j ) (1.136)
2 i μi − μ j 2 p(ω j ) i

Equation (1.134) describes the decision hyperplane separating the class ωi from ω j
and is perpendicular to the line joining the centroids μi and μ j . The x0 point is
determined by the values of p(ωi ) and p(ω j ), the point from which the hyperplane
that is normal to the vector w passes. A special case occurs when p(ωi ) = p(ω j )
for each class. Figures 1.22 and 1.16 show the distributions in feature space for two
classes, respectively, for density a priori equal p(ωi ) = p(ω j ) and in the case of
sensible difference with p(ωi ) > p(ω j ).
In the first case, from the (1.136), we observe that the second addend becomes zero
and the separation point of the classes x0 is at the midpoint between the vectors μi
and μ j , and the hyperplane bisects perpendicularly the line joining the two averages.
1.8 Bayesian Discriminant Functions 63

The discriminating function becomes

gi (x) = − x − μi 2
(1.137)

thus obtaining a classifier named at minimum distance. Considering that the (1.137)
calculates the Euclidean distance between the x and the averages μi which in this
context represent the prototypes of each class, the discriminant function is the typical
of a template matching classifier.
In the second case, with p(ωi ) = p(ω j ), the point x0 moves away from the
most probable class. As you can guess, if the variance has low values (patterns more
grouped) than the value of the distance between the averages μi − μ j , the a
priori densities will have a lower influence.

1.8.2.2 Assumption:  i =  (Diagonal)


This case includes identical but ar bitrar y covariance matrices for each class and
the features (x1 , x2 , . . . , xd ) are not necessarily independent. The geometric config-
uration of the grouping of the patterns forms, for each class, a hyperellipsoid of the
same size centered in μi .
The general discriminant function expressed by the (1.129) in this case is reduced
to the following form:
1
gi (x) = − (x − μi )T i−1 (x − μi ) + log p(ωi ) (1.138)
2 ' () *
Mahalanobis distance Eq. 1.128

having eliminated the constant term and the term with |i | being both independent
of i. If the a priori densities p(ωi ) were found to be identical for all classes, their
contribution is ignored, and the (1.138) would be reduced only to the term of the
Mahalanobis distance. Basically, we have a classifier based on the following decision
rule: a pattern x is classified by assigning it to the class whose centroid μi is at the
minimum distance of Mahalanobis. If p(ωi ) = p(ω j ), the separation boundary
moves in the direction of less priori probability. It is observed that the Mahalanobis
distance becomes the Euclidean distance x − μ 2 = (x − μ)T (x − μ) if  = I.
Expanding in the Eq. (1.138) only the expression of the Mahalanobis distance and
eliminating the terms independent of i (i.e., the quadratic term xT  −1 x), the linear
discriminant functions are still obtained:

gi (x) = wiT x + wi0 (1.139)

where
1
wi (x) =  −1 μi wi0 = − μiT  −1 μi + log p(ωi ) (1.140)
2
The term wi0 is called the threshold (or bias) of the i-th class. The linear discriminant
functions, also in this case, represent geometrically the hypersurface surfaces of
64 1 Object Recognition

separation of the adjacent classes defined by the equation:

w T (x − x0 ) = 0 (1.141)

where
w =  −1 (μi − μ j ) (1.142)

1 log[ p(ωi )/ p(ω j )]


x0 = (μ + μ j ) − (μ − μ j ) (1.143)
2 i (μi − μ j )T  −1 (μi − μ j ) i

It is observed that the vector w, given by the (1.142), does not result, in general, in
the direction of the vector (μi − μ j ). It follows that the separation hyperplane (of
the regions i and  j ) is not perpendicular to the line joining the two averages μi
and μ j . As in the previous case, the hyperplane always intersects the junction line
of the averages in x0 with the position that depends on the values of the a priori
probabilities. If the latter are different, the hyperplane moves toward the class with
less prior probability.

1.8.2.3 Assumption: i Arbitrary (Non-diagonal)


In the general case of multivariate normal distribution, the covariance matrices are
different for each class and the features (x1 , x2 , . . . , xd ) are not necessarily indepen-
dent. Therefore, the general discriminant function expressed by the (1.129) in this
case has the following form:
1 1
gi (x) = − (x − μi )T i−1 (x − μi ) − log |i | + log p(ωi ) (1.144)
2 2

having only been able to eliminate the constant term d2 log 2π , while the others are
all dependent on i.
The discriminant functions (1.144) are quadratic functions and can be rewritten
as follows:
gi (x) = x T Wi x + wiT x + wi0 (1.145)
' () * '()* '()*
x-Quadratic x-Linear Costant

where
1
Wi = −  i−1 wi =  i−1 μi (1.146)
2

1 1
wi0 = − μiT  i−1 μi + − log |i | + log p(ωi ) (1.147)
2 2
1.8 Bayesian Discriminant Functions 65

Fig. 1.23 One-dimensional p(x|ωi)


representation of decision
0.4
regions for two classes with
normal distribution with
ω
different variance. It is 0.3
observed that the relative
decision regions are not 0.2
connected in this case with
equal a priori probabilities
0.1

-5 -2.5 0 2.5 5 7.5 x


Ω

The term wi0 is called the threshold (or bias) of the i-th class. The decision surface
between two classes is quadric hypersurface.16 These decision surfaces may not be
connected even in the one-dimensional case (see Fig. 1.23).
Any hypersurface can be generated from two Gaussian distributions. The surfaces
of separation become more complex when the number of classes is greater than 2
even with Gaussian distributions. In these cases, it is necessary to identify the pair
of classes involved in that particular area of the feature space.

1.8.2.4 Conclusions
Let us briefly summarize the specificities of the Bayesian classifiers described. In the
hypothesis of Gaussian distribution of classes, in the most general case, the Bayesian
classifier is quadratic. If the hypothesized Gaussian classes all have equal variance,
the Bayesian classifier is linear . It is also highlighted that a classifier based on
the Mahalanobis distance is optimal in the Bayesian sense if the classes have a
normal distribution, equal covariance matrix, and equal a priori probability. Finally,
it is pointed out that a classifier based on the Euclidean distance is optimal in the
Bayesian sense if the classes have a normal distribution, equal covariance matrix
proportional to the identical matrix, and equal a priori probability.
Both the Euclidean and Mahalanobis distance classifiers are linear. In various
applications, different distance-based classifiers (Euclidean or Mahalanobis) are used
making implicit statistical assumptions. Often such assumptions, for example, those
on the normality of class distribution are rarely true, obtaining bad results. The
strategy is to verify pragmatically if these classifiers solve the problem.

16 In the geometric and mathematical discipline, we define quadric surface a hypersurface of a space

d-dimensional on real (or complex) numbers represented by a second-order polynomial equation.


The hypersurface can take various forms: hyperplane, hyperellipsoid, hyperspheroid, hypersphere
(special case of the hyperspheroid), hyperparaboloid (elliptic or circular), hyperboloid (one or two
sheets).
66 1 Object Recognition

In the parametric approach, where assumptions are made about class distribu-
tions, it is important to carefully extract the associated parameters through signifi-
cant training sets and an interactive pre-analysis of sample data (histogram analysis,
transformation to principal components, reduction of dimensionality, verification of
significant features, number of significant classes, data normalization, noise attenu-
ation, ...).

1.9 Mixtures of Gaussian—MoG

The Gaussian distribution has some limitations in modeling real-world datasets.


Very complex probability densities can be modeled with a linear combination of
several suitably weighted Gaussians. The applications that make use of these mixtures
are manifold, in particular, in the context of the modeling of the background for a
video streaming sequence in video surveillance applications discriminant functions.
The problem we want to solve is, therefore, the following: given a set of patterns
P = {xi }i=1
n , we want to find the probability distribution p(x) which generated this

set.
In the implementation (framework) of the Gaussian mixtures, this generating
distribution of the P dataset consists of a set composed of K Gaussian distributions
(see Fig. 1.24); each of them is therefore:

p(x; θ k ) ∼ N (xi |μk , k ) (1.148)

with μk and k , respectively, the mean and the covariance matrix of the k-th Gaussian
distribution, defined as follows:

1
N (xi |μk , k ) = (2π )−d/2 |k |−1/2 exp − (x − μk )k−1 (x − μk ) (1.149)
2

that is, the probability of generating the observation x using the model k-th is provided
by the Gaussian distribution having the parameter vector θ k = (μk , k ) (mean and
covariance matrix, respectively).

Fig. 1.24 Example of p(x)


probability density obtained
Mixture of 3 Gaussians
as a mixture of 3 Gaussian
distributions

x
1.9 Mixtures of Gaussian—MoG 67

We introduce in this model a further variable zi associated to the observation xi .


The variable zi is a variable that indicates which of the K models has generated the
data xi according to the a priori probability πk = p(z i = k). It is easy to think of
the variable zi as a binary vector K-dimensional having the number 1 in the element
corresponding to the Gaussian component that generated the corresponding data xi

1 0 . . . 0 0*]
zi = [0' 0 . . . 0 () (1.150)
K elements

Since we do not know the corresponding zi for each xi ,17 these variables are called
hidden variables. Our problem is now reduced to the search for the parameters μk
and k for each of the K models and the respective a priori probabilities πk that
if incorporated in the generative model, it has a high probability of generating the
observed distribution of data. The density of the mixture is given by


K
p(xi |θ ) = πk N (x|μk , k ) (1.151)
k=1

Each Gaussian density:

Pik = p(xi |z i = k, μk , k ) = N (xi |μk , k )

represents a component of the mixture having average μk and covariance k .


The parameters πk are called coefficients of the mixture and serve to weigh the
relative Gaussian component in modeling the generic random variable x. If we inte-
grate both sides of (1.151) with respect to x, you can see that both p(x) that the
individual components of the Gaussians are normalized, obtaining


K
πk = 1. (1.152)
k=1

Also note that both p(x) ≥ 0 and N (x|μk , k ) ≥ 0 resulting in πk ≥ 0 for each k.
Combining these last conditions with the (1.152), we get

0 ≤ πk ≤ 1 (1.153)

and so the mixture coefficients are probabilities. We are interested in maximizing the
likelihood L(θ ) = p(P; θ ) that generates the observed data with the parameters of
the model θ = {μk , k , πk }k=1
K .

This approach is called ML—Maximum Likelihood estimate since it finds the


parameters maximizing the likelihood function that generates the data. An alternative
approach is given by the EM (Expectation–Maximization) algorithm, where the latter

17 If
we knew them, we would group all xi based on their zi and we would model each grouping
with a single Gaussian.
68 1 Object Recognition

calculates θ maximizing the posterior probability (MAP—Maximum A Posterior


estimate) and which presents a more simple mathematical treatment. So, to sum up,
our vector can be estimated in the following two ways:

θ M L = arg maxθ P(P|θ)


(1.154)
θ M A P = arg maxθ P(θ |P)

1.9.1 Parameters Estimation of the Gaussians Mixture with the


Maximum Likelihood—ML

Our goal is, therefore, to find the parameters of the K Gaussian distributions and the
coefficients πk based on the dataset P we have. Calculating the mixture density on
the entire dataset of statistically independent measurements, we have


n 
K
p(X |θ ) = πk p(xi |z i = k, μk , k ) (1.155)
i=1 k=1

To find the parameter vector θ , a mode is represented by the maximum likelihood


estimate. Calculating the logarithm of (1.155), we obtain


n 
K
L= log πk p(xi |z i = k, μk , k ) (1.156)
i=1 k=1

which we will differentiate with respect to θ = {μ, } and then, we will annul to
find the maximum, that is
∂L n
πk ∂ Pik  n
πk Pik ∂ log Pik  ∂ log Pik
n
= K = K = rik (1.157)
∂θ k j=1 π j Pi j
∂θ k j=1 π j Pi j
∂θ k ∂θ k
i=1 i=1 i=1
' () *
rik

in which we used the identity ∂ p/∂θ = p × ∂ log p/∂θ and defined rik the responsi-
bility, i.e., the variable representing how likely the i-th point is modeled or explained
by the Gaussian k-th, namely

p(xi , z i = k|μk , k )
rik = = p(z i = k|xi , μk , k ) (1.158)
p(xi |μk , k )
K
which are a posteriori probabilities of class membership with k=1 rik = 1. Now,
we have to calculate the derivative with respect to πk . Take the objective function L
and add a Lagrange multiplier λ which reinforces the fact that the priori probabilities
need to adapt to 1
K
L̃ = L + λ(1 − πk ) (1.159)
k=1
1.9 Mixtures of Gaussian—MoG 69

therefore, we derive with respect to πk and λ, and we put to zero as before obtaining

∂ L̃  n
Pik  n
= n −λ=0⇔ rik − λπk = 0 (1.160)
∂πk j=1 π j Pi j
i=1 i=1

∂ L̃  K
=1− πk = 0 (1.161)
∂λ
k=1

 n
and considering that k=1 i=1 rik − λπk = n − λ = 0 we get λ = n, so the priori
probability for the k-th class is given by

1
n
πk = rik (1.162)
n
i=1

We now find mean and covariance from the objective function L as follows:

∂L  n
= rik (1.163)
∂μk
i=1

∂L 
n
= rik [k − (xi − μk )(xi − μk ) ] (1.164)
∂k−1 i=1

By placing (1.163) and (1.164) at zero, we get


n n 
rik xi i=1 rik (x −μ )(x −μk )
μk = i=1
n k = i n k i (1.165)
i=1 rik i=1 rik
n
Note that the denominators i=1 rik = nπk represent the total number of points
assigned to the k-th class. Furthermore, the mean in (1.165) is similar to that obtained
for the k-means method except for the responsibilities rik which in this case are so f t
(i.e., 0 ≤ rik ≤ 1).

1.9.2 Parameters Estimation of the Gaussians Mixture with


Expectation–Maximization—EM

A smart way to find maximum likelihood estimates for models of latent variables
is characterized by the algorithm Expectation–Maximization or EM [20]. Instead of
finding the maximum likelihood estimation (ML) of the observed data p(P; θ ), we
will try to maximize the likelihood of the joint distribution of P and Z = {zi }i=1
n ,

p(P, Z; θ ). In this regard, we prefer to maximize the logarithm of likelihood, that is

lc (θ ) = log p(P, Z; θ ),
70 1 Object Recognition

quantity known as complete log-likelihood. Since we cannot observe the values of the
random variables zi , we have to work with the expected values of the quantity lc (θ)
with respect to some distribution Q(Z). The logarithm of the complete likelihood
function is defined as follows:
lc (θ ) = log /
p(P, Z; θ )
n
= log i=1 p(P , z ; θ )
/n / K i i (1.166)
= log i=1 k=1 [ p(xi |z ik = 1; θ ) p(z ik = 1)]zik
n  K
= i=1 k=1 z ik log p(xi |z ik = 1; θ ) + z ik log πk .

Since we have assumed that each of the models is a Gaussian, the quantity p(xi |k, θ )
represents the conditional probability of generating xi given the model k-th:

1 1
log p(xi |z ik = 1; θ) = exp − (xi − μk ) k−1 (xi − μk ) (1.167)
(2π )d/2 ||1/2 2

Considering the expected value with respect to Q(Z), we have


n 
K
lc (θ ) Q(Z) = z ik  log p(xi |z ik = 1; θ ) + z ik  log πk . (1.168)
i=1 k=1

1.9.2.1 The M Step


The “M” step considers the expected value of the lc (θ ) function defined in (1.168)
and maximizes it with respect to the estimated parameters in the other step (the “E”
expectation) which are, therefore, πk , μk , and k . Differentiating the Eq. (1.168)
with respect to μk and setting to zero, we get

∂lc (θ ) Q(Z)  n

= z ik  log p(xi |z ik = 1; θ ) = 0 (1.169)
∂μk ∂μk
i=1

We can now calculate



log p(xi |z ik = 1; θ )
∂μk

using the (1.167) as follows:



∂ ∂ 1 1  −1
∂μk log p(xi |z ik = 1; θ) = ∂μk log (2π )d/2 | |1/2 exp − 2 (xi − μk ) k (xi − μk )
k
1 ∂  −1
= − 2 ∂μ (xi − μk ) k (xi − μk ) (1.170)
k
= (xi − μk ) k−1

∂ 
where the last equality derives from the relation ∂ x x Ax = x (A + A ). By
replacing the result of (1.170) in (1.169), we get


n
z ik (xi − μk ) k−1 = 0 (1.171)
i=1
1.9 Mixtures of Gaussian—MoG 71

which gives us the following update equation:


n
z ik xi
μk = i=1 n (1.172)
i=1 z ik 

Let us now calculate the estimate for the covariance matrix by differentiating
Eq. (1.168) with respect to k−1 , we have

∂lc (θ) Q(Z) 


n

= z ik  log p(xi |z ik = 1; θ ) = 0 (1.173)
∂k−1 i=1
∂k−1


We can calculate log p(xi |z ik = 1; θ ) using the (1.167) as follows:
∂k−1


log p(xi |z ik =1;θ ) = ∂
log 1
exp − 21 (xi −μk ) k−1 (xi −μk )
∂k−1 ∂k−1 (2π )d/2 ||1/2

= ∂ 1
log |k−1 |− 21 (xi −μk ) k−1 (xi −μk ) (1.174)
∂k−1 2

= 1 1
2 k − 2 (xi −μk )(xi −μk )


where the last equality is obtained from the following relationships:


∂ ∂ 
log |P| = (P −1 ) and x Ax = xx
∂P ∂A
Replacing this result of the (1.174) in the (1.173), we get


n  
1 1
z ik  k − (xi − μk )(xi − μk ) =0 (1.175)
2 2
i=1

which gives us the update equation for the covariance matrix for the k-th component
of the mixture: n
z ik (xi − μk )(xi − μk )
k = i=1 n (1.176)
i=1 z ik 

Now, we need to find the update equation of the prior probability πk for the k-
th component of the mixture. This means maximizing the expected value of the
logarithm function of the likelihood lc  (Eq. 1.168) subject to the constraint that

k πk = 1. To do this, we introduce the Lagrange multipliers λ by increasing the
(Eq. 1.168) as follows:
0 K 1

L(θ ) = lc (θ ) Q(Z) − λ πk − 1 (1.177)
k=1
72 1 Object Recognition

By differentiating this expression from πk , we get



lc (θ ) Q(Z) − λ = 0 1 ≤ k ≤ K. (1.178)
∂πk
Using the (1.168), we have
n ⎫
i=1 z ik  − λ = 0⎬
1
πk
1≤k≤K (1.179)
n ⎭
or equivalently i=1 z ik  − λπk =0

Now adding Eq. (1.179) to all K models, we have


K 
n 
K
z ik  − λ πk = 0 (1.180)
k=1 i=1 k=1
K
and since k=1 πk = 1, we have


K 
n
λ= z ik  = n (1.181)
k=1 i=1

Replacing this result in Eq. (1.179), we obtain the following update formula:
n
z ik 
πk = i=1 (1.182)
n
K
which retains the constraint k=1 πk = 1.

1.9.2.2 The E Step


Now that we have derived the parameter updating formulas that maximize the
expected value of the complete likelihood function log p(P, Z; θ ), we must also
make sure that we are maximizing the incomplete version log p(P; θ ) (which is
actually the quantity that we really want to maximize). As mentioned above, we
are sure to maximize the incomplete version of the logarithm of the likelihood, only
when we consider the expected value of the posterior distribution of Z, or p(Z|P; θ ).
Thus, each of the expected values z ik  appearing in the update equation derived in
the previous paragraph should be calculated as follows:
1.9 Mixtures of Gaussian—MoG 73

z ik  p(Z|P ;θ ) = 1 · p(z ik = 1|xi ; θ ) + 0 · p(z ik = 0|xi ; θ )

= p(z ik = 1|xi ; θ )

p(x |z =1;θ) p(z ik =1) (1.183)


=  K i ik
j=1 p(xi |z i j =1;θ ) p(z i j =1)

p(x |z =1;θ)πk
=  K i ik
j=1 p(xi |z i j =1;θ )π j

1.9.3 EM Theory

Since it is difficult to analytically maximize the quantity log p(P; θ ), we there-


fore choose to maximize the complete version of the logarithm of likelihood
log p(P, Z; θ ) Q(Z) by using a theorem known as Jensen’s inequality. Our goal is,
therefore, to maximize log p(P, Z; θ ) Q(Z) with the hope of maximizing also its
incomplete version log p(P; θ ) (which in fact represents the quantity we are inter-
ested in maximizing). Before going into this justification, we introduce the inequality
of Jensen in the next paragraph.

1.9.3.1 Jensen’s Inequality


If f (x) is a convex function (Fig. 1.25), defined in an interval (a, b), then

∀x1 , x2 ∈ (a, b) ∀λ ∈ [0, 1] : λ f (x1 )+(1−λ) f (x2 ) ≥ f [λx1 +(1−λ)x2 ] (1.184)

If alternatively f (x) is a concave function, then

λ f (x1 ) + (1 − λ) f (x2 ) ≤ f [λx1 + (1 − λ)x2 ] (1.185)

That is, if we want to find the value of the function between the two points x1 and x2 ,
we say x ∗ = λx1 +(1−λ)x2 , then the value of f (x ∗ ) will be found below the joining
f (x1 ) and f (x2 ) (in the case it is convex, vice versa if concave). We are interested

Fig. 1.25 Representation of


a convex function for
Jensen’s inequality
f(x1)+(1- )f(x2)

f(x*)

x1 x2
x*= x1+(1- )x2
74 1 Object Recognition

in evaluating the logarithm which is actually a concave function and for which we
will consider the last inequality (1.185). We rewrite log p(P; θ ), as follows:

log p(P; θ ) = log p(P, Z; θ )dZ (1.186)

Now let’s multiply and divide by an arbitrary distribution Q(Z) in order to find
a lower bound of the log p(P; θ ), and we use the result of Jensen’s inequality to
continue with Eq. (1.186):
  ,Z;θ)
log p(P , Z; θ)dZ = log Q(Z) p(P
Q(Z) dZ

 p(P ,Z;θ)
≥ Q(Z) log Q(Z) dZ Jensen
 
= Q(Z) log p(P , Z; θ)dZ − Q(Z) log Q(Z)dZ
(1.187)
' () * ' () *
expected value log-likelihood Entropy of Q(Z)

= log p(P , Z; θ) Q(Z) + H[Q(Z)]

= F (Q, θ)

So we get to the following lower extreme of the function log p(P; θ ):

log p(P; θ ) ≥ F(Q, θ ) (1.188)

Since Q(Z) is an arbitrary distribution, it is independent of θ , and therefore, to maxi-


mize the functional F(Q, θ ), it is sufficient to maximize log p(P, Z; θ ) Q(Z) (hence
the step “M”). Even if we found an extreme lower F(Q, θ ) for the log p(P; θ ) func-
tion, this does not imply that at every step the improvement in the search for the max-
imum of F has repercussions in the improvement of the maximum of log p(P; θ ).
If, instead, we set Q(Z) = p(Z|P; θ ) in (1.187), we can observe that the lower
extreme becomes a minimum, that is, an equality as follows:
 p(P ,Z;θ )  p(P ,Z;θ )
Q(Z) log Q(Z) dZ = p(Z|P; θ ) log p(Z|P ;θ) dZ

 p(Z|P ;θ) p(P ;θ )


= p(Z|P; θ ) log p(Z|P ;θ ) dZ
 (1.189)
= p(Z|P; θ ) log p(P; θ )dZ

= log p(P; θ ) p(Z|P; θ )dZ

= log p(P; θ )

This means that when we calculate the expected value of the complete version of the
logarithm function of the likelihood log p(P, Z; θ ) Q(Z) , it should be taken with
respect to the true posteriori probability (Z|P; θ ) of the hidden variables (hence the
step “E”).
1.9 Mixtures of Gaussian—MoG 75

Assume a model having: observable (or visible) variables x; unobservable (hidden)


variables y; and the relative vector of parameters θ . Our goal is to maximize the
logarithm of the likelihood with respect to the variable θ containing the parameters:

L(θ ) = log p(x|θ ) = log p(x, y|θ )dy (1.190)

Now let’s multiply and divide by the same arbitrary distribution q(y) defined on
the latent variables y. Now we can take advantage of Jensen’s inequality, since we
have a convex weighted combination of q(y) of some combination of functions. In
practice, we consider as f (y) a function of latent variables y as indicated in (1.191).
Any distribution q(y) on the hidden variables can be used to get a lower limit of the
log of the likelihood function:
 
p(x, y|θ ) p(x, y|θ )
L(θ ) = log q(y) dy ≥ q(y) log dy = F(q, θ ) (1.191)
q(y) q(y)
This lower bound is called Jensen’s inequality and derives from the fact that the
logarithm function is concave.18 In the EM algorithm, we alternatively optimize
F(q, θ ) with respect to q(y) and θ. It can be proved that this mode of operation will
never decrease L(θ ). In summary, the EM algorithm alternates between the following
two steps:

1. Step E optimizes F(Q, θ ) with respect to the distribution of the hidden variables
maintaining the fixed parameters:

Q (k) (z) = arg max F(Q(z), θ (k−1) ). (1.192)


Q(z)

2. Step M maximizes F(Q, θ ) with respect to the parameters keeping the distribution
of the hidden variables fixed:

θ (k) = arg maxF (Q(z), θ (k−1) ) = arg max Q (k) (z) log p(x, z|θ)dz (1.193)
θ θ

where the second equality derives from the fact that the entropy of q(z) does not
depend directly on θ .

The intuition, that is, the basis of the EM algorithm, can be schematized as follows:

Step E finds the values of the hidden variables according to their posterior proba-
bilities;
Step M learns the model as if the hidden variables were not hidden.

The EM algorithm is very useful in many contexts, since in many models, if the hidden
variables are not anymore, learning becomes very simple (in the case of Gaussian

18 The logarithm of the average is greater than the averages of the logarithms.
76 1 Object Recognition

mixtures). Furthermore, the algorithm breaks down the complex learning problem
into a sequence of simpler learning problems. The pseudo-code of the algorithm EM
for Gaussian mixtures is reported in Algorithm 3.

Algorithm 3 EM algorithm for Gaussian mixtures


1: Initialize z ik , πk , μk and k for 1 ≤ k ≤ K
2: repeat

3: for i = 1, 2, . . . , n do

4: for k = 1, 2, . . . , K do

5: 
1
p(xi |z ik = 1; θ) = (2π )−d/2  −1/2 exp − (x − μ)  −1 (x − μ)
2

6:
p(xi |z ik = 1; θ)πk
z ik  =  K
j=1 p(xi |z i j = 1; θ)π j

7: end for

8: end for
9: for k = 1, 2, . . . , K do

10: n
i=1 z ik (xi − μk )(xi − μ k )
k = n
i=1 z ik 

11: n
z ik xi
μk = i=1
n
i=1 z ik 

12: n
i=1 z ik 
πk =
n

13: end for

14: until convergence of parameters

1.9.4 Nonparametric Classifiers

In the preceding paragraphs, we have explicitly assumed the conditional probabil-


ity densities p(x|ωi ) and the priori probability density p(x) or have taken note in
terms of parametric form (for example, Gaussian) and estimated (for example, with
1.9 Mixtures of Gaussian—MoG 77

the maximum likelihood (ML) method), with a supervised approach, the relative
characteristic parameters. Moreover, these densities have been considered unimodal
when in real applications, they are always multimodal. The extension with Gaussian
mixtures is possible even if it is necessary to determine the number of the components
and hope that the algorithm (for example, E M) that estimates the relative parameters
converges toward a global optimum.
With the nonparametric methods, no assumption is made on the knowledge of the
density functions of the various classes that can take arbitrary forms. These methods
can be divided into two categories, those based on the Density Estimation—DE and
those that explicitly use the features of the patterns considering the training sets
significant for the classification. In Sect. 1.6.5, some of these have been described
(for example, the k-Nearest-Neighbor algorithm). In this section, we will describe the
simple method based on the Histogram and the more general form for estimating
density together with the Parzen Window.

1.9.4.1 Estimation of Probability Density


The simplest way to have a nonparametric D E is given by the histogram. The his-
togram method partitions the pattern space into several distinct containers (bin)
with width i and approximates the density pi (x) at the center of each bin with
the fraction of the n i patterns of training set P = {x1 , . . . , x N } that fall into the
corresponding i-th bin:
ni
pi (x) = (1.194)
N i

where the density is constant over the whole width of the bin which is normally
chosen for all with the same dimension, i =  (see Fig. 1.26). With (1.194), the
objective is to model the normalized density p(x) from the observed N patterns P.

Fig. 1.26 The density of the 5


histogram seen as a function Δ=0.04
of the width of the bin .
With large values of , the
resulting density is very 0
0 0.5 1
smooth and the correct form
of the p(x) distribution is 5
lost in this case consisting of Δ=0.08
the mixture of two
Gaussians. For very small 0
values of , we can see very 0 0.5 1
pronounced and isolated
5
peaks that do not reproduce Δ=0.25
the true distribution p(x)

0
0 0.5 1
78 1 Object Recognition

From the figure, it can be observed how the approximation p(x) can be attributable
to a mixture of Gaussians. The approximation p(x) is characterized by . For very
large values, the density is too level and the bimodal configuration of p(x) is lost, but
for very small values of , a good approximation of p(x) is obtained by recovering
its bimodal structure.
The histogram method to estimate p(x), although very simple to calculate the
estimate from the training set, as the pattern sequence is observed, has some limita-
tions:

(a) Discontinuity of the estimated density due to the discontinuity of the bin rather
than the intrinsic property of the density.
(b) Problem of scaling the number of bin with pattern to d-dimension, we would
have M d bin with M bin for each dimension.

Normally the histogram is used for a fast qualitative display (up to 3 dimensions) of
the pattern distribution. A general formulation of the D E is obtained based on the
probability theory. Consider pattern samples x ∈ Rd with associated density p(x).
Both  is a bounded region in the feature domain, the P probability that a x pattern
falls in the region  is given by

P= p(x )dx (1.195)


P can be considered in some way an approximation of p(x), that is, as an average


over the region  of p(x). The (1.195) is useful for estimating the density p(x).
Now suppose we have a training set P of N pattern independent samples (iid)
xi , i = 1, . . . , N associated with the distribution p(x). The probability that k of
these N samples fall in the  region is given by the binomial distribution:
 
N
P(k) = P k (1 − P) N −k (1.196)
k

It is shown by the properties of the binomial distribution that the average and the
variance with respect to the ratio k/N (considered as random variable) are given by
-k. -k. - k 2 . P(1 − P)
P =P V ar =E −P = (1.197)
N N N N
For N , the distribution becomes more and more peaked with small variance
(var (k/N ) → 0) and we can expect a good estimate of the probability P that
can be obtained from the mean of the sample fraction which fall into :
k
P∼
= (1.198)
N
1.9 Mixtures of Gaussian—MoG 79

At this point, if we assume that the  region is very small and p(x) continuous which
does not vary appreciably within it (i.e., approximately constant), we can write
 
p(x )dx ∼
= p(x) 1dx ∼= p(x)V (1.199)
 

where x is a pattern inside  and V is the volume enclosed by the  region. By virtue
of the (1.195), (1.198), and the last equation, combining the results, we obtain

P =  p(x )dx ∼ = p(x)V k/N
=⇒ p(x) ∼= (1.200)
P∼= N k
V

In essence, this last result assumes that the two approximations are identical. Fur-
thermore, the estimation of the density p(x) becomes more and more accurate with
the increase in the number of samples N and the simultaneous contraction of volume
V . Attention, this leads to the following contradiction:

1. Reducing the volume implies the reduction of the sufficiently small region 
(with the density approximately constant in the region) but with the risk of zero
samples falling.
2. Alternatively, with sufficiently large  would suffice k samples sufficient to pro-
duce an accentuated binomial peak (see Fig. 1.27).

If instead the volume is fixed (and consequently ) and we increase the number of
samples of the training set, then the ratio k/N will converge as desired. But this only
produces an estimate of the spatial average of the density:

P p(x )dx
=  
(1.201)
V  1dx

In reality, we cannot have V very small considering that the number of samples N
is always limited. It follows that we should accept that the density estimate is a spatial
average associated with a variance other than zero. Now let’s see if these limitations
can be avoided when an unlimited number of samples are available. To evaluate p(x)
in x, let us consider a sequence of regions 1 , 2 , . . . containing patterns x with 1
having 1 sample, 2 having 2 samples, and so on. Let Vn be the volume of n , let kn
be the number of falling samples in n , and let pn (x) be the n-th estimate of p(x),
we have
kn /N
pn (x) = (1.202)
Vn
80 1 Object Recognition

Fig. 1.27 Convergence of Relative


probability density probability
estimation. The curves are
1
the binomials associated
with the N number of the
samples. As N grows, the
binomial associate has a peak
at the correct value P = 0.7.
For N → ∞, the curve tends
to the delta function 0.5

20
50
100
k/N
0 P=0.7 1

If we want pn (x) to converge to p(x), we need the following three conditions:

lim Vn = 0 (1.203)
N →∞
lim kn = ∞ (1.204)
N →∞
lim kn /N = 0 (1.205)
N →∞

The (1.203) ensures that the spatial average P/V converges to p(x). The (1.204)
essentially ensures that the ratio of the frequencies k/N converges to the probability
P with the binomial distribution sufficiently peaked. The (1.205) is required for the
convergence of pn (x) given by the (1.202).
There are two ways to obtain regions that satisfy the three conditions indicated
above (Eqs. (1.203), (1.204), and (1.205)):

1. Parzen windows reduce the n region initially


√ by specifying the volume Vn as
a function of N , for example, Vn = 1/ N , and show that pn (x) converges to
p(x) for N → ∞. In other words, the region is fixed and therefore, the volume to
make the estimate without directly considering the number of samples included.

2. knn-nearest neighbors specify kn as a function of N , for example, kn = N
and then increase the volume Vn until in the associated region n fall down kn
neighboring samples of x.

Figure 1.28 shows a graphical representation of the two methods. The two sequences
represent random variables that normally converge and allow us to estimate the
probability density at a given point in the circular region.
1.9 Mixtures of Gaussian—MoG 81

(a)
n=1 n=4 n=9 n=18

Vn=1/√n

(b)

kn=√n

Fig. 1.28 Two methods for estimating density, that of Parzen windows (a) and that of knn-nearest
neighbors (b). The two sequences represent random variables that generally converge to estimate
the probability density at a given point in the circular (or square) region. The Parzen method starts
with a large initial value of the region which decreases as n increases, while the knn method specifies
a number of samples kn and the region V increases until the predefined samples are included near
the point under consideration x

1.9.4.2 Parzen Window


The density estimation with the Windows of Parzen (also known as the Kernel Density
Estimation—KDE) is based on windows of variable size assuming that the n region
that includes kn samples is a hypercube with side length h n centered in x. With
reference to the (1.202), the goal is to determine the number of samples kn in the
hypercube fixed its volume. The volume Vn of this hypercube is given by

Vn = h dn (1.206)

where d indicates the dimensionality of the hypercube. To find the number of samples
kn that fall within the region n is defined the window function ϕ(u) (also called
kernel function)

1 if |u j | ≤ 1/2 ∀ j = 1, . . . , d
ϕ(u) = (1.207)
0 otherwise

This kernel, which corresponds to a unit-centered hypercube, is known as a Parzen


window. It should be noted that the following expression has a unitary value:
  
x − xi 1 if xi falls into the hypercube of volume Vn of the side h n centered on x
ϕ
hn
=
0 otherwise
(1.208)

It follows that the total number of samples inside the hypercube is given by


N  
x − xi
kn = ϕ (1.209)
hn
i=1
82 1 Object Recognition

(a)
p(x) 7
x=0.5
p(0.5)= 1
7 Σ φ( x )
i=1
1
3
0.5-
3
i
= 1
21
(b)
5
2 3 5 6 7 9 10 x hn=0.005
p(x) x=1 7
p(1)= 1
7 Σ φ( x )
i=1
1
3
1-
3
i
= 1
21 0
0 0.5 1
x 5
2 3 5 6 7 9 10 hn=0.08
p(x) 7
x=2
p(2)= 1
7 Σ φ( x )
i=1
1
3
2-
3
i
= 2
21 0
0 0.5 1
x 5
2 3 5 6 7 9 10
hn=0.2
p(x) x=3 x=4 x=5 x=6 x=7 x=8 x=9 x=10
3
21 0
2
21 0 0.5 1
1
21

2 3 5 6 7 9 10 x

Fig. 1.29 One-dimensional example of calculation of density estimation with Parzen windows. a
The training set consists of 7 samples P = {2, 3, 5, 6, 7, 9, 10}, the window has width h n = 3.
The estimate is calculated with the (1.210) starting from x = 0.5 and subsequently the window
is centered in each sample obtaining finally the p(x) as the sum of 7 rectangular functions each
of height 1/nh d = 1/7 · 31 = 1/21. b Analogy of the density estimation carried out with the
histogram and with the rectangular windows (hyper-cubic in the case of d-dimensions) of Parzen
where we observe strong discontinuities of p(x) with very small h n (as happens for small values
of  for the bins) while obtaining a very smooth shape of p(x) for large values of h n in analogy to
the histogram for high values of 

By replacing the (1.209) in the Eq. (1.202), we get the KDE density estimate:
 
1  1
N
kn /N x − xi
pn (x) = = ϕ (1.210)
Vn N Vn hn
i=1

The kernel function ϕ, in this case called the Parzen window, tells us how to weigh all
the samples in n to determine the density pn (x) with respect to a particular sample
x. The density estimation is obtained as the average of the kernel functions of x and
xi . In other words, each sample xi contributes to the estimation of the density in
relation to the distance from x (see Fig. 1.29a). It is also observed that the Parzen
window has an analogy with the histogram with the exception that the bin locations
are determined by the samples (see Fig. 1.29b).
Now consider a more general form of the kernel function instead of the hypercube.
You can think of a kernel function seen as an interpolator placed in the various samples
xi of training set P instead of considering only the position x. This means that the
kernel function ϕ must satisfy the density function conditions, that is, be nonnegative
and the integral equal to 1:

ϕ(x) ≥ 0 ϕ(u)du = 1 (1.211)
1.9 Mixtures of Gaussian—MoG 83

For the hypercube previously considered with the volume Vn = h dn , it follows that
the density pn (x) satisfies the conditions indicated by the (1.211):
   
1  1
N
x − xi
pn (x)dx = ϕ dx
N Vn hn
i=1
N   
1  1 1  (1.212)
N
x − xi
= ϕ dx = Vn = 1
Vn N hn N Vn
i=1 ' () * i=1
hypercube volume

If instead we consider an interpolating kernel function that satisfies the density con-
ditions, (1.211), integrating by substitution, putting u = (x − xi )/ h n for which
du = dx/ h n , we get:
   
1  1
N
x − xi
pn (x)dx = ϕ dx
N Vn hn
i=1
(1.213)
N  N 
1  1 
= h n ϕ(u)du =
d
ϕ(u)du = 1
Vn N N
i=1 i=1

Parzen windows based on hypercube have several drawbacks.


In essence, they produce a very discontinuous density estimate and the contribution
of each sample xi is not weighted in relation to its distance with respect to the point
x in which the estimate is calculated. For this reason, the Parzen window is normally
replaced with a kernel function with the smoothing feature. In this way, not only
the number of samples falling in the window are counted but their contribution is
weighed with the interpolating function. With a number of samples N → ∞ and
choosing an appropriate window size, it is shown that it converges toward the true
density pn (x) → p(x). The most popular choice falls on the Gaussian kernel:
1
ϕ(u) = √ e−u /2
2
(1.214)

considering a one-dimensional density with zero mean and unit variance. The result-
ing density, according to KDE (1.210), is given by

1 
N
1 - 1 .
pϕ (x) = √ exp − 2 (x − xi )2 (1.215)
N h 2π 2h n
i=1 n

The Parzen Gaussian window eliminates the problem of the discontinuity of the
rectangular window. The champions of the training set P which are closest to the
sample xi have higher weight thus obtaining a density pϕ (x) smoothed (see Fig. 1.30).
It is observed how the shape of the estimated density is modeled by the Gaussian
kernel functions located on the observed samples.
84 1 Object Recognition

Fig. 1.30 Use of Parzen Estimated density


windows with Gaussian p(x)
kernels centered in each
sample of the training set for
density estimation. In the
one-dimensional example Kernel functions
the form of the density pϕ (x)
is given by the sum of the 7
Gaussians, each centered on
the samples and scaled by a
factor 1/7 2 3 5 6 7 9 10 x
Sample patterns

We will now analyze the influence of the window width (also called smoothing
parameter) h n on the final density result pϕ (x). A large value tends to produce a
more smooth density by altering the structure of the samples, on the contrary, small
values of h n tend to produce a very peaked density function resulting in complex
interpretation. To have a quantitative measure of the influence of h n , we consider the
function δ(x) as follows:
1 x 1 x
δn (x) = ϕ = dϕ (1.216)
Vn h n hn hn

where h n affects the horizontal scale (width) while the volume h dn affects the vertical
scale (amplitude). Also the function δn (x) satisfies the conditions of density function,
in fact operating the integration with substitution and putting u = x/ h n , we get
   
1 x 1
δn (x)dx = ϕ dx = ϕ(u)h d
n du = ϕ(u)du = 1 (1.217)
h dn hn h dn

By rewriting the density as an average value, we have

1 
N
pn (x) = δn (x − xi ) (1.218)
N
i=1

The effect of the h n parameter (i.e., the volume Vn ) on the function δn (x) and con-
sequently on the density pn (x) results as follows:

(a) For h n that tends toward high values, a contrasting action is observed on the
function δn (x) where on one side there is a reduction of the vertical scale factor
(amplitude) and an increase of the scale factor horizontal (width). In this case,
we get a poor (very smooth) resolution of the density pn (x) considering that it
will be the sum of many delta functions centered on the samples (analogous to
the convolution process).
(b) For h n that tends toward small values, δn (x) becomes very peaked and pn (x) will
result in the sum of N peaked pulses with high resolution and with an estimate
1.9 Mixtures of Gaussian—MoG 85

0.4 1 4

0.2 0.5 2
N=1
0 0 0
−2 0 2 −2 0 2 −2 0 2
1 1 1.5
1
N=10 0.5 0.5
0.5
0 0 0
−2 0 2 −2 0 2 −2 0 2
1 1 1

0.5 0.5 0.5

0 0 0
−2 0 2 −2 0 2 −2 0 2
1 1 1

0.5 0.5 0.5

0 0 0
−2 0 2 −2 0 2 −2 0 2

Fig. 1.31 Parzen window estimates associated to a univariate normal density for different values
of the parameters h 1 and N , respectively, window width and number of samples. It is observed that
for very large N , the influence of the window width is negligible

affected by a lot of statistical variability (especially in the presence of noisy


samples).

The theory suggests that for an unlimited N of samples, with the volume Vn
tending to zero, the density pn (x) converges toward an unknown density p(x).
In reality, having a limited number of samples, the best that can be done is to
find a compromise between the choice of h n and the limited number of samples.
Considering the training set of samples P = (x1 , . . . , x N ) as random variables on
which the density pn (x) depends, for any value of x, it can be shown that if pn (x) has
an estimate of the average p̂n (x) and an estimate of the variance σ̂n2 (x), it is proved
[18] that
lim p̂n (x) = p(x) lim σ̂n2 (x) = 0 (1.219)
n→∞ n→∞

Let us now consider a training set of samples i.i.d. deriving from a normal distribution
where p(x) → N (0, √ 1). If a Parzen Gaussian window given by the (1.214) is used,
setting h n = h 1 / N where h 1 is a free parameter, the resulting estimate of the
density is given by the (1.215) which is the average of the normal density centered
in the samples xi . Figure 1.31 shows the estimate of the true density p(x) = N (0, 1)
using a Parzen window with Gaussians as the free parameter varies h 1 = 1, 0.4, 0.1
and the number of samples N = 1, 10, 100, ∞. It is observed as in the approximation
86 1 Object Recognition

process, starting with a single sample N = 1 the Gaussian window centered on it


while varying h 1 , the mean and variance of pn (x) are different from the density true.
As the samples grow to infinity, the pn (x) tends to converge to the true density
p(x) of the training set regardless of the free parameter h 1 . It should be noted that in
the approximation process for values of h 1 = 0.1, i.e., with a small window width, the
contribution of the individual samples is distinguished by highlighting their possible
noise. From the examples, it can be seen that it is strategic to find the best value of
h n , possibly adapting it to various training sets.

1.9.4.3 Classifier with Parzen Windows


A Parzen window-based classifier basically works as follows:

(a) The conditional densities are estimated for each class p(x|ωi ) (Eq. (1.215)) and
the test samples are classified according to the corresponding maximum posterior
probability (Eq. (1.70)). Eventually the priori probabilities can be considered in
particular when they are very different.
(b) The decision regions for this type of classifier depend a lot on the choice of the
kernel function used.

A 2D binary classifier based on Parzen windows is characterized by the choice of


the width h n of the window which influences the decision regions. In the learning
phase, you can choose small values of h n while keeping the classification errors to
a minimum but obtaining very complex regions with a resolution that in the test
phase (final important phase of the classification), you can have problems known as
generalization error. For large values of h n , the classification on training sets is not
perfect but simpler decision regions are obtained.
This last solution tends to minimize the generalization error in the test phase with
new samples. In these cases, you can use a cross-validation19 approach since there
is no robust theory for the choice of the exact width of the window.
In conclusion, a Parzen window-based is a good method applicable on samples
derivable from any distribution and in theory proves that it converges to the true
density as the number of samples tends to infinity. The negative aspect of this classifier

19 Cross-validation is a statistical technique that can be used in the presence of an acceptable number

of the observed sample (training set). In essence, it is a statistical method to validate a predictive
model. Taken a sample of data, it is divided into subsets, some of which are used for the construction
of the model (the training sets) and others to be compared with the predictions of the model (the
validation set). By mediating the quality of the predictions between the various validation sets, we
have a measure of the accuracy of the predictions. In the context of classification, the training set
consists of samples of which the class to which they belong is known in advance, ensuring that this
set is significant and complete, i.e., with a sufficient number of representative samples of all classes.
For the verification of the recognition method, a validation set is used, also consisting of samples
whose class is known, used to check the generalization of the results. It consists of a set of samples
different from those of the training set.
1.9 Mixtures of Gaussian—MoG 87

concerns the limited number of samples available in concrete applications and this
implies a not easy choice of h n . It also requires a high computational complexity.
The classification of a single sample requires the calculation of the function that
potentially depends on all the samples. The number of samples grows exponentially
as the feature space increases.

1.10 Method Based on Neural Networks

The Neural Network—NN has seen an explosion of interest over the years, and is
successfully applied in an extraordinary range of sectors, such as finance, medicine,
engineering, geology, and physics.
Neural networks are applicable in practically every situation in which the rela-
tionship between prediction (independent of input) and predicted (output dependent)
variables exists, even when this relationship is very complex and not easy to define
in terms of correlation or similarity between various classes. In particular, they are
also used for the classification problem that aims to determine which of a defined
number of classes a given sample belongs to.

1.10.1 Biological Motivation

Studies on neural networks are inspired by the attempt to understand the functioning
mechanisms of the human brain and to create models that mimic this functionality.
This has been possible over the years with the advancement of knowledge in neuro-
physiology which has allowed various physicists, physiologists, and mathematicians
to create simplified mathematical models, exploited to solve problems with new
computational models called neur ocomputing. Biological motivation is seen as a
source of inspiration for developing neural networks by imitating the functionality of
the brain regardless of its actual model of functioning. The nervous system plays the
fundamental role of intermediary between the external environment and the sensory
organs to guarantee the appropriate responses between external stimuli and internal
sensory states. This interaction occurs through the receptors of the sense organs,
which, excited by the external environment (light energy, ...), transmit the signals to
other nerve cells which are in turn processed, producing informational patterns useful
to the executing organs (effectors, an organ, or cell that acts in response to a stimu-
lus). The neur on is a nerve cell, which is the basic functional construction element
of the nervous system, capable of receiving, processing, storing and transmitting
information.
Ramòn y Cajal (1911) introduced the idea of neurons as elements of the structure
of the human brain. The response times of neurons are 5–6 orders of magnitude slower
than the gates of silicon circuits. The propagation of the signals in a silicon chip is
of a few nanoseconds (10−9 s), while the neural activity propagates with times of the
order of milliseconds (10−3 s). However, the human brain is made up of about 100
88 1 Object Recognition

Soma Dendrites

Nucleus

Axon

Myelin sheaths

Synaptic
Nodes of Ranvier
terminals

Impulses
direction

Fig. 1.32 Biological structure of the neuron

billion (1012 ) nerve cells, also called neur ons and interconnected with each other up
to 1 trillion (1018 ) of special structures called synapses or connections. The number
of synapses per neuron can range from 2,000 to 10,000. In this way, the brain is in
fact a massively parallel, efficient, complex, and nonlinear computational structure.
Each neuron constitutes an elementary process unit. The computational power of
the brain depends above all on the high degree of interconnection of neurons, their
hierarchical organization, and the multiple activities of the neurons themselves.
This organizational capacity of the neurons constitutes a computational model that
is able to solve complex problems such as object recognition, perception, and motor
control with a speed considerably higher than that achievable with traditional large-
scale computing systems. It is in fact known how spontaneously a person is able to
perform functions of visual recognition for example of a particular object with respect
to many other unknowns, requiring only a few milliseconds of time. Since birth, the
brain has the ability to acquire information about objects, with the construction of
its own rules that in other words constitute knowledge and experience. The latter is
realized over the years together with the development of the complex neural structure
that occurs particularly in the first years of birth. The growth mechanism of the neural
structure involves the creation of new connections (synapses) between neurons and
the modification of existing synapses. The dynamics of development of the synapses
is 1.8 million per second (from the first 2 months of birth to 2–3 years of age then it
is reduced on average by half in adulthood). The structure of a neuron is schematized
in Fig. 1.32. It consists of three main components: the cellular body or soma (the
central body of the neuron that includes the genetic heritage and performs cellular
functions), the axon (filiform nerve fiber), and the dendrites.
A synaptic connection is made by the axons which constitute the transmission
lines of out put of the electrochemical signals of neurons, whose signal reception
structure (input) consists of the dendrites (the name derives from the similarity to
the tree structure) that have different ramifications. Therefore, a neuron can be seen
as an elementary unit that receives electrochemical impulses from different dendrites
and once processed in the soma several electrochemical impulses are transmitted to
other neurons through the axon. The end of the latter branches forming terminal
1.10 Method Based on Neural Networks 89

fibers from which the signals are transmitted to the dendrites of other neurons. The
transmission between axon and dendrite of other neurons does not occur through a
direct connection but there is a space between the two cells called synaptic fissure
or cleft or simply synapse.
A synapse is a junction between two neurons. A synapse is configured as a
mushroom-shaped protrusion called the synaptic node or knob that is modeled from
the axon to the surface of the dendrite. The space between the synaptic node and the
dendritic surface is precisely the synaptic fissure through which the excited neuron
propagates the signal through the emission of fluids called neurotransmitters.
These come into contact with the dendritic structure (consisting of post-synaptic
receptors) causing the exchange of electrically charged ions-atoms (entering and
leaving the dendritic structure) thus modifying the electrical charge of the dendrite.
In essence, an electrical signal is propagated from the axon, a chemical transmission
is propagated in the synaptic cleft, and then an electrical signal is propagated in the
dendritic structure.
The body of the neuron receiving the signals from its dendrites processes them by
adding them together and triggers an excitator y response (increasing the frequency
of discharge of the signals) or inhibitor y (decreasing the discharge frequency) in
the post-synaptic neuron. Each post-synaptic neuron accumulates signals from other
neurons that add up to determine its excitation level. If the excitation level of the
neuron has reached a threshold level limit, this same neuron produces a signal
guaranteeing the further propagation of the information toward other neurons that
repeat the process.
During the propagation of each signal, the synaptic permeability, as well as the
thresholds, are slightly adapted in relation to the signal intensity, for example, the
activation threshold ( f iring) is lowered if the transfer is frequent or is increased if
the neuron has not been stimulated for a long time. This represents the plasticity of
the neuron, that is, the ability to adapt to stimulations-stresses that lead to reorganize
the nerve cells. The synaptic plasticity leads to the continuous remodeling of the
synapses (removal or addition) and is the basis of the learning of the brain’s abilities
in particular during the period of development of a living organism (in the early years
of a child are formed new synaptic connections with the frequency of one million per
second), during adult life (plasticity is reduced), and also in the phases of functional
recovery after any injuries.

1.10.2 Mathematical Model of the Neural Network

The simplified functional scheme, described in the previous paragraph, of the bio-
logical neural network is sufficient to formulate a model of artificial neural network
(ANN) from the mathematical point of view. An ANN can be made using elec-
tronic components or it can be simulated to software in traditional digital computers.
An ANN is also called neuro-computer, connectionist network, parallel distributed
processor (PDP), associative network, etc.
90 1 Object Recognition

(a) (b)
x
||
w
x/||

T
w
g(x)=0
∑ ξ σ y=σ(ξ) w
||
y |w g(x)>0
wn θ ξ
xn -θ g(x)<0
bias threshold Ω

Fig. 1.33 Mathematical model of the perceptron (a) and its geometric representation used as a
linear binary classifier (b). In this 2D example, of the feature space, the hyperplane is the dividing
line of the two classes ω1 and ω2

In analogy to the behavior of the human brain that learns from experience, even a
neural computational model must solve the problems in the same way without using
an algorithmic approach. In other words, an artificial neural network is seen as an
adaptive machine with some features. The network must adapt the nodes (neurons)
for the learning of knowledge through a learning phase observing examples, organiz-
ing and modeling this knowledge through the synaptic weights of the connections,
and finally making this knowledge available for its generalized use.
Before addressing how a neural network can be used to solve the classification
problem, it is necessary to analyze how an artificial neuron can be modeled, what the
neuron connection architecture can be, and how the learning phase can be realized.

1.10.2.1 Model of Artificial Neuron


A neuron can represent any significant entity (for example, a pixel of an image, an
element of the features vector) that must be processed with a neural model.
A first neuron model was proposed by Mc Culloch and Pitts [21]. In this model, the
essential components of the neuron, an elementary unit of neural network processing,
are

• the set of synaptic inputs,


• the adder of input signals, and
• the activation function that generates the output value of the neuron (see Fig. 1.33a).

This model predicts a neuron represented by N input values x = (x1 , x2 , . . . , x N ) that


model the signals coming from dendrites. These signals are associated with synaptic
weights w = (w1 , w2 , . . . , w N ) which measure their permeability. According to the
neuro-biological motivation some of these synaptic weights are negative to express
their inhibitory characteristic. The weighted sum of the input values xi represent the
1.10 Method Based on Neural Networks 91

excitation level ξ of the neuron:


N
ξ= wi xi (1.220)
i=1

The excitation level ξ compared to a suitable θ threshold associated with the neuron
determines its final state, i.e., it produces the output y of the neuron that models the
electric-chemical signal generated by the axon.
The nonlinear growth of the output value y after the excitation value has reached
the θ threshold level is determined by the activation function (i.e., the transfer func-
tion) σ for which, we have
0 N 1 
 1 if ξ ≥ θ
y = σ (ξ ) = σ wi xi = σ (wx) = (1.221)
i=1
0 if ξ < θ

where the output y is binary with the step activation function σ (see Fig. 1.33a). A
more compact mathematical formulation of the neuron is obtained with a simple arti-
fice, considering the activation function σ with threshold zero and the true threshold
with opposite sign, seen as an additional input x0 = 1 with constant unitary value,
appropriately modulated by a weight coefficient w0 = −θ (also called bias) which
has the effect of controlling the translation of the activation threshold with respect
to the origin of the signals. It follows that the (1.221) becomes
0 N 1 
 1 if ξ ≥ 0
y = σ (ξ ) = σ wi xi = σ (wx) = (1.222)
i=0
0 if ξ < 0

where the vector of the input signals x and that of the weights w are augmented,
respectively, with x0 and w0 .
The activation function σ considered in the (1.222) is inspired by the functionality
of the biological neuron. Together with the θ threshold, they have the effect of limiting
the amplitude of the output signal y of a neuron. Alternatively, different activation
functions are used to simulate different functional models of the neuron based on
mathematical or physical criteria.
The simplest activation function used in neural network problems is linear given
by y = σ (ξ ) = ξ (the output is same as the input and the function is defined in
the range [−∞, +∞]). Other types of activation functions (see Fig. 1.34) are more
used: binary step, linear piecewise with threshold, nonlinear sigmoid, and hyperbolic
tangent.

1. Binary step: 
1 if ξ ≥ 0
σ (ξ ) = (1.223)
0 if ξ < 0
92 1 Object Recognition

σ(ξ) σ(ξ) σ(ξ) σ(ξ)

1 1 1 1

-1 -.5 0 .5 1 ξ -1 -.5 0 .5 1 ξ -10 0 10 ξ -10 0 10 ξ


Binary Step Linear piecewise Sigmoid -1
Hyperbolic tangent

Fig. 1.34 Activation functions that model neuron output. From left to right, we have step activation
function; linear piecewise; sigmoid; and hyperbolic tangent

2. Linear piecewise: ⎧

⎨1 if ξ > 1
σ (ξ ) = ξ if 0 ≤ ξ ≤ 1 (1.224)


0 if ξ < 0

3. Standard sigmoid or logistic:


1
σ (ξ ) = (1.225)
1 + e−ξ
4. Hyperbolic tangent:
1 − e−ξ 1 
σ (ξ ) = = tanh ξ (1.226)
1 + e−ξ 2

All activation functions assume values between 0 and 1 (except in some cases where
the interval can be defined between −1 and 1, as is the case for the hyperbolic tangent
activation function). As we shall see later, when we analyze the learning methods of a
neural network, these activation functions do not properly model the functionality of
a neuron and above all of a more complex neural network. In fact, synapses modeled
with simple weights are only a rough approximation of the functionality of biological
neurons that are a complex nonlinear dynamic system.
In particular, also the nonlinear activation functions of sigmoid and hyperbolic
tangent are inadequate when they saturate around the extremes of the interval 0 or
1 (or −1 and +1) where the gradient in these regions tends to vanish. This results
in the non-operation of the activation functions and no output signal is generated by
the neuron with the consequent block of the update of the weights. Other aspects
concern the interval of the activation functions with the nonzero-centered output
as is the case for the sigmoid activation function and the slow convergence. Also,
the function of hyperbolic tangent, although with zero-centered output, presents
the problem of saturation. Despite these limitations in the applications of machine
learning, sigmoid and hyperbolic tangent functions were frequently used.
In recent years, with the great diffusion of deep learning have become very popular
new activation functions: ReLu, Leaky ReLu, Parametric ReLU, and ELU. These
functions are simple and exceed the limitations of previous functions. The description
of these new activations functions is reported in the Sect. 2.13 of Deep Learning.
1.10 Method Based on Neural Networks 93

1.10.3 Perceptron for Classification

The neuron model of McCulloch and Pitts (MP) does not learn, the weights and
thresholds are analytically determined and operate with binary and discrete input
and output signal values. The first successful neuro-computational model was the
per ceptr on devised by Rosenblatt based on the neuron model defined by MP.
The objective of the perceptron is to classify a set of input patterns (stimuli)
x = (x1 , x2 , . . . , x N ) into two classes ω1 and ω2 . In geometric terms, the clas-
sification is characterized by the hyperplane that divides the space of the input
patterns. This hyperplane is determined by the linear combination of the weights
w = (w1 , w2 , . . . , w N ) of the perceptron with the f eatur es of the patter n x.
According to the (1.222), the hyperplane of separation of the two decision regions
is given by
N
wT x = 0 or wi xi + w0 = 0 (1.227)
i=1

where the vector of the input signals x and that of the weights w are augmented,
respectively, with x0 = 1 and w0 = −θ to include the bias20 of the level of excite-
ment. Figure 1.33b shows the geometrical interpretation of the perceptron used to
determine the hyperplane of separation, in this case a straight line, to classify two-
dimensional patterns in two classes. In essence, the N input stimuli to the neuron are
interpreted as the coordinates of a N -dimensional pattern projected in the Euclidean
space and the synaptic weights w0 , w1 , . . ., w N , including the bias, are seen as the
coefficients of the hyperplane equation we denote with g(x). A generic pattern x is
classified as follows:

N ⎪
⎨> 0 =⇒ x ∈ ω1
g(x) = wi xi + w0 < 0 =⇒ x ∈ ω2 (1.228)


i=1 = 0 =⇒ x ∈ hyperplane

In particular, with reference to the (1.222), we have a pattern x ∈ ω1 (the region 1


in the figure) if the neuron is activated, i.e., it assumes the state y = 1, otherwise
it belongs to the class ω2 (the region 2 ) if the neuron is passive being in the state
y = 0. The perceptron thus modeled can classify patterns into two classes and are
referred to as linearly separable patterns. Considering some properties of the vector
algebra and the proposed neuron model in terms of stimulus vectors x and weight
vector w, we observe that the excitation level is the inner product of these augmented
vectors, that is, ξ = w T x. According to the (1.227), we can rewrite in the equivalent

20 Inthis context, the bias is seen as a constant that makes the perceptron more flexible. It has
a function analogous to the constant b of a linear function y = ax + b that representing a line
geometrically allows to position the line not necessarily passing from the origin (0, 0). In the
context of the perceptron, it allows a more flexible displacement of the line to adapt the prediction
with the optimal data.
94 1 Object Recognition

vector form:
w T x = w0 x0 (1.229)

where signal and weight vectors do not include bias. The (1.229) in this form is
useful in observing that, if w and w0 x0 are constant, this implies that the projection
xw of the vector x on w is constant since21 :
w0 x0
xw = (1.230)
w
We also observe (see Fig. 1.33b) that w determines the orientation of the decision
plan (1.228) to its orthogonal (in the space of augmented patterns) for the (1.227),
while the bias w0 determines the location of the decision surface. The (1.228) also
informs us that in 2D space, all the x patterns on the separation line of the two classes
have the same projection xw .
It follows that, a generic pattern x can be classified, in essence, by evaluating if the
excitation level of the neuron is greater or less than the threshold value, as follows:

> w0wx0 =⇒ x ∈ ω1
xw (1.231)
< w0wx0 =⇒ x ∈ ω2

21 By definition, the scalar or inner product between two vectors x and w belonging to a vector
space R N is a symmetric bilinear form that associates these vectors to a scalar in the real number
field R, indicated in analytic geometry with:


N
<wx> = w · x = (w, x) = wi xi
i=1

In matrix notation, considering the product among matrices, where w and x are seen as matrices
N × 1, the formal scalar product is written


N
wT x = wi xi
i=1

The (convex) angle θ between the two vectors in any Euclidean space is given by

wT x
θ = arccos
|w||x|
from which a useful geometric interpretation can be derived, namely to find the orthogonal projection
of one vector on the other (without calculating the angle θ), for example, considering that xw =
|x| cos θ is the length of the orthogonal projection of x over w (or vice versa calculate wx ), this
projection is obtained considering that wT x = |w| · |x| cos θ = |w| · xw , from which we have

wT x
xw =
|w|
1.10 Method Based on Neural Networks 95

Fig. 1.35 Geometrical (a) (b)


configuration a of two
classes ω1 and ω2 separable
linearly and b configuration
of nonlinear classes not
separable with the perceptron
Ω

1.10.3.1 Learning with the Perceptron


Recall that the objective of the perceptron is to classify a set of patterns by receiving
as input stimuli the features N -dimension of a generic vector pattern x and assign it
correctly to one of the two classes.
To do this, it is necessary to know the hyperplane of separation given by the (1.228)
or to know the weight vector w. This can be done in two ways:

1. Direct calculation. Possible in the case of simple problems such as the creation
of logic circuits (AND, OR, ...).
2. Calculation of synaptic weights through an iterative process. To emulate biolog-
ical learning from experience, synaptic weights are adjusted to reduce the error
between the output value generated by the appropriately stimulated perceptron
and the correct output defined by the pattern samples (training set).

The latter is the most interesting aspect if one wants to use the perceptron as a neural
model for supervised classification. In this case, the single perceptron is trained
(learning phase) offering as input stimuli the features xi of the sample patterns and
the values y j of the belonging classes in order to calculate the synaptic weights
describing a classifier with the linear separation surface between the two classes.
Recall that the single perceptron can classify only linearly separable patterns (see
Fig. 1.35a).
It is shown that the perceptron performs the learning phase by minimizing a tuning
cost function, the current value at time t of the y(t) response of the neuron, and
the desired value d(t), adjusting appropriately synaptic weights during the various
iterations until converging to the optimal results.
Let P = {(x1 , d1 ), (x2 , d2 ), . . . , (x M , d M )} the training set consisting of M pairs
of samples pattern xk = (xk0 , xk1 , . . . , xk N ), k = 1, . . . , M (augmented vectors
with xi0 = 1) and the corresponding desired classes dk = {0, 1} selected by the
expert. Let w = (w0 , w1 , . . . , w N ) be the augmented weight vector. Considering
that xi0 = 1, w0 corresponds to the bias that will be learned instead of the constant
bias θ . We will denote by w(t) the value of the weight vector at time t during the
iterative process of perceptron learning.
The convergence algorithm of the perceptron learning phase consists of the fol-
lowing phases:
96 1 Object Recognition

1. I nitially at time t = 0, the weights are initialized with small random values (or
w(t) = 0).
2. For each adaptation step t = 1, 2, 3, . . ., a pair is presented (xk , dk ) of training
set P.
3. Activation. The perceptron is activated by providing the vector of the features
xk . Calculated its current answer yk (t) is compared to the desired output dk and
then calculated the error dk − yk (t) as follows:

yk (t) = σ [w(t) · xk ] = σ [w0 (t) + w1 (t)xk1 +, · · · , +w N (t)xk N ] (1.232)

4. Adaptation of synaptic weights:

w j (t + 1) = w j (t) + η[dk − yk (t)]xk j ∀j 0≤ j ≤N (1.233)

where 0 < η ≤ 1 is the parameter that controls the degree of learning (known as
learning rate).

Note that the expression [dk − yk (t)] in the (1.233) indicates the discrepancy between
the actual perceptron response yk (t) calculated for the input pattern xk and the desired
associated output dk of this pattern. Essentially the error of the t-th output of the
perceptron is determined with respect to the k-th pattern of training. Considering
that dk , y(t) ∈ {0, 1} follows that (y(t) − dk ) ∈ {−1, 0, 1}. Therefore, if this error
is zero, the relative weights are not changed. Alternatively, this discrepancy can take
the value 1 or −1 because only binary values are considered in output. In other words,
the perceptron-based classification process is optimized by iteratively adjusting the
synaptic weights that minimize the error.
The training parameter η very small (η ≈ 0), implies a very limited modification of
the current weights that remain almost unchanged with respect to the values reached
with the previous adaptations. With high values (η ≈ 1) the synaptic weights are
significantly modified resulting in a high influence of the current training pattern
together with the error dk − yk (t), as shown in the (1.233). In the latter case, the
adaptation process is very fast. Normally if you have a good knowledge of the training
set, you tend to use a fast adaptation with high values of η.

1.10.3.2 Comparison Between Statistical Classifier and Perceptron


While in the statistical approach, the cost function J is evaluated starting from the
statistical information, in the neural approach, it is not necessary to know the statis-
tical information of the pattern vectors. The perceptron with single neuron operates
under the conditions that the objects to be classified are linearly separable. When
class distributions are Gaussian, the Bayes classifier is reduced to a linear classifier
like the perceptron. When the classes are not separable (see Fig. 1.35b), the learning
algorithm of the perceptron oscillates continuously for patterns that fall into the over-
lapping area. The statistical approach attempts to solve problems even when it has to
classify patterns belonging to the overlapping zone. The Bayes classifier, assuming
1.10 Method Based on Neural Networks 97

the distribution of the Gaussian classes, controls the possible overlap of the class
distributions with the statistical parameters of the covariance matrix.
The learning algorithm of the perceptron, not depending on the statistical param-
eters of the classes, is effective when it has to classify patterns whose features are
dependent on nonlinear physical phenomena and the distributions are strongly dif-
ferent from the Gaussian ones as assumed in the statistical approach. The perceptron
learning approach is adaptive and very simple to implement, requiring only memory
space for synaptic weights and thresholds.

1.10.3.3 Batch Perceptron Algorithm


The perceptron described above is based on the adaptation process to find the hyper-
plane of separation of two classes. We now describe the perceptron convergence
algorithm that calculates the synaptic vector w based on the cost function J (w).
The approach considers a function that allows the application of the gradient search.
Using the previous terminology, the cost function of the perceptron is defined as
follows: 
J(w) = (−w T x) (1.234)
x∈M

where M is the set of patterns x misclassified from the perceptron using the weight
vector w [18]. If all the samples were correctly classified, the set M would be
empty and consequently the cost function J (w) is zero. The effectiveness of this
cost function is due to its differentiation with respect to the weight vector w. In fact,
differentiating the J (w) (Eq. 1.234) with respect to w, we get the gradient vector:

∇J(w) = (−x) (1.235)
x∈M

where the gradient operator results


- ∂ ∂ ∂ .T
∇= , ,..., (1.236)
∂w1 ∂w2 ∂w N
An analytical solution to the problem would be in solving with respect to w the
equation ∇J(w) = 0 but difficult to implement. To minimize the cost function,
J (w) is used instead the iterative algorithm called gradient descent. In essence, this
approach seeks the minimum point of J (w) that corresponds to the point where the
gradient becomes zero.22

22 It should be noted that the optimization approach based on the gradient descent guarantees to

find the local minimum of a function. It can also be used to search for a global minimum, randomly
choosing a new starting point once a local minimum has been found, and repeating the operation
many times. In general, if the number of minimums of the function is limited and the number of
attempts is very high, there is a good chance of converging toward the global minimum.
98 1 Object Recognition

Fig. 1.36 Perceptron Δ


J(w) -
learning model based on the
gradient descent Δ
-

Δ
- J(wt)=0

wt w

This purpose is achieved through an iterative process starting from an initial


configuration of the weight vector w, after which it is modified appropriately in the
direction in which the gradient decreases more quickly (see Fig. 1.36). In fact, if at
the iteration t-th we have a value of the weight vector w(t) (at the first iteration the
choice in the domain of w can be random), and we calculate with the (1.235) for
the vector w the gradient ∇J(w(t)), then we update the weight vector by moving a
small distance in the direction in which J decreases more quickly, i.e., in the opposite
direction of the gradient vector (−∇J(w(t))), we have the following update rule of
weights:
w(t + 1) = w(t) − η∇J(w(t)) (1.237)

where 0 < η ≤ 1 is still the parameter that controls the degree of learning (learning
rate) that defines the entity of the modification of the weight vector. We recall the
criticality highlighted earlier in the choice of η for convergence.
The perceptron batch update rule based on the gradient descent, considering the
(1.235), has the following form:

w(t + 1) = w(t) + η x (1.238)
x∈M

The denomination batch rule is derived from the fact that the adaptation of the weights
to the t-th iteration occurs with the sum, weighted with η, of all the samples pattern
M misclassified.
From the geometrical point of view, this perceptron rule represents the sum of the
algebraic distances between the hyperplane given by the weight vector w and the
sample patterns M of the training set for which one has a classification error. From
the (1.238), we can derive the adaptation rule (also called on-line) of the perceptron
based on a single sample misclassified xM given by

w(t + 1) = w(t) + η · xM (1.239)


1.10 Method Based on Neural Networks 99

(a) (b) (c)


w(t+1)=w(t)+ηxM w(t+1) w(t+1)
xM
xM xM
w(t) ηxM w(t) w(t)

xt xt

Fig. 1.37 Geometric interpretation of the perceptron learning model based on the gradient descent.
a In the example, a pattern xM misclassified from the current weight w(t) is on the wrong side
of the dividing line (that is, w(t)T xM < 0), with the addition to the current vector of ηxM , the
weight vector moves the decision line in the appropriate direction to have a correct classification
of the pattern xM . b Effect of the learning parameter η; large values, after adapting to the new
weight w(t + 1), can misclassify a previous pattern, denoted by x(t), which was instead correctly
classified. c Conversely, small values of η are likely to still leave misclassified xM

From Fig. 1.37a, we can observe the effect of weight adaptation considering the single
sample xM misclassified by the weight vector w(t) as it turns out w(t)T xM ≤ 0.23
Applying the rule (1.239) to the weight vector w(t), that is, adding to this η·xM , we
obtain the displacement of the decision hyperplane (remember that w is perpendicular
to it) in the correct direction with respect to the misclassified pattern. It is useful to
recall in this context the role of η. If it is too large (see Fig. 1.37b), a previous pattern
xt correctly classified with w(t), after adaptation to the weight w(t), would now be
classified incorrectly.
If it is too small (see Fig. 1.37c), the pattern xM after the adaptation to the
weight w(t + 1) would still not be classified correctly. Applying the rule to the
single sample, once the training set samples are augmented and normalized, these
are processed individually in sequence and if at the iteration t for the j-th sample we
have w T x j < 0, i.e., the sample is misclassified we perform the adaptation of the
weight vector with the (1.239), otherwise we leave the weight vector unchanged and
go to the next (j +1)-th sample.
For a constant value of η predefined, if the classes are linearly separable, the
perceptron converges to a correct solution both with the batch rule (1.238) and with
the single sample (1.239). The iterative process of adaptation can be blocked to

23 Inthis context, the pattern vectors of the training set P , besides being augmented (x0 = 1), are
also nor mali zed, that is, all the patterns belonging to the class ω2 are placed with their negative
vector:
x j = −x j ∀ x j ∈ ω2
It follows that a sample is classified incorrectly if:


N
wT x j = wk j xk j < 0.
k=0
100 1 Object Recognition

limit processing times or for infinite oscillations in the case of nonlinearly separable
classes.
The arrest can take place by predicting a maximum number of iterations or by
imposing a minimum threshold to the cost function J(w(t)), while being aware of
not being sure of the quality of the generalization. In addition, with reference to
the Robbins–Monro algorithm [22], convergence can be analyzed by imposing an
adaptation also for the learning rate η(t) starting from an initial value and then
decreasing over time in relation a η(t) = η0 /t, where η0 is a constant and t the
current iteration.
This type of classifier can also be tested for nonlinearly separable classes although
convergence toward an optimal solution is not ensured with the procedure of adap-
tation of the weights that oscillates in an attempt to minimize the error despite using
the trick to also update the parameter of learning.

1.10.4 Linear Discriminant Functions and Learning

In this section, we will describe some classifiers based on discriminating functions


whose characteristic parameters are directly estimated by the training sets of the
observed patterns. In the Sects. 1.6.1 and 1.6.2, we have already introduced similar
linear and generalized discriminant functions. The discriminating functions will be
derived in a nonparametric way, i.e., no assumption will be made on the knowledge
of density both in the analytic and parametric form as done in the Bayesian context
(see Sect. 1.8). In essence, we will continue the approach, previously used for the
per ceptr on, based on minimizing a cost function but using alternative criteria to that
of the perceptron: Minimum squared error (MSE), Widrow–Hoff gradient descent,
Ho–Kashyap Method. With these approaches, particular attention is given to conver-
gence both in computational terms and to the different ways of minimizing the cost
function based on the gradient-descent.

1.10.4.1 Algorithm of Minimum Square Error—MSE


The perceptron rule finds the weight vector w and classify the patterns xi satisfying
the inequality w T xi > 0 and considering only the misclassified samples that do not
satisfy this inequality to update the weights and converge toward the minimum error.
The MSE algorithm instead tries to find a solution to the following set of equations:

w T xi = bi (1.240)

where xi = (xi0 , xi1 , . . . , xid ), i = 1, . . . , N are the augmented vectors of the


training set (TS), w = (w0 , w1 , . . . , wd ) is the augmented weight vector to be found
as a solution to the system of linear equations (1.240), and bi , i = 1, . . . , N are the
arbitrary specified positive constants (also called margins).
In essence, we have converted the problem of finding the solution to a set of linear
inequalities with the more classical problem of finding the solution to a system of
1.10 Method Based on Neural Networks 101

Fig. 1.38 Geometric


interpretation of the MSE g(x)>
algorithm: calculates the
g(x)<0
normalized distance with wT xj /||w||
respect to the weight vector xj

0
of the pattern vectors from

)=
xk

x
the class separation wT xk /||w||

g(
hyperplane
w

linear equations. Moreover, with MSE algorithm, all TS patterns are considered
simultaneously, not just misclassified patterns. From the geometrical point of view,
the MSE algorithm with w T xi = bi proposes to calculate for each sample xi the
distance bi from the hyperplane, normalized with respect to |w| (see Fig. 1.38). The
matrix compact form of the (1.240) is given by
⎛ ⎞⎛ ⎞ ⎛ ⎞
x10 x11 · · · x1N w0 b1
⎜ x20 x21 · · · x2N ⎟ ⎜ w1 ⎟ ⎜ b2 ⎟
⎜ ⎟⎜ ⎟ ⎜ ⎟
⎜ .. .. . . . ⎟ ⎜ . ⎟ = ⎜ . ⎟ ⇐⇒ Xw = b (1.241)
⎝ . . . .. ⎠ ⎝ .. ⎠ ⎝ .. ⎠
xN0 xN1 · · · xN N wN bN
' () * ' () * ' () *
N ×(d+1) (d+1)×1 N ×1

The goal is now to solve the system of linear equations (1.241). If the number of
equations N is equal to the number of unknowns, i.e., the number of augmented
features d + 1, we have the exact formal solution:

w = X−1 b (1.242)

where it needs X to be non-singular. Normally, X results in a rectangular array with


many more rows (samples) than the columns.
It follows that when the number of equations exceeds the number of unknowns
(N  d + 1), the unknown vector w is overdetermined, and an exact solution cannot
be found. We can, however, look for a weight vector w which minimizes some error
function ε between the model Xw and the desired vector b:

ε = Xw − b (1.243)

One approach is to try to minimize the module of the error vector, but this corresponds
to minimizing the sum function of the squared error:


N
JMSE (w) = Xw − b 2
= (w T xi − bi )2 (1.244)
i=1

The minimization of the (1.244) can be solved by analytically calculating the gradient
and setting it to zero, differently from what is done with the perceptron. From the
102 1 Object Recognition

gradient calculation, we get

dJMSE  d N
∇ JMSE (w) = = (w T xi − bi )2
dw dw
i=1

N
d
= 2(w T xi − bi ) (w T xi − bi )
dw (1.245)
i=1

N
= 2(w T xi − bi )xi
i=1
= 2XT (Xw − b)

from which setting equal to zero, we have

2XT (Xw − b) = 0 =⇒ XT Xw = XT b (1.246)

In this way instead of solving the system Xw = b, we set it to solve the Eq. (1.246)
with the advantage that XT X is a square matrix of (d + 1) × (d + 1) and is often
non-singular. Under these conditions, we can solve the (1.246) only with respect to
w obtaining the solution sought MSE:

w = (XT X)−1 XT b = X† b (1.247)

where
X† = (XT X)−1 XT

is known as the pseudo-inverse matrix of X. We observe the following property:

X† X = ((XT X)−1 XT )X = (XT X)−1 (XT X) = I

where I is the identity matrix and the matrix X† is inverse on the left (the inverse on the
right in general is XX† = I). Furthermore, it is observed that if X is square and non-
singular, the pseudo-inverse coincides with the normal inverse matrix X† = X−1 .
Like all regression problems, the solution can be conditioned by the uncertainty
of the initial data which then propagates on the error committed on the final result. If
the training data are very correlated, the XT X matrix could become almost singular
and therefore not admit its inverse preventing the use of the (1.247).
This type of ill-conditioning can be approached with the linear regularization
method, also known as ridge regression. The ridge estimator is defined in this way
[23]:
wλ = ((XT X) + λId)−1 XT b (1.248)

where λ (0 < λ < 1) is a nonnegative constant called shrinkage parameter that


controls the contraction level of the identity matrix. The choice of λ is made on the
basis of the correlation level of the features, i.e., the existing multicollinearity, trying
1.10 Method Based on Neural Networks 103

to guarantee an appropriate balance between the variance and the distortion of the
estimator. For λ = 0, the ridge regression (1.248) coincides with the pseudo-inverse
solution.
Normally, the proper choice of λ is found through a cross-validation approach. A
graphical exploration that represents the components of w in relation to the values
of λ is useful when analyzing the curves (traces of the ridge regressions) that tend to
stabilize for acceptable values of λ.
The MSE solution also depends on the initial value of the margin vector b which
conditions the expected results of w∗ . The arbitrary choice of positive values of b can
give an MSE solution with a discriminant function that separates linearly separable
(even if not guaranteed) and non-separable classes. For b = 1, the MSE solution
becomes identical to the Fischer linear discriminant solution. If the number of sam-
ples tends to infinity, the MSE solution approximates the discriminating function of
Bayes g(x) = p(ω1 /x) − p(ω2 /x).

1.10.4.2 Widrow–Hoff Learning


The sum function of the squared error JMSE (w), Eq. (1.244), can be minimized using
the gradient descent procedure. On the one hand, there is the advantage of avoiding
the singularity conditions of the matrix XT X and on the other, we avoid working
with large matrices.
Assuming an arbitrary initial value of w(1) and considering the gradient equation
(1.245), the weight vector update rule results

w(t + 1) = w(t) − η(t)XT (Xw(t) − b) (1.249)

It is shown that if η(t) = η(1)/t, with arbitrary positive value of η(1), this rule
generates the weight vector w(t) which converges to the solution MSE with weightw
such that XT (Xw − b) = 0. Although the memory required for this update rule has
been reduced considering the dimensions (d + 1) × (d + 1) of the XT X matrix
with respect to the matrix X† of (d + 1) × N , using the Widrow–Hoff procedure (or
Least Mean Squared rule-LMS) has a further memory reduction, considering single
samples sequentially:

w(t + 1) = w(t) + η(t)[bk − w(t)T xk ]xk (1.250)

The pseudo-code of the Widrow–Hoff procedure is reported in Algorithm 4. With


the perceptron rule, convergence is always guaranteed, only if the classes are linearly
separable. The MSE method guarantees convergence but may not find the separation
hyperplane if the classes are linearly separable (see Fig. 1.39), because it only works
in minimizing squares of the distances of samples from the hyperplane associated
with margin b.
104 1 Object Recognition

Algorithm 4 Widrow–Hoff algorithm


1: Initialization: w, b, threshold θ = 0.02, t ← 0, η(t) = 0.1
2: do t ← (t + 1) mod N
3: w ← w + η(t)[bt − w T xt ]xt
4: until |η(t)(bt − w T xt )xt | < θ
5: Final results w
6: end

Fig. 1.39 The MSE


algorithm minimizes the sum
of the squares of the sample ω
pattern distances with respect
to the class separation
hyperplane and may not LMS
converge even if this exists
as shown in the example Perceptron
which instead is always
found by the perceptron for
linearly separable classes

1.10.4.3 Ho–Kashyap Algorithm


The main limitation of the MSE method is given by the non-guarantee of finding
the hyperplane of separation in the case of linearly separable classes. In fact, with
the MSE method, we have imposed the minimization Xw − b 2 choosing the
arbitrary and constant margin vector b. Whether or not MSE converges to find the
hyperplane of separation depends precisely on how the margin vector is chosen.
In the hypothesis that the two classes are linearly separable, there must exist two
vectors w∗ and b∗ such that Xw∗ = b∗ > 0, where we can assume that the samples
are normalized (i.e., x ← (−x) ∀x ∈ ω2 ). If we arbitrarily choose b, in the MSE
method, we would have no guarantee of finding the optimal solution w∗ .
Therefore, if b were known, the MSE method would be used with the pseudo-
inverse matrix for the calculation of the weight vector w = X† b. Since b∗ is not
known, the strategy is to find both w and b. This is feasible using an alternative learn-
ing algorithm for linear discriminant functions, known as the Ho–Kashyap method
which is based on the JMSE functional to be minimized with respect to w and b,
given by
JMSE (w, b) = Xw − b 2 (1.251)

In essence, this algorithm is implemented in three steps:

1. Find the optimal value of b through the gradient descent.


2. Calculate the weight vector w with the MSE solution.
3. Repeat the previous steps until convergence.
1.10 Method Based on Neural Networks 105

To make the first step, the gradient ∇b JMSE is calculated of the functional MSE
(1.244) with respect to the margin vector b, given by

∇b JMSE (w, b) = −2(Xw − b) (1.252)

which suggests a possible update rule for b. Since b is subject to the constraint
b > 0, we start from this condition and following the gradient descent, we prevent
from reducing any component of the vector b to negative values. In other words, the
gradient descent does not move in any direction but is always forced in a learner’s
way to move in the direction that b remains positive. This is achieved through the
following rule of adaptation of the margin vector (t is the iteration index):

b(t + 1) = b(t) − η(t)∇b JMSE (w, b) = b(t) + 2η(t)(Xw − b) (1.253)

and setting to zero all the positive components of ∇b JMSE or equivalently, keeping
the positive components in the second term of the last expression. Choosing the first
option, the adaptation rule (1.253) for b is given by
1- .
b(t + 1) = b(t) − η(t) ∇b JMSE (w(t), b(t)) − ∇b JMSE (w(t), b(t)) (1.254)
2
where | • | indicates a vector to which we apply the absolute value to all of its
components. Remember that η indicates the learning parameter.
Summing up, the equations used for the Ho–Kashyap algorithm are (1.252), for
the calculation of the gradient ∇b JMSE (w, b), the (1.254) that is the adaptation rule
to find the margin vector b fixed the weight vector w, and the (1.247) to minimize
the gradient ∇w JMSE (w, b) with respect to the weight vector w that we rewrite

w(t) = X† b(t) (1.255)

where X† = (XT X)−1 XT is the pseudo-inverse matrix of X. Remember that with


the solution MSE, the gradient with respect to the weight vector is zeroed, that is
   
∇a JMSE = 2XT (Xw − b) = 2XT X X† b − b = 0

At this point, we can get the Ho–Kashyap algorithm with the following adaptation
equations for the iterative calculation of both margin and weight vectors:
;
b(t + 1) = b(t) + 2ηε + (t)
adaptation equations of Ho–Kashyap (1.256)
w(t) = X† b(t) t = 1, 2, . . .

where ε+ indicates the positive part of the error vector:


1- .
ε + (t) = ε(t) + |ε(t)| (1.257)
2
106 1 Object Recognition

and the error vector remembering the (1.252) results


1  
ε(t) = Xw(t) − b(t) = − ∇b JMSE w(t), b(t) (1.258)
2
The complete algorithm of Ho–Kashyap is reported in Algorithm 5.

Algorithm 5 Ho–Kashyap algorithm


1: Initialization: w, b > 0, 0 < η(.) < 1, t = 0, thresholds bmin , tmax
2: do t ← (t + 1) mod N
3: ε ← Xw − b
4: ε+ ← 1/2(ε + abs(ε))
5: b ← a + 2η(t)ε+
6: w ← X† b(t)
7: i f abs(ε) ≤ bmin then w and b are the solution found exit
8: until t = tmax
9: no solution reached
10: end

If the two classes are linearly separable, the Ho–Kashyap algorithm always pro-
duces a solution reaching the condition ε(t) = 0 and freezes (otherwise it continues
the iteration if some components of the error vector are positive). In the case of non-
separable classes, it occurs that ε(t) will have only negative components proving the
condition of non-separable classes. It is not possible to know after how many itera-
tions this condition of non-separability is encountered. The pseudo-inverse matrix is
calculated only once depending only on the samples of the training set. Considering
the high number of iterations required to limit the computational load, the algo-
rithm can be terminated by defining a maximum number of iterations or by setting a
minimum threshold for the error vector.

1.10.4.4 MSE Algorithm Extension for Multiple Classes


A multiclass classifier based on linear discriminant functions is known as linear
machine. Data K classes to be separated K linear discriminant functions (LDF) are
required:
gk (x) = wkT x + wk0 k = 1, . . . , K (1.259)

A x pattern is assigned to the class ωk if

gk (x) ≥ g j (x) ∀ j = k (1.260)

With the (1.260), the feature space is partitioned into K regions (see Fig. 1.40).
The discriminant function k-th with the largest value gk (x) assigns the pattern x
under consideration to the region k . In the case of equality, we can consider the
unclassified pattern (it is considered to be on the separation hyperplane).
1.10 Method Based on Neural Networks 107

(a) (b)

?
?

(c) (d)

Fig. 1.40 Classifier based on the MSE algorithm in the multiclass context. a and b show the
ambiguous regions when used binary classifier to separate the 3 classes. c and d instead show the
correct classification of a multiclasse MSE classifier that uses a number of discriminant functions
gk (x) up to the maximum number of classes

A multiclass classifier based on linear discriminated functions can be realized in


different ways considering also the binary classifiers described above (perceptron,
...). One approach would be to use K − 1 discriminant functions, each of which
separates a ωk class from all remaining classes (see Fig. 1.40a). This approach has
ambiguous (undefined) areas in the feature space.
A second approach uses K (K −1)/2 discriminant functions g jk (x) each of which
separates two classes ωk , ω j with respect to the others. Also in this case, we would
have ambiguous (undefined) areas in the feature space (see Fig. 1.40b).
These problems are avoidable by defining K LDF functions given by the (1.259),
i.e., with the linear machine approach (see Fig. 1.40c). If k and  j are contiguous
regions, their decision boundary is represented by a portion of the hyperplane H jk
defined as follows:

g j (x) = gk (x) −→ (w j − wk )T x + (w j0 − wk0 ) = 0 (1.261)


108 1 Object Recognition

From the (1.261), it follows that the difference of the vectors weight w j − wk is
normal to the hyperplane H jk and that the distance of a pattern w from H jk is given
by
(w j − wk )/ w j − wk

It follows that with the linear machine, the difference of the vectors is important
and not the vectors themselves. Furthermore not all K (K − 1)/2 region pairs must
be contiguous and a lower number of separation hyperplanes may be required (see
Fig. 1.40d). A multiclass classifier can be implemented as a direct extension of the
MSE approach used for two classes based on the pseudo-inverse matrix.
In this case, the matrix N × (d + 1) of the training set X = {X1 , . . . , X K }
can be organized partitioning the lines so that it contains the patterns ordered by
the K classes, that is, all the samples associated to a class ωk are contained in the
submatrix Xk . Likewise the weight matrix is constructed W = [w1 , w2 , . . . , w K ] of
size (d + 1) × K . Finally, the margin matrix B = [B1 , B2 , . . . , B N ] of size N × K
partitioned in submatrix B j (like X) whose elements are zero except those in the j-th
column that are set to 1. In essence, the problem is set as K MSE solutions in the
generalized form:
XW = B (1.262)

The objective function is given by


K
J (A) = Xwi − bi 2
(1.263)
i=1

where J (A) is minimized using the pseudo-inverse matrix:


 −1
W = X† B = X T X XB (1.264)

1.10.4.5 Summary
A binary classifier based on the perceptron always finds the hyperplane of separation
of the two classes only if these are linearly separable otherwise it oscillates without
ever converging. Convergence can be controlled by adequately updating the learning
parameter but there is no guarantee on the convergence point. A binary classifier
that uses the MSE method converges for classes that can be separated and cannot be
separated linearly, but in some cases, it may not find the hyperplane of separation for
separable linear classes. The solution with the pseudo-inverse matrix is used if the
sample matrix XT X is non-singular and not too large. Alternatively, the Widrow–
Hoff algorithm can be used. In other paragraphs, we will describe how to develop
a multiclass classifier based on multilayer perceptrons able to classify nonlinearly
separable patterns.
1.11 Neural Networks 109

1.11 Neural Networks

In Sects. 1.10.1 and 1.10.2 we described, respectively, the biological and mathemat-
ical model of a neuron, explored in the 1940s by McCulloch and Pitts, with the aim
of verifying the computationality of a network made up of simple neurons. A first
application of a neural network was the per ceptr on described previously for binary
classification and applied to solve logical functions.
An artificial neural network (ANN) consists of simple neurons connected to each
other in such a way that the output of each neuron serves as an input to many neurons
in a similar way as the axon terminals of a biological neuron are connected via
synaptic connections with dendrites of other neurons. The number of neurons and
the way in which they are connected (topology) determines the ar chitectur e of a
neural network. After the perceptron, in 1959, Bernard Widrow and Hoff Marciano
of Stanford University developed the first neural network models (based on the Least
Mean Squares—LMS algorithm) to solve a real problem. These models are known
as ADALINE (ADAptive LInear NEuron) and MADALINE (multilayer network of
ADALINE units) realized, respectively, to eliminate echo in telephone lines and for
pattern recognition.
Research on neural networks went through a period of darkness in the 1970s
after the Per ceptr ons book of 1969 (by M. Minsky and S. Pappert) that questioned
the ability of neural models, limited to solving only linearly separable functions.
This involved the limited availability of funds in this revitalized sector only in the
early 1980s when Hopfield [24] demonstrated, through a mathematical analysis,
what could and could not be achieved through neural networks (he introduced the
concepts of bidirectional connections between neurons and associative memory).
Subsequently, research on neural networks took off intensely with the contribution
of various researchers from whom the proposed neural network models were named:
Grosseberg–Carpenter for the ART—Adaptive Resonance Theory network; Kohonen
for the SOM—Self Organization Map; Y. Lecunn, D. Parker, and Rumelhart–Hinton–
Williams who independently proposed the learning algorithm known as Backprop-
agation for an ANN network; Barto, Sutton, and Anderson for incremental learning
based on the Reinforcement Learning, ...
While it is understandable how to organize the topology of a neural network,
years of research have been necessary to model the computational aspects of state
change and the aspects of adaptation (configuration change). In essence, neural net-
works have been developed only gradually defining the modalities of interconnection
between neurons, their dynamics (how their state changes), and how to model the
process of adaptation of synaptic weights. All this in the context of a neural network
created by many interconnected neurons.

1.11.1 Multilayer Perceptron—MLP

A linear machine that implements linear discriminant functions with the minimum
error approach, in general, does not sufficiently solve the requirements required
110 1 Object Recognition

Input Layer Output Layer Target


Hidden Layer
1/2(z

1/2(z

J(w)
xi zk
tk Σ Objective
wji 1/2(zk-tk
yj wkj function
netk
netj
tK
xd zK
1/2(zK-tK
yNh

bias 1 bias 1

Fig. 1.41 Notations and symbols used to represent a MLP neural network with three layers and its
extension for the calculation of the objective function

by the various applications. In theory, the solution of finding suitable nonlinear


functions could be a solution but we know how complex the appropriate choice of
such functions is. A multilayered neural network that has the ability to learn from a
training set regardless of the linearity or nonlinearity of the data can be the solution
to the problems.
A neural network created with MultiLayer Perceptrons—MLP is a feedforward
network of simple process units with at least one hidden layer of neurons.
The process units are similar to the perceptron, except for the threshold function
placed by a nonlinear differentiable which guarantees the calculation of the gradient.
The feedforward network (acyclic) defines the type of architecture that defines the
way in which neurons are organized by layer s and how neurons are connected from
one layer to another. In particular, all the neurons of a layer are connected with all
the neurons of the next layer, that is, with only forward connections (feedforward).
Figure 1.41 shows a 3-layer feedforward MLP network: d-dimension input layer
(input neurons), intermediate layer of hidden neurons; and the layer of output neu-
rons. The input and output neurons, respectively, represent the r eceptor s and the
e f f ector s and with their connections the channels through which the respective
signals and information are propagated are in fact created. In the mathematical model
of the neural network, these channels are known as paths. The propagation of signals
and the process of information through these paths of a neural network is achieved by
modifying the state of neurons along these paths. The states of all neurons realize the
overall state of the neural network and the synaptic weights associated with all con-
nections give the neural network configuration. Each path in a f eed f or war d MLP
1.11 Neural Networks 111

network leads from the input layer to the output layer through individual neurons
contained in each layer.
The ability of an NN-Neural Network to process information depends on the
inter-connectivity and the states of neurons that change and the synaptic weights that
are updated through an adaptation process that represents the learning activity of the
network starting from the samples of the training set. This last aspect, i.e., the network
update mode is controlled by the equations or rules that determine the dynamics and
functionality over time of the NN. The computational dynamics specifies the initial
state of an NN and the update rule over time, once the configuration and topology
of the network itself has been defined. An NN f eed f or war d is characterized by a
time-independent data flow (static system) where the output of each neuron depends
only on the current input in the manner specified by the activation function.
The adaptation dynamic specifies the initial configuration of the network and
the method of updating weights over time. Normally, the initial state of synaptic
weights is assigned with random values. The goal of the adaptation is to achieve
a network configuration such that the synaptic weights realize the desired function
from the input data (training pattern) provided. This type of adaptation is called
supervised learning. In other words, it is the expert that provides the network with
the input samples and the desired output values and with the learning occurs how
much the network response agrees with the desired target value known a priori. A
supervised feedforward NN is normally used as a function approximator. This is
done with different learning models, for example, the backpr opagation which we
will describe later.

1.11.2 Multilayer Neural Network for Classification

Let us now see in detail how a supervised MLP network can be used for the clas-
sification in K classes of d-dimensional patterns. With reference to Fig. 1.41, we
describe the various components of an MLP network following the flow of data from
the input layer to the output layer.

(a) Input layer. With supervised learning, each sample pattern x = (x1 , . . . , xd ) is
presented to the network input layer.
(b) Intermediate layer. The neuron j-th of the middle layer (hidden) calculates the
activation value net j obtained from the inner product between the input vector
x and the vector of synaptic weights coming from the first layer of the network:


d
net j = w ji xi + w j0 (1.265)
i=1

where the pattern and weight vectors are augmented to include the fictitious
input component x0 = 1, respectively.
112 1 Object Recognition

(c) Activation function for hidden neurons. The j-th neuron of the intermediate layer
emits an output signal y j through the nonlinear activation function σ , given by

1 if net j ≥ 0
y j = σ (net j ) = (1.266)
−1 if net j < 0

(d) Output layer. Each output neuron k calculates the activation value netk obtained
with the inner product between the vector y (the output of the hidden neurons)
and the vector wk of the synaptic weights from the intermediate layer:


Nh
netk = wk j y j + wk0 (1.267)
j=1

where Nh is the number of neurons in the intermediate layer. In this case, the
weight vector wk is augmented by considering a neuron bias which produces a
constant output y0 = 1.
(e) Activation function for output neurons. The neuron k-th of the output layer emits
an output signal z k through the non-linear activation function σ , given by

1 if netk ≥ 0
z k = σ (netk ) = (1.268)
−1 if netk < 0

The output z k for each output neuron can be considered as a direct function of an input
pattern x through f eed f or war d operations of the network. Furthermore, we can
consider the entire f eed f or war d process associated with a discriminant function
gk (x) capable of separating a class (of the K classes) represented by the k-th output
neuron. This discriminating function is obtained by combining the last 4 equations
as follows:

Nh 
d  
gk (x) = z k = σ wk j σ w ji xi + w j0 + wk0 (1.269)
j=1 i=1
' () *
activation of the k-th output neuron

where the internal expression (•) instead represents the activation of the j-th hidden
neuron net j given by the (1.265).
The activation function σ (net) must be continuous and differentiable. It can also be
different in different layers or even different for each neuron. The (1.269) represents
a category of discriminant functions that can be implemented by a three-layer MLP
network starting from the samples of the training set {x1 , x2 , . . . , x N } belonging to
K classes. The goal now is to find the network learning paradigm to get the synaptic
weights wk j and w ji that describe the functions gk (x) for all K classes.
1.11 Neural Networks 113

1.11.3 Backpropagation Algorithm

It is shown that an MLP network, with three layers, an adequate number of nodes
per layer, and appropriate nonlinear activation functions, is sufficient to generate
discriminating functions capable of separating classes also nonlinearly separable in
a supervised context. The backpr opagation algorithm is one of the simplest and
most general methods for supervised learning in an MLP network. The theory on
the one hand demonstrates that it is possible to implement any continuous function
from the training set through an MLP network but from a practical point of view, it
does not give explicit indications on the network configuration in terms of number
of layers and necessary neurons. The network has two operating modes:

Feedforward or testing which consists of presenting a pattern to the input layer


and the information processed by each neuron propagates forward through the
network producing a result in the output neurons.
Supervised learning which consists of presenting a pattern in input and modifying
(adapting) the synaptic weights of the network to produce a result very close to
the desired one (the target value).

1.11.3.1 Supervised Learning


Learning with the backpropagation involves initially executing, for a network to be
trained, the phase of feedforward calculating and memorizing the outputs of all the
neurons (of all the layers). The values of the output neurons z k are compared to the
desired values tk and an error function (objective function) is used to evaluate this
difference for each output neuron and for each training set sample. The evaluation
of the overall error of the feedforward phase is a scalar value that depends on the
current values of all the W weights of the network that during the learning phase
must be adequately updated in order to minimize the error function. This is achieved
by evaluating with the SSE method (the Sum of the Squared Error) of all the output
neurons related to each training set sample, and minimizing the following objective
function J (W) with respect to the weights of the network:


N 
K
1 
N
1
J (W) = (tnk − z nk )2 = tn − zn 2
(1.270)
2 2
n=1 k=1 n=1

where N indicates the number of samples in the training set, K is the number of
neurons in the output layer (coinciding with the number of classes), and the factor
of 1/2 is included to cancel the contribution of the exponent with the differentiation,
such as we will see forward.
The backpropagation learning rule is based on the gradient descent. Once the
weights are initialized with random values, their adaptation to the t-th iteration occurs
in the direction that will reduce the error:
114 1 Object Recognition

∂ J (W)
w(t + 1) = w(t) + w = w(t) − η (1.271)
∂w

where η is the learning parameter that establishes the extent of the weight change. The
(1.271) ensures that the objective function (1.270) is minimized and never becomes
negative. The learning rule guarantees that the adaptation process converges once
all input samples of the training set are input. Now let’s look at the essential steps
of supervised learning based on the backpr opagation. The data of the problem is
the samples xk of the training set, the output of the MLP network and the desired
target value tk . The unknowns are the weights related to all the layers to be updated
with the (1.271) for which we should determine the w adaptation with the gradient
descent:
∂ J (W)
w = −η (1.272)
∂w

for each net weight (weights are updated in the opposite direction to the gradient).
For simplicity, we will consider the objective function (1.270) for a single sample
(N = 1):
K
1 1
J (W) = (tk − z k )2 = t−z 2 (1.273)
2 2
k=1

The essential steps of the backpropagation algorithm are:

1. Feedforward calculation. The sample (x1 , . . . , xd ) is presented to the input layer


of the MLP network. For each neuron j-th of the hidden layer, the activation value
net j is calculated with the (1.265) and the output y j of this neuron is calculated
and stored with the activation function σ , or the Eq. (1.266). Similarly, the
activation value netk is calculated and stored with the (1.267) and the output of
the output neuron z k with the activation function σ the Eq. (1.268). For each
neuron, the values of the derivatives of the activation functions σ  are stored (see
Fig. 1.42).
2. Backpropagation in the output layer. At this point, we begin to compute the first
set of the partial derivatives ∂ J/∂wk j of the error with respect to the weights wk j
between hidden and output neurons. This is done with the rule of derivation of
composite functions (chain rule) since the error does not depend exclusively on
wk j . By applying this rule, we obtain

∂J ∂ J ∂z k ∂netk
= (1.274)
∂wk j ∂z k ∂netk ∂wk j

Let us now calculate each partial derivative of the three terms of the (1.274)
separately.
1.11 Neural Networks 115

k-th Output neuron


J(w)
xi
J-th hidden neuron
zk
tk Σ
wji 1/2(zk-tk
yj wkj
netk
netj k
k-th o. neuron: Calculation
Backpropagation Error
J-th h. neuron: Calculation k -tk)(1-zk)zk
k
Backpropagation Error
Nh
δj=[Σ ](1-y )y
h
n nj j j
n=1
Backpropagation

Fig. 1.42 Reverse path to the feed-forward one, shown in Fig. 1.41, during the learning phase for
the backward propagation of the error of backpropagation δko for the output neuron k-th and the
error of backpr opagation δ hj associated with the hidden neuron j-th

The first term considering the (1.273) results:


 K 
∂J ∂ 1
= (z m − tm )2 = (z k − tk ) (1.275)
∂z k ∂z k 2
m=1

The second term, considering the activation value of the output neuron k-th given
by the (1.267) and the corresponding output signal z k given by its nonlinear
activation function σ Eq. (1.268), results
∂z k ∂
= σ (netk ) = σ  (netk ) (1.276)
∂netk ∂netk
The activation function is generally nonlinear and commonly the sigmoid func-
tion24 given by the (1.225) is chosen which by replacing in the (1.276), we obtain
 
∂z k ∂ 1
=
∂netk ∂netk 1 + exp(−netk )
(1.277)
exp(−netk )
= = (1 − z k )z k
(1 + exp(−netk ))2

24 Thesigmoid or sigmoid curve function (in the shape of an S) is often used as a transfer function
in neural networks considering its nonlinearity and easy differentiability. In fact, the derivative is
given by  
dσ (x) d 1
= = σ (x)(1 − σ (x))
∂x dx 1 + exp(−x)
and is easily implementable.
116 1 Object Recognition

The third term, considering the activation value netk given by the (1.267), results
 Nh 
∂netk ∂ 
= wkn yn = y j (1.278)
∂wk j ∂wk j
n=1

From the (1.278), we observe that only one element in the sum netk (that is of the
inner product between the output vector y of the hidden neurons and the weight
vector wk of the output neuron) depends on wk j .
Combining the results obtained for the three terms, (1.275), (1.277), and (1.278),
we get
∂J
= (z k − tk )(1 − z k )z k y j = δk y j (1.279)
∂wk j ' () *
δk

where we assume y j = 1 for the bias weights or for j = 0. The expression


indicated with δk defines the error of backpr opagation. It is highlighted that in
the (1.279), the weight wk j is the variable entity while its input y j is a constant.
3. Backpropagation in the hidden layer. The partial derivatives of the objective
function J (W) with respect to the weights w ji between input and hidden neurons
must now be calculated. Applying the chain rule again gives

∂J ∂ J ∂ y j ∂net j
= (1.280)
∂w ji ∂ y j ∂net j ∂w ji

In this case, the first term ∂ J/∂ y j of the (1.280) cannot be determined directly
because we do not have a desired value t j to compare with the output y j of a hid-
den neuron. The error signal must instead be recursively inherited from the error
signal of the neurons to which this hidden neuron is connected. For the MLP in
question, the derivative of the error function must consider the backpropagation
of the error of all the output neurons. In the case of multilayer MLP, reference
would be made to the neurons of the next layer. Thus, the derivative of the error
on the output y j of the j-th hidden neuron is obtained by considering the errors
propagated backward by the output neurons:

∂J  ∂ J ∂z n ∂net o
K
n
= (1.281)
∂yj ∂z n ∂netno ∂ y j
n=1

The first two terms in the summation of the (1.281) have already been calculated,
respectively, with the (1.275) and (1.277) in the previous step and their product
corresponds to the backpropagation error δk associated to the k-th output neuron:
∂ J ∂z n
= (z n − tn )(1 − z n )z n = δno (1.282)
∂z n ∂netno
1.11 Neural Networks 117

where here the propagation error is explicitly reported with the upper apex “o”
to indicate the association to the output neuron. The third term in the summation
of the (1.281) is given by
 Nh 
∂netno ∂ 
= wns ys = wno j (1.283)
∂yj ∂yj
s=1

From the (1.283), it is observed that only one element in the sum netn (that is
of the inner product between the output vector y of the hidden neurons and the
weight vector wno of output neuron) depends on y j . Combining the results of the
derivatives (1.282) and (1.283), the derivative of the error on the output y j of
the j-th hidden neuron given by the (1.281), becomes

∂J  ∂ J ∂z n ∂net o
K
n
=
∂yj ∂z n ∂netno ∂ y j
n=1
(1.284)

K 
K
= (z n − tn )(1 − z n )z n wn j = δno wn j
' () *
n=1 δno n=1

From the (1.284), we highlight how the error propagates backward on the j-th
hidden neuron accumulating the error signals coming backward from all the K
neurons of output to which it is connected (see Fig. 1.42).
Moreover, this backpropagation error is weighed by the connection force of the
hidden neuron with all the output neurons. Returning to the (1.280), the second
∂ y j /∂net j and the third ∂net j /∂w ji term are calculated in a similar way to those
of the data output layer from the Eqs. (1.277) and (1.278) which in this case are
∂yj
= (1 − y j )y j (1.285)
∂net j

∂net j
= xi (1.286)
∂w ji

The final result of the partial derivatives ∂ J/∂w ji of the objective function, with
respect to the weights of the hidden neurons, is obtained by combining the results
of the single derivatives (1.284) and of the last two equations, as follows:
 K 
∂J
= δn wn j (1 − y j )y j xi = δ hj xi
o
(1.287)
∂w ji
n=1
' () *
δ hj
118 1 Object Recognition

where δ hj indicates the backpropagated error related to the j-th hidden neuron.
Recall that the weight of the associated input value is where delta hj indicates
the retropropated error related to the j-th hidden neuron. Recall that for the bias
weight, the associated input value results xi = 1.
4. Weights update. Once all the partial derivatives are calculated, all the weights of
the MLP network are updated in the direction of the negative gradient with the
(1.271) and considering the (1.279). For the weights of the neurons hidden →
out put wk j , we have
∂J
wk j (t + 1) = wk j (t) − η = wk j (t) − ηδko y j k = 0, 1, . . . , K ; j = 1, . . . , N h (1.288)
∂wk j

remembering that for j = 0 (the bias weight), we assume y j = 1. For neuron


weights input → hidden w ji , we have
∂J
w ji (t + 1) = w ji (t) − η = w ji (t) − ηδ oj xi i = 0, 1, . . . , d; j = 1, . . . , N h (1.289)
∂w ji

Let us now analyze the weight update Eqs. (1.279) and (1.287) and see how they affect
the network learning process. The gradient descent procedure is conditioned by the
initial values of the weights. From the above equations, it is normal to randomly set
the initial weight value. The update amount of the k-th output neuron is proportional
to (z k − tk ). It follows that no update occurs when the output of the neuron and the
desired value coincide.
The sigmoid activation function is always positive and controls the output of
neurons. According to the (1.279) y j and (z k − tk ) concur, based on their own sign,
to modify adequately (decrease or increase) the weight value. It can be verified that
a pattern presented to the network produces no signal (y j = 0) and this implies no
update on the weights.

1.11.4 Learning Mode with Backpropagation

The learning methods concern how to present the samples of the training set and how
to update the weights. Three are the most common methods:

1. Online. Each sample is presented only once and the weights are updated after
the presentation of the sample (see Algorithm 6).
2. Stochastic. The samples are randomly chosen from the training set and the
weights are updated after the presentation of each sample (see Algorithm 7).
3. Batch Backpropagation. Also called off-line, the weights are updated after the
presentation of all the samples of the training set. The variations of the weights
for each sample are stored and the update of the weights takes place only when
all the samples have been presented only once. In fact, the objective function of
batch learning is the (1.270) and its derivative is the sum of the derivatives for
1.11 Neural Networks 119

Algorithm 6 Online Backpropagation algorithm


1: Initialize: w, Nh , η, convergence criterion θ, n ← 0
2: do n ← n + 1
3: xn ← choose sequentially an patter n
4: w ji ← w ji + ηδ hj xi ; wk j ← wk j + ηδko y j
5: until ∇ j (w < θ
6: return w
7: end

Algorithm 7 Stochastic Backpropagation algorithm


1: Initialize: w, Nh , η, convergence criterion θ, n ← 0
2: do n ← n + 1
3: xn ← choose a patter n randomly
4: w ji ← w ji + ηδ hj xi ; wk j ← wk j + ηδko y j
5: until ∇ j (w < θ
6: return w
7: end

each sample:
 K 
1 ∂ 
N

J (W) = (tnk − z nk ) 2
(1.290)
∂w 2 ∂w
n=1 k=1

where the partial derivatives of the expression [•] have been calculated previ-
ously and are those related to the objective function of the single sample (see
Algorithm 8).

Algorithm 8 Batch Backpropagation algorithm


1: Initialize: w, Nh , η, convergence criterion θ, epoch ← 0
2: do epoch ← epoch + 1
3: m ← 0; w ji ← 0 wk j ← 0
4: do m ←m+1
5: xm ← select a sample
6: w ji ← w ji + ηδ hj xi ; wk j ← wk j + ηδko y j
7: until m = n
8: w ji ← w ji + w ji ; wk j ← wk j + wk j ;
9: until ∇ j (w < θ
10: return w
11: end

From the experimental analysis, the stochastic method is faster than the batch
even if the latter fully uses the direction of the gradient descent to converge. Online
120 1 Object Recognition

training is used when the number of samples is very large but is sensitive to the order
in which the samples of the training set are presented.

1.11.5 Generalization of the MLP Network

An MLP network is able to approximate any nonlinear function if the training set
of samples (input data/desired output data) presented are adequate. Let us now see
what is the level of generalization of the network, that is, the ability to recognize a
pattern not presented in the training phase and not very different from the sample
patterns. The learning dynamics of the network is such that at the beginning, the
error on the samples is very high and proceeds to decrease asymptotically tending
to a value that depends on: the Bayesian error of the samples, the size of the training
set, the network configuration (number of neurons and layers), and initial value of
the weights.
A graphical representation of the learning dynamics (see Fig. 1.43) is obtained by
reporting on the ordinates how the error varies with respect to the number of realized
epochs. From the obtained learning curve, you can decide the level of training and
stop it. Normally, the learning is blocked when the imposed error is reached or when
an asymptotic value is reached. A situation of saturation in learning can occur, in
the sense that an attempt is made to excessively approximate the training data (for
example, many samples are presented) generating the phenomenon of overfitting
(in this context over training) with the consequent loss of the generalization of the
network when one then enters the test context.
A strategy to control the adequacy of the level of learning achieved is to use test
samples, other than the training samples, and validate the generalization behavior of
the network. On the basis of the results obtained, it is also possible to reconfigure
the network in terms of number of nodes. A strategy that allows an appropriate
configuration of the network (and avoid the problem of overfitting) is that of having
a third set of samples called validation. The dynamics of learning are analyzed with
the two curves that are obtained from the training set and the one related to the

Fig. 1.43 Learning curves


related to the three
operational contexts of the
MLP network: training,
Mean Squared Error

validation, and testing

Validation

Testing
Training
Point of Early Stopping Epochs
1.11 Neural Networks 121

validation set (see Fig. 1.43). From the comparison of the curves, you can decide to
block the learning at the minimum local met on the validation curve.

1.11.6 Heuristics to Improve Backpropagation

A neural network like the MLP we have seen that it is based on mathemati-
cal/computing fundamentals inspired by biological neural networks. The backprop-
agation algorithm used for supervised learning is set up as an error minimization
problem associated with training set patterns. In these conditions it is shown that the
convergence of the backpropagation is possible, both in probabilistic and indeter-
ministic terms. Although this, it is useful in real applications to introduce heuristics
aimed at optimizing the implementation of an MLP network, in particular for the
aspects of classification and pattern recognition.

1.11.6.1 Dynamic Learning Improvement: Momentum


We know that the learning factor η controls the behavior of the backpropagation algo-
rithm, in the sense that for small values, there is a slow convergence and ensures bet-
ter effectiveness, while for large values, you can have an unstable network. Another
aspect to consider concerns the typical local minimum problem with the gradient
descent method (see Fig. 1.44). This implies that a sub-optimal solution is reached.
In these contexts, it would be useful to modify the weight adaptation rule. One
solution derived from physics is the use of momentum.
Objects in motion tend to stay in motion unless external actions intervene to
change this situation. In our context, we should dynamically vary the learning factor
as a function of the variation of the previous partial derivatives (the dynamics of
the system in this case is altered by the gradient of the error function). In particular,
this factor should be increased where the variation of the partial derivatives is almost
constant, and decrease it where the value of the partial derivative undergoes a change.
This results in modifying the weight adaptation rule by including some fraction of
the previously updated weight changes (the idea is to link the updating of the current

Fig. 1.44 The problem of


the backpropagation J(w)
algorithm which, based on
the gradient descent, finds a
local minimum in
minimizing the error
function

Global Minimum

Local Minimum w
122 1 Object Recognition

weights taking into account the past iterations). Let w(t) = w(t) − w(t − 1) the
variation of the weights at the iteration t-th, the adaptation rule of the weights (for
example, considering the 1.289) is modified as follows:
 
∂J
w(t + 1) = w(t) + (1 − α) η + αw(t − 1) (1.291)
∂w

where α (also called momentum) is a positive number with values between 0 and
1, and the expression [•] is the variation of the weights associated with the gradient
descent, as expected for the rule of backpr opagation. In essence, α parameter
determines the amount of influence from the previous iterations over the current one.
The momentum introduces a sort of damping on the dynamic of adaptation of the
weights avoiding oscillations in the irregular areas of the surface of the error function
averaging the components of the gradient with opposite sign and speeding up the
convergence in the flat areas. This attempts to prevent the search process from being
blocked on a local minimum. For α = 0, we have the gradient descent rule, for
α = 1, the gradient descent is ignored and weights are updated in constant variation.
Normally, α = 0.9 is used.

1.11.6.2 Properties of the Activation Function


The backpropagation algorithm accepts any type of activation function σ (•) if it
is derivable. Nevertheless it is convenient to choose functions with the following
properties:

(a) Nonlinear, to ensure nonlinear decision boundaries.


(b) Saturated, allow a minimum and maximum output value. This allows you to
maintain the weights and activation potentials in a limited range as well as the
training time.

(c) Continuous and differentiable, that is, σ (•) and σ (•) are defined in the whole
input range. The derivative is important to derive the weight adaptation rule.
Backpropagation can accept piecewise linear activation functions even if it adds
complexity and few benefits.
(d) Monotonicity, to avoid the introduction of local minima.
(e) Linearity, for small values of net, so the network can implement linear models
according to the type of data.
(f) Antisymmetry, to guarantee a faster learning phase. An antisymmetric function
(σ (−x) = −σ (−x)) is the hyperpbolic tangent (1.226).

The activation function that satisfies all the properties described above is the
following sigmoid function:
 b·net 
e − e−b·net
σ (net) = a · tanh(b · net) = a b·net (1.292)
e + e−b·net
1.11 Neural Networks 123

with the following optimal values for a = 1.716 and b = 2/3, and the linear interval
−1 < net < 1.

1.11.6.3 Preprocessing of Input Data


Training set data must be adequately processed ensuring that the calculated average
is zero or in any case with small values compared to variance. In essence, the input
data must be normalized to homogenize the variability range. Possibly with the
normalization, the data is transformed (xn = (x − μx )/σx ) to result with zero mean
and unit variance.
This data normalization is not required for online backpropagation where the entire
training set is not processed simultaneously. Avoid presenting patterns with highly
correlated or redundant features to the network. In these cases, it is convenient to
apply a transform to the principal components to verify the level of their correlation.
When the network is used for classification, the desired target value that identifies a
class must be compatible with the range of definition of the activation function. For
any finite value of net, the output σ (net) must never reach the saturation values of
the sigmoid function (±1.716), and so there would be no error.
Conversely, if the error would be great the algorithm would never converge and the
weights would tend with values toward infinity. One solution is to use target vectors
of type t = (−1, −1, 1)T where 1 indicates the class and −1 indicates nonclass
membership (in the example, the target vector represents class ω3 in the 3-class
classification problem).

1.11.6.4 Network Configuration: Neurons and Weights


The number of neurons in the input and output layers are imposed by the dimension-
ality of the problem: size of the pattern features and number of classes. The number
of hidden neurons Nh instead characterizes the potential of the network itself. For
very small values of Nh , the network may be insufficient to learn complex decision
functions (or approximate generic nonlinear functions). For very large Nh , we could
present the problem of the over f itting of the training set and lose the generaliza-
tion of the network to the presentation of new patterns. Although in literature there
are several suggestions for choosing Nh , the problem remains unresolved. Normally
each MLP network is configured with Nh neurons after several tests for a given
application.
For example, we start with a small number and then gradually increase it or, on
the contrary, we start with a large number and then it is decreased. Another rule can
be to correlate Nh with the number of samples N of the training set choosing, for
example, Nh = N /10.
The initial value of the weights must be different from zero otherwise the learning
process does not start. A correct approach is to initialize them with small random
values. Those relating to the H-O (Hidden-Output) connection must be larger than
the I-H (Input-Hidden) connections, since they must carry back the propagation error.
Very small values for the weights H-O involve a very small variation of the weights
124 1 Object Recognition

in the hidden layer with the consequent slowing of the learning process. According
to the interval of definition of the sigmoid function, the heuristic used for the choice
-of the weights
. of the layer I-H is that of a uniform
- random distribution
. in the interval
− √ , √ while for the layer H-O, it results − √ N , √ N , where d indicates the
1 1 1 1
d d h h
size of the samples.
The backpropagation algorithm is applicable for MLP with more than one hidden
layer. The increase in the number of hidden layers does not improve the approxi-
mation power of any function. A 3-layer MLP is sufficient. From the experimental
analysis, for some applications, it was observed that the configuration of an MLP
with more than 3 layers presents a faster learning phase with the use of a smaller
number of hidden neurons altogether. However, there is a greater predisposition of
the network to the problem of the local minimum.

1.11.6.5 Learning Speed, Stop Criterion, and Weight Decay


Backpropagation theory does not establish an exact criterion to block the learning
phase of the MLP. We have also seen the problem of overfitting when the network
is not properly blocked. One approach may be to monitor the error function J (W)
during the training with the validation set.
The method of early stopping can be used to prevent overfitting by monitoring the
mean square error (MSE) of the MLP on the validation set during the learning phase.
In essence, the training is blocked on the minimum of the curve of the validation set
(see Fig. 1.43).
In general, like all algorithms based on the gradient descent, the backpropagation
depends on the learning parameter η. This essentially indicates the speed of learning
that for small values (starting from 0.1) ensures convergence (that is, we reach a
minimum of the error function J (W)) although it does not guarantee the generaliza-
tion of the network. A possible heuristic is to dynamically adapt η in relation to the
current value of the J function during the gradient descent. If J fluctuates, η may be
too large and should be decreased. If instead J decreases very slowly, η is too small
and should be increased.
Another heuristic that can avoid the problem of overfitting is to keep the weight
values small. There is no theoretical motivation to justify that weight decay should
always lead to an improvement in network performance (in fact, there may be spo-
radic cases where it leads to performance degradation), although experimentally it
is observed that in most cases, it is useful. Weight reduction is performed after each
update as follows:

w(new) = w(old)(1 − ) 0<<1 (1.293)

A further advantage of weight reduction is found in particular on the weights that


are not updated by the backpropagation algorithm. In this case, with the heuristic
of weight reduction, these weights unmodified tend to be eliminated as they are not
influential in reducing the error function. It is shown that the reduction of weights is
1.11 Neural Networks 125

equivalent to the descent of the gradient which uses an objective function based on
the regularization method.25

1.11.6.6 Conclusions: Advantages and Disadvantages of a Neural


Network
The ability to generalize an MLP neural network is useful for pattern recognition
problems. An MLP learns complex discriminating functions between input and out-
put data although based on a set of significant samples of the domain of interest. The
relationship of the data can be linear and nonlinear and the physical–mathematical
model that links the input data to the output data may not be known. Fault tolerance
is achieved when stimulated by a test pattern that is different (within certain limits)
from the training set samples with which the network has been trained.
Among the disadvantages, we see the difficulty of configuring (in terms of layers
and number of neurons) an MLP as it cannot have an a priori knowledge of its
behavior. From this point of view, it is like a black box. It may require a long learning
process and not reach an optimal solution (trap toward a nonoptimal local minimum).
Therefore, different heuristics and attempts are required to achieve experimental
results appropriate to the context and performance.

1.12 Nonmetric Recognition Methods

All previous methods of object recognition are based on a quantitative evaluation of


the descriptors of the objects themselves, while the nonmetric methods are based on
a qualitative evaluation. In other words, while with metric methods it is possible to
apply a metric on the vectors of features that describe objects (patterns), nonmetric
methods are used when the intrinsic nature of the data is not suitable to apply any
metric, for example, a measure of similarity. In these cases, it may be useful to
represent a complex object using hierar chical approaches based on decision trees
or syntactic approaches based on a grammatical structure.

1.13 Decision Tree

The decision tree is a model used for the classification that is generated by analyz-
ing the various attributes (in this context, the attributes or features constitute the
instance space, i.e., the attribute/value pairs) that describe a pattern (object) to be

25 In the fields of machine learning and inverse problems, the regularization consists of the introduc-

tion of additional information or regularity conditions in order to solve an ill-conditioned problem


or to prevent overfitting.
126 1 Object Recognition

Training Set
K Class
Indu
1 Yes 11 A ctio
n
Samples
Model
Learning
N No 33 B
Model
(Decision
Test Set Tree)
K Class Application
Patterns

1 No 30 B of the Model

ction
Dedu
M Yes 13 A
Classes
Prediction

Fig. 1.45 Functional scheme of a classifier based on a decision tree. The induction learning of
the decision tree prediction model takes place by analyzing the instances (attributes/values and
associated class) of the training set samples. The validity of the model is verified with the test set
patterns to predict classes of which the true class is known

classified. The classification function is expressed by logical evaluations of attribute


values (features). Decision tree learning is a method based on inductive inferences to
approximate the target function that produces discrete values or the class to which a
pattern belongs. The generation of the tree takes place considering the examples of
the training set (learning phase) with the known class (see Fig. 1.45).
With a test on the value of an attribute, the instance space is partitioned, navigating
the tree from the root until it reaches a leaf that represents the expected value of the
class to which it belongs for a pattern described by the path connecting the leaf to
the root of the tree. Therefore, starting from the topmost node (see Fig. 1.46), the
root node, to which a attribute is associated, is linked to the other internal nodes
(each representing an attribute) through ar cs. The latter represent the values of the
attributes of the node from which they branch.
The ar cs represent the result of the test on the attribute. A terminal node reached
by a path that starts from the root node is called leaf node. The classification process
with a decision tree starts from the root node which, representing the first attribute
(the one considered most important) of the pattern to be classified, tests the value of
this attribute and starts the navigation by choosing the arc with the appropriate value
pointing to the node descending.
The process is iterated over this node (which can be considered the root of a
subtree) and starts the test on the value of the attribute represented by this node to
continue browsing through the appropriate value of the arc. The process continues
analyzing all the attributes and ending at the leaf node that represents the expected
value of the class of the pattern being examined. In other words, all internal nodes
(therefore not leaf) divide their instance space into two or more subspaces according
to the test performed on the attribute values.
1.13 Decision Tree 127

Fig. 1.46 Generation Training Set


top-down of a decision tree K Class
starting from the root node 1 Yes 11

Samples
and using the samples of the
training set whose
membership class is known N No 33 ωC

Root node

Figure 1.47 shows the structure of a decision tree created with a training set of
patterns x characterized by attributes of the type color, size, shape, taste. We can
see how easy it is to understand and interpret the decision tree for a classification
problem. For example, the pattern x = (yellow, medium, thin, sweet) is classified as
banana because the path from root to leaf banana encodes a con junction of test
on attributes: (color = yellow) and (shape = thin). It is also observed that different
paths can lead to the same class by encoding a dis junction of conjunctions. In the
figure, the tree shows two paths leading to the class apple = (color = green) and (size
= average) or (color = red) and (size = average).
Although decision trees give a concise representation of a classification process,
it can be difficult to interpret it especially for an extended tree. In this case, we can
use a simpler representation through classification rules, easily obtainable from the
tree. In other words, a path can be transformed into a rule. For example, the following
rule can be derived from the tree in Fig. 1.47 to describe the grape pattern = (size =
small) and (taste = sweet) and not (color = yellow).

1.13.1 Algorithms for the Construction of Decision Trees

We have already seen the structure of a decision tree (see Fig. 1.45) and how easy
it is to interpret it for pattern classification (see Fig. 1.47). When the training set D
of patterns, with defined attributes and classes, is very extensive, building a decision
tree to classify generic patterns can be very complex. In this case, the tree can grow
significantly and become little interpretable. They therefore need control criteria to
limit growth based on the depth reachable from the tree or from the minimum number
of pattern samples present in each node in order to carry out the appropriate partition
128 1 Object Recognition

Color?
n Red

Yel
Gree

low
Dimens? Form? Dimens?

ium
nd
Medium
Sm

Th

Sm
e
rg

Rou

in
al

al
Med
La

l
Anguria Apple Grapes Dimens? Banana Apple Taste?

t
rge

Sm

Ac
ee
Sw
al

rid
La

l
Grapefruit Lemon Cherry Grapes

Fig. 1.47 Complete example of a decision tree for classification

(division) of the training set in ever smaller subsets. The top-down approach to
building the learning tree for the training set partition using the logical test conditions
on one attribute at a time involves the following steps:

1. Cr eate the root node by assigning it the most significant attribute to trigger the
classification process by assigning it all the training set D and creates the arcs
for all possible test values associated with the significant attribute. The samples
of the training set are distributed to the descendant nodes in relation to the value
of the attribute of the source node. The process continues recursively considering
the samples associated with the descendant nodes, choosing for these the most
significant attribute for the test.
2. E xamine the current node:

a. If all the samples of the node have the same class of membership, ω assigns
this class to the node that becomes lea f and the process stops.
b. Evaluates a measure of significance for each attribute of the partitioned sam-
ples.
c. Associate the test that maximizes significance measurement to the node.
d. For each test value, it creates a descending node by associating the arc with
the condition of maximum significance by creating a subtree with the samples
that satisfy this condition.

3. For each node created, repeat step 2

This type of algorithm for the top-down construction of the decision tree that extends
and partitions the training set is recursively known as the divide and conquer (from
Latin divide et impera) algorithm. In each iteration, the algorithm considers the
training set partition using the results of the attribute test. The various algorithms
differ in relation to the test function used to limit the growth of the tree and the
partitioning mode that controls the distribution of the training set in smaller subsets
1.13 Decision Tree 129

in each node until a stop criterion is encountered that does not compromise the
accuracy of the classification or prediction.
The main known algorithms in the literature are the ID3 and C 4.5 of Quinlan
[25,26], CART [27]. The ID3 algorithm performs the top-down construction of the
tree by raising the tree by appropriately choosing an attribute in each node while the
C 4.5 and CART algorithms in addition implement a pruning phase (pruning) of the
tree by checking that the tree does not become too extensive and complex in terms of
number of nodes, number of leaves, depth, and number of attributes. Several other
algorithms are available in literature that are characterized by the introduction of
some variants compared to those mentioned above. An evaluation and comparison
of these algorithms are reported in [28].

1.14 ID3 Algorithm

ID3 (Iterative Dichotomiser 3) [25] is a classification algorithm based on the con-


struction of a decision tree starting from a training set. The first objective of ID3
is that of the appropriate choice of the attribute to be tested in each tree node
under construction. The strategy is select, among the attributes of the training of the
instances (attribute-value), the attribute that best separates the samples of the train-
ing set. How to evaluate a good quantitative measure of the value of a attribute? The
different decision tree algorithms are characterized precisely by the way an attribute
is selected. ID3, in building the tree, is based on the statistical property that evaluates
the entr opy and the information gain to measure how a given attribute better sepa-
rates the training set samples in the classes they belong to. ID3 chooses the attribute
with the highest information content.

1.14.1 Entropy as a Measure of Homogeneity of the Samples

Recall that the concept of entr opy defined in thermodynamics defines a measure of
the disor der . In information theory, this concept is used as a measure of uncertainty
or information in a random variable. In this context, entropy is used to measure the
information content of an attribute, that is, a measure of the impurity (or homogeneity)
associated with the training set samples. In other words, in this context, the measure
of entropy is interpreted to indicate the value of the disorder (or diversity or impurity)
when a set of samples belong to different classes, while, if the samples of this set all
have the same class, the information content is zero.
Given a training set D = {(x1 , ω(x1 )), (x2 , ω(x2 )), . . . , (x N , ω(x N ))}, contain-
ing samples belonging to classes ωi , i = 1, . . . , C, the entropy of training set D is
defined by:
C
H (D) = − p(ωi ) log2 p(ωi ) (1.294)
i=1
130 1 Object Recognition

Table 1.1 Training set: play tennis


Day Outlook Temperature Humidity Wind Play tennis
D01 Sunny Hot High Weak No
D02 Sunny Hot High Strong No
D03 Overcast Hot High Weak Yes
D04 Rain Mild High Weak Yes
D05 Rain Cool Normal Weak Yes
D06 Rain Cool Normal Strong No
D07 Overcast Cool Normal Strong Yes
D08 Sunny Mild High Weak No
D09 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

where p(ωi ) indicates the fraction of the samples in D belonging to the class ωi .
If we consider a training set D with samples belonging to two classes with boolean
value (see Table 1.1) nominally indicated with the symbols “⊕” and “!”, the value
of the entropy H (D) for this boolean classification is

H (D) = − p⊕ log2 p⊕ − p! log2 p!


(1.295)
= −(9/14) log2 (9/14) − (5/14) log2 (5/14) = 0.94

where p⊕ = 9/14 is the fraction of the samples with positive class and p! =
1 − p⊕ = 5/14 is the fraction of samples belonging to the negative class. It is
observed that the entropy results with zero value if the samples in D belong to a
same class ( purit y), in this case, if they are all positive or negative. In fact, we have
the following:

i f p⊕ = 0 → p! = 1 − p⊕ = 1 ⇒ H (D) = −0 log2 0 − 1 log2 1 = 0


i f p⊕ = 21 → p! = 1 − p⊕ = 21 ⇒ H (D) = − 21 log2 21 − 21 log2 21 = 1

considering the properties of the logarithm function.26 Figure 1.48 shows the plot
of the entropy H (D) for the Boolean classification according to p⊕ between 0

26 Normally, the logarithm values are given with respect to the base 10 and to the number of Nepero
log (x)
e. With the base change, we can have the logarithms in base 2 of a number x, that is, log2 = log10 (2) .
10
The above calculated entropy values are obtained considering that (1) log2 (1) = 0; log2 (2) = 1;
log2 (1/2) = −1; and (1/2) log2 (1/2) = (1/2)(−1) = −1/2.
1.14 ID3 Algorithm 131

Fig. 1.48 Entropy graph 1.0 Maximum value of the entropy with
H (D) associated with the equal distribution of the classes
Minimum value of the entropy if D
binary classification of the

Entropy H(D)
contains samples with the same class
training set samples D
expressed as a function of 0.5
the distribution p⊕ of the
positive examples

0.0 0.5 1.0


p+

(corresponding to all samples with negative class) and 1 (samples all positive). The
maximum value of the entropy is given when the distribution of the classes is uniform
(in the example p⊕ = p! = 1/2). We can interpret the entropy as a value of the
disorder (impurity) of the distribution of classes in D.
According to information theory, entropy in this context has the meaning of mini-
mal informative content to encode in terms of bit the class to which a generic sample
belongs x. The logarithm is expressed in base 2 precisely in relation to the binary
coding of the value of the entropy.

1.14.2 Information Gain

Let us now look at a measure based on entropy that evaluates the effectiveness of an
attribute in the classification process of a training set. The Gain Information measures
the reduction of entropy that is obtained by partitioning the training set with respect
to a single attribute. Let A be a generic attribute, the information gain G(D, A) of A
relating to the training set of samples D, is defined as follows:
 |Di |
G(D, A) = H (D) − H (Di ) (1.296)
|D|
i
' () *
average entr opy

where Di , i = 1, . . . , n A are the subsets deriving from the partition of the entire set
D using the attribute A with n A values.
It is observed that the first term of the (1.296) is the entropy of the whole set D
calculated with the (1.295), while the second term represents the mean entropy, that
is the sum of the entropy of each subset Di weighted with the fraction of samples ||D i|
D|
belonging to Di . The most significant attribute to choose is the one that maximizes
the information gain G, which is equivalent to minimizing the average entropy (the
second term of the 1.296) since the first term, the entropy H (D) is constant for all
attributes. In other words, the attribute that maximizes G is the one that most reduces
entropy (i.e., disorder).
132 1 Object Recognition

Let’s go back to the samples of the Table 1.1 and calculate the significance of the
attribute Humidity which has 7 occurrences with a value of high and 7 with a value
of nor mal. For Humidity = high, 3 samples are positive and 4 negative, while, for
example, Humidity = normal, 6 are positive and 1 negative. Applying the (1.296)
and using the value of the entropy H (D) given by the (1.295), the information
gain G(D, H umidit y), partitioning the training set D with respect to the attribute
Humidity results
 |Di |
G(D, H umidit y) = H (D) − H (Di )
|D|
i
7 7 (1.297)
= 0.940 − H (D H igh ) − H (D N or mal )
14 14
7 7
= 0.940 − 0.985 − 0.592 = 0.151
14 14
where H (D H igh ) and H (D N or mal ) are the entropies calculated for the subsets D H igh
and D N or mal , selected for the samples associated with the Humidity attribute with
value, respectively, of H igh and N or mal.
The ID3 algorithm for building the tree selects the one with the highest value of
the information gain G as a significant attribute. Therefore, with the training set of
Table 1.1, to decide the best weather day to play tennis in a 2 week time frame, ID3
can build the decision tree. The classification leads to a binary result, i.e., it is played
if the path on the tree leads to the class ω1 = yes is positive or it is not played if
class ω2 = no is negative. The root node of the tree is generated by evaluating G
for the attributes (Outlook, Wind, Temperature and Humidity). The information gain
for these attributes is calculated by applying the (1.297), as done for the Humidity
attribute, obtaining
G(D, Outlook) = 0, 246
G(D, W ind) = 0, 048
G(D, T emperatur e) = 0, 029
G(D, H umidit y) = 0, 151

The choice falls on the Outlook attribute which has the highest value of the infor-
mation gain G(D, Outlook) = 0.246. The construction top-down of the tree (see
Fig. 1.49) continues by creating 3 branches from the root node, or as many as there
are values that the Outlook attribute can take: Sunny, Overcast, Rain. The question
to ask now is the following. The nodes to correspond to the 3 branches, are leaf nodes
or root nodes of sub trees to be created by making the tree grow further?
Analyzing the training set samples (see Table 1.1), we have the 3 subsets D Sunny =
{D1, D2, D8, D9, D11}, D Over cast = {D3, D7, D12, D13}, and D Rain =
{D4, D5, D6, D10, D14} associated, respectively, to the 3 values of the Outlook
attribute. It is observed that all the samples of the subset D Over cast have positive class
(associated with Outlook = Over cast) and therefore, the corresponding node is a
leaf node with class/action PlayTennis.
1.14 ID3 Algorithm 133

(a) (c)
Outlook Outlook
ny Rain ny Rain
Sun Overcast Sun Overcast
to be to be to be
partitioned Yes partitioned
Humidity 4 Yes partitioned
Contains only High Normal
samples of class Yes
2 Yes
3 No

(b)
Outlook Outlook Outlook
ny Rain ny Rain ny Rain
Sun Overcast Sun Overcast Sun Overcast
Humidity ? Temperature ? Wind ?
High Normal Hot Mild Cool Weak Strong
2 Yes 2 No 1 No 1 Yes 2 No 1 No
3 No 1 Yes
1 Yes 2 Yes

Fig.1.49 Generation of the decision tree associated with the training set D consisting of the samples
of the Table 1.1. a Once the Outlook attribute is selected as the most significant, 3 child nodes are
created, as many as there are values of that attribute that partition D into 3 subsets. The child node,
associated with the value Outlook = Over cast, is leaf node since it includes samples with the
same class ω1 = Y es, while the other two nodes are to be partitioned having samples with different
classes. b For the first child node, associated with the value Outlook = Sunny, the information gain
G is calculated for the remaining attributes (e.g., Wind, Temperature, and Humidity) to select the
most significant one. c The attribute Humidity is selected which generates two leaf nodes associated
with the two values (H igh, N or mal) of this attribute. The process of building the tree continues,
for the third node associated with Outlook = Rain, as done in (b), selecting the most significant
attribute

For the other two branches Sunny and Rain, the associated nodes are not leaf
and will become root nodes of sub trees to be built. For the node associated with
the Sunny branch, what is done on the initial node of the tree is repeated, i.e., the
most significant attribute must be chosen. Since we have already used the Outlook
attribute, it remains to select from the 3 attributes: Humidity, Temperature, and Wind.
Considering the subset D Sunny and applying the (1.297), the information gain for the
3 attributes results
G(D Sunny , H umidit y) = 0.970
G(D Sunny , T emperatur e) = 0.570
G(D Sunny , W ind) = 0.019

The choice falls on the attribute attribute Humidity which has the highest value of G.
This attribute has two values: high and nor mal. Therefore, from this node, there will
be two branches and the process of building the tree proceeds by selecting attributes
(excluding those already selected at the highest levels of the tree) and partitioning the
training set until leaf nodes are obtained, i.e., all associated samples have the same
class (zero entropy of the associated subset). Figure 1.50 shows the entire decision
tree built. The same tree can be expressed by rules (see the rules in Algorithm 9).
These are generated, for each leaf node, testing each attribute through a path that
134 1 Object Recognition

D1,D2,D3,D4,D5,D6,D7,D8,D9,D10,D11,D12,D13,D14
[9 Yes, 5 No]

Outlook
Rain
Sunny Overcast
D1,D2,D8,D9,D11 D4,D5,D6,D10,D14
Humidity 4 Yes Wind [3 Yes, 2 No]
[2 Yes, 3 No]
High Normal D3,D7,D12,D13 Strong Weak
[4 Yes, 0 No]
2 Yes 2 No 3 Yes
3 No
D9,D11 D6,D14
D1,D2,D8 [2 Yes, 0 No] [0 Yes, 2 No] D4,D5,D10
[0 Yes, 3 No] [3 Yes, 0 No]

Fig. 1.50 The complete decision tree, associated with the training set D relative to the samples of
Table 1.1, generated with the ID3 algorithm. Starting from the root node to which all training set D
is associated, for each child node are then reported the subsets deriving from the partitions together
with the frequency of the two classes in these subsets

starts from the root node and arrives at the leaf anode (precondition of the rule) and
the classification of the leaf is the rule post-condition. The pseudo-code of the ID3

Algorithm 9 Rules of the decision tree in Fig. 1.50


1: if (Outlook = Sunny) and (H umidit y = H igh) then

2: Play = N o;

3: else if (Outlook = Sunny) and (H umidit y = N or mal) then

4: Play = Y es;

5: else if (Outlook = Over cast) then

6: Play = Y es;

7: else if (Outlook = Rain) and (W ind = Str ong) then

8: Play = N o;

9: else if (Outlook = Rain) and (W ind = W eak) then

10: Play = Y es;

11: else

12: Play = N ull

13: end if

algorithm is reported in Algorithm 10.


1.14 ID3 Algorithm 135

Algorithm 10 Pseudo-code of the decision tree learning algorithm: ID3


1: Function I D3
2: Input: training set D
3: Output: Decision Tree DT
4: if Samples in D have all the same class ω then

5: Return a new leaf node and classify it with class ω

6: else

7: 1. Select a significant attribute A according to the Entr opia and Information Gain function
2. Create a new node in DT and use the attribute A as a test
3. For each V alue vi of A

a. Assign all D samples to Di with A = vi


b. Use I D3 to construct the DTi decision subtree associated with the sub-set Di
c. Create a branch that connects DT and DTi

8: end if

The peculiarity of the ID3 algorithm is gr eedy search, that is, chooses the best
attribute and never comes back to reconsider the previous choices. ID3 is a non-
incremental algorithm, meaning that it derives classes from the default instance
formation training set. An incremental algorithm [29] reviews the current instance
definition, if necessary, with a new sample.
The classes created by ID3 are inductive, that is, the classification carried out by
ID3 is based on the intrinsic knowledge of the instances contained in the training set
which is assumed to be functional also for the future instances presented in the test
phase. The induction of classes cannot be shown to always work, since an infinite
number of patterns can be presented by classifying them. Therefore, ID3 (or any
inductive algorithm) may not classify correctly.
The description and examples provided demonstrate that ID3 is easy to use. It is
mainly used for replacing the ex per t who normally builds a decision tree classifier
manually. ID3 has been used in various industrial applications, for medical diagnosis
and for the assessment of credit risk (or insolvency).

1.14.3 Other Partitioning Criteria

Information gain as a partitioning indicator has the tendency to favor tests with
attributes with many values. Another problem arises when the training set has a
limited number of samples or the data is affected by noise. In all these cases, the
136 1 Object Recognition

problem of overfitting27 may occur, i.e., the selection of a nonoptimal attribute for the
prediction, and the problem of the strong f ragmentation, or when the training sets
are very small to represent in a way significant samples of a certain class. Returning
to the samples of Table 1.1, if we add the Date attribute, we get the situation in which
the information gain G favors this attribute for the proper partition because it has a
high number of possible values. Therefore, the Date attribute, with the large number
of values, would correctly predict the samples of the training set by being preferred
to the root node with the construction of the very large tree but with depth 1. The
problem occurs when presenting test samples where the tree is unable to generalize.
In fact, having the Date attribute many values, it prevents the construction of the
tree by partitioning the training set into small subsets. It follows that the entropy of the
partition caused by Date would be zero (each day would result in a different and pure
subset, as it consists of unique samples and a single class) with the corresponding
maximum value of G(D, Date) = 0.940 bits.
To eliminate this problem, Quinlan [25] introduced an alternative measure to G
known as Gain Ratio G R which normalizes the information gain. This new measure,
which normalizes impurity, penalizes attributes with many values, through the term
Split Information Split I n(D, A) which measures the information due to the partition
of D in relation to the values of the attribute A. The Gain Ratio G R is defined as
follows:
G(D, A)
G R(D, A) = (1.298)
Split I n(D, A)

where

C
|Di | |Di |
Split I n(D, A) = − log2 (1.299)
|D| |D|
i=1

with Di , i = 1, . . . , C which are the subsets of the training set D partitioned by the
C values of the attribute A. It should be noted that Split I n(D, A) is the entropy of
D calculated with respect to the values of the attribute A. Furthermore, we have that
Split I n(D, A) penalizes attributes with a large number of values which partitions

27 In the context of supervised learning, a learning algorithm uses the training set samples to predict

the class to which other samples belong in the test phase, not presented during the learning phase.
In other words, it is assumed that the learning model is able to generalize. It can happen instead,
especially when the learning is too adapted to the training samples or when there is a limited number
of training samples, that the model adapts to characteristics that are specific only to the training set,
but that does not have the same capacity prediction (for example, to classify) with the samples of
the test phase. We are, therefore, in the presence of overfitting, where performance (ie the ability
to adapt/predict) on training data increases, while performance on unseen data will be worse. In
general, the problem of overfitting can be limited with cross-validation in statistics or with the
early stop in the learning context. Decision trees that are too large are not easily understood and
often there is the overfitting known in literature as the violation of Occam’s Razor (philosophical
principle) which suggests the futility of formulating more hypotheses than those that are strictly
necessary to explain a given phenomenon when the initial ones may be sufficient.
1.14 ID3 Algorithm 137

D into many subsets all of the same cardinality. For the attribute Date, we would
have
 |Di | |Di |  1 1
Split I n(D, Date) = − log2 = 14 − log = 3, 807
|D| |D| 14 14
i=1

and
G(D, Date) 0, 940
G R(D, Date) = = = 0, 246
Split I n(D, Date) 3, 807

It should be pointed out that for the training set example considered, the Date
attribute would still be a winner, having the value of G R higher than that of the other
attributes. Nevertheless, G R proves more reliable than the G information gain. The
choice of attributes must however be made carefully by selecting those with a value
of G higher than the average of the information gain of each attribute. The final
choice is then made on the attribute with G R greater. The heuristic itself is used in
cases where the denominator of the (1.298) tends toward zero.
In summary, the ID3 algorithm uses a measure-based approach of information
gain to navigate the attribute space of the decision tree. The result converges on a
single hypothesis and converges optimally without ever navigating backward. The
construction of the tree (learning phase) is based only on the samples of the training
set and therefore represents a non-incremental method of learning. This produces the
drawback of not being able to update the tree when a new sample is badly classified
and requires tree regeneration. It uses the statistical information of the whole training
set and this makes it less sensitive to the noise related to the individual training
samples. ID3 is limited to working with only one attribute at a time and not numeric
type.

1.15 C4.5 Algorithm

The C4.5 algorithm is an evolution of ID3, always proposed by Quinlan [26]. C4.5
also uses the Gain Ratio as the partitioning criterion. Compared to the ID3 algorithm,
it has the following improvements:

1. It handles both discrete and continuous attributes. In the case of attributes with
numerical values, the test is performed over an interval, for example by appropri-
ately dividing it in binary mode. If A is an attribute with numeric value (as is the
case for the attribute T emperatur e of the training set of Table 1.1), it can be dis-
cretized or represented with a boolean value by dividing the interval with a thresh-
old t appropriate (for example, if A < t → Ac = T r ue otherwise Ac = False).
The appropriate choice of the t threshold can be evaluated with the maximum
information gain G (or considering Gain Ratio) by partitioning the training set
138 1 Object Recognition

according to A values once the two subsets of values A ≤ t and A > t are
obtained.
2. Partitioning stops when the number of instances to be partitioned is less than a
certain threshold.
3. Manages training set samples with missing attribute values. If a A attribute with
missing values needs to be tested, in the learning phase, we can use the following
approaches:

(a) We choose the most probable value, that is the one most frequently used in
the samples associated with A.
(b) We choose considering all the values vi of x.A assigning an estimated prob-
ability p(vi ) on the samples belonging to the node under examination. This
probability is calculated with the observed frequency of the various values of
A between the samples of the node under examination. We assign a fraction
of p(vi ) of x to each descendant in the tree that receives a weight proportional
to its importance. The calculation of the information gain G(D, A) occurs
using these proportionality weights.

4. Execute the pr uning of the tree to avoid the problem of overfitting.

1.15.1 Pruning Decision Tree

In the construction of the decision tree, we have previously highlighted the possibility
of blocking growth through a stop criterion based, for example, on the maximum
depth or by providing a maximum number of partitions. These methods are limited
in that they tend to create very small and undersized trees to classify correctly or
to create very large over-sized trees compared to the training set. This last aspect
concerns the problem of overfitting highlighted in the previous paragraphs.
A method to avoid the problem of overfitting is achieved by pruning the tree [27].
The strategy is to grow the tree without stop restrictive criteria, initially accepting
an over-sized tree compared to the training set and then, in the second phase, prune
the tree by adequately removing branches that do not contribute to the generalization
and accuracy of the classification process. Before describing the methods of pruning,
we give the formal definition of overfitting in this context.

Let h be a hypothesis belonging to the hypothesis space D which generates


overfitting on D and is h’ ∈ D an alternative hypothesis; we have overfitting
on D if h produces a minor error with respect to h’, but h’ has a minor error
with respect to h when referred to all instances, including those not included
in the training set.

Figure 1.51 shows the consequences of overfitting in the context of learning with the
decision tree. The graph shows the accuracy trend of the predictions made by the
1.15 C4.5 Algorithm 139

Fig. 1.51 Graph of the 0.9


accuracy of the measured
0.85
decision tree on the training
set samples during the 0.8
construction (lear ning) 0.75

Accuracy
phase of the decision tree, on
0.7
the samples of the test set,
and then on these last 0.65
samples after applying the Training Set
0.6
tree pruning operator Test Set
0.55 Post-Pruned on Test Set

0.5
0 10 20 30 40 50 60 70 80 90 100
Number of nodes (tree size)

tree in the learning phase using the training set and in the test phase considering the
samples of the test set not processed by the tree in the training phase. Accuracy varies
with the number of nodes as the tree grows by examining the training set samples in
the learning phase. It is observed how the accuracy decreases, with the samples of
the test set, after the tree reaches a certain size (depending on the type of application)
in terms of number of nodes. This decrease in accuracy is due to the random noise
of the training set samples and the test set.
There are various methods of pruning decision trees which, for simplicity, we can
group them as follows:

Pre-pruning. The growth of the tree is blocked, during construction, when it is


determined that the added information is no longer reliable. This group includes
methods based on statistical tests. The growth of the tree is blocked when there
is no significant statistical dependency associated between the attributes and the
class in a particular node. In essence, the growth of the tree uses all the training
set and applies the statistical test to evaluate whether, at the same time, to extend
or prune a node, to produce greater accuracy beyond the training set samples. The
ID3 algorithm uses the chi-square test to select the most significant attributes and
then, in addition, the information gain is evaluated.
Post-pruning. Generates the tree completely which is probably over-sized on the
training set and subsequently removes ( pr unes) the insignificant subtrees replac-
ing them with leaf nodes.

Generally, the Post-pruning method is preferred as those based on Pre-pruning can


prematurely stop the growth of the tree although they are faster. Regardless of the
methods used, it is necessary to evaluate the accuracy of the tree built to select the best
one. This can be assessed by considering the performance of the decision tree with
a training set D for the construction phase (learning) and a set of validation samples
V (validation set) presented only in the test phase. It is assumed that although the
learning may be affected by an error caused by the noise and coincident irregularity
of the training set samples, it is unlikely that the samples of the validation set will
140 1 Object Recognition

present the same statistical fluctuation. Normally, the validation set can help verify
the existence of some abnormal training set samples.
This is made possible in practice when the two sets of samples are adequately
sized, in the ratio of 2/3 for the training set and 1/3 for the validation set.
An alternative measure of tree performance after pruning is given by calculating
the minimum length (Minimum Description Length—MDL) to describe the tree [30].
This measurement is expressed as the number of bits required to code the decision
tree. This evaluation method selects decision trees with a shorter length.

1.15.2 Post-pruning Algorithms

The basic idea of these algorithms first involves the complete construction of the tree
to include all possible attributes and later, with pruning, parts of the tree associated
with attributes due to random effects are removed. The simplification of the tree takes
place using a post-pruning strategy based on two post-pruning operators: subtree
replacement and subtree raising. In the first case, with the substitution of a subtree,
the initial tree is modified after having analyzed all its subtrees and eventually placed
with leaf nodes. For example, the entire subtree shown in Fig. 1.52 which includes
two internal nodes and 4 leaf nodes is replaced with a single leaf node. This pruning
involves less accuracy if evaluated on the training set while it can increase if you
consider the test set.
This operator is implemented starting from the leaf nodes and proceeds backward
to the root node. In fact, in the example of Fig. 1.52 is first considered to replace
the 3 child nodes of the subtree with the root X with a single leaf node. Later, going
backward, it is evaluated to prune the subtree with root B, which now has only two
child nodes, and replace it with a single leaf node as shown in the figure.
In the second case, with the subtree raising operator with the deletion of a node
(and consequently the subtree of which this node is root) involves the raising of an
entire subtree as shown in Fig. 1.53.

Pruning operator: Subtree substitution with root node B


A A

B C C

X D D

Fig. 1.52 Pruning with the subtree replacement operator. In the example, the subtree with the root
B is replaced with a leaf node ω1
1.15 C4.5 Algorithm 141

Pruning operator: Raising the subtree with the root node C eliminating node B
A A

B C

C 4 5 1’ 2’ 3’

1 2 3

Fig. 1.53 Pruning with the subtree raising operator. In the example, the subtree is raised with root
C deleting the subtree with root B and redistributing the instances of leaf nodes 4 and 5 in the
node C

Although in the figure the child nodes of B and C are referred to as leaves, these
can be subtrees. Furthermore, using the subtree raising operator, it is necessary to
reclassify the samples associated with the suppressed nodes, which in the example
correspond to nodes 4 and 5, in the new subtree with the root node C. This explains
why the child nodes of C, after pruning, are shown differently with 1 , 2 , and 3 (the
instances are redistributed to consider the samples associated with the initial nodes
4 and 5) with respect to the initial one before pruning. This last operator based on
deleting a node is slower.
We now describe two strategies of pruning (Reduced Error Pruning and Rule
Post-Pruning) and how to evaluate the accuracy of the pruned tree.

1.15.2.1 Reduced Error Pruning


A simple pruning algorithm, known as Reduced Error Pruning [31], visits the entire
tree with a bottom-up approach. Check each internal node if it is replacing it (together
with the subtree of which it is a root node) with a leaf node, assigning it as the class to
which it belongs the most frequent sample of the training set associated with it. The
nodes are removed only if the tree thus pruned does not worsen its original accuracy
when tested with the test set. The pruning of the nodes continues until a further
removal produces a worsening of the accuracy of the tree always evaluated on the
test set. The graph of Fig. 1.51 shows the trend of the accuracy of the tree, in relation
to the number of nodes, evaluated on the training set and on the test set before and
after pruning. A generic algorithm post-pruning is reported in Algorithm 11.

1.15.2.2 Rule Post-pruning


This approach is used when the data available is limited. Furthermore, it is a practical
approach to find hypotheses with high accuracy. The 4 essential steps of Rule Post-
Pruning are
142 1 Object Recognition

Algorithm 11 Pseudo-code of a Post-Pruning algorithm


1: Input: training set D and Validation (Test) set V
2: Output: Decision Tree DTs simplified
3: Build the complete and consistent DT decision tree that classifies the samples of the training
set D.
4: while Accuracy does not decrease do

5: 1. Select a signi f icant attribute A according to the Entr opy and Information Gain
function
2. Create a new node in DT and use the attribute A as a test
3. For each V alue vi of A

a. Assign all samples of D with attribute A = vi to the subset Di .


b. Use I D3 to construct the DTi decision subtree associated with the subset samples
Di
c. Create a branch that connects DT and DTi

6: end while

1. Build the DT decision tree from the training set D, possibly allowing overfitting.
2. Conver t the DT decision tree into an equivalent set of rules by creating a rule
for each path from the root node to leaf nodes (see the rules in Algorithm 9).
3. Generali ze ( pr une) each rule, i.e., try to remove any precondition of the rule
itself that generates an improvement in accuracy.
4. Or der the rules thus obtained in relation to their estimated accuracy and consider
them in this sequence when classifying new instances.

We report the rule 1 (see Algorithm 9) of the tree of Fig. 1.50:

I F (Outlook = Sunny) AN D (H umidit y = H igh) T H E N Play = N o;

The preconditions that are considered for removal are (Outlook = Sunny) and
H umidit y = H igh). The precondition to be pruned is the one that produces an
improvement in accuracy. Pruning is not performed if the elimination of a precondi-
tion produces a decrease in accuracy. The process is iterated for each rule.
The accuracy can be evaluated, as done previously, using the validation set if the
data are adequately numerous to have separated those of the training set from those
of the validation set.
The C4.5 algorithm evaluates the performance based on the training set itself,
evaluating if the estimated error is reduced by deriving the confidence intervals with
a statistical test on the learning data. In particular, C4.5 assumes that the realistic
error is at the upper limit of this confidence interval (pessimistic error estimate) [26].
The accuracy estimate on the training set D is made by assuming that the probability
of error has a binomial distribution.
1.15 C4.5 Algorithm 143

With this assumption, the standard deviation σ is calculated. For a given confi-
dence interval d (for example, d = 0.95), the realistic error e falls d% of the times
in the confidence interval dependent on sigma. As a pessimistic estimate of the
error, the maximum value of the interval is chosen which corresponds to the esti-
mated accuracy−1.96 · sigma. This method of pessimistic pruning, despite being
a heuristic approach without statistical foundations in practice, produces acceptable
results. If an internal node is pruned, then all the nodes descending to it are removed,
thus obtaining a fast procedure of pr uning.

1.16 CART Algorithm

CART [27] stands for Classification And Regression Trees. The CART algorithm
is among the best known for constructing classification and regression trees.28 The
supervised CART algorithm generates binary trees, i.e., trees in which each node
has only two arcs. This does not involve limitations for making complex trees. A
CART tree is built in a Gr eedy way like for C4.5 trees but the type of tree produced
is substantially different. An important feature of CART is the ability to generate
regression trees (see note 28). CART uses discrete and continuous attributes. CART
constructs the binary tree using the information gain (see the previous paragraph) or
the Gini Index as the training set splitting criterion in each node.

1.16.1 Gini Index

The splitting criterion based on the Gini index29 was applied in the CART algorithm
by Breiman [27] for the construction of binary decision trees, with the advantage
of being able to be used well also for regression problems. As an alternative to the
entr opy, the Gini index, indicated with Gini(D), is a measure of impurity (disorder)
of the training set D to be partitioned into a certain node t of the tree. The generalized
form of this index is given by


C
Gini(D) = 1 − p 2 (ωi ) (1.300)
i=1

28 Decision trees that predict a categorical variable (i.e., the class to which a pattern vector belongs)

are commonly called classification trees, while those that predict continuous-type variables (i.e., real
numbers) and not a class are named trees of regression. However, classification trees can describe
the attributes of a pattern also in the form of discrete intervals.
29 Introduced by the Italian statistician Corrado Gini in 1912 in Variability and Mutability to represent

a measure of inequality of a distribution of a random variable. Indicated with a number from 0 to 1.


Low index values indicate a homogeneous distribution, with the value 0 corresponding to the pure
equitable distribution, while high index values indicate a very unequal distribution.
144 1 Object Recognition

where p(ωi ) indicates the relative frequency (i.e., the probability) of the class ωi in
training set D at the node under examination t, and C indicates the number of the
classes. If the training set D consists of samples belonging to the same Cclass, the Gini
index is 1 − 12 = 0 considering that the probability distribution is i=1 p(ωi ) = 1.
It would reach the maximum purity of D. If instead, the probability distribution is
uniform,
C i.e., p(ω i ) = 1/C
C∀ i, the Gini index would result with maximum value
1 − i=1 (1/C)2 = 1 − i=1 (1/C)2 = 1 − 1/C.
As a splitting criterion, when a t node is partitioned into K subsets Di , i =
1, . . . , K in relation to a generic attribute A, the Gini index average (as an alternative
to the average entropy) Gini split (D, A) defined by


K
|D j |
Gini split (D, A) = Gini(D j ) (1.301)
|D|
j=1

where | • | indicates the number of samples of the training set D and D j . Basically,
in correspondence of the training set D associated with the node under examination
t, the various values of the Gini index relative to the subsets D j of the partition
|D |
of D are weighted by the ratios |Dj| . The best attribute selected, for the node t, is
the one corresponding to the minimum value of Gini split (D, A), evaluated for each
attribute.
In the simplest case, that is, with a training set D consisting of samples belonging
to only two classes (ω1 and ω2 ), the Gini index given by the (1.300) and the average
Gini index are
Gini(D) = p(ω1 ) p(ω2 ) (1.302)

|D1 | |D2 |
Gini split (D1 , D2 , A) = Gini(D1 ) + Gini(D2 ) (1.303)
|D| |D|

where the training set D relative to the node under examination t is partitioned into
two subsets D1 and D2 . In two-class classification problems, the Gini index (1.302)
can be interpreted as variance of impurity. In fact, we can imagine for the node t that
the associated samples D are the realizations of a random Bernoulian experiment
where the random variable of Bernoulli is the class ω to be assigned to each sample.
Assigning the value 1 to the class ω1 and 0 to the class ω2 , we will have

1 if p(ω = 1) = p(ω1 )
ω= (1.304)
0 if p(ω = 0) = 1 − p(ω1 )

Thus we obtain, respectively, the expectation value E(ω) = μ = p(ω1 ) and the
variance V ar (ω) = σ = p(ω1 )(1 − p(ω1 ) = p(ω2 )(1 − p(ω2 ) = p(ω1 ) p(ω2 ) =
Gini. If the two classes are equiprobable, the variance will reach the minimum value,
that is, p(ω1 ) = p(ω2 ) = 1/2 (the worst condition for the classification), while it
1.16 CART Algorithm 145

Fig. 1.54 Splitting criteria 1.0


Entropy H(D)
compared for a classification
problem between two
equiprobable classes

0.5 Gini(D)

M
error

0.0 0.5 1.0


p

will take the maximum value when there are samples belonging to a single class, i.e.,
p(ω1 ) is equal to 0 or 1 (the condition better for classification).
The splitting criterion based on the Gini index, therefore, involves partitioning the
training set D into two subsets, associated with the node under examination t, mini-
mizing the value of the variance. Figure 1.54 shows the comparison between the split-
ting criteria based on the entropy equation (1.294), on the Gini index equation (1.303)
and on the misclassification error MC(D) given by MC(D) = min( p(ω1 ), p(ω2 )),
for a problem with 2 equiprobable classes remembering that p(ω1 ) and p(ω2 ) indi-
cate, respectively, the probability that a generic sample belongs to the class ω1 or to
the class ω2 .
Returning to the CART algorithm for the construction of the binary tree, the Gini
index (measure of the impurity), given by the (1.303), which chooses the attribute
corresponding to the minimum value of the impurity in partitioning the training set
of the node under examination in two subsets (relative to the two nodes children).
Equivalently, as an optimal splitting criterion the measure that maximizes the impurity
gradient (i.e., the reduction of the impurity) is given by

|D1 | |D2 |
Gini split (D1 , D2 , A) = 1 − Gini(D1 ) − Gini(D2 ) (1.305)
|D| |D|
Therefore, the splitting strategy is to choose the A attribute that maximizes with
the (1.305) Gini split . If instead the entropy is used as a measure of impurity, the
corresponding strategy is that of the choice of the attribute that produces the highest
value of the information gain. The Gini index tends to isolate the largest class from
all the others, while the entropy-based criteria tend to find sets of more balanced
classes.
The construction of a multiclass binary decision tree is made with the Twoing
criterion, an extension of the Gini index. The strategy is to find the partition that best
divides groupings of samples into C classes. The approach is to optimally divide the
samples of all the classes into two super classes C1 which contains a subset of all
the classes, and C2 which contains the rest of the remaining samples. The strategy
consists in choosing the A attribute by applying the Gini index to two classes to these
146 1 Object Recognition

new superclasses C1 and C2 created. In other words, to get an optimal splitting,


you will have to choose the optimal superclasses in addition to the attribute.

1.16.2 Advantages and Disadvantages of Decision Trees

The classification based on decision trees requires the prediction of a discrete value
of the class to which a sample belongs. This is achieved by learning the decision
tree starting from the training set of the samples whose class is known in advance.
The tree is constructed by iteratively selecting the best attribute and partitioning the
training set according to this attribute once the relative information content has been
evaluated using entropy, or the information gain or impurity measurement of the
partitioned training set in each node. The pruning process is applied after the tree is
built to eliminate redundant nodes or subtrees. In essence, the inductive decision tree
is a nonparametric method for creating a classification model. Therefore, it does not
require any prior knowledge of the probability of class distribution.
Among the advantages of decision trees are the following:

(a) Their peculiarity of being self-explanatory in particular when they are well com-
pacted being easily understandable. If the number of leaf nodes are numerically
small, they are easily accessible even to nonexperts.
(b) Their immediate conversion into a set of easily understandable rules.
(c) They can handle samples whose attributes can be nominal or numerical (discrete
and continuous, CART).
(d) They allow accurate estimates even when the training set data contains noise,
for example samples with incorrect class or attributes with inaccurate values
(CART).
(e) They can also manage training sets with missing attributes.
(f) They manage attributes with different costs. In some applications it is convenient
to build the decision tree by placing the least expensive attributes as close as
possible to the root node with the highest probability to verify them. One way
to influence the choice of the meaningful attribute in relation to the cost of the
attribute A is achieved by inserting the term Cost (A) into the function that
chooses the optimal attribute.
(g) Decision trees are based on nonparametric approaches and this implies that they
have no assumptions about the distribution of attributes and the structure of the
classifier.

Among the disadvantages, we highlight the following:

(a) Limited scalability to very large training sets. In particular when it presents a lot
of attributes for which considerable computational complexity is also required.
(b) Different algorithms, such as ID3 and C4.5, require target attributes with only
discrete values.
1.16 CART Algorithm 147

(c) The gr eedy feature of the tree growth algorithms, based on selecting an attribute
to partition the training set, does not take into account the intrinsic irrelevance
that an attribute may have on future partitions with the top-down approach.
(d) The approach used for the construction of decision trees is of the divide and
conquer type which tends to work correctly if there are highly relevant attributes,
but less if many complex instances are present. One reason is that other classifiers
can compactly describe a classifier that would be very difficult to represent with
a decision tree. Furthermore, most decision trees divide the instance space into
mutually exclusive regions to represent a concept, in some cases, the tree should
contain several duplicates of the same subtree in order to represent the classifier.

1.16.2.1 Computational Complexity of Decision Trees


The construction of an optimal decision tree falls into the computational complexity
class defined as NP-complete problem.30
Often these types of NP-complete problems require exponential calculation times
that are not tr eatable and fall back on heuristic approaches even if they give more
approximate solutions. This explains the development of various algorithms for the
optimal construction of decision trees based on greedy strategies, top-down, recursive
partitioning that require reasonable calculation times even with a significant number
of samples.
If we assume a training set with N samples each with M attributes, the extension of
the tree, given by its maximum depth p, would be of the order of log N or O(log N ).
In the first instance, the computational cost for the construction of an inductive tree
is O(M N log N ). In fact, for each level of the tree, the entire training set of N
instances must be considered, and since there are log N number of levels (depth),
the computational load required for an attribute is O(N log N ). It follows that the
overall computational load is O(M N log N ) having to consider all M attributes in
each node.
The impact of the pruning operators on the computational load results of O(N ) for
subtree replacement and O(N (log N )2 ) for subtree raising. The higher cost of the
last operator is due to the reclassification of the instances for each node between the

30 Inthe theory of computational complexity, the different problem solutions are grouped into two
classes: the problems P and N P. The first includes all those decision problems that can be solved
with a computational load (deterministic Turing machine) in a time that is polynomial with respect
to the size of the problem, that is, they admit algorithms that in the worst case scenario is polynomial
and includes all treatable problems. The second class is one of the problems that cannot be solved
in polynomial time. For this last class, it can be verified that every resolving algorithm would
require an exponential calculation time (in the worst case) or however asymptotically superior
to the polynomial one. In the latter case, we have problems also called intractable in terms of
calculation time. The class of NP-complete problems are instead the most difficult problems of the
class N P nondeterministic in polynomial times. In fact, if you find an algorithm that solves any
N P-complete problem in a reasonable time (i.e., in polynomial times), then it could be used to
reasonably solve every NP problem. The theory of complexity has not yet given answers if the class
N P is more general than the class P or if both coincide.
148 1 Object Recognition

leaf node and the root node which is of the order of O(log N ) (average cost). The total
number of reclassifications is O(N log N ). It follows that the overall computational
load of the inductive decision tree is

O(M N log N ) + O(N (log N )2 )

1.17 Hierarchical Methods

In the last paragraphs, we described supervised classifiers based on the knowl-


edge of the training set and test set of samples of which the class to which they
belong is known in advance. Non-supervised classification methods (also known as
clustering) have been previously described based on nonparametric non-supervised
learning such as the partition algorithm k-means (see Sect. 1.6.6), those using prox-
imity measurements (Sect. 6.4 Vol. I), and classifiers based on parametric non-
supervised learning such as the ML and EM algorithms (see Sect. 1.7).
This section describes a hierarchical clustering method based on nonparamet-
ric non-supervised learning where there is no knowledge of the class distribution
function. The goal is to find a natural grouping of the patterns (clusters) in the multi-
dimensional space of the feature. In this context, we do not even have knowledge of
the number of classes and while in partition clustering (k-means), we obtain a parti-
tion of the set of patterns with nonoverlapping clusters, in hierarchical clustering we
have nested clusters organized with a tree. Each tree node represents a nested cluster
(excluding leaf nodes that represent patterns) or the union of child subtrees (which
represent sub-clusters).
The root of the tree represents the cluster containing all the patterns to be classified
(see Fig. 1.55). Hierarchical clustering is commonly used in biological taxonomy to
hierarchically classify animal species (for example, insects) starting from very large
families to smaller ones.
Hierarchical clustering is divided into two groups: Agglomerative and Divisive
methods. Those agglomeratives also called bottom up or merging start with clusters
consisting of individual patterns and then, going up in the tree structure, are grouped
in a cluster the patterns or sub-clusters that are similar in agreement to a similarity
criterion.
The divisive methods also called top-down or splitting start with a single cluster
that includes all the patterns and in each tree level, the patterns are partitioned into
sub-clusters, the patterns that are most dissimilar from each other until each pattern
is found single in a cluster.
A hierarchical cluster, once the procedure is completed, is normally represented
graphically as a binary tree, named dendr ogram to highlight the cluster struc-
ture. Basically, a dendrogram is a tree graph whose extreme vertices (the leaves)
represent the classified patterns. This representation has the utility of highlighting
the similarity between the clusters in the various levels of the tree (vertical axis).
Figure 1.55a shows an example of a dendrogram for the hierarchical classification
1.17 Hierarchical Methods 149

Agglomerative Gerachic clustering

Divisive
C

C C

(a) Dendrogram (b) Nested diagram of Venn

Fig. 1.55 Hierarchical clustering of 5 patterns represented graphically as a dendrogram and in b


as Venn diagram which does not show the quantitative information of similarity

of 5 patterns. An alternative representation of hierarchical clustering is given with


the nested Venn diagram where each level of the cluster contains the sub-cluster sets
(see Fig. 1.55b). A further representation based on the sets is expressed in a textual
way: {C1 , {C2 , {C3 , {C4 , C5 }}}}. These last representations cannot indicate quanti-
tative information of similarity as instead highlighted with the dendrogram that is
normally preferred.

1.17.1 Agglomerative Hierarchical Clustering

The pseudo-code of the agglomerative hierarchical clustering algorithm is shown


below (Algorithm 12).

Algorithm 12 Pseudo-code of the basic hierarchical agglomerative clustering


algorithm
1: Input: Number of clusters C and Number of patterns N
2: Output: Construction of the dendr ogram with K cluster
3: I nitiali ze the algorithm by imposing that each pattern of the set D is a cluster: C = N
4: repeat

5: Find the two closest cluster s


6: Group them into a new cluster

7: until A desired K number of clusters remains

The algorithm can end earlier if you also impose a predefined number of clusters
Cmax to be extracted. As described in the algorithm, the essential step (line 5) is
the calculation of similarity, i.e., closeness between two clusters. The similarity
calculation method characterizes the various hierarchical agglomerative clustering
algorithms. The similarity measure is normally associated with the measurement of
150 1 Object Recognition

(a) (b) (c) (d)


+ +

Fig. 1.56 Graphical representation of the types of calculation of distance measurement (proximity)
between clusters: a MIN, minimum distance (single linkage); b MAX, maximum distance (complete
linkage); c average distance (average linkage); d distance between centroids

distance between the patterns of two clusters. Figure 1.56 graphically shows 4 ways
to calculate the distance between the patterns of two clusters: minimum distance dmin ,
maximum distance dmax , average distance dmed , and distance between centroids dcnt .

1.17.1.1 Single Linkage or Nearest Neighbor—NN


dmin indicates the minimum distance between two patterns x and y of a pair of clusters
ωi and ω j , defined as follows:

dmin (ωi , ω j ) = min x−y (1.306)


x∈ωi ,y∈ω j

The hierarchical agglomerative clustering algorithm based on the minimum distance


dmin is also called Single Linkage or Nearest Neighbor. If the algorithm ends with a
single cluster, a minimum spanning tree (MST) is obtained.31 This algorithm tends
to favor elongated groupings and is sensitive to noise.

1.17.1.2 Complete Linkage or Farthest Neighbor


dmax indicates the maximum distance between two patterns x and y of a pair of
clusters ωi and ω j , defined as follows:

dmax (ωi , ω j ) = max x−y (1.307)


x∈ωi ,y∈ω j

The hierarchical agglomerative clustering algorithm based on the maximum distance


dmax is also called Complete Linkage or Farthest Neighbor. This algorithm tends to
create compact groupings. According to graph theory, each cluster generated with
this algorithm constitutes a complete subgraph (each node representing a pattern is
connected with all adjacent nodes).

31 In
graph theory, having a spanning tree with weighted arcs, it is also possible to define the
minimum spanning tree (MST), that is, a spanning tree so that adding the weights of the arcs, a
minimum value is obtained.
1.17 Hierarchical Methods 151

1.17.1.3 Average Distance


dmed indicates the distance between the two clusters ωi and ω j by first finding the
values of the mean for all the patterns in the two clusters and then calculating the
distance between the clusters as the distance between the values of the averages,
defined as follows:
1  
dmed (ωi , ω j ) = x−y (1.308)
Ni N j x∈ω y∈ω
i j

where Ni and N j are the number of the patterns, respectively, of the classes ωi and
ωj.

1.17.1.4 Distance Between Centroids


dcnt indicates the distance between the two clusters ωi and ω j that corresponds to
the distance calculated between the respective centroids μi and μ j , given by

dcnt (ωi , ω j ) = μi − μj (1.309)

Distance measurements based on dcnt and dmed are more robust with the presence
of noisy patterns (outlier s) than the distance measures given by dmin and dmax who
are more sensitive to outlier s. Furthermore, it should be noted that the distance
between centroids dcnt is computationally simpler than dmed which instead requires
the calculation of Ni × N j distances.

Algorithm 13 Pseudo-code of the basic hierarchical divisive clustering algorithm


1: Input: Number of clusters C, cluster number desired Cmax and number of patterns N
2: Output: Construction of the dendrogram
3: I nitiali ze the algorithm by imposing C = 1 that is the set D is a single cluster.
4: while C ≤ Cmax do

5: Find the “worst” cluster


6: Divide it into two smaller clusters
7: C =C +1

8: end while

1.17.2 Divisive Hierarchical Clustering

There are several algorithms developed for the divisive hierarchical clustering that
differ in the way in which the concept of cluster is defined wor st (for example,
larger number of patterns, i.e., clusters with larger diameter, higher variance, largest
sum squared error) and how clusters are divided (for example, median-median in
152 1 Object Recognition

Table 1.2 Initial distance matrix


Foggia Bari Taranto Brindisi Lecce
Foggia – 137 217 250 283
Bari 137 – 97 117 151
Taranto 217 97 – 55 107
Brindisi 250 117 55 – 39
Lecce 283 151 107 39 –

the direction of an attribute, perpendicular to the direction of the highest variance,


evaluation of the highest dissimilarity).
The pseudo-code of the divisive hierarchical clustering algorithm is reported in
Algorithm 13. A divisive clustering algorithm named DiAna-Divisive Analysis Clus-
tering [32] is based on the criterion of maximum distance (i.e., minimum similarity)
between the patterns in the cluster for the division of the clusters and of the maximum
diameter for the selection. The process is iterated to reach the number of imposed
clusters. The computational load required for divisive clustering is more intensive
than the agglomerative clustering method that is most commonly used.
In general, the complexity of a hierarchical clustering with N pattern is O(N 3 )
since N steps are needed to construct the entire dendrogram. The memory space
required by the similarity matrix is O(N 2 ) (although the matrix is symmetric) which
is updated and analyzed in each iteration of the algorithm. It achieves an excessive
computational load for sets of patterns with very large N .
Hierarchical clustering algorithms, despite being simple, are strongly conditioned
by the selection criteria of the patterns to be aggregated or divided. This conditioning
becomes critical once a set of patterns is aggregated or divided, and the process to the
subsequent iteration operates directly on new sets without taking into account the pre-
vious aggregations or divisions. It follows that any error of merge or split occurring in
a given iteration is propagated until the end of the process generating incorrect clus-
ters. To mitigate these errors, algorithms have been developed that improve clustering
performance by combining hierarchical methods with other classification methods,
as the known methods of BIRCH [33], ROCK [34], and Chamaleon [35].

1.17.3 Example of Hierarchical Agglomerative Clustering

To illustrate the procedure of a hierarchical agglomerative algorithm in Table 1.2,


the distances between 4 cities are reported. The distance between two clusters is
defined with the minimum distance method (maximum similarity) and in this case,
the distance between the cities representing the patterns is indicated.
1.17 Hierarchical Methods 153

Table 1.3 Distance matrix after Step 1


Foggia Bari Taranto Brindisi, Lecce
Foggia – 137 217 250
Bari 137 – 97 117
Taranto 217 97 – 55
Brindisi, Lecce 250 117 55 –

The procedure involves the following steps:

Step 0 Initially, we start with all the patterns as individual clusters {Foggia},
{Bari}, {Taranto}, {Brindisi}, {Lecce}. From Table 1.2, we observe that the min-
imum distance between the clusters is that relative to the two clusters {Brindisi}
and {Lecce} with value 39.
Step 1 Once the clusters are found with minimum distance, these are immersed in
a single cluster and the distance table is updated considering the current clusters
{Foggia}, {Bari}, {T aranto}, {Brindisi, Lecce}. From Table 1.3, it is shown
that the minimum distance between the clusters is that relative to the clusters
{Brindisi, Lecce} and {T aranto} with value 55.
Step 2 Clusters found with minimum distance, these are immersed in a sin-
gle cluster and the distance table is updated considering the current clusters
{Foggia}, {Bari}, {Brindisi, Lecce, T aranto}. From Table 1.4, it results
that the minimum distance between the clusters is that relative to the clusters
{Brindisi, Lecce, T aranto} and {Bari} with value 97.
Step 3 Clusters found with minimum distance, these are immersed in a sin-
gle cluster and the distance table is updated considering the current clusters
{Foggia} e {Brindisi, Lecce, Taranto, Bari} (see Table 1.5). This last table
indicates that the minimum distance between the only two clusters remain-
ing is 137 which are immersed constituting a single cluster {Brindisi, Lecce,
T aranto, Bari, Foggia}.
Step 4 The procedure ends by obtaining the single cluster that includes all the
patterns.

Figure 1.57 shows the results of the hierarchical agglomerative algorithm in the
single linkage version with the procedure described above through a dendrogram.
The vertical position (height) in which two clusters are immersed in the dendrogram
represents the distance of the two clusters. It is observed that in step 0, the clus-
ters {Brindisi} and {Lecce} are immersed at the height 39 corresponding to their
minimum distance.
154 1 Object Recognition

Table 1.4 Distance matrix after Step 2


Foggia Bari Brindisi, Lecce,
Taranto
Foggia – 137 217
Bari 137 – 97
Brindisi, Lecce, Taranto 217 97 –

Table 1.5 Distance matrix after Step 3


Foggia Brindisi, Lecce, Taranto, Bari
Foggia – 137
Brindisi, Lecce, Taranto, Bari 137 –

300
Distance in km

200
137 1 Cluster
100 97 2 Cluster
55 3 Cluster
39
4 Cluster
0 5 Cluster
Taranto

Brindisi
Foggia

Lecce
Bari

Fig.1.57 Dendrogram generated with the hierarchical agglomerative algorithm in the single linkage
version which considers the minimum distance between two clusters. In the example considered
with 5 cities, initially the clusters are 5 and in the final step, the algorithm ends with a single cluster
containing all the patterns (cities)

1.18 Syntactic Pattern Recognition Methods

While the statistical approach for the recognition of an object (pattern) is based on
a quantitative evaluation of its descriptors based on the prior knowledge of a model,
the syntactic approach is based on a more qualitative description. This last approach
can be used when a complex object is not easily described by its features and when
it is possible to have its own hierarchical representation, composed of elementary
components of the same object for which it is possible to consider some relational
information. In fact, the basic idea of a syntactic method (also called str uctural)
for recognition is to decompose a complex image into a hierarchy of elementary
structures and to develop the r ules with which these elementary structures (sub-
pattern or primitive) have a mutual relationship that allows their recombination to
generate higher level structures. With this syntactic or structural approach, it is there-
fore possible to describe complex patterns by adequately breaking them down into
robust primitives and using composition or production rules for them that include the
1.18 Syntactic Pattern Recognition Methods 155

3 2 1 Scene S 00335500224466 Scene S


4 0 4 4
5 3
6 2 Triangle A Rectangle B
5 6 7
A B
5 3
6 2
0 0 0 0 Sides Sides

0 0 3 3 5 5 0 0 2 2 4 4 6 6

Fig. 1.58 Syntactic description of a scene whose objects are decomposed into oriented segments
representing primitives (terminal symbols)

relational information of such primitives normally not considered with quantitative


approaches.
The various syntactical pattern recognition approaches have as their strategy the
decomposition of a complex pattern into elementary sub-parts related to each other
hierarchically in analogy to what happens for natural language when a pr oposition
is broken down into wor ds in their own Alphabet letters. For example, the objects in
the image of Fig. 1.58 can be represented by their contours that can be broken down
into horizontal, vertical, and oblique segments, indicated with Freeman’s chain code
(see Sect. 7.3 Vol. I), which are considered the primitives.
The objects of the scene (triangle and rectangle) are represented graphically in a
hierarchical way broken down into the various segments (the primitives) as well as a
proposition of natural language can be decomposed (see Fig. 1.59). In this example,
the letters of the alphabet are considered the primitives, the set of words are used
to describe propositions just as the set of segments can describe a geometric figure
(pattern) through a structural description obtained with an adequate language. In
analogy to the construction of a proposition that requires a set of production rules
to characterize a given language, even for the composition of a pattern through its
primitives, it needs a set of production rules or its grammar [36]. Different grammars
can generate different languages.
In the syntactic approach for pattern recognition, it is assumed that the primitives
and production rules have been well defined and through the grammatical syntactic
analysis (called parsing or syntax analysis or syntactic analyzer), it is determined
whether a pattern belongs to the class of patterns describable by the associate gram-
mar.
In summary, the syntactic or structural method applied to images involves the
following processes:

1. Definition of the primitive structures of a pattern.


2. Definition of the r elations existing between the primitives.
3. Construction of the grammar for each class of patterns (grammatical inferences).
4. Extraction of primitives and intermediate level structures for each class of pattern
and describe the relationship between them.
5. From the results of the syntactic analysis, identify the class of the pattern associ-
ated with the appropriate grammar of belonging.
156 1 Object Recognition

Lecce is a baroque city Proposition

Noun_phrase Verb_phrase Noun_phrase

Noun_proper Verb Article Adjective Substantive

Lecce is a baroque city

Fig. 1.59 Syntactic tree of a proposition of a natural language

In essence, the first two processes constitute the learning phase, the third process
concerns the grammar construction that normally requires user intervention and is
rarely automated, the last two processes constitute the recognition and verification
phase.

1.18.1 Formal Grammar

A set of patterns belonging to a class can be represented by a set of strings (concate-


nated symbols), that is, the wor ds for a formal language. Such strings are realized
through a set of rules defined by a grammar G. In the pattern recognition context,
a grammar G constitutes the model with which to generate syntactically strings rep-
resenting patterns of a given class or given a string verifies if syntactically it results
generated from G. A formal grammar [37] G = (VT , VN , P, S) is characterized by
the following 4 components:

1. Terminal Symbols VT . A finite and non-empty set of terminal symbols, also


known as primitive, which constitutes a subset of the alphabet V. For example,
the letters of a word in formal language.
2. Nonterminal symbols VN . A finite and non-empty set of nonterminal symbols,
also known as variables or intermediate symbols or internal symbols. The sets
VT and VN satisfy the following equations:
< =
VT VN = " VT VN = V (1.310)

3. Production rules P. A set of production rules also called rewrite rules. Each pro-
duction rule is a pair of strings of the type <α, β> that as the binary relationship
of cardinality finite on V × V is indicated as

α→β (1.311)
1.18 Syntactic Pattern Recognition Methods 157

where the left string α of the rule must contain at least one nonterminal symbol,
i.e.,
= = =
α ∈ (VT VN )∗ ◦ VN ◦ (VT VN )∗ and β ∈ (VT VN )∗ (1.312)

where α is a string (word) that contains at least one nonterminal symbol and β
is any string of terminal and nonterminal symbols or empty.32
4. Nonterminal initial symbol S. Also called axiom where S ∈ VN .

It should therefore be noted that every production rule (1.311) states that the string
α can be replaced by the string β and defines how, starting from the axiom, we can
generate strings of primitives and not, until you reach a string of only primitives.
Given a grammar G, we can say that a language L(G) generated from a grammar
is given by the set of words (strings) of primitive derivable by applying a finite
sequence of production rules given by the (1.311).

1.18.1.1 Example 1: Generating Strings a n bn with n ≥ 1


Let G(VT , VN , P, S) be the grammar with the following features:

VT = {a, b} VN = {S} P = {S → aSb, S → ab}

G can generate all the strings of the type:

a' ◦ a ◦()· · · ◦ a* b' ◦ b ◦()· · · ◦ b* = a n bn


n times n times

with n ≥ 1 and remembering that “◦” is the symbol of character concatenation.


Applying the first rule of P to the axiom (i.e., replacing S with the string aSb), we
get aSb. By applying this rule (n − 1) times, we would get

S → aSb → aaSbb → · · · → a n−1 Sbn−1

32 The characters of “*” and “◦” have the following meaning. If V is an alphabet that defines strings
or words as a sequence of characters (symbols) of V , the set of all strings defined on the alphabet
V (including the empty string) is normally denoted by V ∗ . The string 110011001 is a string of
length 9 defined on the alphabet {0, 1} and therefore belongs to {0, 1}∗ . The symbol “◦” instead
defines the concatenation or product operation given by ◦ : V ∗ × V ∗ → V ∗ , which consists in
juxtaposing two words of V ∗ . This operation is not commutative but only associative (for example,
mono ◦ block = monoblock and abb ◦ bba = abbbba = bba ◦ abb = bbaabb)). It should also
be noted that an empty string x consists of 0 symbols, therefore of length |x| = 0, normally denoted
also with the neutral symbol . It follows that x ◦  =  ◦ x = x, ∀x ∈ V ∗ ; and besides || = 0.
It can be shown that given an alphabet V , the triad < V ∗ , ◦,  > is a monoid, that is a closed set
with respect to the concatenation operator “◦” and for which  is the neutral element. It is called
syntactic monoid defined on V because it is the basis of the syntactic definition
 of languages. The
set of non-empty strings are indicated with V + and it follows that V ∗ = V + {}.
158 1 Object Recognition

Finally applying the second rule S → ab, we get as result the strings of the type
a n bn . A more compact way to represent production rules is possible using the “|”
symbol with the meaning of O R, writing as follows:

α → β1 |β2 | · · · |βn (1.313)

instead of the explicit form:

α → β1 α → β2 ··· α → βn (1.314)

For the previous example, we would have P = {S → aSb | ab}.

1.18.2 Language Generation

Defined a formal grammar G = (VT , VN , P, S), we can generate a formal language


L(G) consisting of all those strings on VT that can be generated starting with the
axiom S then iteratively applying the production rules of P until there are no more
intermediate or nonterminal symbols. A precise formalization of a L(G) language
generated by a G grammar involves the following:

1. Each string x generated by L(G) is made up of primitives.


2. Each string can be derived from the S axiom by replacing it by applying the P
production rules expressed in the form given by the (1.311) under the conditions
indicated by the (1.312).
3. The mode of derivation of a string can occur by direct derivation which consists
in applying a single production rule, or for derivation which consists in the
repeated application of direct derivations.

Direct derivation: A string ψ is obtained by direct derivation from a string φ


(and is indicated as φ =⇒ ψ) if there are two strings γ and δ (also empty)
such that φ = γ αδ, ψ = γβδ, and the production rule α → β is contained in
P. For example, given the grammar G = ({a, b, c}, {S, B, C}, P, S) with the
production rules:

P = {S → aS | B, B → bB | bC, C → cC | c}

the string ψ = aaabbC is obtained by direct derivation from the string φ =


aaabB applying the fourth rule (B → bC), that is, aaabB =⇒ aaabbC.
Derivation: A string ψ is obtained by derivation from a string φ (and is denoted
by φ =⇒∗ ψ) if there are n strings (n ≥ 1) α1 , α2 , . . . , αn such that α1 = φ
and αn = ψ and each string αi is obtained by direct derivation. Considering
the grammar of the previous example, the string ψ = aabbC is obtained by
1.18 Syntactic Pattern Recognition Methods 159

derivation from the string φ = aS as follows:

→ aaS '()*
aS '()* → aa B '()*
→ aabB '()*
→ aabbC ⇐⇒ aS =⇒∗ aabbC
r ule1 r ule2 r ule3 r ule4

where r ule i indicates the i-th rule contained in P of the direct derivation
applied.

4. A sentential form is any string x derivable from the S axiom of the G grammar
such that x ∈ V ∗ and S ⇐⇒∗ x.
5. According to point 1, a sentence or proposition of the language generated by G
consists of sentence forms made up of terminals only, that is,

L(G) = {x|x ∈ VT∗ ∧ S =⇒∗ x}

The grammar of the previous example 1 generates the language:

{a n bn | n ≥ 1}

while grammar:

G({a, b}, {S, B}, {P = (S → a B | b, B → aS)}, S)

generates language:
{a n b | n ≥ 1 and n even}

6. Two grammars G 1 and G 2 are said to be equivalent if L(G 1 ) = L(G 2 ). For


example, grammars:

G 1 ({a, b}, {S, A}, {P = (S → Ab | b, A → a Aa | aa)}, S)

G 2 ({a, b}, {S, A}, {P = (S → Ab, A → Aaa | )}, S)

they are equivalent because they both generate the language:

{a n b | n ≥ 1 and n even}

7. The application of the production rules does not guarantee the generation of a
language string, in fact it can be verified to produce a form of sentence in which
it is not possible to apply any production rule.

Returning to the proposition of Fig. 1.59 is generated by a complex grammar since it


is the Italian language where words represent primitives while intermediate strings
are parts of the language structure. The components of this grammar for this sentence
are
160 1 Object Recognition

(a) Primitives:
VT = {Lecce, is, a, bar oque, cit y}.
(b) Nonterminal words:

VN = {< pr oposition>, <noun_ phrase>,


<ver b_ phrase>, <inde f inite_ar ticle>, <ad jective>,
<noun>, <ver b>}.
(c) Partial production rules P:
P = {< pr oposition> → < phrase_noun>< phrase_ver b>< phrase_noun>
< phrase_noun> → <noun _ pr oper >|<ar ticle><substantive><ad jective>,
< phrase_ver b> → <ver b> | <ver b>< phrase_noun>,
<noun _ pr oper > → Lecce,
<ver b> → is,
<ar ticle> → a,
<ad jective> → bar oque,
<substantive> → cit y}

(d) The axiom in this case is S = < pr oposition>.

Starting from the axiom, root of the tree structure, and applying the above production
rules, the intermediate words belonging to VN (intermediate levels of the tree) are
generated up to the terminal words (the tree’s terminal nodes) which constitute a
proposition of the Italian language.
A grammar can be used in two ways:

Generative Method: The grammar by applying its structural production rules P


generates a string of primitives that represents a sentence of the language L(G).
Recognizable method: Given a sentence of a certain language, defined by the gram-
mar G, through a recognition algorithm it is verified whether this phrase belongs
or not to the language L(G), that is, it was generated or not from the grammar G.

In evaluating the applicability of the syntactic approach for pattern recognition, it is


strategic to choose an adequate grammar to define a language L(G) (subset of V ∗ ) that
through the production rules P a recognition algorithm can verify the belonging of a
pattern to a certain class. Considering only the terminal symbols (primitive) VT , L(G)
is a subset of VT∗ and we are interested in a number of finite primitives. In general,
languages have infinite dimensions considering that VT∗ is infinite and can generate
an indefinite number of languages. In fact, the cardinality of the set (infinite) of the
languages defined on a given alphabet V is greater than the cardinality of the infinite
set of possible recognition algorithms. In other words, there are more languages than
algorithms for recognizing elements of these languages.
1.18 Syntactic Pattern Recognition Methods 161

1.18.3 Types of Grammars

The formal grammars used in the informatic context are known as the Chomsky
Grammar [37] introduced by linguists to study the syntactic analysis of natural
languages even if they were found to be more adequate for the study of the syntactic
characteristics of computer programming languages. Basically, Chomsky introduces
restrictions on the type of production to differentiate the various grammars and
express a limited number of languages.

Type 0 grammars, not limited (Unrestricted Grammar-UR) with more general


productions, as follows:

α → β α ∈ V ∗ ◦ VN ◦ V ∗ , β ∈ V ∗ (1.315)

remembering that V = VT VN and V ∗ also include empty strings . The lan-
guages that can be generated by type 0 grammars are type 0 languages. These
grammars do not include any particular production restrictions. In fact, the lan-
guage with strings {a n bn , n ≥ 0} is of type 0 since it can be generated with the
grammar G 0 = ({a, b}, {S, A, X }, P, S) with nonrestrictive productions (differ-
ent from those considered in the previous example for the same language) given
by
P = {S → a AbX | , a A → aa Ab, bX → b, A → }

Type 1 grammars, dependent on the context (Context Sensitive—CS), with pro-


ductions of the type:

α→β α ∈ V ∗ ◦ VN ◦ V ∗ , β ∈ V + (1.316)

with |α| ≤ |β| and remembering that V + does not include empty strings. It follows
by the (1.316) that type 1 productions do not reduce the length of the sentence
forms unlike the type 0 productions which can admit into β empty strings by
generating derivations that shorten the strings in α. The languages that can be
generated by type 1 grammars are type 1 languages.
For example, the language {a n bn cn , n ≥ 1} is of type 1 because generated by the
grammar of type 1:
G 1 = ({a, b, c}, {S, B, C}, P, S)

with the following productions:


P = {S → aS BC | a BC, C B → BC, a B → ab, bB → bb, bC → bc, cC → cc}

These productions respect the restriction (they do not reduce the length of the
sentence forms) imposed by the (1.316), i.e., both strings of each production
αi → βi have the same length. The derivation of the strings of this G 1 grammar
162 1 Object Recognition

is obtained as follows:

→ aSBC '()*
S '()* → aaSBC BC '()*
→ aaa BC BCBC '()*
→ aaa BCBBCC
r ule1 r ule1 r ule2 r ule3
→ aaa B BCBCC
'()*
r ule3

'()* → aaabBBCCC '()*


→ aaaBB BCCC '()* → aaabbBCCC
r ule3 r ule4 r ule5
→ aaabbbCCC '()*
'()* → aaabbbcCC '()*
→ aaabbbccC
r ule5 r ule6 r ule7
→ aaabbbccc
'()*
r ule7

where r ule.i indicates the i-th rule contained in P of the direct derivation applied.
To highlight the effects of the derivation, the string that is rewritten is shown in
bold, while the result of the rule found in the subsequent derivation is underlined.
Type 2 grammars, called Context Free—CF, have the productions in the following
form:
A→β A ∈ VN , β ∈ V+ (1.317)

It should be noted that the left part of a rule can be formed by a single nonter-
minal symbol. The languages that can be generated by type 2 grammars are the
non-contextual languages of type 2. For example, the language {a n bn , n ≥ 1}
considered in the previous example 1 is of type 2 because generated by the
grammar of type 2 G 2 = ({a, b}, {S}, P, S) with the two production rules
P = {S → aSb | ab}.
Type 3 grammars, called Regular or Finite State—FS, are based on productions in
the following form:

A → a B or A → a A, B ∈ VN , a ∈ VT (1.318)

Similarly to non-contextual grammar (type 2), even the regular grammar (type
3) has the production rules, on the left side, with a single nonterminal symbol.
Furthermore, the regular grammar has the right side of a production with the string
restriction which can be  (null), or terminal or terminal followed by nonterminal
symbol. The strings {a n b, n ≥ 0} are of the regular language, since they can be
generated with the productions P = {S → aS, S → b} that are in agreement
to the production rules given by (1.318). The regular languages generated with
grammars (type 3) are recognizable by finite state automata. FS grammars are the
most widespread in the computer sector of automation and are often used with
graphical representations derived from graph theory.

We have seen that the various grammars are characterized by the restrictions imposed
on the various production rules. It can be shown that the i grammar classes that
1.18 Syntactic Pattern Recognition Methods 163

generate the L i language classes include all grammars of type i + 1, i = 0, 1, 2, 3.


We explicitly have that

L0 ⊃ L1 ⊃ L2 ⊃ L3 (1.319)

where the symbol ⊃ indicates the concept of super set and represent the hierarchies
of Chomsky grammars. A language is strictly of type i if generated from a grammar
of type i and there is no higher level grammar, of type j > i that can generate it. It
can be shown that the language {a n bn , n ≥ 1}, reported in the previous examples, is
strictly type 2, since it cannot be generated by any higher level grammar. As well as
the language {a n bn cn , n ≥ 1} is strictly type 1, because there are no type 2 or type
3 grammars that can generate it.
All the languages presented are based on nondeterministic grammars. In fact, in
the various production rules, the same string on the left of the productions can have
different forms to the right of the production rules and there is no specific criterion for
choosing the rule to be selected. Therefore, a language, generated by a nondetermin-
istic grammar, does not present words with certain preferences. In some cases, it is
possible to keep track of the number of substitutions made with some productions to
learn the frequency of some generated words and have an estimate of the probability
of how often a certain rule is applied. If a probability of application is associated
with the productions, the grammar is called stochastic. In cases where probability
properties cannot be applied, it may be useful to apply the fuzzy approach [38].

1.18.4 Grammars for Pattern Recognition

The theory of syntactic analysis introduced in the previous paragraphs can be used to
define a syntactic classifier that assigns the class of membership to a pattern (word).
Given a pattern x (a string characterized by the pattern feature set, seen as a sequence
of symbols) chosen a given language L generated by an appropriate grammar G, the
problem of the recognition is reduced to determine if the pattern string belongs to
the language (x ∈ L(G)). In problems with C classes, the classification of a pattern x
can be solved by associating a single grammar for each class. As shown in Fig. 1.60,
an unknown pattern x is syntactically analyzed to find the language of membership
L I associated with the grammar G i that identifies the class ωi , i = 1, . . . , C. To
avoid assigning the pattern to different classes, it is necessary to adequately define
the G i grammars, i.e., the L i languages are strictly disjoint. In the presence of noise
in the patterns, it can happen that it does not belong to any language.
From a practical point of view, it is necessary that the patterns of the various
classes consist of features or terminal elements ( primitives) that represent the set
VT . Strategic is the selection and extraction of primitives and in particular their
structural relationship that depends on the type of application. Once the primitives are
defined, it is necessary to plan the appropriate grammar through which the associated
patterns are described. The construction of a grammar in the context of classification
is difficult to achieve automatically, generally it is defined by the expert who knows
164 1 Object Recognition

Structured Classes Library


Class
struc
Class
struc
...... Class
structure ωC

Input Structural Structural Comparison


Pattern Analyzer (Parser) Pattern

Fig. 1.60 Block diagram of a syntactic pattern recognition system

the application context well. Once the grammar has been defined, a syntactic analysis
process must be carried out for the recognition of the patterns generated by this
grammar, i.e., given the string x the recognition problem consists in finding L(G i )
such that:
x ∈ L(G i ) i = 1, . . . , C (1.320)

This syntactic analysis process also called parsing consists in constructing the deriva-
tion tree useful for understanding the structure of the pattern (string) x and verifying
its syntactic correctness. The syntactic analysis process is based on the attempt to
construct the pattern string to be classified by applying an appropriate sequence of
productions to the starting axiom symbol. If the applied productions are successful,
it converges to the test string and we have x ∈ L(G) otherwise, the pattern string is
not classified. Now let’s see how to formalize a syntactic tree (or derivation tree)

(a) A syntactic tree of a string pattern x of G has as root the G axiom.


(b) Each non-leaf node is associated with a not terminal symbol of VN .
(c) Each leaf node has a terminal symbol of VT .
(d) For a non-leaf node, if it contains the symbol A, then there exists a pr oduction
A → β in P such that the sequence of the children symbols of the node corre-
sponds to the β string.
(e) The leaf-node symbols, which are read from left to right, correspond to the string
x.
(f) A pattern string x ∈ L(G) if and only if there is a syntactic tree for x derived
from G.

The search for a string of a language with syntactic analysis is a generally non-
deterministic operation and can involve exponential calculation times. In order to
be efficient, normally the syntactic analysis requires that the syntactic tree be con-
structed by analyzing a few symbols of the pattern string at a time (usually a symbol
of the string at a time). This implies that the production rules are chosen with the char-
acteristics of adaptability to syntactic analysis. The ambiguous grammars, which
generate multiple syntactic trees for a given string, should be avoided. The recon-
struction of the syntactic tree can take place in two ways: top-down (descending
derivation) or bottom-up (ascending reduction).
1.18 Syntactic Pattern Recognition Methods 165

Rule 2
S S S S S S
(1) S AB
(2) S cSc Rule 1
c S c c S c c S c c S c c S c
(3) A a
(4) B bB Rule 3 Rule 4
(5) B b A B A B A B A B

Rule 5
a a b B a b B

Fig. 1.61 Descending syntactic analysis: the result of the derivation is shown in red, while the left
part of the production is shown in green. r ule.i indicates the rule i applied up to the final tree of
the right whose leaves represent the derived string x = cabbc

1.18.4.1 Descending Syntactic Analysis


The syntactic process top-down attempts to construct the tree starting from the root
node with the S axiom and then applies the productions of P appropriately to converge
toward the test pattern string. For each level of the tree, each production involves an
expansion of the nonterminal symbol (for example, from left to right) of the current
sentence form. For example, consider the grammar G = ({a, b, c}, {S, A, B}, P, S)
with the following production rules:

P = {S → AB | cSc, A → a, B → bB | b}

The construction of the tree (see Fig. 1.61) with the top-down approach to derive the
string pattern x = cabbc is given by the following derivations:

→ cSc '()*
S '()* → cABc '()*
→ caBc '()*
→ cabBc '()*
→ cabbc
r ule2 r ule1 r ule3 r ule4 r ule5

It is noted that the top-down process starts by expanding the axiom with the appropri-
ate r ule 2 S → cSc instead of the r ule 1 S → AB and the third derivation expands
the nonterminal symbol A with the r ule 3 A → a instead of the nonterminal symbol
B. The choice of another sequence of rules would have generated a pattern string
different from x.
Figure 1.61 shows the entire tree built for the pattern string x = cabbc. As you
can see from the example of the built tree, it needs a strategy to choose the production
rule to be applied to expand a nonterminal symbol to avoid, in the case of the wrong
choice, to go back to the root of the tree and start the parsing again. To avoid this,
several syntactic analysis algorithms have been developed. A deterministic analyzer
is the following.
At any time, the analyzer recognizes only the next symbol to be analyzed of the
string (one at a time from left to right) and finds the set of P productions that pro-
duce that symbol. Subsequently, the production rule is applied iteratively to expand
a nonterminal symbol. If a is the leftmost current symbol of the string x under con-
166 1 Object Recognition

sideration and A is the current nonterminal symbol that must be replaced, only those
productions starting with a are chosen.
This syntactic analyzer is based on the construction of a binary matrix which for
each nonterminal symbol A determines all β terminal symbols and not, such that
there is a production A → β · · · . In the top-down analysis, the presence of recursions
on the left must also be managed when there are symbols X ∈ VN such that there is
a derivation of the type X =⇒∗ X α (where α ∈ V ∗ ), and the presence of common
prefixes in the rules, that is, there are distinct rules such as A → aα, A → aβ
(where a ∈ VT and α ∈ V ∗ ). Common recursions and prefixes are solved through
equivalent transformations of the rules of G leaving the generated language L(G)
unchanged. For example, the rules with recursion to the left:

A → Aα1 | Aα2 ed A → β1 | β2

are replaced with the following transformed rules:

A → β1 A | β2 A e A →  | α1 A | α2 A

with A new nonterminal symbol. In a similar way, we proceed to eliminate the


common prefixes present in the rules.
The top-down analyzer can be inefficient especially when during the construction
of the tree, it finds several wrong paths despite various attempts have been made
to return backward in the tree. Problem that can be attenuated with a better a priori
knowledge of the rules to be applied. An alternative solution to backtracking is to use
different production rules (eliminating ambiguities and recursive forms on the left)
in parallel, generating more syntactic trees, blocking the process when one of the
trees successfully converges. The reconstruction of the syntactic tree can take place
in two ways: top-down (descending derivation) or bottom-up (ascending reduction).

1.18.4.2 Ascending Syntactic Analysis


The syntactic analyzer bottom-up constructs the syntactic tree of the input pattern
string starting from the leaves, i.e., the terminal symbols of the string, going up
to the root node, the axiom. In essence, it uses production rules in an inverse way
(with respect to the top-down analyzer), that is, it realizes a right reduction (the
inverse of the derivation) of the input string x ∈ VT∗ to the initial nonterminal symbol
of G. In each reduction, a substring β is being searched for in the current string
under examination which is similar with the string of the right part of a production
rule A → β ∈ G and then is replaced with β with A. The substring β is named
handle. With the appropriate selection of the substring β, in each reduction, leads
to a sequence of reductions that ends with the axiom symbol. Let us now consider
an example of a right reduction using the same grammar as the previous top-down
approach G = ({a, b, c}, {S, A, B}, P, S) with the rules of production:

P = {S → AB | cSc, A → a, B → bB | b}
1.18 Syntactic Pattern Recognition Methods 167

S
Rule 2
(1) S AB
(2) S cSc Rule 1 S S
(3) A a Rule 4

(4) B bB B B B
Rule 3 Rule 5
(5) B b
A A B A B A B A B

c a b b c c a b b c c a b b c c a b b c c a b b c

Fig. 1.62 Ascending syntactic analysis: in red the reduction result, while in green the right part of
the production

The reconstruction of the tree (see Fig. 1.62) with the bottom-up approach to reduce
the pattern string x = cabbc is given by the following reductions:

cabbc →r c Abbc →r c AbBc →r cABc →r cSc →r S


'()* '()* '()* '()* '()*
r ule3 r ule5 r ule4 r ule1 r ule2

where →r indicates the right reduction operator while the effects of the reduction
are highlighted with the writing in bold of the substring that results the right part of
the production rule (indicated with r ule i) applied while the result of the reduction
is underlined. The substrings indicated in bold correspond to the handles.
The bottom-up process, as highlighted in the example, in each reduction step finds
a substring (the handle) that corresponds to the right side of a production of G. These
reduction steps may not be unique, in fact in the previous grammar, we could replace
the symbol b with r ule 5 instead of a. Therefore, it is important to decide which is
the substring handle to be reduced and which is the most appropriate reduction to
select.
The substring to be reduced is defined by formalizing the characteristics of the
handle. Let n be a positive integer, be x and β two strings of symbols, and let A be
a nonterminal symbol such that A ↔ β is a production of the grammar G. We call
handle for the string (sentential form) x, and indicated by the pair (A ↔ β, n), if
there is a string of symbols γ such that

S =⇒∗ α Aγ =⇒ αβγ = x (1.321)

with n = |α| + 1 indicating the position in x of the rightmost symbol of β. In other


words, a handle of a string x is a substring that is identical to the right side of a
production A → β such that the reduction of the handle corresponds to a step of
the derivation from right to back. Furthermore, we have that γ is a string of terminal
symbols, since α A → αβ is a step of a derivation from the right.
If we consider the previous grammar G, with the production rules P, the string
cabbc has the handles (A → a, 2) and (B → b, 3). If G also had the production
B → a, we should add the handle (B → a, 1) but in this case, the grammar G would
be ambiguous due to the existence of the two handles with identical right side a in
168 1 Object Recognition

position 1. For the sentential form x = c AB, the resulting handle is (S → AB, 2).
It is shown that, for an unambiguous grammar G, any sentential form of right has a
single handle.
The implementation of an bottom-up analyzer, based on the handle, uses different
data structures such as a stack to contain the grammar symbols G (initially the stack
is empty " ↑), an input vector (buffer) to contain the part of the input string x still to
be examined (at the beginning the pointer is positioned adjacent to the right border
x ↑), and a decision table. The bottom-up analyzer uses the following approach
Shift/Reduce:

Shift: move the pointer of the vector containing the input string one character to
the right and move the terminal symbol under consideration to the top of the stack.
For example, if the input string is ABC ↑ abc, the shift produces ABCa ↑ bc,
that is, move the pointer and load the terminal symbol a in the stack.
Reduce: the analyzer knows that the right side of the handle is at the top of the
stack, of the same it locates in the stack the left side and decides with which
string of nonterminal symbols replace the handle.

A parser that uses the shift/reduce paradigm decides one of the two actions recog-
nizing on the top of the stack the presence of a handle. The procedure ends correctly
by sending a message of accept when the top of the stack appears the symbol S
(↑) and the input string " ↑ is completely analyzed. An error is reported when the
parser causes a syntax error. The following (Algorithm 14) is a simple algorithm for
a bottom-up parser based on shift/reduce operations.
Example of the operations of a shift/reduce parser.
Given the grammar G with the following pr oductions:

P = {(1) S → E + E, (2) E → E ∗ E, (3) E → (E), (4) E → I D}

The useful reductions, with the bottom-up approach, to reduce the following string
x = I D + I D ∗ I D are

ID + I D ∗ I D →r E + ID ∗ I D →r E + E ∗ ID →r E + E ∗ E →r E + E →r S
'()* '()* '()* '()* '()*
r ule4 r ule4 r ule4 r ule2 r ule1

In the Table 1.6 are instead reported the results of the previous parsing algorithm,
based on the shift/reduce operations, for the syntactic analysis of the x = I D + I D ∗
I D input string using the G grammar with the 4 productions listed above.
Figure 1.63a shows the syntactic tree obtained with the results shown in the Table
1.6 related to the parsing of the string x = I D + I D ∗ I D. A parser shift/reduce
even for a free context grammar (CF) can produce conflicts in deciding which action
to apply if shift or r educe, or in deciding which production to choose for reduction.
For example, in step 6, the action performed results in a shift (appropriate) but the
action of r educe could be applied by applying the production S → E + E being
in the stack the handle β = E + E. This potential conflict shift/reduce involves
the problem that the parser is not able to automatically decide which of the two
1.18 Syntactic Pattern Recognition Methods 169

Algorithm 14 Pseudo-code of a simple syntactic analyzer algorithm shift/reduce


1: Input: String x of length n; P productions of type A → β and root symbol S of grammar G.
2: Output: Bottom-up syntactic tree construction
3: I nitiali ze top of the stack T O F
4: k←1
5: car ← x(k)
6: while T O F = S and k < n do

7: if T O F = handle A → β then

8: E xecute R E DU C E : Reduction o f β f r om A
9: Remove |β| f r om the stack
10: Inserting A in the stack

11: else if (k< n+1) then

12: E xecute S H I F T :
13: I nser ting o f car on the stack
14: k ←k+1
15: car ← x(k)

16: else

17: Report an error having analyzed the input string but without reaching the root S

18: end if

19: end while

actions to be performed to converge to the S axiom. In fact, the wrong decision of


the reduction would have involved the interruption of the analysis of the input string
reaching in advance the axiom (see Fig. 1.63b).
Therefore, it is not sufficient to verify if in the stack there exists the string β that is
identical with the right side of some production A → β. It, therefore, needs to find a
strategy to choose the appropriate handle to avoid generating different derivations.
Normally, a syntactic analyzer based on the L R(1) grammar [39] is used, which
through a deterministic recognizer is capable of realizing the right reduction of
the input string. The grammars L R(1) analyze the input from left to right (L) and
produce a reduction (R, derivation from right to the contrary) and make the decision
by observing, in advance, one input symbol. For a context-free grammar (CF) and
unambiguous, it is shown that a shift/reduce parser is possible using the stack with
the characteristic that the right side of the handle is always on the top of the stack.
Combining the stack information (current state) and the next symbol of the string
under examination, a finite state decision table (DFS—Deterministic Finite State)
is constructed to choose the appropriate action (shift, reduce) that eliminates the
conflicts mentioned above.
170 1 Object Recognition

Table 1.6 Applied shift/reduce actions for the entry parsing of the input string x = I D + I D ∗ I D
Step n. Stack Input string Action Handle
1 ↑" ID+ ID∗ ID Shift –
2 ↑ ID +I D ∗ I D Reduce E → ID
3 ↑E +I D ∗ I D Shift –
4 ↑ E+ ID∗ ID Shift –
5 ↑ E + ID ∗I D Reduce E → ID
6 ↑E+E ∗I D Shift –
7 ↑ E + E∗ ID Shift –
8 ↑ E + E ∗ ID " Reduce E → ID
9 ↑E+E∗E " Reduce E→E∗E
10 ↑E+E " Reduce S→E+E
11 ↑S " Accept –

(1) S E+E S
(2) E E*E Rule 1

(3) E (E) E E
(4) E ID
Rule 2
(a) E E E E E E E E E E E E
Rule 4 Rule 4 Rule 4
ID + ID * ID ID + ID * ID ID + ID * ID ID + ID * ID ID + ID * ID ID + ID * ID

Rule 1
S
(b) E E E E E
Rule 4 Rule 4 unanalyzed symbols
ID + ID * ID ID + ID * ID ID + ID * ID ID + ID * ID

Fig. 1.63 Bottom-up syntactic analysis with operations shift/reduce: a Generation of the bottom-up
parsing tree applied to the string x = I D + I D ∗ I D relative to the actions shown in the Table
1.6. In red is shown the result of the derivation, while in green the left part of the pr oduction. b
Construction of the tree as in (a), but in step 6, instead of shift the action of r educe was applied
with production 1 highlighting the conflict situation shift/reduce which breaks the parsing of the
string

Generally, the bottom-up parser has a wider availability of L R(1) grammar classes
than the L L(1) grammar classes based on top-down parsers. This can be explained by
the fact that the bottom-up approach uses the input string information more effectively
considering that it begins the construction of the syntactic tree from the terminal
symbols, the leaf nodes, instead of the axiom. The disadvantage of the bottom-up
parser is that only at the end of the procedure, it checks whether the tree created
ends in the S axiom. Therefore, all trees are expanded even those that are not able to
converge in the root node S. A possible strategy is the mixed one that combines the
top-down and bottom-up approach with the aim of achieving a parser that prevents
1.18 Syntactic Pattern Recognition Methods 171

the expansion of trees that cannot converge in the axiom S and prevents expansion
of trees that cannot end on the input string.

1.18.5 Notes on Other Methods of Syntactic Analysis

Other methods of syntactic analyzers use structural information among primitives


to better characterize classes of objects. In these cases, the structural information
and their relationships are well represented with relational graphs whose nodes and
arcs are associated with primitives and their structural relations. In particular, for
each class of objects, the relational graphs of the prototype classes of objects are
realized and the class to which an test object belongs (also characterized in terms of
pattern string) occurs by comparing the prototype relational graphs with that of the
test object. The problem is reduced to isomorphism33 between relational graphs [40].
In the context of pattern recognition in images, nodes represent primitives while arcs
are the relationships between them. Therefore a complex object of the image can be
represented by the contours expressed with the Freeman code or with elementary
regions characterized by topological relations and other information (color, texture,
...) and described with a relational graph G(V, E, A V , A E , f V , f E ) where V repre-
sents the set of nodes, E the set of arcs, A V and A E represent the sets of attributes
of nodes and arcs, respectively, while f V and f E are the functions that relate nodes
and arcs to each other. The recognition between test object and prototype objects is
reduced to the comparison (isomorphism) between the relational graph G t of the test
object and of those G i prototypal.
Considering the complexity of the description of the objects, the errors introduced
by the vision system for the aspects of variability of the lighting conditions and the
different observation points, the errors introduced by the algorithms for the extraction
of primitives from the image, is not evaluated an exact graph similarity but the
problem is reduced by using an adequate metric to evaluate how similar two relational
graphs are (graph matching). A further problem is given on the non-completeness
of the observed test object (partial vision) with respect to the prototype object and in
this case, the isomorphism between graph and subgraph is evaluated or if possible
between subgraphs in more complex situations [41]. Another method of syntactic
analyzer is based on inferential grammars which are normally learned from a set of
prototype patterns. In these cases, knowledge of a set of propositions that belong to
a given grammar (the model grammar) is assumed. This inferential grammar is used
to derive (or verify) whether a test proposition is structurally analogous to that of
prototype propositions [42–44].

33 We define isomorphism between two complex structures when trying to evaluate the correspon-
dence between the two structures or a similarity level between their structural elements. In math-
ematical terms, isomorphism is a f bijective application between two sets consisting of similar
structures belonging to the same structural nature such that both f and its inverse preserve the same
structural characteristics.
172 1 Object Recognition

(a) (b)
1 2 ...... 10 n 1 2 ...... 10 n
Text T d a v i d e d . m o l t o a l t o a a b b a b a z z v k v k v k z z f

s=9
v k v
Pattern x s=10 m o l t o
s=11
v k v
1 2 ...... m

Fig. 1.64 a The problem of the exact string matching: given a text string T and a substring pattern
x, normally much shorter than the text, we want to find the occurr ences of x in T , with relative
displacements s. In the example, the x pattern occurs in T only once for s = 10. b Occurrences of
x in T can also be superimposed as occurs for s = 9 and s = 11

1.19 String Recognition Methods

Let us now analyze some patter n recognition methods understood as a sequence


of elements (string), belonging to an alphabet, which are located in a larger
string (also called text). In literature, these methods are also known as the string-
matching problem or the correspondence between strings (patterns). There are
various applications where string recognition algorithms have been developed, in
particular in the processing of texts of various languages (Italian, English, ...),
in compilers for computers, in search engines for telematic texts, in recognition
of parts of contours of objects encoded as strings, and above all in bioinfor-
matics for the analysis of genes (the DNA alphabet that indicates the nuclei of
{Adenine A, C ytosine C, Guanine G, T hymine T }). The key problems of string
recognition concern

String Matching, determines if a string x appears in a text as a substring and once


found it also locates its position in the text.
Edit Distance, determines a measure of similarity, between two strings x and y
belonging to the same alphabet V, and calculates the minimum number of basic
operations, such as insertion, deletion, modification, and transposition, necessary
to transform x into y.
String Matching with Error, evaluates the minimum distance (or minimum cost)
and the position in the text, for an input string x and any 0f its occurrence in the
text.
String Matching with special symbol, problem analogous to string matching, but
in the comparison between strings, a special symbol " is considered, with the
meaning of being negligible and therefore comparable with any other symbol of
the alphabet.

1.19.1 String Matching

Let be an alphabet V of distinguishable symbols (or characters, or letters). Let defined


on V be a long string T , called text, of length n and a string x, called patter n, of
1.19 String Recognition Methods 173

length m < n. The key problem of string recognition (known as string matching) is to
find the set of all positions in the text T , starting from where it appears (occurrence),
the pattern x if it exists (see Fig. 1.64a). The occurrence of x in T can be multiple
and overlaid (see Fig. 1.64b).
The pseudo-code of a simple string-matching algorithm is given below
(Algorithm 15) assuming that the text T [1 : n] is longer than the pattern x[1 : m]
and with s + 1 is indicated the position in the text where we have the occurrence
x[1 : m] = T [s + 1 : s + m], that is, where x in T appears aligned (s is the necessary
displacement to align the first character of x with the character in position s + 1 in
T ). This simple algorithm starts by aligning (initially s = 0) the first character of x

Algorithm 15 Pseudo-code of a simple algorithm of string-matching


1: Input Text T , pattern x, n length of T and m length of x
2: Output: Find the positions s of the occurrences of x in T
3: s←0
4: while s ≤ n − m do

5: if x[1 : m] = T [s + 1 : s + m] then

6: print: “Pattern appears in position”, s

7: end if
8: s ←s+1

9: end while

and of T , then compares, from left to right, the corresponding characters of x and T
until you find different characters (it means no-match) or you get to the end of x (it
means found match and then indicates the occurrence that is the position s + 1 of
the T character that corresponds to the first character of x). Both in the match and
no-match situation, it is translated to the right, of a character, the pattern x, and the
procedure is repeated until the last character of x or x[m], exceeds the last character
of the text T , that is, T [s + m].
The remarkable computational load of this simple procedure is immediately high-
lighted, which in the worst case scenario (where x = a m and T = a n with strings of
the same character a) has the complexity of O((n − m + 1)m). If, on the other hand,
the x and T characters change randomly, the procedure tends to be more powerful.
The weakness of this algorithm is the systematic displacement of a single position
to the right of the x pattern which is not always convenient.
A possible strategy is to consider larger displacements without compromising the
loss of occurrences. More efficient algorithms have been developed that tend to learn,
through appropriate heuristics, the configuration of the patter n x and the text T .
In 1977, the KMP algorithm of Knuth, Morris, and Pratt [45] is published which
goes in the direction of making the string matching procedure more efficient. Break
down the process into two steps: preprocessing and sear ching. With the first step,
174 1 Object Recognition

1 2 ...... 10 s+m n
Text T d a v i d e d . a l t o
s+j
scansion
Pattern x s=9 m o l t o

1 2 ...... m
j

Fig. 1.65 At the displacement s, the comparison between each character of the pattern and the text
occurs from right to left, until meeting the discordant characters, respectively, x[ j] and T [s + j]
(in red in the picture)

it performs a preliminary analysis of the pattern to learn its internal configuration in


adequate time, then optimize, with the information learned, the second step which is
that of searching for occurrences, moving to the right, the string x also of multiple
positions in the text T . In essence, the complexity becomes linear, with respect to
the length of the pattern and text, obtaining O(n) for the search step while, for both
steps, the complexity is O(n + m).

1.19.2 Boyer–Moore String-Matching Algorithm

The algorithm that over the years has turned out to be an excellent reference for
the scientific community of the sector is that of Boyer–Moore [46]. From the other
algorithms it is essentially distinguished by the following aspects:

(a) Compar e the pattern x and the text T from right to left. When comparing
characters from right to left, starting from x[m] up to x[1], when a discordance
is found (no-match), between text and pattern, the pointer j + 1 of the pattern
is returned, which corresponds to the last position in which the characters of the
pattern and the text were correspondents (see Fig. 1.65).
(b) Reduce comparisons by calculating shifts greater than 1 without compromising
the detection of valid occurrences. This occurs with the use of two heuristics that
operate in parallel as soon as a discordance is signaled in the comparison step.
The two heuristics are

1. Heuristic of the discordant character (bad character rule), uses the position
where the discordant character of the text T [s + j] is found (if it exists) in
the pattern x to propose a new displacement s appropriate. This heuristic
proposes to increase the displacement of the necessary value to align the
discordant found character more to the right in x with the one identified in
the text T . However, it must be guaranteed not to skip valid occurrences (see
Fig. 1.66).
1.19 String Recognition Methods 175

1 2 ...... 9 s+j s+m n


Text T d a v i d e _ d . _ a l t o _ m o l t o

Scansion
Pattern x s=9 m o l t o
j
discordant character s=9+j=9+2
m o l t o

go s=9+m=9+5 m o l t o

Fig. 1.66 Boyer–Moore heuristics to propose the magnitude of the increase in the movement of
the pattern x with respect to the text T . With current s, we do not have the occurrence of the pattern
as scanning from right to left is the discordant character “a” of the text with respect to the character
“o” of the pattern in position j = 2 (both in red). The heuristic of the discordant character, not
finding the discordant character “a” in the pattern, proposes to increase s of j = 2 or move the
pattern immediately after the discordant character. The heuristic of the good suffix, instead, verifies
if in the pattern a substring is found identical to that of the good suffix (in the example “olto”) and it
proposes to move the pattern of a suitable quantity to align this new substring found with the good
suffix previously found in the text (colored green). In this example, this substring does not exist in
the pattern, and proposes to move the complete pattern immediately after the good text suffix, that
is, to increase s by the length m of the pattern

2. Good suffix heuristics (good suffix rule)34 that operating in parallel with
the (bad character rule) attenuates the number of shifts in the pattern. The
search for the good suffix is determined efficiently by having adopted the
search from right to left. Also this heuristic proposes, in an independent
way, to increase the displacement of the value necessary to align the next
occurrence of the good suffix in x with the one identified in the text T ,
always scanning in the search from right to left (see Fig. 1.66).

When the discordance between the pattern and text characters is found, it is actu-
ally determined that the current s shift does not correspond to a valid occurrence of
the pattern, and at this point, each heuristic proposes a value with respect to which
the shift s can be increased without compromising the determination of occurrences.
The authors of the algorithm choose the largest value, suggested by the heuristics,
to increase the shift s and then continue to search for occurrences. Returning to the
example of Fig. 1.66 is chosen the heuristic of the good suffix which suggests a larger
increment, that is, the shift to the right of the 5-character pattern proposed by the

34 Prefix and suffix string formalism. A string y ∈ V is a substring of x if there are two strings α and
β on the alphabet V such that x = α ◦ y ◦ β (concatenated strings) and we can say that y occurs
in x. It follows that the string α is a pr e f i x of x (and is denoted by α ⊂ x), i.e., it corresponds to
the initial characters of x. Similarly, it is defined that β is a su f f i x of x (and is denoted by β ⊃ x)
and coincides with the final characters of x. In this context, a good suffix is defined as the substring
suffix of the pattern x that occurs in the text T for a given value of the shift s starting from the
character j + 1 + s. It should be noted that the relations ⊂ and ⊃ enjoy the transitive property.
176 1 Object Recognition

good suffix is chosen instead of the 2 of the heuristic of the discordant character.
Now let’s see in more detail how the two heuristics are able to attenuate the number
of comparisons without compromising any occurrences.

1.19.2.1 Heuristic of Discordant Character


Let s be the current displacement that aligns the first character of the pattern x[1]
with the character T [s + 1] of the text and suppose, after starting to compare the
two strings from right to left, that a discordance occurs between patterns and text,
respectively, between the character x[ j] (where j indicates the position in x) and the
discordant character T [s + j].
The strategy of this heuristic suggests to find the first occurrence of the discordant
character of the text T [s + j] in the pattern x, analyzing the latter always from right
to left, and if it exists, we have

x[k] = T [s + j] (1.322)

where k is the maximum value of the position in x where the discordant character is
found. The action proposed by the heuristic is to move the pattern x of s + k, and
update the shift s as follows:

s ← s + ( j − k) j = 1, . . . , m k = 1, . . . , m (1.323)

which has the effect of aligning the discordant character of the text T [s + j] with the
identical character found inside the pattern in the position k. Figure 1.67 schematizes
the functionality of this heuristic for the 3 possible configurations depending on the
value of the index k. Let’s now analyze the 3 configurations:

1. k = 0: is the configuration in which is not satisfied the Eq. (1.322), that is, the
discordant character does not appear inside the pattern x and the proposed action,
according to the (1.323), is to increase the shift s by j:

s←s+ j (1.324)

which has the effect of aligning the first character of the pattern x[1] with the
character of T after the discordant character (see Fig. 1.67a).
2. k < j: is the configuration in which the discordant character of T is present in
the pattern x in the k position and is to the left of the j position of the discordant
character in x resulting in j −k > 0. It follows that we can increase the movement
s of j − k (to the right), according to the (1.323), which has the effect of aligning
the x[k] character with the discordant character in the text T (see Fig. 1.67b).
3. k > j: is the configuration where the discordant character T [s + j] is present in
the pattern x, but to the right of the position j of the discordant character in x,
resulting j − k < 0, which would imply a negative shift, to the left, and therefore,
the increment of s is ignored and not applied with the (1.323) but we could only
increase s by 1 character to the right (see Fig. 1.67c).
1.19 String Recognition Methods 177

(a) 1 2 ...... s+j n


Text T A G G G C G G A C C G C T A G A A T A G ......
s=0 j scansion
Pattern x G G T A A G G A k=0
s=0+j=5 G G T A A G G A

(b) 1 2 ...... s+j n


Text T A G G G C G G A C C G C T A G A A T A G ......
k j
s=11 k<j
j-k=7-3>0
Pattern x G G T A A G G A
s=11+(j-k)=15
G G T A A G G A

(c)
1 2 ...... s+j n
Text T A G G G C G G A C C G C T A G A A T A G ...... k>j
j k
s=8
Pattern x G G T A A G G A
s=8+(j-k)=6 j-k=6-8<0 Negative
G G T A A G G A
Proposal ignored increase

Fig. 1.67 The different configurations of the heuristic of the discordant character. a The discordant
character T [s + j] is not found in the pattern x and it is proposed to move the pattern to the left
immediately after the discordant character (incremented s by 5). b The discordant character occurs
in the pattern, at the rightmost position k, with k < j, and the movement of the pattern of j − k
characters (incremented s by 4) is proposed. This is equivalent to aligning the discordant character
T [s + j] = “T ” of the text with the identical character x[k] found in the pattern. c Situation identical
to (b) but the discordant character T [s + j] = “A” is found in the pattern in the rightmost position
where k > j. In the example, j = 6 and k = 8 and the heuristic proposing a negative shift is
ignored

In the Boyer–Moore algorithm, the heuristic of the discordant character is realized


with the function of the last occurrence:

λ : {σ1 , σ2 , . . . , σ|V | } → {0, 1, . . . , m}

given as follows:

max{k : 1 ≤ k ≤ m and x[k] = σi if σi ∈ x
λ[σi ] = (1.325)
λ[σi ] = 0 otherwise

where σi is the i-th symbol of the alphabet V. The function of the last occurrence
defines λ[σi ] as the pointer of the rightmost position (i.e., of the last occurrence) in x
where the character σi appears, for all the characters of the alphabet V. The pointer
is zero if σi does not appear in x. The pseudo-code that implements the algorithm of
the last occurrence function is given below (Algorithm 16).
178 1 Object Recognition

(a) 1 2 ...... s+j n


Text T A T T G T G A C T A G T C A C A A C A T G A C A A C A G G ......
s=10 j
Pattern x A A G A C A A C A k=0
s=10+m=10+9=19 A A G A C A A C A

(b) 1 2 ...... s+j n


Text T A T T G T G A C T A G T C A C A A C A T G A C A A C A G G ......
s=10 k j β
Pattern x ε C A G A C A A C A k=0
s=10+(m-|α|)=10+(9-2)=17 α
C A G A C A A C A
α
(c) 1 2 ...... s+j n
Text T A T T G T G A C T A G T C A C A A C A T G A C A A C A G G ......
s=7 k j
Pattern x A T G A C A A C A k>0
s=7+3 A T G A C A A C A

Fig. 1.68 The different configurations of the good suffix heuristic (string with a green background).
a k does not exist. In the pattern x, no prefix is needed that is also suffixed to T [s + j + 1 : s + m].
A pattern shift equal to its length m is proposed. b k does not exist but a prefix α is needed (in
the example α = “C A”) of x which is also suffix of T [s + j + 1 : s + m] indicated with β. A
pattern shift is proposed to match its α prefix with the text suffix β (s increment of 7 characters). c k
exists. In the pattern, a substring is needed (in the example, it is “AC A” orange colored) coinciding
with the suffix that occurs in T [s + j + 1 : s + m] satisfying the condition that x[k] = x[ j]. It is
proposed, as in (b), the shift of the pattern to align the substring found in x with the suffix of the
text indicated above (in the example, the increment is 3 characters)

1.19.2.2 Good Suffix Heuristics


The strategy of this heuristic is to analyze, as soon as the discordant character is found,
the presence of any identical substrings between text and pattern. Figure 1.68a shows
a possible configuration while this heuristic operates as soon as a discordant character
is detected.
We observe the presence of the substring suffixed in the pattern that coincides with
the identical substring in the text (in the figure they have the same color). Having
found the identical substring, between text and pattern, the heuristic of the good suffix
proposes to verify if in the pattern there exist other substrings identical to the suffix
and in the affirmative case, it suggests to move the pattern to the right to align the
substring found in the pattern with that of the good suffix in the text (see figure).
In formal terms, using the previous meaning of the symbols, we have the discordant
character between text T [s + j] and pattern x[ j], with s the current shift and j
indicates the position in x of the discordant character. The suffix x[ j + 1 : m] is the
same as the substring in the text T [s + j + 1 : s + m]. We want to find, if there is in
x, a copy of the suffix at the rightmost position k < j, such that

x[k + 1 : k + m − j] = T [s + j + 1 : s + m] x[k] = x[ j] (1.326)


1.19 String Recognition Methods 179

Algorithm 16 Pseudo-code of the algorithm function of the last occurrence:


Last_Occurence(x, V)
1: Input: pattern x, m length of x, Alphabet V
2: Output: λ
3: for ∀σ ∈ V do

4: λ[σ ] ← 0

5: end for
6: for j ← 1 to m do

7: λ[x[ j]] ← j

8: end for
9: return λ

The first expression indicates the existence of a copy of the good suffix starting from
the position k + 1, while the second one indicates that this copy is preceded by a
character different from the one that caused the discordance or x[ j]. Having met the
(1.326), the heuristic suggests updating the s shift as follows:

s ← s + ( j − k) j = 1, . . . , m k = 1, . . . , m (1.327)

and to move the pattern x to the new position s + 1. Comparing the characters of x
from the position k to k + m − j is useless. Figure 1.68 schematizes the functionality
of the good suffix heuristic for the 3 possible configurations as the index k varies.

1. k = 0: since there is no copy in x of the suffix of T [s + j + 1 : s + m], move x


of m characters (see Fig. 1.68a).
2. k = 0 but there is a prefix α o f x: the prefix α exists and if it is also coincident
with a suffix β of T [s + j + 1 : s + m], the heuristic suggests a shift m − |α|,
such as to match the prefix α with the suffix β of T [s + j + 1 : s + m] (see
Fig. 1.68b).
3. k exists: since there is a copy in x of the good suffix (starting from position
k +1), preceded by a character different from the one that caused the discordance,
moves x of the minimum number of characters to align this copy found in x with
the coincident suffix of T . In other words, there is another substring T [s + j + 1 :
s + m] in x, that is, x[k + 1 : k + m − j] with x[k] = x[ j] (see Fig. 1.68c).
This situation is attributable to the previous case when this copy in x is just a
prefix of x. In fact, as shown in Fig. 1.68b, we have that copy (the substring
α = x[1, |α|] = “CA”) matches with a prefix of x preceded by the null character
 (imagined in the position k = 0 in x[0]) which satisfies the condition x[k] =
x[ j], or  = x[ j]. This prefix coincides with the suffix T [s +m −|α|+1 : s +m]
and the proposed shift always consists in aligning the prefix of the pattern with
the suffix of the text.
180 1 Object Recognition

In the Boyer–Moore algorithm, the good suffix heuristic is realized with the good
suffix function γ [ j] which defines, once found in position j, j < m the discordant
character x[ j] = x[s + j], the minimum amount of increment of the movement s,
given as follows:
γ [ j] = m − max{k : 0 ≤ k < m and x[ j + 1 : m] ∼ x[1 : k] with x[k] = x[ j] (1.328)

where the symbol ∼ indicates a relationship of similarity between two strings.35 In


this context, we have that x[ j + 1 : m] ⊃ x[1 : k] or x[1 : k] ⊃ x[ j + 1 : m]. The
function γ [ j] with the (1.328) determines a minimum value to increase the shift s
without causing any character in the good suffix T [s + j + 1 : s + m] is discordant
with the proposed new pattern alignment (see also the implication in the Note 35).
The pseudo code that implements the algorithm of the last occurrence function is
given in Algorithm 17.

Algorithm 17 Pseudo-code of the good suffix algorithm: Good_Suffix(x, m)


1: Input: pattern x, m length of x
2: Output: γ
3: π ← Func_Pr e f i x(x)
4: x ← Rever se(x)
5: π  ← Func_Pr e f i x(x )
6: for j ← 0 to m do

7: γ [ j] ← m − π [m]

8: end for
9: for k ← 1 to m do

10: j ← m − π  [k]
11: if (γ [ j] > (k − π  [k])) then

12: γ [ j] ← k − π  [k]

13: end if

14: end for


15: return γ

35 Let α and β be two strings, we define a similarity relation α ∼ β (we read α is similar to β), with
the meaning that α ⊃ β (where we recall that the symbol ⊃ has the meaning of suffix). It follows
that, if two strings are similar , we can align them with their identical characters further to the right,
and no pair of aligned characters will be discordant. The similarity relation ∼ is symmetric, that is,
α ∼ β if and only if α ∼ β. It is also shown that the following implication is had:

α⊃β and y ⊃ β =⇒ α ∼ y.
1.19 String Recognition Methods 181

From the pseudo-code, we observe the presence of the function prefix π applied
to the pattern x and its inverse indicated with x . This function is used in the prepro-
cessing of the string-matching algorithm of Knuth–Morris–Pratt and is formalized
as follows: given a pattern x[1 : m], the function prefixed for x is the function
π : {1, 2, . . . , m} → {0, 1, 2, . . . , m − 1} such that

π [q] = max{k : k < q and x[1 : k] ⊃ x[1 : q]} (1.329)

In essence, the (1.329) indicates that π [q] is the length of the longest prefix of the
pattern x and is also suffix of x[1 : q].
Returning to the good suffix algorithm (Algorithm 17), the first f or − loop cal-
culates the vector γ with the difference between the length of the pattern x and the
values returned by the prefix function π . With the second f or −loop, having already
initialized the vector γ , the latter is updated with the values of π  for any shifts less
than those calculated with π in the initialization. The pseudo-code of the function
prefix π given by the (1.329) is given in Algorithm 18.

Algorithm 18 Pseudo-code of the prefix function algorithm: Func_Prefix(x, m)


1: Input: pattern x, m length of x
2: Output: π
3: π [1] ← 0
4: k←0
5: for i ← 2 to m do

6: while k > 0 and x[k + 1] = x[i] do

7: k ← π [k]
8: if (x[k + 1] = x[i]) then

9: k ←k+1

10: end if
11: π [i] ← k

12: end while

13: end for


14: return π

We are now able to report the Boyer–Moore algorithm having defined, the two
preprocessing functions, that of the discordant character (Last_Occurrence) and of
the good suffix (Good_Suffix). The pseudo-code is given in Algorithm 19.
Figure 1.69 shows a simple example of the Boyer–Moore algorithm that finds
the first occurrence of the pattern x[1:9] = “CTAGCGGCT” in the text T[1:28] =
“CTTATAGCTGATCGCGGCCTAGCGGCTAA” after 6 steps, having previously
182 1 Object Recognition

Algorithm 19 Pseudo-code of Boyer–Moore’s string-matching algorithm


1: Input: Text T , pattern x, n length of T and m length of x, Alphabet V
2: Output: Find the positions s of the occurrences of x in T
3: λ ← Last_Occurr ence(x, V )
4: γ ← Good_Su f f i x(x, m)
5: s←0
6: while s ≤ n − m do

7: j ←m
8: while j > 0 and x[ j] = T [s + j] do

9: j ← j −1
10: if j = 0 then

11: print “pattern x appears in position", s


12: s ← s + γ [0]

13: else

14: s ← s + max(γ [ j], j − λ[T [s + j]])

15: end if

16: end while

17: end while

pre-calculated the tables of the two heuristics, appropriate for the alphabet V =
{A, C, G, T } and the pattern x considered.
From the analysis of the Boyer–Moore algorithm, we can observe the similarity
with the simple algorithm Algorithm 15 from which it differs substantially, for the
comparison between pattern and text, which occurs from right to left, and from
the use of the two heuristics to evaluate the shifts of the pattern by more than 1
character. In fact, while in the simple algorithm of string matching, the shift s is
always 1 character, with the Boyer–Moore algorithm, when the discordant character
is determined the instruction associated with line 14 is executed to increase s of
a quantity that corresponds to the maximum of the values suggested by the two
heuristic functions.
The computational complexity [47,48] of the Boyer–Moore algorithm is O(nm).
In particular, the one related to the preprocessing, due to the Last_ Occurrence
function is O(m + |V|) and the Good_Su f f i x function is O(m), while the one
due to the search phase is O(n − m + 1)m. The comparisons between strings saved
in the search phase depend very much on heuristics that learn useful information
about the internal structure of the pattern or text. To operate in linear times, the
two heuristics are implemented through tables containing the entire alphabet and the
symbols of the pattern string whose positions of their occurrence more to the right
1.19 String Recognition Methods 183

1 2 ...... s+j n
T: C T T A T A G C T G A T C G C G G C C T A G C G G C T A A
Step 1 s=0 α j=6
x: C T A G C G G C T kG=0 Gs=m-|α|=9-2=7 kL=3 Lo=j-kL=6-3=3

T: C T T A T A G C T G A T C G C G G C C T A G C G G C T A A

Step 2 s=0+7 kL=7 j=9


x: C T A G C G G C T kG=0 Gs=1 kL=7 Lo=j-kL=9-7=2

T: C T T A T A G C T G A T C G C G G C C T A G C G G C T A A
Step 3 s=7+2 j=9
x: C T A G C G G C T kG=0 Gs=1 kL=8 Lo=j-kL=9-8=1

T: C T T A T A G C T G A T C G C G G C C T A G C G G C T A A
Step 4 s=9+1 j=9
x: C T A G C G G C T come passo 3

T: C T T A T A G C T G A T C G C G G C C T A G C G G C T A A
Step 5 s=10+1 α j=7
x: kG=0 Gs=m-|α|=9-2=7 C T A G C G G C T kL=8 Lo=j-kL=7-8=-1

T: C T T A T A G C T G A T C G C G G C C T A G C G G C T A A
Step 6 s=11+7
x: C T A G C G G C T

Fig. 1.69 Complete example of the Boyer–Moore algorithm. In step 1, we have character
discor dance for j = 6 and match between su f f i x “CT” (of the good suffix T[7:9] = “GCT”)
and pr e f i x α = x[1:2] = “CT”; between the two heuristics the good suffix (Gs = 7) wins over
the one of the discordant character (Lo = 3) moving the 7-character pattern as shown in the figure.
In step 2, the heuristic wins having found the rightmost discordant character in the pattern and
proposes a shift of Lo = 2 greater than the heuristic of the good suffix Gs = 1. In steps 3 and 4,
both heuristics suggest moving 1 character. Step 5 instead chooses the heuristics of the good suffix
Gs = 7 (configuration identical to step 1) while the other heuristic that proposes a negative shift
Lo = −1 is ignored. In step 6, we have the first occurrence of the pattern in the text

are pre-calculated in x and of the rightmost positions of the occurrence of the suffixes
in x. From experimental results, the Boyer–Moore algorithm is performing for large
lengths of the x pattern and with large alphabets V. To better optimize computational
complexity, several variants of the Boyer–Moore algorithm have been developed
[47,49,50].

1.19.3 Edit Distance

In the preceding paragraphs, we have addressed the problem of exact matching


between strings. In the more general problem of pattern classification, we used the
nearest-neighbor algorithm to determine the class to which the patterns belong.
Even for a string pattern, it is useful to determine a category by evaluating a level of
184 1 Object Recognition

similarity with strings grouped in classes according to some categorization criteria.


In the context of string recognition, we define a concept of distance known as the
edit distance, also known as the Levenshtein distance [51], to evaluate how many
character operations are needed to transform a string x in another string y. This
measure can be used to compare strings and express concepts of diversity in particular
in Computational Biology applications. The possible operations to evaluate the edit
distance are

Substitution: Put in x the corresponding character in y. For example, turning mario


into maria requires replacing “o with “a  .
Insertion: A character from y is inserted in x with the consequent increment of 1
character of the length of x. For example, turning the string “mari” into “maria”
requires the insertion of “a”.
Deletion: A character is deleted in x by decreasing its length by one character.

These are the elementary operations to calculate the distance of edit. Several elemen-
tary operations can be considered as the transposition that interexchanges adjacent
characters of a string x. For example, transforming x = “marai” into y = “maria”
requires a single transposition operation equivalent to 2 substitution operations. The
edit distance between the strings “dived” and “davide” is equal to 3 elementary
operations: 2 substitution or “dived” → “daved” (substitution of “i” with “a”)
and “dived” → “david” (substitution of “e” with “i”), and 1 insertion “dived” →
“davide” (the character “e”).
The edit distance is the minimum number of operations required to make two
strings identical. Edit distance can be calculated by giving different costs to each
elementary operation. For simplicity, we will consider unit costs for each edit ele-
mentary operation.
Given two strings

x = (x1 , x2 . . . , xi−1 , xi , xi+1 , . . . , xn )

y = (y1 , y2 . . . , y j−1 , y j , y j+1 , . . . , ym ),

it is possible to define the matrix D(i, j) as the edit distance of the prefix strings
(x1 ..xi ) and (y1 ..y j ) and consequently obtain as final result the edit distance D(n, m)
between the two strings x and y, as the minimum number of edit operations to
transform the entire string x in y.
The calculation of D(i, j) can be set recursively, based on immediately shorter
prefixes, considering that you can have only three cases associated with the related
edit operations:

1. Substitution, the xi character is replaced with the y j and we will have the
following edit distance:

D(i, j) = D(i − 1, j − 1) + cs (i, j) (1.330)


1.19 String Recognition Methods 185

where D(i − 1, j − 1) is the edit distance between the prefixes (x1 ..xi−1 ) and
(y1 ..y j−1 ), and cs (i, j) indicates the cost of the substitution operation between
the characters xi and y j , given by

1 if xi = y j
cs (i, j) = (1.331)
0 if xi = y j

2. Deletion, the character xi is deleted and we will have the following edit distance:

D(i, j) = D(i − 1, j) + cd (i, j) (1.332)

where D(i − 1, j) is the edit distance between the prefixes (x1 ..xi−1 ) and
(y1 ..y j ), and cd (i, j) indicates the cost of the deletion operation, normally set
equal to 1.
3. I nserimento, il carattere y j viene inserito ed avremo la seguente edit distance:

D(i, j) = D(i, j − 1) + cin (i, j) (1.333)

where D(i, j −1) is the edit distance between the prefixes (x1 ..xi−1 ) and (y1 ..y j ),
and cin (i, j) indicates the cost of the insertion operation, normally set equal to
1.

Given that there are no other cases, and that we are interested in the minimum value,
the correct edit distance recursively defined is given by

D(i, j) = min{D(i −1, j)+1, D(i, j −1)+1, D(i −1, j −1)+cs (i, j)} (1.334)

with strictly positive i and j. The edit distance D(n, m) between two length strings
of n and m characters, respectively, can be calculated with a recursive procedure that
implements the (1.334) starting from the basic conditions:
D(i, 0) = i i = 1, . . . , n D(0, j) = j j = 1, . . . , m D(0, 0) = 0 (1.335)
where D(i, 0) is the edit distance between the prefix string (x1 ..xi ) and the null
string , D(0, j) is the edit distance between the null string  and the prefix string
(y1 ..y j ) and D(0, 0) represents the edit distance between null strings.
A recursive procedure based on the (1.334) and (1.335) would be inefficient requir-
ing considerable computation time O[(n + 1)(m + 1)]. The strategy used, instead,
is based on dynamic programming (see algorithm Algorithm 20).
With this algorithm, a cost matrix D is used to calculate the edit distance starting
from the basic conditions given by the (1.335), and then, using these basic values
(lines 3–12), calculate for each element D(i, j) (i.e., for pairs of strings more and
more long at the variation of i and j) the edit distance (minimum cost, line 20)
with the (1.334) thus filling the matrix up to the element D(n, m) representing the
edit distance of the strings x and y of length respectively of n and m. In essence,
the Algorithm 20 instead of directly calculating the distance D(n, m) of the two
186 1 Object Recognition

strings of interest, the strategy is to determine the distance of all the prefixes of the
two strings (reduction of the problem in sub problems) from which to derive, for
induction, the distance for the entire length of the strings.

Algorithm 20 Pseudo-code of the edit distance algorithm


1: Input: String x, n length of x, String y and m length of y
2: Output: Matrix of edit distances D(n, m)
3: D(i, j) ← 0
4: for i ← 1 to n do

5: D(i, 0) ← i

6: end for
7: for j ← 1 to m do

8: D(0, j) ← j

9: end for
10: for i ← 1 to n do

11: for j ← 1 to m do

12: if x[i] = y[ j] then

13: c←0

14: else

15: c←1

16: end if
17: D(i, j) = min{D(i − 1, j) + 1, D(i, j − 1) + 1, D(i − 1, j − 1) + cs }
' () * ' () * ' () *
Deletion xi I nser tion y j subst/nosubst xi with y j

18: end for

19: end for


20: return D(n, m)

The symmetry properties are maintained if the edit operations (insertion and dele-
tion) have identical costs as in this case is reported in the algorithm. Furthermore,
we can consider c(i, j) if we wanted to differentiate the elementary costs of editing
between the character xi with the character yi . Figure 1.70a shows the schema of the
matrix D with dimensions (n + 1 × m + 1) for the calculation of the minimum edit
distance for strings x = “Frainzisk” and y = “Francesca”. The elements D(i, 0) and
D(0, j) are filled first (respectively, the first column and the first row of D), or the
base values representing the lengths of all the prefixes of the two strings with respect
1.19 String Recognition Methods 187

(a) j y (b)
D(0,0) ε F r a n c e s c a ε F r a n c e s c a
ε 0 1 2 3 4 5 6 7 8 9 D(0,j) ε 0 1 2 3 4 5 6 7 8 9
i F 1 0 1 F 1 0 1 2 3 4 5 6 7 8 Substitution
r 2 r 2 1 0 1 2 3 4 5 6 7
a 3 D(i-1,j-1) a 3 2 1 0 1 2 3 4 5 6
x No Substitution
i 4 D(i-1,j) i 4 3 2 1 1 2 3 4 5 6
n 5 D(i,j) n 5 4 3 2 1 2 3 4 5 6
z 6 5 4 3 2 2 3 4 5 6 Insertion
D(i,j-1) z 6
i 7 i 7 6 5 4 3 3 3 4 5 6
s 8 s 8 7 6 5 4 4 4 3 4 5 Deletion
k 9 k 9 8 7 6 5 5 5 4 4 5
F’ c(1,1)=0 c(1,2)=1
D(i,0)
D(i,j)=D(1,1)=min[D(1-1,1-1)+c(1,1), D(1-1,1)+1, D(1,1-1)+1]= D(i,j)=D(1,2)=min[D(1-1,2-1)+c(1,2), D(1-1,2)+1, D(1,2-1)+1]=
min[D(0,0)+0, D(0,1)+1, D(1,0)+1]= min[D(0,1)+1, D(0,2)+1, D(1,1)+1]=
min[0+0, 1+1, 1+1]=0 min[1+1, 2+1, 0+1]=1

Fig. 1.70 Calculation of the edit distance to transform the string x = “Frainzisk” into y =
“Francesca”. a Construction of the matrix D starting from the base values given from the (1.335)
and then iterating with the dynamic programming method to calculate the other elements of D using
the algorithm Algorithm 20. b D complete calculated by scanning the matrix from left to right and
from the first line to the last. The edit distance between the two strings is equal to D(9, 9) = 5 and
the requested edit operations are 1 deletion, 3 substitution, and 1 insertion

to the null string . The element D(i, j) is the edit distance between the prefix x(1..i)
and y(1.. j). The value D(i, j) is calculated by induction based on the last characters
of the two prefixes.
If these characters are equal D(i, j) is equal to the edit distance between the two
shorter prefixes of 1 character (x(1..i − 1) and y(1.. j − 1)) or D(i, j) = 0 + D(i −
1, j − 1). If the last two characters are not equal D(i, j) results in a unit greater than
the minimum edit distances relative to the 3 shortest prefixes (the adjacent elements:
upper, left, and upper left), that is, D(i, j) = 1+min{D(i −1, j), D(i, j −1), D(i −
1, j −1)}. It follows, as shown in the figure, that for each element of D the calculation
of the edit distance depends only on the values previously calculated for the shortest
prefixes of 1 character.
The complete matrix is obtained by iterating the calculation, for each element
of D, operating from left to right and from the first row to the last, thus obtaining
the edit distance in the last element D(n, m). Figure 1.70b shows the complete D
matrix for the calculation of the edit distance to transform the string x = “Frainzisk”
in y = “Francesca”. The path is also reported (indicating the type of edit operation
performed) which leads to the final result D(9.9) = 5, i.e., to the minimum number
of required edit operations.
Returning to the calculation time, the algorithm reported requires a computational
load of O(nm) while in space-complexity, it requires O(n) (the space-complexity is
O(nm) if the whole of the matrix is kept for a trace-back to find an optimal alignment).
Ad hoc algorithms have been developed in the literature that reduce computational
complexity up to O(n + m).
188 1 Object Recognition

1.19.4 String Matching with Error

In several applications, where the information may be affected by error or the nature
of the information itself evolves, the exact pattern matching is not useful. In these
cases, it is very important to solve the problem of the approximate pattern matching
which consists in finding in the text string an approximate version of the pattern
string according to a predefined similarity level.
In formal terms, the approximate pattern matching is defined as follows. Given a
text string T of length n and a pattern x of length m with m ≤ n, the problem is to
find the k approximate occurrence of the pattern string x in the text T with maximum
k (0 ≤ k ≤ m) different characters (or errors). A simple version of an approximate
matching algorithm Algorithm 21, shown below, is obtained with a modification of
the exact matching algorithm presented in Algorithm 15.

Algorithm 21 Pseudo-code of a simple approximate string-matching algorithm


1: Input: Text T , pattern x, n length of T , m length of x and number of different characters k
2: Output: Find the positions s of the approximate occurrences of x in T
3: s←0
4: for s ← 1 to n − m + 1 do

5: count ← 0
6: for j ← 1 to m do

7: if x[ j] = T [s + j − 1] then

8: count ← count + 1

9: end if

10: end for


11: if count ≤ k then

12: Print “approximate pattern in position”, s

13: end if

14: end for

The first for-loop (statement line 4) iterates through the text of one character at a
time while the second for-loop (statement line 6) there are counted in count label the
number of different characters found between the pattern x and text T [s : s + j − 1]
reporting the s positions in T of the approximate patterns found in T according to the
required k-differences. We would return to the exact matching algorithm by putting
in Algorithm 21 k = 0. We recall the computational inefficiency of this algorithm
equal to O(nm). An efficient algorithm of approximate string matching, based on
the edit distance, is reported in Algorithm 22.
1.19 String Recognition Methods 189

Algorithm 22 Pseudo-code of an approximate string-matching algorithm based on


edit distance
1: Input: pattern x, m length of x, Text T , n length of T and number of different characters k
2: Output: Matrix D(m, n) where if D(m, j) ≤ k we have that x occurs at the position j in T
3: D(i, j) ← 0
4: for i ← 1 to m do

5: D(i, 0) ← i

6: end for
7: for j ← 1 to n do

8: D(0, j) ← 0

9: end for
10: for i ← 1 to m do

11: for j ← 1 to n do

12: if (x[i] = T [ j] then

13: c←0

14: else

15: c←1

16: end if
17: D(i, j) = min{D(i − 1, j) + 1, D(i, j − 1) + 1, D(i − 1, j − 1) + cs }

18: end for

19: end for


20: for j ← 1 to n do

21: if D(m, j) ≤ k then

22: Print “approximate pattern in position”, j

23: end if

24: end for


25: return D(m, n)

This algorithm differs substantially in having zeroed the line D(0, j), j = 1, n
instead of assigning the value j (line 8 of the Algorithm 22). In this way, the D matrix
indicates that a null prefix of the pattern x corresponds to a null occurrence of the
text T which does not involve any cost. Each element D(i, j) of the matrix contains
190 1 Object Recognition

j T
D(0,0) ε C C T A T A G C T G A T C
ε 0 0 0 0 0 0 0 0 0 0 0 0 0 0 D(0,j)
i C 1 0 0 1 1 1 1 1 0 1 1 1 1 0
T 2 1 1 0 1 1 2 2 1 0 1 2 1 1
x
A 3 2 2 1 0 1 1 2 2 1 1 1 2 2
G 4 3 3 2 1 1 2 1 2 2 1 2 2 3
4 5 7 10
D(i,0) 1-approximate occurrences

Fig. 1.71 Detection, using the algorithm Algorithm 22, of the 1-approximate occurrences of the
pattern x = “CTAG” in the text T = “CCTATAGCTGATC”. The positions of the approximate
occurrences in T are in the line D(m, ∗) of the modified edit matrix, where the value of the edit
distance is minimum, i.e., D(4, j) ≤ 1. In the example, the occurrences in T are 4 in the positions
for j = 4, 5, 7 and 10

the minimum value k for which there exists an approximate occurrence (with at most
k different characters) of the prefix x[1 : i] in T . It follows that the approximate
occurrences of k of the entire pattern x are found in T in the positions shown in the
last row of the matrix D(m, ∗) (line 22 of the Algorithm 22). In fact, each element
D(m, j), j = 1, n reports the number of different characters (that is, number of
edit operations required to transform x in the occurrence to T ) between pattern and
corresponding substring T [ j : j + m − 1] of the text under examination. In fact, the
positions j-th of the occurrences of x in T are found where D(m, j) ≤ k.
Figure 1.71 shows the modified edit matrix D calculated with the algorithm
Algorithm 22 to find the approximate occurrences of k = 1 of the pattern x =
“CTAG” in the text T = “CCTATAGCTGATC”. The 1-approximate pattern x, as
shown in the figure, occurs in the text T in positions j = 4, 5, 7 and 10, or where
D(4, j) ≤ k. In literature [51], there are several other approaches based on different
methods of calculating the distance between strings (Hamming, Episode, ...) and on
dynamic programming with the aim also of reducing computational complexity in
time and space.

1.19.5 String Matching with Special Symbol

In several string-matching applications, it is useful to define a special " which can


appear in the pattern string x and in the text T and have the meaning of equivalence
(match) in comparison with any other character of the pattern or text. For example,
in the search for the occurrences of the pattern x = ‘CDD"AT"G’ in the text T =
‘C"D"ATTGATTG"G...’ the special character " is not considered in the compar-
ison and the first occurrence of the pattern is aligned at the beginning of the text. The
exact string-matching algorithms, described in the previous paragraphs, can be mod-
ified to include the management of the special character involving a high degradation
of computational complexity.
1.19 String Recognition Methods 191

Ad hoc solutions have been developed in the literature [52] for the optimal com-
parison between strings neglecting the special character.

References
1. R.B. Cattell, The description of personality: basic traits resolved into clusters. J. Abnorm. Soc.
Psychol. 38, 476–506 (1943)
2. R.C. Tryon, Cluster Analysis: Correlation Profile and Orthometric (Factor) Analysis for the
Isolation of Unities in Mind and Personality (Edward Brothers Inc., Ann Arbor, Michigan,
1939)
3. K. Pearson, On lines and planes of closest fit to systems of points in space. Philos. Mag. 2(11),
559–572 (1901)
4. H. Hotelling, Analysis of a complex of statistical variables into principal components. J. Educ.
Psychol. 24, 417–441 and 498–520 (1933)
5. W. Rudin, Real and Complex Analysis (Mladinska Knjiga McGraw-Hill, 1970). ISBN 0-07-
054234-1
6. R. Larsen, R.T. Warne, Estimating confidence intervals for eigenvalues in exploratory factor
analysis. Behav. Res. Methods 42, 871–876 (2010)
7. M. Friedman, A. Kandel, Introduction to Pattern Recognition: Statistical, Structural, Neural
and Fuzzy Logic Approaches (World Scientific Publishing Co Pte Ltd, 1999)
8. R.A. Fisher, The statistical utilization of multiple measurements. Ann Eugen 8, 376–386 (1938)
9. K. Fukunaga, J.M. Mantock, Nonparametric discriminant analysis. IEEE Trans. Pattern Anal.
Mach. Intell. 5(6), 671–678 (1983)
10. T. Okada, S. Tomita, An optimal orthonormal system for discriminant analysis. Pattern Recog-
nit. 18, 139–144 (1985)
11. J.-S.R. Jang, C.-T. Sun, E. Mizutani, Neuro-fuzzy and Soft Computing (Prentice Hall, 1997)
12. J. MacQueen, Some methods for classification and analysis of multivariate observations, in
Proceedings of the Fifth Berkeley Symposium on Mathematical statistics and Probability, vol.
1, ed. by L.M. LeCam, J. Neyman (University of California Press, 1977), pp. 282–297
13. G.H. Ball, D.J. Hall, Isodata: a method of data analysis and pattern classification. Technical
report, Stanford Research Institute, Menlo Park, United States. Office of Naval Research.
Information Sciences Branch (1965)
14. J.R. Jensen, Introductory Digital Image Processing: A Remote Sensing Perspective, 2nd edn.
(Prentice Hall, Upper Saddle River, NJ, 1996)
15. L.C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms (Plenum Press,
New York, 1981)
16. C.K. Chow, On optimum recognition error and reject tradeoff. IEEE Trans. Inf. Theory 16,
41–46 (1970)
17. A.R. Webb, K.D. Copsey, Statistical Pattern Recognition, 3rd edn. (Prentice Hall, Upper Saddle
River, NJ, 2011). ISBN 978-0-470-68227-2
18. R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification, 2nd edn. (Wiley, 2001). ISBN
0471056693
19. K. Fukunaga, Introduction to Statistical Pattern Recognition, 2nd edn. (Academic Press Pro-
fessional, Inc., 1990). ISBN 978-0-470-68227-2
20. A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via the
EM algorithm. J. R. Stat. Soc. B 39(1), 1–38 (1977)
21. W. McCulloch, W. Pitts, A logical calculus of the ideas immanent in nervous activity. Bull.
Math. Biophys. 5, 115–133 (1943)
192 1 Object Recognition

22. H. Robbins, S. Monro, A stochastic approximation method. Ann. Math. Stat. 22, 400–407
(1951)
23. R. Tibshirani, Regression shrinkage and selection via the lasso. J. R. Stat. Soc. 58(1), 267–288
(1996)
24. J.J. Hopfield, Neural networks and physical systems with emergent collective computational
abilities. Proc. Natl. Acad. Sci 79, 2554–2558 (1982)
25. J.R. Quinlan, Induction of decision trees. Mach. Learn. 1, 81–106 (1986)
26. J.R. Quinlan, C4.5: Programs for Machine Learning (Morgan Kaufmann, San Mateo, CA,
1993)
27. L. Breiman, J. Friedman, R. Olshen, C. Stone, Classification and Regression Trees (Wadsworth
Books, 1984)
28. X. Lim, W.Y. Loh, X. Shih, A comparison of prediction accuracy, complexity, and training
time of thirty-three old and new classification algorithms. Mach. Learn. 40, 203–228 (2000)
29. P.E. Utgoff, Incremental induction of decision trees. Mach. Learn. 4, 161–186 (1989)
30. J.R. Quinlan, R.L. Rivest, Inferring decision trees using the minimum description length prin-
ciple. Inf. Comput. 80, 227–248 (1989)
31. J.R. Quinlan, Simplifying decision trees. Int. J. Man-Mach. Stud. 27, 221–234 (1987)
32. L. Kaufman, P.J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis
(Wiley, 2009)
33. T. Zhang, R. Ramakrishnan, M. Livny, Birch: an efficient data clustering method for very large
databases, in Proceedings of SIGMOD’96 (1996)
34. S. Guha, R. Rastogi, K. Shim, Rock: a robust clustering algorithm for categorical attributes, in
Proceedings in ICDE’99 Sydney, Australia (1999), pp. 512–521
35. G. Karypis, E.-H. Han, V. Kumar, Chameleon: a hierarchical clustering algorithm using
dynamic modeling. Computer 32, 68–75 (1999)
36. K.S. Fu, Syntactic Pattern Recognition and Applications (Prentice-Hall, Englewood Cliffs, NJ,
1982)
37. N. Chomsky, Three models for the description of language. IRE Trans. Inf. Theory 2, 113–124
(1956)
38. H. J. Zimmermann, B.R. Gaines, L.A. Zadeh, Fuzzy Sets and Decision Analysis (North Holland,
Amsterdam, New York, 1984). ISBN 0444865934
39. Donald Ervin Knuth, On the translation of languages from left to right. Inf. Control 8(6),
607–639 (1965)
40. D. Marcus, Graph Theory: A Problem Oriented Approach, 1st edn. (The Mathematical Asso-
ciation of America, 2008). ISBN 0883857537
41. D.H. Ballard, C.M. Brown, Computer Vision (Prentice Hall, 1982). ISBN 978-0131653160
42. A. Barrero, Three models for the description of language. Pattern Recognit. 24(1), 1–8 (1991)
43. R.E. Woods, R.C. Gonzalez, Digital Image Processing, 2nd edn. (Prentice Hall, 2002). ISBN
0201180758
44. P.H. Winston, Artificial Intelligence (Addison-Wesley, 1984). ISBN 0201082594
45. D.E. Knuth, J.H. Morris, V.B. Pratt, Fast pattern matching in strings. SIAM J. Comput. 6(1),
323–350 (1977)
46. R.S. Boyer, J.S. Moore, A fast string searching algorithm. Commun. ACM 20(10), 762–772
(1977)
47. Hume and Sunday, Fast string searching. Softw. Pract. Exp. 21(11), 1221–1248 (1991)
48. T.H. Cormen, C.E. Leiserson, R.L. Rivest, C. Stein, Introduction to Algorithms (MIT Press
and McGraw-Hill, 2001). ISBN 0-262-03293-7
49. R. Nigel Horspool, Practical fast searching in strings. Softw. Pract. Exp., 10(6), 501–506 (1980)
50. D.M. Sunday, A very fast substring search algorithm. Commun. ACM 33(8), 132–142 (1990)
51. N. Gonzalo, A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88
(2001)
52. P. Clifford, R. Clifford, Simple deterministic wildcard matching. Inf. Process. Lett. 101(2),
53–54 (2007)
RBF, SOM, Hopfield, and Deep Neural
Networks 2

2.1 Introduction

We begin to describe the first three different neural network architectures: Radial
Basis Functions-RBF, Self-Organizing Maps-SOM, and the Hopfield network. The
Hopfield network has the ability to memorize information and recover it through
partial contents of the original information. As we shall see, it presents its originality
based on physical foundations that have revitalized the entire sector of neural net-
works. The network is associated with an energy function to be minimized during
its evolution with a succession of states until it reaches a final state corresponding to
the minimum of the energy function. This characteristic allows it to be used to solve
and set an optimization problem in terms of objective function to be associated with
an energy function.
The SOM network instead has an unsupervised learning model and has the orig-
inality of autonomously grouping input data on the basis of their similarity without
evaluating the convergence error with external information on the data. Useful when
we have no exact knowledge of the data to classify them. It is inspired by the topology
of the model of the cortex of the brain considering the connectivity of neurons and
in particular the behavior of an activated neuron and the influence with neighbor-
ing neurons that reinforce the bonds with respect to those further away that become
weaker. Extensions of the SOM bring it back to supervised versions, as in the Learn-
ing Vector Quantization versions SOM-LVQ1, LVQ2, etc., which essentially serve
to label the classes and refine the decision edges.
The RBF network uses the same neuron model as the MLP but differs in its
architectural simplification of the network and of the activation function (based on
the radial base function) that implements the Cover theorem. In fact, RBF provides
only one hidden layer and the output layer consists of only one neuron. The MLP
network is more vulnerable in the presence of noise on the data while the RBF is
more robust to the noise due to the radial basis functions and to the linear nature of
the combination of the output of the previous neuron (MLP instead uses the nonlinear
activation function).
© Springer Nature Switzerland AG 2020 193
A. Distante and C. Distante, Handbook of Image Processing and Computer Vision,
https://doi.org/10.1007/978-3-030-42378-0_2
194 2 RBF, SOM, Hopfield, and Deep Neural Networks

The design of a supervised neural network can be done in a variety of ways. The
backpropagation algorithm for a multilayer (supervised) network, introduced in the
previous chapter, can be seen as the application of a recursive technique that is denoted
by the term stochastic approximation in statistics. RBF uses a different approach in
the design of a neural network as a “curve fitting” problem, that is, the resolution of
an approximation problem, in a very large space, so learning is reduced to finding a
surface in a multidimensional space that provides the best “fit” for training data, in
which the best fit is intended to be measured statistically. Similarly, the generalization
phase is equivalent to using this multidimensional surface searched with training data
to interpolate test data never seen before from the network.
The network is structured on three levels: input, hidden, and output. The input layer
is directly connected with the environment, that is, they are directly connected with
the sensory units (raw data) or with the output of a subsystem of feature extraction.
The hidden layer (unique in the network) is composed of neurons in which radial-
based functions are defined, hence the name of radial basis functions, and which
performs a nonlinear transformation of the input data supplied to the network. These
neurons form the basis for input data (vectors). The output layer is linear, which
provides the network response for the presented input pattern.
The reason for using a nonlinear transformation in the hidden layer followed by
a linear one in the output layer is described in an article by Cover (1965) according
to which a pattern classification problem reported in a much larger space (i.e., in the
nonlinear transformation from the input layer to the hidden one) it is more likely to
be linearly separable than in a reduced size space. From this observation derives the
reason why the hidden layer is generally larger than the input one (i.e., the number
of hidden neurons much greater than the cardinality of the input signal).

2.2 Cover Theorem on Pattern Separability

A complex problem of automatic pattern recognition through the use of a neural


network radial basis function is solved by transforming the space of the problem
into a larger dimension in a nonlinear way. Cover’s theorem on the separability of
the pattern is defined as follows:

Theorem 1 (Cover theorem) A complex pattern recognition problem transformed


into a larger nonlinear space is more likely to be linearly separable than in a lower
dimensionality space, except that the space is not densely populated.

Let C be a set of N patterns (vectors) x1 , x2 , …, xN , to each of which one of the two


classes C 1 or C 2 is assigned. This binary partition is separable if there is a surface
such that it separates the points of class C 1 from those of class C 2 . Suppose that a
generic vector x ∈ C is m0 -dimensional, and that we define a set of real functions
2.2 Cover Theorem on Pattern Separability 195

{ϕi (x) |i = 1, ..., m1 } for which the input space is transformed to m0 -dimensional in
a new m1 -dimensional space, as follows:
 T
ϕ(x) = ϕ1 (x), ϕ2 (x), . . . , ϕm1 (x) (2.1)

The function ϕ, therefore, allows the nonlinear spatial transformation from a space
to a larger one (m1 > m0 ). This function refers to the neurons of the hidden layer
of the RBF network. A binary partition (dichotomy) [C1 , C2 ] of C is said to be
ϕ—separable if there exists a vector w m1 -dimensional such that we can write: w
m1

wT ϕ(x) > 0 if x ∈ C1 (2.2)


wT ϕ(x) < 0 if x ∈ C2 (2.3)

where the equation


wT ϕ(x) = 0 (2.4)

indicates the hyperplane of separation, or describes the surface of separation between


the two classes in the space ϕ (or hidden space). The surfaces of separation between
populations of objects can be hyperplanes (first order), quadrics (second order) or
hypersphere (quadrics with some linear constraints on the coefficients).
In Fig. 2.1 the three types of separability are shown. In general, linear separability
implies spherical separability which, in turn, implies quadratic separability. The
reverse is not necessarily true. Two key points of Cover’s theorem can be summarized
as follows:

1. Nonlinear formulation of the functions of the hidden layer defined by ϕi (x) with
x the input vector and i = 1, . . . , m1 the cardinality of the layer.
2. A dimensionality greater than the hidden space compared with that of the input
space determined as we have seen from the value of m1 (i.e., the number of
neurons in the hidden layer).

Fig. 2.1 Examples of binary


partition in space for
different sets of five points in
a 2D space: (a) linearly
separable, (b) spherically
separable, (c) quadrically
separable
a) b) c)
196 2 RBF, SOM, Hopfield, and Deep Neural Networks

Table 2.1 Nonlinear transformation of two-dimensional input patterns x


Input Pattern First hidden function Second hidden function
x ϕ1 ϕ2
(1, 1) 1 0,1353
(0, 1) 0,3678 0,3678
(0, 0) 0,1353 1
(1, 0) 0,3678 0,3678

It should be noted that in some cases it may be sufficient to satisfy only point 1,
i.e., the nonlinear transformation without increasing the input space by increasing
the neurons of the hidden layer (point 2), in order to obtain linear separability. The
XOR example shows this last observation. Let 4 points in a 2D space: (0, 0), (1, 0),
(0, 1), and (1, 1) in which we construct an RBF neural network that solves the XOR
function. It has been observed above that the single perceptron is not able to represent
this type of function due to the nonlinearly separable problem. Let’s see how by using
the Cover theorem it is possible to obtain a linear separability following a nonlinear
transformation of the four points. We define two Gaussian transformation functions
as follows:
ϕ1 (x) = e−x−t1  ,
2
t1 = [1, 1]T
ϕ2 (x) = e−x−t2  ,
2
t2 = [0, 0]T

In the Table 2.1 the values of the nonlinear transformation of the four points
considered are shown and in Fig. 2.2 their representation in the space ϕ. We can
observe how they become linearly separable after the nonlinear transformation with
the help of the Gaussian functions defined above.

2.3 The Problem of Interpolation

The Cover’s theorem shows that there is a certain benefit of operating a nonlinear
transformation from the input space into a new one with a larger dimension in order
to obtain separable patterns due to a pattern recognition problem.
Mainly a nonlinear mapping is used to transform a nonlinearly separable classi-
fication problem into a linearly separable one. Similarly, nonlinear mapping can be
used to transform a nonlinear filtering problem into one that involves linear filtering.
For simplicity, consider a feedforward network with an input layer, a hidden layer,
and an output layer, with the latter consisting of only one neuron. The network, in
this case, operates a nonlinear transformation from the input layer into the hidden
one followed by a linear one from the hidden layer into the output one. If m0 always
indicates the dimensionality of the input layer (therefore, m0 neurons in the input
2.3 The Problem of Interpolation 197

1
(1,1)

0.8

Separation line
0.6 or around decision-making

0.4
(0,1)
(1,0)
0.2
(0,0)

0
0 0.2 0.4 0.6 0.8 1 1.2

Fig. 2.2 Representation of the nonlinear transformation of the four points for the XOR problem
that become linearly separable in the space ϕ

layer), the network transforms from a m0 -dimensional space into a one-dimensional


space since in output we have only one neuron, so the mapping function is expressed
as
s : m0 → 1 (2.5)

We can think of the mapping function s as a hypersurface  ⊂ m0 +1 , analogously to


an elementary mapping function s : 1 → 1 with s(x) = x2 , a parabola in space 2 .
The surface  is, therefore, a multidimensional plot of network output as a function
of input. Generally, the surface  is unknown, with training data contaminated by
noise. The two important phases of a classifier, training and test (or generalization)
can be seen as follows:

(a) the training phase is an optimization of a fitting procedure for the  surface, start-
ing from known examples (i.e., training data) that are presented to the network
as input–output pairs (patterns).
(b) The generalization phase is synonymous with interpolation between the data,
with the interpolation performed along a surface obtained by the training proce-
dure using optimization techniques that allow to have a surface  close to the
real one.

With these premises we are in the presence of a multivariate interpolation problem


in large spaces. Formalization of the interpolation problem
198 2 RBF, SOM, Hopfield, and Deep Neural Networks

Given a set of N different points {xi ∈ m0 |i = 1, . . . , N } and a corresponding set


of N real numbers {di ∈ 1 |i = 1, . . . , N } find a function F : m0 → 1 which
satisfies the following interpolation condition

F(xi ) = di i = 1, 2, . . . , N (2.6)

For better interpolation, the interpolating surface (therefore, the function F) passes
through all the points of the training data. The RBF technique consists of choosing
a function F with the following form:


N
F(x) = wi ϕ(x − xi ) (2.7)
i=1

where {ϕ(x − xi )|i = 1, 2, . . . , N } is the set of N generally nonlinear functions,


called radial basis functions, and  •  denotes the Euclidean norm. The known points
of the dataset xi ∈ m0 , i = 1, 2, . . . , N are the centers of the radial functions.
By inserting the interpolation condition (2.6) in (2.7), we obtain the set of linear
equations for the expansion coefficients (or weights) {wi }
⎡ ⎤⎡ ⎤ ⎡ ⎤
ϕ11 ϕ12 · · · ϕ1N w1 d1
⎢ ϕ21 ϕ22 · · · ϕ2N ⎥ ⎢ w2 ⎥ ⎢ d2 ⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎢ .. .. .. .. ⎥ ⎢ .. ⎥ = ⎢ .. ⎥ (2.8)
⎣ . . . . ⎦ ⎣ . ⎦ ⎣ . ⎦
ϕN 1 ϕN 2 · · · ϕNN wN dN

with
ϕji = ϕ(xj − xi ) (j, i) = 1, 2, . . . , N (2.9)

Let
d = [d1 , d2 , . . . , dN ]T

w = [w1 , w2 , . . . , wN ]T

be the vectors d and w of size N × 1 representing respectively the desired responses


and weights, with N the size of the training sample. Let  the N × N matrix of
elements ϕji
 = {ϕji |(j, i) = 1, 2, . . . , N } (2.10)

which will be denoted interpolation matrix. Rewriting the Eq. (2.8) in compact form
we get
w = d (2.11)
2.3 The Problem of Interpolation 199

Assuming that  is a non-singular matrix, there is its inverse −1 , and therefore, the
solution of the Eq. (2.11) for weights w is given by

w = −1 d (2.12)

Now how do we make sure that the  is non-singular?


The following theorem gives us an answer in this regard.

2.4 Micchelli’s Theorem

i=1 be a set of distinct points of  . Then


Theorem 2 (Micchelli’s theorem) Let xi N m0

the interpolating matrix  of size N × N whose elements ϕji = ϕ(xj − xi ) are


non-singular.

There is a vast class of radial-based functions that satisfy Micchelli’s theorem, in


particular, they are

Gaussian
r2
ϕ(r) = exp − (2.13)
2σ 2

where σ > 0 and r ∈ .


Multiquadrics √
r2 + σ 2
ϕ(r) = (2.14)
σ2

Inversemultiquadrics
σ2
ϕ(r) = √ (2.15)
r2 + σ 2

Cauchy
σ2
ϕ(r) = (2.16)
r2 + σ 2

The four functions described above are depicted in Fig. 2.3. The radial functions
defined in the Eqs. (2.13)–(2.16) so that they are not singular, all the points of the
dataset {xi }N
i=1 must necessarily be distinct from one another, regardless of the sample
size N and the cardinality m0 of the input vectors xi .
The inverse multiquadrics (2.15), the Cauchy (2.16) and the Gaussian functions
(2.13) share the same property, that is, they are localized functions in the sense
that ϕ(r) → 0 for r → ∞. In both these cases, the  is positive definite matrix. In
200 2 RBF, SOM, Hopfield, and Deep Neural Networks

4.5
Gaussian
Multiquadric
4 Inverse Multiquadric
Cauchy

3.5

2.5

1.5

0.5

0
−3 −2 −1 0 1 2 3

Fig. 2.3 Radial-based functions that satisfy Micchelli’s theorem

contrast, the family of multiquadric functions defined in (2.14) are nonlocal because
ϕ(r) becomes undefined for r → ∞, and the corresponding interpolation matrix 
has N − 1 negative eigenvalues and only one positive, with the consequence of not
being positive definite. It can, therefore, be established that an interpolation matrix
 based on multiquadric functions (introduced by Hardy [1]) is not singular, and
therefore, suitable for designing an RBF network. Furthermore, it can be remarked
that radial basis functions that grow to infinity, such as multiquadrics, can be used
to approximate smoothing input–output with large accuracy with respect to those
that make the interpolation matrix  positive definite (this result can be found in
Powell [2]).

2.5 Learning and Ill-Posed Problems

The interpolation procedure described so far may not have good results when the
network has to generalize (see examples never seen before). This problem is when
the number of training samples is far greater than the degrees of freedom of the
physical process you want to model, in which case we are bound to have as many
radial functions as there are the training data, resulting in an oversized problem.
In this case, the network attempts to best approximate the mapping function,
responding precisely when a data item is seen during the training phase but fails
when one is ever seen. The result is that the network generalizes little, giving rise
to the problem of overfitting. In general, learning means finding the hypersurface
(for multidimensional problems) that allows the network to respond (generate an
2.5 Learning and Ill-Posed Problems 201

output) to the input provided. This mapping is defined by the hypersurface equation
found in the learning phase. So learning can be seen as a hypersurface reconstruction
problem, given a set of examples that can be scattered.
There are two types of problems that are generally encountered: ill-posed prob-
lems and well-posed problems. Let’s see what they consist of. Suppose we have a
domain X and a set Y of some metric space, which are related to each other by a
functional unknown f which is the objective of learning. The problem of reconstruct-
ing the mapping function f is said to be well-posed if it satisfies the following three
conditions:

1. Existence. ∀ x ∈ X ∃ y = f (x) such that y ∈ Y


2. Uniqueness. ∀ x, t ∈ X : f (x) = f (t) if and only if x = t
3. Continuity. The mapping function f is continuous, that is

∀  > 0 ∃ δ = δ() such that ρX (x, t) < δ ⇒ ρY (f (x), f (t)) < 

where ρ(•, •) represents the distance symbol between the two arguments in their
respective spaces. The continuity property is also referred to as being the property
of stability.

If none of these conditions are met, the problem will be said to be ill-posed. In
problems ill-posed, datasets of very large examples may contain little information
on the problem to be solved. The physical phenomena responsible for generating the
dataset for training (for example, for speech, radar signals, sonar signals, images, etc.)
are well-posed problems. However, learning from these forms of physical signals,
seen as a reconstruction of hypersurfaces, is an ill-posed problem for the following
reasons.
The criterion of existence can be violated in the case in which distinct outputs do
not exist for each input. There is not enough information in the training dataset to
univocally reconstruct the mapping input–output function, therefore, the uniqueness
criterion could be violated. The noise or inaccuracies present in the training data add
uncertainty to the surface of mapping input–output. This last problem violates the
criterion of continuity, since if there is a lot of noise in the data, it is likely that the
desired output y falls outside the range Y for a specified input vector x ∈ X .
Paraphrasing Lanczos [3] we can say that There is no mathematical artifice to
remedy the missing information in the training data. An important result on how
to render a problem ill-posed in a well-posed is derived from the theory of the
Regularization.

2.6 Regularization Theory

Introduced by Tikhonov in 1963 for the solution of ill-posed problems. The basic
idea is to stabilize the hypersurface reconstruction solution by introducing some
nonnegative functional that integrates a priori information of the solution. The most
202 2 RBF, SOM, Hopfield, and Deep Neural Networks

common form of a priori information involves the assumption that the function of
input–output mapping (i.e., solution of the reconstruction problem) is of smooth type,
or similar inputs correspond to similar outputs. Let be the input and output data sets
(which represent the training set) described as follows:

Input signal : xi ∈ m0 , i = 1, 2, . . . , N


(2.17)
Desired response : di ∈ 1 , i = 1, 2, . . . , N

The fact that the output is one-dimensional does not affect any generality in the
extension to multidimensional output cases. Let F(x) be the mapping function to
look for (the weight variable w from the arguments of F has been removed), the
Tikhonov regularization theory includes two terms:

1. Standard Error. Denoted by ξs (F) measures the error (distance) between the
desired response (target) di and the current network response yi of the training
samples i = 1, 2, . . . , N

1
N
ξs (F) = (di − yi )2 (2.18)
2
i=1

1 N
= (di − F(xi ))2 (2.19)
2
i=1

where 21 represents a scale factor.


2. Regularization. The second term, denoted by ξc (F), which depends on the geo-
metric properties of the approximating function F(x), is defined by
1
ξc (F) = DF2 (2.20)
2
where D is a linear differential operator. The a priori information of the input–
output mapping function is enclosed in D, which makes the selection of D depen-
dent problems. The D operator is also a stabilizer, since it makes the solution of
type smooth satisfying the condition of continuity.

So the quantity that must be minimized in the regularization theory is the following:

ξ(F) = ξs (F) + λξc (F)


1
N
1
= [di − F(xi )]2 + λDF2
2 2
i=1

where λ is a positive real number called regularization parameter, ξ(F) is called a


Tikhonov functional. By indicating with Fλ (x) the surface that minimizes the func-
tional ξ(F), we can see the regularization parameter λ as a sufficiency indicator of
2.6 Regularization Theory 203

the training set that specifies the solution Fλ (x). In particular, in a limiting case when
λ → 0 implies that the problem is unconstrained, the solution Fλ (x) is completely
determined by the training examples. The other case, where λ → ∞ implies that the
continuity constraint introduced by the smooth operator D is sufficient to specify the
solution Fλ (x), or another way of saying that the examples are unreliable. In practical
applications, the parameter λ is assigned a value between the two boundary condi-
tions, so that both training examples and information a priori contribute together for
the solution Fλ (x). After a series of steps we arrive at the following formulation of
the functional for the regulation problem:

1
N
Fλ (x) = [di − F(xi )]G(x, xi ) (2.21)
λ
i=1

where G(x, xi ) is called the Green function which we will see later on as one of
the radial-based functions. The Eq. (2.21) establishes that the minimum solution
Fλ (x) to the regularization problem is the superposition of N Green functions. The
vectors of the sample xi represent the expansion centers, and the weights [di −
F(xi )]/λ represent the expansion coefficients. In other words, the solution to the
regularization problem lies in a N -dimensional subspace of the space of smoothing
functions, and the set of Green functions {G(x, xi )} centered in xi , i = 1, 2, . . . , N
form a basis for this subspace. Note that the expansion coefficients in (2.21) are:
linear in the error estimation defined as the difference between the desired response
di and the corresponding output of the network F(xi ); and inversely proportional to
the regularization parameter λ.
Let us now calculate the expansion coefficients, that are not known, defined by
1
wi = [di − F(xi )], i = 1, 2, . . . , N (2.22)
λ
We rewrite the (2.21) as follows:


N
Fλ (x) = wi G(x, xi ) (2.23)
i=1

and evaluating the (2.23) in xj for j = 1, 2, . . . , N we get


N
Fλ (xj ) = wi G(xj , xi ) j = 1, 2, . . . , N (2.24)
i=1

We now introduce the following definitions:

Fλ = [Fλ (x1 ), Fλ (x2 ), . . . , Fλ (xN )]T (2.25)


204 2 RBF, SOM, Hopfield, and Deep Neural Networks

d = [d1 , d2 , . . . , dN ]T (2.26)

⎡ ⎤
G(x1 , x1 ) G(x1 , x2 ) · · · G(x1 , xN )
⎢ G(x2 , x1 ) G(x2 , x2 ) · · · G(x2 , xN ) ⎥
⎢ ⎥
G=⎢ .. .. .. .. ⎥ (2.27)
⎣ . . . . ⎦
G(xN , x1 ) G(xN , x2 ) · · · G(xN , xN )

w = [w1 , w2 , . . . , wN ]T (2.28)

we can rewrite the (2.22) and the (2.25) in matrix form as follows:
1
w= (d − Fλ ) (2.29)
λ
and
Fλ = Gw (2.30)

deleting Fλ from (2.29) and (2.30) we get

(G + λI)w = d (2.31)

where I is the identity matrix N × N . The G matrix is named Green matrix. Green’s
functions are symmetric (for some classes of functions seen above), namely

G(xi , xj ) = G(xj , xi ) ∀ i, j, (2.32)

and therefore, the Green matrix is also symmetric and positive definite if all the points
of the sample are distinct between them and we have

GT = G (2.33)

We can think of having a regularization parameter λ big enough to ensure that (G +


λI) is positive definite, and therefore, invertible. This implies that the system of linear
equations defined in (2.31) has one and only one solution given by

w = (G + λI)−1 d (2.34)

This equation allows us to obtain the vector of weights w having identified the Green
function G(xj , xi ) for i = 1, 2, . . . , N ; the desired answer d; and an appropriate value
of the regularization parameter λ. In conclusion, it can be established that a solution
to the regularization problem is provided by the following expansion:
2.6 Regularization Theory 205


N
Fλ (x) = wi G(x, xi ) (2.35)
i=1

This equation establishes the following considerations:

(a) the approach based on the regularization theory is equivalent to the expansion
of the solution in terms of Green functions, characterized only by the form of
stabilizing D and by the associated boundary conditions;
(b) the number of Green functions used in the expansion is equal to the number of
examples used in the training process.

The characterization of the Green functions G(x, xi ) for a specific center xi depend
only on the stabilizer D, priori information known based on the input–output map-
ping. If this stabilizer is invariant from translation, then the Green function centered
in xi depends only on the difference between the two arguments

G(x, xi ) = G(x − xi ) (2.36)

otherwise if the stabilizer must be invariant both for translation and for rotation, then
the function of Green will depend on the Euclidean distance of its two arguments,
namely
G(x, xi ) = G(x − xi ). (2.37)

So under these conditions Green’s functions must be radial-based functions. Then


the solution (2.24) can be rewritten as follows:


N
Fλ (x) = wi G(x − xi ) (2.38)
i=1

Therefore, the solution is entirely determined by the N training vectors that help to
construct the interpolating surface F(x).

2.7 RBF Network Architecture

As previously mentioned, the network is composed of three layers: input, hidden,


and output, as shown in Fig. 2.4. The first layer (input) is made up of m0 nodes,
representing the size of the input vector x. The second layer, the hidden one, consists
of m1 nonlinear radial-based functions ϕ, connected to the input layer. In some cases
the size m1 = N , in others (especially when the training set is very large) differs
(m1 N ) as we will see later.
206 2 RBF, SOM, Hopfield, and Deep Neural Networks

Fig. 2.4 RBF network architecture

The output layer consists of a single linear neuron (but can also be composed of
several output neurons) connected entirely to the hidden layer. By linear, we mean
that the output neuron calculates its output value as the weighted sum of the outputs
of the neurons of the hidden layer. The weights wi of the output layer represent the
unknown variables, which also depend on the functions of Green G(x − xi ) and the
regularization parameter λ.
The functions of Green G(x − xi ) for each ith are defined positive, and therefore,
one of the forms satisfying this property is the Gaussian one
1
G(x, xi ) = exp (− x − xi 2 ) (2.39)
2σi2

remembering that xi represents the center of the function and σi its width. With the
condition that the Green functions are defined positive, the solution produced by
the network will be an optimal interpolation in the sense that it minimizes the cost
function seen previously ξ(F). We remind you that this cost function indicates how
much the solution produced by the network deviates from the true data represented
by the training data. Optimality is, therefore, closely related to the search for the
minimum of this cost function ξ(F).
In Fig. 2.4 is also shown the bias (variable independent of the data) applied to the
output layer. This is represented by placing one of the linear weights equal to the
bias w0 = b and treating the associated radial function as a constant equal to +1.
Concluding then, to solve an RBF network, knowing in advance the input data and
the shape of the radial basis functions, the variables to be searched are the linear
weights wi and the centers xi of the radial basis functions.
2.8 RBF Network Solution 207

2.8 RBF Network Solution

Let be {ϕi (x)| i = 1, 2, . . . , m1 } the family of radial functions of the hidden layer,
which we assume to be linearly independent. We, therefore, define

ϕi (x) = G(x − ti ) i = 1, 2, . . . , m1 (2.40)

where ti are the centers of the radial functions to be determined. In the case in
which the training data are few or computationally tractable in number, these centers
coincide with the training data, that is ti = xi for i = 1, 2, . . . , N . Therefore, the new
interpolating solution F ∗ is given by the following equation:


m1
F ∗ (x) = wi G(x, ti )
i=1
m1
= wi G(x − ti )
i=1

which defines the new interpolating function with the new weights
{wi | i = 1, 2, . . . , m1 } to be determined in order to minimize the new cost function
⎛ ⎞2

N 
m1
ξ(F ∗ ) = ⎝di − wi G(x − ti )⎠ + λDF ∗ 2 (2.41)
i=1 j=1

with the first term of the right side of this equation it can be expressed as the Euclidean
norm of d − Gw2 , where

d = [d1 , d2 , . . . , dN ]T (2.42)

⎡ ⎤
G(x1 , t1 ) G(x1 , t2 ) · · · G(x1 , tm1 )
⎢ G(x2 , t1 ) G(x2 , t2 ) · · · G(x2 , tm1 ) ⎥
⎢ ⎥
G=⎢ .. .. .. .. ⎥ (2.43)
⎣ . . . . ⎦
G(xN , t1 ) G(xN , t2 ) · · · G(xN , tm1 )

w = [w1 , w2 , . . . , wm1 ]T (2.44)

Now the G matrix of Green’s functions is no longer symmetrical but of size N × m1 ,


the vector of the desired answers d is like before of size N , and the weight vector
w of size m1 × 1. From the Eq. (2.24) we note that the approximating function is
a linear combination of Green’s functions for a certain stabilizer D. Expanding the
second term of (2.41), omitting the intermediate steps, we arrive at the following
result:
DF ∗ 2 = wT G0 w (2.45)
208 2 RBF, SOM, Hopfield, and Deep Neural Networks

where G0 is a symmetric matrix of size m1 × m1 defined as follows:


⎡ ⎤
G(t1 , t1 ) G(t1 , t2 ) · · · G(t1 , tm1 )
⎢ G(t2 , t1 ) G(t2 , t2 ) · · · G(t2 , tm1 ) ⎥
⎢ ⎥
G0 = ⎢ .. .. .. .. ⎥ (2.46)
⎣ . . . . ⎦
G(tm1 , t1 ) G(tm1 , t2 ) · · · G(tm1 , tm1 )

and minimizing the (2.41) with respect to the weight vector w, we arrive at the
following equation:
(GT G + λG0 )w = GT d (2.47)

which for λ → 0 the weight vector w converges to the pseudo-inverse (minimum


norm) solution for a least squares fitting problem for m1 < N , so we have

w = G+ d, λ = 0 (2.48)

where G+ represents the pseudo-inverse of the matrix G, that is

G+ = (GT G)−1 GT (2.49)

The (2.48) represents the solution to the problem of learning weights for an RBF
network. Now let’s see what are the RBF learning strategies that starting from a
training set describe different ways to get (in addition to the weight vector calculation
w) also the centers of the radial basis functions of the hidden layer and their standard
deviation.

2.9 Learning Strategies

So far the solution of the RBF has been found in terms of the weights between the
hidden and output layers, which are closely related to how the activation functions of
the hidden layer are configured and eventually evolve over time. There are different
approaches for the initialization of the radial basis functions of the hidden layer. In
the following, we will show some of them.

2.9.1 Centers Set and Randomly Selected

The simplest approach is to fix the Gaussian centers (radial basis functions of the
hidden layer) chosen randomly from the training dataset available. We can use a
Gaussian function isotropic whose standard deviation σ is fixed according to the
dispersion of the centers. That is, a normalized version of the radial basis function
centered in ti
2.9 Learning Strategies 209

 
m1
(G(x − ti  ) = exp − 2 x − ti  , i = 1, 2, . . . , m1
2 2
(2.50)
dmax

where m1 is the number of centers (i.e., neurons of the hidden layer), dmax the
maximum distance between the centers that have been chosen. The standard deviation
(width) of the Gaussian radial functions is fixed to
dmax
σ =√ . (2.51)
2m1
The latter ensures that the identified radial functions do not reach the two possible
extremes, that is, too thin or too flattened. Alternatively to the Eq. (2.51) we can
think of taking different versions of radial functions. That is to say using very large
standard deviations for areas where the data is very dispersed and vice versa.
This, however, presupposes a research phase carried out first about the study of the
distribution of the data of the training set which is available. The network parameters,
therefore, remain to be searched, that is, the weights of the connections going from
the hidden layer to the output layer with the pseudo-inverse method of G described
above in (2.48) and (2.49). The G matrix is defined as follows:

G = {gji } (2.52)

with
 m 
1
gji = exp − 2 xj − ti 2 , i = 1, 2, . . . , m1 j = 1, 2, . . . , N (2.53)
d
with xj the jth vector of the training set. Note that if the samples are reasonably1 few
so as not to affect considerably the computational complexity, one can also fix the
centers of the radial functions with the observations of the training data, then ti = xi .
The computation of the pseudo-inverse matrix is done by the Singular Value
Decomposition (SVD) as follows. Let G be a matrix N × M of real values, there are
two orthogonal matrices:
U = [u1 , u2 , . . . , uN ] (2.54)

and
V = [v1 , v2 , . . . , vM ] (2.55)

such that
UT GV = diag(σ1 , σ2 , . . . , σK ) K = min(M , N ) (2.56)

1 Always satisfying the theorem of Cover described above, with reasonably we want to indicate a

size correlated with the computational complexity of the entire architecture.


210 2 RBF, SOM, Hopfield, and Deep Neural Networks

where
σ1 ≥ σ2 ≥ · · · ≥ σK > 0 (2.57)

The column vectors of the matrix U are called left singular vectors of G while those
of V right singular vectors of G. The standard deviations σ1 , σ2 , . . . , σK are simply
called singular values of the matrix G. Thus according to the SVD theorem, the
pseudo-inverse matrix of size M × N of a matrix G is defined as follows:

G+ = V +
UT (2.58)

where + is the matrix N × N defined in terms of singular values of G as follows:


 
+ 1 1 1
= diag , ,..., , 0, . . . , 0 (2.59)
σ1 σ2 σK

The random selection of the centers shows that this method is insensitive to the use
of regularization.

2.9.2 Selection of Centers Using Clustering Techniques

One of the problems that you encounter when you have very large datasets is the
inability to set the centers based on the size of the training dataset, whether they are
randomly selected and whether they coincide with the same training data. Training
datasets with millions of examples would involve millions of neurons in the pattern
layer, with the consequence of raising the computational complexity of the classifier.
To overcome this you can think of finding a number of centers that are lower than the
cardinality of the training dataset but which is in any case descriptive in terms of the
probability distribution of the examples you have. To do this you can use clustering
techniques such as fuzzy K-means, K-means or self-organizing maps.
Therefore, in a first phase, the number of prototypes to be learned with the cluster-
ing technique is set, and subsequently, we find the weights of the RBF network with
the radial basis functions centered in the prototypes learned in the previous phase.

2.10 Kohonen Neural Network

Note as Self-Organizing Map (SOM) (also called Kohonen map) is a computational


model to visualize and analyze high-dimensional data by projecting them (mapping)
into a low-dimensional space (up to 1D), preserving how possible distances be-
tween input patterns. The maps of Kohonen [4] establish a topological relationship
between multidimensional patterns and a normally 2D grid of neurons (Kohonen
layers) preserving the topological information, i.e., similar patterns are represented
to neighboring neurons.
2.10 Kohonen Neural Network 211

In essence, this layer of neurons adapts during the learning phase, in the sense
that the positions of the individual neurons are indicators of the significant statistical
characteristics of the input stimuli. This process of spatial adaptation of input pattern
characteristics is also known as feature mapping. The SOMs learn without a priori
knowledge in unsupervised mode, from which descends the name of self-organizing
networks, i.e., they are able to interact with the data, training themselves without a
supervisor.
Like all neural networks, SOMs have a neuro-biological motivation, based on the
spatial organization of brain functions, as has been observed especially in the cerebral
cortex. Kohonen developed the SOM-based on the studies of C. von der Malsburg
[5] and on the models of the neural fields of Amari [6]. A first emulative feature of
SOM concerns the behavior of the human brain when subjected to an input signal.
When a layer of the neural network of the human brain receives an input signal, very
close neurons are strongly excited with stronger bonds, while those at an intermediate
distance are inhibited, and distant ones are weakly excited.
Similarly in SOM, during learning, the map is partitioned into regions, each of
which represents a class of input patterns (principle of topological map formation).
Another characteristic of biological neurons, when stimulated by input signals, is that
of manifesting an activity in a coordinated way such as to differentiate themselves
from the other less excited neurons. This feature was modeled by Kohonen restricting
the adaptation of weights only to neurons in the vicinity of what will be considered
the winner (competitive learning). This last aspect is the essential characteristic of
unsupervised systems, in which the output neurons compete with each other before
being activated, with the result that only one is activated at any time. The winning
neuron is called winner-takes-all-neuron (the winning neuron takes everything).

2.10.1 Architecture of the SOM Network

Similarly to the multilayer neural networks, the SOM presents a feed-forward ar-
chitecture with a layer of input neurons and a single layer of neurons, arranged on
a regular 2D grid, which combine the computational output functions (see Fig. 2.5).
The computation-output layer (Kohonen layer) can also be 1D (with a single row or
column of neurons) and rarely higher than 2D maps. Each input neuron is connected
to all the neurons of the Kohonen layer. Let x = (x1 , x2 , . . . , xd ) the generic pattern
d -dimensional of the N input patterns to present to the network and both PEj (Pro-
cessing Element) generic neuron of the 2D grid composed of M = Mr × Mc neurons
arranged on Mr rows and Mc columns.
The input neurons xi , i = 1, . . . , d only perform the memory function and are
connected to the neurons PEj , j = 1, . . . , M through the vectors weight wj , j = 1, M
of the same dimensionality d of the input pattern vectors. For a configuration with
N input pattern vectors and M neurons PE of the Kohonen layer we have in total
N M connections. The activation potential yj of the single neuron PEj is given by the
212 2 RBF, SOM, Hopfield, and Deep Neural Networks

Fig. 2.5 Kohonen network architecture. The input layer has d neurons that only have a memory
function for the input patterns x d -dimensional, while the Kohonen layer has 63 PE neurons (process
elements). The window size, centered on the winning neuron, gradually decreases with the iterations
and includes the neurons of the lateral interaction

inner product between the generic vector pattern x and the vector weight wj


d
yj = wjT x = wji xi (2.60)
i=1

The initial values of all the connection weights wji , j = 1, . . . , M , i = 1, . . . , d are


assigned randomly and with small values. The self-organizing activity of the Kohonen
network involves the following phases:

Competition, for each input pattern x, all neurons PE calculate their respective
activation potential which provides the basis of their competition. Once the dis-
criminant evaluation function is defined, only one neuron PE must be winning in
the competition. As a discriminant function the (2.60) can be used and the one
with maximum activation potential yv is chosen as the winning neuron, as follows:


d
yv = arg max {yj = wji xi } (2.61)
j=1,...,M i=1

An alternative method uses the minimum Euclidean distance as the discriminant


function. Dv (x) between the vector x and the vectors weight wj to determine the
winning neuron, given from

  d 

yv = Dv (x) = arg min Dj = x − wj =  (xi − wji )2 (2.62)
j=1,...,M i=1
2.10 Kohonen Neural Network 213

With the (2.62), the one whose weight vector is closest to the pattern presented
as input to the network is selected as the winning neuron. With the inner product
it is necessary to normalize the vectors with unitary norm (|x| = |w| = 1). With
the Euclidean distance the vectors may not be normalized. Both the two discrimi-
nant functions are equivalent, i.e., the vector of weights with minimum Euclidean
distance from the input vector is equivalent to the weight vector which has the
maximum inner product with the same input vector. With the process of compe-
tition between neurons, PE, the continuous input space is transformed (mapped)
into the discrete output space (Kohonen layer).
Cooperation, this process is inspired by the neuro-biological studies that demon-
strate the existence of a lateral interaction that is a state of excitation of neurons
close to the winning one. When a neuron is activated, neurons in its vicinity
tend to be excited with less and less intensity as their distance from it increases.
It is shown that this lateral interaction between neurons can be modeled with a
function with circular symmetry properties such as the Gaussian and Laplacian
function of the Gaussian (see Sect. 1.13 Vol.II). The latter can achieve an action
of lateral reinforcement interaction for the neurons closer to the winning one and
an inhibitory action for the more distant neurons. For the SOM a similar topology
of proximity can be used to delimit the excitatory lateral interactions for a limited
neighborhood of the winning neuron. If Djv is the lateral distance between the
neuron jth and the winning one v, the Gaussian function of lateral attenuation φ
is given by
2
Djv

φ(j, v) = e 2σ 2 (2.63)

where σ indicates the circular amplitude of the lateral interaction centered on the
winning neuron. The (2.63) has the property of having the maximum value in the
position of the winning neuron, circular symmetry, decreases monotonically to
zero as the distance tends to infinity and is invariant with respect to the position
of the winning neuron. Neighborhood topologies can be different (for example,
proximity to 4, to 8, hexagonal, etc.) what is important is the variation over time
of the extension of the neighborhood σ (t) which is useful to reduce it over time
until only winning neuron. A method to progressively reduce over time, that is, as
the iterations of the learning process grow, is given by the following exponential
form: t
σ (t) = σ0 e− Dmax (2.64)

where σ0 indicates the size of the neighbourhood at the iteration t0 and Dmax
is the maximum dimension of the lateral interaction that decreases during the
training phase. These parameters of the initial state of the network must be selected
appropriately.
Adaptation, this process carries out the formal learning phase in which the Koho-
nen layer self-organizes by adequately updating the weight vector of the winning
neuron and the weight vectors of the neurons of the lateral interaction accord-
ing to the Gaussian attenuation function (2.63). In particular, for the latter, the
adaptation of the weights is smaller than the winning neuron. This happens as
214 2 RBF, SOM, Hopfield, and Deep Neural Networks

input pattern vectors are presented to the network. The equation of adaptation
(also known as Hebbian learning) of all weights is applied immediately after
determining the v winning neuron, given by

wji (t + 1) = wji (t) + η(t)φ(j, v)[xi (t) − wji (t)] i = 1, . . . , d (2.65)

where j indicates the jth neuron included in the lateral interaction defined by
the Gaussian attenuation function, t + 1 indicates the current iteration (epoch)
and η(t) controls the learning speed. The expression η(t)φ(j, v) in the (2.65)
represents the weight factor with which the weight vector of the winning neuron
and of the neurons included in the neighborhood of the lateral interaction φ(j, v)
are modified. The latter, given by the (2.63), also dependent on σ (t), we have
seen from the (2.64) that it is useful to reduce it over time (iterations). Also η(t)
is useful to vary it over the time starting from a maximum initial value and then
reducing it exponentially over the time as follows:
t
η(t) = η0 e− tmax (2.66)

where η0 indicates the maximum initial value of the learning function η and tmax
indicates the maximum number of expected iterations (learning periods).
From the geometric point of view, the effect of learning for each epoch, obtained
with the (2.65), is to adjust the weight vectors wv of the winning neuron and those
of the neighborhood of the lateral interaction and move them in the direction of
the input vector x. Repeating this process for all the patterns in the training set
realizes the self-organization of the Kohonen map or its topological ordering. In
particular, we obtain a bigection between the feature space (input vectors x) and
the discrete map of Kohonen (winning neuron described by the weight vector wv ).
The weight vectors w can be used as pointers to identify the vector of origin x in
the feature space (see Fig. 2.6).

Below is the Algorithm 23 learning algorithm of the Kohonen network.


Once the network has been initialized with the appropriate parameters, for
example,

η0 = 0.9, η ≥ 0.1, tmax ≈ 1000 Dmax ≈ 1000/ log σ0

starting with the completely random weight vectors, the initial state of the Kohonen
map is totally disordered.
Presenting to the network the patterns of the training set gradually triggers the
process of self-organization of the network (see Fig. 2.8) during which the topological
ordering of the output neurons is performed with the weight vectors that map as much
as possible the input vectors (network convergence phase).
It may occur that the network converges toward a metastable state, i.e., the network
converges toward a disordered state (in the Kohonen map we have topological de-
fects). This occurs when the lateral interaction function φ(t) decreases very quickly.
2.10 Kohonen Neural Network 215

Fig. 2.6 Self-organization of the SOM network. It happens through a bijection between the input
space X and the map of Kohonen (in this case 2D). Presenting to the network a pattern vector x
the winning neuron PEv whose associated weight vector wv is determined the most similar to x.
Repeating for all the training set patterns, this competitive process produces a tessellation of the
input space in regions represented by all the winning neurons whose vectors weight wvj realize
the discretization once reprojected in the input space, thus producing the Voronoi tesselation. Each
Voronoi region is represented by the weight vectors of the winning neurons (prototypes of the input
vectors included in the relative regions)

1.2

0.8

0.6

0.4 X: 0.85
Y: 0.26

0.2

−0.2
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Fig. 2.7 Application of a 1D SOM network for classification


216 2 RBF, SOM, Hopfield, and Deep Neural Networks

Algorithm 23 SOM algorithm


Initialize: t ← 0, η0 , σ0 , tmax , Dmax ; Create a grid of M neurons by associating d-dimensional
weight vectors wjT = (wj1 , . . . , wjd ); j = 1, . . . , M ; N ← Training set pattern number;
Initialize: Assign small random initial values for weight vectors wj , j = 1, . . . , M ;
repeat

for i = 1 to N do

t ← t + 1;
Present each pattern vector to the network x chosen randomly from
the training set;
Calculate the winning neuron with the (2.62) wv ;
Update the weight vectors including those of neighboring neurons
with the (2.65);

end for
Reduce η(t) and σ (t) according to (2.66) and (2.64)

until (t < tmax )


end

Normally the convergence phase requires a number of iterations related to the number
of neurons (at least 500 times the number of neurons M ).
Like the MLP network, once the synaptic weights have been calculated with the
training phase, the SOM network is used in the test context to classify a generic pattern
vector x not presented in the training phase. In Fig. 2.7 shows a simple example of
classification with the SOM 1D network. The number of classes is 6, each of which
has 10 2D input vectors (indicated with the “+” symbol).
The network is configured with 6 neurons with associated initial weight vectors
wi = (0.5, 0.5), i = 1, . . . , 6, and the initial learning parameter η = 0.1. After the
training, the weight vectors, adequately modified by the SOM, each represent the
prototype of the classes. They are indicated with the symbol “◦” and are located in the
center of each cluster. The SOM network is then presented with some input vectors
(indicated with black squares), for testing the network, each correctly classified in
the class to which they belong.
The Kohonen network, by projecting a d -dimensional vector in a discrete 2D grid,
actually performs a transformation of the data reducing the dimensionality of the data
as happens with the transformation to the main components (PCA). In essence, it
realizes a nonlinear generalization of the PCA. Let us now examine some peculiar
properties of the Kohonen network.

Approximationoftheinputspace. Once the SOM algorithm converges, the resulting


Kohonen map displays important statistical features of the input feature space.
The SOM algorithm can be thought of as a nonlinear project ψ which projects the
continuous input space (feature space) X into the discrete output space L. This
2.10 Kohonen Neural Network 217

1 2 3
w

Kohonen network
Before training After training
1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0
(0.9806, -0.0001)
−0.2 −0.2

−0.4 −0.4

−0.6 −0.6
(-0.6975, -0.6974)
−0.8 −0.8

−1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1

Fig. 2.8 Example of Kohonen 1D network that groups 6 2D input vectors in 3 classes; using Matlab
the network is configured with 3 neurons. Initially, the 3 weight vectors take on small values and
are randomly oriented. As the input vectors are presented (indicated with “+”), the weight vectors
tend to move toward the most similar input vectors until they reach the final position to represent
the prototype vectors (indicated with “◦” ) of each grouping

transformation, ψ : X → L, can be seen as an abstraction that associates a large


set of input vectors {X} with a small set of prototypes {W} (the winning neurons)
which are a good approximation of the original input data (see Fig. 2.6).
This process is the basic idea of the theory of vector quantization (Vector
Quantization-VQ) which aims to reduce the dimensionality or to realize data
compression. Kohonen has developed two methods for pattern classification, a
supervised one known as Learning Vector Quantization-LVQ [7] and an unsuper-
vised one which is the SOM. The SOM algorithm behaves like a data encoder
where each neuron is connected to the input space through the relative synaptic
weights that represent a point of the input space. Through the ψ transformation,
to this neuron corresponds a region of the input space constituted by the set of
input vectors that have made the same neuron win by becoming the prototype
vector of this input region.
If we consider this in a neurobiological context, the input space can represent
the coordinates of the set of somatosensory receptors densely distributed over
the entire surface of the human body, while the output space represents the set of
neurons located in the somatosensory cortex layer where the receptors (specialized
for the perception of texture, touch, shape, etc.) are confined.
The results of the approximation of the input space depend on the adequacy of the
choice of parameters and on the initialization of the synaptic weights. A measure
218 2 RBF, SOM, Hopfield, and Deep Neural Networks

of this approximation is obtained by evaluating the average quantization error


Eq which must be the minimum possible by comparing the winning neurons wv
and the input vectors x. The average of  xi − wv , defined by presenting the
training set patterns again after learning, is used to estimate the error as follows:

1 
N
Eq =  xi − wv  (2.67)
N
i=1

where N is the number of input vectors x and wv is the corresponding winning


weight vector.
Topological Ordering. The ψ transformation performed by the SOM produces a
local and global topological order, in the sense that neurons, close to each other
in the Kohonen map, represent similar patterns in the input space. In other words,
when pattern vectors are transformed (mapped) on the grid of neurons represented
by neighboring winning neurons, the latter will have associated similar patterns.
This is the direct consequence of the process of adaptation of the synaptic weights
Eq. (2.65) together with the lateral interaction function. This forces the weight
vector of the winning neuron wv to move in the direction of the input vector x, as
well as the weight vectors wj of neighboring neurons jths move in the direction
of the winning neuron v, to ensure global ordering.
The topological ordering, produced by the ψ transformation with the SOM, can be
visualized thinking of this transformation as an elastic network of output neurons
placed in the input space. To illustrate this situation, consider a two-dimensional
input space. In this way, each neuron is visualized in the input space at the coordi-
nates defined by the relative weights (see Fig. 2.9). Initially, we would see a total
disorder while after the training we have the topological ordering. If a number of
neurons equal to the number of 2D input patterns are used, the neurons projected
by the respective weights in the input plane will be in the vicinity of their corre-
sponding input patterns, thus observing an image of the neuron grid ordered at
the end of the process of training.
Density Matching. The ψ transformation of SOM reflects the intrinsic variations
of the input pattern statistics. The density of the winning neurons of an ordered
map will reflect the distribution density of the patterns in the training set. Regions
of the input space with high density, from which the training patterns come, and
therefore, with a high probability of being presented to the network, will produce
winning neurons very close to each other. On the contrary, in less dense regions
of input patterns there will be scattered winning neurons. However, there will be
a better resolution for patterns with high probability than patterns presented with
low probability. SOM, therefore, tends to over-represent regions of input with low
probability and represent less regions with high probability. A heuristic can be used
to relate the probability distribution of the input vectors p(x) to a magnification
2
factor m(x) ∝ p 3 (x) of transformed patterns, valid for one-dimensional patterns.
Feature Selection. Given a set of input patterns with a nonlinear distribution, the
self-organizing map is able to select a set of significant features to approximate
the underlying distribution. Recall that the transform to the principal components
2.10 Kohonen Neural Network 219

1
Iter=0 Iter=25 Iter=300
1 1
0.9 0.9 0.9
0.8 0.8 0.8
0.7 0.7 0.7
0.6 0.6 0.6
0.5 0.5 0.5
0.4 0.4 0.4
0.3 0.3 0.3
0.2 0.2 0.2
0.1 0.1 0.1
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

a) b) c)

Fig. 2.9 Simulation of a Kohonen SOM network with 10 × 10 neurons: a displaying uniformly
distributed 2D input vectors in the range [0, 1] × [0, 1] with overlapping weight vectors whose
initial assigned values are around zero; b position of the vectors weight linked to each other, after
25 iterations; c Weights after 300 iterations

achieves the same objective by diagonalizing the correlation matrix to obtain the
associated eigenvectors and eigenvalues. If the data do not have a linear distribu-
tion, the PCA does not work correctly while the SOM overcomes this problem
by virtue of its topological ordering property. In other words, the SOM is able to
sufficiently approximate a nonlinear distribution of data by finding the principal
surface, and can be considered as a nonlinear generalization of the PCA.

2.10.2 SOM Applications

Many applications have been developed with the Kohonen network. An important
fallout is represented by the fact that this simple network model offers plausible
explanations on some neuro-biological phenomena. The Kohonen network is used
in the fields of the combinatory calculus to solve the traveling salesman problem,
in the fields of Economic Analysis, Data Mining, Data Compression, Recognition
of Phonemes in real time, and in the field of robotics to solve the problem of in-
verse kinematics. Several applications have been developed for signal and image
processing (segmentation, classification, texture, ...). Finally, various academic and
commercial software packages are available.
To improve the classification process, in some applications, Kohonen maps can
be given in input to a linear classification process of the supervised type. In this case,
we speak of a hybrid neural network that combines the SOM algorithm that produces
the unsupervised feature maps with the supervised linear one of a backpropagation
MLP network to achieve a more accurate and more efficient adaptive classification
requiring a smaller number of iterations.
220 2 RBF, SOM, Hopfield, and Deep Neural Networks

2.11 Network Learning Vector Quantization-LVQ

The concept of Vector Quantization-VQ can be understood by considering the SOM


algorithm which actually encodes a set of input vectors x generating a reduced set
of prototypes wv (associated with winning neurons) which provide a good approxi-
mation of the entire original input space. In this context, these prototypes {wv } can
be named as code-book vectors representative of the origin vectors.
In essence, the basic idea of Vector Quantization theory is to reduce the dimen-
sionality of the data, which is their compression. We have also seen with the (2.67)
how to estimate the error of this approximation with the quantization of the vectors
evaluating the Euclidean distance between the input vectors {x} and the prototype
vectors {wv }.
In a more formal way, the best way of thinking about vector quantization is in terms
of encoder which encodes a given signal and decoder which reconstructs as much
as possible the original signal by minimizing the information lost with encoding and
decoding. In considering this parallelism between the data process with the model
encoder/decoder and the SOM algorithm, we can imagine the encoding function
c(x) of the encoder as the winning neuron associated with {wv } of the SOM, the
decoding function x (c) of the decoder as the connection weight vector {wv }, and
the probability density function of the input x (including the additive noise) as the
lateral interaction function φ(t).
A vector quantizer with minimal distortion error is called a Voronoi quantizer or
nearest-neighbor quantizer. This is because the input space is partitioned into a set of
Voronoi regions or nearest-neighbor regions each of which contains the associated
reconstruction vector. The SOM algorithm provides a non-supervised method useful
for calculating the Voronoi vectors with the approximation obtained through the
weight vectors of the winning neurons in the Kohonen map.
The LVQ approach, devised by Kohonen [8], is the supervised learning version
of a vector quantizer that can be used when input data has been labeled (classified).
LVQ uses class information to slightly shift the Voronoi vectors in order to improve
the quality of the classifier’s decision regions. The procedure is divided into two
stages: the SOM-based competitive learning process followed by LVQ supervised
learning (see Fig. 2.10a). In fact, it carries out a pattern classification procedure.
With the first process, SOM non-supervised associates the vectors x of the training
set with the weight/vectors of Voronoi {wv } obtaining a partition of input space.
With the second process, LVQ knowing for each vector of the training set the
class of membership allows to find the best labeling for each neuron through the
associated weight vector wv , i.e., for each Voronoi region. In general, the Voronoi
regions do not precisely delimit the boundaries of class separation. The goal is to
change the boundaries of these regions to obtain an optimal classification. LVQ
actually achieves a proper displacement of the Voronoi region boundaries starting
from the {wv } prototypes found by the SOM for the training set {x} and uses the
knowledge of the labels assigned to the x patterns to find the best label to assign to
each prototype wv . LVQ verifies the class of the input vector with the class to which
2.11 Network Learning Vector Quantization-LVQ 221

}
Supervised 1
Learning
2

}..

Input
Input 3
SOM LVQ

a)
Class
Labels
b)
xd wd M
M
Kohonen Layer
yM
}ω C

Fig. 2.10 Learning Vector Quantization Network: a Functional scheme with the SOM component
for competitive learning and the LVQ component for supervised learning; b architecture of the
LVQ network, composed of the input layer, the layer of Kohonen neurons and the computation
component to reinforce the winning neuron if the input has been correctly classified by the SOM
component

the prototype wv belongs and reinforces it appropriately if they belong to the same
class.
Figure 2.10b shows the architecture of the LVQ network. At the schematic level,
we can consider it with three linear layers of neurons: input, Kohonen, and output.
In reality, the M process neurons are only those of the Kohonen layer. The d input
neurons only have the function of storing the input vectors {x} randomly presented
individually. Each input neuron is connected with all the neurons of the Kohonen
layer. The number of neurons in the output layer is equal to the C number of the
classes.
The net is very conditioned by the number of neurons used for the Kohonen layer.
Each neuron of this layer represents a prototype of a class whose values are defined
by the vectors weight wv , i.e., the synaptic connections of each neuron connected to
all the input neurons. The number of neurons in the middle layer M is a multiple of
the number of C classes. In Fig. 2.10b, the output layer shows the possible clusters
of neurons that LVQ has detected to represent the same class ωj , j = 1, . . . , C.
We now describe the sequential procedure of the basic LVQ algorithm

1. Given the training set P = {x1 , . . . , xN }, xi ∈ Rd and the weight vectors


(Voronoi vectors) wj , j = 1, . . . , M obtained through non-supervised learn-
ing with the SOM network. These weights are the initial values for the LVQ
algorithm.
2. Use the classification labels of the input patterns to improve the classification
process by appropriately modifying each wj prototype. LVQ verifies, for each
input randomly selected from the training set D = {(xi , ωxi )}, i = 1, . . . , n
(containing C classes), the input class ωi with the class associated to the Voronoi
regions by adequately modifying the wj weight vectors.
3. Randomly selects an input vector x from the training set D. If the selected input
vector xi and the weight vector wv of the winning neuron (i.e., wv is the one
closest to xi in the Euclidean distance sense) have the same label (ωxi = ωwv )
as class, then modify the prototype wv representing this class by reinforcing it,
222 2 RBF, SOM, Hopfield, and Deep Neural Networks

as follows:
wv (t + 1) = wv (t) + η(t)[xi − wv (t)] (2.68)

where t indicates the previous iteration, and η(t) indicates the current value of
the learning parameter (variable in the range 0 < η(t) ≤ 1) analogous to that of
the SOM.
4. If the selected input vector xi and the weight vector wv of the winning neuron
have a different label (ωxi = ωwv ) as a class, then modify the prototype wv ,
removing it from the input vector, as follows:

wv (t + 1) = wv (t) − η(t)[xi − wv (t)] (2.69)

5. All other weight/prototype vectors wj = wv , j = 1, . . . , M associated with the


input regions are not changed.
6. It is convenient that the learning parameter η(t) (which determines the magnitude
of the prototype shift with respect to the input vector) decreases monotonically
with the number of iterations t starting from an initial value ηmax (ηmax 1)
and reaching very small values greater than zero
 
t
η(t + 1) = η(t) 1 − (2.70)
tmax

where tmax indicates the maximum number of iterations.


7. LVQ stop condition. This condition can be reached when t = tmax or by imposing
a lower limit on the learning parameter η(t) = ηmin . If the stop condition is not
reached the iterations continue starting from step 3.

The described classifier (also known as LVQ1) is more efficient than using the
SOM algorithm only. It is also observed that with respect to the SOM it is no longer
necessary to model the neurons of the Kohonen layer with the function φ of lateral
interaction since the objective of LVQ is vector quantization and not the creation
of topological maps. The goal of the LVQ algorithm is to adapt the weights of the
neurons to optimally represent the prototypes of the training set patterns to obtain a
correct partition of this.
This architecture allows us to classify the N input vectors of the training set into
C classes, each of which is subdivided into subclasses of the latter represented by
the initial M prototypes/code-books. The sizing of the LVQ network is linked to the
number of prototypes that defines the number of neurons in the Kohonen layer. An
undersizing involves partitions with few regions and with the consequent problem of
having regions with patterns belonging to different classes. An over-dimensioning
instead involves the problem of overfitting.
2.11 Network Learning Vector Quantization-LVQ 223

2.11.1 LVQ2 and LVQ3 Networks

In relation to the initial value of the neuron weight vectors, these must be able to
move through a class that they do not represent, in order to associate with another
region of belonging. Since the weight vectors of these neurons will be rejected by the
vectors in the regions they must cross, it is possible that this will never be realized
and will never be classified in the correct region to which they belong.
The LVQ2 algorithm [9] which introduces a learning variant with respect to LVQ1,
can solve this problem. During the learning, for each input vector x, the simultaneous
update is carried out considering the two prototype vectors wv1 and wv2 closer to x
(always determined with the minimum distance from x). One of them must belong
to the correct class and the other to a wrong class. Also, x must be in a window
between the vectors wv1 and wv2 that delimit the decision boundaries (perpendicular
plane bisecting). In these conditions, the two weight vectors wv1 and wv2 are updated
appropriately using the Eqs. (2.68) and (2.69), respectively, of correct and incorrect
class of membership of x. All other weight vectors are left unchanged.
The LVQ3 algorithm [8] is the analogue of LVQ2 but has an additional update of
the weights in cases in which x, wv1 , and wv2 represent the same class

wk (t + 1) = wk (t) + η(t)[x − wk (t)] k = v1, v2 0.1 <  < 0.5 (2.71)

where  is a stabilization constant that reflects the width of the window associated
with the borders of the regions represented by the prototypes wv1 and wv2 . For very
narrow windows, the constant  must take very small values.
With the changes introduced by LVQ2 and LVQ3 during the learning process,
it is ensured that weight vectors (code-book vectors) continue to approximate class
distributions and prevent them from moving away from their optimal position if
learning continues.

2.12 Recurrent Neural Networks

An alternative architecture to the feedforward network, described in the preceding


paragraphs, is the recurrent architecture (also called cyclical). The topology of a
cyclic network requires that at least one neuron (normally a group of neurons) is
connected in such a way as to create a cyclic and circular data flow (loop). In the
feedforward topology only the error was propagated backward during learning. In
recurrent networks, at least one neuron propagates its output backward (feedback)
which becomes its input simultaneously. A feedforward network behaves like a static
system, completely described by one or more functional, where the current output y
depends only on the current inputs x in relation to the mapping function y = F(x).
The recurrent networks, on the other hand, are dynamic systems where the output
signal, in general, depends on its internal state (due to the feedback) which exhibits
224 2 RBF, SOM, Hopfield, and Deep Neural Networks

Fig. 2.11 Computational dynamics of a neural network: a Static system; b Dynamic continuous-
time system; and c Discrete-time dynamic system

a dynamic temporal behavior and the last input signal. Two classes of dynamical
systems are distinguished: dynamic systems with continuous time and discrete time.
The dynamics of continuous-time systems depend on functions whose continuous
variable is time (spatial variables are also used). This dynamic is described by dif-
ferential equations. A model of the most useful dynamics is that described only by
differential equations of the first order y (t) = f [x(t), y(t)] where y (t) = dy(t)/dt
which models the output signal as a function of the derivative with respect to time,
requiring an integration operator and the feedback signal inherent to the dynamic
systems (see Fig. 2.11). In many cases, a discrete-time computational system is as-
sumed. In these cases, a discrete-time system is modeled by discrete-time variable
functions (even spatial variables are considered). The dynamics of the network, in
these cases, starts from the initial state at time 0 and in the subsequent discrete steps
for t = 1, 2, 3, . . . the state of the network changes in relation to the computational
dynamics foreseen by the activation function of one or more neurons. Thus, each
neuron acquires the related inputs, i.e., the output of the neurons connected to it, and
updates its state with respect to them.
The dynamics of a discrete-time network is described by difference equations
whose first discrete derivative is given by y(n) = y(n + 1) − y(n) where y(n + 1)
and y(n) are, respectively, the future value (predicted) and the current value of y,
and n indicates the discrete variable that replaces the continuous and independent
variable t. To model the dynamics of a discrete-time system, that is, to obtain the
output signal y(n + 1) = f [x(n), y(n)], the integration operator is replaced with the
summation operator D which has the function of delay unit (see Fig. 2.11c).2
The state of neurons can change independently of each other or can be controlled
centrally, and in this case, we have asynchronous or synchronous neural network
models, respectively. In the first, case the neurons are updated one at a time, while
in the second case all the neurons are updated at the same time. Learning with a
recurrent network can be accomplished with a procedure similar to the gradient
descent as used with the backpropagation algorithm.

2 The D operator derives from the Z transform applied to the discrete signals y(n) : n = 0, 1, 2, 3, . . .
to obtain analytical solutions to the difference equations. The delay unit is introduced simply to
delay the activation signal until the next iteration.
2.12 Recurrent Neural Networks 225

2.12.1 Hopfield Network

A particular recurrent network was proposed in 1982 by J.Hopfield [10]. The orig-
inality of Hopfield’s network model was such as to revitalize the entire scientific
environment in the field of artificial neural networks. Hopfield showed how a collec-
tion of simple process units (for example, perceptrons by McCulloch-Pitts), appro-
priately configured, can exhibit remarkable computing power. Inspired by physical
phenomenologies,3 he demonstrated that a physical system can be used as a potential
memory device, once such a system has a dynamic of locally stable states to which it
is attracted. Such a system, with its stability and well localized attractors,4 constitutes
a model of CAM memory (Content-Addressable Memory).5
A CAM memory is a distributed memory that can be realized by a neural network
if each contained (in this context a pattern) of the memory corresponds to a stable
configuration of the neural network received after its evolution starting from an
initial configuration. In other words, starting from an initial configuration, the neural
network reaches a stable state, that is, an attractor associated with the pattern most
similar to that of the initial configuration. Therefore, the network recognizes a pattern
when the initial stimulus corresponds to something that, although not equal to the
stored pattern, is very similar to it.
Let us now look at the structural details of the Hopfield network that differs greatly
from the two-layer network models of input and output. The Hopfield network is
realized with M neurons configured in a single layer of neurons (process unit or
process element PE) where each is connected with all the others of the network
except with itself. In fact it is a recurrent symmetric network, that is, with a matrix
of synaptic weights
wij = wji , ∀i, j (2.72)

3 Ising’s model (from the name of the physicist Ernst Ising who proposed it) is a physical-
mathematical model initially devised to describe a magnetized body starting from its elementary
constituents. The model was then used to model variegated phenomena, united by the presence of
single components that, interacting in pairs, produce collective effects.
4 In the context of neural networks, an attractor is the final configuration achieved by a neural

network that, starting from an initial state, reaches a stable state after a certain time. Once an
attractor is known, the set of initial states that determine evolution of the network that ends with
that attractor is called the attraction basin.
5 Normally different memory devices store and retrieve the information by referring to the memory

location addresses. Consequently, this mode of access to information often becomes a limiting factor
for systems that require quick access to information. The time required to find an item stored in
memory can be considerably reduced if the object can be identified for access through its contents
rather than by memory addresses. A memory accessed in this way is called addressable memory for
content or CAM-Content-Addressable Memory. CAM offers an advantage in terms of performance
on other search algorithms, such as binary tree or look-up table based searches, comparing the
desired information against the entire list of pre-stored memory location addresses.
226 2 RBF, SOM, Hopfield, and Deep Neural Networks

Fig. 2.12 Model of the discrete-time Hopfield network with M neurons. The output of each neuron
has feedback with all the others excluded with itself. The pattern vector (x1 , x2 , . . . , xM ) forms the
entry to the network, i.e., the initial state (y1 (0), y2 (0), . . . , yM (0)) of the network

no neuron is connected with itself

wii = 0, ∀i (2.73)

and is completely connected (see Fig. 2.12).


According to Hopfield’s notation wij indicates the synaptic connection from the
neuron jth to the neuron ith and the activation level of the neuron ith is indicated
with yi which can assume the value 1 if activated and 0 otherwise.
In the Hopfield discrete network context, the total instantaneous state of the net-
work is given by M values yi which represents a binary vector of M bit. Neurons
are assumed to operate as the perceptron with a threshold activation function that
produces a binary state. The state of each neuron is given by the weighted sum of its
inputs with synaptic weights as follows:
 
1 if w y > θi
yi = j=i ij j (2.74)
0 if j=i wij yj < θi

where yi indicates the output of the neuron ith and θi the relative threshold. The
(2.74) rewritten with the activation function σ () of each neuron becomes
 
M 
yi = σ wij yj − θi (2.75)
j=1;j=i
  
z

where 
1 if z≥0
σ (z) = (2.76)
0 if z<0
2.12 Recurrent Neural Networks 227

It is observed that in the Eq. (2.75) only the state of neurons yi is updated and not
the relative thresholds θi . Furthermore, the update dynamics of the discrete Hopfield
network is asynchronous, i.e., the neurons are sequentially updated in order and
random time. This is the opposite of a synchronous dynamic system where the t + 1
time update takes place considering the time state t and simultaneously updating
all neurons at once. This mode requires buffer memory to maintain the state of the
neurons and a synchronism signal. Normally, networks with asynchronous updating
are more available in applications with neural networks since the single neural unit
updates its status as soon as the input information is available.
The Hopfield network associates a scalar value with the overall state of the network
itself defined by the energy E which, recalling the concept of kinetic energy expressed
in quadratic form, the energy function for a network with M neurons is given by

1 
M 
M
E=− wij yi yj + yi θi (2.77)
2
i,j=1;j=i i=1

Hopfield demonstrated the convergence of the network in a stable state, after a finite
number of sequential (asynchronous) updates of neurons, in the conditions expressed
by the (2.72) and (2.73) that guarantee the decrease of the energy function E for each
updated neuron.
In fact, we consider how the energy of the system changes to the change of state
of a neuron with the activation function (for example, from 0 to 1, or from 1 to 0)
 
M   
M 
E = Et+1 − Et = − wij yi (t + 1)yj + yi (t + 1)θi − − wij yi (t)yj + yi (t)θi
j=1;j =i j=1;j =i
 
M  (2.78)
= [yi (t) − yi (t + 1)] wij yj −θi
j=1;j =i
  
neti


The second factor of the (2.78) is less than zero if neti = j=i wij yj < θi . Also, for
the (2.75) and (2.76) we would have yi (t + 1) = 0 and yi (t) = 1, it follows that the
first factor [•] is more large of zero, thus resulting E < 0. If instead neti > θi , the
second factor is greater or equal to zero; yi (t + 1) = 1 and yi (t) = 0; it follows that
the first factor [•] is less than zero, thus resulting E ≤ 0.
Therefore, Hopfield has shown that any change in the ith neuron, if the activation
equations are maintained (2.75) and (2.76), the energy variation E is negative or
zero. The energy function E decreases in a monotonic way (see Fig. 2.13) if the
activation rules and the symmetry of the weights are maintained. This allows the
network, after repeated updates, to converge toward a stable state that is a local
minimum of the energy function (also considered as a Lyapunov function6 ).

6 Lyapunov functions, named after the Russian mathematician Aleksandr Mikhailovich Lyapunov,
are scalar functions that are used to study the stability of an equilibrium point of an ordinary au-
tonomous differential equation, which normally describes a dynamic system. For dynamic physical
systems, conservation laws often provide candidate Lyapunov functions.
228 2 RBF, SOM, Hopfield, and Deep Neural Networks

Fig. 2.13 1D and 2D energy map with some attractors in a Hopfield network. The locations of
the attractors indicate the stable states where the patterns are associated and stored. After having
initialized the network with the pattern to be memorized, the network converges following the
direction indicated by the arrows to reach the location of an attractor to which the pattern is associated

The generalized form of the network update equation also includes the additional
term of the bias I (which can also be the direct input of a sensor) for each neuron
 
1 if w y + Ii > θi
yi = j=i ij j (2.79)
0 if j=i wij yj + Ii < θi

It is highlighted that contrary to the training of the perceptron, the thresholds of


neurons are never updated. The energy function becomes

1 
M 
M 
M
E=− wij yi yj − yi Ii + yi θi (2.80)
2
i,j=1;j=i i=1 i=1

Also in this case it is shown that the change of state of the network for a single update
of a neuron, the energy change E is always zero or negative.

2.12.2 Application of Hopfield Network to Discrete States

The initialization of the Hopfield network depends on the particular application of


the network. It is normally made starting from the initial values of neurons associated
with the desired pattern. After repeated updates, the network converges to a pattern
attractor. Hopfield has shown that convergence is generally assured and the attractors
of this nonlinear dynamic system are stable, nonperiodic or chaotic as often occurs
in other systems.
The most common applications are

(a) Associative Memories: The network is capable of memorizing some states


(local minimums of the energy function) associated with patterns which it will
then remember.
(b) Combinatorial optimization: in the hypothesis of a well-modeled problem, the
network is able to find some local minimums and acceptable solutions but does
2.12 Recurrent Neural Networks 229

not always guarantee the optimal solution. The classic application example is the
problem of Traveling Salesman’s Problem.
(c) Calculation of logical functions (OR,XOR, ...).
(d) Miscellaneous Applications: pattern classification, signal processing, control,
voice analysis, image processing, artificial vision. Generally used as a black box
to calculate some output resulting from a certain self-organization caused by the
same network. Many of these applications are based on Hebbian learning.

2.12.2.1 Associative Memory-Training


The Hopfield network can be used as an associative memory. This allows the network
to serve as a content addressable memory (CAM), i.e., the network will converge to
remember states even if it is stimulated with slightly different patterns from those
that generated those states (the classic example is the recognition of a noisy character
compared to the one without noise learned from the network). In this case, we want
the network to be stimulated by N binary pattern vectors P = {x1 , x2 , . . . , xN } with
xi = (xi1 , xi2 , . . . , xiM ) and generate N different stable states associated with these
patterns. In essence, the network, once stimulated by the set of patterns P, stored
their footprint by determining the synaptic weights, adequate for all connections. The
determination of weights is accomplished through Hebbian non-supervised learning.
This learning strategy is summarized as follows:

(a) If two neurons have the same state of activation their synaptic connection is
reinforced, i.e., the associated weight wij is increased.
(b) If two neurons exhibit opposite states of activation, their synaptic connection is
weakened.

Starting with the W matrix of zero weights, applying these learning rules and pre-
senting the patterns to be memorized, the weights are modified as follows:

wij = wij + xki xkj i, j = 1, . . . , M ; with i = j (2.81)

where k indicates the pattern to memorize, i and j indicate the indices of the com-
ponents of the binary pattern vector to be learned;  is a constant to check that the
weights do not become too large or small (normally  = 1/N ). The (2.81) is iterated
for all N patterns to be stored and once presented, the final weights will result


N
wij = xki xkj wii = 0; i, j = 1, . . . , M ; with i = j (2.82)
k=1

In vector form the M × M weights matrix W for the set P of the stored patterns is
given by

N 
W= T
xki xkj − N I = (x1T x1 − I) + (x2T x2 − I) + · · · , (xNT xN − I) (2.83)
k=1
230 2 RBF, SOM, Hopfield, and Deep Neural Networks

where I is the identity matrix M × M that subtracted from the weights matrix W
guarantees that the latter has zeroed all the elements of the main diagonal. It is also
assumed that the binary responses of neurons are bipolar (+1 or −1). In the case of
binary unipolar pattern vectors (that is, with values 0 or 1) the weights Eq. (2.82) is
thus modified by introducing scale change and translation


N
wij = (2xki − 1)(2xkj − 1) wii = 0; i, j = 1, . . . , M ; with i = j (2.84)
k=1

2.12.2.2 Associative Memory-Recovery


Once the P patterns are stored, it is possible to activate the network to retrieve them
even if the pattern given as input to the network is slightly different from the one
stored. Let s = (s1 , s2 , . . . , sM ) a generic pattern (with bipolar elements −1 and +1)
to be retrieved presented to the network which is initialized with y(0) = si , i =
1, . . . , M , where the activation state of the neuron ith at the time t is indicated with
yi (t). The state of the network at time (t + 1), during its convergence iteration, based
on the (2.75) neglecting the threshold, can be expressed by the following recurrent
equation:
  M 
yi (t + 1) = σ wij yj (t) (2.85)
j=1;j=i

where in this case the activation function σ () used is the sign function sgn() con-
sidering the bipolarity of the patterns (−1 or 1). Recall that the neurons are updated
asynchronously and randomly. The network activity continues until the output of the
neurons remains unchanged, thus representing the final result, namely the x pattern
recovered from the set P, stored earlier in the learning phase, and that best approxi-
mates the input pattern s. This approximation is evaluated with the Hamming distance
H which determines the number of different  bits between two binary patterns which
in this case is calculated from H = M − i si xi .
The Hopfield network can be used as a pattern classifier. The set of patterns P,
in this case, are considered the prototypes of the classes and the patterns s to be
classified are compared with these prototypes by applying a similarity function to
decide the class to which they belong.
Consider a first example with two stable states (two auto-associations) x1 =
(1, −1, 1)T and x2 = (−1, 1, −1)T . The diagonal and symmetric matrix of the
synaptic weights W given by the (2.83) results
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
1  −1  0 −2 2
  ⎢ ⎥
W = ⎣−1⎦ 1 −1 1 + ⎣ 1 ⎦ −1 1 −1 − 2I = ⎣−2 0 −2⎦
1 −1 2 −2 0
2.12 Recurrent Neural Networks 231

Fig. 2.14 Structure of the Hopfield network at discrete time in the example with M = 3 neurons
and N = 23 = 8 states of which two are stable representing the patterns (1, −1, 1) and (−1, 1, −1)

The network is composed of 3 neurons and 23 patterns can be proposed. Presenting to


the network one of the 8 possible patterns the network always converges in one of the
two stable states as shown in Fig. 2.14. For example, by initializing the network with
the y(0) = (1, 1, −1)T pattern and applying to the 3 neurons the (2.85), the network
converges to the final state y(3) = (−1, 1, −1)T or the pattern x2 = (−1, 1, −1)T .
Initializing instead with the pattern y(0) = (1, −1, −1)T the network converges to
the final state y(3) = (1, −1, 1)T or the pattern x1 = (1, −1, 1)T . Figure 2.14 shows
the structure of the network and all the possible convergences for the 8 possible
patterns that are less than 1 bit similar to the stored prototype pattern vectors.
A second example of the Hopfield network is shown, used as CAM [11], where
they are stored and then recovered simple binary images of size 20 × 12 representing
single numeric characters. In this case, a pattern is represented by a vector of 240
binary elements with bipolar values −1 and 1. Once stored in the network the desired
patterns (training phase) are tested for the ability of the network to recover a stored
pattern. Figure 2.15 shows the patterns stored and the same patterns with added
random noise at 20% presented as input to the network for their retrieval. The results
of the correct retrieval or not of the patterns to be recovered are highlighted. The noise
added to the patterns has greatly altered the initial graphics of the numeric character
(deleting and/or adding elements in the binary image). Hopfield’s self-associative
memory performs the function of a decoder in the sense that it retrieves the memory
content more similar to the input pattern. In this example, only the character “5” is
incorrectly retrieved as character “3”.

2.12.2.3 Associative Memory-Performance


Assume that the x pattern has been stored as one of the patterns of the set P of N
pattern. If this pattern is presented to the network, the activation potential of the ith
neuron to recover this pattern x is given by


M
yi = neti = wij xj
j=1;j=i
232 2 RBF, SOM, Hopfield, and Deep Neural Networks

Pattern riconosciuto “0” Pattern riconosciuto “1” Pattern riconosciuto “3” Pattern riconosciuto “3”

Fig. 2.15 Results of numerical pattern retrieval using the Hopfield network as a CAM memory.
By giving non-noisy patterns to the input, the network always retrieves the memorized characters.
By giving instead noisy input patterns at 25% the network does not correctly recover only the “5”
character, recognized as “3”


M 
N

= xki xkj xj
j=1;j=i k=1
  
for the (2.82) =wij


N 
M

= xki xkj xj
k=1 j=1
  

≈ xi M̃ (2.86)

where M̃ ≤ M , being the result of the inner product between vectors with M bipolar
binary elements (+1 or −1). If the pattern x and x are statistically independent, i.e.,
orthogonal, their inner product becomes zero. Alternatively, the limit case is when
the two vectors are identical, obtaining that M̃ = M , that is, the pattern x does not
generate any updates and the network is stable.
This stable state for the recovery of the pattern x we have, considering the Eq.
(2.83), with the activation potential net given by

N 
y = net = Wx = xkT xk − M I x (2.87)
k=1
2.12 Recurrent Neural Networks 233

Let us now analyze the two possible cases:

1. stored patterns P are orthogonal (or statistically independent) and M > N , that
is 
0 if i = j
xiT xj = (2.88)
M if i = j

then we will have, developing the (2.87), the following:

y = net = (x1T x1 + · · · + xNT xN − M I)x


(2.89)
= (M − N )x

With the assumption that M > N , it follows as a result that x is a stable state of
the Hopfield network.
2. stored patterns P are not orthogonal (or statistically independent):

y = net = (x1T x1 + · · · + xNT xN − M I)x


= x x1T x1 + · · · + x xNT xN − M x I

N
(2.90)

= (M − N )x + (xkT xk )x
  
k=1;xk =x
stable state   
noise

where the activation potential of the neuron in this case is given by the term
stable state previously determined and the term of the noise. The x vector is a
stable state when M > N and the noise term is very small and a concordant sign
must be maintained between the activation potential y and the vector x (sgn(y) =
sgn(x )). Conversely, x will not result in the stable state if the noise is dominant
with respect to the equilibrium status term as happens with the increasing number
of patterns N to be stored (i.e., M − N decreases).

As is the case for all associative memories, the best results occur when the patterns
to be memorized are represented by orthogonal vectors or very close to their orthog-
onality. The Hopfield network used as CAM is proved to have a memory capacity
equal to 0.138 · M where M is the number of neurons (the theoretical limit is 2M
pattern). The network is then able to recover the patterns even if they are noisy within
a certain tolerance dependent on the application context.

2.12.3 Continuous State Hopfield Networks

In 1984 Hopfield published another important scientific contribution that proposed


a new neural model with continuous states [12]. In essence, the Hopfield model of
discrete states (unipolar [0,1] or bipolar [−1,1]) was generalized considering the
continuous state of neurons whose responses can assume continuous (real) values in
234 2 RBF, SOM, Hopfield, and Deep Neural Networks

the interval between 0 and 1 as with MLP networks. In this case, the selected activation
function is the sigmoid (or hyperbolic tangent) described in the Sect. 1.10.2. The
dynamics of the network remain asynchronous and the new state of the neuron ith
is given by a generic monotone increasing function. Choosing the sigmoid function
the state of the neuron results
1
yi = σ (neti ) = (2.91)
1 + e−neti
where neti indicates the activation potential of the neuron ith. It is assumed that the
state of a neuron changes slowly over time. In this way, the change of state of the
other neurons does not happen instantaneously but with a certain delay. The activation
potential of the ith neuron changes over time according to the following:
 
M   
M 
d
neti = η −neti + wij yi = η −neti + wij σ (netj ) (2.92)
dt
j=1 j=1

where η indicates the learning parameter (positive) and wij the synaptic weight be-
tween the neuron i and j. With the (2.92) a discrete approximation of the differential
d (neti ) is calculated which is added to the current value of the activation potential
neti which leads the neuron to the new state given by yi expressed with the (2.91).
The objective is now to demonstrate how this new, more realistic network model,
with continuous-state dynamics, can converge toward fixed or cyclic attractors. Hop-
field to demonstrate convergence has proposed [12] a slightly different functional
energy than the discrete model, given by


M M 
 yi
1
E=− wij yi yj + σ −1 (y)dy (2.93)
2 0
i,j=1;j=i i=1

At this point, it is sufficient to demonstrate that the energy of the network decreases
after each update of the state of the neuron. This energy decrease is calculated with
the following:
dE M
dyi M
dyi
=− wij yj + σ −1 (yi ) (2.94)
dt dt dt
i,j=1;j=i i=1

Considering the property of symmetry (that is, wij = wji ) of the network and since
there exists the inverse of the function sigmoide neti = σ −1 (yi ), the (2.94) can be so
simplified
 M 
dyi 
M
dE
=− wij yj − neti (2.95)
dt dt
i=1 j=1
2.12 Recurrent Neural Networks 235

The expression in round brackets (•) is replaced considering the (2.92) and we get
the following:
1  dyi d
M
dE
=− neti (2.96)
dt η dt dt
i=1

d
and replacing dt neti considering the inverse of the derivative of the activation function
σ (using the rule of derivation of composite functions) we obtain
 2
1 
M
dE d
=− σ (neti ) neti (2.97)
dt η dt
i=1

It is shown that the derivative of the inverse function σ is always positive considering
that the sigmoid is a strictly monotone function.7 The parameter η is also positive, as
well as (d (neti )/dt)2 . It can then be affirmed, considering the negative sign, that the
expression of the second member of the (2.97) will never be positive and consequently
the derivative of the energy function with respect to time will never be positive, i.e.,
the energy can never grow while the network dynamics evolves over time
dE
≤0 (2.98)
dt
The (2.98) implies that the dynamics of the network is based on the energy E which
is reduced or remains stable in each update. Furthermore, a stable state is reached
when the (2.98) vanishes and corresponds to a attractor of the state space. This
happens when dtd
neti ≈ 0 ∀i, i.e., the state of all neurons does not change significantly.
Convergence can take a long time since dtd neti gets smaller and smaller. It is, however,
required to converge around a local minimum in a finite time.

2.12.3.1 Summary
The Hopfield network has been an important step in advancing knowledge on neural
networks and has revitalized the entire research environment in this area. Hopfield
established the connection between neural networks and physical systems of the
type considered in statistical mechanics. Other researchers had already considered
the associative memory model in the 1970s more generally. With the architecture of
the symmetrical connection network with a diagonal zero matrix, it was possible to
design recurrent networks of stable states. Furthermore, with the introduction of the

7 Recall the nonlinearity property of the sigmoid function σ (t) ensuring the limited definition range.

In this context the following property is exploited


d σ (t)
= σ (t)(1 − σ (t))
dt
a polynomial relation between derivative and function itself very simple to calculate.
236 2 RBF, SOM, Hopfield, and Deep Neural Networks

concept of energy function, it was possible to analyze the convergence properties


of the networks. Compared to other models the Hopfield network has a simpler
implementation also hardware. The strategy adopted for the updating of the state
of the network corresponds to the physical methods of relaxation which from a
perturbed state a system is placed in equilibrium (stable state). The properties of the
Hopfield network have been studied by different researchers both from the theoretical
and implementation point of view including the realization of optical and electronic
devices.

2.12.4 Boltzmann Machine

In the previous section, we described the Hopfield network based on the minimization
of the energy function without the guarantee of achieving a global optimization
even if once the synaptic weights have been determined the network spontaneously
converges in stable states. Taking advantage of this property of the Hopfield network
it was possible to introduce variants to this model to avoid the problem of local
minima of the energy function. At the conceptual level, we can consider that the
network reaches a state of minimum energy but it could accidentally jump into
a higher state of energy. In other words, a stochastic variant of network dynamics
could help the network avoid a local minimum of the energy function. This is possible
through the best known stochastic dynamics model known as Boltzmann Machine
(BM).
A neural network based on the BM [13] can be seen as the stochastic dynamic
version of the Hopfield network. In essence, the deterministic approach for the update
of the weights used for the Hopfield network (Eq. 2.74), is replaced with the stochastic
approach that updates the state yi of the neuron ith, always in asynchronous mode,
but according to the following rule:

1 with probability pi
yi = (2.99)
0 with probability (1 − pi )

where
1 1
pi = =   (2.100)
1 + eEi /T M
1 + exp −( j=1 wij yj − θi )/T

with wij are indicated the synaptic weights, θi denotes the term bias associated with
the neuron ith and T is the parameter that describes the temperature of the BM
network (simulates the temperature of a physical system). This latter parameter is
motivated by statistical physics whereby neurons normally enter a state that reduces
the system energy E, but can sometimes enter a wrong state, as well as a physical
system sometimes (but not often) can visit states corresponding to higher energy
2.12 Recurrent Neural Networks 237

values.8 With the simulated annealing approach considering that the connections
between neurons are symmetrical for all (wij = wji ), then each neuron calculates the
energy gap Ei as the difference between energy of the inactive state E(yi = 0) and
the active one E(yi = 1).
Returning to the (2.100) we point out that for low values of T we have
 
E
T −→ 0 =⇒ exp − −→ 0 =⇒ pi −→ 1
T

having assumed that E > 0. This situation brings the updating rule back to the
deterministic dynamic method of the Hopfield network of the discrete case. If instead
T is very large, as happens at the beginning of activation of the BM network, we
have  
E
T −→ ∞ =⇒ exp − −→ 1 =⇒ pi −→ 0.5
T

which indicates the probability of accepting a change of state, regardless of whether


this change leads to benefits or disadvantages. In essence, with T large the BM
network explores all states. As for the Hopfield network, the energy function of the
Boltzmann machine results

1 
M 
M
E=− wij yi yj + yi θi (2.101)
2
i,j=1;j=i i=1

The network update activity puts the same in the local minimum configuration of the
energy function, associating (memorizing) the patterns in the various local minima.
This occurs with BM starting with high-temperature values and then through the
simulated annealing process it will tend to stay longer in attraction basins with
deeper minimum values, and there is a good chance of ending in a global minimum.
The BM network compared to Hopfield’s also differs because it divides neurons
into two groups: visible neurons and hidden neurons. The visible ones interface with
the external environment (they perform the function of input/output) while the hidden
ones are used only for internal representation and do not receive signals from the

8 This stochastic optimization method that attempts to find a global minimum in the presence of
local minima, is known as simulated annealing. In essence, the physical process that is adopted
in the heat treatment (heating then slow cooling and then fast cooling) of the ferrous materials is
simulated to make them more resistant and less fragile. At high temperatures the atoms of these
materials are excited but during the slow cooling phase they have the time to assume an optimal
crystalline configuration such that the material is free of irregularities reaching an overall minimum.
This heat quenching treatment of the material can avoid local minimum of the lattice energy because
the dynamics of the particles include a temperature-dependent component. In fact, during cooling,
the particles lose energy but sometimes they acquire energy, thus entering states of higher energy.
This phenomenon avoids the system from reaching less deep minimums.
238 2 RBF, SOM, Hopfield, and Deep Neural Networks

outside. The state of the network is given by the states of the two types of neurons.
The learning algorithm has two phases.
In the first phase the network is activated keeping the visible neurons blocked with
the value of the input pattern and the network tries to reach a thermal balance toward
the low temperatures. Then we increase the weights between any pair of neurons that
both have the active state (analogy to the Hebbian rule).
In the second phase, the network is freely activated without blocking the visible
neurons and the states of all neurons are determined through the process of simu-
lated annealing. Reached (perhaps) the state of thermal equilibrium toward the low
temperatures there are sufficient samples to obtain reliable averages of the products
yi yj .
It is shown that the learning rule [13–15] of the BM is given by the following
relation: η     
wij = − yi yj fixed − yi xj free (2.102)
T
 
where η is the learning parameter; yi yj fixed denotes the expected average value of
the product
 of neuron states ith and jth during the training phase with blocked visible
neurons; yi xj free denotes the expected average value of the product of neuron states
ith and jth when the net is freely activated without the block of visible neurons.
In general, the Boltzmann machine has found widespread use with good perfor-
mance in applications requiring decisions on stochastic bases. In particular, it is used
in all those situations where the Hopfield network does not converge in the global
minimums of the energy function. Boltzmann’s machine has had considerable im-
portance from a theoretical and engineering point of view. One of its architecturally
efficient versions is known as Restricted Boltzmann Machine-RBM [16].
An RBM network considers distinct the set of neurons visible from those hidden
with the particularity that no connection is expected between neurons of the same
set. Only connections between visible and hidden neurons are expected. With these
restrictions the hidden neurons are independent of each other and depend only on
visible neurons allowing the calculation,
 in a single parallel step, of the expected
average value of the product yi yj fixed of the states of the neurons ith and jth during the
 
training phase with blocked visible neurons. The calculation instead of the yi xj free
products requires several iterations that involve parallel updates of all visible and
hidden neurons.
The RBM network has been applied with good performance for speech recogni-
tion, dimensionality reduction [17], classification, and other applications.

2.13 Deep Learning

In the previous paragraphs we have described several applied neural network ar-
chitectures for machine learning and, above all, for supervised and non-supervised
automatic classification, based on the back-propagation method, commonly used in
2.13 Deep Learning 239

conjunction with an algorithm to optimize the rapid descent of the gradient to update
the weights of the neurons by calculating the gradient of the cost or target function.
In recent years, starting in 2000, with the advancement of Machine Learning
research, the neural network sector has vigorously recovered through further avail-
ability of low-cost multiprocessor computing systems (computer clusters, processors
GPU graphs, ...), the need to elaborate large amounts of data (big data) in various
applications (industries, public administration, social telematic sector, ...), and above
all through the development of new algorithms for automatic learning applied for
artificial vision [18,19], speech recognition [20], textual and language [21].
This advancement of research on neural networks has led to the development of
new algorithms for machine learning-based also on the architecture of traditional
neural networks developed since the 1980s. The strategic progress was with the best
results achieved with the new neural networks developed, called Deep Learning-DL
[22].
In fact, with Deep Learning it is intended as a set of algorithms that use neural
networks as a computational architecture that automatically learn the significant
features from the input data. In essence, with the Deep Learning approach neural
networks have improved the limitations of traditional neural networks to approximate
nonlinear functions and to automatically resolve the extraction of feature in complex
application contexts with large amounts of data, exhibiting excellent adaptability
(recurrent neural networks, convolutional neural networks, etc.).
The meaning of deep learning is also understood in the sense of the multiple
layers involved in architecture. In more recent years, networks with deeper learn-
ing are based on the use of rectified linear units [22] as an activation function in
place of the traditional sigmoid function (see Sect. 1.10.2) and regularization tech-
nique (dropout) [20] to improve or reduce the problem overfitting in neural networks
(already described in Sect. 1.11.5).

2.13.1 Deep Traditional Neural Network

In Sect. 1.11.1 we have already described the MLP network which in fact can be
considered a deep neural network as it can have more than two intermediate layers of
neurons, between input and output, even if it is among the traditional neural networks
being the neurons completely connected between adjacent layers. Therefore, even
the MLP network using many hidden layers can be defined as deep learning. But
with the learning algorithm based on the backpropagation increasing the number
of hidden layers become increasingly difficult the process of adapting significant
weights, even if from a theoretical point of view it would be possible.
Unfortunately, this is found experimentally when the weights are randomly ini-
tialized and subsequently, during the training phase, with the backpropagation algo-
rithm, the error is backpropagated from the right (from the output layer) to the left
(towards the input layer) by calculating the partial derivatives with respect to each
weight, moving in the opposite direction of the gradient of the objective function.
This weight-adjustment process becomes more and more complex as the number of
240 2 RBF, SOM, Hopfield, and Deep Neural Networks

hidden layers of neurons increases as the value of the weight update and the objective
function can not converge toward optimal values for the given data set. Consequently,
the MLP with the backpropagation process, in fact, does not optimally parametrize
the traditional neural network although deep in terms of the number of hidden layers.
The limits of learning with backpropagation derive from the following problems:

1. There is not always available prior knowledge about the data to be classified
especially when dealing with large volumes of data such as color images, variant
space-time image sequences, multispectral images, etc.
2. The learning process, based on backpropagation, tends to lock up in local min-
ima using the gradient descent method especially when the layers with totally
connected neurons are many. Problem known as vanish gradient problem. The
process of adaptation of the weights through activation functions (log-sigmoid
or hyperbolic tangent) tends to cancel itself. Even if the theory of adaptation of
the weights with the error back-propagation with the gradient descent is mathe-
matically rigorous, it fails in practice, since the gradient values (calculated with
respect to weights), which determine how much each neuron should change, they
get smaller and smaller as they are propagated backward from the deeper layers of
the neural network.9 This means that the neurons of the previous levels learn very
slowly compared to the neurons in the deeper layers. It follows that the neurons
of the first layers learn less and more slowly.
3. The computational load becomes noticeable as the depth of the network and the
data of the learning phase increase.
4. The network optimization process becomes very complex as the hidden layers
increase significantly.

A solution to the overcoming of the exposed problems is given by the Convolutional


Neural Networks-CNN [23,24] that we will introduce in the following paragraph.

2.13.2 Convolutional Neural Networks-CNN

The strategy of a Deep Learning-DL network is to adopt a deep learning process


using intelligent algorithms that learn feature through deep neural networks. The
idea is to extract from the input dataset feature significant, useful to use then in the
learning phase from the DL network.
The quality of the learning model is strongly linked to the nature of the input data.
If the latter have intrinsically little discriminating information, the performance of

9 We recall that the adaptation of the weights in the MLP occurs by deriving the objective function
that uses the sigmoid function with the value of the derivative less than 1. Application of the
chain rule derivation leads to multiplying many terms less than 1 with the consequent problem of
considerably reducing the gradient values as you proceed toward the layers furthest from the output.
2.13 Deep Learning 241

Fig. 2.16 Example of input volume convolution of H × W × D with multiple filters h × w × D.


Placing a filter on the input volume generates a single scalar. Translating the filter for the whole
image, then implementing the convolution operator, generates a single map (activation map) of
reduced size (if padding and stride is not implemented as discussed below). The figure shows that
applying more filters, and therefore, more convolutions to the input volume, there will be more
activation maps

the automatic learning algorithms are negatively influenced by operating with little
significant extracted features.
Therefore, the goal behind CNN is to automatically learn the significant features
from input data that is normally affected by noise. In this way, significant features
are provided to the deep network so that they can learn more effectively. We can think
of deep learning as algorithms for the automatic detection of features to overcome
the problem of descent of the gradient in nonoptimal situations and facilitate the
learning with a network with many hidden layers.
In the context of the classification of images, CNN uses the concept of ConvNet, in
fact, a neural convolution that uses windows (convolution masks), for example, 5 × 5
dimensions, thought as receptive fields associated with the neurons of the adjacent
layer, which slide on the image to generate a feature map, also called the activation
map, which is propagated to the next layers. In essence, the convolution mask of
fixed dimensions will be the object of learning, based on the data supplied as input
(sequence of pairs contained in the dataset—images/label).
Let’s see how a generic convolution layer is implemented. Starting from a volume
of size H × W × D, where H is the height, W is the witdh, and D indicates the
number of channels (which in the case of a color image D = 3), we define one or
more convolution filters of dimensions h × w × D. Note that the number of filter
channels and the input volume must be the same size. Figure 2.16 shows an example
of convolution. The filter generates a scalar when positioned on a single volume
location, while it’s sliding over the entire volume (implementing the convolution
operator) of the input generates a new image called the activation map.
242 2 RBF, SOM, Hopfield, and Deep Neural Networks

The activation map will have high values, where the structures of the image or
input volume are similar to the filter coefficients (which are the parameters to be
learned), and low where there is no similarity. If x is the image or input volume,
let w denote the filter parameters and with b the bias, the result obtained is given
by w x + b. For example, for a color image with D = 3, and a filter consisting of
the same height and width equal to 5, we have a number of parameters equal to
5 × 5 × 3 = 75 plus the bias b, for a total of 76 parameters to learn. The same image
can be analyzed with K filters, generating several different activation maps, as shown
in Fig. 2.16.
A certain analogy between the local functionality (defined by its receptive field) of
biological neurons (simple and complex cells) described in Sect. 4.6.4 for capturing
movement and elementary patterns in the visual field, and the functionality of neurons
in the convolutional layer of CNN. Both neurons propagate very little information
in successive layers in relation to the area of the receptive field, greatly reducing the
number of total connections. In addition, the spatial information of the visual field
(retinal map and feature map) propagated in subsequent layers is preserved in both.
Let us now look at the multilayered architecture of a CNN network and how the
network evolves from the convolutional layer, which for clarity we will refer to a
CNN implemented for the classification.

2.13.2.1 Input Layer


Normally it is the input image with dimensions W0 × H0 × D0 . It can be considered
as the input volume with D0 number of channels corresponding, for example, to an
RGB color image where D0 = 3.

2.13.2.2 Convolutional Layer


A CNN network is composed of one or more convolutional layers generating a
hierarchy of layers. The first volume of output (first convolutional layer) is obtained
with the convolution process, schematized in Fig. 2.16, directly linked to the pixels
of the input volume. Each neuron of the output volume is connected to w × w × D0
input neurons where w is the size of the convolution square mask. The entire output
volume is characterized by the following parameters:

1. Number K of convolution filters. Normally chosen as a power number of 2, for


example, 4, 8, 16, 32, ..., and so on.
2. Dimension w × w of the convolution mask. It normally takes values of w =
3, 5, . . ..
3. Stride. S translation step of the 3D mask on the input volume (in the case of
color images D0 = 3). In normal convolutions it is 1 while S = 2 for values of
w > 3. This parameter affects the 2D reduction factor of the activation maps and
consequently the number of connections.
4. Padding. Border size to be added (with zeros) to the input image to manage the
convolution operator at the edges of the input volume. The size of the border is
2.13 Deep Learning 243

defined with the padding parameter P which affects the size of the 2D base of
the output volume or the size of the activation maps (or even feature maps), as
follows:
(W0 − w + 2P) (H0 − w + 2P)
W1 = +1 H1 = +1 (2.103)
S S
For P = 0 and S = 1 the size of the feature maps results W1 = H1 = W0 − w + 1.

The parameters defined above are known as hyperparameters of the CNN network.10
Defined as the hyperparameters of the convolutional layer we have the following
dimensions of the feature volume:

W1 × H1 × D1

where W1 and H1 are given by the Eq. (2.103) while D1 = K coincident with the
number of convolution masks. It should be noted that the number of weights shared
for each feature map is w × w × D0 for a total of w × w × D0 weights for the entire
volume of the features.
For a 32 × 32 × 3 RGB image and 5 × 5 × 3 masks we would have for each
neuron of the feature map a value resulting from the convolution given by the in-
nerl product of size 5 × 5 × 3 = 75 (convolution between 3D input volume and 3D
convolution mask, neglecting the weight bias). The corresponding activation map
would be
W1 × H1 × D1 = 28 × 28 × K

assuming P = 0 and S = 1. For K = 4 masks we would have between the input layer
and the convolutional layer a total of connections equal to (28 × 28 × 4) × (5 × 5 ×
3) = 235200 and a number total weights (5 × 5 × 3) · 4 = 300, plus the 4 biases b
and there are a number of parameters to be learned equal to 304. In this example, for
each portion of size (5 × 5 × 3) of the input volume, as extended as the size of the
mask, 4 different neurons observe the same portion by extracting 4 different features.

2.13.2.3 ReLU Layer


As seen above, the activity level of each convolutional filter is measured with net =
w x + b, which is supplied as input to an activation function. We have previously
talked about sigmoid neurons, in which the sigmoid defined in [−∞, ∞] → [0, 1]

10 In the field of machine learning, there are two types of parameters, those that are learned during the

learning process (for example, the weights of a logistic regression or a synaptic connection between
neurons), and the intrinsic parameters of an algorithm of a learning model whose optimization takes
place separately. The latter is known as the hyperparameters, or optimization parameters associated
with a model (which can be a regularization parameter, the depth of a decision tree, or how in the
context of deep neural networks the number of neuronal layers and other parameters that define the
architecture of the network).
244 2 RBF, SOM, Hopfield, and Deep Neural Networks

10 10

8
8

7
6
6

5 4

2
3

2
0
1

0 −2
−10 −8 −4 −6 −2 0 2 4 6 8 10 −10 −8 −6 −4 −2 0 2 4 6 8 10

a) b)
10

−2
−10 −8 −6 −4 −2 0 2 4 6 8 10

c)

Fig. 2.17 a Rectifier Linear Unit ReLU. b Leaky ReLU; c exponential Linear Units ELU

presents some problems when the neurons go into saturation. The sigmoid resets the
gradient and does not allow to modify the weights of the net. Furthermore, its output
is not centered with respect to 0, the exponential is slightly more computationally
complex, and slows down the process of the gradient descent. Subsequently, more
efficient activation functions have been proposed such as the ReLU [19] of which
we also describe some variants.

ReLU. The ReLU activation function (Rectified Linear Units) is defined by



net if net ≥ 0
f (net) = (2.104)
0 if net < 0

It performs the same function as the sigmoid used in MLPs but has the char-
acteristic of increasing the nonlinearity property and eliminates the problem of
canceling the gradient highlighted previously (see Fig. 2.17a). There is no satura-
tion for positive values of the derivative while it is zero for negative or null values
of net and 1 for positive values. This layer is not characterized by parameters and
the convolution volume remains unchanged in size. Besides being computation-
ally less expensive, it allows in the training phase to converge much faster than
the sigmoid and is more biologically plausible.
2.13 Deep Learning 245

LeakyReLU. Proposed in [25], defined by



net if net ≥ 0
f (net) =
0.01 · net if net < 0

As shown in Fig. 2.17b does not cancel when net becomes negative. It has the
characteristic of not saturating neurons and does not reset the gradient.
ParametricReLU. Proposed in [26], defined by

net if net ≥ 0
f (net, α) =
α · net if net < 0

It is observed that for α ≤ 1 implies f (net) = max(net, α · net). As we can see


in Fig. 2.17a is not canceled when net becomes negative. It has the same char-
acteristics as the previous function, plus it improves the fitting model at a low
computational cost and reduces the risk of overfitting. The arbitrary hyperparam-
eter α can be learned since we can backpropagate into it. In this way it gives
neurons the possibility to choose the best slope in the negative region.
ExponentialLinearU nits(ELU ). Defined as follows:

net net > 0
f (net, α) = (2.105)
α(e(net) − 1) net ≤ 0

This has been proposed to improve speedup in the convergence phase, and to
alleviate the problem of zeroing the gradient [27]. In the negative part, it is not
annulled as for Leaky ReLU (see Fig. 2.17c).

2.13.2.4 Pooling Layer


It is applied downstream of the convolution, or on the activation maps calculated with
the functions described above, and has the objective of subsampling each map (see
Fig. 2.18a). In essence, it operates on every dth map of activation independently. The
most used information aggregation operator is the maximum (Max) (see Fig. 2.18b)
or the average (Avg) applied locally on each map. These aggregation operators ex-
hibit the characteristic of being invariant with respect to small feature translations.
The pooling has no parameters, but only hyperparameters that are identified in the
size of the windows on which to apply the operators Max or Avg, and the stride.
The affected volume of size W1 × H1 × D1 is reduced in the pooling layer in the
following dimensions:
W2 × H2 × D2

where
W1 − w H1 − w
W2 = +1 H2 = +1 D2 = D1 (2.106)
S S
246 2 RBF, SOM, Hopfield, and Deep Neural Networks

Fig. 2.18 a Subsampling of the activation maps produced in the convolution layer. b Operation of
max pooling applied to the linearly adjusted map (ReLU) of size 4 × 4, pooling window 2 × 2 and
stride = 2. We get a map of dimensions 2 × 2 in output

These relations are analogous to the (2.103) but in this case, for the pooling, there is no
padding, and therefore, the parameter P = 0. Normally the values of the parameters
used are (w, S) = (2, 2) and (w, S) = (3, 2). It is observed that also in this layer
there are no parameters to learn as in the ReLU layer. An example of max pooling is
shown in Fig. 2.18b.

2.13.2.5 Full Connected Layer


So far we have seen how to implement convolutional filters, activation maps using
functions, and subsampling operations using pooling which generates small activa-
tion maps (or features). In a CNN, the size of the volumes is reduced for each level of
depth (variables H and W ) but generally they become deeper in terms of the number
of channels D. Usually, the output of the last convolution layer is provided in layered
input Fully Connected (FC). The FC layer is identical to any layer of a traditional
deep network where the neurons of the previous layer are totally connected to those
of the following layer. The goal of the FC layer is to use the features determined by
the convolutional layer to classify the input image into various classes based on the
training set input images. The volume of the FC layer is 1 × 1 × K where in this
case K, if it is the last output layer, indicates the number of classes of the data set to
be classified. Therefore, K is the number of neurons totally connected with all the
neurons of the previous FC layer or volume. The activation of these neurons occurs
with the same formalism as traditional networks (see Sect. 1.10.3).
In summary, a CNN network starting from the input image extracts the significant
features through the convolution layer, organizing them in the convolutional volume
that includes many feature maps for how many convolution masks are. Then follows a
ReLU layer that applies an activation function generating the activation volume of the
same size as the convolutional one and a pooling layer to subsample the feature maps.
These three layers Conv, ReLU, and Pool can be replicated in cascade, depending on
the size of the input volume and the level of reduction to be achieved, ending with
2.13 Deep Learning 247

Fig. 2.19 Architectural scheme of a CNN network very similar to the original LeNet network [24]
(designed for automatic character recognition) used to classify 10 types of animals. The network
consists of ConvNet components for extracting the significant features of the input image, given as
input to the traditional MPP network component, based on three fully connected FC layers, which
performs the function of classifier

the FC layer representing a volume of size 1 × 1 × C completely connected with


all the neurons of the previous layer where the number of neurons C represents the
classes of the data set under examination. In some publications, the sequence Conv,
ReLU, and Pool is considered to be part of the same level or layer, therefore, the
reader should be careful not to make confusion.
The Conv, ReLU, and Pool layers thus defined make the CNN network invariant
with small transformations, distortions, and translation in the input image. A slight
distortion of the input image will not significantly change the feature maps in the
pooling layer since the avg and max operator is applied on small local windows. We,
therefore, obtain an almost invariant representation in the image scale (equivariant
representation). It follows that we can detect objects in the image, regardless of their
position.
Figure 2.19 shows an architectural scheme of a CNN network (used to classify 10
types of animals) very similar to the original LeNet network (designed for automatic
character recognition) composed of 2 convolutional layers and two layers of max
pooling, arranged alternately, and by the fully connected layers (FC) that make up
a traditional MPP network (layers FC1, FC2, and the logistic regression layer). As
shown in the figure, giving the image of the coala input, the network recognizes it by
assigning the highest probability to the coala class among all the 10 types of animals
for which it was previously trained. As will be seen below, the out vector of the
output layer is characterized in such a way that the sum of all the probabilities of the
classes corresponds to 1. The network tries to assign the highest probability to the
coala class for the input image presented in the figure.
248 2 RBF, SOM, Hopfield, and Deep Neural Networks

2.13.3 Operation of a CNN Network

Having analyzed the architecture of a CNN network, we now see its operation in the
context of classification. In analogy to the operation of a traditional MLP network,
the CNN network envisages the following phases in the forward/backward training
process:

1. Dataset normalization. Images are normalized with respect to the average. There
are two ways to calculate the average. Suppose we have color images, so with
three channels. In the first case, the average is calculated from each channel,
and it is subtracted from the pixels of the corresponding channels. In this case,
the average is a scalar referred to the R, G, and B band of the single image. In
the second case, all the training images of the dataset are analyzed, calculating
an average color image, which is then subtracted pixel by pixel from each
image of the dataset. In the latter case, the average image must be remembered
in order to remove it from a test image. Furthermore, there are techniques of
data augmentation which serve to increase the cardinality of the dataset and
to improve the generalization of the network. These techniques modify each
image by turning it upside down in both axes, generating rotated versions of
them, inserting slight geometric and radiometric transformations.
2. Initialization. The hyperparameter values are defined (size w of the convolution
masks, number of filters K, stride S, padding P) related to all the layers Conv,
Pool, ReLU and initialized according to the Xavier method. The Xavier initial-
ization method introduced in [28] is an optimal way to initialize weights with
respect to activation functions, avoiding convergence problems. The weights
are initialized as follows:
!
1 1
Wi,j = U − √ , √ (2.107)
n n
with U uniform distribution and n the number of connections with the previous
layer (the columns of the weight matrix W). The initialization of weights is a
fairly active research activity, other methods are discussed in [29–31].
3. Forward propagation. The images of the training set are presented as input
(input volume W0 × H0 × D0 ) to the layer Conv which generates the feature
maps subsequently processed in a cascade from the layers ReLU and Pool, and
finally given to the FC layer that calculates the final result of the classes.
4. Learning error calculation. FC evaluates the learning error if it is still above
a predefined threshold. Normally the C classes vector contains the labels of
the objects to be classified in terms of probability and the difference between
the target probabilities and the current output to FC is calculated with a metric
to estimate this error. Considering that in the initial process the weights are
random, surely the error will be high and the process of retro-propagation of
the error is triggered as in the MLPs.
2.13 Deep Learning 249

5. Backward propagation. FC activates the algorithm of backpropagation, as hap-


pens for a traditional network, calculating the gradient of the objective function
(error) with respect to the weights of all FC connections using the gradient
descent to update all the weights and minimize the objective function. From
the FC the network error to the previous layer is retropropagated. It should be
noted that all the parameters associated with learning (for example, weights)
are updated, while the parameters that characterize the CNN network as the
number and size of masks are set at the beginning and cannot be modified.
6. Reiterate from phase 2.

With the learning process CNN has calculated (learned) all the optimal parameters
and weights to classify an image of the training set in the test phase. What happens
if an image never before seen in the learning phase is presented to the network?
As with traditional networks, an image that has never been seen can be recognized as
one of the images learned if it presents some similarity, otherwise it is hoped that the
network will not recognize it. The learning phase, in relation to the available calcu-
lation system and the size of the image dataset, can also require days of calculation.
The time required to recognize an image with the test phase is a few milliseconds,
but it always depends on the complexity of the model and the type of GPU you have.
The level of generalization of the network in recognizing images also depends on
the ability to construct a CNN network with an adequate number of convolutional
layers (Conv, ReLU, Poll) replicated and FC layers, the MLP component of a CNN
network. In image recognition applications it is strategic to design the convolutional
layers capable of extracting as many features as possible from the training set of
images using optimal filters to extract in the first layers Conv the primal sketch
information (for example, lines, corners) and in subsequent Conv layers use other
types of filters to extract higher level structures (shapes, texture).
The monitoring of the parameters during the training phase is of fundamental
importance, it allows to avoid continuing with the training due to an error in the
model or in the initialization of the parameters or hyperparameters, and to take
specific actions. It is useful to divide the dataset into three parts when it is very large,
leaving, for example, 60% for training, 20% of the data for the validation phase and
the remaining 20% for the test phase to be used only at the end when with the training
and validation partitions found the best model.

2.13.3.1 Stochastic Gradient Descent in Deep Learning


So far, to update the weights and generally the parameters of the optimization model
we have used the method of the gradient descent-GD. In machine learning problems,
where the training set dataset is very large, the calculation of the cost function and
the gradient can be very slow, requiring considerable memory resources and compu-
tational time. Another problem with the batch optimization methods is that they do
not provide an easy way to incorporate new data. The optimization method based on
the Stochastic Gradient Descent—SGD [32] addresses these problems following the
negative gradient of the objective function after processing only a few examples of
250 2 RBF, SOM, Hopfield, and Deep Neural Networks

the training set. The use of SGD in neural network settings is motivated by the high
computational cost of the backpropagation procedure on the complete training set.
SGD can overcome this problem and still lead to rapid convergence.
The SGD technique is also known as an incremental gradient descent, which
uses a stochastic technique to optimize a cost function or objective function—the
loss function. Consider having a dataset (xi , yi )Ni=1 , consisting of image pairs and
membership class, respectively. The output of the network in the final layer is given by
the score generated by the jth neuron for a given input xi from fj (xi , W) = W xi + b.
Normalizing each output of the output layer we obtain values that can be traced back
to probabilities as follows for the neuron kth

e fk
P(Y = k|X = xi ) =  f (2.108)
je
j

which is known as the Softmax function, whose outputs of each output neuron are in
the range [0.1]. We define the cost function for the ith pair of the dataset
" #
e fyi
Li = − log  f (2.109)
je
j

and the objective function on the whole training set is given by

1 
N
L(W) = Li (f (xi , W), yi ) + λR(W). (2.110)
N
i=1

The first term of the total cost function is calculated on all the pairs supplied as input
during the training process, and evaluates the consistency of the model contained in
W up to that time. The second term is regularization, penalizes values of W very high,
and prevents the problem of overfitting, i.e., CNN behaves too well with training data
but then will have poor performance with data never seen before (generalization).
The R(W) function can be one of the following:

R(W) = Wkl2
k l

R(W) = |Wkl |
k l

or a combination of the two with a β hyperparameter, obtaining:



R(W) = βWkl2 + |Wkl |.
k l

For very large training datasets, the stochastic technique of the gradient descent
is adopted. That is, we approximate the sum with a subset of the dataset, named
2.13 Deep Learning 251

minibatch, consisting of pairs of training chosen randomly in a number that is gener-


ally a power of 2, then 32, 64, 128 or 256, depending on the available memory capacity
in GPUs in use. That said, instead of calculating the gradient descent ∇L(W), the
stochastic gradient descent ∇W L(W) for the minibatch samples is instead calculated
as follows:

1 
N
∇W L(W) = ∇W Li (f (xi , W), yi ) + λ∇W R(W). (2.111)
N
i=1

where the minibatch stochastic gradient descent ∇W L(W) is an undistorted estimator


of the gradient ∇L(W). The update of the weights in the output layer is given by

W := W − η∇W L(W) (2.112)

with the sign “−” to go in the direction of minimizing the cost function, and η rep-
resents the learning parameter (learning rate or step size). The gradient of the cost
function is propagated backward through the use of representation of computational
graphs that make the weights in each layer of the network simple to update. The
discussion of computational graphs, are not covered in the text, as well as variants
to SGD integrating the moment (discussed in the previous chapter), or adaptive gra-
dient (ADAGRAD) [33], root mean square propagation (RMSProp) [34], adaprive
moment estimator (ADAM) [35] and Kalman based Stochastic Gradient Descent
(kSGD) [36].

2.13.3.2 Dropout
The problem of overfitting we had already highlighted for traditional neural networks
and to reduce it we have used heuristic methods and verified the effectiveness (see
Sect. 1.11.5). For example, an approach used was to monitor the learning phase
and block it (early stop) when the values of some parameters were checked with a
certain metric and thresholds (a simple way was to check the number of epochs).
Validation datasets (not used during network training) were also used to verify the
level of generalization of the network as the number of epochs changed and to assess
whether the network’s accuracy improved or deteriorated.
For deep neural networks with a much larger number of parameters, the problem of
overfitting becomes even more serious. A regularization method, to improve or reduce
the problem of overfitting in the context of deep neural networks, is known as dropout
proposed in [37] in 2014. The dropout method together with the use of rectified linear
units (ReLUs) are two fundamental ideas for improving the performance of deep
networks.
252 2 RBF, SOM, Hopfield, and Deep Neural Networks

Fig. 2.20 Neural network with dropout [37]. a A traditional neural network with 2 hidden layers.
b An example of a reduced network produced by applying the dropout heuristic to the network in
(a). Neural units in gray have been excluded in the learning phase

The key idea of the dropout method 11 is to randomly exclude neural units (together
with their connections) from the neural network during each iteration of the training
phase. This prevents neural units from co-adapting too much. In other words, the
exclusion refers to ignoring, during the training phase, a certain set of neurons that
are chosen in a stochastic way. Therefore, these neurons are not considered during
a particular propagation forward and backward. More precisely, in each training
phase, the individual neurons are either abandoned from the network with probability
1 − p or maintained with probability p, so as to have a reduced neural network (see
Fig. 2.20). It is highlighted that a neuron excluded in an iteration can be active in the
subsequent iteration of its own because the sampling of neurons occurs in a stochastic
manner.
The basic idea to exclude or consider a neural unit is to prevent an excessive
adaptation at the expense of generalization. In other words, avoid a highly trained
network on the training dataset which then shows little generalization with the test
or validation dataset. To avoid this, dropout expects stochastically excluding neu-
rons in the training phase, causing their contribution to the activation of neurons
in the downstream layer to be temporarily reset in the forward propagation, and

11 The heuristic of the dropout is better understood through the following analogy. Let’s imagine
that in a patent office the expert is a single employee. As often happens, if this expert is always
present all the other employees of the office would have no incentive to acquire skills on patent
procedures. But if the expert decides every day, in a stochastic way (for example, by throwing a
coin), whether to go to work or not, the other employees, unable to block the activities of the office,
are forced, even occasionally, to adapt by acquiring skills. Therefore, the office cannot rely only on
the only experienced employee. All other employees are forced to acquire these skills. Therefore, a
sort of collaboration between the various employees is generated, if necessary, without the number
of the same being predefined. This makes the office much more flexible as a whole, increasing
the quality and competence of employees. In the jargon of neural networks, we would say that the
network generalizes better.
2.13 Deep Learning 253

consequently, any weights (of the synaptic connections) are not then updated in the
backward propagation.
During the learning phase of the neural network the weights of each neuron intrin-
sically model an internal state of the network on the basis of an adaptation process
(weight update) that depends on the specific features providing a certain specializa-
tion. Neurons in the vicinity rely on this specialization which can lead to a fragile
model that is too specialized for training data. In essence, dropout tends to prevent
the network from being too dependent on a small number of neurons and forces each
neuron to be able to operate independently (reduces co-adaptations). In this situation,
if some neurons are randomly excluded during the learning phase, the other neurons
will have to intervene and attempt to manage the representativeness and prediction
of the missing neurons. This leads to more internal, independent representations
learned from the network. Experiments (in the context of supervised learning in vi-
sion, speech recognition, document classification, and computational biology) have
shown that with such training the network becomes less sensitive to specific neuron
weights. This, in turn, translates into a network that is able to improve generalization
and reduces the probability of over-adaptation with training data.
The functionality of the dropout is implemented by inserting in a CNN network
a dropout layer between any layer of the network. Normally it is inserted between
the layers that have a high number of hyperparameters and in particular between
the FC layers. For example, in the network of Fig. 2.19 a layer dropout can be
inserted immediately before the FC layers with the function of randomly selecting
neurons to be excluded with a certain probability (for example, p = 0.25 and this
implies that 1 neuron on 4 inputs it will be randomly excluded) in each forward step
(zero contribution to the activation of neurons in the downstream layer) and in the
backward propagation phase (weight update). The dropout feature is not used during
the network validation phase. The hyperparameter p must be adapted starting from
a low probability value (for example, p = 0.2) and then increment it up to p = 0.5
with the foresight to avoid very high values for not compromising the learning ability
of the network.
In summary, we have that the dropout method forces a neural network to learn
more robust features that are useful in conjunction with many different random sub-
sets of other neurons. The weights of the net were learned in conditions such that
a part of the neurons were temporarily excluded and when using the net in the test
phase the number of neurons involved will be greater. This tends to reduce the over-
fitting since the network trained with the dropout layer actually turns out to be a
sort of media of different networks that could potentially present an over-adaptation
even if with different characteristics, and in the end with this heuristic it is hoped
to reduce (mediating) the phenomenon of overfitting. In essence, it would reduce
the co-adaptation of neurons, from the moment when a neuron, in the absence of
other neurons exhibit a different forced adaptation in the hope of capturing (learn-
ing) more significant features, useful together with different random subgroups of
other neurons. Most CNN networks insert the dropout layer. A further widespread
regularization method is known as Batch Normalization described in the following
paragraph.
254 2 RBF, SOM, Hopfield, and Deep Neural Networks

2.13.3.3 Batch Normalization


A technique recently developed by Ioffe and Szegedy [38] called Batch Normal-
ization alleviates the difficulties of the correct initialization of neural networks by
explicitly forcing the activations to assume a unitary Gaussian distribution in the
training phase. This type of normalization is simple and differentiable. The goal is
to prevent the early saturation of nonlinear activation functions such as the sigmoid
function or ReLU, ensuring that all input data is in the same range of values. In the
implementation, the BatchNorm (BN) layer is inserted immediately after the fully
connected FC levels (or convolutional layers), and normally before the nonlinear
activation functions.
Normally each neuron of the hidden layer lth receive in input x(l) the output of
the neurons of the previous layer (l − 1)th, obtaining the activation signal net (l) (x)
which is then transformed by the nonlinear activation function net (l) (x) (for example,
the sigmoid or ReLU function) given by
 
y(l) = net (l) (x) = f (l) W(l) x(l−1) + b(l)

Instead, inserting the normalization process activates the BN function before the
activation function by obtaining
⎛ ⎞
 
⎜ (l) (l−1) ⎟
ŷ(l) = f (l) BN (net (l) (x) = f (l) ⎝BN (W
  x )⎠
y(l)

where in this case the bias b is ignored for now as, as we shall see, its role will be
played by the parameter β of BN. In reality the normalization process could be done
throughout the training data set, but used together with the stochastic optimization
process described above, it is not practical to use the entire dataset. Therefore, the
normalization is limited to each minibatch B = {x1 , . . . , xm } of m samples during
the stochastic network training process. For a layer of the network with input y =
(y(1) , . . . y(d ) ) at d -dimensions, according to the previous equations, the relative
normalization for each kth feature ŷ(k) is given by

y(k) − E[x(k) ]
ŷ(k) = & (2.113)
Var[y(k) ]

where the expected value and the variance are calculated on the single minibatch
By = {y1 , . . . , ym } consisting of m activations of the current layer. Average μB and
variance σB2 are approximated using the data from the minibatch as follows:

1
m
μB ← yi
m
i=1

1 m
σB2 ← (yi − μB )2
m
i=1
2.13 Deep Learning 255

yi − μB
ŷi ← '
σB2 + 

where ŷi are the normalized values of the inputs of the lth layer while  is a small
number added to the denominator to guarantee numerical stability (avoids division by
zero). Note that simply normalizing each input of a layer can change what the layer
can represent. Most activation functions present problems when the normalization
is applied before the activation function. For example, for the sigmoid activation
function, the normalized region is more linear than nonlinear.
Therefore, it is necessary to perform a transformation to move the distribution from
0. We introduce for each activation, a pair of parameters γ and β that, respectively,
scale and translate each normalized value ŷi , as follows:

ỹi ← γ ŷi + β (2.114)

These additional parameters are learned along with the weight parameters W of the
network during the training process through backpropagation in order to improve
accuracy and speedup the training phase. Furthermore, normalization does not alter
the network’s representative power. If you decide not to use BN, it places γ =

σ 2 +  and β = μ thus obtaining ỹ = ŷ that is operating as an identity function
that returns the original values back.
The use of BN has several advantages:

1. It improves the flow of the gradient through the net, in the sense that the descent
of the gradient can reduce the oscillations when it approaches the minimum point
and converge faster.
2. Reduces network convergence times by making the network layers independent
of the dynamics of the input data of the first layers.
3. It is allowed to have a more high learning rate. On a traditional network, high
values of the learning factor can scale the parameters that could amplify the
gradients, thus leading to saturation. With BN, small changes in the parameter to
a layer are not propagated to other layers. This allows higher learning factors to
be used for optimizers, which otherwise would not have been possible. Moreover,
it makes the propagation of the gradient in the network more stable as indicated
above.
4. Reduces dependence on accurate initialization of weights.
5. It acts in some way as a form of regularization motivated by the fact that using
reduced minibatch samples reduces the noise introduced in each layer of the
network.
6. Thanks to the regularization effect introduced by BN, the need to use the dropout
is reduced.
256 2 RBF, SOM, Hopfield, and Deep Neural Networks

2.13.3.4 Monitoring of the Training Phase


Once you have chosen the network architecture you want to use, you need to pay
great attention to the initialization phase, as discussed above for data and weights,
and to monitor several network parameters. A first verification of the implemented
model is the verification of the loss function. Set λ = 0 to turn off the regularization
and check the value of the objective function that calculates the network. If we are
solving a 10 class problem, for example, the worst that the network can do is not
to identify any of the classes in the output layer, that is, that no output neuron of
the corresponding class realizes the maximum near the correct class. This means
calculating the maximum value of the loss function, which in the case of a 10-class
problem L = log(10) ≈ 2.3. Then the regularization term is activated with λ very
high and it occurs if the loss function increases with respect to the value obtained
previously (when λ was zero).
The next test is to use a few dozen data to check if the model goes in overfitting,
drastically reducing the loss function and getting 100% accuracy. Subsequently, the
optimal training parameter must be identified, choosing a certain number of learning
rate randomly (according to a uniform distribution) in a defined interval, for example,
[10−6 , 10−3 ], and starting several training sessions from a tabula rasa configuration.
The same can be done in parallel for the hyperparameter λ which controls the reg-
ularization term. In this way, you can record each accuracy result for the randomly
set values and choose λ and η accordingly for the complete training phase. You can
eventually wait to start the training phase on the whole dataset, repeating the hyper-
parameter search, narrowing down the search interval around those identified in the
previous phase (coarse to fine search).

2.13.4 Main Architectures of CNN Networks

Several are the architectures realized of convolutional networks, in addition to the


LeNet described in the preceding paragraphs, the other most common are

AlexNet, described in [19] is a deeper and much wider version of LeNet. Result of
the first work in the competition ImageNet Large Scale Visual Recognition Chal-
lenge (ILSVRC) of 2012, making the convolutional networks in the Computer
Vision sector so popular. It turned out to be an important breakthrough compared
to previous approaches and the spread of CNN can be attributed to this work. It
showed good performances and won by a wide margin the difficult challenge of
visual recognition on large datasets (ImageNet) in 2012.
ZFNet, described in [39] is an evolution of AlexNet but better performing, result-
ing winner in the competition ILSVRC 2013. Compared to AlexNet, the hyper-
parameters of the architecture have been modified, in particular, by expanding the
size of the central convolutional layers and reducing the size of the stride and the
window on the first layer.
GoogLeNet, described in [40] differs from the previous ones for having inserted
the inception module which heavily reduces the number of network parameters
up to 4M compared to the 60M used in AlexNet. Result in the winner of ILSVRC
2014.
2.13 Deep Learning 257

V GGNet, described in [41] was ranked second in ILSVRC 2014. Its main contri-
bution was the demonstration that the depth of the network (number of layers) is
a critical component for achieving an optimal CNN.
ResNet, described [42] and first classified in ILSVRC 2015. The originality of
this network is the lack of the FC layer, the heavy use of the batch normalization
and above all the idea winning of the so-called identity shortcut connection that
skips the connections of one or more layers. ResNet is currently by far the most
advanced of the CNN network models and is the default choice for using CNN
networks. ResNet allows you to train up to hundreds or even thousands of layers
and achieves high performance. Taking advantage of its powerful representation
capacity, the performance of many artificial vision applications other than image
classification, such as object detection and facial recognition [43,44], have been
enhanced.

The authors of ResNet, following the intuitions of the first version, have perfected
the residual block and proposed a variant of pre-activation of the residual block [45],
in which the gradients can flow without obstacles through the connections shortcut
connection to any other previous layer. They have shown with experiments that they
can now train a deep ResNet to 1001 layers to overcome their lesser counterparts.
Thanks to its convincing results, ResNet has quickly become one of the strategic
architectures also in various artificial vision applications.
A more recent evolution has modified the original architecture by renaming the
network to ResNeXt [46] by introducing a hyperparameter called cardinality, the
number of independent paths, to provide a new way to adapt the model’s capacity.
Experiments show that accuracy can be improved more efficiently by increasing
cardinality rather than going deeper or wider. The authors say that, compared to
the Inception module of GoogLeNet, this new architecture is easier to adapt to new
datasets and applications, as it has a simple paradigm and only one hyperparameter
to adjust, while the Inception module has many hyperparameters (such as the size of
the convolution layer mask of each path) to fine-tune the network.
A further evolution is described in [47,48] with a new architecture called DenseNet
that further exploits the effects of shortcut connections that connect all the layers
directly to each other. In this new architecture, the input of each layer consists of the
feature maps of the entire previous layer and its output is passed to each successive
layer. The maps of the features are aggregated with depth concatenation.

References
1. R.L. Hardy, Multiquadric equations of topography and other irregular surfaces. J. Geophys.
Res. 76(8), 1905–1915 (1971)
2. J.D. Powell, Radial basis function approximations to polynomials, in Numerical Analysis, eds.
by D.F. Griffiths, G.A. Watson (Longman Publishing Group White Plains, New York, NY,
USA, 1987), pp. 223–241
258 2 RBF, SOM, Hopfield, and Deep Neural Networks

3. Lanczos Cornelius, A precision approximation of the gamma function. SIAM J. Numer. Anal.
Ser. B 1, 86–96 (1964)
4. T. Kohonen, Selforganized formation of topologically correct feature maps. Biol. Cybern. 43,
59–69 (1982)
5. C. Von der Malsburg, Self-organization of orientation sensitive cells in the striate cortex. Ky-
bernetik 14, 85–100 (1973)
6. S. Amari, Dynamics of pattern formation in lateral inhibition type neural fields. Biol. Cybern.
27, 77–87 (1973)
7. T. Kohonen, Self-Organizing Maps, 3rd edn. ISBN 3540679219 (Springer-Verlag New York,
Inc. Secaucus, NJ, USA, 2001)
8. T. Kohonen, Learning vector quantization. Neural Netw. 1, 303 (1988)
9. T. Kohonen, Improved versions of learning vector quantization. Proc. Int. Joint Conf. Neural
Netw. (IJCNN) 1, 545–550 (1990)
10. J.J. Hopfield, Neural networks and physical systems with emergent collective computational
abilities. Proc. Nat. Acad. Sci 79, 2554–2558 (1982)
11. R.P. Lippmann, B. Gold, M.L. Malpass, A comparison of hamming and hopfield neural nets
for pattern classification. Technical Report ESDTR-86-131,769 (MIT, Lexington, MA, 1987)
12. J.J. Hopfield, Neurons with graded response have collective computational properties like those
of two-state neurons. Proc. Nat. Acad. Sci 81, 3088–3092 (1984)
13. D.H. Ackley, G.E. Hinton, T.J. Sejnowski, A learning algorithm for boltzmann machines. Cogn.
Sci. 9(1), 147–169 (1985)
14. J.A. Hertz, A.S. Krogh, R. Palmer, Introduction to the Theory of Neural Computation. (Addison-
Wesley, Redwood City, CA, USA, 1991). ISBN 0-201-50395-6
15. R. Raél, Neural Networks: A Systematic Introduction (Springer Science and Business Media,
Berlin, 1996)
16. S. Paul, Information processing in dynamical systems: Foundations of harmony theory, in
Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1:
Foundations, eds. by D.E. Rumelhart, J.L. McLelland, vol. 1, Chapter 6 (MIT Press, Cambridge,
1986), pp. 194–281
17. G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with neural networks.
Science 313(5786), 504–507 (2006)
18. D.C. Ciresan, U. Meier, J. Masci, L.M. Gambardella, J. Schmidhuber, Flexible, high per-
formance convolutional neural networks for image classification, in In International Joint
Conference on Artificial Intelligence, pp. 1237–1242, 2011
19. A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional
neural networks, in Advances in Neural Information Processing Systems, eds. by F. Pereira,
C.J.C. Burges, L. Bottou, K.Q. Weinberger, vol. 25 (Curran Associates, Inc., Red Hook, NY,
2012), pp. 1097–1105
20. G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P.
Nguyen, T. Sainath, B. Kingsbury, Deep neural networks for acoustic modeling in speech
recognition. IEEE Signal Process. Mag. 29(6), 82–97 (2012)
21. M. Tomáš, Statistical language models based on neural networks. Ph.D. thesis, Brno University
of Technology, 2012
22. Y. Lecun, Y. Bengio, G. Hinton. Nature, 521(7553), 436–444 (2015). ISSN 0028-0836. https://
doi.org/10.1038/nature14539
23. L. Yann, B.E. Boser, J.S. Denker, D. Henderson, R.E. Howard, W.E. Hubbard, L.D. Jackel,
Handwritten digit recognition with a back-propagation network, in Advances in Neural In-
formation Processing Systems, ed. by D.S. Touretzky, vol. 2 (Morgan-Kaufmann, Burlington,
1990), pp. 396–404
References 259

24. L. Yann, B. Leon, Y. Bengio, H. Patrick, Gradient-based learning applied to document recog-
nition. Proc. IEEE 86, 2278–2324 (1998)
25. A.L. Maas, A.Y. Hannun, A.Y. Ng, Rectifier nonlinearities improve neural network acoustic
models. Int. Conf. Mach. Learn. 30, 3 (2013)
26. K. He, X. Zhang, S. Ren, J. Sun, Delving deep into rectifiers: Surpassing human-level per-
formance on imagenet classification, in International Conference on Computer Vision ICCV,
2015a
27. D.A. Clevert, T. Unterthiner, S. Hochreiter, Fast and accurate deep network learning by expo-
nential linear units (elus), in ICLR, 2016
28. X. Glorot, Y. Bengio, Understanding the difficulty of training deep feedforward neural net-
works. AISTATS 9, 249–256 (2010)
29. D. Mishkin, J. Matas, All you need is a good init. CoRR (2015). http://arxiv.org/abs/1511.
06422
30. P. Krähenbühl, C. Doersch, J. Donahue, T. Darrell, Data-dependent initializations of convolu-
tional neural networks, CoRR (2015). http://arxiv.org/abs/1511.06856
31. D. Sussillo, Random walks: Training very deep nonlinear feed-forward networks with smart
initialization. CoRR (2014). http://arxiv.org/abs/1412.6558
32. Q. Liao, T. Poggio, Theory of deep learning ii: Landscape of the empirical risk in deep learning.
Technical Report Memo No. 066 (Center for Brains, Minds and Machines (CBMM), 2017)
33. D. John, H. Elad, S. Yoram, Adaptive subgradient methods for online learning and stochastic
optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011). ISSN 1532-4435. http://dl.acm.org/
citation.cfm?id=1953048.2021068
34. S. Ruder, An overview of gradient descent optimization algorithms. CoRR (2016). http://arxiv.
org/abs/1609.04747
35. D.P. Kingma, J. Ba, Adam: A method for stochastic optimization, CoRR (2014). http://arxiv.
org/abs/1412.6980
36. V. Patel, Kalman-based stochastic gradient method with stop condition and insensitivity to
conditioning. SIAM J. Optim. 26(4), 2620–2648 (2016). https://doi.org/10.1137/15M1048239
37. N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: A simple
way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014).
http://jmlr.org/papers/v15/srivastava14a.html
38. S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing
internal covariate shift, CoRR (2015). http://arxiv.org/abs/1502.03167
39. M.D. Zeiler, R. Fergus, Visualizing and understanding convolutional networks, CoRR (2013).
http://arxiv.org/abs/1311.2901
40. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A.
Rabinovich, Going deeper with convolutions, In Computer Vision and Pattern Recognition
(CVPR) (IEEE, Boston, MA, 2015). http://arxiv.org/abs/1409.4842
41. K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recogni-
tion. CoRR (2014). http://arxiv.org/abs/1409.1556
42. K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition. CoRR, 2015b.
http://arxiv.org/abs/1512.03385
43. M. Del Coco, P. Carcagn, M. Leo, P. Spagnolo, P. L. Mazzeo, C. Distante, Multi-branch cnn
for multi-scale age estimation, in International Conference on Image Analysis and Processing,
pp. 234–244, 2017
44. M. Del Coco, P. Carcagn, M. Leo, P. L. Mazzeo, P. Spagnolo, C. Distante, Assessment of
deep learning for gender classification on traditional datasets, in In Advanced Video and Signal
Based Surveillance (AVSS), pp. 271–277, 2016
45. K. He, X. Zhang, S. Ren, J. Sun, Identity mappings in deep residual networks, CoRR, 2016.
http://arxiv.org/abs/1603.05027
260 2 RBF, SOM, Hopfield, and Deep Neural Networks

46. S. Xie, R.B. Girshick, P. Dollár, Z. Tu, K. He, Aggregated residual transformations for deep
neural networks. CoRR (2016). http://arxiv.org/abs/1611.05431
47. G. Huang, Z. Liu, K.Q. Weinberger, Densely connected convolutional networks, CoRR, 2016a.
http://arxiv.org/abs/1608.06993
48. G. Huang, Y. Sun, Z. Liu, D. Sedra, K.Q. Weinberger, Deep networks with stochastic depth,
CoRR, 2016b. http://arxiv.org/abs/1603.09382
Texture Analysis
3

3.1 Introduction

The texture is an important component for recognizing objects. In the field of image
processing, it is consolidated with the term textur e, any geometric and repetitive
arrangement of the gray levels of an image [1]. In this sector, texture becomes an
additional strategic component to solve the problem of object recognition, the seg-
mentation of images, and the problems of synthesis. Several researches have been
oriented on the mechanisms of human visual perception of texture, to be emulated for
the development of systems for the automatic analysis of the information content of
an image (partitioning of the image in regions with different textures, reconstruction,
and orientation of the surface, etc.).

3.2 The Visual Perception of the Texture

The human visual system easily determines and recognizes different types of textures
characterizing them in a subjective way. In fact, there is no general definition of tex-
ture and a method of measuring the texture accepted by all. Without entering into the
merits of our ability to perceive texture, our qualitative approach to characterizing
texture with coarse, granular, random, ordered, threadlike, dotted, fine-grained, etc.,
attributes are known. Several studies have shown [2,3] that the quantitative analy-
sis of the texture passes through statistical and structural relationships between the
basic elements (known as primitive of the texture also called texel) of what we
call texture. Our visual system easily determines the relationships between the fun-
damental geometric structures, which characterize a specific texture, composed of
macrostructures, as can be the regular covering of a building or a floor. Similarly, our

© Springer Nature Switzerland AG 2020 261


A. Distante and C. Distante, Handbook of Image Processing and Computer Vision,
https://doi.org/10.1007/978-3-030-42378-0_3
262 3 Texture Analysis

visual system is able to easily interpret a texture composed of microstructures, as is


seen by observing from a satellite image various types of cultivations of a territory,
or the microstructures associated with artifacts or those observed from a microscope
image. Can we say that our visual system rigorously describes the microstructures
of a texture?
The answer does not seem obvious given that often the characterization of a
texture is very subjective. We also add that often our subjectivity also depends on
the conditions of visibility and the particular context. In fact, the observed texture
characteristics of a building’s coating depends very much on the distance and lighting
conditions that influence the perception and interpretation of what is observed. In
other words, the interpretation of the texture depend very much on the scale factor
with which the texture primitives are observed, on how they are illuminated, and on
the size of the image considered. From the point of view of a vision machine, we can
associate the definition of texture for an image as a particular repetitive arrangement
of pixels with a given local variation of gray or color levels that identify one or more
regions (textured regions).
In Figs. 3.1 and 3.2 are shown, respectively, a remote sensing image of the territory
and images acquired by the microscope that highlight some examples of textures
related to microstructures and macrostructures, and images with natural and artificial
textures. From these examples, we can observe the different complexity of textures,
in particular in natural ones, where the distribution of texture components has an
irregular stochastic arrangement. A criterion for defining the texture of a region is
that of homogeneity which can be perceptually evaluated in terms of color or gray
levels with spatial regularity.
From the statistical point of view, homogeneity implies a stationary statistic, i.e.,
the statistical information of each texture is identical. This characteristic is correlated
to the similarity of the structures as it happens when they have different scales, that
despite not identical, present the same statistics. If we consider the criterion of spatial
homogeneity, textures can be classified as homogeneous (when spatial structures are
r egular ), poorly homogeneous (when structures repeat with some spatial variation)
and inhomogeneous (that is, there are no similar structures that replicate spatially).
The visual perception of texture has been studied at an interdisciplinary level by the
cognitive science and psychophysiology community, exploring aspects of the neural
processes involved in perception to understand the mechanisms of interpretation
and segregation (seen as a perceptive characteristic sensitive to similarity) of the
texture. The computational vision community has developed research to emulate
the perceptual processes of texture to the computer, deriving appropriate physical-
mathematical models to recognize classes of textures, useful for classification and
segmentation processes.
3.2 The Visual Perception of the Texture 263

Fig. 3.1 Remote sensing image

Fig. 3.2 Images acquired from the microscope, in the first line, natural images (the first three of
the second line) and artificial images (the last two)

3.2.1 Julesz’s Conjecture

The first computational texture studies were conducted by Julesz [2,3]. Several ex-
periments have demonstrated the importance of perception through the statistical
analysis of the image for various types of textures to understand how low-level vi-
sion responds to variations in the order of statistics. Various images with different
statistics have been used, with patterns (such as points, lines, symbols, ...) distributed
and positioned randomly, each corresponding to a particular order statistic. Examples
264 3 Texture Analysis

Fig. 3.3 Texture pairs with identical and different second-order statistics. In a the two textures
have pixels with different second-order statistics and are easily distinguishable from people; b the
two textures are easily distinguishable but have the same second-order statistics; c textures with
different second-order statistics that are not easy to distinguish from the human unless it closely
scrutinizes the differences; d textures share the same second-order statistics but an observer does
not immediately discriminate the two textures

of order statistics used are those of the first order (associated with the contrast), of the
second order to characterize homogeneity and those of the third order to characterize
the curvature. Textures with sufficient differences in brightness are easily separable
in different regions with the first order statistics.
Textures with differently oriented structures are also easily separable. Initially,
Julesz finds that, with similar first-order statistics (gray-level histogram), but with
different second-order statistics (variance), they are easily discriminable. However,
no textures with identical statistics of the first and second order are found, but with
different statistics of the third order (moments), which could be discriminated. This
led to Julesz’s conjecture: textures with identical second-order statistics are indistin-
guishable. For example, the texture of Fig. 3.3a have different second-order statistics
and are immediately distinguishable, while the textures of Fig. 3.3d share the same
statistics and are not easily distinguishable [4].

3.2.2 Texton Statistics

Later, Caelli, Julesz, and Gilbert [5] find textures with identical first and second-
order statistics, but with different third-order statistics, which are distinguishable
(see Fig. 3.3b) with pre-active visual perception, thus violating Julesz’s conjecture.
Further studies by Julesz [3] show that the mechanisms of human visual perception do
not necessarily use third-order statistics to distinguish textures with identical second-
order statistics, but instead use second-order statistics of patterns (structures) called
texton. Consequently, the previous conjecture is revised as follows: the human pre-
active visual system does not calculate major statistical parameters of the second
order.
3.2 The Visual Perception of the Texture 265

It is also stated that the pre-attentive human visual system1 uses only the first-
order statistics of these texton which are, in fact, the local structural characteristics,
such as edges, orientated lines, blobs, edges, etc.

3.2.3 Spectral Models of the Texture

Psychophysiological research has shown the evidence that the human brain performs
a spectral analysis on the retinal image that can be emulated on the computer by
modeling with a filter bank [6]. This research has motivated the development of
mathematical models of texture perception based on appropriate filters. Bergen [7]
suggests that the texture present in an image can be decomposed into a series of
image sub-bands using a bank of linear filters with different scales and orientations.
Each sub-band image is associated with a particular type of texture. In essence, the
texture is characterized by the empirical distribution of the filter response form and
by the similarity metrics used (distance between distributions) can be evaluated to
discriminate the potential differences in textures.

3.3 Texture Analysis and its Applications

The research described in the previous paragraph did not lead to a rigorous defini-
tion of texture but led to a variety of approaches (physical-mathematical models) to
analyze and interpret the texture. In developing a vision machine, texture analysis
(which exhibits a myriad of properties) can be guided by the application and can be
requested at various stages of the automatic viewing process. For example, in the
context of segmentation to detect homogeneous regions, in the extraction phase of
the feature or in the classification phase where the texture characteristics can provide
useful information for object recognition. Given the complexity of the information
content of the local structures of a texture, which can be expressed in several per-
ceptive terms (color, light intensity, density, orientation, spatial frequency, linearity,
random arrangement, etc.), texture analysis methods can be grouped into four classes:

1. Statistical approach. The texture is expressed as a quantitative measure of the


local distribution of pixel intensity. This local distribution of pixels generates
statistical and structural relationships between the basic elements, known as
primitive of the texture (known as texel). The characterization of texel can be

1 Stimulus processing does not always require the use of attentional resources. Many experiments
have shown that the elementary characteristics of a stimulus derived from the texture (as happens
for color, shape, movement) are detected without the intervention of attention. The processing of the
stimulus is, therefore, defined as pre-attentive. In other words, the pre-attentive information process
allows to detect the most salient features of the texture very quickly, and only at a second time the
focused attention completes the recognition of the particular texture (or of an object in general).
266 3 Texture Analysis

dominated by local structural properties, and in this case we obtain a symbolic


description of the image, in terms of microstructures and macrostructures mod-
eling from primitives. For images with natural textures (for example, satellite
images of the territory) primitives are characterized by local pixel statistics,
derived from the co-occurrence matrix of gray levels (GLCM) [8].
2. Structural Approach. The texture can be described with 2D patterns through a set
of primitives that capture local geometric structures (texton) according to certain
positioning rules [8]. If the primitives are well defined to completely character-
ize the texture (in terms of micro and macro structures), we obtain a symbolic
description of the image that can be reconstructed. This approach is more effec-
tive for synthesis applications than for classification. Other structural approaches
are based on morphological mathematical operators [9]. This approach is very
useful in the detection of defects characterized by microstructures present in the
artifacts.
3. Stochastic approach. With respect to the two previous approaches, it is more
general. The texture is assumed to derive from a realization of a stochastic pro-
cess that captures the information content of the image. The non-deterministic
properties of the spatial distribution of texture structures are determined by esti-
mating the parameters of the stochastic model. There are several methods used
to define the model that governs the stochastic process. Parameter estimation can
be done with the Monte Carlo method, maximum likelihood or with methods
that use nonparametric models.
4. Spectral Approach. Based on the classical Fourier transforms and wavelet where
in these domains the image is represented in terms of frequencies to characterize
the textures. The wavelet transform [10] and the Gabor filters are used to maintain
the spatial information of the texture [11]. Normally, filter banks [12] are used to
better capture the various frequencies and orientation of the texture structures.

Other approaches, based on fractal models [13] have been used to better characterize
natural textures, although they have the limit of losing local information and orien-
tation of texture structures. Another method of texture analysis is to compare the
characteristics of the texture observed with a previously synthesized texture pattern.
These texture analysis approaches are used in application contexts: extraction
of the image features described in terms of texture properties; segmentation where
each region is characterized by a homogeneous texture; classification to determine
homogeneous classes of texture; reconstruction of the surface (shape from texture) of
the objects starting from the information (such as density, dimension, and orientation)
associated with the macrostructures of the texture; and synthesis to create extensive
textures starting from small texture samples, very useful in rendering applications
(computational graphics).
For this last aspect, with respect to classification and segmentation, the synthesis
of the texture requires a greater characterization of the texture in terms of description
of the details (accurate discrimination of the structures).
3.3 Texture Analysis and its Applications 267

The classification of the texture concerns the search for particular regions of tex-
tures between various classes of predefined textures. This type of analysis can be
seen as an alternative to the techniques of supervised classification and analysis of
clusters or as a complement to further refine the classes found with these techniques.
For example, a satellite image can be classified with the techniques associated with
the analysis of the texture, or, classified with the clustering methods (see Chap. 1) and
then it can be further investigated to search for details subclasses that characterize a
given texture, for example, in the case of a forest, cropland or other regions.
The statistical approach is particularly suitable when the texture consists of very
small and complex elementary primitives, typical of microstructures. When a texture
is composed of primitives of large dimensions (macrostructures) it is fundamental
to first identify the elementary primitives, that is, to evaluate their shape and their
spatial relationship. These last measures are often also altered by the effect of the
perspective view that modifies the shape and size of the elementary structures, and
by the lighting conditions.

3.4 Statistical Texture Methods

The statistical approach is particularly suitable for the analysis of microstructures.


The elementary primitives of microstructures are determined by analyzing the tex-
ture characteristics associated with a few pixels in the image. Recall that for the
characterization of the texture it is important to derive the spatial dependence of the
pixels with the values of gray levels. The distribution of gray levels in a histogram
is not a useful information for the texture because it does not contain any spatial
properties.

3.4.1 First-Order Statistics

Measure the likelihood of observing a gray value in random positions in the image.
The statistics of the first order can be calculated from the histogram of the gray levels
of the image pixels. This depends only on the single gray level of the pixel and not
on the co-occurring interaction with the surrounding pixels. The average intensity in
the image is an example of first-order statistics.
Recall that from the histogram we can derive the p(L) approximate probability
density2 of the occurrence of an intensity level L (see Sect. 6.9 Vol. I), given by

2 Itis assumed that L is a random variable that expresses the gray level of an image deriving from
a stochastic process.
268 3 Texture Analysis

H (L)
p(L) = L = 0, 1, 2, . . . , L max (3.1)
Ntot

where H (L) indicates the frequency of pixels with gray level L and Ntot is the total
number of pixels in the image (or a portion thereof). We know that from the shape of
the histogram we can get information on the characteristics of the image. A narrow
peak level distribution indicates an image with low contrast, while several isolated
peaks suggest the presence of different homogeneous regions that differ from the
background. The parameter that characterizes the first-order statistic is given by the
average μ, calculated as follows:
L
max
μ= L · p(L) (3.2)
L=0

3.4.2 Second-Order Statistics

To obtain useful parameters (features of the image) from the histogram, one can
derive quantitative information from the statistical properties of the first order of the
image. In essence we can derive the central moments (see Sect. 8.3.2 Vol. I) from
the probability density function p(L) and characterize the texture with the following
measures:
L
max
μ2 = σ 2 = (L − μ)2 p(L) (3.3)
L=0

where μ2 is the central moment of order 2, that is the variance, traditionally indicated
with σ 2 . The average describes the position and the variance describes the dispersion
of the distribution of levels. The variance in this case provides a measure of the
contrast of the image and can be used to express a measure of smoothness relative
S, given by
1
S =1− (3.4)
1 + σ2

which becomes 0 for regions with constant gray levels, while it tends to 1 in rough
areas (where the variance is very large).

3.4.3 Higher Order Statistics

The central moments of higher order (normally not greater than 6-th), associated
with the probability density function p(L), can characterize some properties of the
texture. The central moment n-th in this case is given by
3.4 Statistical Texture Methods 269

L
max
μn (L) = (L − μ)n p(L) (3.5)
L=0

For natural textures, the measurements of asymmetry (or Skewness) S and Kurtosis
K are useful, based, respectively, on the moments μ3 and μ4 , given by
 L max
μ3 L=0 (L − μ)3 p(L)
S = n3 = (3.6)
σ σ3
 L max
μ4 L=0 (L − μ)4 p(L)
K = n4 = (3.7)
σ σ3

The measure S is zero if the histogram H (L) has a symmetrical form with respect to
the average μ. Instead it assumes positive or negative values in relation to its defor-
mation (with respect to that of symmetry). This deformation occurs, respectively, on
the right (that is, shifting of the shape toward values higher than the average) or on
the left (deformation toward values lower than the average). If the histogram has a
normal distribution S is zero, and consequently any symmetric histogram will have
a value of S which tends to zero.
The K measure indicates whether the shape of the histogram is flat or has a peak
with respect to a normal distribution. In other words, if K has high values, it means
that the histogram has a peak near the average value and decays quickly at both
ends. A histogram with small K tends to have a flat top near the mean value rather
than a net peak. A uniform histogram represents the extreme case. If the histogram
has a normal distribution K takes value three. Often the measurement of kurtosis is
indicated with K e = K − 3 to have a value of zero in the case of a histogram with
normal distribution.
Other useful texture measures, known as Energy E and Entr opy H , are given
by the following:
L
max
E= [ p(L)]2 (3.8)
L=0

L
max
H =− p(L) log2 p(L) (3.9)
L=0

The energy measures the homogeneity of the texture and is associated at the moment
according to the probability distribution density of the gray levels. Entropy is a
270 3 Texture Analysis

quantity that in this case measures the level of the disorder of a texture. The maximum
value is when the gray levels are all equiprobable (that is, p(L) = 1/(L max + 1)).
Characteristic measures of local texture, called module and state, based on the
local histogram [14], are derivable considering a window with N pixel centered in
the pixel (x, y). The module I M H is defined as follows:
L
max
H (L) − N /N Liv
IM H = √ (3.10)
L=0
H (L)[1 − p(L)] + N /N Liv (1 − 1/N Liv )

where N Liv = L max + 1 indicates the number of levels in the image. The state of
the histogram is the gray level that corresponds to the highest frequency in the local
histogram.

3.4.4 Second-Order Statistics with Co-Occurrence Matrix

The characterization of the texture based on second-order statistics, derived in


Sect. 3.4.2, has the advantage of computational simplicity but does not include spatial
pixel information. More robust texture measurements can be derived according to
Julesz’s studies (Sect. 3.2) on human visual perception of texture. From these studies,
it emerges that two textures, having identical second-order statistics, are not easily
distinguishable at a first observation level.
Although this conjecture has been refuted and affirmed, the importance of second-
order statistics is known. Therefore, the main statistical method used for texture
analysis is that based on the definition of joint probability of pixel pair distributions,
defined as the likeli hood in observing a pair of gray levels measured at the ends of a
segment randomly positioned in the image and with a random orientation. This defi-
nition leads to the co-occurrence matrix used to extract the texture features proposed
by Haralick [1] known as GLCM (Gray-Level Co-occurrence Matrix).
The spatial relation of the gray levels is expressed by the co-occurrence matrix
PR (L 1 , L 2 ) representing the two-dimensional histogram of the gray levels of the
image, considered as an estimate of joint probability that a pair of pixels has levels of
gray L 1 and L 2 and that they satisfy a spatial relation of distance R, for example, the
two pixels are between them at a distance d = (dx , d y ) expressed in pixels. It is known
that each element of the two-dimensional histogram indicates the joint frequency
(co-occurrence) of the presence in the image of pairs of gray level L 1 = I (x, y) and
L 2 = I (x + dx , y + d y ) in which the two pixels are at a distance d defined by the
displacements along the rows (downward, vertical) dx and shifts along the columns
(to the right, horizontal) d y expressed in pixels. Given an image I of size M × M,
the co-occurrence matrix P can be formulated as follows:

M  M
1 if I(x, y) = L 1 and I(x + dx,y + d y ) = L 2
P(dx ,d y ) (L 1 , L 2 ) =
x=1 y=1
0 otherwise
(3.11)
3.4 Statistical Texture Methods 271

Fig. 3.4 Directions and (90°,1) [-1,0]


distances of the pairs of
pixels considered for the (1 35°,1) [-1,-1] (45°,1) [-1,1]
calculation of the (0 °,1) [0,1]
co-occurrence matrix
Pixel processing (θ,d) [dx,dy]

The size of P depends on the number of levels in the image I. A binary image
generates a 2 × 2 matrix, an RGB color image would require a matrix of 224 × 224 ,
but typically grayscale images are used up to a maximum of 256 levels.
The spatial relationship between pixel pairs defined in terms of increments (d x , d y )
generates co-occurrence matrices sensitive to image rotation. Except for rotations of
180◦ , any other rotation would generate a different distribution of P. To obtain the
invariance to the rotation for the analysis of the texture co-occurrence matrices are
calculated considering rotations of 0◦ , 45◦ , 90◦ , 135◦ .
For this purpose it is useful to define the co-occurrence matrices of the type
PR (L 1 , L 2 ) where the spatial relation R = (θ, d) indicates the co-occurrence of
pixel pairs at a distance d and in the direction θ (see Fig. 3.4). These matrices can be
calculated to characterize texture microstructures considering the unit distance be-
tween pairs, that is, d = 1. The rotation invariant matrices would result: P(dx ,d y ) =
P(0,1) ⇐⇒ P(θ,d) = P(0◦ ,1) (horizontal direction); P(dx ,d y ) = P(−1,1) ⇐⇒ P(θ,d) =
P(45◦ ,1) (right diagonal direction); P(dx ,d y ) = P(−1,0) ⇐⇒ P(θ,d) = P(90◦ ,1) (above
vertical direction); and P(dx ,d y ) = P(−1,−1) ⇐⇒ P(θ,d) = P(135◦ ,1) (left diagonal di-
rection).
Figure 3.5 shows the GLMC matrices calculated for the 4 angles indicated above,
for a test image of 4 × 4 size which has maximum level L max = 4. Each matrix
has the size of N × N where N = L max + 1 = 5. We will now analyze the element
P(0◦ ,1) (2, 1) which has value 3. This means that in the test image there are 3 pairs of
pixels horizontally adjacent with gray values, respectively, L 1 = 2 the pixel under
consideration and L 2 = 1 the adjacent co-occurring pixel. Similarly, there are adja-
cent pairs of pixels with values (1, 2) with frequency 3 examining in the opposite
direction. It follows that the matrix is symmetric like the other three calculated.
In general, the co-occurrence matrix is not always symmetrical, that is, not always
P(L 1 , L 2 ) = P((L 2 , L 1 ). The symmetric co-occurrence matrix Sd (L 1 , L 2 ) is de-
fined as the sum of the matrix Pd (L 1 , L 2 ) associated with the distance vector d and
the vector −d:
Sd (L 1 , L 2 ) = Pd (L 1 , L 2 ) + P−d (L 1 , L 2 ) (3.12)
272 3 Texture Analysis

C
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
0 0 0 0
0 0 0 1 0 0 1 0 0 0 0 0 2 0 0 0 1 0 0 0
3 3 4 2 1 0 2 3 2 0 1 1 2 1 0 1 1 0 0 1 4 1 1 1 0 1 3 0
2 1 3 4 2 0 3 0 1 1 2 0 1 0 3 1 2 2 1 0 2 1 2 0 1 0 2 0
0 3 2 1 1 2 1 2 2 0 0 3 2 0 3 0 4 2 0 1 3 0 3 2 2 0
3 3
2 1 1 3 0 0 1 2 0 0 1 1 0 0 0 1 1 1 0 0 0 0 0 2
4 4 4 4
Image (0°,1) [0,1] (45°,1) [-1,1] (90°,1) [-1,0] (135°,1) [-1,-1]

Fig. 3.5 Calculation of 4 co-occurrence matrices, relative to the 4 × 4 test image with number of
levels 5, for the following directions: a 0◦ , b to +45◦ , c to +90◦ , and d to +135◦ . The distance is
d = 1 pixel. It is also shown how the element P0◦ ,1 (L 1 , L 2 ) = (2, 1) = 3 exists in the image three
pairs of pixels with L 1 = 2, L 2 = 1 arranged spatially according to the relation (0◦ , 1)

By obtaining symmetric matrices one has the advantage of operating on a smaller


number of elements equal to N (N + 1)/2 instead of N 2 , where N is the number
of levels of the image. The texture structure can be observed directly in the co-
occurrence matrix by analyzing the frequencies of the pixel pairs calculated for a
determined spatial relationship of the pixel pairs.
For example, if the distribution of gray levels in the image is random, we will
have a co-occurrence matrix with scattered frequency values. If the pairs of pixels
in the image are instead very correlated we will have a concentration of elements
of Pd (L 1 , L 2 ) around the main diagonal with high values of the frequency. The
fine structures of the texture are identified by analyzing pairs of pixels of the image
with distance vector d small, while coarse structures are characterized with large
values of d. The quantitative texture information depends on the number of image
levels. Normally the number of gray levels of an image is N = 256 but to reduce the
computational load it is often quantized into a smaller number of levels according to
an acceptable loss of texture information.
In Fig. 3.6 shows some co-occurrence matrices derived from images with different
types of textures. We observe the different distribution of very compact frequencies
and along the main diagonal on images with high spatial pixel correlation compared
to images with very complex textures with poor spatial correlation between the pixels
with densely distributed frequencies over the entire matrix (the image of the circuit
in the example).

3.4.5 Texture Parameters Based on the Co-Occurrence Matrix

The co-occurrence matrix captures some properties of a texture but is not normally
characterized by using the elements of this matrix directly. From the co-occurrence
matrix, some significant Ti parameters are derived to describe a texture more com-
pactly. Before describing these parameters, it is convenient to normalize the co-
occurrence matrix by dividing each of its elements Pθ,d (L 1 , L 2 ) with the total sum
of the frequencies of all pairs of spatially related pixels from R(θ, d). In this way, the
3.4 Statistical Texture Methods 273

Fig. 3.6 Examples of co-occurrence matrices calculated on 3 different images with different tex-
tures: a complex texture of an electronic board where there are little homogeneous regions and small
dimensions; b a little complex texture with larger macrostructures and greater correlation between
the pixels; and c more homogeneous texture with the distribution of frequencies concentrated along
the main diagonal (strongly correlated pixels)

co-occurrence matrix Pθ,d becomes the joint probability estimate pθ,d (L 1 , L 2 ) of


the occurrence in the image of the pixel pair with levels L 1 and L 2 far from each other
of d pixels in the direction indicated by θ . The normalized co-occurrence matrix is
defined as follows:
Pθ,d (L 1 , L 2 )
pθ,d (L 1 , L 2 ) =  L  L max (3.13)
L 2 =0 Pθ,d (L 1 , L 2 )
max
L 1 =0

where the joint probabilities assume values between 0 and 1. From the normalized
co-occurrence matrix, given by the (3.13), the following characteristic parameters
Ti of the texture are derived.

3.4.5.1 Energy or Measure of the Degree of Homogeneity of the


Texture
The energy is a parameter that corresponds to the angular momentum of the second
order:
L 
max L max
Energia = T1 = 2
pθ,d (L 1 , L 2 ) (3.14)
L 1 =0 L 2 =0
274 3 Texture Analysis

Higher energy values correspond to very homogeneous textures, i.e., the differences
in gray values are almost zero in most pixel pairs, for example, with a distance
of 1 pixel. For low energy values, there are differences that are equally spatially
distributed.

3.4.5.2 Entropy
The entr opy is a parameter that measures the random distribution of gray levels in
the image.
L 
max L max
Entr opy = T2 = − pθ,d (L 1 , L 2 ) · log2 [ pθ,d (L 1 , L 2 )] (3.15)
L 1 =0 L 2 =0

It is observed that entropy is high when each element of the co-occurrence matrix
has an equal value, that is, when the p(L 1 , L 2 ) probabilities are equidistributed.
Entropy has low values if the co-occurrence matrix is diagonal, i.e., there are spatially
dominant gray level pairs for a certain direction and distance.

3.4.5.3 Maximum Probability


T3 = max pθ,d (L 1 , L 2 ) (3.16)
(L 1 ,L 2 )

3.4.5.4 Contrast
The contrast is a parameter that measures the local variation of the gray levels of
the image. Corresponds to the moment of inertia.
L 
max L max
T4 = (L 1 − L 2 )2 pθ,d (L 1 , L 2 ) (3.17)
L 1 =0 L 2 =0

A low value of the contrast is obtained if the image has almost constant gray levels,
vice versa it presents high values for images with strong local variations of intensity
that is with very pronounced texture.

3.4.5.5 Moment of the Inverse Difference


L 
max L max
pθ,d (L 1 , L 2 )
T5 = with L 1 = L 2 (3.18)
1 + (L 1 − L 2 )2
L 1 =0 L 2 =0

3.4.5.6 Absolute Value


L 
max L max
T6 = |L 1 − L 2 | pθ,d (L 1 , L 2 ) (3.19)
L 1 =0 L 2 =0
3.4 Statistical Texture Methods 275

3.4.5.7 Correlation
 L max  L max
L 1 =0 L 2 =0 [(L 1 − μx )(L 2 − μ y ) pθ,d (L 1 , L 2 )]
T7 = (3.20)
σx σ y

where the mean m x and mu y , and the standard deviations σx and σ y are related to
the marginal probabilities px (L 1 ) and p y (L 2 ). The latter correspond, respectively,
to the rows and columns of the co-occurrence matrix pθ,d (L 1 , L 2 ), and are defined
as follows:
 
px (L 1 ) = LL max p (L 1 , L 2 ) p y (L 2 ) = LL max
2 =0 θ,d
p (L 1 , L 2 )
1 =0 θ,d

Therefore, the averages are calculated as follows:


 
μx = LL max
1 =0
L 1 px (L 1 ) μ y = LL max
2 =0
L 2 p y (L 2 ) (3.21)

and the standard deviations σx , σ y are calculated as follows:


 
L max L max
σx = L 1 =0 px (L 1 )(L 1 − μx )2 σ y = L 2 =0 p y (L 2 )(L 2 − μ y )2 (3.22)

The texture characteristics T = T1 , T2 , . . . , T7 , calculated based on the co-occurrence


matrix of the gray levels, may be effective to describe complex microstructures of
limited sparse dimensions and variously oriented in the image. Figure 3.7 shows
examples of different textures characterized by contrast, energy, homogeneity, and
entropy measurements. These measurements are obtained by calculating the co-
occurrence matrices with a distance of 1 pixel and direction 0◦ .

Original images Contrast Energy Homogeneity Entropy

Fig. 3.7 Results of texture measurements by calculating co-occurrence matrices on windows 7 × 7


on 3 test images. For each image the following measurements have been calculated: a Contrast,
b Energy, c Homogeneity, and d Entropy. For all measurements the distance is 1 pixel and the
direction of 0◦
276 3 Texture Analysis

The resulting images related to the 4 measurements are obtained by calculating


the co-occurrence matrices on windows of 7 × 7 centered on each pixel of the input
image. For reasons of space, the different results have not been reported by varying
the size of the window which depends on the resolution level with which the texture
is to be analyzed and the application context. The set of these measures Ti can be
used to model particular textures to be considered as prototypes to be compared with
those unknowns extracted from an image and to evaluate with a metric (for example,
the Euclidean distance) the level of similarity. The texture measurements Ti can
be weighted differently according to the level of correlation observed. Additional
texture measurements can be derived from co-occurrence matrices [15,16].
An important aspect in the extraction of the texture characteristics from the co-
occurrence matrix concerns the appropriate choice of the distance vector d. A possible
suggested solution [17] is that of the statistical test of the chi square X2 for the proper
choice of the values of d that best highlight the texture structures maximizing the
value
L 
max L max
pd2 (L 1 , L 2 )
X2 (d) = −1 (3.23)
px (L 1 ) p y (L 2 )
L 1 =0 L 2 =0

From the implementation point of view, the extraction of the texture characteris-
tics, based on the co-occurrence matrices, requires a lot of memory and calculation.
However, in literature there are solutions that quantize the image with few levels
of gray thus reducing the dimensionality of the co-occurrence matrix with the fore-
sight to balance the eventual degradation of the structures. In addition, solutions with
fast ad hoc algorithms are proposed [18]. The complexity, in terms of memory and
calculation, of the co-occurrence matrix increases with the management of color
images.

3.5 Texture Features Based on Autocorrelation

A feature of the texture is evaluated by spatial frequency analysis. In this way, the
repetitive spatial structures of the texture are identified. Primitives of textures charac-
terized by fine structures present high values of spatial frequencies, while primitive
with larger structures result in low spatial frequencies. The autocorrelation function
of an image can be used to evaluate spatial frequencies, that is, to measure the level of
homogeneity or roughness (fineness/coarseness) of the texture present in the image.
With the autocorrelation function (see Sect. 6.10.2 Vol. I) we measure the level
of spatial correlation between neighboring pixels seen as texture primitives (gray-
level values). The spatial arrangement of the texture is described by the correlation
coefficient that measures the linear spatial relationship between pixels (primitive).
For an image f (x, y) with a size of N × N with L gray levels, the autocorrelation
function ρ f (dx , d y ) is given by:
3.5 Texture Features Based on Autocorrelation 277

4 4
x10 Autocorrelation function Tex. 1 x 10 Autocorrelation function Tex. 2
Texture 1 Texture 2 2.14
2.1
2.05
2.12 2
1.95
2.1 1.9
2.08 1.85
1.8
2.06 1.75
2.04 1.7
1.65
2.02 1.6
50 100 150 200 250 50 100 150 200 250
Axis x Axis x
4
Autocorrelation function Tex. 3 x10 Autocorrelation function Tex. 4
4350
1.45
4300
4250 1.4

4200 1.35
Texture 3 Texture 4 4150
1.3
4100
4050 1.25

50 100 150 200 250 50 100 150 200 250


Axis x Axis x

Fig. 3.8 Autocorrelation function along the x axis for 4 different textures

 N −1  N −1
r =0 c=0 f (r, c) · f (r + dx , c + dy )
ρ f (dx , d y ) =  −1  N −1 2 (3.24)
N 2 · rN=0 c=0 f (r, c)

where dx , d y = −N + 1, −N + 2, . . . , 0, . . . , N − 1 are the distances, respectively,


in the direction x and y with respect to the pixel f (x, y). The size of the auto-
correlation function will be the size of (2N − 1) × (2N − 1). In essence, with the
autocorrelation function the inner product is calculated between the original image
f (x, y) and the image translated to the position (x + dx , y + d y ) and for different
displacements. The goal is the detection of repetitive pixel structures. If the primi-
tives of the texture are rather large, the value of the autocorrelation function slowly
decreases with increasing distance d = (dx , d y ) (presence of coarse texture with low
spatial frequencies) if it decreases quickly we are in the presence of fine textures with
high spatial frequencies (see Fig. 3.8). If instead the texture has primitives arranged
periodically, the function reproduces this periodicity by increasing and decreasing
(peaks and valley) according to the distance of the texture.
The autocorrelation function can be seen as a particular case of convolution3 or
as the convolution of an input function with itself but in this case without spatially
reversing, around the origin, the second function as in convolution. It follows that,
to get the autocorrelation is required, the flow of f (r − dx , c − d y ) with respect
to f (dx , d y ) and adding their product. Furthermore, if the function f is real the
convolution and the autocorrelation differ only in the change of the sign of the
argument.

 y) between two functions f (x, y) and h(x, y) is defined


3 Recallthe 2D convolution operator
g(x,
as: g(x, y) = f (x, y)  h(x, y) = r c f (r, c)h(x − r, y − c).
278 3 Texture Analysis

In this context the ρ f autocorrelation would be given by


N −1 N
 −1
ρ f (dx , d y ) = f (dx , d y )  f (−dx , −d y ) = f (r, c) f (r + dx , c + d y )
r =0 c=0
(3.25)
An immediate way to calculate the autocorrelation function is obtained by virtue of
the convolution theorem (see Sect. 9.11.3 Vol. I) which states: the Fourier transform
of a convolution is the product of the Fourier transforms of each function.
In this case, applied to the (3.25) we will have

F {ρ f (dx , d y )} = F { f (dx , d y ) f (−dx , −d y )} = F(u, v) · F(−u, −v)


(3.26)
= F(u, v) · F ∗ (u, v) = |F(u, v)|2

where the symbol F {•} indicates the Fourier transform operation and F indicates
the Fourier transform of the image f . It is observed that the complex conjugate of a
real function does not influence the function itself.
In essence, the Fourier transform of the autocorrelation function F {ρ f (dx , d y )},
defined by the (3.26), represents the Power Spectrum

P f (u, v) = |F(u, v)|2

of the function f (dx , d y ) where | • | is the module of a complex number. Therefore,


considering the inverse of the Fourier transform of the convolution, i.e., the equation
(3.26), we obtain the original expression of the autocorrelation function (3.25):

ρ f (dx , d y ) = F −1 {|F(u, v)|2 } (3.27)

We can then say that the autocorrelation of a function is the inverse of the Fourier
transform of the power spectrum. Also with f real the autocorrelation is also real
and symmetric ρ f (−dx , −d y ) = ρ f (dx , d y ). Figure 3.8 shows the autocorrelation
function calculated with the (3.27) for four different types of textures.

3.6 Texture Spectral Method

An alternative method for measuring the spatial frequencies of the texture is based on
the Fourier transform. From the Fourier theory, it is known that many real surfaces
can be represented in terms of sinusoidal base functions. In the Fourier spectral
domain, it is possible to characterize the texture present in the image in terms of
energy distributed along the base vectors. The analysis of the texture, in the spectral
domain, is effective when it is composed of repetitive structures, however, oriented.
This approach has already been used to improve image quality (see Chap. 9 Vol. I) and
noise removal (Chap. 4 Vol. II) using filtering techniques in the frequency domain.
In this context, the characterization of the texture in the spectral domain takes place
3.6 Texture Spectral Method 279

Fig. 3.9 Images with 4 different types of textures and relative power spectrum

by analyzing the peaks that give the orientation information of the texture and the
location of the peaks that provide the spatial periodicity information of the texture
structures. Statistical texture measurements (described above) can be derived after
filtering the periodic components.
The first methods that made use of these spectral features divide the frequency
domain into concentric rings (based on the frequency content) and into segments
(based on oriented structures). The spectral domain is, therefore, divided into regions
and the total energy of each region is taken as the feature characterizing the texture.
Let us consider with F(u, v) the Fourier transform of the image f (i, j) whose texture
is to be measured and with |F(u,v)|2 the power spectrum (the symbol | • | represents
the module of a complex number), which we know to coincide with the Fourier
transform of the autocorrelation function ρ f .
Figure 3.9 shows 4 images with different types of textures and their power spec-
trum. It can be observed how texture structures, linear vertical and horizontal, and
those curves are arranged in the spectral domain, respectively, horizontal, vertical,
and circular. The more textured information is present in the image, the more ex-
tended is the energy distribution. This shows that it is possible to derive the texture
characteristics in relation to the energy distribution in the power spectrum.
In particular, the characteristics of the texture in terms of spectral characteristics
are obtained by dividing the Fourier domain into concentric circular regions of radius
r that contains the energy that characterizes the level of fineness/roughness of the
texture (high energy for large values of r that is high frequency implies the presence
of fine structures) while high energy with small values of r that is low frequencies
implies the presence of coarse structures.
The energy evaluated in sectors of the spectral domain identified by the angle θ
reflects the directionality characteristics of the texture. In fact, for the second and
third image of Fig. 3.9, we have a localized energy distribution in the sectors in the
range 40◦ –60◦ and in the range 130◦ –150◦ corresponding to the texture of the spaces
between inclined bricks and inclined curved strips, respectively, present in the third
image. The rest of the energy is distributed across all sectors and corresponds to the
variability of the gray levels of the bricks and streaks.
280 3 Texture Analysis

The functions that can, therefore, be extracted by ring (centered at the origin) are
π 
 r2
tr1 r2 = |F(r, θ )|2 (3.28)
0=0 r =r1

and for orientation


θ2 
 R
tθ1 θ2 = |F(r, θ )|2 (3.29)
θ=θ1 r =0

where r and θ represent the polar coordinates of the power spectrum



r = u 2 + v2 θ = arctan(v/u) (3.30)

The power spectrum |F(r, θ )|2 is expressed in polar coordinates (r, θ ) and considering
its symmetrical nature with respect to the origin (u, v) = (0, 0) only the upper half
of the frequencies u axis is analyzed. It follows, that the polar coordinates r and θ
vary, respectively, for r = 0, R where R is the maximum radius of the outer ring,
and θ varies from 0◦ to 180◦ . From functions tri ,r j and tθl ,θk can be defined n a × n s
texture measures Tm,n n a ; n = 1, n s sampling the entire spectrum in n a rings and n s
radial sectors as shown in Fig. 3.10.
As an alternative to Fourier, other transforms can be used to characterize the
texture. The choice must be made in relation to the better invariance of the texture
characteristics with respect to the noise. The most appropriate choice is to consider
combined spatial and spectral characteristics to describe the texture.

v v

u u

Fig. 3.10 Textural features from the power spectrum. On the left the image of the spectrum is
partitioned into circular rings, each representing a frequency band (from zero to the maximum
frequency) while on the right is shown the subdivision of the spectrum in circular sectors to obtain
the information of direction of the texture in terms of distribution directional energy
3.7 Texture Based on the Edge Metric 281

3.7 Texture Based on the Edge Metric

Texture measurements can be calculated by analyzing pixel contour elements, lines,


and edges that constitute the components of local and global texture structures of an
image. The edge metric can be calculated from the estimated gradient g(i, j) for each
pixel of the image f (i, j) by selecting an appropriate edging operator with appro-
priate kernel size W or distance d between adjacent pixels. Texture measurements
can be characterized by the module gm (i, j) and the direction gθ of the gradient, in
relation to the type of texture to be analyzed.
A texture feature can be expressed in relation to the edge density present in a
window (for example, 3 × 3). For this purpose it is necessary to apply to the input
image f (i, j) one of the already known algorithms for edge extraction (for example,
the Laplace operator, or other operators, described in Chap. 1 Vol. II), to produce a
map of edges B(i, j), with B(i, j) = 1 if there is a border, and B(i, j) = 0 in the
opposite case. Normally, the map B is binarized with very low threshold values to
define the edge elements pixels. A border density measurement is given as follows:

1  
W W
TD (i, j) = B(i + l, j + k) (3.31)
W2
l=−W k=−W

where W is the size of the square window of interest.


Another feature of the texture is the edge contrast calculated as the local average
of the module of the edges of the image

TC (i, j) = media {B(i, j)} (3.32)


(i, j)∈W

where W indicates the image window on which the average value of the module
is calculated. A high contrast of the texture occurs at the maximum values of the
module. The contrast expressed by the (3.32) can be normalized by dividing it with
the maximum value of the pixel in the window.
The boundary density obtained with the (3.31) has the problem of finding an
adequate threshold to extract the edges. This is not always easy considering that the
threshold is applied to the entire image and is often chosen by trial and error. Instead
of extracting the edges by first calculating the module with an edging operator, an
alternative is given by calculating the gradient gd (i, j) as an approximation of the
distance function between adjacent pixels for a defined window.
The procedure involves two steps:

1. Calculation of the texture description function gd (i, j) depending on the distance


d for all pixels in the image texture. This is done by calculating directly, from the
input image f (i, j) to the variation of the distance d, the approximated gradient

gd (i, j) = | f (i, j) − f (i + d, j)| + | f (i, j) − f (i − d, j)|


(3.33)
+ | f (i, j) − f (i, j + d)| + | f (i, j) − f (i, j − d)|
282 3 Texture Analysis

2. The texture measure T (d), based on the density of the edges, is given as the
mean value of the gradient gd (i, j) for a given distance d (for example, d = 1)

1 
N N
T (d) = gd (i, j) (3.34)
N2
i=1 j=1

where N × N is the pixel size of the image.

The micro and macro texture structures are evaluated by the edge density expressed
by the gradient T (d) related to the distance d. This implies that the dimensionality of
the feature vector depends on the number of distances d considered. It is understood
that the microstructures of the image are detected for values of d small, while the
macrostructures are determined for large values (normally d assumes values to obtain
from 1 to 10 feature of edge density) [19]. It can be verified that the function T (d)
is similar to the negative autocorrelation function, with inverted peaks, its minimum
corresponds to the maximum of the autocorrelation function, while its maximum
corresponds to the minimum of the autocorrelation function.
A measure based on edge randomness is expressed as a measure of the Shannon
entropy of the gradient module


N 
N
TEr = gm (i, j) log2 gm (i, j) (3.35)
i=1 j=1

A measure based on the edge directionality is expressed as a measure of the Shannon


entropy of the direction of the gradient


N 
N
TE θ = gθ (i, j) log2 gθ (i, j) (3.36)
i=1 j=1

Other measures on the periodicity and linearity of the edges are calculated using
the direction of the gradient, respectively, through the co-occurrence of pairs of
edges with identical orientation and the co-occurrence of pairs of collinear edges
(for example, edge 1 with direction ←− and edge 2 with direction ←−, or with the
opposite direction ←− −→).

3.8 Texture Based on the Run Length Primitives

This method characterizes texture information by detecting sequences of pixels in


the same direction with the same gray level (primitive). The length of these primitives
run length characterize the structures of fine and coarse texture. The texture mea-
surements are expressed in terms of gray level, length, and direction of the primitives
3.8 Texture Based on the Run Length Primitives 283

r Run length r Run length


Image 1 2 3 4 1 2 3 4
z
0 3 2 3 z 0 4 0 0 0 0 4 0 0 0

Gray Levels
Gray Levels
0 2 2 3 1 4 1 0 0 1 1 1 0 0
0 3 1 1 2 2 1 0 0 2 1 0 1 0
3 0 1 0 3 5 0 0 0 3 3 1 0 0
Direction 0° Direction 45°

Fig. 3.11 Example of calculating GLRLM matrices for horizontal direction and at 45◦ for a test
image with gray levels between 0 and 3

which in fact represent pixels belonging to oriented segments of a certain length and
the same gray level.
In particular, this information is described by GLRLM matrices (Gray Level Run
Length Matrix) reporting how many times sequences of consecutive pixels appear
with identical gray level in a given direction [8,20]. In essence, any matrix defined for
a given θ direction of primitives (called also r uns) can be seen as a two-dimensional
histogram where each of its elements pθ (z, r ), identified by the gray level z e from
the length r of the primitives, it represents the frequency of these primitives present
in the image with maximum L gray levels and dimensions M × N . Therefore, a
GLRLM matrix has the dimensions of L × R where L is the number of gray levels
and R is the maximum length of the primitives.
Figure 3.11 shows an example of a GLRLM matrix calculated for an image with a
size of 5 × 5 with only 4 levels of gray. Normally, for an image, 4 GLRLM matrices
are calculated for the directions 0◦ , 45◦ , 90◦ and 135◦ . To obtain an invariant rotation
matrix p(z, r ) the GLRLM matrices can be summed. Several texture measures are
then extracted from the statistics of the primitives captured by the p(z, r ) invariant
to rotation.
The original 5 texture measurements [20] are derived from the following 5
statistics:

1. Short Run Emphasis (SRE)

1   p(z, r )
L R
TS R E = (3.37)
Nr r2
z=1 r =1

where Nr indicates the total number of primitives

1 
L R
Nr = p(z, r )
Nr
z=1 r =1

This feature measure emphasizes short run lengths.


284 3 Texture Analysis

2. Long Run Emphasis (LRE)

1 
L R
TL R E = p(z, r ) · r 2 (3.38)
Nr
z=1 r =1

This feature measure emphasizes long run lengths.


3. Gray-Level Nonuniformity (GLN)
L  R 2
1  
TG L N = p(z, r ) (3.39)
Nr
z=1 r =1

This texture measure evaluates the distribution of runs on gray values. The value
of the feature is low when runs are evenly distributed along gray levels.
4. Run Length Nonuniformity (RLN)
R  L 2
1  
TR L N = p(z, r ) (3.40)
Nr
r =1 z=1

This texture measure evaluates the distribution of primitives in relation to their


length. The value of TR L N is low when the primitives are equally distributed
along their lengths.
5. Run Percentage (RP)
Nr
TR P = (3.41)
M·N

This feature measure evaluates the fraction of the number of realized runs and
the maximum number of potential runs.

The above measures mostly emphasize the length of the primitives (i.e., the vector
L
pr (r ) = z=1 p(z, r ) which represents the sum of the distribution of the number
of primitives having length r ), without
considering the information level of the gray
level expressed by the vector pz (z) = rR=1 p(z, r ) which represents the sum of the
distribution of the number of primitives having gray level z [21]. To consider also
the gray-level information two new measures have been proposed [22].

6. Low Gray-Level Run Emphasis (LGRE)

1   p(z, r )
L R
TLG R E = (3.42)
Nr z2
z=1 r =1

This texture measure, based on the gray level of the runs, is the same as the S R E,
and instead of considering the short primitive, those with low levels of gray are
emphasized.
3.8 Texture Based on the Run Length Primitives 285

7. High Gray-Level Run Emphasis (HGRE)

1 
L R
TH G R E = p(z, r ) · z 2 (3.43)
Nr
z=1 r =1

This texture measure, based on the gray level of the runs, is the same as the
L R E, and instead of considering the long primitives, those with high levels of
gray are emphasized.

Subsequently, by combining together the statistics associated with the length of the
primitives and the gray level, 4 further measures have been proposed [23]

8. Short Run Low Gray-Level Emphasis (SRLGE)

1   p(z, r )
L R
TS R LG E = (3.44)
Nr z2 · r 2
z=1 r =1

This texture measure emphasizes the primitives shown in the upper left part of
the GLRLM matrix, where the primitives with short length and low levels are
accumulated.
9. Short Run High Gray-Level Emphasis (SRHGE)

1   p(z, r ) · z 2
L R
TS R H G E = (3.45)
Nr r2
z=1 r =1

This texture measure emphasizes the primitives shown in the lower left part of
the GLRLM matrix, where the primitives with short length and high gray levels
are accumulated.
10. Long Run Low Gray-Level Emphasis (LRLGE)

1   p(z, r ) · r 2
L R
TL R LG E = (3.46)
Nr z2
z=1 r =1

This texture measure emphasizes the primitives shown in the upper right part
of the GLRLM matrix, where the primitives with long and low gray levels are
found.
11. Long Run High Gray-Level Emphasis (LRHGE)

1 
L R
TL R H G E = p(z, r ) · r 2 · z 2 (3.47)
Nr
z=1 r =1
286 3 Texture Analysis

Table 3.1 Texture measurements derived from the GLRLM matrices according to the 11 statistics
given by the equations from (3.37) to (3.47) calculated for the images of Fig. 3.12
Image Tex_1 Tex_2 Tex_3 Tex_4 Tex_5 Tex_6 Tex_7
SRE 0.0594 0.1145 0.0127 0.0542 0.0174 0.0348 0.03835
LRE 40.638 36.158 115.76 67.276 85.848 52.37 50.749
GLN 5464.1 5680.1 4229.2 4848.5 6096.3 6104.4 5693.6
RLN 1041.3 931.18 1194.2 841.44 950.43 1115.2 1037
RP 5.0964 5.1826 4.5885 4.8341 5.2845 5.342 5.2083
LGRE 0.8161 0.8246 0.7581 0.7905 0.8668 0.8408 0.82358
HGRE 2.261 2.156 2.9984 2.6466 1.7021 1.955 2.084
SGLGE 0.0480 0.0873 0.0102 0.0435 0.0153 0.0293 0.03109
SRHGE 0.1384 0.3616 0.0312 0.1555 0.0284 0.0681 0.08397
LRLGE 34.546 30.895 86.317 52.874 74.405 44.812 42.847
LRHGE 78.168 67.359 363.92 180.54 146.15 95.811 97.406

Tex_1 Tex_2 Tex_3 Tex_4 Tex_5 Tex_6 Tex_7

Fig. 3.12 Images with natural textures that include fine and coarse structures

This texture measure emphasizes the primitives shown in the lower right part of
the GLRLM matrix, where the primitives with long length and high gray levels
are accumulated.

The Table 3.1 reports the results of the 11 texture measures described above applied to
the images in Fig. 3.12. The GLRLM matrices were calculated by scaling the images
into 16 levels, and the statistics were extracted from the matrix p(z, r ) summation
of the 4 directionals matrices.

3.9 Texture Based on MRF, SAR, and Fractals Models

Let’s now see some methods based on the model that was originally developed for
texture synthesis. When an analytical description of the texture is possible this can
be modeled by some characteristic parameters that are subsequently used for the
analysis of the texture itself. If this is possible, these parameters are used to describe
the texture and to have its own representation (synthesis). The most widespread
texture modeling is that of the discrete Markov Random Field-MRF that are optimal,
to represent the local structural information of an image [24] and to classify the
3.9 Texture Based on MRF, SAR, and Fractals Models 287

texture [25]. These models are based on the hypothesis that the intensity in each
pixel of the image depends only on the intensity of the pixels in its proximity to
less than any additional noise. With this model, each pixel of the image f (i, j) is
modeled as a linear combination of the intensity values of neighboring pixels and an
additional noise n

f (i, j) = f (i + l, j + k) · h(l, k) + n(i, j) (3.48)
(l,k)∈W

where W indicates the window that is the set of pixels in the vicinity of the current
pixel (i, j) (where the window is centered almost always of size 3 × 3) and n(i, j)
is normally considered a random Gaussian noise with mean zero and variance σ 2 .
In this MRF model, the parameters are represented by the weights h(l, k) and by
the noise n(l, k) which are calculated with the least squares approach, i.e., they are
estimated by minimizing the error E expressed by the following functional:
  2
E= f (i, j) − f (i + l, j + k) · h(l, k) + n(i, j) (3.49)
(i, j) (l,k)∈W

The texture of the model image is completely described with these parameters, which
are subsequently compared with those estimated by the observed image to determine
the texture class.
A method similar to that of MRF is given by the Simultaneous Autoregressive-
SAR model [26] which always uses the spatial relationship between neighboring
pixels to characterize the texture and classify it. The SAR model is expressed by the
following relationship:

f (i, j) = μ + f (i + l, j + k) · h(l, k) + n(i, j) (3.50)
(l,k)∈W

where h and n, conditioned by W , it is still the model parameters characterizing


the spatial dependence of the pixel under examination with respect to neighboring
pixels, while, in the SAR model it is considered m as the bias, which is the average
of the intensity of the input image.
Also, in this case, all the parameters of the model (m, σ , h(l, k), n(i, j), and
the window W ) can be estimated for a given image window using the least squares
error (LSE) estimation approach, or the maximum likelihood estimation (MLE) ap-
proach. For both models, MRF and SAR, the texture characteristics are expressed
by the parameters of the model (excluding m), used in the application contexts of
segmentation and classification. A variant of the basic SAR model is reported in [27]
to make the texture features (the parameters of the model) invariant to rotation and
scale change.
The fractal models[13] are used when some local image structures remain similar
to themselves (self-similarity) observed at different scale changes.
Mandelbrot [28] proposed fractal geometry to explain the structures of the natural
world. Given a closed set A in the Euclidean space at size n, it is said to have the
288 3 Texture Analysis

property of being self-similar when A is the union of N distinct (nonoverlapping)


copies of itself, each scaled down by a scale factor r . With this model a texture is
characterized by the fractal dimension D which is given by the equation
log N
D= (3.51)
log(1/r )

The fractal dimension is useful for characterizing the texture, and D expresses a
measure of surface roughness. Intuitively, the larger the fractal dimension is, the more
the surface is rough. In [13] it is shown that images with various natural textures can
be modeled with spatially isotropic fractals.
Generally, the texture related to many natural surfaces cannot be modeled with
deterministic fractal models because they have statistical variations. From this, it
follows that the estimation of the fractal dimension of an image is difficult. There are
several methods for estimating the D parameter, one of which is described in [29]
as follows. Given the closed set A, we consider windows of size L max of the side,
such as to cover the set A. A scaled down version of a factor r of A will result in
N = 1/r D similar sets. This new set can be enclosed by windows of size L = r L max ,
and therefore, their number is related to the fractal dimension D
 
1 L max
N (L) = D = (3.52)
r L

The fractal dimension is, therefore, estimated by the equation (3.52) as follows. For
a given value of L, the n-dimensional space is divided into squares of side L and
the number of squares covering A is counted. The procedure is repeated for different
values of L and therefore the value of the fractal dimension D is estimated with the
slope of the line
ln(N (L)) = −D ln(L) + D ln(L max ) (3.53)

which can be calculated using a linear least-squares interpolation of the available


data, i.e., a plot of ln(L) to ln(N (L)). An improvement method to the previous
one is suggested in [30], in which we assume to estimate the fractal dimension of
a surface of the image A. Let p(m, L) be the probability that there are m intensity
points within a square window of size L centered at a random position on the image
A, we have
n(m, L) · m
p(m, L) =
M

where n(m, L) is the number of window containing m points and M is the total
number of pixels in the image. When overlapping windows of size L on the image,
then the value (M/m)P(m, L) represents the expected number of windows with m
points inside. The expected number of windows covering the entire image is given
by
3.9 Texture Based on MRF, SAR, and Fractals Models 289


N
E[N (L)] = M (1/m)P(m, L) (3.54)
m=1

The expected value of N (L) is proportional to L −D , and therefore, can be used to


estimate the fractal dimension D. In [29] it has been shown that the fractal dimension
is not sufficient to capture all the textural properties of an image. In fact, there may
be textures that are visually different but have similar fractal dimensions. To obviate
this drawback a measure was introduced, called lacunarity4 which actually captures
textural properties so as to be in accord with human perception. The measurement is
defined by
2 
M
=E −1 (3.55)
E(M)

where M is the mass (understood as the set of pixel entities) of the fractal set and
E(M) its expected value. This quantity measures the discrepancy between the current
mass and the expected mass. There are small lacunarity values when the texture is
fine, while with large values there are coarse texture. The mass of the fractal set is
related to the length L in the following way

M(L) = K L D (3.56)

The probability distribution P(m,L) can be calculated as follows. Let M(L) =


 N N
m=1 m P(m, L) and M (L) =
2 2
m=1 m P(m, L), the lacunarity is defined as
follows:
M 2 (L) − (M(L))2
(L) = (3.57)
(M(L))2

It is highlighted that M(L) and M 2 (L) are, respectively, the first and second moment
of the probability distribution P(m, L). This lacunarity measurement of the image
is used as a feature of the texture for segmentation and classification purposes.

4 Lacunarity, originally introduced by Mandelbrot cite Mandelbrot, is a term in fractal geometry


that refers to a measure of how patterns fill space. Geometric objects appear more lacunar if they
contain a wide range of empty spaces or holes (gaps). Consequently, the lacunarity can be thought
of as a measure of “gaps” present, for example, in an image. Note that high lacunarity images that
are heterogeneous at small scales can be quite homogeneous at larger scales or vice versa. In other
words, lacunarity is a scale-dependent measure of the spatial complexity of patterns. In the fractal
context the lacunarity, being also a measure of spatial heterogeneity, can be used to distinguish
between images that have similar fractal dimensions but look different from the other.
290 3 Texture Analysis

3.10 Texture by Spatial Filtering

The texture characteristics can be determined by spatial filtering (see Sect. 9.9.1
Vol. I) by choosing a filter impulse response that effectively accentuates the texture’s
microstructures. For the purpose Laws [31] proposed texture measurements using the
convolution of the image f (i, j) with filtering masks h(i, j) of dimensions 5×5 that
represent the impulse responses of the filter to detect the different characteristics of
textures in terms of uniformity, density, granularity, disorder, directionality, linearity,
roughness, frequency, and phase.
From the results of convolutions ge = f (i, j)  h e (i, j), with various masks, the
relative texture measurements Te are calculated, which express the energetic measure
of the texture microstructures detected, such as edges, wrinkles, homogeneous, point-
like, and spot structures. This diversity of texture structures is captured with different
convolution operations using appropriate masks defined as follows. It starts with three
simple 1D masks
L3 = [ 1 2 1] L − Level
E3 = [−1 0 1] E − Edge (3.58)
S3 = [−1 2 − 1] S − Spot

where L3 represents a local mean filter, E3 represents an edge detector filter (at the
first difference), and S3 represents a spot detector filter (at the second difference).
Through the convolution operator of these masks with himself and each other, the
following 1D basic 5×1 masks are obtained
     
L5 = L3  L3 = 1 2 1  1 2 1 = 1 4 6 4 1 (3.59)
     
E5 = L3  E3 = 1 2 1  −1 0 1 = −1 −2 0 2 1 (3.60)
     
S5 = L3  S3 = 1 2 1  −1 2 −1 = −1 0 2 0 −1 (3.61)
     
R5 = S3  S3 = −1 2 −1  −1 2 −1 = 1 −4 6 −4 1 (3.62)
     
W 5 = E3  (−S3) = −1 0 1  1 −2 1 = −1 2 0 −2 1 (3.63)

These basic masks 5 × 1 represent the filters, respectively, of smoothing (e.g., Gaus-
sian) L5, of detectors of edges (e.g., gradient) E5, of Spot (e.g., Laplacian of
Gaussian-LOG) S5, of crests R5 and of wave structures W 5. From these basic masks
one can derive the two-dimensional 5×5 through the external product between the
same 1D masks and between different pairs. For example, the masks E5L5 and
L5E5 are obtained from the external product, respectively, between E5 and L5, and
between L5 and E5, as follows:
⎡ ⎤ ⎡ ⎤
−1 −1 −4 −6 −4 −1
⎢ −2 ⎥ ⎢ −2 −8 −12 −8 −1 ⎥
⎢ ⎥ ⎢ ⎥
E5L5 = E5 × L5 = ⎢ 0 ⎥ × [1 4 6 4 1] = ⎢
T ⎢ ⎥
⎢ 0 0 0 0 0 ⎥ (3.64)

⎣ 1 ⎦ ⎣ 2 8 12 8 1 ⎦
2 1 4 6 4 1
3.10 Texture by Spatial Filtering 291

⎡ ⎤ ⎡ ⎤
1 −1 −2 0 2 1
⎢4⎥ ⎢ −4 −8 0 8 4⎥
⎢ ⎥ ⎢ ⎥
L5E5 = L5T × E5 = ⎢ ⎥ ⎢
⎢ 6 ⎥ × [−1 −2 0 2 1] = ⎢ −6 −12 0 12 6⎥⎥ (3.65)
⎣4⎦ ⎣ −4 −8 0 8 4⎦
1 −1 −2 0 2 1

The mask E5L5 detects the horizontal edges and simultaneously executes a local
average in the same direction, while the mask L5E5 detects the vertical edges. The
number of Laws 2D masks that can be obtained is 25, useful for extracting different
texture structures present in the image. The essential steps of the Laws algorithm for
extracting texture characteristics based on local energy are the following:

1. Removing lighting variations. Optional pre-processing step of the input image


f (i, j) which removes the effects of the lighting variation. The initial value of
each pixel of the image is replaced by subtracting from it the value of the average
of the local pixels included in the window of appropriate dimensions (for natural
scenes normally the dimensions are 15 × 15) and centered in it.
2. Pre-processed image filtering. The pre-processed image f (i, j) is filtered using
the 25 convolution masks 5 × 5 previously calculated with the external product
of the one-dimensional masks L5, E5, S5, R5, W5 given by the equations from
(3.59) to (3.63). For example, considering the mask E5L5 given by the (3.64)
we get the filtered image g E5L5 (i, j) as follows:

g E5L5 (i, j) = f (i, j)  E5L5 (3.66)

In reality, of the 25 images filtered with the corresponding 25 2D-masks, those


useful for extracting the texture characteristics are 24, since the image g L5L5
filtered with the mask L5L5 (Gaussian smoothing filter) is not considered. Fur-
thermore, to simplify, the one-dimensional basic mask W 5 given by the (3.63)
can be excluded and in this case the 2D masks are reduced to 16 and consequently
there would be 16 filtered images g.
3. Calculation of texture energy images. From the filtered images g the images
of texture energy T (i, j) are calculated. An approach for calculating the value
of each pixel T (i, j) is to consider the summation of the absolute values of the
pixels close to the pixel under examination (i, j) in g belonging to the window of
dimensions (2W + 1) × (2W + 1). Therefore, a generic texture energy image
is given by
 j+W
i+W 
T (i, j) = |g(m, n)| (3.67)
m=i−W n= j−W
292 3 Texture Analysis

With the (3.67) we will have the set of 25 images of texture energy

{T X 5X 5 (i, j)} X =L ,E,S,R,W

if 5 basic 1D-masks are used or 16 if the first 4 are used, i.e., L5, E5, S5, R5.
A second approach for the calculation of texture energy images is to consider,
instead of the absolute value, the square root of the value of neighboring pixels,
as follows:
 j+W
i+W  
T (i, j) = g 2 (m, n) (3.68)
m=i−W n= j−W

4. Normalization of texture energy images. Optionally, the set of energy images


T (i, j) can be normalized with respect to the image g L5L5 obtained with the
Gaussian convolution filter L5L5, unique filter with nonzero sum (see filter E5
given by the 3.59), while all the others have as average the sum zero thus avoiding
to amplify or attenuate the energy of the system (see for example, the filter E5L5
given by the 3.64). Therefore, the new set of texture energy images indicated
with T̂ (i, j) are given as follows:
i+W  j+W
m=i−W n= j−W |g(m, n)|
T̂ (i, j) = i+W  j+W
m=i−W n= j−W |g L5L5 (m, n)|
i+W  j+W  2 (3.69)
m=i−W n= j−W g (m, n)
T̂ (i, j) =   j+W 
i+W
m=i−W n= j−W g 2L5L5 (m, n)

Alternatively, energy measurements of normalized textures can also be expressed


in terms of standard deviation, calculated as

1 
i+W 
j+W
T̂ (i, j) = |g(m, n) − μ(m, n)| (3.70)
(2W + 1)2
m=i−W n= j−W



 i+W
  
j+W
1
T̂ (i, j) =  [g(m, n) − μ(m, n)]2 (3.71)
(2W + 1) 2
m=i−W n= j−W

where μ(i, j) is the local average of the texture measure g(i, j), relative to
the window (2W + 1) × (2W + 1) centered in the pixel being processed (i, j),
estimated by

1 
i+W 
j+W
μ(i, j) = gW (m, n) (3.72)
(2W + 1)2
m=i−W n= j−W
3.10 Texture by Spatial Filtering 293

5. Significant images of texture energy. From the original image f (i, j) we have:

the set {T X 5X 5 (i, j)} X =L ,E,S,R,W of 25


or the reduced set {T X 5X 5 (i, j)} X =L ,E,S,R of 16 energy images.

The energy image TL5L5 (i, j) is not meaningful to characterize the texture unless
we want to consider the contrast of the texture. The remaining 24 or 15 energy
images can be further reduced by combining some symmetrical pairs replacing
them with the average of their sum. For example, we know that TE5L5 and TL5E5
represent the energy of vertical and horizontal structures (variants to rotation),
respectively. If they are added we have as a result the energy image TE5L5/L5E5
corresponding to the module of the edge (texture measurement invariant to ro-
tation). The other images energy TE5E5 , TS5S5 , TR5R5 and TW 5W 5 are used
directly (rotation invariant measures). Therefore, using 5 one-dimensional bases
L5, E5, S5, R5, W 5, after the combination we have the following 14 energy
images
TE5L5/L5E5 TS5L5/L5S5 TW 5L5/L5W 5 TR5L5/L5R5
TE5E5 TS5E5/E5S5 TW 5E5/E5W 5 TR5E5/E5R5
(3.73)
TS5S5 TW 5S5/S5W 5 TR5S5/S5R5 TW 5W 5
TR5W 5/W 5R5 TR5R5

while, using the first 4 masks, one-dimensional bases, we have the following 9
energy images
TE5L5/L5E5 TS5L5/L5S5 TR5L5/L5R5
TE5E5 TS5E5/E5S5 TR5E5/E5R5 (3.74)
TS5S5 TR5R5 TR5S5/S5R5

In summary, to characterize various types of textures with the Laws method, we


have 14 or 9 images of energy, or we have for each pixel of the input image
f (i, j) 14 or 9 texture measurements.

These texture measurements are used in different applications for image segmentation
and classification. In relation to the nature of the texture, to better characterize the
microstructures present at various scales, it is useful to verify the impact of the size
of the filtering masks on the discriminating power of the texture measurements T .
In fact, the method of Laws has also been experimented using the one-dimensional
masks 3 × 1 date in (3.58), from which the two-dimensional ones 3×3 were derived
through the external product between the same 1D masks and between different
couples. In this case, the one-dimensional mask relating to the R3 crests is excluded
as it cannot be reproduced in the mask 3 × 3, and the 2D derivable masks are the
following:
294 3 Texture Analysis

⎡ ⎤ ⎡ ⎤ ⎡ ⎤
121 1 0 −1 −1 2 −1
L3L3 = ⎣ 2 4 2 ⎦ L3E3 = ⎣ 2 0 −2 ⎦ L3S3 = ⎣ −2 4 −2 ⎦
121 1 0 −1 −1 2 −1
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
−1 −2 −1 1 0 −1 −1 2 −1
E3L3 = ⎣ 0 0 0 ⎦ E3E3 = ⎣ 0 0 0 ⎦ E3S3 = ⎣ 0 0 0 ⎦
1 2 1 −1 0 1 1 −2 1
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
−1 −2 −1 −1 0 1 1 −2 1
S3L3 = ⎣ 2 4 2 ⎦ S3E3 = ⎣ 2 0 −2 ⎦ S3S3 = ⎣ −2 4 −2 ⎦
−1 −2 −1 −1 0 1 1 −2 1
(3.75)

With the masks 3 × 3, after the combinations of the symmetrical masks we have the
following 5 images available:

TE3L3/L3E3 TS3L3/L3S3 TE3E3 TE3S3/S3E3 TS3S3 (3.76)

Laws energy masks have been applied to the images in Fig. 3.12 and in the Table 3.2
are reported for each image the texture measurements derived from the 9 significant
energy images obtained by applying the process described above. The 9 energy
images reported in (3.74) were used. A window with a size of 7 × 7 was used to
estimate local energy measurements with the (3.68) and then normalized with respect
to the original image (smoothed with mean filter). Using a larger window does not
change the results and would result in a larger computational time. Laws tested the
proposed method on a sample mosaic of Brodatz texture fields and was identified at
90%. Laws’texture measurements have been extended for volumetric analysis of 3D
textures [32].
In analogy to the masks of Laws, Haralick proposed those used for the extraction
of the edges for the measurement of the derived texture with the following basic
masks:

Table 3.2 Texture measurements related to the images in Fig. 3.12 derived from the energy equation
images 3.74
Image Tex_1 Tex_2 Tex_3 Tex_4 Tex_5 Tex_6 Tex_7
L5E5/E5L5 1.3571 2.0250 0.5919 0.9760 0.8629 1.7940 1.2034
L5R5/R5L5 0.8004 1.2761 0.2993 0.6183 0.4703 0.7778 0.5594
E5S5/S5E5 0.1768 0.2347 0.0710 0.1418 0.1302 0.1585 0.1281
S5S5 0.0660 0.0844 0.0240 0.0453 0.0455 0.0561 0.0441
R5R5 0.1530 0.2131 0.0561 0.1040 0.0778 0.1659 0.1068
L5S5/S5L5 0.8414 1.1762 0.3698 0.6824 0.6406 0.9726 0.7321
E5E5 0.4756 0.6873 0.2208 0.4366 0.3791 0.4670 0.3986
E5R5/R5E5 0.2222 0.2913 0.0686 0.1497 0.1049 0.1582 0.1212
S5R5/R5S5 0.0903 0.1178 0.0285 0.0580 0.0445 0.0713 0.0523
3.10 Texture by Spatial Filtering 295
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
1 1 1
h 1 = 13 ⎣ 1 ⎦ h 2 = 1⎣ ⎦
2 0 h3 = 1⎣
2 −2 ⎦
1 1 1

and the related two-dimensional masks are:


⎡ ⎤ ⎡ ⎤ ⎡ ⎤
111 −1 −1 −1 1 0 −1
1⎣
9 1 1 1⎦ 1⎣
6 0 0 0 ⎦ 1⎣
6 1 0 −1 ⎦
111 1 1 1 1 0 −1
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
1 1 1 −1 0 1 1 −2 1
1⎣
6 −2 −2 −2 ⎦ 1⎣
4 0 0 0 ⎦ 1⎣
4 1 −2 1 ⎦
1 1 1 1 0 −1 1 −2 1
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
1 0 −1 −1 2 −1 1 −2 1
1⎣
4 −2 0 2 ⎦ 1⎣
4 0 0 0 ⎦ 1⎣
4 −2 4 −2 ⎦
1 0 −1 1 −2 1 1 −2 1
(3.77)
Also these masks can be extended with dimensions 5×5 similarly to those previously
calculated.

3.10.1 Spatial Filtering with Gabor Filters

Another spatial filtering approach to extract texture features is based on the Gabor
filters [33]. These filters are widely used for the analysis of the texture motivated by
their spatial location nature, orientation, selectivity, and the frequency characteristic.
Gabor filters are seen as the precursors of wavelets (see Sect. 2.12 Vol. II) where
each filter captures energy at a particular frequency and for a specific direction. Their
diffusion is motivated by mathematical properties and neurophysiological evidence.
In 1946 Gabor showed that the specificity of a signal, simultaneously in time and
frequency is fundamentally limited by a lower limit given by the product of its band
length and duration.
This limit is x ω ≥ 4π 1
. Furthermore, it found that signals of the form

t2
s(t) = exp − + jωt
α2

reach the theoretical limit who found. The Gabor functions form a complete set of
basic (non-orthogonal) functions and allow you to expand any function in terms
of these basic functions. Subsequently, Gabor’s functions were generalized in two-
dimensional space [34,35] to model the profile of receptive fields of simple cells of
the primary visual cortex (also known as striated cortex or V 1)5 .

5 Psychovisual redundancy studies indicate that the human visual system processes images at dif-
ferent scales. In the early stages of vision, the brain performs a sort of analysis in different spatial
296 3 Texture Analysis

As we shall see, these functions are substantially bandpass filters that can be
treated together in the 2D spatial domain or in the Fourier 2D domain. These specific
properties of Gabor 2D functions have motivated research to describe and discrim-
inate the texture of images using the power spectrum calculated with Gabor filters
[36]. In essence, it is verified that the texture characteristics found with this method
are locally spatially invariant.
Now let’s see how it is possible to define a bank of Gabor 2D filters to capture the
energy of the image and detect texture measurements at a particular frequency and a
specified direction. In the 2D spatial domain, the canonical elementary function of
Gabor h(x, y) is a complex harmonic function (i.e., composed of the sine and cosine
functions) modulated by a Gaussian oriented function g(xo , yo ), given in the form

h(x, y) = g(xo , yo ) · exp [2π j (U x + V y)] (3.78)



where j = −1, (U, V ) represents a particular 2D frequency in the frequency do-
main (u, v), and

(xo , yo ) = (x cos θ + y sin θ, −x sin θ + y cos θ )

represents the geometric transformation of the coordinates to rotate an angle θ with


respect to the x axis the Gaussian g(xo , yo ). The 2D oriented Gaussian function is
given by  
1 (xo /γ )2 + yo2
g(xo , yo ) = exp − (3.79)
2π γ σ 2σ 2

where γ is the spatial aspect ratio and specifies the ellipticity of the 2D-Gaussian
(support of Gabor function), σ is the standard deviation of the Gaussian that char-
acterizes the extension (scale) of the filter in the spatial domain, and the band, in
the Fourier domain. If γ = 1, then the angle θ is no longer considered because the
Gaussian (3.79) becomes with circular symmetry, simplifying the filter (3.78).
The Gabor filter h(x, y) in the Fourier domain is given by
  
H (u, v) = exp −2π 2 σ 2 (u o − Uo )2 γ 2 + (vo − Vo )2 (3.80)

where (u o , vo ) = (u cos θ + v sin θ, −u sin θ + v cos θ ) and (Uo , Vo ) produces a


similar rotation of θ , in the frequency domain, with respect to the u axis. Furthermore,
the (3.80) indicates that H (u, v) is a Gaussian bandpass filter, rotated by an angle θ

frequencies, and consequently, the visual cortex is composed of different cells that correspond to
different frequencies and orientations. It has also been observed that the responses of these cells are
similar to those of the Gabor functions. This multiscale process, which successfully takes place in
the human vision for texture perception, has motivated the development of texture analysis methods
that mimic the mechanisms of human vision.
3.10 Texture by Spatial Filtering 297

Fig. 3.13 Gabor filter in the H(u,v)


frequency domain with the
v
elliptical support centered in
F(U, V ) F
V

α
U u

with respect to the axis u, with aspect ratio 1/γ . The complex exponential represents
a complex 2D harmonic with radial central frequency

F = U2 + V 2 (3.81)

and orientation given by


V
φ = tan−1 (3.82)
U

where φ is the orientation angle of the sinusoidal harmonic, with respect to the
frequency axis u, in the Fourier domain (u, v) (see Fig. 3.13).
Figure 3.14 shows instead the 3D and 2D graphic representation of the real and
imaginary components of a Gabor function.
Although Gabor’s filters may have the Gaussian function of modulating support
with arbitrary direction, in many applications it is useful that the modulating Gaussian
function has the same orientation as the complex sinusoidal harmonic, i.e., θ = φ.
In that case, the (3.78) and (3.80) are reduced, respectively,

h(x, y) = g(xo , yo ) · exp [2π j F xo ] (3.83)

and   
H (u, v) = exp −2π 2 σ 2 (u o − F)2 γ 2 + vo2 (3.84)

At this point, it is understood that the appropriate definition of Gabor’s filters is


expressed in terms of their spatial frequency and orientation bandwidth. It is observed
from the (3.78) that the Gabor function responds significantly to a limited range of
signals that form a repetitive structure in some direction (usually coinciding with
the orientation of the filter) and are associated with some frequencies (i.e., the filter
band).
To have usefulness of the (3.78), the frequency domain must be mapped by the
filter banks in terms of radial frequencies and orientation bandwidth so that their
impulsive response characterizes the texture present in the image.
Figure 3.15 shows schematically, in the 2D Fourier domain, the arrangement
of the responses of a Gabor filter bank, where each elliptic region represents the
range of frequencies and orientation so that some filters respond with a strong signal.
298 3 Texture Analysis

Theta= 0 Theta= 0

0.5 0.5

0 0

−0.5 −0.5

30 30
20 30 30
20 25 20
20 25
10
5 10 15 10 10 15
5
Cosine component Sine component

Fig. 3.14 Perspective representation of the real component (cosine) and of the imaginary compo-
nent (sine) of a Gabor function with a unitary aspect ratio

H(u,v)

Fig.3.15 Support in the frequency domain of the Gabor filter bank. Each elliptical region represents
a range of frequencies for which some filters respond strongly. Regions that are on the same ring
support filters with the same radial frequency, while regions at different distances from the origin F
but with identical direction correspond to different scales. In the example shown on the left, the filter
bank would have 3 scales and 3 directions. The figure on the right shows the frequency responses
in the spectral domain H (u, v) of a filter bank in 5 scales and 8 directions

Regions included in a ring correspond to filters with the same radial frequency. While
regions at different distances from the origin but with the same direction correspond
to filters with different scales. The goal of defining the filter bank is to map the
different textures of an image in the appropriate region that represents the filter’s
characteristics in terms of frequencies and direction.
Gabor’s basic 2D functions are generally spatially localized, oriented and with an
octave bandwidth.6

6 We recall that it is customary to divide the bands with constant percentage amplitudes. Each
band is characterized by a lower frequency f i and a higher frequency f s and a central frequency
3.10 Texture by Spatial Filtering 299

The frequency B and orientation (expressed, respectively, in octave bands and


radians), half the bandwidth of the Gabor filter given by the (3.83), are (see Fig. 3.16):
 
π Fγ σ + α
B = log2 (3.85)
π Fγ σ − α
 
α
= 2 tan−1 (3.86)
π Fσ


where α = (ln 2)/2. A bank of Gabor filters of arbitrary direction and bandwidth
can be defined, varying the 4 free parameters θ, F, σ, γ (or , B, σ, γ ) and ex-
tending the elliptical regions of the spatial frequency domain with the major axis
passing through the origin (see Fig. 3.15). In general, we tend to cover the frequency
domain with a limited number of filters and to minimize the overlap of the filter
support regions. From the (3.83) we observe that, for the sinusoidal component, the
Gabor function h(x, y) is a complex function with a real and imaginary part. The
sinusoidal component is given by

exp(2π j F xo ) = cos(2π F xo ) + j sin(2π j F xo ) (3.87)

and the real (cosine) and imaginary (sine) components of h(x, y) are (see Fig. 3.14)

h c,F,θ (x, y) = g(xo , yo ) cos(2π F xo ) (3.88)

h s,F,θ (x, y) = g(xo , yo ) sin(2π F xo ) (3.89)

The functions h c,F,θ and h s,F,θ are, respectively, even (with symmetry with respect
to the x axis) and odd (with symmetry with respect to the origin), and symmetric
in the direction of θ . To get Gabor texture measurements TF,θ , an image I (x, y) is
filtered with Gabor’s filters (3.88) and (3.89) through the convolution operation, as
follows:

Tc,F,θ (x, y) = I (x, y)  h c,F,θ (x, y) Ts,F,θ (x, y) = I (x, y)  h s,F,θ (x, y)
(3.90)
The result of the two convolutions is almost identical to less than the phase difference
of π/2 in the θ direction. From the texture measures Tc,F,θ and Ts,F,θ obtained it is
useful to calculate the energy E F,θ and the amplitude A F,θ

f c . The most frequently used bandwidths are the octave where the lower and upper extremes are
in the ratio√1 : 2, or f s = 2Fi . The bandwidth percentage
√ is given by ( f s − di )/dc = constant,
and f c = f i · f s . We also have 1/3 octave bands f s = 3 2 · f i , where the width of each band is
narrower, equal to 23.2% of the central nominal frequency of each band.
300 3 Texture Analysis

y hc,Fθ(x,y)
v
H(u,v)

x
Ω
Real Component
F
u
y
hs,Fθ(x,y) B

Imaginary Component

Fig. 3.16 Bandwidth detail B and orientation of the frequency domain support of a Gabor filter,
expressed by the (3.83), of which, in the spatial domain, the real and imaginary components are
represented

E F,θ (x, y) = Tc,F,θ


2
(x, y) + Ts,F,θ
2
(x, y) (3.91)

A F,θ (x, y) = E F,θ (x, y) (3.92)

while the average energy calculated on the entire image is given by


1 
Ê F,θ = E F,θ (x, y) (3.93)
N x y

where N is the number of pixels in the image.

3.10.1.1 Application of Gabor Filters


The procedure used to extract texture measurements from an image based on a bank
of Gabor filters depends on the type of application. If the textures to be described are
already known, the goal is to select the best filters with which to calculate the energy
images and extract from these the texture measurements that well discriminate such
texture models. If instead, the image has different textures from the energy images, a
set of texture measurements are extracted in relation to the number of filters defined
on the basis of the scale and orientation used.
This set of texture measurements are then used to segment the image with one
of the algorithms described in Chap. 1, for example the algorithm, e.g., K-means.
The discrimination of texture measurements can be evaluated by calculating the
Euclidean distance or using other metrics. The essential steps for calculating texture
measurements with a Gabor filter bank can be the following:
3.10 Texture by Spatial Filtering 301

1. Select the free parameters θ, F, σ, γ based on which the number of filters is


also defined.
2. Design of the filter bank with the required scale and angular resolution charac-
teristics.
3. Pr epar e the 2D input image I (x, y) which can be gray or color level, but
in this case, the RGB components are treated separately or combined into a
single significant component through, for example, the transform to principal
components (PCA).
4. Decompose the input image using the filters with the convolution operator
(3.90).
5. E xtract texture measurements from energy images (3.91) or amplitude (3.92).
The input image and these texture measurements can be smoothed with low
pass filters to attenuate any noise. In this case, it should be remembered that the
components of low frequencies (representing contrast and intensity) remain un-
changed while those of high frequency (details such as the edges) are attenuated,
thus obtaining a blurred image.
6. In relation to the type of application (segmentation, classification, ...) in the space
of the measures (feature) extracted characterize each pixel of the input image.

Figure 3.17 shows a simple application of Gabor filters to segment the texture of an
image. The input image (a) has 5 types of natural textures, not completely uniform.
The features of the textures are extracted with a bank of 32 Gabor filters in 4
scales and 8 directions (figure (d)). The number of available features is, therefore,
32 × 51 × 51 after sampling the image of size 204 × 204 by a factor of 4. Figure (b)
shows the result of the segmentation by applying K-means algorithm to the feature
images extracted with the Gabor filter bank, while the figure (c) shows the result of
the segmentation applying Gabor filters defined in (d) after reducing the feature to
5 × 51 × 51 applying the data reduction with the principal components (PCA) (see
Sect. 2.10.1 Vol. II).
Another approach to texture analysis is based on the wavelet transform where the
input image is decomposed at various levels of subsampling to extract different image

(a) (b) (c) (d) θ direction

σ
scale

Fig. 3.17 Segmentation of 5 textures not completely uniform. a Input image; b segmented by
applying K-means algorithm to the features extracted with the Gabor filter bank shown in (d); c
segmented image after reducing the feature images to 5 with the PCA; d the bank of Gabor filters
used with 4 scales and 8 directions
302 3 Texture Analysis

details [37,38]. Texture measurements are extracted from the energy and variance
of the subsampled images. The main advantage of wavelet decomposition is that it
provides a unified multiscale context analysis of the texture.

3.11 Syntactic Methods for Texture

The syntactic description of the texture is based on the analogy between the spatial
relation of texture primitives and the structure of a formal language [39]. The de-
scriptions of the various classes of textures form a language that can be represented
by its grammar which constitutes its rules by analyzing the primitives of the sample
textures (training set) in the learning phase. The syntactic description of the texture
is based on the idea that the texture is composed of primitives repeated and arranged
in a regular manner in the image. The syntactic methods of texture, in order to fully
describe it, must be determined essentially by the primitives and rules by which
these primitives are spatially arranged and how they are repeated. A typical syntactic
solution involves using grammar with rules that generate the texture of primitives
using transformation rules for a limited number of symbols. The symbols represent in
practice various types of texture primitives while the transformation rules represent
the spatial relations between the primitives.
The syntactic approach must, however, foresee that the textures of the real world
are normally irregular with the presence of errors in the structures repeated in an
unpredictable way and with considerable distortions. This means that the rules of
grammar may not efficiently describe real textures if they are not variables and the
grammar must be of various types (stochastic grammar). Let’s consider a simple
grammar for the generation of the texture starting with a starting symbol S and
applying the transformation rules called shape rules. The texture is generated through
various phases:

1. Activate the texture generation process by applying some transformation rules


to the start symbol S.
2. Find a part of the texture generated in step 1 that is comparable with the first
member of some of the expected transformation rules. A correct comparison
must be verified between the terminal and nonterminal symbols that appear
in the first member of the transformation rules chosen with the corresponding
terminal and nonterminal symbols of part of the texture to which the rule was
applied. If part of this texture is not found the algorithm ends.
3. Find an appropriate transformation that can be applied to the first member of
the rule chosen to make it perfectly coincide with the considered texture.
4. Apply this geometric transformation to the second member of the transformation
rule.
5. Replace the specified part of texture (the transformed portion that coincides
with the first member of the chosen rule) with that transformed by the second
member of the chosen rule.
6. Continue from step 2.
3.11 Syntactic Methods for Texture 303

Fig. 3.18 Grammar G to Vt = { } R:


generate a texture with
hexagonal geometric Vn = { }
structures S={ }
...
Fig. 3.19 Example of (a) (b)
texture with hexagonal
geometric structures: a
texture recognized and b
unrecognized texture

We explain with an example the algorithm presented for a grammar G = [Vn , Vt ,


P, S]. Let Vn be nonterminal symbols and Vt the set of terminal symbols, R the
set of rules and S the starting symbol. As an example of grammar, we consider the
one shown in Fig. 3.18. This grammar is used to generate a texture with hexagonal
structures with hexagonal geometrical primitives that are replicated by applying the
individual rules of R several times. With this grammar it is possible to analyze the
image of Fig. 3.19 to recognize or reject the hexagonal texture represented.
The recognition involves searching in the image first for the hexagonal primitives
of the texture and then checking if it is comparable with some that are on the second
member of the R transformation rules. In essence, the recognition process takes place
by applying the rules to a given texture in the reverse direction, until the initial shape
is reproduced.

3.12 Method for Describing Oriented Textures

In different applications, we are faced with so-called oriented textures, that is, primi-
tives are represented by a local orientation selectivity that varies in the various points
of the image. In other words, the texture shows a dominant local orientation and in
this case we speak of a texture with a high degree of local anisotropy. To describe
and visualize this type of texture it is convenient to think of the gray-level image as
representing a flow map where each pixel represents a fluid element subjected to a
motion in the dominant direction of the texture, that is, in the directions of maxi-
mum variation of the levels of gray. In analogy to what happens in the study of fluid
dynamics, where each particle is subject to a velocity vector that is composed of its
module and direction, even in the case of images with oriented textures, we can define
a texture orientation field simply called Oriented Texture Field-OTF, which is actu-
ally composed of two images: the image of orientation and the image of coherence.
The orientation image includes the local information of orientation of the texture
for each pixel, the image of coherence represents the degree of anisotropy always
304 3 Texture Analysis

in each pixel of the image. The images of the oriented texture fields as proposed by
Rao [40] are calculated with the following five phases:

1. Gaussian filtering to attenuate the noise present in the image;


2. Calculation of the Gaussian image gradient;
3. Estimating the local orientation angle using the inverse tangent function;
4. Calculation of the average of local orientation estimates for a given window
centered on the pixel being processed;
5. Calculation of a coherence estimate (texture flow information) for each image
point.

The first two phases, as known, are realized with standard edge extraction algo-
rithms (see Chap. 1 Vol. II). Recall that the Gaussian gradient operator is an optimal
solution for edge extraction. We specify also that the Gaussian filter is characterized
by the standard deviation of the Gaussian distribution σ that defines the level of
detail with which the geometric figures of the texture are extracted. This parameter,
therefore, indicates the degree of detail (scale) of the texture to be extracted.
In the third phase the local orientation of texture is calculated by means of the
inverse tangent function which requires only one argument as input and providing in
output a unique result in the interval (−π /2, π /2 ). With edge extraction algorithms
normally the maximum gradient direction is calculated with the arctangent function
which requires two arguments and does not provide a unique result.
In the fourth phase the orientation estimates are smoothed by a Gaussian filter
with standard deviation σ 2 . This second filter must have a greater standard deviation
than the previous one (σ2 σ1 ), and must produce a significant leveling between
the various orientation estimates. The value of σ2 must, however, be smaller than the
distance within which the orientation of the texture has the widest variations, and
finally, it must not attenuate (blurring) the details of the texture itself.
The fifth phase calculates the texture coherence which is the dominant local orien-
tation estimate, i.e., the normalized direction where most of the directional vectors of
neighboring pixels are projected. If the orientations are coherent, then the normalized
projections will have a value close to unity, on the contrary case, the projections tend
to cancel each other producing a result close to zero.

3.12.1 Estimation of the Dominant Local Orientation

Consider an area of the image with different segments whose orientations indicate
the local arrangement of the texture. One could calculate as the dominant direction
the one corresponding to the resulting vector sum of the single local directions.
This approach would have the disadvantage of not being able to determine a single
direction as there would be two angles θ and θ + π . Another drawback would occur if
we considered oriented segments, as some of these with opposite signs would cancel
each other out, instead of contributing to the estimation of the dominant orientation.
3.12 Method for Describing Oriented Textures 305

Fig. 3.20 Calculation of the


dominant local orientation θ
considering the orientation Rj θ
of the gradient of the θj
neighboring pixels
x

Rao suggests the following solution (see Fig. 3.20). Let N be the local segments,
and consider a line oriented at an angle θ with respect to the horizontal axis x.
Consider a segment j with angle θ j and with R j we denote its length. The sum of
absolute values of all the projections of all the other segments is given by


N
S1 = |R j · cos(θ j − θ )| (3.94)
j=1

where S1 varies with the orientation θ of the considered line. The dominant orientation
is obtained for a value of θ where S1 is maximum. In this case, θ is calculated by
setting the derivative of the function S1 with respect to θ to zero. To eliminate the
problem of differentiation of the absolute value function (not differentiable) it is
convenient to consider and differentiate the following sum S2 :


N
S2 = R 2j · cos2 (θ j − θ ) (3.95)
j=1

which derived with respect to θ is obtained

dS2  N
=− 2R 2j · cos(θ j − θ ) sin(θ j − θ )

j=1

Recalling the trigonometric formulas of double angle and then of sine addition, and
setting equal to zero, we obtain the following equations:


N 
N 
N
− R 2j · sin 2(θ j − θ ) = 0 =⇒ R 2j · sin 2θ j cos 2θ = R 2j · cos 2θ j sin 2θ
j=1 j=1 j=1

from which N
j=1 R 2j · sin 2θ j
tan 2θ =  N (3.96)
j=1 R 2j · cos 2θ j

If we denote by θ  the value of θ for which the maximum value of S2 is obtained, this
coincides with the best estimate of local dominant orientation. Now let’s see how the
306 3 Texture Analysis

Fig. 3.21 Coherence G(xj yj) Projection of the gradient vector


calculation of texture flow in the direction θ(x
fields θ(xj j)
(xj j)
θ(x

Window WxW

previous equation (3.96) is used for the calculation of the dominant orientation in
each pixel of the image. Let us consider with gx and g y the horizontal and vertical
components of the gradient at each point of the image, and the complex quantity gx +
ig y which constitutes the representation of the same pixel in the complex plane. The
gradient vector at a point (m, n) of the image can be represented in polar coordinates
with R m,n ei θ m,n . At this point we can calculate the dominant local orientation angle
θ for a neighborhood of (m, n) defined by the N × N pixel window as follows:
N N 
n=1 Rm.n · sin 2θm,n
1 2
−1 m=1
θ = tan N N (3.97)
n=1 Rm,n · cos 2θm,n
2 m=1
2

The dominant orientation θ in the point (m, n) is given by θ + π/2 because the
gradient vector is perpendicular to the direction of anisotropy.

3.12.2 Texture Coherence

Let G(x, y) the magnitude of the gradient calculated in phase 2 (see Sect. 3.12) in the
point (x, y) of the image plane. The measure of the coherence in the point (x0 , y0 ) is
calculated considering at this point centered a window of size W × W (see Fig. 3.21).
For each point (xi , y j ) of the window the gradient vector G(xi , y j ) considered in the
direction θ (xi , y j ) in the unit vector in the direction θ (x0 , y0 ) is projected. In other
words, the projected gradient vector is given by

G(xi , yi ) · cos[θ (x0 , y0 ) − θ (xi , yi )]

The normalized sum of the absolute values of these projections of gradient vectors
included in the window is considered as an estimate κ of the coherence measure

(i, j)∈W |G(x i , yi ) · cos[θ (x 0 , y0 ) − θ (x i , yi )]|
κ=  (3.98)
(i, j)∈W G(x i , yi )
3.12 Method for Describing Oriented Textures 307

Original Orientations Coherence

Fig. 3.22 Calculation of the orientation map and coherence measurement for two images with
vertical and circular dominant textures

This measure is correlated with the dispersion of data directionality. A better coher-
ence measure ρ is obtained by weighing the value of the estimate κ, given by the
previous (3.98), with the magnitude of the gradient at the point (x 0 , y 0 )

(i, j)∈W |G(x i , yi ) · cos[θ (x 0 , y0 ) − θ (x i , yi )]|
ρ = G(x0 , y0 )  (3.99)
(i, j)∈W G(x i , yi )

In this way the coherence is presented with high values, in correspondence with high
values of the gradient, i.e., where there are strong local variations of intensity in the
image (see Fig. 3.22).

3.12.3 Intrinsic Images with Oriented Texture

The images of coherence and of the dominant orientation previously calculated are
considered as intrinsic images (primal sketch) according to the paradigm of Marr
[41] which we will describe in another chapter.7 These images are obtained with
an approach independent of the applicability domain. They are also independent
of light conditions. Certain conditions can be imposed in relation to the type of
application to produce appropriate intrinsic images. These intrinsic images find a
field of use for the inspection of defects in the industrial automation sector (wood

7 Primal sketch indicates the first information that the human visual system extracts from the scene

and in the context of image processing are the first features extracted such as borders, corners,
homogeneous regions, etc. A primal sketch image we can think of as equivalent to the significant
traits that an artist draws as his expression of the scene.
308 3 Texture Analysis

defects, skins, textiles, etc.). From these intrinsic images, it is possible to model
primitives of oriented textures (spirals, ellipses, radial structures) to facilitate image
segmentation and interpretation.

3.13 Tamura’s Texture Features

Tamura et al. in [42] described an approach based on psychological experiments to


extract the texture features that correspond to human visual perception. The set of
perceptive features of texture proposed are six: coarseness, contrast, directionality,
linelikeness, regularity, roughness.

Coarseness: has a direct relation to the scale and repetition frequency of primitives
(textels), i.e., it is related to the distance of high spatial variations of gray level
of textural structures. The coarseness is referred to in [42] as the fundamental
characteristic of the texture. The extremes of coarseness property are coar se and
f ine. These properties help to identify texture macrostructures and microstruc-
tures, respectively. Basically, the measure of coarseness is calculated using local
operators with windows of various sizes. A local operator with a large window
can be used for coarse textures while operators with small windows are adequate
for fine textures. The measures of coarseness are calculated as follows:

1. A sort of moving average is calculated on windows with variable size (see


Fig. 3.23). The size of these windows is chosen with the power of two, i.e.,
2k × 2k per k = 0, 1 . . . , 5. The average is calculated by centering each win-
dow on each pixel (x, y) of the input image f (x, y) as follows:
k−1 −1 y+2k−1 −1
x+2  
1
Ak (x, y) = 2k f (i, j) (3.100)
2
i=x−2k−1 j=y−2k−1

thus obtaining for each pixel six values of average to vary of k = 0, 1, . . . , 5.


2. For each pixel, we calculate the absolute differences E k (x, y) between pairs
of averages obtained from windows that do not overlap (see Fig. 3.24) both
in the horizontal direction E k,h that vertical E k,v , respectively, given by

E k,h (x, y) = |Ak (x + 2k−1 , y) − Ak (x − 2k−1 , y) (3.101)

E k,v (x, y) = |Ak (x, y + 2k−1 ) − Ak (x, y − 2k−1 ) (3.102)

3. For each pixel choose the value of k which maximizes the difference E k in
both directions in order to have the highest difference value
Sbest (x, y) = 2k or Sbest (x, y) = arg max max E k,d (x, y) (3.103)
k=1,...,5 d=h,v
3.13 Tamura’s Texture Features 309

4. The final measure of coarseness Tcr s is calculated by averaging Sbest over


the entire image
1 
M N
Tcr s = Sbest (i, j) (3.104)
M·N
i=1 j=1

where M × N is the size of the image.

Contrast: evaluated by Tamura as another important feature according to the ex-


perimental evaluations of the perceived visual contrast. By definition, the contrast
generically indicates the visual quality of an image. More in detail, as proposed
by Tamura, the contrast is influenced by the following factors:

1. the dynamics of gray levels;


2. the accumulation of the gray-level distribution z toward the low values (black)
or toward the high values (white) of the relative histogram;
3. the sharpness of the edges;
4. the repeatability period of the structures (enlarged structures appear with low
contrast while reduced scale structures are more contrasted).

These 4 factors are considered separately to develop the estimation of the contrast
measure. In particular, to evaluate the polarization of the distribution of gray levels,
the statistical index of kur tosis is considered to detect how much a distribution
is flat or has a peak with respect to the normal distribution.8
To consider the dynamics of gray levels (factor 1) Tamura includes the variance
σ 2 in the contrast calculation. The contrast measure Tcon is defined as follows:
σ
Tcon = (3.105)
α4m

with μ4
α4 = (3.106)
σ4

8 Normally the evaluation of a distribution is evaluated with respect to the normal distribution
considering two indexes that of asymmetr y (or skewness) γ1 = μ3/2 3
and kurtosis index γ2 =
μ2
μ4
μ22
− 3 where μn indicates the central moment of order n. From the analysis of the two indexes we
detect the deviation of a distribution compared to normal:
• γ1 < 0: Negative asymmetry, which is the left tail of the very long distribution;
• γ1 > 0: Positive asymmetry, which is the right tail of the very long distribution;
• γ2 < 0: The distribution is platykurtic, which is very flat compared to the normal;
• γ2 > 0: The distribution is leptokurtic, which is much more pointed than normal;
• γ2 = 0: The distribution is mesokurtic, meaning that this distribution has kurtosis statistic similar
to that of the normal distribution.
310 3 Texture Analysis

where μn indicates the central order moment n (see Sect. 8.3.2 Vol. I) and m = 41
is experimentally calculated by Tamura in 1978 resulting in a value that produces
the best results. It is pointed out that the measure of the contrast (3.105) is based
on the kurtosis index α4 defined by Tamura with the (3.106) which is different
from the one currently used as reported in the note 8. The measure of the contrast
expressed by the (3.105) does not include the factors 3 and 4 above.
Directionality: not intended as an orientation in itself but understood as relevant,
the presence of orientation in the texture. This is because it is not always easy to
describe the orientation of the texture. While it is easier to assess whether two
textures differ only in orientation, the directionality property can be considered
the same. The directionality measure is evaluated taking into consideration the
module and the direction of the edges. According to the Prewitt operator (of edge
extraction, see Sect. 1.7 Vol. II), the directionality is estimated with the horizontal
derivatives x (x, y) and vertical y (x, y) calculated from the convolution of the
image f (x, y) with the 3 × 3 kernel of Prewitt and then evaluating, for each pixel
(x, y), the module | | and the θ direction of the edge
 π
| |= 2 + 2 ∼
=| x| + | y| θ = tan−1
y
+ (3.107)
x y
x 2
Subsequently a histogram Hdir (θ ) is constructed from the values of the quan-
tized directions (normally in 16 directions) evaluating the frequency of the edge
pixels corresponding to high values of the module, higher than a certain thresh-
old. The histogram is relatively uniform, for images without strong orientations,
and presents different peaks, for images with oriented texture. The directionality
measure Tdir , proposed by Tamura, considers the sum of the moments of the sec-
ond order of Hdir relative only to the values around the peaks of the histogram
between valleys and valleys, given by
np
 
Tdir = 1 − r · n p (θ − θ p )2 Hdir (θ ) (3.108)
p=1 θ∈w p

where n p indicates the number of peaks, θ p is the position of the p-th peak, w p
indicates the range of angles included in the p-th peak (i.e., the interval between
valleys adjacent to the peak), r is a normalization factor associated with the quan-
tized values of the angles θ , and θ is the quantized angle. Alternatively, we can
consider the sum of moments of the second order of all the values of the histogram
as the measure of directionality Tdir instead of considering only those around the
peaks.
Line-Likeness: defines a local texture structure composed of lines so that when a
border and the direction of the nearby edges are almost equal they are defined as
similar linear structures. The measure of line-likeness is calculated similarly to
the co-occurrence matrix GLCM (described in Sect. 3.4.4) only that in this case is
calculated the frequency of the directional co-occurrence between pixels of edge
that are at a distance d with similar direction (more precisely, if the orientation of
3.13 Tamura’s Texture Features 311

the relative edges is kept within an orientation interval). In the computation of the
directional co-occurrence matrix PDd edge with module higher than a predefined
threshold is considered by filtering the weak ones. The directional co-occurrence
is weighted with the cosine of the difference of the angles of the edge pair, and
in this way, the co-occurrences in the same direction are measured with +1 and
those with perpendicular directions with −1. The line-likeness measure Tlin is
given by n n
j=1 PDd (i, j)cos[(i − j) n ]

i=1
Tlin =  n  n (3.109)
i=1 j=1 PDd (i, j)

where the directional co-occurrence matrix PDd (i, j) has dimensions n × n and
has been calculated by Tamura using the distance d = 4 pixels.
Regularity: intended as a measure that captures information on the spatial regu-
larity of texture structures. A texture without repetitive spatial variations is con-
sidered r egular , unlike a texture that has strong spatial variations observed as
irr egular . The measure of regularity proposed by Tamura is derived from the
combination of the measures of previous textures of coarseness, contrast, direc-
tionality, and line-likeness.
These 4 measurements are calculated by partitioning the image into regions of
equal size, obtaining a vector of measures for each region. The measure of reg-
ularity is thought of as a measure of the variability of the 4 measures over the
entire image (i.e., overall regions). A small variation in the first 4 measurements
indicates a regular texture. Therefore, the regularity measure Tr eg is defined as
follows:
Tr eg = 1 − r (σcr s + σcon + σdir + σlin ) (3.110)

where σx x x is the standard deviation of each of the previous 4 measures and r


is a useful normalization factor to compensate for the different image sizes. The
standard deviations relative to the 4 measurements are calculated using the values
of the measurements derived from each region (sub-image).
Roughness: is a property that recalls touch rather than visual perception. Neverthe-
less, it can be a useful property to describe the visual texture. Tamura’s experiments
motivate the measure of coarseness not so much due to the visual perception of
the variation of the gray levels of the image as to the tactile imagination or to the
sensation of physically touching the texture. From Tamura’s psychological exper-
iments no physical-mathematical models emerged to derive roughness measures.
A rough approximate measure of the roughness Trgh is proposed, based on the
combination of the measure of coarseness and contrast:

Trgh = Tcr s + Tcon (3.111)

Figure 3.25 shows the first 3 Tamura texture measurements estimated for some
sample images. Tamura texture measurements are widely used in image recovery ap-
plications based on visual attributes contained in the image (known in the literature as
312 3 Texture Analysis

(a) (b)
2k

f(x,y)

(x,y) 2k

3x3 5x5 65x65

Fig. 3.23 Generation of the set {Ak }k=0,1,··· ,5 of average images at different scales for each pixel
of the input image

2k
Ek,a(x,y)=| Ak - Ak Ak Ak

P(x,y)={E1,a,E1,b,E2,aE2,b,...,E5,a,E5,a} P(x,y) 2k

Ek,b(x,y)=| k - Ak Ak Ak

Fig. 3.24 Of the set {Ak }k=0,1,...,5 of average images with different scales k, for each pixel P(x, y),
the absolute differences are calculated E k,h and E k,v of the averages between non overlapping pairs
and on opposite sides, respectively, in the horizontal direction h and vertical v, as shown in the
figure

T_crs = 25.6831 T_crs = 6.2593 T_crs = 13.257 T_crs = 31.076 T_crs = 14.987
T_con = 0.7868 T_con = 0.2459 T_con = 0.3923 T_con = 0.3754 T_con = 0.6981
T_dir = 0.0085 T_dir = 0.0045 T_dir = 1.5493 T_dir = 0.0068 T_dir = 1.898

Fig. 3.25 The first 3 Tamura texture measures (coarseness, contrast, and directionality) calculated
for some types of images
3.13 Tamura’s Texture Features 313

CBIR Content-Based Image Retrieval) [43]. They have many limitations to discrim-
inate fine textures. Often the first three Tamura measurements are used, treated as a
3D image, whose first three components Coarseness-coNtrast-Directionality (CND)
are considered in analogy to the RGB components. Mono-multidimensional his-
tograms can be calculated from the CND image. More accurate measurements can
be calculated using other edge extraction operators (for example, Sobel). Tamura
measurements have been extended to deal with 3D images [44].

References
1. R.M. Haralick, K. Shanmugam, I. Dinstein, Textural features for image classification. IEEE
Trans. Syst. Man Cybern. B Cybern. 3(6), 610–621 (1973)
2. B. Julesz, Visual pattern discrimination. IRE Trans. Inf. Theory 8(2), 84–92 (1962)
3. B. Julesz, Textons, the elements of texture perception, and their interactions. Nature 290, 91–97
(1981)
4. Ruth Rosenholtz. Texture perception. in Johan Wagemans, editor, Oxford Handbook of Per-
ceptual Organization, Oxford University Press, (2015), pp. 167–186. ISBN 9780199686858
5. T. Caelli, B.Julesz, E.N. Gilbert, On perceptual analyzers underlying visual texture discrimi-
nation. Part II. Biol. Cybern. 29(4), 201–214 (1978)
6. J.R. Bergen, E.H. Adelson, Early vision and texture perception. Nature 333(6171), 363–364
(1988)
7. R. Rosenholtz, Computational modeling of visual texture segregation, in Computational Models
of Visual Processing, ed. by M. Landy, J.A. Movshon (MIT Press, Cambridge, MA, 1991), pp.
253–271
8. R. Haralick, Statistical and structural approaches to texture. Proc. IEEE 67(5), 786–804 (1979)
9. Y. Chen, E. Dougherty, Grey-scale morphological granulometric texture classification. Opt.
Eng. 33(8), 2713–2722 (1994)
10. C. Lu, P. Chung, C. Chen, Unsupervised texture segmentation via wavelet transform. Pattern
Recognit. 30(5), 729–742 (1997)
11. A.K. Jain, F. Farrokhnia, Unsupervised texture segmentation using gabor filters. Pattern Recog-
nit. 24(12), 1167–1186 (1991)
12. A. Bovik, M. Clark, W. Giesler. Multichannel texture analysis using localised spatial fil-
ters.IEEE Trans. Pattern Anal. Mach. Intell. 2, 55–73 (1990)
13. A. Pentland, Fractal-based description of natural scenes. IEEE Trans. Pattern Anal. Mach.
Intell. 6(6), 661–674 (1984)
14. G. Lowitz, Can a local histogram really map texture information? Pattern Recognit. 16(2),
141–147 (1983)
15. R. Lerski, K. Straughan, L. Schad, D. Boyce, S. Blüml, I. Zuna, Mr image texture analysis an
approach to tissue characterisation. Magn. Reson. Imaging 11, 873–887 (1993)
16. K.P. William Digital Image Processing. Wiley, second edition, (1991). ISBN 0-471-85766-1
17. S.W. Zucker, D. Terzopoulos, Finding structure in co-occurrence matrices for texture analysis.
Comput. Graphics Image Process. 12, 286–308 (1980)
18. L. Alparone, F. Argenti, G. Benelli, Fast calculation of co-occurrence matrix parameters for
image segmentation. Electron. Lett. 26(1), 23–24 (1990)
19. L.S. Davis, A. Mitiche, Edge detection in textures. IEEE Comput. Graphics Image Process.
12, 25–39 (1980)
314 3 Texture Analysis

20. M.M. Galloway, Texture classification using grey level runlengths. Comput. Graphics Image
Process. 4, 172–179 (1975)
21. X. Tang, Texture information in run-length matrices. IEEE Trans. Image Process. 7(11), 1602–
1609 (1998)
22. A. Chu, C.M. Sehgal, J.F. Greenleaf, Use of gray value distribution of run lengths for texture
analysis. Pattern Recogn. Lett. 11, 415–420 (1990)
23. B.R. Dasarathy, E.B. Holder, Image characterizations based on joint gray-level run-length
distributions. Pattern Recogn. Lett. 12, 497–502 (1991)
24. G.C. Cross, A.K. Jain, Markov random field texture models. IEEE Trans. Pattern Anal. Mach.
Intell. 5, 25–39 (1983)
25. R. Chellappa, S. Chatterjee, Classification of textures using gaussian markov random fields.
IEEE Trans. Acoust. Speech Signal Process. 33, 959–963 (1985)
26. J.C. Mao, A.K. Jain, Texture classification and segmentation using multiresolution simultane-
ous autoregressive models. Pattern Recognit. 25, 173–188 (1992)
27. Divyanshu Rao Sumit Sharma, Ravi Mohan, Classification of image at different resolution
using rotation invariant model. Int. J. Innovative Res. Adv. Eng. 1(4), 109–113 (2014)
28. B.B. Mandelbrot, The Fractal Geometry of Nature (Freeman, Cityplace San Francisco, 1983)
29. J.M.S.Chen Keller, R.M. Crownover, Texture description and segmentation through fractal
geometry. Comput. Vis. Graphics Image Process. 45(2), 150–166 (1989)
30. R.F. Voss, Random fractals: Characterization and measurement, in Scaling Phenomena in
Disordered Systems, ed. by R. Pynn, A. Skjeltorp (Plenum, New York, 1985), pp. 1–11
31. K.I. Laws, Texture energy measures. in Proceedings of Image Understanding Workshop, pp.
41–51 (1979)
32. M.T. Suzuki, Y. Yaginuma, A solid texture analysis based on three dimensional convolution
kernels. in Proceedings of the SPIE, vol. 6491, pp. 1–8 (2007)
33. D. Gabor, Theory of communication. IEEE Proc. 93(26), 429–441 (1946)
34. J.G. Daugman, Uncertainty relation for resolution, spatial frequency, and orientation optimized
by 2d visual cortical filters. J. Opt. Soc. Am. A 2, 1160–1169 (1985)
35. J. Malik, P. Perona, Preattentive texture discrimination with early vision mechanism. J. Opt.
Soc. Am. A 5, 923–932 (1990)
36. I. Fogel, D. Sagi, Gabor filters as texture discriminator. Biol. Cybern. 61, 102–113 (1989)
37. T. Chang, C.C.J. Kuo, Texture analysis and classification with tree-structured wavelet trans-
form. IEEE Trans. Image Process. 2(4), 429–441 (1993)
38. J.L. Chen, A. Kundu, Rotation and gray scale transform invariant texture identification using
wavelet decomposition and hidden markov model. IEEE Trans. PAMI 16(2), 208–214 (1994)
39. M. Sonka, V. Hlavac, R. Boyle, Image Processing, Analysis and Machine Vision. CL Engineer-
ing, third edition (2007). ISBN 978-0495082521
40. A.R. Rao, R.C. Jain, Computerized flow field analysis: oriented texture fields. IEEE Trans.
Pattern Anal. Mach. Intell. 14(7), 693–709 (1992)
41. D. Marr, S. Ullman, Directional selectivity and its use in early visual processing. in Proceedings
of the Royal Society of London. Series B, Biological Sciences, vol. 211, pp. 151–180 (1981)
42. S. Mori, H. Tamura, T. Yamawaki, Texture features corresponding to visual perception. IEEE
Trans. Syst. Man Cybern. B Cybern. SMC 8(6), 460–473 (1978)
43. S.H. Shirazi, A.I. Umar, S.Naz, N. ul Amin Khan, M.I. Razzak, B. AlHaqbani, Content-based
image retrieval using texture color shape and region. Int. J. Adv. Comput. Sci. Appl. 7(1),
418–426 (2016)
44. T. Majtner, D. Svoboda, Extension of tamura texture features for 3d fluorescence microscopy.
in Proceedings of 3D Imaging, Modeling, Processing, Visualization and Transmission (3DIM-
PVT), pp. 301–307. IEEE, 2012. ISBN 978-1-4673-4470-8
Paradigms for 3D Vision
4

4.1 Introduction to 3D Vision

In the previous chapters, we analyzed 2D images and developed algorithms to rep-


resent, describe, and recognize the objects of the scene. In this context, the intrinsic
characteristics of the 3D nature of the objects were not strictly considered, and con-
sequently, the description of the objects did not include formal 3D information. In
different applications, for example, in remote sensing (image classification), in the
visual inspection of defects in the industrial sector, in the analysis of microscope
images, and in recognizing characters and shapes, while observing 3D scenes, it is
not necessarily required 3D vision systems.
In other applications, a 3D vision system is required, i.e., a system capable of
analyzing 2D images to correctly reconstruct and understand a scene typically of 3D
objects. Imagine, for example, a mobile robot that must move freely in a typically 3D
environment (industrial, domestic, etc.), its vision system must be able to recognize
the objects of the scene, identify unexpected obstacles and avoid them, and calculate
its position and orientation with respect to fixed points of the environment and visible
in the scene. Another example is the vision systems of robotized cells for industrial
automation, where a mechanical arm, guided by the vision system, can be used to pick
up and release objects from a bin (problem of bin-picking) and the 3D vision system
must be able to: locate the candidate object for the taking, calculate its attitude, all
even in the context of overlapping objects (taken from a stack of objects).
A 3D vision system has the fundamental problem typical of inverse problems,
that is, from single 2D images, which are only a two-dimensional projection of the
3D world (partial acquisition), must be able to reconstruct the 3D structure of the
observed scene and eventually define the relationship between the objects. In other
words, regardless of the complexity of the algorithms, the 3D reconstruction must
take place starting from the 2D images that contain only partial information of the
3D world (loss of information from the projection 3D → 2D) and possibly using the
geometric and radiometric parameters of calibration of the acquisition system (for
example, a camera).
© Springer Nature Switzerland AG 2020 315
A. Distante and C. Distante, Handbook of Image Processing and Computer Vision,
https://doi.org/10.1007/978-3-030-42378-0_4
316 4 Paradigms for 3D Vision

The human visual system addresses the problems of 3D vision using a binocular
visual system, a remarkable richness of elementary processors (neurons) and a model
of reconstruction based also on the a priori prediction and knowledge of the world.
In the field of artificial vision, the current trend is to develop 3D systems oriented
to specific domains but with characteristics that go in the direction of imitating
some functions of the human visual system. For example, use systems with multiple
cameras, analyze time-varying image sequences, observing the scene from multiple
points of view, and make the most of prior knowledge with respect to the specific
application. With 3D vision systems based on these features, it is possible to try
to optimize the 2D to 3D inversion process, obtaining the least ambiguous results
possible.

4.2 Toward an Optimal 3D Vision Strategy

Once the scene is reconstructed, the vision system performs the perception phase
trying to make hypotheses that are verified with the predicted model and evaluating
its validity. If a hypothesis cannot be accepted, a new2 hypothesis of description of
the scene is reformulated until the comparison with the model is acceptable. For the
formation of the hypothesis and the verification with the model, different processing
steps are required for the acquired data (2D images) and for the data known a priori
that represent the models of the world (the a priori knowledge for a certain domain).
A 3D vision system must be incremental in the sense that its elementary process
components (tasks) can be extended to include new descriptions to represent the
model and to extract new features from the images of the scene. In a vision system,
the 3D reconstruction of the scene and the understanding of the scene (perception),
are the highest level tasks that are based on the results achieved by the lowest level
tasks (acquisition, pre-processing, feature extraction, . . .) and the intermediate one
(segmentation, clustering, etc.). The understanding of the scene can be achieved only
through the cooperation of the various elementary calculation processes and through
an appropriate control strategy for the execution of these processes.
In fact, the biological visual systems have different control strategies including sig-
nificant parallel computing skills, a complex computational model based on learning,
remarkable adaptive capacity, and high incremental capacity in knowledge learning.
An artificial vision system to imitate some functions of the biological system should
include the following features:

Control of elementary processes: in the sequential and parallel context. It is not


always possible to implement as a parallel process, with the available hardware,
some typical algorithms of the first stages of vision (early vision).
Hierarchical control strategies: Bottom-up (Data Driven). Starting from the
acquired 2D image, the vision system can perform the reconstruction and recog-
nition of the scene through the following phases:
4.2 Toward an Optimal 3D Vision Strategy 317

1. Several image processing algorithms are applied to the raw data (pre-
processing and data transformation) to make them available to higher-level
processes (for example, to the segmentation process).
2. Extraction of higher-level information such as homogeneous regions corre-
sponding to parts of the object and objects of the scene.
3. Reconstruction and understanding of the scene based on the results of point
(2) and on the basis of a priori knowledge.

Hierarchical control strategies: Top-down (Model Driven). In relation to the type


of application, some assumptions and properties on the scene to be reconstructed
and recognized are defined. In the various phases of the process, in a top-down
way, these expected assumptions and properties are verified for the assumed
model, up to the elementary data of the image. The vision system essentially
verifies the internal information of the model by accepting or rejecting it. This
type of approach is a goal-oriented process. A problem is decomposed into sub-
problems that require lower-level processes that are recursively decomposed into
other subproblems until they are accepted or rejected. The top-down approach
generates hypotheses and verifies them. For a vision system based on the top-
down approach, its ability consists of updating the internal model during the
reconstruction/recognition phase, according to the results obtained in the hypoth-
esis verification phase. This last phase (verification) is based on the information
acquired by the Low-Level algorithms. The top-down approach applied for vision
systems can significantly reduce computational complexity with the additional
possibility of using appropriate parallel architectures for image processing.
Hybrid control strategies: Data and Model Driven. They offer better results than
individual strategies. All high-level information is used to simplify low-level
processes that are not sufficient to solve problems. For example, suppose you
want to recognize planes in an airport from an image taken from satellite or
airplane. Starting with the data driven approach, it is necessary to identify the
aircrafts first but at the same time high-level information can be used as the planes
will appear with a predefined form and there is a high probability of observing
them on the runways, on the connecting sections and in parking lots.
Nonhierarchical control strategies: In these cases, different experts or processes
can compete to cooperate at the same level. The problem is broken down into sev-
eral subproblems for each of which some knowledge and experience is required.
A nonhierarchical control requires the following steps:

1. Make the best choice based on the knowledge and current status achieved.
2. Use the results achieved with the last choice to improve and increase the
available information about the problem.
3. If the goal has not been reached, return to the first step otherwise the procedure
ends.
318 4 Paradigms for 3D Vision

4.3 Toward the Marr’s Paradigm

Inspired by the visual systems of nature, two aspects should be understood.


The first aspect regards the nature of the input information to the biological vision
system, that is, the varying space-time patterns associated with the light projected
on the retina and with the modalities of encoding the information of the external
environment, typically 3D.
The second aspect concerns the output of the biological visual system, i.e., how
the external environment (objects, surface of objects, and events) is represented so
that a living being can organize its activities.
Responding adequately to these two aspects means characterizing a biological
vision system. In other words, it is necessary to know what information a living
being needs and how this information is encoded in the patterns of light that is
projected onto the retina. For this purpose, it is necessary to discover a theory that
highlights how information is extracted from the retina (input) and, ultimately, how
this information is propagated to the neurons of the brain.
Marr [1] was the first to propose a paradigm that explains how the biological visual
system processes the complex input and output information for visual perception.
The perception process includes three levels:

1. The level of computational theory. The relationship between the quantities to


be calculated (data) and the observations (images) must be developed through
a rigorous physical-mathematical approach. After this computational theory is
developed, it must be understood whether or not the problem has at least one solu-
tion. Put the ad hoc solutions and heuristic approaches with methodologies based
on a solid and rigorous formulation (physical-mathematical) of the problems.
2. The level of algorithms and data structures. After the Computational Theory is
completed, the appropriate algorithms are designed, and, when applied to the
input images, they will produce the expected results.
3. The implementation level. After the first two levels are developed, the algorithms
can be implemented with adequate hardware (using serial or parallel computing
architectures) thus obtaining an operational vision machine.

Some considerations on the Marr paradigm:

(a) The level of algorithms in the Marr model tacitly includes the level of robustness
and stability.
(b) In developing a component of the vision system, the three levels (computational
theory, algorithms, and implementation) are often considered, but when activat-
ing the vision process using real data (images) it is possible to obtain absurd
results for having neglected (or not well modeled), for example, in the compu-
tational level, the noise present in the input images.
(c) Need to introduce stability criteria for the algorithms, assuming, for example, a
noise model.
4.3 Toward the Marr’s Paradigm 319

(d) Another source of noise is given by the use of uncertain intermediate data (such
as points, edges, lines, contours, etc.) and in this case, stability can be feasible
using statistical analysis.
(e) Analysis of the stability of results is a current and future theme of research that
will avoid the problem of algorithms that have an elegant mathematical basis but
do not operate properly on real data.

The bottom-up approaches are potentially more general since they operate only
considering the information extracted from the 2D images, together with the cali-
bration data of the acquisition system, to interpret the 3D objects of the scene. The
vision systems of the type bottom-up are oriented for more general applications.
The top-down approaches assume the presence of particular objects or classes of
objects that are localized in the 2D images and the problems are solved in a more
deterministic way. The vision systems of the top-down type are oriented to solve
more specific applications with a more general vision theory.
Marr observes that the complexity of vision processes imposes a sequence of ele-
mentary processes to improve the geometric description of the visible surface. From
the pixels, it is necessary to delineate the surface and derive some of its character-
istics, for example, the orientation and the depth with respect to the observer, and
finally arrive at the complete 3D description of the object.

4.4 The Fundamentals of Marr’s Theory

The input to the biological visual system are the images that are formed on the retina
seen as a matrix of values of intensity of reflected light of the physical structures of
the observed external environment.
The goal of the first stages of vision (early vision) is to create, from the 2D image, a
description of the physical structures: the shape of the surface and of the objects, their
orientation, and distance from the observer. This goal is achieved by constructing
a distinct number of representations, starting from the variations in light intensity
observed in the image. This first representation is called Primal Sketch.
This primary information describes the variations in intensity present in the image
and makes some global structures explicit. This first stage of the vision process,
locates the discontinuity of light intensity in correspondence with the edge points,
which often coincide with the geometric discontinuities of the physical structures of
the observed scene.
The primal sketches correspond to the edges and small homogeneous areas present
in the image, including their location, orientation and whatever else can be deter-
mined. From this primary information, by applying adequate algorithms based on
group theory, more complex primary structures (contours, regions, and texture) can
be derived, called full primal sketch.
The ultimate goal of early vision processes is to describe the surface and shape of
objects with respect to the observer, i.e., to produce a world representation observed in
320 4 Paradigms for 3D Vision

the reference system centered with respect to the observer (viewer-centred). In other
words, the early vision process in this viewer-centered reference system produces
a representation of the world, called 2.5D sketch. This information is obtained by
analyzing the information on depth, movement, and derived shape, analyzing the
primal sketch structures. The extracted 2.5D structures describe the structures of the
world with respect to the observation point. A vision system must fully recognize
an object. In other words, it is necessary that the 2.5D viewer centered structures
are expressed in the object reference system (object-centered) and not referred to
the observer. Marr indicates this level of representation of the world as 3D model
representation.
In this process of 3D formation of the world model, all the primary informa-
tion extracted from the primary stages of vision are used, proceeding according to
a bottom-up model based on general constraints of 3D reconstruction of objects,
rather than on specific hypotheses of the object. The Marr paradigm foresees a dis-
tinct number of levels of representation of the world, each of them is a symbolic
representation of some aspects of information derived from the retinal image.
Marr’s paradigm sees the vision process based on a computational model of a set
of symbolic descriptions of the input image. The process of recognizing an object, for
example, can be considered achieved when one, among the many descriptions derived
from the image, is comparable with one of those memorized, which constitutes the
representation of a particular class of the known object. Different computational
models are developed for the recognition of objects, and their diversity is based on
how concepts are represented as distributed activities on different elementary process
units. Some algorithms have been implemented, based on the neural computational
model to solve the problem of depth perception and object recognition.

4.4.1 Primal Sketch

From the 2D image, the primary information is extracted or primal sketch which can
be any elementary structure such as edges, straight and right angle edges (corner),
texture, and other discontinuities present in the image. These elementary structures
are then grouped to represent higher-level physical structures (contours, parts of an
object) that can be used later to provide 3D information of the object, for example, the
superficial orientation with respect to the observer. These primal sketch structures
can be extracted from the image at different geometric resolutions just to verify their
physical consistency in the scene.
The primal sketch structures, derived by analyzing the image, are based on the
assumption that there is a relationship between zones in the image where the light
intensity and the spectral composition varies, and the areas of the environment where
the surface or objects are delimited (border between different surfaces or different
objects). Let us immediately point out that this relationship is not univocal, and it is
not simple. There are reasonable considerations for not assuming that any variation
in luminous or spectral intensity in the image corresponds to the boundary or edge of
an object or a surface of the scene. For example, consider an environment consisting
4.4 The Fundamentals of Marr’s Theory 321

220
50
200
100
180
150
160

Intensity
200 140
250 120

300 100

350 80

400 60

50 100 150 200 250 300 350 400 450 50 100 150 200 250 300 350 400 450
Distance in pixels

Fig. 4.1 Brightness fluctuation even in homogeneous areas of the image as can be seen from the
graph of the intensity profile relative to line 320 of the image

of objects with matte surfaces, i.e., that the light reflected in all directions from every
point on the surface has the same intensity and spectral composition (Lambertian
model, described in Chap. 2 Vol. I).
In these conditions, the boundaries of an object or the edges of a surface will
emerge in the image in correspondence of the variations of intensity. It is found that
in reality, these discontinuities in the image emerge also for other causes, for example,
due to the effect of the edges derived from a shadow that falls on an observed surface.
The luminous intensity varies even in the absence of geometric discontinuity of
the surface, due as a consequence, that the intensity of the reflected light is a function
of the angle of the surface, with respect to the incident light direction. The intensity of
the reflected light has maximum value if the surface is perpendicular to the incident
light, and decreases as the surface rotates in other directions. The luminous intensity
changes in the image, in the presence of a curved surface and in the presence of
texture, especially in natural scenes. In the latter case, the intensity and spectral
composition of the light reflected in a particular direction by a surface with texture
varies locally and consequently generates a spatial variation in luminous intensity
within a particular region of the image corresponding to a particular surface.
To get an idea about the complex relationship between natural surface and the
reflected light intensity resulting in the image, see Fig. 4.1 where it is shown how
the brightness varies in a line of the image of a natural scene. Brightness variations
are observed in correspondence of variation of the physical structures (contours and
texture) but also in correspondence of the background and of homogeneous physical
structures. Brightness fluctuations are also due to the texture present in some areas
and to the change in orientation of the surface with respect to the direction of the
source.
To solve the problem of the complex relationship between the structures of natural
scenes and the structures present in the image, Marr proposes a two-step approach.
322 4 Paradigms for 3D Vision

Fig. 4.2 Results of the LoG filter applied to the Koala image. The first line shows the extracted
contours with the increasing scale of the filter (from left to right) while in the second line the images
of the zero crossing are shown (closed contours)

In the first step the image is processed making explicit the significant variations
in brightness, thus obtaining what Marr calls a representation of the image, in terms
of raw primal sketch.
In the second step the edges are identified with particular algorithms that process
the information raw primal sketch of the previous step to describe information and
structures of a higher level, called Perceptual Chunks.
Marr’s approach has the advantage of being able to use raw primal sketch informa-
tion for other perceptual processes that operate in parallel, for example, to calculate
depth or movement information. The first stages of vision are influenced by the noise
of the image acquisition system. In Fig. 4.1 it can be observed how the brightness
variations in the image are present also in correspondence of uniform surfaces. These
variations at different scales are partly caused by noise.
In the Chap. 4 Vol. II we have seen how it is possible to reduce the noise present
in the image, applying an adequate smoothing filter that does not alter the significant
structures of the image (attenuation of the high frequencies corresponding to the
edges). Marr and Hildreth [2] proposed an original algorithm to extract raw primal
sketch information by processing images of natural scenes.
The algorithm has been described in Sect. 1.13 Vol. II, also called the Laplacian
of Gaussian (LoG) filter operator, used for edge extraction and zero crossing. In the
context of extracting the raw primal sketch information, the LoG filter is used with
different Gaussian filters to obtain raw primal sketch at different scales for the same
image as shown in Fig. 4.2. It is observed how the various maps of raw primal sketch
(zero crossing in this case) represent the physical structures at different scales of
representation.
In particular, the very narrow Gaussian filter highlights the noise together with
significant variations in brightness (small variations in brightness are due to noise
4.4 The Fundamentals of Marr’s Theory 323

Fig. 4.3 Laplacian filter


applied to the image of
Fig. 4.1

and physical structures), while as a wider filter is used only zero crossing remain
corresponding to significant variations (which are not accurately localized) to be
associated with the real structures of the scene with the almost elimination of noise.
Marr and Hildreth proposed to combine the various raw primal sketch maps extracted
at different scales to obtain more robust primal sketch than the original image with
contours, edges, homogeneous areas (see Fig. 4.3).
Marr and Hildreth assert that at least in the early stages of biological vision the LoG
filter is implemented for the extraction of the zero crossing at different filtering scales.
The biological evidence of the theory of Marr and Hildreth has been demonstrated
by several researchers. In 1953 Kuffler had discovered the spatial organization of the
receptive fields of retinal ganglion cells (see Sect. 3.2 Vol. I).
In particular, Kuffler [3] discovered the effect on ganglion cells of a luminous spot
and observed concentric receptive fields with circular symmetry with a central region
of excitation (sign +) and a surrounding inhibitor (see Fig. 4.4). Some ganglion cells
instead presented receptive fields with concentric regions excited of opposite sign.
In 1966 Enroth-Cugell and Robson [4] discovered, in relation to temporal response
properties, the existence of two types of ganglion cells, called X and Y cells. The
X cells have a linear response, proportional to the difference between the intensity
of light that affects the two areas and this response is maintained over time. The
Y cells do not have a linear response and are transient. This cellular distinction is
also maintained up to the lateral geniculate nucleus of the visual cortex. Enroth-
Cugell and Robson showed that the intensity contribution for both areas is weighted
according to a Gaussian distribution, and the resulting receptive field is described
as the difference of two Gaussians (called Difference of Gaussian (DoG) filter, see
Sect. 1.14 Vol. II).
324 4 Paradigms for 3D Vision

Ganglion cells
x x x
Central spot Peripheral spot Central Peripheral
lighting lighting lighting lighting lighting

Ganglion cells
x x x
Fig. 4.4 Receptive fields of the center-ON and middle-OFF ganglion cells. They have the charac-
teristic of having both the receptive field almost circular and divided into two parts, an internal area
called center and an external one called peripheral. Both respond well to changes in lighting between
the center and the periphery of their receptive fields. They are divided into two classes of ganglion
cells center-ON and center-OFF, based on the different responses when excited by a light beam. As
shown in the figure, the first (center-ON) respond with excitement when the light is directed to the
center of the field (with spotlight or with light that illuminates the entire center), while the latter
(center-OFF) behave in the opposite way, that is, they are very little excited. Conversely, if the
light beam affects the peripheral part of both, they are the center-OFF that respond very well (they
generate a short electrical excitable membrane signal) while the center-ON are inhibited. Ganglion
cells respond primarily to differences in brightness, making our visual system sensitive to local
spatial variations, rather than the absolute magnitude (or magnitude) of light affecting the retina

From this, it follows that the LoG operator is seen as a function equivalent to the
DoG, and the output of the ∇ 2 G  I operator is analogous to the X retinal and cells
of the lateral geniculate nucleus (LGN). Positive values of ∇ 2 G  I correspond to
the central zone of the cells X and the negative values for the surrounding concentric
zone. In this hypothesis, the problem arises that positive and negative values must
be present for the determination of the Zero Crossing in the image ∇ 2 G  I . This
would not be possible since the nerve cells cannot operate with negative values in the
answers to calculate the zero crossing. Marr and Hildreth explained, for this reason,
the existence of cells in the visual cortex that are excited with opposite signs as shown
in Fig. 4.4.
This hypothesis is weak if we consider the inadequacy of the concentric areas of
the receptive fields of the cells X , for the accurate calculation of the function ∇ 2 G I .
With the interpretation of Marr and Hildreth, cells with concentric receptive fields
cannot determine the presence of edges in a classical way, as done for all the other
edge extraction algorithms (see Chap. 1 Vol. II).
A plausible biological rule for extracting the zero crossing in the X cells would
be to find adjacent active cells that operate with positive and negative values in
the central receptive area, respectively, as shown in Fig. 4.5. The zero crossing are
determined with the AND logical connection of two cells center-ON and center-OFF
(see Fig. 4.5a).
With this idea it is possible to extract also segments of zero crossing organizing
in two ordered columns of cells with receptive fields of opposite sign (see Fig. 4.5b).
Two cells X of opposite sign are connected through a logical AND connection
4.4 The Fundamentals of Marr’s Theory 325

(a) (b)
+ -
- + + -
ing
Cross
Zero + -
+ -
+ -

Fig. 4.5 Functional scheme proposed by Marr and Hildreth for the detection of zero crossing from
cells of the visual cortex. a Overlapping receptive fields of two cells center-ON and center-OFF
of LGN; if both are active, a zero crossing ZC is located between the two cells and detected by
these if they are connected with a logical AND conjunction. b Several different AND logic circuits
associated with pairs of cells center-ON and center-OFF (operating in parallel) are shown which
detect an oriented segment of zero crossing

producing an output signal only if the two cells are active indicating the presence of
zero crossing between the cells. The biological evidence of the LoG operator explains
some features of the visual cortex and how it works at least in the early stages of
visual perception for the calculation of segments. It is not easy to explain how the
nervous system combines this elementary information (zero crossing) generated by
ganglion cells to obtain the information called primal sketch.
Another important feature of the theory of Marr and Hildreth is to provide a
plausible explanation for the existence of cells, in the visual cortex, that operate with
different spatial frequencies in a similar way to the filter ∇ 2 G varying the width of
the filter itself through the parameter σ of the Gaussian. Campbell and Robson [5]
in 1968 discovered with their experiments that visual input is processed in multiple
independent channels, each of which analyzes a different band of spatial frequencies.

4.4.2 Toward a Perceptive Organization

Following Marr’s theory in the early stages of vision, the first elementary information
called primal sketch is extracted. In reality, the visual system uses this elementary
information and organizes it in a higher level to generate more important percep-
tual structures called chunks. This to reach a perception of the world not made of
elementary structures (borders and homogeneous areas), but of 3D objects with the
visible surface of the objects well reconstructed. How this higher-level perceptual
organization is realized, has been studied by several researchers in the nineteenth
and twentieth centuries.
We can immediately affirm that while for the retina, all its characteristics have
been studied and how the signals are transmitted through the optic nerve to the visual
cortex, for the latter, called Area 17 or sixth zone, the mechanisms are not yet clear of
perception of reconstruction of objects and their motion. Hubel and Wiesel [6] have
shown that the signals coming from the retina through the fibers of the optic nerve
arrive in the fourth layer of the Area 19 passing from the lateral geniculate nucleus
(see Fig. 3.12 of Vol. I). In this area of the visual cortex it is hypothesized that the
326 4 Paradigms for 3D Vision

retinal image is reconstructed maintaining the information from the first stages of
vision (first and second derivatives of luminous intensity).
From this area of the visual cortex different information is transmitted to the
various layers of the visual cortex in relation to the tasks of each layer (motor control
of the eyepieces, perception of motion, perception of depth, integration of different
primal sketches to generate chunks, etc.).

4.4.3 The Gestalt Theory

The in-depth study of human perception began in the nineteenth century, when
psychology was established as a modern autonomous discipline detached from the
philosophy [7]. The first psychologists (von Helmholtz [8] influenced by J. Stuart
Mill) studied perception based on associationism. It was assumed that the percep-
tion of an object can be conceived in terms of a set of sensations that emerge with
past experience and that the sensations that make up the object have always pre-
sented themselves associated with the perceiving subject. Helmholtz asserted that
the past perceptive experience of the observer imposes that an unconscious inference
automatically links the dimensional aspects of perception of the object taking into
account its distance.
In the early twentieth century, the association theory was opposed by a group
of psychologists (M. Wertheimer 1923, W. Kohler 1947, and K. Koffka 1935) who
founded the school of Gestalt psychology (i.e., the psychology of form), who gave
the most contributions important to the study of perception. The Gestaltists argued
that it is wrong to say that perception can be seen as a sum of stimuli linked by
associative laws, based on past experience (as the associationists thought). At the
base of the Gestalt theory is this admission:

we do not perceive sums of stimuli, but forms, and the whole is much more
than the sums of the components that compose it.

In Fig. 4.6 it is observed how each of the represented forms are perceived as three
squares regardless of the fact that its components are completely different (stars,
lines, circles).
The Gestalt idea can be stated as the observed set is greater than the sum of
its elementary components. The natural world is perceived as composed of discrete
objects of various sizes and appear well highlighted with respect to background.

Fig. 4.6 We can feel three


square shapes despite being
constructed from completely
different graphic components
4.4 The Fundamentals of Marr’s Theory 327

Fig. 4.7 Ambiguous figures that produce different perceptions. a Figure with two possible inter-
pretations: two human face profiles or a single black vase; b perception of a young or old woman,
designed by William Ely Hill 1915 and reported in a paper by the psychologist Edwin Boring in
1930

Even if the surfaces of the objects have a texture, there is no difficulty in perceiving
the contours of the objects, unless they are somehow camouflaged, and generally,
the homogeneous areas belonging to the objects (foreground ) with respect to the
background. Some graphic and pictorial drawings, made by man can present some
ambiguity when they are interpreted and their perception can lead to errors. For
example, this happens to distinguish from the background the objects present in
famous figures (see Fig. 4.7 that highlight the perceptual ambiguity indicated above).
In fact, figure (a), conceived by Gestalt psychologist Edgar Rubin, can be inter-
preted by perceiving the profiles of two human figures or perceived as a single black
vase. It is impossible to perceive both human figures and the vase simultaneously.
Figure (b) on the other hand can be interpreted by perceiving an old woman or a
young woman.
Some artists have produced paintings or engravings with ambiguities between
the background and the figure represented, based on the principles of perceptive
reversibility. Examples of reversible perception are given by the Necker cube in
1832 (see Fig. 4.8), which consists of a two-dimensional representation of a three-
dimensional wire-frame cube. The intersections between two lines do not show which
line is above the other and which is below, so the representation is ambiguous. In
other words, it is not possible to indicate which face is facing the observer and which
is behind the cube. Looking at the cube (figure a) or corner (figure d) for a long time
they appear alternately concave and convex in relation to the perceptive reactivity of
a person.
Some first perceive the lower left face of the cube as the front face (figure (b))
facing the observer or alternatively it is perceived as further back as the lower rear
face of the cube (figure (c)). In a similar way, the same perceptive reversibility
328 4 Paradigms for 3D Vision

(a) (b) (c) (d)

Fig. 4.8 a The Necker cube is a wire-frame drawing of a cube with no visual cues (like depth or
orientation). b One possible interpretation of the Necker cube. It is often claimed to be the most
common interpretation because people view objects from above (see the lower left face as being
in front) more often than from below). c Another possible interpretation. d The same perceptive
reversibility occurs by observing a corner of the cube that appears alternately concave and convex

(a) (b)

Fig. 4.9 Examples of stable real figures: a hexagonal figure which also corresponds to the two-
dimensional projection of a cube seen from a corner; b predominantly stable perception of over-
lapping circular disks

occurs by observing a corner of the cube that appears alternately concave and convex
(figure (d)).
Several psychologists have tried to study what are the principles that determine
the ambiguities in the perception of these figures through continuous perceptual
change. In the examples shown, the perceptible data (the objects) are invariant, that
is, they remain the same, while, during the perception, only the interpretation between
background and objects varies. It would seem that the perceptual organization was
of the top-down type.
High-level structures of perceptive interpretation seem to condition and guide
continuously, low-level structures extracted from image analysis. Normally in nat-
ural scenes and in many artificial scenes, no perceptual ambiguity is presented. In
these cases, there is a stable interpretation of the components of the scene and its
organization.
As an example of stable perception we have Fig. 4.9a, which seen individually
by several people is perceived as a hexagon, while if we remember the 3D cube
of Fig. 4.8a the Fig. 4.9 also represents the figure of the cube seen by an observer
positioned on a corner of the cube itself. Essentially, perceptual ambiguity does not
occur as the hexagonal Fig. 4.9 is a correct and real two-dimensional projection of
the 3D cube seen from one of its corners. Also Fig. 4.9b gives a stable perception of a
set of overlapping circular disks, instead of being interpreted, alternatively, as disks
interlocked with each other (for example, thinking that two disks have a circular
notch for having removed a circular portion).
4.4 The Fundamentals of Marr’s Theory 329

Fig. 4.10 The elements of (a) (b) (c)


visual stimulation, close to
each other, tend to be
perceived as components of
the same object: a columns;
b lines; c lattice

Gestalt psychologists have formulated some ideas about the perceptive organiza-
tion to explain why, by observing the same figure, there are some different perceptions
between them. Some principles of the Gestalt theory were based mainly on the idea
of grouping elementary regions of the figures to be interpreted and other principles
were based on the segregation of the objects extracted with respect to the back-
ground. In the first case, it is fundamental to group together the elementary regions
of a figure (object) to obtain a larger region which constitutes, as a whole, the figure
to be perceived. Some of the perceptive principles of the Gestalt theory are briefly
described below.

4.4.3.1 Proximity
A basic principle of the perceptive organization of a scene is the proximity of
its elementary structures. Elementary structures of the scene, which are close to
each other, are perceived grouped. In Fig. 4.10a we observe vertical and horizon-
tal linear structures since the vertical spacing of the elementary structures (points)
is smaller, compared to the horizontal one (vertical structures), while the horizontal
spacing is smaller than the vertical one when observing the horizontal structures (see
Fig. 4.10b). If they are equally spaced horizontally and vertically (see Fig. 4.10c) the
perceived aggregation is a regular grid of points. An important example of the prin-
ciple of proximity is in the visual perception of depth as we will describe later in the
paragraph on binocular vision.

4.4.3.2 Similarity
Elementary structures of the scene that appear similar tend to be grouped together
from the perceptive point of view (see Fig. 4.11). The regions appear distinct because,
the visual stimulus elements that are similar, are perceived grouped and components
of the same grouping (figure (a)). In Fig. 4.11b vertical linear structures are perceived
although, due to the proximity principle, they should be perceived as horizontal struc-
tures. This shows that the principle of similarity can prevail over information per-
ceived by the principle of proximity. There is no strict formulation of how structures
are aggregated with the principle of similarity.
330 4 Paradigms for 3D Vision

(a) (b)

Fig. 4.11 Similar elements of visual stimulation tend to be perceived grouped as components of the
same object: a two distinct perceived regions consisting of groups of similar elements; b perceive
the vertical structures because the dominant visual stimulus is the similarity, between the visual
elements, rather than the proximity

4.4.3.3 Common Fate


Structures that appear to move together (in the same direction) are perceived grouped
and be visible as a single region. In essence, perception associates movement as part
of the same stimulus. In nature, many animals camouflage their apparent texture with
the background, but as soon as they move their structure emerges and they are easily
visible with the perception of motion. For example, birds can be distinguished from
the background as a single swarm because they are moving in the same direction and
at the same speed, even when viewed from a distance each as a moving point (see
Fig. 4.12). The set of mobile points seems to be part of a single entity.
Similarly, two flocks of birds can cross in the observer’s field of view that will
continue to perceive them as separate swarms because each bird has a direction of
motion common to its own swarm. The perception of motion becomes strategic in
environmental contexts where the visual stimuli (insufficient lighting) linked to the
color or contour of the objects are missing. In animals, this ability has probably
developed greatly from evolutionary needs to distinguish a camouflaged predator
from its background.

Fig. 4.12 The elements of visual stimulation that move, in the same direction and speed, are
perceived as components of the same whole: in this case a swarm of birds
4.4 The Fundamentals of Marr’s Theory 331

+
(a) (b) *
Δ

Δ @
*+

Δ
X

Fig. 4.13 Looking at the figure a we have the perception of two curves that intersect in X for the
principle of continuity and not two separate structures. The same happens for the figure b where in
this case the mechanisms of perception combine perceptual stimuli of continuity and proximity

Fig. 4.14 Closure principle.


We tend to perceive a form
with closed margins even if
physically nonexistent

4.4.3.4 Continuity
As shown in Fig. 4.13a, an observer tends to perceive two intersecting curves at the
point X , instead of perceiving two separate irregular graphic structures that touch
at the point X . The Gestalt theory justifies this type of perception in the sense that
we tend to preserve the continuity of a curvilinear structure instead of producing
structures with strong discontinuities and graphic interruptions. Some mechanisms of
perception combine the Gestalt principles of proximity and continuity. This explains
why completely dissimilar elementary structures can be perceived as belonging to
the same graphic structure (see Fig. 4.13b).

4.4.3.5 Closure
The perceptive organization of different geometric structures, such as shapes, letters,
pictures, etc., can generate closed figures as shown in Fig. 4.14 (physically nonexis-
tent) thanks to inference with other forms. This happens even when the figures are
partially overlapping or incomplete. If the closure law did not exist, the image would
represent an assortment of different geometric structures with different lengths, rota-
tions, and curvatures, but with the law of closure, we perceptually combine the
elementary structures into whole forms.
332 4 Paradigms for 3D Vision

(a) (b)

Fig. 4.15 Principle of symmetry. a Symmetrical elements (in this case with respect to the central
vertical axis) are perceived combined and appear as a coherent object. b This does not happen in a
context of asymmetric elements where we tend to perceive them as separate

4.4.3.6 Symmetry
The principle of symmetry indicates the tendency to perceive symmetrical elementary
geometric structures against the antisymmetric ones. Visual perception favors the
connection of symmetrical geometric structures forming a region around the point
or axis of symmetry. In Fig. 4.15a it is observed how symmetrical elements are
easily perceptively connected to favor a symmetrical form rather than considering
them separated as instead happens for the figure (b). Symmetry plays an important
role in the perceptual organization by combining visual stimuli in the most regular
and simple way possible. Therefore, the similarities between symmetrical elements
increase the probability that these elements are grouped together forming a single
symmetrical object.

4.4.3.7 Relative Dimension and Adjacency


When objects are not equal, perception tends to bring out the smaller object (small-
ness) contrasting it with the background. The object appears with a defined extension
while the background does not. When the object is easily recognizable as a known
object, the rest becomes all background. As highlighted in Fig. 4.16a, the first object
is smaller in size than the rest (the background seen as a white cross or a helix, delim-
ited by the circular contour) and is perceived as a cross or a black helix. This effect is
amplified, as shown in Fig. 4.16b, since the white area surrounding (surroundedness)
the black area tends to be perceived as a figure for the principle of surroundedness
but has a smaller dimension. If we orient Fig. 4.16a in such a way that the white area
is arranged along the horizontal and vertical axes, as highlighted in Fig. 4.16c, then
it is easier to perceive it despite being larger than the black one. There seems to be
a preference to perceive figures arranged in horizontal and vertical positions over
oblique positions.
As a further illustration of the Gestalt principle smallness in Fig. 4.16d, Rubin’s
figure is shown, highlighting that due to the small relative dimension principle the
ambiguity in perceiving the vase rather than the faces or vice versa can be eliminated.
We also observe how the dominant perception is the black region that also emerges
from the combination of perceptive principles of relative dimension, orientation, and
symmetry.
4.4 The Fundamentals of Marr’s Theory 333

(a) (b) (c) (d)

Fig. 4.16 Principle of relative dimension and adjacency. a The dominant perception aided by the
contour circumference is that of the black cross (or seen as a helix) because it is smaller than the
possible white cross. b The perception of the black cross becomes even more evident by becoming
even smaller. c After the rotation of the figure, the situation is reversed, the perception of the white
cross is dominant because the vertical and horizontal parts are better perceived than the oblique
ones. d The principle of relative size eliminates the ambiguity of the vase/face figure shown in
Fig. 4.7, in fact, in the two vertical images on the left, the face becomes background and the vase
is easily perceived, vice versa, in the two vertical images on the right the two faces are perceived
while the vase is background

Fig. 4.17 Example of a


figure perceived as E, based
on a priori knowledge,
instead of perceiving three
broken lines

4.4.3.8 Past Experience


A further heuristic, suggested by Wertheimer, in perceptual organization concerns
the role of the past experience or meaning of the object under observation. Past
experience implies that under some circumstances visual stimuli are categorized
according to past experience. This can facilitate the visual perception of objects
with which one has greater knowledge with respect to objects never seen, all other
principles being considered being equal. For example, the object represented in
Fig. 4.16a experience can facilitate the perception of the object seen as a cross or
helix. From the point of view of meaning, knowing the letters of the alphabet, the
perception of Fig. 4.17 leads to the letter E rather than to single broken lines.

4.4.3.9 Good Gestalt or Prägnanz


All the Gestalt principles of perceptual organization lead to the principle of good
form or Prägnanz even if there is no strict definition of it. Prägnanz is a German word
that has the meaning of pithiness and in this context implies the perceptive ideas of
334 4 Paradigms for 3D Vision

salience, simplicity, and orderliness. Humans will perceive and interpret ambiguous
or complex objects as the simplest forms possible. The form that is perceived is so
good in relation to the conditions that allow it. In other words, what is perceived
or what determines perceptive stimuli to make a form appear is intrinsically the
characteristic of prägnanz or good form it possesses in terms of regularity, sym-
metry, coherence, homogeneity, simplicity, conciseness, and compactness. These
characteristics contribute to the greater probability of favoring various stimuli for
the perception process. Thanks to these properties the forms take on a good shape
and certain irregularities or asymmetries are attenuated or eliminated, also due to the
stimuli deriving from the spatial relations of the elements of the field of view.

4.4.3.10 Comments on the Gestalt Principles


The dominant idea behind the Gestalt perceptive organization was expressed in terms
of certain force fields that were thought to operate inside the brain. In the Gestalt idea,
it was considered the doctrine of Isomorphism, according to which, for every per-
ceived sensory experience there corresponds a brain event that is structurally similar
to that perceptive experience. For example, when a circle is perceived, a perceptual
structure with circular traces is activated in the brain. Force fields came into play to
produce results as stable as possible. Unfortunately, no rigorous demonstration was
given on the evidence of force fields and the Gestalt theory was abandoned leaving
only some perceptive principles but without highlighting a valid model of the per-
ceptual process. Today the Gestalt principles are considered vague and inadequate
to explain perceptual processes.
Subsequently, some researchers (Julesz, Olson and Attneave, Beck) considered
some of the Gestalt principles to have a measure of Validity of the Shape. They
addressed the problem of grouping by perceived elementary structures. In other
words, the problem consists in quantifying the similarity of the elementary structures
before they are grouped together.
One strategy is to evaluate which are the significant variables characterizing
the elementary structures that determine their grouping by similarity. This can be
explored by observing how two different regions of elementary structures (patterns)
are segregated (isolated) in perceptual terms, between them. The logic of this rea-
soning leads us to think that if the two regions have very coherent patterns between
them, due to the perceptual similarity that enters between them, the two regions will
be less visible without presenting a significant separation boundary.
The problem of grouping by similarity has already been highlighted in Fig. 4.11
in the Gestalt context. In [9], experimental results are reported on people to assess
the perceptive capacity of homogeneous regions where visual stimuli are influenced
by different types of features and their different orientations.
Figure 4.18a shows the circular field of view where three different regions are
perceived, characterized only by the different orientation of the patterns (segments
of fixed length). In figure (b) in the field of view, some curved segments are randomly
located and oriented, and one can just see parts of homogeneous regions. In figure (c)
the patterns are more complex but do not present a significantly different orientation
4.4 The Fundamentals of Marr’s Theory 335

(a) (b) (c)

Fig. 4.18 Perception of homogeneous and nonhomogeneous regions. a In the circular field of view
three different homogeneous regions are easily visible characterized by linear segments differently
oriented and with a fixed length. b In this field of view some curved segments are randomly located
and oriented, and one can just see parts of homogeneous regions. c Homogeneous regions are
distinguished, with greater difficulty, not by the different orientation of the patterns but by the
different types of the same patterns

between the regions. In this case, it is more difficult to perceive different regions and
the process of grouping by similarity is difficult. It can be concluded by saying that
the pattern orientation variable is a significant parameter for the perception of the
elementary structures of different regions.
Julesz [10] (1965) emphasized that in the process of grouping by similarity, vision
occurs spontaneously through a perceptual process with pre-attention, which pre-
cedes the identification of elementary structures and objects in the scene. In this
context, a region with a particular texture is perceived by carefully examining and
comparing the individual elementary structures in analogy to what happens when
one wants to identify a camouflaged animal through a careful inspection of the
scene. The hypothesis that is made is that the mechanism of pre-attention grouping
is implemented in the early stages of the vision process (early stage) and the biolog-
ical evidence that the brain responds if patterns of texture differ significantly in the
orientation between the central and external area of their receptive fields.
Julesz extends the study of grouping by similarity, including other significant
variables to characterize the patterns, such as brightness and color, which are indis-
pensable for the perception of complex natural textures. Show that two regions are
perceived as separate if they have a significant difference in brightness and color.
These new parameters, the color and brightness, used for grouping, seem to influ-
ence the perceptive process by operating on the mean value instead of considering
punctual values of the differences between brightness and color values.
A region that presents half dominant patterns with black squares and levels of
gray toward black, and the other half with dominant patterns with white squares and
light gray levels, is perceived with two different regions thanks to the perception of
the boundary of its subregions.
336 4 Paradigms for 3D Vision

Fig. 4.19 Perception of (a) (b)


regions with identical
brightness, but separable due
to the different spatial
density of the patterns
between the region (a)
and (b)

The perceptual process discriminates the regions based on the average value of
the brightness of the patterns in the two subregions and is not based on the details of
the composition of these average brightness values. Similarly, it occurs to separate
regions with different colored patterns. Julesz also used parameters based on the
granularity or spatial distribution of patterns for the perception of different regions.
Regions that have an identical average brightness value, but a different spatial pattern
arrangement, have an evident separation boundary and are perceived as different
regions (see Fig. 4.19).
In analogy with what has been pointed out by Bech and Attneave, Julesz has also
found the orientation parameter of the patterns for the perception of the different
regions to be important. In particular, he observed the process of perception based
on grouping and evaluated in formal terms the statistical properties of the patterns to
be perceived. At first, Julesz argued that two regions cannot be perceptually separated
if their first and second-order statistics are identical.
The statistical properties were derived mathematically based on Markovian pro-
cesses in the one-dimensional case or by applying random geometry techniques
generating two-dimensional images. The differences in the statistical properties of
the first order of the patterns highlight the differences in the global brightness of the
patterns (for example, the difference in the value of the average brightness of the
regions). The difference in the statistical properties of the second order highlights
instead the differences in the granularity and orientation of the patterns.
Julesz’s initial approach, which attempted to model the significant variables to
determine grouping by similarity in mathematical terms, was challenged applied to
artificial concrete examples, and Julesz himself modified his theory to the theory
of texton (structures elementary of the texture). He believed that only a difference
4.4 The Fundamentals of Marr’s Theory 337

of the texton or their density can be detected in pre-attentive mode.1 No positional


information on neighboring texton is available without focused attention.

He stated that the pre-attentive process operates in parallel while the attentive
(i.e., focused attention) operates in a serial way. The elementary structures of
the texture (texton) can be elementary homogeneous areas (blob) or rectilinear
segments (characterized by parameters such as appearance ratio and orienta-
tion) that correspond to the primal sketch representation proposed by Marr
(studies on the perception of texture will influence the same Marr).

Figure 4.20 shows an image with a texture generated by two types of texton.
Although the two texton (figure (a) and (b)) are perceived as different when viewed
in isolation, they are actually structurally equivalent, having the same size and con-
sisting of the same number of segments and each having two ends of segments. The
figure image (c) is made from texton of type (b) representing the background and
from texton of type (a) in the central area representing the region of attention. Since
both textons are randomly oriented and in spite of being structurally equivalent,
the contours of the two textures are not proactively perceptible except with focused
attention. It would seem that only the number of texton together with their shape
characteristics are important, while as they are spatially oriented together with the
closure and continuity characteristics they do not seem to be important.

Julesz states that vision with attention does not determine the position of sym-
bols but evaluates their number, or density or first-order statistics. In other
words, Julesz’s studies emphasize that the elementary structures of texture do
not necessarily have to be identical to be considered together by the processes
of grouping. These processes seem to operate between symbols of the same
brightness, orientation, color, and granularity. These properties correspond to
those that we know to be extracted from the first stages of vision, and corre-

1 Various studies have highlighted mechanisms of co-processing of perceptive stimuli, reaching the
distinction between automatic attentive processes and controlled attentive processes. In the first
case, the automatic processing (also called pre-attentive or unconscious) of the stimuli would take
place in parallel, simultaneously processing different stimuli (relating to color, shape, movement,
. . .) without the intervention of attention. In the second case, the focused attention process requires a
sequential analysis of perceptive stimuli and the combination of similar ones. It is hypothesized that
the pre-attentive process is used spontaneously as a first approach (in fact it is called pre-attentive)
in the perception of familiar and known (or undemanding) things, while the focused attention
intervenes subsequently, in situations of uncertainty or in sub-optimal conditions, sequentially
analyzing the stimuli. In other words, the mechanism of perception would initially tend toward a
pre-attentive process by processing in parallel all the stimuli of the perceptual field, but in front of
an unknown situation, if necessary activating a focused attention process to adequately integrate
the information of the individual stimuli. The complexity of the perception mechanisms is not yet
rigorously demonstrated.
338 4 Paradigms for 3D Vision

(c)

(a) (b)

Fig. 4.20 Example of two non-separable textures despite being made with structurally similar
elements (texton) (figure a and b) in dimensions, number of segments and terminations. The region
of interest, located at the center of the image (figure c) and consisting of texton of type (b), is
difficult to perceive, with respect to the background (distractor component) represented by texton
of type (a), when both texton are randomly oriented

spond to the properties extracted with the description in terms of raw primal
sketch proposed initially by Marr.

So far we have quantitatively examined some Gestalt principles such as the sim-
ilarity and evaluation of the form to justify the perception based on grouping. All
the studies used only artificial images. It remains to be shown whether this percep-
tive organization is still valid when applied for natural scenes. Several studies have
demonstrated the applicability of the Gestalt principles to explain how some animals
with their particular texture are able to camouflage themselves in their habitat by
making it difficult to perceived its predatory antagonist. Various animals, prey and
predators, have developed particular systems of perception that are different from
each other and are also based on the process of grouping by similarity or on the
process of grouping by proximity.

4.4.4 From the Gestalt Laws to the Marr Theory

We have analyzed how the Gestalt principles are useful descriptions to analyze
perceptual organization in the real world, but we are still far from having found
an adequate theory that explains why these principles are valid and how perceptual
organization is realized. The same Gestalt psychologists have tried to answer these
questions thinking of a model of the brain that provides force fields to characterize
objects.
Marr’s theory attempts to give a plausible explanation of the process of perception,
emphasizing how such principles can be incorporated into a computational model
that detects structures hidden in uncertain data, obtained from natural images. With
4.4 The Fundamentals of Marr’s Theory 339

Raw Primal sketch:


Edges
Input image
Blobs
Bars
Z Segments
.......
Attributes:
Position
Orientation
Contrast
Dimension

Fig. 4.21 Example of map raw primal sketch derived from zero crossing calculated at different
scales by applying the LoG filter to the input image

Marr’s theory, with the first stages of vision, they are not understood as processes
that directly solve the problem of segmentation or to extract objects as is traditionally
understood (generally complex and ambiguous activity).
The goal of the early vision process seen by Marr is to describe the surface
of objects by analyzing the real image, even in the presence of noise, of complex
elementary structures (textures) and of shaded areas. We have already described
in Sect. 1.13 Vol. II the biological evidence for the extraction of zero crossings
from the early vision process proposed by Marr-Hildreth, to extract the edges of the
scene, called raw primal sketch. The primitive structures found in the raw primal
sketch maps are edges, homogeneous elementary areas (blob), bars, terminations in
general, which can be characterized with attributes such as orientation, brightness,
length, width, and position (see Fig. 4.21).
The map of raw primal sketch is normally very complex, from which it is necessary
to extract global structures of higher level such as the surface of objects and the
various textures present. This is achieved with the successive stages of the processes
of early vision by recursively assigning place tokens (elementary structures such as
bars, oriented segments, contours, points of discontinuity, etc.) to small structures of
the visible surface or aggregation of structures, generating primal sketch maps.
These structures place tokens are then aggregated together to form larger struc-
tures and if possible this is repeated cyclically. The place tokens structures can be
characterized (at different scales of representation) by the position of the elementary
structures represented (elementary homogeneous areas, edges or short straight seg-
ments or curved lines), or by the termination of a longer edge, a line or area elongated
elementary, or from a small aggregation of symbols.
The process of aggregation of the place tokens structures can proceed by clustering
those close to each other on the basis of changes in spatial density (see Fig. 4.22)
and proximity, or by curvilinear aggregation that produces contours joining aligned
structures that they are close to each other (see Fig. 4.23) or through the aggregation
340 4 Paradigms for 3D Vision

(a) Spatial grouping


(b)

Blob Place Token

Fig. 4.22 Full primal sketch maps derived from raw primal sketch maps by organizing elementary
structures (place tokens) into larger structures, such as regions, lines, curves, etc., according to
continuity and proximity constraints

Linear and curved aggregation

light d
irectio
n
(a) Edge and curve alignment (b) Termination alignment (c) Curved alignment

Fig. 4.23 Full primal sketch maps derived from raw primal sketch maps by aggregating elemen-
tary structures (place tokens), like elements of boundaries or terminations of segments, in larger
structures (for example, contours), according to spatial alignment constraints and continuity. a The
shading, generated by the illumination in the indicated direction, creates local variations of bright-
ness, and applying closing principles are also detected contours related to the shadows, in addition
to the edges of the object itself; b The circular contour is extracted from the curvilinear aggregation
of the terminations of elementary radial segments (place tokens); c Curved edge detection through
curvilinear aggregation and alignment of small elementary structures

of texture oriented structures (see Fig. 4.24). This latter type of aggregation implies
a grouping of similar structures oriented in a given direction and other groups of
similar structures oriented in other directions. In this context, it is easy to extract
rectilinear or curved structures by aligning the terminations or the discontinuities as
shown in the figures.
The process of grouping together with the place tokens structures is based on local
proximity (i.e., they are combined adjacent elementary structures) and on similarity
(i.e., they are combined elementary oriented structures), but the determined structures
of the visible surface can be influenced by more global considerations. For example,
in the context of curvilinear aggregation, a closure principle can allow two segments,
which are edge elements, to be joined even if the change in brightness across the
segments is sensitive, due to the effect of the lighting conditions (see Fig. 4.23a).
Marr’s approach combines many of the Gestalt principles discussed above. From
the proposed grouping processes, we derive primal sketch structures at different
scales to physically locate the significant contours in the images. It is important to
derive different types of visible surface properties at different scales, as Marr claims.
The contours due to the change in the reflectance of the visible surface (for example,
due to the presence of two overlapping objects), or due to the discontinuity of the
orientation or depth of the surface can be detected in two ways.
4.4 The Fundamentals of Marr’s Theory 341

(a) (b)

Fig. 4.24 Full primal sketch maps derived from raw primal sketch maps by organizing elemen-
tary oriented structures (place tokens) in vertical edges, according to constraints of directional
discontinuity of the oriented elementary structures

In the first way, the contours can be identified with place tokens structures. The
circular contour perceived in Fig. 4.23b can be derived through the curvilinear aggre-
gation of place tokens structures assigned to the termination of each radial segment.
In the second way, the contours can be detected by the discontinuity in the param-
eters that describe the spatial organization of the structures present in the image.
In this context, changes in the local spatial density of the place tokens structures,
their spacing or their dominant orientation could be used together to determine the
contours. The contour that separates the two regions in Fig. 4.24b is not detected
by analyzing the structures place tokens, but is determined by the discontinuity in
the dominant orientation of the elementary structures present in the image. This
example demonstrates how Marr’s theory is applicable to the problem of texture and
the separation of regions in analogy to what Julesz realized (see Sect. 3.2).

As already known, Julesz attempted to produce a general mathematical formu-


lation to explain why some texture contours are clearly perceived while others
are not easily visible without a detailed analysis. Marr instead provides a com-
putational theory with a more sophisticated explanation. Julesz’s explanation
is purely descriptive while Marr has shown how a set of descriptive principles
(for example, some Gestalt principles) can be used to recover higher-level tex-
tures and structures from images. The success of Marr’s theory is demonstrated
by being able to obtain results from the procedures of early vision to determine
the contours even in the occlusion zones. The derived geometrical structures
are based only on the procedures of initial early vision, without having any
high-level knowledge.

In fact, the procedures of early vision have no knowledge on the possible shape
of the teddy bear’s head and do not find the contours of the eyes since they do not
have any prior knowledge to find them.

We can say that Marr’s theory, based on the processes of early vision, contrasts
strongly with the theories of perceptive vision that are based on the expectation
and hypothesis of objects that drive each stage of perceptual analysis. These
342 4 Paradigms for 3D Vision

last theories are based on the knowledge of the objects of the world, for which
in the computer there is a model of representation.

The procedures of early vision proposed by Marr also operate correctly in


natural images since they include some principles of grouping that reflect some
general properties of the observed world. Marr’s procedures use higher-level
knowledge only in the case of ambiguity.

In fact, in Fig. 4.23a shows the edges of the object but also those caused by shadows
not belonging to the object itself. In this case, the segmentation procedure may fail
to separate the object’s contour from the background if it does not know a priori
information about the lighting conditions. In general, ambiguities cannot be solved
by analyzing a single image, but by considering the additional information as that
deriving from stereo vision (observation of the world from two different points of
view) or from the analysis of motion (observing time-varying image sequences) in
order to extract depth and movement information present in the scene, or by knowing
the lighting conditions.

4.4.5 2.5D Sketch of Marr’s Theory

In Marr’s theory, the goal of the first stages of vision is to produce a description of the
visible surface of the objects observed together with the information indicating the
structure of the objects with respect to the observer’s reference system. In other words,
all the information extracted from the visible surface is referred to the observer.
The primal sketch data are analyzed to perform the first level of reconstruction
of the visible surface of the observed scene. These data (extracted as a bottom-up
approach) together with the information provided by some modules of early vision,
such as depth information (between scene and observer) and orientation of the visible
surface (with respect to the observer) form the basis for the first 3D reconstruction of
the scene. The result of this first reconstruction of the scene is called 2.5D Sketch, in
the sense that the result obtained, generally orientation maps (called also needle map)
and depth map, is something more than 2D information but it cannot be considered
as a 3D reconstruction yet.
The contour, the texture, the depth, orientation and movement information (called
by Marr information full primal sketch), extracted from the processes of early vision,
such as stereo vision, movement analysis, analysis of texture and color, all together
contribute to the production of 2.5D-sketch maps seen as intermediate information,
temporarily stored, which give a partial solution waiting to be processed by the
perceptual process, for the reconstruction of the visible surface observed.
Figure 4.25, shows an example of a 2.5D sketch map representing the visible
surface of a cylindrical object in terms of orientation information (with respect to the
observer) of elementary portions (patch) of the visible surface of the object. To render
this representation effective, the orientation information is represented with oriented
4.4 The Fundamentals of Marr’s Theory 343

Fig. 4.25 2.5D sketch map derived from the full primal sketch map and from the orientation map.
The latter adds the orientation information (with respect to the observer) for each point of the
visible surface. The orientation of each element (patch) of visible surface is represented by a vector
(seen as a little needle) whose length indicates how much it is inclined with respect to the observer
(maximum length means the direction perpendicular to the observer, zero indicates pointing toward
the observer), while the direction coincides with the normal at the patch

needles, whose inclination indicates how the patches are oriented with respect to the
observer. These 2.5D sketch maps are called nedle map or orientation maps of the
observed visible surface.
The length of each oriented needle describes the level of inclination of the patch
with respect to the observer. A needle with zero length indicates that the patch is
perpendicular to the vector that joins the center of the patch with the observer. The
increase in the length of the needle implies the increase of the inclination of the patch
with respect to the observer. The maximum length is when the inclination angle of
the patch reaches 90◦ . In a similar way, the depth map can be represented for each
patch that indicates its distance from the observer.
It can be verified that the processes of early vision do not produce information
(of orientation, depth, etc.) in some points of the visual field and this involves the
production of a 2.5D sketch map with some areas without information. In this case,
interpolation processes can be applied to produce the missing information on the map.
This also happens with the stereo vision process in the occlusion areas where no depth
information is produced, while in the shape of shading process2 (for extraction of
the orientation maps) this occurs in areas of strong light discontinuity (or depth), in
which the orientation information is not produced.

2 The shape-from-shading we will see later that it is a method to reconstruct the surface of a 3D
object from a 2D image based on the shading information, i.e., it uses shadows and light direction as
reference points to interpolate the 3D surface. This method is used to obtain a 3D surface orientation
map.
344 4 Paradigms for 3D Vision

4.5 Toward 3D Reconstruction of Objects

The final goal of vision is the recognition of the objects observed. Marr vision
processes through 2.5D sketch maps produce a representation, with respect to the
observer, of the visible surface of 3D objects. This implies that the representation of
the observed object varies in relation to the point of view, thus making the recognition
process more complex. To discriminate between the observed objects, the recognition
process must operate by representing the objects with respect to their center of mass
(or with respect to an absolute reference system) and not with respect to the observer.
This third level of representation is called object centred. Recall that the shape
that characterizes the object observed expresses the geometric information of the
physical surface of an object. The representation of the shape is a formal scheme
to describe a form or some aspects of the form together with rules that define how
the schema is applied to any particular form. In relation to the type of representation
chosen (i.e., the chosen scheme), a description can define a shape with a rough or
detailed approximation.
There are different models of 3D representation of objects. The representation of
the objects viewer centered or object centered, becomes fundamental to characterize
the same process of recognition of the objects. If the vision system uses a viewer
centered representation to fully describe the object it will be necessary to have
a representation of it through different views of the object itself. This obviously
requires more memory to maintain an adequate representation of all the observed
views of the objects. Minsky [11] proposed to optimize the multi-view representation
of objects, choosing significant primitives (for example, 2.5D sketch information)
representing the visible surface of the observed object from appropriate different
points of view.
An alternative to the multi-view representation is given by the object centered
representation that certainly requires less memory to maintain a single description of
the spatial structures of the object (i.e., it uses a single model of 3D representation of
the object). The recognition process will have to recognize the object, whatever its
spatial arrangement. We can conclude by saying that a object centered description
presents more difficulties for the reconstruction of the object, since a single coordinate
system is used for each object and this coordinate system is identified by the image
before the description of the shape be rebuilt. In other words, the form is described
not with respect to the observer but relative to the object centered coordinate system
that is based on the form itself.
For the recognition process, the viewer centered description is easier to produce,
but is more complex to use than the object centered one. This depends on the fact
that the viewer centered description depends very much on the observation points
of the objects. Once defined the coordinate system (viewer centered or object cen-
tered) that is the way to organize the representation that imposes a description of the
objects, the fundamental role of the vision system is to extract the primitives in a
stable and unique (invariance of information associated with primitives) which are
the basic information used for the representation of the form. We have previously
mentioned the 2.5D sketch information, extracted from the process of early vision,
4.5 Toward 3D Reconstruction of Objects 345

Fig. 4.26 A representation of objects from elementary blocks or 3D primitives

which essentially is the orientation and depth (calculated with respect to the observer)
of the visible surface observed in the field of view and calculated for each point of
the image. In essence, primitives contain 3D information of the visible surface or 3D
volumetric information. The latter include the spatial information on the form.
The complexity of primitives, considered as the first representation of objects,
is closely linked to the type of information that can be derived from the vision
system. While it is possible to define even complex primitives by choosing a model
of representation of the sophisticated world, we are nevertheless limited by the ability
of vision processes that are not able to extract consistent primitives. It achieves an
adequate choice between the model of representation of the objects and the type of
primitives that must guarantee stability and uniqueness [12].
The choice of the most appropriate representation will depend on the application
context. A limited representation of the world, made in blocks (cubic primitives), can
be adequate to represent for example, the objects present in an industrial warehouse
that produces packed products (see Fig. 4.26). A plausible representation for peo-
ple and animals, can be to consider as solid primitive cylindrical or conical figures
[12] properly organized (see Fig. 4.27), known as generalized cones.3 With this 3D
model, according to the approach of Marr and Nishihara [12], the object recogni-
tion process involves the comparison between the 3D reconstruction obtained from
the vision system and the 3D representation of the object model using generalized
cones, previously stored in memory. In the following paragraphs, we will analyze
the problems related to the recognition of objects.
In conclusion, the choice of the 3D model of the objects to be recognized, that is,
the primitives, must be adequate with respect to the application context that character-

3A generalized cone can be defined as a solid created by the motion of an arbitrarily shaped cross-
section (perpendicular to the direction of motion) of constant shape but variable in size along the
symmetry axis of the generalized cone. More precisely a generalized cone is defined by drawing
a 2D cross-section curve at a fixed angle, called the eccentricity of the generalized cone, along
a space curve, called spine of the generalized cone, expanding the cross-section according to a
sweeping rule function. Although spine, cross-section, and sweeping rule can be arbitrary analytic
functions, in reality only simple functions are chosen, in particular, a spine is a straight or circular
line, the sweeping rule is constant or linear, and the cross-section is generally rectangular or circular.
The eccentricity is always chosen with a right angle and consequently the spine is normal to the
cross-section.
346 4 Paradigms for 3D Vision

Fig. 4.27 3D multiscale Human


hierarchical representation of
a human figure by using Arm
object-centered generalized
cones
Forearm

Hand

izes the model of representation of objects, the modules of the vision system (which
must reconstruct the primitives) and the recognition process (that must compare the
reconstructed primitives and those of the model). With this approach, invariance was
achieved thanks to the coordinate system centered on the object whose 3D compo-
nents were modeled. Therefore, regardless of the point of view and most viewing
conditions, the same 3D structural description that is invariant from the point of
view would be retrieved by identifying the appropriate characteristics of the image
(for example, the skeleton or principal axes of the object) by retrieving a canonical
set of 3D components and comparing the resulting 3D representation with similar
representations obtained from the vision system. Following Marr’s approach, in the
next chapter, we will describe the various vision methods that can extract all pos-
sible information in terms of 2.5 D Sketch from one or more 2D images, in order
to represent the shape of the visible surface of the objects of the observed scene.
Subsequently, we will describe the various models of representation of the objects,
analyzing the more adequate primitives 2.5 D Sketch, also in relation to the typology
of the recognition process.
A very similar model to the theory of Marr and Nishihara is Biederman’s model
[13,14] known as Recognition By Components-RBC. The limited set of volumet-
ric primitives used in this model is called geons. RBC assumes that the geons are
retrieved from the images based on some key features, called non-accidental prop-
erties since they do not change when looking at the object from different angles, i.e.,
shapes configurations that are unlikely to be verified by pure chance. Examples of
non-accidental properties are parallel lines, symmetry, or Y-junctions (three edges
that meet at a single point).
These qualitative descriptions could be sufficient to distinguish different classes
of objects, but not to discriminate within a class of objects having the same basic
components. Moreover, this model is inadequate to differentiate dissimilar forms in
which the perceptual similarities of the forms can induce to recognize similar objects
between them that instead are composed of different geons.
4.5 Toward 3D Reconstruction of Objects 347

4.5.1 From Image to Surface: 2.5D Sketch Map Extraction

This section describes how a vision system can describe the observed surface of
objects by processing the intensity values of the acquired image. The vision system
can include different procedures according to the Marr theory which provides for
different processes of early vision and the output from which it gives a first description
of the surface as seen by the observer.
In the previous paragraphs, we have seen that this first description extracted from
the image is expressed in terms of elementary structures (primal sketch) or in terms
of higher-level structures (full primal sketch). Such structures essentially describe the
2D image rather than the surface of the observed world. To describe instead the 3D
surface of the world, it will be necessary to develop further vision modules that, by
giving the input, the image and/or the structures extracted from the image, it should
be possible to extract the information associated with the physical structures of the
3D surface.
This information can be the distance of different points of the surface from the
observer, the shape of the solid surface at different distances and different inclination
with respect to the observer, the motion of the objects between them and with respect
to the observer, the texture, etc. The human vision system is considered as the vision
machine of excellence to be inspired to study how the nervous system, by processing
the images of each retina (retinal image), correctly describes the 3D surface of the
world. The goal is to study the various modules of the vision system considering:

(a) The type of information to be represented;


(b) The appropriate algorithm for obtaining this information;
(c) The type of information representation;
(d) The computational model adequate with respect to the algorithm and the chosen
representation.

In other words, the modules of the vision system will be studied not on the basis
of ad hoc heuristics but according to the Marr theory, introduced in the previous
paragraphs, which provides for each module a methodological approach, known as
the Marr paradigm, divided into three levels of analysis: computational, algorithmic,
and implementation.

The first level, the computational model, deals with the physical and mathematical
modeling with which we intend to derive information from the physical structures
of the observable surface, considering the aspects of the image acquisition system
and the aspects that link the physical properties of the structures of the world
(geometry and reflectance of the surface) with the structures extracted from the
images (zero crossing, homogeneous areas, etc.).
The second level, the algorithmic level, addresses how a procedure is realized that
implements the computational model analyzed in the first level and identifies the
input and output information. The procedure chosen must be robust, i.e., it must
guarantee acceptable results even in conditions of noisy input data.
348 4 Paradigms for 3D Vision

The third level, the implementation level, deals with the physical modality of how
the algorithms and information representation are actually implemented using a
hardware with an adequate architecture (for example, neural network) and soft-
ware (programs). Usually, a vision system operates in real-time conditions. The
human vision system implements vision algorithms using a neural architecture.

An artificial vision system is normally implemented on standard digital computers


or on specialized hardware, which efficiently implements vision algorithms, which
require considerable computational complexity. The three levels are not completely
independent, it may be that the choice of a particular hardware may influence the
algorithm or vice versa. Given the complexity of a vision system, it is plausible that
for the complete perception of the world (or to reduce the uncertainty of perception) it
is necessary to implement different modules, each specialized to recover a particular
information such as depth, orientation of the elementary surface, texture, color, etc.
These modules cooperate together for the reconstruction of the visible 3D surface
(of which one or more 2D images are available), according to Marr’s theory inspired
by the human visual system. In the following paragraphs, we will describe some
methodologies used for extracting the shape of objects from the acquired 2D images.
The shape is intended as orientation and geometry information of the observed 3D
surface. Since there are different methods for extracting form information, the generic
methodology can be indicated with X and a given methodology can be called Shape
from X. With this expression we mean the extraction of the visible surface shape
based on methodology X using information derived from the acquired 2D image.
The following methodologies will be studied:

Shape from Shading


Shape from Stereo
Shape from Stereo photometry
Shape from Contour
Shape from Motion
Shape from Line drawing
Shape from (de)Focus
Shape from Structured light
Shape from Texture

Basically each of the above methods produces according to Marr’s theory a 2.5D
primal sketch map, for example, the depth map (with the methodology Shape from
Stereo and Structured light), the surface orientation map (with the methodology
Shape from Shading), and the orientation of the source (using the methodologies
Shape from shading and Shape from Stereo photometry).
The human visual system perceives the 3D world by integrating (not yet known
how it can happen) 2.5D sketch information deriving the primal sketch structures from
the pair of 2D images formed on the retinas. The various information of shading and
shape are normally used, for example, with ability by an artist when he represents
in a painting, a two-dimensional projection of the 3D world.
4.5 Toward 3D Reconstruction of Objects 349

Fig.4.28 Painted by Canaletto known for its effectiveness in creating wide-angle perspective views
of Venice with particular care for lighting conditions

Figure 4.28, a painting by Canaletto, shows a remarkable perspective of a view


of Venice and effectively highlights how the observer is stimulated to have a 3D
perception of the world while observing its 2D projection. This 3D perception is
stimulated by the shading information (i.e., by the nuance of the color levels) and
by the perspective representation that the artist was able to memorize in painting the
work.
The Marr paradigm is a plausible theoretical infrastructure for realizing an artifi-
cial vision system, but unfortunately, it is not performing to realize vision machines
for automatic object recognition. In fact, it is shown that many of the vision system
modules are based on ill-posed computational models, that is, they do not result in
a single solution. In fact, many of the methods of Shape from X are formulated in
the solution of inverse problems, in this case, for the recovery of the 3D shape of the
scene starting from limited information, i.e., the 2D image normally with noise of
which we do not know model.
In other words, vision is the inverse problem of the process of image formation.
As we shall see in the description of some vision modules, there are methods for
realizing well-posed modules, that are producing efficient and unambiguous results,
through a process of regularization [15]. This is possible by imposing constraints
(for example, the local continuity of the intensity or geometry of the visible surface)
to transform an inverse problem into a well-posed problem.
350 4 Paradigms for 3D Vision

4.6 Stereo Vision

A system based on stereo vision extracts distance information and surface geometry
of the 3D world by analyzing two or more images corresponding to two or more
different observation points. The human visual system uses binocular vision for the
perception of the 3D world. The brain processes the pair of images, 2D projections
on the retina, of the 3D world, observed from two slightly different points of view,
but lying on the horizontal.
The basic principle of binocular vision is as follows: objects positioned at dif-
ferent distances from the observer, are projected into the retinal images in different
locations. A direct demonstration of the principle of stereo vision, you can have by
looking at your own thumb at different distances closing first one eye and then the
other. It will be possible to observe how in the two retinal images the finger will
appear in two different locations moved horizontally. This relative difference in a
location in the retina is called disparity.
From Fig. 4.29 can be understood, as in a human binocular vision, retinal images
are slightly different in relation to the interocular distance of the eyes and in relation to
the distance and angle of the observed scene. It is observed that the image generated by
the fixation point P is projected in the foveal area due to the convergence movements
of the eyes, that is, they are projected in corresponding points of the retinas. In
principle, the distance from any object that is being fixed could be determined based
on the observation directions of the two eyes, through triangulation. Obviously, with
this approach, it would be fundamental to know precisely the orientation of each eye
which is limited in humans while it is possible to make binocular vision machines
with good precision of the fixing angles as we will see in the following paragraphs.
Given the binocular configuration, points closer or further away from the point
of fixation project their image at a certain distance from each fovea. The distance
between the image of the fixation point and the image of each of the other points

Fig. 4.29 Binocular L


disparity. The fixation point
is P whose image is
projected on the two fovea Fixation point
by stimulating corresponding P
main retinal zones. The other
θ
points V and L, not fixed,
stimulate other zones of the
two retinas that can be V
homologous having d
binocular disparity or not
being correspondence, and
therefore, it is perceived as D
double point

Vs Vd
P s Ls Ld Pd
4.6 Stereo Vision 351

projected on the relative retinas is called retinal disparity. Therefore, we observe


that for the fixation point P projected in each fovea of the eyes we have horizontal
disparity zero while for the points L and V we have a horizontal disparity different
from zero that can be calculated as a horizontal distance with respect to the fovea of
each retina or as the relative distance of the respective projections of L and V on the
two retinas. We will see that it is possible to use the relative horizontal disparity to
estimate the depth of an object instead of the horizontal retinal disparity (distance
between the projections LL PL and PL VL , similarly for the right eye) not knowing the
distance d from the fixation point P.
The human binocular system is able to calculate this disparity and to perceive the
information of depth between the points in the vicinity of the fixation point included
in the visual field. The perception of depth tends to decrease when the distance of
the fixation point exceeds 30–40 m.
The brain uses relative disparity information to derive depth perception, i.e., the
distance measure between object and observer (see Fig. 4.29). The disparity is char-
acterized not only by its value but also by the sign. The point V , closer to the observer,
is projected onto the retina outside the fixation point P and generates positive dis-
parity, conversely, the farthest point L is projected within P and generates negative
disparity. We point out that if the retinal distance is too large the images of the pro-
jected objects (very close or very far from the observer) will not be attributed to the
same object thus resulting in a double vision (problem of the diplopia).
In Sect. 3.2 Vol. I we described the components of the human visual system that
in the evolutionary path developed the ability to combine the two retinal images in
the brain to produce a single 3D image of the scene. Aware of the complexity of
interaction of the various neuro-physical components of the human visual system, it
is useful to know the functional mechanisms in view of creating a binocular vision
machine capable of imitating some of its functionalities.
From the schematic point of view, according to the model proposed in 1915 by
Claude Worth, human binocular vision can be articulated in three phases:

1. Simultaneous acquisition of image pairs.


2. Fusion of each pair.
3. Stereopsis, or the ability in the brain to extract 3D information of the scene and
position in the space of objects starting from the horizontal visual disparity.

For a binocular vision machine, the simultaneous acquisition of simultaneous image


pairs is easily achieved by synchronizing the two monocular acquisition systems.
The phases of fusion and of stereopsis are much more complex.

4.6.1 Binocular Fusion

Figure 4.29 shows the human binocular optical system of which the interocular dis-
tance D is known with the ocular axes assumed coplanar. The ocular convergence
allows the focusing and fixing of a point P of the space. Knowing the angle θ between
352 4 Paradigms for 3D Vision

Fixation point
P
Q R

Horopter

Nodal Points

Rs Qd
Ps Qs Rd Pd

Fig. 4.30 The horopter curve is the set of points of view that projected on the two retinas stimulate
corresponding areas and are perceived as single elements. The P fixation point stimulates the main
corresponding zones, while simultaneously other points, such as Q that are on the horopter, stimulate
correspondences and are seen as single elements. The R point, outside the horopter, is perceived as
a double point (diplopia) as it stimulates retinal zones unmatched. The shape of the horopter curve
depends on the distance of the fixation point

the ocular axes with a simple triangulation it would be possible to calculate the dis-
tance d = D/2 tan(θ/2) between observer and fixation point. Looking at the scene
from two slightly different viewpoints on the two retinal images you will have slightly
shifted points in the scene.
The problem of the fusion process is to find in the two retinas homologous points
of the scene in order to produce a single image, as if the scene were observed by
a cyclopic eye placed in the middle of the two. In the human vision, this occurs at
the retinal level if the visual field is spatially correlated with each fovea where the
disparity assumes a value of zero. In other words, the visual fields of the two eyes
have a reciprocal bond between them, such that, a retinal zone of an eye placed at a
certain distance from the fovea, finds in the other eye a corresponding homologous
zone, positioned on the same side and at the same distance from your own fovea.
Spatially scattered elementary structures of the scene projected into the corre-
sponding retinal areas stimulate them, propagating signals in the brain that allow the
fusion of retinal images and the perception of a single image. Therefore, when the
eyes converge fixating an object this is seen as single as the two corresponding main
areas (homologous) of the fovea are stimulated. Simultaneously, other elementary
structures of the object, included in the visual field, although not fixed, can be per-
ceived as single because they fall back and stimulate the other corresponding retinal
(secondary) zones.
Points of space (seen as single, that stimulate corresponding areas on the two
retinas) that are at the same distance from the point of fixation form a circumference
that passes through the point of fixation and for the nodal points of the two eyes
(see Fig. 4.30). The set of points that are at the same distance from the point of
fixation that induces corresponding positions on the retina (zero disparity) form the
4.6 Stereo Vision 353

horopter. It is shown that these points have the same angle of virginity b and lie on
the circumference of Vieth-Müller which contains the fixation point and the nodal
points of the eyes.
In reality, the shape of the horopter changes with the distance of the fixation point,
the contrast, and the lighting conditions. When the fixation distance increases the
shape of the horopter tends to become a straight line until it returns a curve with the
convexity facing the observer.
All the points of the scene, included in the visual field, but located outside
the horopter stimulate non-corresponding retinal areas, and therefore, will be per-
ceived blurred or tendentially double, thus giving rise to the diplopia (see Fig. 4.30).
Although all the points of the scene outside the horopter curve were defined as
diplopics, in 1858 Panum showed that there is an area near the horopter curve within
which these points, although stimulating retinal zones that are not perfectly corre-
spondent, are still perceived as single. This area in the vicinity of the horopter is
called Panum area (see Fig. 4.31a). It is precisely these minimal differences between
the retinal images, relative to the points of the scene located in the Panum area,
which are used in the stereoscopic fusion process to perceive depth. Points at the
same distance as that of fixation from the observer produce zero disparity. Points
closer to and farther from the fixation point produce disparity measurements through
a neurophysiological process.
The retinal disparity is a condition in which the line of sight of the two eyes
do not intersect at the point of fixation, but in front of or behind the fixation point
(Panum area). In processing the closest object, the visual axes converge, and the
visual projection from an object in front of the fixation point leads to Crossed Retinal
Disparity (CRD). On the other hand, in processing the distant object, the visual axes
diverge and the visual projection from an object behind the fixation point leads to
Uncrossed Retinal Disparity (URD). In particular, as shown in Fig. 4.31b the nearest
V point P induces CRD disparities and its image VL is shifted to the left in the left

(a) (b) (c)


Diplopia
Crossed disparity Uncrossed disparity
L
Panum area Ho
rop
P P ter P
Ho
ro
pt

Diplopia
er

Horopter
V

VR LR
PL PL VL PL LL
PR PR PR

Fig. 4.31 a Panum area: the set of points of the field of view, in the vicinity of the horopter curve,
where despite stimulating retinal zones that are not perfectly corresponding, they are still perceived
as single elements. b Objects within the horopter (closer to the observer than the fixation point P)
induce retinal disparity crossed. c Objects outside the horopter (farther from the observer than the
fixation point P) induce retinal disparity uncrossed
354 4 Paradigms for 3D Vision

eye and to the right VR in the right eye. For the point L farther than the fixation point
P, its projection induces disparity URD, that is, the image LL is shifted to the right
in the left eye and to the left LR in the right eye (see Fig. 4.31c). It is observed that
the nearest point V to the eyes has greater disparity.

4.6.2 Stereoscopic Vision

The physiological evidence of depth perception through the estimated disparity value
is demonstrated with the stereoscope invented in 1832 by the physicist C. Wheatstone
(see Fig. 4.32). This tool allows the viewing of stereoscopic images.
The first model consists of a reflection viewer composed of two mirrors positioned
at an angle of 45◦ with respect to the respective figures positioned at the end of the
viewer. The two figures represent the image of the same object (in the figure appears
a rectangular pyramid trunk) represented with a slightly different angle. An observer
could approach the mirrors and turn away the supports of the two representations
until the two images reflected in the mirrors did not overlap perceiving a single
3D object. The principle of operation is based on the fact that to observe an object
closely, the brain usually tends to converge the visual axes of the eyes, while in the
stereoscopic view the visual axes must point separately and simultaneously on the
images on the left and on the right to merge them into a single 3D object.
The observer actually looks at the two sketch images IL and IR of the object through
the mirrors SL and SR , so that the left eye sees the sketch IL of the object, simulating
the acquisition made with the left monocular system, and the right eye similarly sees
the sketch IR of the object that simulates the acquisition of the photo made by the right
monocular system. In these conditions, the observer, looking at the images formed
in the two mirrors SL and SR , has a clear perception of depth having the impression
of being in front of a solid figure. An improved and less cumbersome version of the
reflective stereoscope was made by D. Brewster. The reflective stereoscope is still
used today for photogrammetric relief.

Mirrors
Mirrors

IL IR
SL SR

IL IR

OL OR

Fig. 4.32 The mirror stereoscope by Sir Charles Wheatstone from 1832 that allows the viewing of
stereoscopic images
4.6 Stereo Vision 355

(a) (b)
LL FL FL

OL A
B

OR
LR FR FR

Fig. 4.33 a Diagram of a simple stereoscope that allows the 3D recomposition of two stereograms
with the help of magnifying lenses (optional). With the lenses, the task of the eyes is facilitated to
remain parallel as if looking at a distant object and focus, at a distance of about 400 mm, simultane-
ously the individual stereograms with the left and right eye, respectively. b Acquisition of the two
stereograms

Figure 4.33 schematizes a common stereoscope showing how it is possible to have


the perception of depth (relief) AB observing the photos of the same object acquired
from slightly different points of view (with disparity compatible with the human
visual system). The perceived relief is given by the length of the object represented
by the segment AB perpendicular to the vertical plane passing through the two optical
centers of the eyes. The disparity, that is, the difference of the position in the two
retinas of an observed point of the world, is measured in terms of pixels in the two
images or as a measure of angular discrepancy (angular disparity). The brain uses
disparity information to derive depth perception (see Fig. 4.29).
The stereoscopic approach is based on the fact that the two images that form
on the two eyes are slightly different and the perception of depth arises from the
cerebral fusion of the two images. The two representations of Fig. 4.33a must be
the stereo images that reproduce the object as if it were photographed with two
monocular systems with parallel axes (see Fig. 4.33b), distant 65−70 mm (average
distance between the two eyes).
Subsequently, alternative techniques were used for viewing stereograms. For
example, in the stereoscopic cinematic view, the films have each frame consist-
ing of two images (the stereoscopic pair) of complementary colors. The vision of
these films takes place through the use of glasses, whose lenses have complementary
colors to those of the images of the stereoscopic pair. The left or right lens has a
color complementary to that of the image more to the left, or more to the right, in
the frame. In essence, the two eyes see the two images separately as if they were
true and can reconstruct the depth. This technique has not been very successful in
the field of cinema.
This method of viewing stereograms for the 3D perception of objects starting
from two-dimensional images in the photographic technique is called ana-glyph. In
1891 Louis Arthus Ducos du Hauron produced the first anaglyphic images starting
from the negatives of a pair of stereoscopic images taken from two objectives placed
side by side and printing them both, almost overlapping, in complementary colors
356 4 Paradigms for 3D Vision

(red and blue or green) on a single support and then observe with glasses having, in
place of lenses, two filters of the same complementary colors.
The effect of these filters is to show each eye only one of the stereo image pairs. In
this way, the perception of the 3D object is only an illusion, since the physical object
does not exist, what the human visual system really acquires from the anaglyphs
are the 2D images, the object projected on the two retinas and the brain evaluates
a disparity measure for each elementary structure present in the images (remember
that the image pair make up the stereogram).
The 3D perception of the object, realized with the stereoscope, is identical to the
same sensation that the brain would have when seeing the 3D physical object directly.
Today anaglyphs can be generated electronically by displaying the pair of images of
a stereogram on a monitor, perceiving the 3D object by observing the monitor with
glasses containing a red filter on one eye, and a green filter on the other. The pairs of
images can also be displayed superimposed on the monitor with shades of red on the
left image, respectively, and with shades of green on the right image. Since the two
images, 2D projections of the object observed from slightly different points of view,
observed superimposed on the monitor, the 3D object of origin would be confused,
but with glasses with red and green filters, the brain fuses the two images perceiving
the 3D source object.
A better visualization is obtained by alternatively displaying on the monitor the
pair of images with a certain frequency compatible with the persistence capacity on
the retina of the images.

4.6.3 Stereopsis

Julesz [16] in 1971 showed how it can be relatively simple to evaluate disparity to get
depth perception and the neurophysiological evidence that neural cells in the cortex
are enabled to select elementary structures in pairs of retinal images and measure
the disparity present. It is not yet evident in biological terms how the brain fuses the
two images giving the perception of the single 3D object.
He conceived the random-dot stereograms (see Fig. 4.34) as a tool to study the
working mechanisms of the binocular vision process. The stereograms random-dot
are generated by a computer producing two equal images of point-like structures
randomly arranged with uniform density, essentially generating a texture of black
and white dots. Next, two central windows are selected (see Fig. 4.35a and b) in each
stereogram which are shifted horizontally by the same amount D, respectively, to
the right in the left stereogram and to left in the right one. After moving to the right
and left of the central square (homonymous or uncrossed disparity), the remaining
area without texture is adequately filled with the same background texture (black
and white random points). In this way, the two central windows are immersed and
camouflaged in the background with an identical texture.
When each stereogram is seen individually it appears (see Fig. 4.34) as a single
texture of randomly arranged black and white dots. When the pair of stereograms
is seen instead binocularly, i.e., an image is seen from one eye, and the other from
4.6 Stereo Vision 357

Fig. 4.34 Random-dot stereogram

(a) (b) (c)

Fig. 4.35 Construction of the Julesz stereograms of Fig. 4.34. a The central window is shifted
to the right horizontally by D pixels and the void left is covered by the same texture (white and
black random points) of the background. b The same operation is repeated as in (a) but with a
horizontal shift D to the left. c If the two stereograms constructed are observed simultaneously with
a stereoscope, due to the disparity introduced, the central window is perceived high with respect to
the background

the other eye, the brain performs the stereopsis process that is, it fuses the two
stereograms perceiving depth information, noting that the central square rises with
respect to the texture of the background, toward the observer (see Fig. 4.35c).

In this way, Julesz has shown that the perception of the emerging central square
at a different distance (as a function of D) from the background, is only to be
attributed to the disparity determined by the binocular visual system that surely
performs first the correspondence activity between points of the background
(zero disparity) and between points of the central texture with disparities D. In
other words, the perception of the central square raised by the background is
358 4 Paradigms for 3D Vision

Fig. 4.36 Stereograms of


random points built with
crossed disparity, or with the
central window moved
horizontally outward in both
stereograms. The stereo
vision of the stereograms,
with the overlapping central
windows of green and red
color, occurs using glasses
with filters of the same color
of the stereograms
constructed

due only to the disparity measure (no other information is used) that the brain
realizes through the process of stereogram fusion.

This demonstration makes it difficult to support any other stereo vision theory that
is based, for example, on the a priori knowledge of what is being observed or on the
fusion of particular structures of monocular images (for example, the contours). If in
the construction of the stereograms the central windows are moved in the opposite
direction to those of Fig. 4.35a and b, that is, to the left in a stereogram and to the
right in the other (crossed disparity), with the stereoscopic vision the central window
is perceived emerging from the opposite side of the background or moves away from
the observer (see Fig. 4.36). Within certain limits the perceived relief is easier the
more the disparity (homonymous or crossed) is high.

4.6.4 Neurophysiological Evidence of Stereopsis

Different biological systems use stereo vision and others (rabbits, fish, etc.) observe
the world with panoramic vision, that is, their eyes are placed to observe different
parts of the world, and unlike stereo vision, pairs of images do not overlapping areas
of the observed scene for depth perception.
Studies by Wheatstone and Julesz have shown that binocular disparity is the key
feature for stereopsis. Let us now look at some neurophysiological mechanisms
related to retinal disparity. In Sect. 3.2 Vol. I it was shown how the visual system
propagates the light stimuli on the retina and how impulses propagate from this to the
brain components. The stereopsis process uses information from the striate cortex
and other levels from the visual binocular system to represent the 3D world.
The stimuli coming from the retina through the optical trait (containing fibers
of both eyes) are transmitted up to the lateral geniculate nucleus—LNG which
functions as a thalamic relay station, subdivided into 6 laminae, for the sorting of the
4.6 Stereo Vision 359

different information (see Fig. 4.37). In fact, the fibers coming from the single retinas
are composed of axons deriving both from the large ganglion cells (of type M ), and
from small ganglion cells (of the type P) and from small ganglion cells of the type
non-M and non-P, called koniocellular or K cell. The receptive fields of the ganglion
cells are circular and of center-ON and center-OFF type (see Sect. 4.4.1). The M cells
are connected with a large number of photoreceptor cells (cones and rods) through
the bipolar cells and for this reason, they are able to provide information on the
movement of an object or on rapid changes in brightness. The P cells are connected
with fewer receptors and are suitable for providing information on the shape and
color of an object.
In particular, some different peculiarities between the M and P cells should be
highlighted. The former is not very sensitive to different wavelengths, very selective
at low spatial frequencies, high temporal response and conduction velocity, and wide
dendritic branching. The P cells, on the other hand, are selective at different wave-
lengths (color) and for high spatial frequencies (useful for capturing details having
small receptive fields), have low conduction speed and temporal resolution. The K
cells are very selective at different wavelengths and do not respond to orientation.
Laminae 1 and 2 receive the signals of the M cells, while the remaining 4 laminae
receive the signals of the P cells. The interlaminar layers receive the signals from
the koniocellular cells K. The receptive fields of the K cells are also circular and
of center-ON and center-OFF type. How P cells are color-sensitive but with the
specificity that receptive fields are opponents for red-green and blue-yellow.4
As shown in the figure, the information of the two eyes is transmitted separately to
the different LGN laminae in such a way that the nasal hemiretina covers the hemi-
camp view of the temporal side, while the temporal hemiretina of the opposite eye
includes the hemicamp view of the nasal side. Only in the first case, the information
of the two eyes intersect. In particular, laminae 1, 4, and 6 receive information from
the nasal retina of the opposite eye (contralateral), while laminae 2, 3, and 5 from

4 In addition to the trichromatic theory (based on three types of cones sensitive to red, green and

blue, the combination of which determines the perception of color in relation to the incident light
spectrum) was proposed by Hering (1834–1918), the theory of opponent color. According to this
theory, we perceive colors by combining 3 pairs of opponent colors: red-green, blue-yellow, and
an achromatic channel (white-black) used for brightness. This theory foresees the existence in the
visual system of two classes of cells, one selective for the color opponent (red-green and yellow-
blue) and one for brightness (black-white opponent). In essence, downstream of the cones (sensitive
to red, green, and blue) adequate connections with bipolar cells would allow to have ganglion cells
with the typical properties of chromatic opponency, having a center-periphery organization. For
example, if a red light affects 3 cones R, G, and B connected to two bipolar cells β1 and β2 with
the following cone-cell connection configuration β1 (+R, −G) and β2 (−R, −G, +B), we would
have an excitation of the bipolar cell β1 stimulated by the cone R, sending the signal +R − G on
its ganglion cell. On the other hand, a green light inhibits both bipolar cells. A green or red light
inhibits the bipolar cell β2 while a blue light signals to its cell ganglion the signal +B − (G + R).
Hubel and Wiesel demonstrated, the presence of cells of the retina and the lateral geniculate nucleus
(the P and the K cells), which respond to the chromatic opponency properties organized with the
properties of the center-ON and center-OFF receptive fields.
360 4 Paradigms for 3D Vision

the temporal retina of the eye of the same side (ipsilateral). In this way, each lamina
contains a representation of the contralateral visual hemifield (of the opposite side).
With this organization of information, in the LGN the spatial arrangement of the
receptive fields associated with ganglion cells is maintained and in each lamina the
complete map of the field of view of each hemiretina is stored (see Fig. 4.37).

4.6.4.1 Structure of the Primary Visual Cortex


The signals of both the two types of ganglion cells are propagated in parallel by the
LGN, through their neurons, toward different areas of the primary visual cortex or
V1, also known as striated cortex (see Fig. 4.37).
At this point, it can be stated that the signals coming from the two retinas have
already undergone a pre-processing and exiting from the LGN layers, the informa-
tion of the topographic representation of the fields of view, although separated in the
various LGN layers, is maintained and will continue to be so also with the propa-
gation of signals toward the cortex V1. The primary visual cortex is composed of
a structure of 6 horizontal layers about 2 mm thick (see Fig. 4.38) each of which
contains different types of neural cells (central body, dendrites, and axons of various
proportions) estimated at a total of 200 million. The organization of the cells in each
layer is vertical and the cells are aligned in columns perpendicular to the layers.
The layers in each column are connected through the axons, making synapses along
the way. This organization allows the cells to activate simultaneously for the same
stimulus. The flow of information to some areas of the cortex can propagate up or
down the layers.
In the cortex V1, from the structural point of view, two types of neurons are distin-
guished: stellate cells and pyramidal cells. The first are small with spiny dendrites.
The latter has only one large apical dendrite which, as we shall see, extends into all
the layers of the cortex V1.
In the layer I comes the set of the distal dendrites of the pyramidal neurons and
the axons of the koniocellular pathways. These last cells make synapses in the layers
II and III where there are small stellate and pyramidal cells.
The cells of the layers II and III make synapses with the other cortical areas.
The IV layer of the cortex is divided into substrates IV A, IV B, and IV C.
The IV C layer has a further subdivision into IV Cα, and IV Cβ because of the
different connectivity found between the cells of the upper and lower parts of these
substrates.
The propagation of information from the LGN to the primary visual cortex occurs
through the P (via parvocellular), M (via magnocellular) and K (via koniocellular)
channels. In particular, the axons of M cells transmit the information ending in
the substrate IV Cα, while the axons of the P cells end in the substrate IV Cβ. The
substrate IVB receives the input from the substrate IV Cα and its output is transmitted
in other parts of the cortex.
4.6 Stereo Vision 361

P Fixation point

C D Binocular
iew
B E

A F

Monocular Monocular
view iew

E AF B
Nasal B E Temporal
D Chiasm D C hemiretina
hemiretina C
Optic nerves
Contralateral Ipsilateral
Optic tracts

6 FED CBA 6
ED 5 5 CB
4 FED CBA 4
ED 3 3 CB
ED 2 2 CB
FED CBA
1 1
Channel K
Left LGN Right LGN
Channel M

Channel P Channel P

Primary visual cortex

Fig. 4.37 Propagation of visual information from retinas to the Lateral Geniculate Nucleus (LGN)
through optic nerves, chiasm, and optic tracts. The information from the right field of view passes
to the left LGN and vice versa. The left field of view information is processed by the right LGN
and vice versa. The left field of view information that is seen by the right eye does not cross and is
processed by the right LGN. The opposite situation occurs for the right field of view information
seen by the left eye that is processed by the left LGN. The spatial arrangement of the field of
view is reversed and mirrored on the retina but the information toward the LGN propagates while
maintaining the topographic arrangement of the retina. The relative disposition of the hemiretinas
is mapped on each lamina (in the example, the points A, B, C, D, E, F)
362 4 Paradigms for 3D Vision

I
II Towards other
Complex Cells
III Cortical Areas (e.g. V2, MT)
IV A Simple cells
IV B Simple cells
Pyramidal cells V3,MT
IV Cα Stellate Cells M

IV Cβ Stellate Cells P Channel P Channel K

V Complex Cells
Cross section of the VI Complex Cells 6
monkey visual cortex 5
4
3
to the superior Channel M 2
1
colliculus

Right LGN

Fig. 4.38 Cross-section of the primary visual cortex (V1) of the monkey. The six layers and relative
substrates with different cell density and connections with other components of the brain are shown

The layers V and VI , with high cell density even pyramidal, transmit their output
back to the upper colliculus5 and to the LGN, respectively. As shown in Fig. 4.37
the area of the left hemisphere of V 1 receives only the visual information related to
the right field of view and vice versa. Furthermore, the information that reaches the
cortex from the retina is organized in such a way as to maintain the hemiretinas of
origin, the cell type (P or M ) and the spatial position of the ganglion cells inside the
retina (see Fig. 4.42). In fact, the axons of the cells M and P transmit the information
of the retinas, respectively, in the substrates IV Cα and IV Cβ. In addition, the cells
close together in these layers receive information from the local areas of the retina,
thus maintaining the topographical structure of origin.

4.6.4.2 Neurons of the Primary Visual Cortex


Hubel and Wiesel (winners of the Nobel Prize in Physiology or medicine in 1981)
discovered some types of cells present in the cortex V1, called, simple, complex, and
hypercomplex cells. These cortical cells are sensitive to stimuli for different spatial
orientations with a resolution of approximately 10◦ .
The receptive fields of cortical cells, unlike those with a circular shape center-ON
and center-OFF of LGNs and ganglion cells, are rather long (rectangular) and more
extensive, and are classified into three categories of cells:

5 Organ that controls saccadic movements, coordinates visual and auditory information, directing
the movements of the head and eyes in the direction where the stimuli are generated. It receives
direct information from the retina and from different areas of the cortex.
4.6 Stereo Vision 363

(a) (b) (c)


Movement direction
of Simple Cell 1 1

- +- 2 2
Basic level
Light bar
3 3
- +-
Weak response
4 4

- +- 5 5
Strong response

6 6

7 7
Receptive Fields
of Complex Cell of Hypercomplex Cell

Fig. 4.39 Receptive fields of cortical cells associated with the visual system and their response
to the different orientations of the light beam. a Receptive field of the simple cell of ellipsoidal
shape with respect to the circular one of ganglion cells and LGN. The diagram shows the maximum
stimulus only when the light bar is totally aligned with the ON area of the receptive field while it
remains inhibited when the light falls on the OFF zone. b Responses of the complex cell, with a
rectangular receptive field, when the inclination of a moving light bar changes. The arrows indicate
the direction of motion of the stimulus. From the diagram, we note the maximum stimulation when
the light is aligned with the axis of the receptive field and moves to the right, while the stimulus
is almost zero with motion in the opposite direction. c hypercomplex cell responses when solicited
by a light bar that increases in length by exceeding the size of the receptive field. The behavior of
the cells (called with end-stopped cells) is such that the stimulus increases reaching the maximum
when the light bar completely covers the receptive field but decreases its activity if stimulated with
a larger light bar

1. Simple cells, they present the excitatory and inhibitory areas rather narrow and
elongated, having a specific orientation axis. These cells are functional as detec-
tors of linear structures, in fact, they are well stimulated when a rectangular light
beam is located in an area of the field of view and oriented in a particular direction
(see Fig. 4.39a). The receptive fields of simple cells seem to be realized by the
convergence of different receptive fields of adjacent cells of the substrate IV C.
The latter, known as stellate cells are small-sized neurons with circular receptive
fields that receive signals from the cells of the geniculate body (see Fig. 4.40a)
which, like retinal ganglion cells, are center-ON and center-OFF type.
2. Complex Cells, have extended receptive fields but not a clear zone of excitation or
inhibition. They respond well to the motion of an edge with a specific orientation
and direction of motion (good motion detectors, see Fig. 4.39b. Their receptive
fields seem to be realized by the convergence of different receptive fields of more
364 4 Paradigms for 3D Vision

(a) (b) Receptive Fields


Output from LGN cells s Output from LGN cells
+- ld
ie +- of Simple Cells
lls eF
+- tiv ells
N
ce p
ce N c +-
LG Re f LG - - -

+
+
of +- o +-
+- +-
Stellate Cells CS Stellate Cells
+- of
of layer IV C ld of layer IV C
ie
eF
ptiv
Cortical ce
Simple Cell (CS) Re
Simple
Cortical Cells
Complex
Cortical Cells

Fig. 4.40 Receptive fields of simple and complex cortical cells, generated by multiple cells with
circular receptive fields. a Simple cell generated by the convergence of 4 stellate cells receiving the
signal from adjacent LGN neurons with circular receptive fields. The simple cell with an elliptic
receptive field responds better to the stimuli of a localized light bar oriented in the visual field. b A
complex cell generated by the convergence of several simple cells that responds better to the stimuli
of a localized and oriented bar (also in motion) in the visual field

simple cells (see Fig. 4.40b). The peculiarity of motion detection is due to two
phenomena.
The first occurs when the axons of different simple cells adjacent and with the
same orientation, but not with identical receptive fields, converge on a complex
cell which determines the motion from the difference of these different receptive
fields.
The second occurs when the complex cell can determine motion through differ-
ent latency times in the responses of adjacent simple cells. Complex cells are
very selective in a given direction, responding only when the stimulus moves in
one direction and not in the other (see Fig. 4.39c). Compared to the simple one,
complex cells are not conditioned to the position of the light beams (in stationary
conditions) in the receptive field. The amount of the stimulus also depends on the
length of the rectangular light beam that falls within the receptive field. They are
located in layers II and III of the cortex and in the boundary areas between layers
V and VI .
3. Hypercomplex cells, are a further extension of the process of visual information
processing and advancement of the knowledge of the biological visual system.
Hypercomplex cells (known as end-stopped cells) respond only if a light stimulus
has a given ratio between the illuminated surface and the dark surface, or comes
from a certain direction, or includes moving forms. Some of these hypercomplex
cells respond well only to rectangular beams of light of a certain length (com-
pletely covering the receptive field), so that if the stimulus extends beyond this
length, the response of the cells is significantly reduced (see Fig. 4.39c). Hubel
and Wiesel characterize these receptive fields as containing activating and antago-
nistic regions (similar to excitatory/inhibitory regions). For example, the left half
of a receptive field can be the activating region, while the antagonistic region is
on the right. As a result, the hypercomplex cell will respond, with the spatial sum-
mation, to stimuli on the left side (within the activation region) to the extent that
4.6 Stereo Vision 365

it does not extend further into the right side (antagonistic region). This receptive
field would be described as stopped at one end (i.e., the right). Similarly, hyper-
complex receptive fields can be stopped at both ends. In this case, a stimulus that
extends too far in both directions (for example, too left or too far to the right) will
begin to stimulate the antagonistic region and reduce the signal strength of the
cell. The hypercomplex cells occur when the axons of some complex cells, with
adjacent receptive fields and different in orientation, converge in a single neuron.
These cells are located in the secondary visual area (also known as V5 and MT).
Following the experiments of Hubel and Wiesel, it was discovered that even some
simple and complex cells exhibit the same property as the hypercomplex, that is,
they have end-stopping properties when the luminous stimulus exceeds a certain
length overcoming the margins of the same receptive field.

From the properties of the neural cells of the primary visual cortex a computational
model emerges with principles of self-learning that explain the sensing and motion
capacities of structures (for example, lines, points, bars) present in the visual field.
Furthermore, we observe a hierarchical model of visual processing that starts from
the lowest level, the level of the retinas that contains the scene information (in the
field of view), the LGN level that captures the position of the objects, the level of
simple cells that see the orientation of elementary structures (lines), the level of
complex cells that see(detect) their movement, and the level of hypercomplex cells
that perceive of the object, edges, and their orientation. The functionality of simple
cells can be modeled using Gabor filters to describe their sensitivity to orientation
to a linear light beam.
Figure 4.41 summarizes the connection scheme between the retinal photoreceptors
and the neural cells of the visual cortex. In particular, it is observed how groups of
cones and rods are connected with a single bipolar cell, which in turn is connected
with one of the ganglion cells from which the fibers afferent to the optic nerve
originate whose exit point (papilla) is devoid of photoreceptors. This architecture
suggests that stimuli from retinal areas of a certain extension (associated, for example,
with an elementary structure of the image) are conveyed into a single afferent fiber of
the optic nerve. In this architecture, a hierarchical organization of the cells emerges,
starting from the bipolar cells that feed the ganglion cells up to the hypercomplex
cells.
The fibers of the optic nerve coming from the medial half of the retina (nasal field)
intersect at the level of the optic chiasm to those coming from the temporal field and
continue laterally. From this it follows (see also Fig. 4.37) that after crossing in the
chiasm the right optic tract contains the signals coming from the left half of the
visual field and the left one the signals of the right half. The fibers of the optic tracts
reach the lateral geniculate bodies that form part of the thalamus nuclei: here there
is the synaptic junction with neurons that send their fibers to the cerebral cortex of
the occipital lobes where the primary visual cortex is located. The latter occupies the
terminal portion of the occipital lobes and extends over the medial surface of them
along the calcarine fissure (or calcarine sulcus).
366 4 Paradigms for 3D Vision

Photoreceptors (Cones and Rods)


Bipolar Cells
Ganglion Cells Cortex V1, V2, and other areas

}
Right eye

}
Left eye

}
Optic
chiasm Hypercomplex cells

}
Complex cells
}
}
LGN cells Simple cells
}

Retina

Fig. 4.41 The visual pathway with the course of information flow from photoreceptors and retinal-
visual cortex cells of the brain. In particular, the ways of separation of visual information coming
from nasal and temporal hemiretinas for each eye are highlighted

4.6.4.3 Columnar Organization of the Primary Visual Cortex


In the second half of the twentieth century, several experiments analyzed the orga-
nization of the visual cortex and evaluated the functionality of the various neural
components. In fact, it has been discovered that the cortex is also radially divided
into several columns where neuronal cells that respond with the same characteristic
for stimuli arising from a given point of the field of view are aggregated.
This columnar aggregation actually forms functional units perpendicular to the
surface of the cortex. It turns out that, by inserting a microelectrode perpendicularly
to the various layers of cortex V 1, all the neuronal cells (simple and complex) that
it encounters have the same stimuli that indicate the same direction, for example.
Consequently, if the electrode is instead inserted parallel to the surface of the cor-
tex, crossing several columns in the same layer, the various orientation differences
would be observed. This neural organization highlights how the responses of the
cells that capture the orientation information are correlated with those of the visual
field maintaining, in fact, the topographic representation of the field of view.
Based on these results, Hubel and Wiesel demonstrated the periodic organization
of the visual cortex consisting of a set of vertical columns (each containing cells with
selective receptive fields for a given orientation) called hypercolumn. The hypercol-
umn can be thought of as a subdivision into vertical plates of the visual cortex (see
Fig. 4.42) which are repeated periodically (representing all perceived orientations).
4.6 Stereo Vision 367

Blob Interblob area


500
μm

I
II
III
IV A
IV B
IV Cα

IV Cβ
V
VI

Columns of I m
C 0μ
ocular dominance I ~5
C Orientation
Channel K

column
6(C)
5(I)
4(C)
Channel P
3(I)
2(I) Channel M
1(C)

LGN

Fig. 4.42 Columnar organization, perpendicular to the layers, of the cells in the cortex V 1. The
ocular dominance columns of the two eyes are indicated with I (the ipsilateral) and with C (the
contralateral). The orientation columns are indicated with oriented bars. Blob cells are located
between the columns of the layers II , III and V , VI

Each column crosses the 6 layers of the cortex and represents an orientation in the
visual field with an angular resolution of about 10◦ . The cells crossed by each column
respond to the stimuli deriving from the same orientation (orientation column) or
to the input of the same eye (dominant ocular column, dominant ocular plate). An
adjacent column includes cells that respond to a small difference in orientation from
the near one and perhaps to the input of the same eye or the other. The neurons in the
IV layer are an exception, as they could respond to any orientation or just one eye.
From Fig. 4.42 it is observed how the signals of the M and P cells, relative to the
two eyes, coming from the LGN, are kept separated in the IV layer of the cortex and in
particular projected, respectively, in the substrate IVCα and IVCβ where monocular
cells are found with center-ON and center-OFF circular receptive fields. Therefore,
368 4 Paradigms for 3D Vision

the signals coming from LGN are associated with one of the two eyes, never to both,
while each cell of the cortex can be associated with input from one eye or that of
the other. It follows that we have ocular dominance columns arranged alternately
associated with the two eyes (ipsi or contralateral), which extend horizontally in the
cortex, consisting of simple and complex cells.
The cells of the IVCα substrate propagate the signals to the neurons (simple cells)
of the substrate IV B. The latter responds to stimuli from both eyes (binocular cells),
unlike the cells of the IV C substrate whose receptive fields are monocular. Therefore,
the neurons of the IV B layer begin the process of integration useful for binocular
vision. These neurons are selective to detect movement but also to detect direction
only if stimulated by a beam of light that moves in a given direction.
A further complexity of the functional architecture of V 1 emerges with the dis-
covery (in 1987) by contrast medium, along with the ocular dominance columns of
another type of column, regularly spaced and localized in the layers II − III and
V − VI of the cortex. These columns are made up of arrays of neurons that receive
input from the parvocellular pathways and from the koniocellular pathways. They
are called blob appearing (by contrast medium) as leopard spots when viewed with
tangential sections of the cortex. The characteristic of the neurons included in the
blobs is that of being sensitive to color (i.e., to the different wavelengths of light,
thanks to the information of the channels K and P) and to the brightness (thanks to
the information coming from the channel M ).
Among the blobs, there are regions with neurons that receive signals from the mag-
nocellular pathways. These regions called interblobs contain orientation columns and
ocular dominance columns whose neurons are motion-sensitive and nonselective for
color. Therefore, the blob are in fact modules, in which the signals of the three
channels P, M , K converge, where it is assumed necessary to combine these signals
(i.e., the spectral and brightness information) on which the perception of color and
brightness variation depends.
This organization of the cortex V 1 in hypercolumns (also known as cortical mod-
ule) each of which receiving input from the two eyes (orientation columns and ocular
dominance columns) is able to analyze a portion of the visual field. Therefore, each
module includes neurons sensitive to color, movement, linear structures (lines or
edges) for a given orientation and for an associated area of the visual field, and
integrates the information of the two eyes for depth perception.
The orientation resolution in the various parallel cortical layers is 10◦ and the
whole module can cover an angle of 180◦ . It is estimated that a cortical module
that includes a region of only 2 × 2 mm of the visual cortex is able to perform a
complete analysis of a visual stimulus. The complexity of the brain is such that the
functionalities of the various modules and their total number have not been clearly
defined.

4.6.4.4 Area V1 Interaction with Other Areas of the Visual Cortex


Before going into how the information coming from the retinas in the visual cortex
is further processed and distributed, we summarize the complete pathways in the
4.6 Stereo Vision 369

visual system. The visual pathways begin with each retina (see Fig. 4.37), then leave
the eye by means of the optic nerve that passes through the optic chiasm (in which
there is a partial crossing of the nerve fibers coming from the two hemiretinas of each
eye), and then it becomes the optic tract (seen as a continuation of the optic nerve).
The optic tract goes toward the lateral geniculate body of the thalamus. From here
the fibers, which make up the optic radiations, reach the visual cortex in the occipital
lobes.
The primary visual cortex mainly transmits to the adjacent secondary visual cortex
V2, also known as area 18, most of the first processed information of the visual field.
Although most neurons in the V 2 cortex have properties similar to those of neurons
in the primary visual cortex, many others have the characteristic of being much more
complex. From areas V 1 and V 2 the visual information processed continues toward
the areas so-called associative areas, which process information at a more global
level. These areas, in a progressive way, combine (associate) the first level visual
information with information deriving from other sensors (hearing, touch, . . .) thus
creating a multisensory representation of the observed world.
Several researches have highlighted dozens of cortical areas that contribute to
visual perception. Areas V 1 and V 2 are surrounded by several of these cortical
areas and associative visual areas called: V 3, V 4, V 5 (orMT ), PO, TEO, etc. (see
Fig. 4.43). From the visual area V 1 two cortical pathways of propagation and pro-
cessing of visual information branch out [17]: the ventral path which extends to the
temporal lobe and the dorsal pathway projected to the parietal lobe.

V3A
V3
V2
V1 Parietal cortex
V2
V4 VIP
V3
MT V3A Dorsal stream
MT MST 7a
(WHERE)

LIP
Retina LGN V1 V2 V3

Ventral stream
(WHAT) V4 TEO TE

Temporal Cortex

Fig. 4.43 Neuronal pathways involved in visuospatial processing. Distribution of information from
the retina to other areas of the visual cortex that interface with the primary visual cortex V 1. The
dorsal pathway, which includes the parietal cortex and its projections to the frontal cortex, is involved
in the processing of spatial information. The ventral pathway, which includes the inferior and lateral
temporal cortex and their projections to the medial temporal cortex, is involved in the processing
of recognition and semantic information
370 4 Paradigms for 3D Vision

The main function of the ventral visual pathway (channel of what is observed, i.e.,
object recognition pathway) seems to be that of conscious perception, that is, to make
us recognize and identify objects by processing their intrinsic visual properties, such
as shape and color, memorizing such information in memory for a long time term. The
basic function of the dorsal visual pathway (channel where is an object, i.e., spatial
vision pathway) seems to be the one associated with visual-motor control on objects
by processing their extrinsic properties which are essential for their localization
(and mobility), such as their size, their position, orientation in space, and saccadic
movements.
Figure 4.43 shows the connectivity between the various main areas of the cortex.
The signals start from the ganglion cells of the retina, and through LGN and V 1 they
branch out, by their process, toward the ventral path (from V 1 to V 4 reaching the
inferior temporal cortex IT ) and dorsal (from V 1 to V 5 reaching the posterior parietal
cortex) thus realizing a hierarchical connection structure. In particular, the parieto-
medial temporal area integrates information from both pathways and is involved in
the encoding of landmarks in spatial navigation and in the integration of objects into
the structural environment.
The flow of information is summarized in the ventral visual channel for the
perception of objects:

Area V1-Primary Visual Cortex, has a retinotopic organization, meaning that it


contains a complete map of the visual field covered by the two eyes. In this
area, we have seen that from the image edges oriented by the local variation of
brightness are detected along different orientation angles (orientation columns).
They have also separated color information and detected spatial frequencies and
depth perception information.
Area V2-Secondary Visual Cortex, where the detected edges are combined to
develop a vocabulary of intersections and junctions, along with many other ele-
mentary visual features (for example, texture, depth, . . .), fundamental for the
perception of more shapes complex along with the ability to distinguish the stim-
ulus of the background or part of the object. The neural cells of V 2, with properties
similar to those of V 1, encode these features in a wide range of positions, starting
a process that ends with the neurons of the lower temporal area (IT ) in order
to recognize an object regardless of where it appears in the visual field. The V 2
area is organized in three modules, one of which, in the thin strip (striate cortex),
receives the signals from the blobs. The other two modules receive the signals
from the interblocks in the thick interstrips and strips. From V 2, and in particular
from the thick strips (depositories of movement and depth), it is believed that
signals are transmitted to the V 5 area. V 1 and V 2 basically produce the initial
information related to color, motion analysis, and shape, which are then trans-
mitted in parallel to the other areas of the extrastriate visual cortex, for further
specialized processing.
Area V3-Third visual cortex, receives many connections from area V 2 and con-
nects with the middle temporal (MT) areas. V 3 neurons have identical properties
to those of V 2 (selective ability for orientation) but many other neurons have more
4.6 Stereo Vision 371

complex properties yet to be known. Some of the latter are sensitive to color and
movement, characteristics most commonly analyzed in other stages of the visual
process.
Area V4, receives the information flow after the process in V 1, V 2, and V 3, to con-
tinue further processing the color information (received from blobs and interblob
of V 1) and form. In this area, there are neurons with properties similar to other
areas but with more extensive receptive fields than those of V 1. Area still to be
analyzed in-depth. It seems to be essential for the perception of extended and
more complex contours.
Area IT-Inferior temporal cortex, receives many connections from the area V 4 and
includes complex cells that have shown little sensitivity to color and size of the
perceived object but are very sensitive to the shape. Studies have led to consider
this area sensitive to face recognition and important visual memory capacity.

The cortical areas of the dorsal pathway that terminate in the parietal lobe, elaborate
the spatial and temporal aspects of visual perception. In addition to spatially locating
the visual stimulus, these areas are also linked to aspects of movement including eye
movement. In essence, the dorsal visual pathway integrates the spatial information
between the visual system and the environment for a correct interaction. The dorsal
pathway includes several cortical areas, including in particular the Middle Temporal
(MT) area, also called area V 5, the Medial Superior Temporal (MST ) area, and the
lateral and ventral intraparietal areas (LIP and VIP, respectively).
The MT area is believed to contribute significantly to the perception of move-
ment. This area receives the signals from V 2, V 3, and the substrate IVB of V 1 (see
Fig. 4.43). We know that the latter is part of the magnocellular pathways involved in
the analysis of movement. Neurons of MT have properties similar to those of V 1, but
have the most extensive receptive field (up to covering an angle of tens of degrees).
They have the peculiarity of being activated only if the stimulus, which falls on its
receptive field, moves in a preferred direction.
The area MST is believed to contribute, as well as to the analysis of the movement,
it is sensitive to the radial motion (that is, approaching or moving away from a point)
and to the circular motion (clockwise or counterclockwise). The neurons of MST are
also selective for movements in complex configurations. The LIP area is considered
to be the interface between the visual system and the oculomotor system. Also, the
neurons of the areas LIP and VIP (receive the signals from V 5 and MST ) are sensitive
to stimuli generated by a limited area of the field of view and are active for stimuli
resulting from an ocular movement (known also as saccade) in the direction of a
given point in the field of view.
The brain can use for various purposes, this wealth of information associated with
movement, acquired through the dorsal pathway. It can acquire information about
objects moving in the field of view, understand the nature of motion compared to
that of one own by moving the eyes and then act accordingly.
The activities of the visual cortex take place through various hierarchical levels
with the serial propagation of the signals and their processing also in parallel through
the different communication channels thus forming a highly complex network of
372 4 Paradigms for 3D Vision

Inferotemporal areas
Parietal areas V4 V4
V5(MT)
V3
Movement Thick stripe
Depth
Shape Inter-stripe
Color Thin stripe
Movement
Blob Depth
Shape
I Color
Color II Movement
Shape Depth
III Shape
Channel K Movement IV A Color
Depth
Channel M IV B
IV Cα Area V2
Retina LGN
IV Cβ
Channel P
V
VI
Area V1

Fig.4.44 Overall representation of the connections and the main functions performed by the various
areas of the visual cortex. The retinal signals are propagated segregated through the magnocellular
and parvocellular pathways, and from V 1 they continue on the dorsal and ventral pathways. The
former specializes in the perception of form and color, while the latter is selective in the perception
of movement, position, and depth

circuits. This complexity is attributable in part to the many feedback loops that each
of these cortical areas form with their connections to receive and return information
considering all the ramifications that for the visual system originate from the receptors
(with ganglion neurons) and then transmitted to the visual cortical areas, through the
optic nerve, chiasm and optic tract, and lateral geniculate nucleus of the thalamus.
Figure 4.44 summarizes schematically, at the current state of knowledge, the main
connections and the activities performed by the visual areas of the cortex (of the
macaque), as the signals from the retinas propagate (segregated by the two eyes) in
such areas, through the parvocellular and magnocellular channels, and the dorsal and
ventral pathways.
From the analysis of the responses of different neuronal cells, it is possible to
summarize the main functions of the visual system realized through the cooperation
of the various areas of the visual cortex.

Color perception. The selectivity response to color is given by the ganglion cells
P that through the parvocellular channel of LGN reaches the cells of the substrate
IVBβ of V 1. From here it propagates in the other layers II and III of V 1 in
vertically organized cells that form the blob. From there the signal propagates
in the V 4 area directly and through the thin strips of V 2. V 4 includes cells with
larger receptive fields with selective capabilities to discriminate color even with
lighting changes.
Perception of the form. As for the color, the ganglion cells P of the retina, through
the parvocellular channel of LGN transmit the signal to the cells of the substrate
4.6 Stereo Vision 373

IVCβ of V 1 but it propagates later in the cells interblob of the other layers II
and III of V 1. From here the signal propagates in the V 4 area directly and via
the interstripes (also known as pale stripes) of V 2 (see Fig. 4.44). V 4 includes
cells with larger receptive fields with also selective capabilities to discriminate
orientation (as well as color).
Perception of movement. The signal from the ganglion cells M of the retina,
through the magnocellular channel of LGN, reaches the cells of the substrate
IVCα of V 1. From here, it propagates in the IVB layer of V 1, which we high-
lighted earlier, including very selective complex cells in orientation also in relation
to movement. From the layer IVB the signal propagates directly in the area V5
(MT) and through the thick strips of V 2.
Depth perception. Signals from the LGN cells that enter the IVC substrates of the
V 1 cortex keep the information in the two eyes segregated. Subsequently, these
signals are propagated in the other layers II and III of V 1, and in these appear, for
the first time, cells with afferents coming from cells (but not M and not P cells)
of the two eyes, that is, we have bipolar cells. Hubel and Wiesel have classified
the cells of V 1 in relation to their level of excitation deriving from one and the
other eye. Those deriving from the exclusive stimulation of a single eye are called
ocular dominance cells, and therefore, monocular cells. The binocular cells are
instead those excited by cells of the two eyes whose receptive fields simultane-
ously see the same area of the visual field. The contribution of the cells of a single
eye can be dominant with respect to the other or both contribute with the same
level of excitation, and in the latter case, they have perfectly binocular cells. With
binocular cells it is possible to evaluate depth by estimating binocular disparity
(see Sect. 4.6) at the base of stereopsis. Although the neurophysiological basis
of stereopsis is still not fully known, the functionality of binocular neurons is
assumed to be guaranteed by the monocular cells of both eyes stimulated by the
corresponding receptive fields (also with different viewing angles) as much as
possible compatible in terms of orientation and position compared to the point
of fixation. With reference to Fig. 4.29 the area of the fixation point P gener-
ates identical receptive fields in the two eyes stimulating a binocular cell (zero
disparity) with the same intensity, while the stimulation of the two eyes will be
differentiated (receptive fields slightly shifted with respect to the fovea), deriving
from the farthest zone (the point L) and the nearest one (the point V ) with respect
to the observer. In essence, the action potential of the most distant monocular cells
(corresponding) is higher than those closer to the fixation point and this behavior
becomes a property of binocular disparity.

The current state of knowledge is based on the functional analysis of the neurons
located in the various layers of the visual areas, their interneural connectivity, and
the effects caused by the lesions in one or more components of the visual system.6

6 Biologicalevidence has been demonstrated which, with the stimulation of the nerve cells of the
primary visual cortex, through weak electrical impulses, causes the subject to see elementary visual
374 4 Paradigms for 3D Vision

4.6.5 Depth Map from Binocular Vision

After having analyzed the complexity of the biological visual system of primates, let
us now see how it is possible to imitate some of its functional capabilities by creating
a binocular vision system for calculating the depth map of a 3D scene, locating an
object in the scene and calculating its attitude. These functions are very useful for
navigating an autonomous vehicle and for various other applications (automation of
robot cells, remote monitoring, etc.).
Although we do not yet have a detailed knowledge of how the human visual system
operates for the perception of the world, as highlighted in the previous paragraphs, it
is hypothesized that different modules cooperate together for the perception of color,
texture, movement, and to estimate depth. Modules of the primary visual cortex have
the task of merging images from the two eyes in order to perceive the depth (through
stereopsis) and the 3D reconstruction of the visible surface observed.

4.6.5.1 Correspondence Problem


One of these modules may be one based on binocular vision for depth perception
as demonstrated by Julesz. In fact, with Julesz’s stereograms, made with random
points, depth perception is demonstrated through binocular disparity only. It follows
the need to solve the correspondence problem, that is, the identification of identical
elementary structures (or similar features), in the two retinal images of left and right,
which correspond to the same physical part of the observed 3D object. Considering
that the pairs of images (as in the biological vision) are slightly different from each
other (observation from slightly different points of view), it is plausible to think
that the number of identical elementary structures, present in any local region of
the retina, is small, and it follows that similar features found in the corresponding

events, such as a colored spot or a flash of light, in the expected areas of the visual field. Given the
spatial correspondence one by one, between the retina and the primary visual area, the lesion of
areas of the latter part to blind areas (blind spot) in the visual field even if some visual patterns are
left unchanged. For example, the contours of a perceived object are spatially completed even if they
overlap with the blind area. In humans, two associative or visual-psychic areas are located around
the primary visual cortex, the parastriate area and the peristriate area. The electrical stimulation of
the cells of these associative areas is found to generate the sensation of complex visual hallucinations
corresponding to images of known objects or even sequences of significant actions. The lesion or
surgical removal of areas of these visual-psychic areas does not cause blindness but prevents the
maintenance of old visual experiences; moreover, it generates disturbances in perception in general,
that is, the impossibility of combining individual impressions in complete structures and the inability
to recognize complex objects or their pictorial representation. However, new visual learnings are
possible, at least until the temporal lobe is removed (ablation). Subjects, with lesions in the visual-
psychic areas, can describe single parts of an object and correctly reproduce the contour of the
object but are unable to recognize the object as a whole. Other subjects cannot see more than one
object at a time in the visual field. The connection between the visual-psychic areas of the two
hemispheres is important for comparing the received retinal images of the primary visual cortex to
allow 3D reconstruction of the objects.
4.6 Stereo Vision 375

Fig. 4.45 Map of the disparity resulting from the fusion of Julesz stereograms. a and b are the
stereograms of left and right with different levels of disparity; c and d, respectively, show the pixels
with different levels of disparity (representing four depth levels) and the red-blue anaglyph image
generated by the random-dot stereo pair from which the corresponding depth levels can be observed
with red-blue glasses

regions on each retina, can be assumed homologous, i.e., corresponding to the same
physical part of the observed 3D object.
Julesz demonstrated, using random-dot stereogram images, that this matching
process applied to random-dot points is performing finding a large number of matches
(homologous points in the two images) even under very noisy random-dot images.
Figure 4.45 shows another example of random-dot stereogram with different central
squares of different disparities perceiving a depth map at different heights. On the
other hand, by using more complex images the matching process produces false
targets, i.e., the search for homologous points in the two images fails [18].
376 4 Paradigms for 3D Vision

Fig. 4.46 Correspondence


problem: ambiguity in
finding homologous points in
two images so that each
point of the pair is the unique
projection of the same 3D P
L V
point

VL VR
PL LL LR PR

In these cases the observer can find himself in the situation represented in Fig. 4.46
where both eyes see three points, but the correspondence between the two retinal
projections is ambiguous. In essence, we have the problem of correspondence, that
is, how do you establish the true correspondence between the three points seen from
the left retina with the right retina that are possible projections of the nine points
present in the field of view?
Nine candidate matches are plausible and the observer could see different depth
planes corresponding to the perceived false targets. Only three matches are correct
(colored squares), while the remaining six are generated by false targets (false coun-
terparts), indicated with black squares.
To solve the problem of ambiguous correspondences (basically ill-posed problem),
Julesz suggested to consider in the process of correspondence global constraints. For
example, to consider candidate points for correspondence, more complex structures
than simple points, that is to say, traits of oriented contours or to consider some
physical constraints of 3D objects, or to impose some constraints on the search
modes in the two images of the homologous structures (example, the search for
structures only on horizontal lines), or particular textures.
This stereo vision process is called by Julesz as a global stereo vision, which is
probably based on a more complex neural process to select local structures in images
composed of elements with the same disparity. For the perception of more extended
depth intervals the human visual system uses the movement of the eyes.
The global stereo vision mechanism introduced by Julesz is not inspired by neuro-
physiology but uses the phenomenon of physics associated with the magnetic dipole,
consisting of two point-like magnetic masses of equal value and opposite polarity,
placed at a small distance from each other. This model also includes the hysteresis
phenomenon. In fact, when Julesz’s stereograms are out of the observer, the disparity
can be increased by twenty times the limit of Panum’s fusion area (range in which
one can fuse two stereo images, normally 6 −18 of arc), without losing the feeling
of stereoscopic vision.
4.6 Stereo Vision 377

In analogy to the magnetic dipole mechanism, fusion is based on the attraction


it generates between opposite poles, once they come into contact, and it becomes
difficult to separate them. The hysteresis phenomenon has influenced various models
of stereo vision including the cooperativity among local measures to reach the global
stereo vision.

4.6.6 Computational Model for Binocular Vision

The estimate of the distance of an object from the observer, i.e., the perceived depth
is determined in two phases: first, the disparity value is calculated (having solved
the correspondence of the homologous points) and subsequently, this measurement
is used, together with the geometry of the stereo system, to calculate the depth
measurement.
Following the Marr paradigm, these phases will have to include three levels for
estimating disparity: the level of computational theory, the level of algorithms, and the
level of algorithm implementation. Marr, Poggio, and Grimson [19] have developed
all three levels inspired by human stereo vision. Several researchers subsequently
applied some ideas of computational stereo vision models proposed by Marr-Poggio,
to develop artificial stereo vision systems.
In the previous paragraph, we examined what are the elements of uncertainty in the
estimate of the disparity known as the correspondence problem. Any computational
model chosen will have to minimize this problem, i.e., correctly search for homolo-
gous points in stereo images through a similarity measure that represents an estimate
of how similar such homologous points (or structures) are. In the computational
model of Marr and Poggio [20] are considered different constraints (based on phys-
ical considerations) to reduce as much as possible the problem of correspondence.
These constraints are

Compatibility. The homologous points of stereo images must have a very similar
intrinsic physical structure if they represent the 2D projection of the same point
(local area) of the visible surface of the 3D object. For example, in the case of
random-dot stereograms, homologous candidate points are either black or white.
Uniqueness. A given point on the visible surface of the 3D object has a unique
position in space at any time (static objects). It follows that a point (or structure)
in an image has only one homologous point in the other image, that is, it has only
one candidate point as comparable: the constraint of uniqueness.
Continuity. The disparity will vary slightly in any area of the stereo image. This
constraint is motivated by the physical coherence of the visible surface in the
sense that it assumes various continuities without abrupt discontinuities. This
constraint is obviously violated in the areas of surface discontinuity of the object
and in particular in the contour area of the object.
Epipolarity. Homologous points must lie on the same line as the stereograms.
378 4 Paradigms for 3D Vision

To solve the problem of correspondence in stereo vision, by using these constraints,


Marr-Poggio developed (following the Marr paradigm) two stereo vision algorithms:
one based on cooperativity and the other based on the fusion of structures from coarse
to fine (called coarse-to-fine control strategy). These algorithms have been tested with
random-dot stereograms.
The first algorithm uses a neural network to implement the three previous stereo
vision constraints. When applied to random-dot stereograms these correspondence
constraints must have compatibility (black points with black points, white points with
white points), uniqueness (a single correspondence for each point), and continuity
(values with constant disparity will be maintained or the disparity varies slightly
in any area of a stereogram). The functional scheme of the neural network is a
competitive one where there is only one unit for each possible disparity between
homologous points of the stereograms.
Each neural unit represents a point on the visible surface or a small surface element
at a certain depth. Each excited neural unit can, in turn, excite or inhibit the neural
activity of other units, and have it’s own increased or decreased activity, in turn, due
to the excitement or inhibition it receives from the other units. Marr and Poggio have
shown how the constraints of correspondence are satisfied in terms of excitation and
inhibition in this neural model.
The compatibility constraint implies that a neural unit will initially be active if it
is excited by similar structures from both stereograms, for example, in the case of
homologous points corresponding to black points or points both white the uniqueness
constraint is incorporated through the inhibition which it proceeds between units that
fall along the same line of sight, i.e., units that represent different disparity values
for the same structure (homologous structures) inhibit each other.
The continuity constraint is incorporated having the excitation that proceeds
between units that represent different structures (nonhomologous structures) at the
same disparity. It follows that the network operates so that unique comparisons that
maintain structures of the same disparity are favored. Figure 4.47 shows the results of
this cooperative algorithm. The stereograms that are input to the neural network are
shown, which presents an initial state indicated with 0 including all possible compar-
isons between the predefined disparity intervals. The algorithm performs different
iterations producing various maps of ever more precise depths that highlight, with
different levels of gray, the structures with different disparity values. The activity of
the neural network ends when it reaches a stable state, i.e., when the neural activity of
each unit is no longer modified in the last iterations. This algorithm provides a clear
example to understand the nature of human vision even if it does not fully explain
the mechanisms of human stereo vision.
The second algorithm of Marr and Poggio proposes to attenuate the number of
false targets or points observed as homologous, but belonging to different physical
points of the object. The false targets vary in relation to the number of points present
in the images, considered candidates for comparison, and in relation to the interval
of disparity within which it is plausible to verify the correspondence of the points.
Another element that induces false targets is the different contrast that can exist
between stereo images. From these considerations emerges the need to reduce as
4.6 Stereo Vision 379

Fig. 4.47 Results of the Marr-Poggio stereo cooperative algorithm. The initial state of the network,
which includes all possible match within a predefined disparity range, is indicated with the map
0. With the evolution of iterations, the geometric structure present in the random-dot stereograms
emerges, and the different disparity values are represented with gray levels
380 4 Paradigms for 3D Vision

much as possible the number of correspondences and to become invariant from the
possible different contrast of the stereograms.
The functional scheme of this second algorithm is the following:

1. Each stereo image is analyzed with different spatial resolution channels and the
comparison is made between the elementary structures present in the images
associated with the same channel and for disparity values that depend on the
resolution of the channel considered.
2. We can use the disparity estimates calculated with coarse resolution channels
to guide the ocular movements of vergence of the eyes to align the elementary
structures by comparing the channel disparities with a finer resolution to find the
correct match.
3. Once the correspondence is determined, associated with a certain resolution, the
disparity values are maintained in a map of disparity (2.5D sketch) as memory
buffers (the function of this memory suggests to Marr-Poggio to attribute the
phenomenon of hysteresis).

The stereo vision process with this algorithm begins by analyzing the images
with coarse resolution channels that generate elementary structures well separated
from each other, and then the matching process is guided to corresponding channels
with finer resolution, thus improving the robustness in determining the homologous
points.
The novelty in this second algorithm consists in selecting in the two images,
as elementary structures for comparison, contour points, and in particular, Marr
and Poggio used the zero crossing, characterized by the sign of variation of the
contrast and their orientation local. Grimson [19] has implemented this algorithm
using random-dot stereograms at 50% density.
The stereograms are convolved with the LoG filter (Laplacian of Gaussian), with
different values of σ to produce multi-channel
√ stereograms with different spatial reso-
lution. Remember the relation W = 2×3 2σ , introduced in Sect. 1.13 Vol. II, which
links the W dimension of the convolution mask with σ which controls the smooth-
ing effect of the LOG filter. In Fig. 4.48 the three convolutions of the stereograms
obtained with square masks of different sizes of 35, 17 and 9 pixels, respectively, are
shown.
The zero crossings obtained from the convolutions are shown and it is observed
how the structures become more and more detailed as the convolution filter is smaller.
Points of zero crossing are considered homologous in the two images, if they have
the same sign and their local orientation remains within an angular difference not
exceeding 30◦ .
The comparison activity of the zero crossing starts with the coarse channels and
the resulting disparity map is very coarse. Starting from this map of rough disparity,
the comparison process analyzes the images convolved with the medium-sized filter
and the resulting disparity map is more detailed and precise. The process continues
using this intermediate disparity map that guides the process of comparing the zero
4.6 Stereo Vision 381

Left image Zero crossing

Right image W=35 W=17 W=9

Fig.4.48 Zero crossing obtained through the multiscale LoG filtering applied to the pair of random-
dot stereo images of Fig. 4.45 (first column). The other columns show the results of the filtering
performed at different scales by applying the convolution mask of the LOG filter, of the square
shape, respectively, of sizes 35, 17 and 9

crossing in the last channel, obtaining a final disparity map with the finest resolution
and the highest density.
The compatibility constraint is satisfied considering the zero crossing structures
as candidates for comparison to those that have the same sign and local orientation.
The larger filter produces few candidates for zero crossing due to the increased
smoothing activity of the Gaussian component of the filter and only the structures with
strong variations in intensity are maintained (coarse channel). In these conditions,
the comparison concerning the zero crossing structures, which in stereo images are
within a predefined disparity interval (Panum fusion interval, which depends on the
binocular system used) and the W width of the filter used, which we know depends
on the parameter σ for the LOG filter.
The constraints of the comparison, together with the quantitative relationship that
links these latter parameters (filter width and default range of disparities), allow to
optimize the number of positive comparisons between the homologous zero crossing
structures, reducing false positives (can not to detect homologous zero crossing
structures that exist instead) and false negatives (homologous zero crossing structures
can be detected instead that they should not exist), for example, a homologous zero
crossing is found in the other image generated by the noise, or the homologous
point in the other image is not visible because it is occluded and instead another
nonhomologous zero crossing is chosen (see Fig. 4.49).
Once the candidate homologous points are found in the pair of images with low
spatial resolution channels, these constitute the constraints to search for zero crossing
structures in the images filtered at higher resolution and the search for zero crossing
382 4 Paradigms for 3D Vision

(a) (b) (c)


Left
L L

-W +W

F R F R
Right
d d
-W/2 d +W/2 -W +W -W +W

Fig.4.49 Scheme of the matching process to find zero crossing homologous points in stereo images.
a One zero crossing L in the left image has a high probability of finding the homologous R with
disparity d in the right image if d < W/2. b Another possible configuration is to find a counterpart
in the whole range W or a false counterpart F with 50% probability, but it always remains to find a
R homologous. c To disambiguate the false homologues, the comparison between the zero crossing
is realized first from the left image to the right one and viceversa obtaining that L2 can have as a
homologous R2 , while R1 has as a homologous L1

homologues must take place within a range of disparity which is twice the size of
the current filter W . Given a zero crossing L in a given position in the left image (see
Fig. 4.49a) and another R in the right image which is homologous to L (or has the
same sign and orientation) with a disparity value d . With F a possible false match
is indicated instead near R. From the statistical analysis, it is shown that R, is the
counterpart of L, within a range ±W/2 with the probability of 95% if the maximum
disparity is d = W/2. In other words, given a zero crossing in some position in the
filtered image, it has been shown that the probability of the existence of another zero
crossing in the range ±W/2 is 5%. If the correct disparity is not in the range ±W/2,
the probability of a match is 40%.
For d > W/2 and d ≤ W we have the same probability of 95% that R is the only
homologous candidate for L with disparity from 0 to W if the value of d is positive
(see Fig. 4.49b). The probability of 50% is also statistically determined, of a false
correspondence of 2W of disparity in the interval between d = −W and d = W .
This means that the 50% of the times there is ambiguity in determining the correct
match, both in the disparity interval (0−W ) (convergent disparity) and in the interval
(−W − 0) (divergent disparity), where only one of the two cases is correct. If d is
around zero, the probability of a correct match is 90%. Therefore, from the figure
we have that F is a false match candidate, with the probability of 50%, but it also
turns out the possible match with R.
To determine the correct match, Grimson proposed the match procedure that first
compares the zero crossing from the left to the right image, and then from the right
and left image (see Fig. 4.49c). In this case, starting the comparison from the left to
right image, L1 can ambiguously correspond to R1 or to R2 , but L2 has only R2 as its
counterpart. From the right hand side, the correspondence is unique for R1 which is
only L1 , but is ambiguous for R2 .
Combining the two situations together, the two unique matches provide the correct
solution (constraint of uniqueness). It is shown that if more than 70% of the zero
crossings matches in the range (−W, +W ) then the disparity interval is correct
4.6 Stereo Vision 383

Fig. 4.50 Results of the Map of disparity


second Marr-Poggio
algorithm applied to the
random-dot stereograms of
Fig. 4.48 with 4 levels of
depth

(satisfies the continuity constraint). Figure 4.50 shows the results of this algorithm
applied to the random-dot stereograms of Fig. 4.48 with four levels of depth.
As previously indicated, the disparity values, found in the intermediate steps of
this coarse-to-fine comparison process, are saved in a temporary memory buffer also
called 2.5D sketch map. The function of this temporary memory of the correspon-
dence process is considered for Marr-Poggio to be the equivalent of the hysteresis
phenomenon initially proposed by Julesz to explain the biological fusion process.
This algorithm does not fully understand all the psycho-biological evidences of
human vision. This computational model of stereo vision has been revised by other
researchers, and others have proposed different computational modules where the
matching process is seen as integrated with the extraction process of candidate ele-
mentary structures for comparison (primal sketch). These latest computational mod-
els contrast with the idea of Marr’s stereo vision, which sees the early vision modules
for the extraction of elementary structures separated.

4.6.7 Simple Artificial Binocular System

Figure 4.51 shows a simplified diagram of a monocular vision system and we can
see how the 3D scene is projected onto the 2D image plane essentially reducing
the original information of the scene by one dimension. This loss of information
is caused by the perspective nature of the projection that makes ambiguous in the
image plane, the apparent dimension of a geometric structure of an object. In fact,
it appears of the same size, regardless of whether it is near or further away from the
capture system.
This ambiguity depends on the inability of the monocular system to recover the
information lost with the perspective projection process. To solve this problem, in
an analogy to human binocular vision, the artificial binocular vision scheme, shown
in Fig. 4.52, consists of two cameras located in slightly different positions along
the X -axis, is proposed. The acquired images are called stereoscopic image pairs or
stereograms. A stereoscopic vision system produces a depth map that is the distance
384 4 Paradigms for 3D Vision

N Object

Light source

Im
ag
eP
lan
e

Optical System

Fig. 4.51 Simplified scheme of a monocular vision system. Each pixel in the image plane captures
the light energy (irradiance) reflected by a surface element of the object, in relation to the orientation
of this surface element and the characteristics of the system

(a) (b)
Z
Y Z N
M M P(X,Y,Z)
N
Epipolar line
D
ZP

cL
PL yL 0 b P(X,Y,Z)
yL xL
b
c’L 0
cR cL cR X
yR
f

PR f
f

xR
xL yR c’R xL xR
PL c’L xL PR c’R xR
X Left image Right image
xR

Fig. 4.52 Simplified diagram of a binocular vision system with parallel and coplanar optical axes.
a 3D representation with the Z axis parallel to the optical axes; b Projection in the plane X −Z of
the binocular system
4.6 Stereo Vision 385

between the cameras and the visible points of the scene projected in the stereo image
planes.
The gray level of each pixel of the stereo images is related to the light energy
reflected by the visible surface projected in the image plane, as shown in Fig. 4.51.
With binocular vision, part of the 3D information of the visible scene is recov-
ered through the gray level information of the pixels and through the triangulation
process that uses the disparity value for depth estimation. Before proceeding to the
formal calculation of the depth, we analyze some geometric notations of stereometry.
Figure 4.52 shows the simplest geometric model of a binocular system, consisting of
two cameras arranged with separate parallel and coplanar optical axes of a value b,
called distance baseline, in the direction of X axis. In this geometry, the two image
planes are also coplanar at the focal distance f with respect to the optical center of
the left lens which is the origin of the stereo system.
A P element of the visible surface is projected by the two lenses on their respective
retinas in PL and in PR . The plane passing through the optical centers CL and CR of
the lenses and the visible surface element P is called epipolar plane. The intersection
of the epipolar plane with the plane of the retinas defines the epipolar line. The Z axis
coincides with the optical axis of the left camera. Stereo images are also vertically
aligned and this implies that each element P of the visible surface is projected onto
the two retinas maintaining the same vertical coordinate Y . The constraint of the
epipolar line implies that the stereo system does not present any vertical disparity.
Two points found in the two retinas along the same vertical coordinate are called
homologous points if they derive from the perspective projection of the same element
of the visible surface P. The disparity measure is obtained by superimposing the two
retinas and calculating the horizontal distance of the two homologous points.

4.6.7.1 Depth Calculation


With reference to Fig. 4.52a, let P be the visible 3D surface element of coordinates
(X , Y , Z) considering the origin of the stereo system coinciding with the optical cen-
ter of the left camera. From the comparison of similar triangles MPCL and CL PL CL
in the plane XZ we can observe that the radius passing through P crossing the center
of the lens CL intersects the plane of the retinas Z = −f in the point PL of the left
retina whose horizontal coordinate XL is obtained from the following relation:
X −XL
= (4.1)
Z f
from which
f
XL = −X (4.2)
Z
Similarly, considering the vertical plane YZ, from the comparison of similar triangles,
the vertical coordinate YL is given for the same point P projected on the left retina
in PL , given by
f
YL = −Y (4.3)
Z
386 4 Paradigms for 3D Vision

Similar equations are obtained by considering the comparison of similar triangles


PNCR and CR PR CR in the plane XZ where the radius passing through P crossing the
center of the lens CR intersects the plane of the right retina in the point PR , whose
horizontal and vertical coordinates are given considering the following relation:
X −b b − XR
= (4.4)
Z f
from which
X −b
XR = b − f (4.5)
Z
where we remember that b is the baseline (separation distance of the optical axes).
In the vertical plane YZ, in a similar way, the vertical coordinate yR is calculated
f
YR = −Y (4.6)
Z
From the geometry of the binocular system (see Fig. 4.52b) we observe that the depth
of P coincides with the value Z, that is, the distance of P from the plane passing
through the two camera optical centers and parallel to the image plane.7 To simplify
the derivation of the equation calculating the value of Z for each point P of the
scene, it is convenient to consider the coordinates of the projection of P in the local
reference systems to the respective left and right image planes.
These new reference systems have the origin in the center of each retina with
respect to which the coordinates of the projections of P are referred by operating a
simple translation from the global coordinates X to the local ones xL and xR , respec-
tively, for the left retina and the right retina. It follows that these local coordinates
xL and xR having the origin, with respect to the center of the relative retinas, can also
assume negative values together with the global coordinate X , while Z would always
be positive. In Fig. 4.52b the global axis of the Y and the local coordinate axes yR
and yL , have not been reported because they are perpendicular and exiting from the
page.
Considering the new local coordinates for each retina, and the new relations
derived from the same similar rectangular triangles (MPCL and CL PL CL for the
left retina, and NPCR and CR PR CR for the right retina), the following relationships
are found:
−xL X −xR X −b
= = (4.7)
f Z f Z
from which it is possible to derive (similarly for the coordinates yL and yR ), the new
equations for the calculation of the horizontal coordinates xL and xR
X X −b
xL = − f xR = − f (4.8)
Z Z

7 Inthe Fig. 4.52b the depth of the point P is indicated with ZP but in the text, we will continue to
indicate with Z the depth of a generic point of the object.
4.6 Stereo Vision 387

By eliminating the X from the (4.8), and resolving with respect to Z, we get the
following relation:
b·f
Z= (4.9)
xR − xL
which is the triangulation equation for calculating the perpendicular distance (depth)
for a binocular system with the geometry defined in Fig. 4.52a, that is, with the
constraints of the parallel optical axes and with the projections PL and PR lying on
the epipolar line. We can see how in the (4.9) the distance Z is correlated only to the
disparity value (xR − xL ), induced by the observation of the point P of the scene,
and is independent of the system of reference of the local coordinates or, from the
absolute values of xR and xL .
Recall that the parameter b is the baseline, that is, the separation distance of the
optical axes of the two cameras and f is the focal length, identical for the optics of
the two cameras. Furthermore, b and f have positive values. In the (4.9) the value of
Z must be positive and consequently the denominator is xR ≤ xL .
The geometry of the human binocular vision system is such that the b·f numerator
of (4.9) assumes values in the range (390 − 1105 mm) considering the interval
(6 − 17 mm) of the focal length of the crystalline lens (corresponding, respectively,
to the vision of objects closer, at about 25 cm, with contracted ciliary muscle, and at
the sight of more distant objects, at about 25 cm, with relaxed ciliary muscle) and the
baseline b = 65 mm. Associated with the corresponding range of Z = 0, 25−100 m,
the interval of disparity xR − xL would result (2 − 0, 0039 mm). The denominator
value of (4.9) tends to assume very small values to calculate large values of depth Z
(for (xR − xL ) → 0 ⇒ Z → ∞). This can determine a non-negligible uncertainty
in the estimate of Z.
For a binocular vision system, the uncertainty of the estimate of Z can be limited
with the use of cameras with a good spatial resolution (not less than 512 × 512 pixel)
and to minimize the error in the estimation of the position of the elementary struc-
tures detected in stereo images, candidate as homologous structures. These two
aspects can easily be solved by considering the availability of HD cameras (res-
olution 1920 × 1080 pixel) equipped with chips with photoreceptors of 4µm. For
example, a pair of these cameras, configured with a baseline of b = 120 mm and
optics with focal lengths of 15 mm, to detect an object at a distance of 10 m, for
the (4.9) the corresponding disparity would be 0.18 mm which in terms of pixels
would correspond to several tens (adequate resolution to evaluate the position of
homologous structures in the two HD stereo images).
Let us return to Fig. 4.52 and use the same similar right angled triangles in the 3D
context. PL CL and CL P are the hypotenuses of this similar right angled triangles. We
can get the following expression:

D PL CL D f 2 + xL2 + yL2
= =⇒ = (4.10)
Z f Z f
388 4 Paradigms for 3D Vision

to which Z can be replaced with the Eq. (4.9) (perpendicular distance of P) obtaining

b f 2 + xL2 + yL2
D= (4.11)
xR − xL
which is the equation of the Euclidean distance D of the point P in the three-
dimensional reference system, whose origin always coincides with the optical center
CL of the left camera. When calibrating the binocular vision system, if it is necessary
to verify the spatial resolution of the system, it may be convenient to use the Eq. (4.9)
or the (4.11) to predict, note the positions of points P in space and position xL in
the left retina, what should be the value of the disparity, i.e., estimate xR , and the
position of the point P when projected in the right retina.
In some applications it is important to evaluate well the constant b · f (of the
Eq. 4.9) linked to the intrinsic parameter of the focal length f of the lens and to the
extrinsic parameter b that depends on the geometry of the system.

4.6.7.2 Intrinsic and Extrinsic Parameters of a Binocular System


The intrinsic parameters of a digital acquisition system, characterize the optics of
the system (for example, the focal length and optical center), the geometry of the
optical system (for example, the radial geometric distortions introduced), and the
geometric resolution of the image area that depends on the digitization process and
on the transformation of the plane-image coordinates to the pixel-image coordinates
(described in Chap. 5 Vol. I Digitization and Image Display).
The extrinsic parameters of an acquisition system are instead the parameters that
define the structure (position and orientation) of the cameras (or in general of optical
systems) with respect to a 3D external reference system. In essence, the extrinsic
parameters describe the geometric transformation (for example, translation, rotation
or roto-translation), which relate the coordinates of known points in the world and
the coordinates of the same points with respect to the acquisition system (unknown
reference). For the considered binocular system, the baseline b constitutes an extrinsic
parameter.
This activity of estimating the intrinsic and extrinsic parameters, known as cal-
ibration of the binocular vision system, consists of first identifying some known P
points (in the 3D world) in the two retinas and then evaluating the disparity value
xR − xL e the distance Z of these known points using other systems. Solving with
respect to b or f with the Eq. (4.9) it is possible to verify the correctness of these
parameters. From the analysis of the calibration results of the binocular system con-
sidered, it can be observed that the increase of the baseline value b can better influence
the accuracy in the estimate of Z with the consequent increase of the disparity values.
A good compromise must be chosen between the value of b and the width of the
visible area of the scene seen by both cameras (see Fig. 4.53a). In essence, increasing
the baseline decreases the number of observable points in the scene. Another aspect
that must be considered is the diversity of the acquired pair of stereo images, due
to the distortion introduced and the perspective projection. This difference in stereo
4.6 Stereo Vision 389

(a) (b)
Uncertainty of P Field of view
Field of view
P

Uncertainty of P

cL cR

1 pixel
f

PL c’L c’R PR
Left image Right image

Fig. 4.53 Field of view in binocular vision. a In systems with parallel optical axes, the field of
view decreases with the increase of baseline but a consequent increase in the accuracy is obtained
in determining the depth. b In systems with converging optical axes, the field of view decreases as
the vergence angle and baseline increase but decreases the level of depth uncertainty

images increases with the increase of b, all to the disadvantage of the stereo fusion
process which aims to search, in stereo images, for the homologous points deriving
from the same point P of the scene.
Proper calibration is strategic when the vision system has to interact with the
world to reconstruct the 3D model of the scene and when it has to refer to it (for
example, an autonomous vehicle or robotic arm must self-locate with an accuracy).
Some calibration methods are well described in [21–24].
According to Fig. 4.52, the equations for reconstructing the 3D coordinates of
each point P(X , Y , Z) visible from the binocular system (with parallel and coplanar
optical axes) are summarized
b·f Z xL · b Z yL · b
Z= X = xL = Y = yL = (4.12)
xR − xL f xR − xL f xR − xL

4.6.8 General Binocular System

To mitigate the limit situations indicated above, with the stereo geometry proposed
in Fig. 4.52 (to be used when Z b), it is possible to use a different configuration of
the cameras arranging them with the convergent optical axes, that is, inclined toward
the fixation point P which is at a finished distance from the stereo system, as shown
in Fig. 4.54.
With this geometry the points of the scene projected on the two retinas lie along the
lines of intersection (the epipolar lines) between the image planes and the epipolar
plane which includes the point P of the scene and the two centers optical CL and CR
of the two cameras, as shown in Fig. 4.54a. It is evident that with this geometry, the
epipolar lines are no longer horizontal as they were with the previous stereo geometry,
390 4 Paradigms for 3D Vision

(a) (b)
P
P
Epipolar
PL Plane PR
PL PR
Epipolar lR
lL Lines
cL cR lL
cL lR cR
Baseline Epipoles eL eR

Fig. 4.54 Binocular system with converging optical axes. a Epipolar geometry: the baseline line
intersects each image plane to the epipoles eL and eR . Any plan containing the baseline is called
epipolar plane and intersects the image planes at the epipolar lines lL and lR . In the figure, the
epipolar plane considered is the one passing through the fixation point P. As the 3D position of P
changes, the epipolar plane rotates around the baseline and all the epipolar lines pass through the
epipoles. b The epipolarity constraint imposes the coplanarity in the epipolar plane of the point P
of the 3D space, of the projections PL and PR of P in the respective image planes, and of the two
optical centers CL and CR . It follows that a point of the image of the left PL through the center CL
is projected backward in the 3D space from the radius CL PL . The image of this ray is projected in
the image on the right and corresponds to the epipolar line lR where to search for PR , that is, the
homologue of PL

with the optical axes of the cameras arranged parallel and coplanar. Furthermore,
the epipolar lines intersect the epipolar plane always in corresponding pairs. The
potential homologous points of P projections in the two retinas, respectively, in PL
and PR , lie on the corresponding epipolar lines lL and lR for the epipolarity constraint.
The baseline b is always the line joining the optical centers and the epipoles eL and eR
of the optical systems are the intersection points of the baseline with the respective
image planes. The right epipole eR is the virtual image of the left optical center CL
observed in the right image, and vice versa the left epipole eL is the virtual image of
the optical center CR .
Known the intrinsic and extrinsic parameters of the binocular system (calibrated)
and the epipolarity constraints, the correspondence problem is simplified by restrict-
ing the search for the homologous point of PL (supposedly known) on the associated
epipolar line lR , coplanar to the plane epipolar determined by PL , CL , and the baseline
(see Fig. 4.54b). Therefore, the search is restricted on the epipolar line lR and not on
the entire image on the right.
For a binocular system with converging optical axes, the binocular triangulation
method (also called binocular parallax) can be used, for the calculation of the coor-
dinates of P, but the previous Eq. (4.12), is no longer valid, having substantially
assumed a binocular system with point of fixation at infinity (parallel optical axes).
In fact, in this geometry, instead of calculating the linear disparity, it is necessary to
calculate the angular disparities θL and θR , which depend on the angle of convergence
ω of the optical axes of the system (see Fig. 4.55).
In analogy to the human vision, the optical axes of the two cameras intersect
at a point F of the scene (fixation point) at the perpendicular distance Z from the
4.6 Stereo Vision 391

P P
(a) (b) (c)
β

ΔZ
β
P
F β/2 β/2 F ΔZ
β/2 β/2
θL θR θL F θR
α α Δψ
Δω Z
Z α/2 α/2
α/2 α/2 ω ψ
b
A
b/2 b/2 b B

Δω
θL θR Δψ

Fig. 4.55 Calculation of the angular disparity

baseline. We know that with the stereopsis (see Sect. 4.6.3) we get the perception
of the relative depth if simultaneously another point P is seen near or farther with
respect to F (see Fig. 4.55a).
In particular, we know that all points located around the horopter stimulate the
stereopsis caused by the retinal disparity (local difference between the retinal images
caused by the different observation point of each eye). The disparity in the point F
is zero, while there is an angular disparity for all points outside the horopter curve
and each presents a different angle of vergence β. Analyzing the geometry of the
binocular system of Fig. 4.55a it is possible to derive a binocular disparity in terms
of angular disparity δ, defined as follows:
δ = α − β = θR − θL (4.13)
where α and β are the angles of vergence underlying the fixation point F from the
optical axes and the point P outside the horopter curve; θL and θR are the angles
included in the retinal projections, in the left and right camera, of a fixation point F
and the target point P. The functional relationship that binds the angular disparity δ
(expressed in radians) and the depth Z is obtained by applying the elementary geom-
etry (see Fig. 4.55b). Considering the right angled triangles with base b/2, where b
is the baseline, it results from the trigonometry 2b = tan(α/2)Z. For small angles
you can approximate tan(α/2) ≈ α/2. Therefore, the angular disparity between PL
and PR is obtained by applying the (4.13) as follows:
b b b· Z
δ =α−β = − = 2 (4.14)
Z Z +· Z Z +Z Z
For very small distances of Z (less than 1 meter) of the fixation point and with values
of depth very small compared to Z, the second term in the denominator of the (4.14)
becomes almost zero and the expression of angular disparity is simplified
b Z
δ= (4.15)
Z2
To the same result of the angular disparity is achieved if we consider the differences
between the angles θL and θR treating in this case the sign of the angles.
392 4 Paradigms for 3D Vision

The estimate of the depth Z and of the absolute distance Z of an object with
respect to a known reference object (fixation point), where the binocular system
converges with the ω and ψ angles, can be calculated considering different angular
reference coordinates, as shown in Fig. 4.55c. Note the angular configuration ω and
ψ with respect to the reference object F, the absolute distance Z of F with respect
to the baseline b is given by8 :
sin ω sin ψ
Z ≈b (4.16)
sin(ω + ψ)
while the depth Z results
ω+ ψ
Z=b (4.17)
2
where ω and ψ are the angular offsets of the left and right image planes to align
the binocular system with P starting from the initial reference configuration.
In human vision, considering the baseline b = 0.065 m and fixing an object at the
distance Z = 1.2 m, for a distant object Z = 0.1 m from the one fixed, applying
the (4.15) would have an angular disparity of δ = 0.0045 rad. A person with normal
vision is able to pass a wire through the eye of a needle fixed at Z = 0.35 m and
working around the eyelet with the resolution of Z = 0.1 mm. The visual capacity
of the human stereopsis is such as to perceive depth, around the point of fixation,
of fractions of millimeters requiring, according to the (4.15), a resolution of the
angular disparity δ = 0.000053 rad = 10.9 s of arc. Fixing an object at the distance
of Z = 400 m the depth around this object is no longer perceptible (perceived
flattened background) as the resolution of the required angular disparity would be
very small, less than 1 s of arc. In systems with converging optical axes, the field of
view decreases with increasing vergence angle and baseline but decreases the level
of depth uncertainty (see Fig. 4.53b).
Active and passive vision systems, have been experimented to estimate the angular
disparity together with other parameters (position and orientation of the cameras)
of the system. These parameters are evaluated and dynamically checked for the
calculation of the depth of various points in the scene. The estimation of the position
and orientation of the cameras requires their calibration (see Chap. 7 of Camera
calibration and 3D Reconstruction). If positions and orientations of the cameras are
known (calculated, for example, with active systems) the reconstruction of the 3D
points of the scene is realized with the roto-translation transformation of the points
PL = (xL , yL , zL ) ( projection of P(X , Y , Z) in the left retina) which are projected

8 From Fig. 4.55c it is observed that Z = AF · sin ω, where AF is calculated remembering the
b
theorem of the sines for which sin(π −ω−ψ) = sin
AF
ψ and that the sum of the inner angles of the acute
triangle AFB is π . Therefore, resolving with respect to AF and replacing, it is obtained
sin ψ sin ω sin ψ
Z =b sin ω = b .
sin[π − (ω + ψ)] sin(ω + ψ)
4.6 Stereo Vision 393

in PR = (xR , yR , zR ), in the right retina, with the transformation


PR = RPL + T
where R and T are, respectively, the rotation matrix of size 3 × 3 and the translation
vector, to switch from the left to the right camera. The calibration procedure has
previously calculated R and T knowing the coordinates PL and PR , in the reference
system of cameras that correspond to the same points as the 3D scene (at least 5
points). In the Chap. 7 we will return in detail on the various methods of calibration
of monocular and stereo vision systems.

4.7 Stereo Vision Algorithms

In the preceding paragraphs, we have described the algorithms proposed by Julesz


and Marr-Poggio inspired by biological binocular vision. In this paragraph we will
take up the basic concepts of stereo vision, to create an artificial vision system capable
of adequately reconstructing the 3D surface of the observed scene.
The stereo vision problem is essentially reduced to the identification of the ele-
mentary structures in the pair of stereo images and to the identification of the pairs of
homologous elementary structures, that is, points of the stereo images, which are the
projection of the same point P of the 3D scene. The identification in the stereo pair of
homologous points is also known in the literature as the problem of correspondence:
for each point in the left image, find its corresponding point in the right image.
The identification of homologous points in the pair of stereo images depends on two
strategies:

1. what elementary structures of stereo images must be chosen as candidate homol-


ogous structures;
2. what measure of similarity (or dissimilarity) must be chosen to measure the level
of dependency (or independency) between the structures identified by strategy 1.

For strategy 1 the use of contour points or elementary areas have already been
proposed. Recently, the Points Of Interest-POI described in Chap. 6 Vol. II (for
example, SIFT and SUSAN) are also used.
For strategy 2, two classes of algorithms are obtained for the measurement of
similarity (or dissimilarity) for point elementary structures or extended areas. We
immediately highlight the importance of similarity (or dissimilarity) algorithms of
structures that must well discriminate between structures that are not very different
from each other (homogeneous distribution of gray levels of pixels). This is to keep
the number of false matches to a minimum.
The calculation of the depth is made only for the elementary structures found
in the images and in particular by choosing only the homologous structures (punc-
tual or areas). For all the other structures (features) for which the depth cannot be
394 4 Paradigms for 3D Vision

calculated with stereo vision, interpolation techniques are used to have a more com-
plete reconstruction of the visible 3D surface. The search for homologous structures
(strategy 2) is simplified when the geometry of the binocular system is conditioned
by the constraint of the epipolarity. With this constraint, the homologous structures
are located along the corresponding epipolar lines and the search area in the left and
right image is limited.
The extent of the research area depends on the uncertainty of the intrinsic and
extrinsic parameters of the binocular system (for example, uncertainty about the posi-
tion and orientation of the cameras), making it necessary to search for the homologous
structure in a small neighborhood with respect to the estimated position of the struc-
ture in the image on the right (note the geometry of the system) slightly violating the
constraint of the epipolarity (search in a horizontal and/or vertical neighborhood).
In the simple binocular system, with parallel and coplanar optical axes, or in the
case of rectified stereo images, the search for homologous structures takes place by
considering corresponding lines with the same vertical coordinate.

4.7.1 Point-Like Elementary Structures

The extraction of the elementary structures present in the pair of stereo images can
be done by applying to these images some filtering operators described in Chap. 1
Vol. II Local Operations: Edging. In particular, point-like structures, contours, edge
elements, and corners can be extracted. Julesz and Marr used the random-dot images
(black or white point-like synthetic structures) and the corresponding structures of
the zero crossing extracted with the LOG filtering operator.
With the constraint of epipolar geometry, a stereo vision algorithm includes the
following essential steps:

1. Acquisition of the pair of stereo images;


2. Apply to the two images a Gaussian filter (with the appropriate parameter σ ) to
attenuate the noise;
3. Apply to the two images a point-like structure extraction operator (edges, con-
tours, points, etc.);
4. Apply strategy 2 (see previous paragraph) for the search for homologous structures
by analyzing those extracted in point 3. With the constraint of epipolar geometry,
the search for homologous points is done by analyzing the two epipolar lines in
the two images. To minimize the uncertainty in the evaluation of similarity (or
dissimilarity) of structures, these can be described through different features such
as, for example, the orientation θ (horizontal borders are excluded), the value of
the contrast M , the length l of the structure, the coordinates of the center of the
structure (xL , yL ) in the left image and (xR , yR ) in the right one. For any pair of
elementary structures, whose n features are represented by the components of the
respective vectors si = (s1i , s2i , . . . , sni ) and sj = (s1j , s2j , . . . , snj ), one of the
4.7 Stereo Vision Algorithms 395

following similarity measures can be used


SD = si , sj  = siT sj = s|  sj  cos φ

= ski skj = s1i s1j + s2i s2j + · · · + ski skj (4.18)
k
  2

SE = si − sj  = ski − skj wk (4.19)
k

where SD represents the inner product between the vectors si and sj which indicate
two generic elementary structures characterized by n parameters, φ is the angle
between the two vectors and • indicates the length of the vector. Instead, SE rep-
resents the Euclidean distance weighted by the weights wk of each characteristic
sk of the elementary structures.
The two similarity measures SD and SE can be normalized to not depend very
much on the variability of the characteristics of the elementary structures. In this
case, the (4.18) can be normalized by dividing each term of the summation by the
product of the vector modules si and sj , given by
     
 2  2
si sj  = ski skj
k k
The weighted Euclidean distance measure can be normalized by dividing each
addend of the (4.19) by the term R2k which represents the maximum range of
variability of the component k-th squared high. The Euclidean distance measure is
used as a similarity estimate in the sense that the more different are the components
that describe the pairs of candidate structures as homologous, the greater their
difference, that is, the value of SD . The weights wk , relative to each characteristic
sk , are calculated by analyzing a certain number of pairs of elementary structures
for which we can guarantee that they are homologous structures.
An alternative normalization can be chosen on a statistical basis by considering
each characteristic sk having subtracted its average sk  and dividing by its stan-
dard deviation. Obviously, this is possible for both measures SD and SE , only if
the probability distribution of the characteristics is known which can be estimated
using known pairs {(si , sj ), . . .} of elementary structures.
In conclusion, in this step, the measure of similarity of pairs (si , sj ) is estimated
to check whether they are homologous structures and then, for each pair, the
measure of disparity dij = si (xR ) − sj (xL ) is calculated.
5. The previous steps can be repeated to have different disparity estimates by iden-
tifying and calculating the correspondence of the structures at different scales,
analogous to the coarse-to-fine approach proposed by Marr-Poggio, described
earlier in this chapter.
6. Calculation with Eq. (4.12) of 3D spatial coordinates (depth Z and coordinates X
and Y ), for each point of the visible surface represented by the pair of homologous
structures.
396 4 Paradigms for 3D Vision

7. Reconstruction of the visible surface, at the points where it was not possible to
measure the depth, through an interpolation process, using the measurements of
the stereo system estimated in step 6.

4.7.2 Local Elementary Structures and Correspondence Calculation


Methods

The choice of local elementary structures is based on small image windows (3 ×


3, 5 × 5, . . .), characterized as feature, which through a measure of similarity based
on correlation, we can check if it exists, in the other stereo image, a window of
the same size that is candidate to be considered a homologous structure (feature).
In this case, comparing image windows, the potential homologous structures are
the selected windows that give the maximum correlation, i.e., the highest value of
similarity.
Figure 4.56 shows a stereo pair of images with some windows found with a high
correlation value, as they represent the projection of a same portion of 3D surface
of the observed scene. In the first horizontal line of the figure, we assume to find
the homologous windows centered on the same row of the stereo images for the
constraint of the epipolarity. The identification by correlation of the local elementary
structures eliminates the drawback that one had with the method that uses the point-
like structures, which generate uncertain depth measurements in the areas of strong
discontinuity in the gray levels due to occlusion problems. Moreover in the areas
with a visible curved surface, the point-like structures of the stereo pairs of images
are difficult to identify and when they are, they do not always coincide with the same
physical structure. The identification of point-like structures (edges, contours, edges,
etc.) is well defined in the images right in the occlusion zones.
Without the epipolarity constraint, we can look for local structures in the other
image in a R search area whose size depends on the size of the local structures and
the geometry of the stereo system. Figure 4.56 shows this situation in another row of
stereo images. Suppose we have identified a local feature represented by the square
window WL (k, l), k, l = −M , +M of size (2M + 1), centered in (xL , yL ), in the
left image IL (xL , yL ). We search horizontally in the right image IR a window WR of
the same size as WL , located in the search area Rm,n (i, j), i, j = −N , +N of size
(2N + 1), initially at position m = xL + dminx and n = yL + dminy where, dminx and
dminy are, respectively, the minimum expected horizontal and vertical disparity (dminy
is zero in the hypothesis of epipolar geometry), known a priori from the geometry
of the binocular system.
With the constraint of the epipolarity we would have that the search area R would
coincide with the window WR where N = M (identical dimensions) and the corre-
spondence in the image on the right would result in xR = xL + dx (the disparity is
added because the right image is assumed to be shifted to the right of the left one, as
shown in Fig. 4.56, otherwise the disparity dx would be subtracted.).
4.7 Stereo Vision Algorithms 397

x x
xL xR
y y
yL
yL
WL WR(i+m,j+n)
WR xR=i+m
Rmxn
xL yR=j+n
WR
yL n
i
WL

j Rmxn
(m,n)

Left image IL Right image IR m

Fig. 4.56 Search in stereo images of homologous points through the correlation function between
potential homologous windows, with and without epipolarity constraint

4.7.2.1 Correlation-Based Matching Function


The goal is to move the window WR into R which, according to a matching measure9
C reaches the maximum value when the two windows are similar, i.e., they represent
the homologous local structures. This matching measure can be the cross-correlation
function or other functions of similarity/dissimilarity measures (e.g., sum of the
differences).
Without the constraint of epipolarity, similar local structures to be searched for in
stereo images are not aligned horizontally. Therefore, we must find the window WR ,
the homologue of the WL of the image on the left, in the search area R centered in
(m, n) in the image on the right. The dimensions of R must be adequate as indicated
in the previous paragraph (see Fig. 4.56).
Let (i, j) be the position in the search area R where the window WR is centered
in which the correlation C is maximum. WR in the image IR will result in the posi-

9 In the context of image processing, it is often necessary to compare the level of equality (similarity,

matching) of similar objects described with multidimensional vectors or as even multidimensional


images. Often an object is described as a known model and one has the problem of comparing the data
of the model with those of the same object observed even in different conditions with respect to the
model. Thus the need arises to define functionals or techniques that have the objective of verifying
the level of similarity or drift (dissimilarity) between the data of the model with those observed. A
technique known as template matching is used when comparing model data (template image) with
those observed of the same physical entity (small parts of the captured image). Alternatively, we
can use similarity functions (which assesses the level of similarity, affinity) based on correlation or
dissimilarity functions based on the distance associated with a norm (which assesses the level of
drift or distortion between data). In this context, the objects to be compared do not have a model
but we want to compare small windows of stereo images, acquired simultaneously, which represent
the same physical structure observed from slightly different points of view.
398 4 Paradigms for 3D Vision

tion (xR , yR ) = (m + i, n + j) with the resulting horizontal and vertical disparity,


respectively, dx = xR − xL and dy = yR − yL . The correlation function, between the
window WL and the search area Rm,n (i, j) is indicated for simplicity with C(i, j; m, n).
Detected WL in the position (xL , yL ) in the left image IL , for each pixel position (i, j)
in Rmn is calculated, a correlation measure between the window WL (seen as a tem-
plate window) and the WR window (which slides in R by varying the indices i, j),
with the following relation:
+M
 +M

C(i, j; m, n) = WL (xL + k, yL + l) · WR (i + m +k, j + n +l) (4.20)
k=−M l=−M xR yR

with i, j = −(N − M − 1), +(N − M − 1). The size of the square window WL
is given by (2M + 1) with values of M = 1, 2, . . . generating windows of the
size of 3 × 3, 5 × 5, etc, respectively. The size of the square search area Rm,n ,
located in (m = xL + dminx , n = yL + dminy ) in the right image IR , is given by
(2N + 1) related to the size of the WR window with N = M + q, q = 1, 2, . . ..
For each value of (i, j) where is centered WR in the search region R we have a value
of C(i, j; m, n) and to move the window WR in R it is sufficient to vary the indices
i, j = −(N − M − 1), +(N − M − 1) inside R whose dimensions and position can
be defined a priori in relation to the geometry of the stereo system which allows
a maximum and minimum interval of disparity. The accuracy of the correlation
measurement depends on the variability of the gray levels between the two stereo
images. To minimize this drawback, the correlation measurements C(i, j; m, n) can
be normalized using the correlation coefficient 10 r(i, j; m, n) as the new correlation
estimation value given by
+M +M
k=−M l=−M WL (k, l; xL , yL ) · WR (k, l; i + m, j + n)
r(i, j; m, n) =   2  +M  2 (4.21)
+M +M WL (k, l) +M WR (k, l; i + m, j + n)
k=−M l=−M k=−M l=−M

where
WL (k, l; xL , yL ) = WL (xL + k, yL + l) − W̄L (xL , yL )
(4.22)
WR (k, l; i + m, j + n) = WR (i + m + k, j + n + l) − W̄R (i + m, j + n)

with i, j = −(N − M − 1), +(N − M − 1) and (m, n) prefixed as above. W̄L and
W̄R are the mean values of the intensity values in the two windows. The correlation
coefficient is also known as Zero Mean Normalized Cross-Correlation - ZNCC. The
numerator of the (4.21) represents the covariance of the pixel intensities between
the two windows while the denominator is the product of the respective standard
deviations. It can easily be deduced that the correlation coefficient r(i, j; m, n) takes
scalar values in the range between −1 and +1, no longer depending on the variability
of the intensity levels in the two stereo images. In particular, r = 1 corresponds to

10 The correlation coefficient has been described in Sect. 1.4.2 and in this case it is used to evaluate
the statistical dependence of the intensity of the pixels between the two windows, without knowing
the nature of this statistical dependence.
4.7 Stereo Vision Algorithms 399

the exact equality of the elementary structures (homologous structures, less than a
constant factor c, WR = cWL , that is, the two windows are very correlated but with
uniform intensity, one clearer than the other). r = 0 means that they are completely
different, while r = −1 indicates that they are anticorrelated (i.e., the intensity of
the corresponding pixels are equal but of opposite sign).
The previous correlation measures, for different applications, may not be adequate
due to the noise present in the images and in particular when the research regions
are very homogeneous with little variability in intensity values. This generates very
uncertain or uniform correlation values C or r with the consequent uncertainty in the
estimation of horizontal and vertical disparities (dx , dy ).
More precisely, if the windows WL and WR represent the intensities in two
images obtained under different lighting conditions of a scene and the correspond-
ing intensities are linearly correlated, a high similarity between the images will be
obtained. Therefore, the correlation coefficient is suitable for determining the sim-
ilarity between the windows with the assumed intensities to be linearly correlated.
When, on the other hand, the images are acquired under different conditions (sensors
and nonuniform illumination), so that the corresponding intensities are correlated in
a nonlinear way, the two perfectly matched windows may not produce sufficiently
high correlation coefficients, causing misalignments.
Another drawback is given by the intensive calculation required especially when
the size of the windows increases. In [25] an algorithm is described that optimizes the
computational complexity for the problem of template matching between images.

4.7.2.2 Distance-Based Matching Function


An alternative way is to consider a dissimilarity measure to evaluate the diversity of
pixel intensities between two images or windows of them. This dissimilarity mea-
sure can be associated with a metric, thus producing a very high value to indicate
the difference between the two images. As highlighted above with the metrics used
for the calculation of similarity, the similarity/dissimilarity metrics are not always
functional, especially when the environmental lighting conditions change the inten-
sity of the two images to be compared in a nonlinear way. In these cases, it may be
useful to use nonmetric measures.
If the stereo images are acquired under the same environmental conditions, the
correspondence between the WL and WR windows can be evaluated with the simple
dissimilarity measure known as the sum of the squares of the intensity differences
(SSD—Sum of Squared Difference) or with the sum of absolute differences (SAD—
Sum of Absolute Difference) instead of the products WL · WR . These dissimilarity
measures are given by the following expressions:
 2
CSSD (i, j; m, n) = +M
k=−M
+M
l=−M WL (xL +k,yL +l)−WR (i+m+k,j+n+l) (4.23)

CSAD (i, j; m, n) = +M
k=−M
+M
l=−M |WL (xL +k,yL +l)−WR (i+m+k,j+n+l)| (4.24)
with i, j = −(N − M − 1), +(N − M − 1).
400 4 Paradigms for 3D Vision

For these two dissimilarity measures CSSD (i, j; m, n) and CSAD (i, j; m, n) (sub-
stantially based, respectively, on the norm L2 and L1 ) the minimum value is chosen as
the best match between the windows WL (xL , yL ) and WR (xR , yR ) = WR (i + m, j + n)
which are chosen as homologous local structures with estimate of the disparity
(dx = xR − xL , dy = yR − yR ) (see Fig. 4.56). The SSD metric is less expensive com-
putationally than the correlation coefficient (4.21), and as the latter can be normalized
to obtain equivalent results. In literature there are several methods of normalization.
The SAD metric is mostly used as it requires less computational load.
All matching measurements described are sensitive to geometric deformations
(skewing, rotation, occlusions, . . .) and radiometric (vignetting, impulse noise, . . .).
The latter can also be attenuated for the SSD and SDA metrics by subtracting the
value of the average W̄ calculated on the windows to be compared from each pixel, as
already done for the calculation of the correlation coefficient (4.21). The two metrics
become the Zero-mean Sum of Squared Differences (ZSSD) and the Zero-mean Sum
of Absolute Differences (ZSAD) and their expressions, considering the Eq. (4.22), are

+M
 +M 
 2
CZSSD (i, j; m, n) = WL (k, l; xL , yL ) − WR (k, l; i + m, j + n) (4.25)
k=−M l=−M

+M
 +M 
 
 
CZSAD (i, j; m, n) =  WL (k, l; xL , yL ) − WR (k, l; i + m, j + n) (4.26)
k=−M l=−M

with i, j = −(N − M − 1), +(N − M − 1).


In analogy to the normalization made with the correlation coefficient, to make
the measurement insensitive to the contrast of the image, even for the dissimilarity
measurement ZSSD, we can normalize the intensities of the two windows first with
respect to its mean, and therefore, with respect to their standard deviations obtaining
the measure Normalized Zero-mean Sum of Squared Differences (NZSSD). Therefore,
similar to the correlation coefficient, this measurement is suitable for comparing
images captured under different lighting conditions.
When images with impulsive noise are known, the ZSAD and ZSSD dissimilar-
ity measures, based on the L1 and L2 norms, respectively, produce high distance
measurements. In this case, to reduce the effect of the impulse noise on the calcu-
lated dissimilarity measure, instead of the mean of the absolute differences or of the
squared differences, the median of the absolute differences (MAD) or the median
of the squared differences (MSD) can be used to measure the dissimilarity between
the two windows WL and WR . The calculation of MAD involves the search for the
absolute intensity differences of the corresponding pixels in the two windows (for
MSD instead involves evaluating the squared intensity differences), the ordering of
absolute differences (for MSD the ordering of the squared differences) and then the
median value was chosen as a measure of dissimilarity. The median filter is described
in Sect. 9.12.4 Vol. I, and in this context, it is used considering the intensity values
of the two windows that are ordered in a one-dimensional vector and then discard-
ing half of the major values (absolute differences for MAD and square differences
for MSD). In addition to impulse noise, MAD and MSD can be effective measures
4.7 Stereo Vision Algorithms 401

Original 3x3 5x5 7x7 9x9 11x11


Right image IR

CorrNorm
Left image IL

SSD
Real depth map

SAD
Fig. 4.57 Calculation of the depth map from a pair of stereo images by detecting homologous
local elementary structures using similarity functions. The first column shows the stereo images
and real depth map of the scene. The following columns show the results of depth maps detected
with windows of increasing size starting from 3 × 3, while the corresponding windows in stereo
images are calculated with correlation similarity functions (first row), SSD (second row) and SAD
(third line)

in determining the dissimilarity between windows containing occluded parts of the


observed scene.
Returning to the size of the windows, in addition to influencing the computational
load, it is sensitive to the precision with which homologous structures are located
in stereo images. A small window locates structures with greater precision but is
more sensitive to noise, while on the contrary, a large window is more robust to noise
but reduces the localization accuracy on which the disparity value depends. Finally,
a larger window tends to violate the continuity of disparity constraint. Figure 4.57
shows the results of the depth maps extracted by detecting various elementary local
structures (starting from 3 × 3 square windows) and using the functions of simi-
larity/dissimilarity described above to find homologous windows in the two stereo
images.

4.7.2.3 Matching Function Based on the Rank Transform


An alternative dissimilarity measure, not metric, known as the Rank Distance-RD, is
based on the Rank Transform [26]. Useful in the presence of significant radiometric
variations and occlusion in stereo images.
402 4 Paradigms for 3D Vision

The rank transform is applied to the two windows of the stereo images where for
each pixel the intensity is replaced with its Rank(i, j). For a given window W in the
image, centered in the pixel p(i, j) the transformed rank RankW (i, j) is defined as the
number of pixels in 79W42for which the intensity is less than the value of p(i, j). For
51
example, if W = 46 36 34 it is had that RankW (i, j) = 3 being there three pixels
37 30 28
with intensity less to the central pixel with value 36. It is evidenced, that the obtained
values are on the base of the relative order of intensity of the pixels rather than the
same intensities. The position of the pixels inside the window is also lost. Using the
preceding symbolism (see also Fig. 4.56), the dissimilarity measure of rank distance
RD(i, j) based on the rank transform Rank(i, j) is given by the following:

+M
 +M 
 
 
RD(i, j; m, n) = RankWL (xL + k, yL + l) − RankWR (i + m + k, j + n + l) (4.27)
k=−M l=−M

In the (4.27) the value of RankW for a window centered in (i, j) in the image is
calculated as follows:
+M
 +M

RankW (i, j) = L(i + k, j + l) (4.28)
k=−M l=−M
where L(k, l) is given by

1 if W (k, l) < W (i, j)
L(k, l) = (4.29)
0 otherwise
With the (4.29) is calculated the number of pixels that in the window W are with
value of the intensity less than the central pixel (i, j). Once the rank is calculated
with the (4.28) for the window WL (xL , yL ) and the windows WR (i + m, j + n), i, j =
−(NM − 1), +(NM − 1) in the search region R(i, j) located in (m, n) in the image
IR , the comparison between the windows is evaluated using the SAD method (sum
of absolute differences) with the (4.27).
Rank distance is not a metric11 like all other measures based on ordering. The
dissimilarity measure based on the rank transform actually compresses the informa-
tion content of the image (the information of a window is encoded in a single value)
thus reducing the potential discriminating ability of the comparison between the win-
dows. The choice of window size becomes even more important in this method. The
computational complexity of the rank distance is reduced compared to the methods
of correlation and ordering. Evaluated in the order of nlog2 n, where n indicates the

11 Rank distance is not a metric because it does not satisfy the reflexivity property of metrics. If
WL = WR we have that each corresponding pixel has the same intensity value and it follows
that RD = 0. However, when RD = 0, being RD the sum of nonnegative numbers given by the
Eq. (4.27), would require |RankWL (i, j) − RankWR (i, j)| = 0 for each pixel corresponding to the two
windows. But the intensity of the pixels can vary to less than an offset or a scale factor being able
to still produce RD = 0, and therefore, does not imply that WL = WR thus violating the reflexivity
property of the metric that in this case would necessarily impose RD(WL , WR ) = 0 ⇔ WL = WR .
4.7 Stereo Vision Algorithms 403

(a) (b) (c) (d)


Rank transform of left image Disparity map using window 9x9 Census transform of left image Disparity map using window 5x5

Fig. 4.58 Calculation of the depth map, for the same stereo images of Fig. 4.57, obtained using
the census and rank transform. a Transformed census applied to the left image; b map of disparity
based on the census transform and Hamming distance; c transformed rank applied to the left image;
d map of disparity based on the rank distance and the SAD matching method

number of pixels of the W window. Experimental results have demonstrated, with


this method, a reduction of ambiguities in the comparison of windows, in particular
in stereo images with local variations of intensity and presence of impulse noise.
Figure 4.58 shows the results of the rank distance (figure a) applied to the left image
of Fig. 4.57 and the disparity map (figure b) based on the rank distance and the SAD
matching method.

4.7.2.4 Matching Function Based on the Hamming Distance of the


Census Transform
A variant of the transformed rank is known as the Census Transform - CT [27] capable
of maintaining the information of the spatial distribution of pixels in assessing the
rank by generating a sequence of binary numbers. The most widespread version of
the census transformation uses windows 3 × 3 to transform the pixel intensities of
the input image into binary encoding by analyzing the pixels around the pixel under
examination. The values of the surrounding pixels are compared to the central pixel
of the current window. A binary mask 3 × 3 is produced associated with the central
pixel comparing it with the 8 neighboring pixels thus generating the 8-bit binary
encoding of the pixel under examination. The census function that generates the bit
string by comparing the central pixel of the window with the neighboring ones is


1 if W (k, l) < W (i, j)
CensusW (i, j) = BitstringW [W (k, l) < W (i, j)] = (4.30)
0 otherwise

For example, by applying to the same window preceding the census function for the
central pixel with value 36 the following binary coding is obtained
 79 42 51  0 0 0
W = 46 36 34 ⇔ 0 (i,j) 1 ⇔ (0 0 0 0 1 0 1 1)2 ⇔ (11)10
37 30 28 0 1 1
From the result of the comparison, we obtain a binary mask whose bits, not con-
sidering the central bit, are concatenated by row, from left to right, and the final 8
bit code is formed which represents the central pixel of the W window. Finally, the
404 4 Paradigms for 3D Vision

decimal value representing this 8-bit code is assigned as the central pixel value of
W . The same operation is performed for the window to be compared.
The dissimilarity measure DCT , based on the CT , is evaluated by comparing with
the Hamming distance, which evaluates the number of different bits, between the bit
strings relating to the windows to be compared. The dissimilarity measure DCT is
calculated according to the (4.30) as follows:
+M
 +M

DCT (i, j; m, n) = DistHamming [CensusWL (xL + k, yL + l)−
k=−M l=−M (4.31)
CensusWR (i + m + k, j + n + l)]
where CensusWL represents the census bit string of left window WL (xL , yL ) on the
left stereo image and CensusWR (i + m, j + n) represents the census bit string of the
windows in the search area R in the right stereo image that must be compared (see
Fig. 4.56).
The function (4.30) generates the census bit string whose length depends, as seen
in the example, on the size of the windows, i.e., Ls = (2M − 1)2 − 1 which is
given by the number of pixels in the window minus the middle one (in the example
Ls = 8). Compared to the rank transformed there is a considerable increase in the
dimensionality of the data depending on the size of the windows to be compared, with
the consequent increase in the computationality required. For this method, real-time
implementations, based on ad hoc hardware (based on FPGA-Field Programmable
Gate Arrays), have been developed for the treatment of binary strings. Experimental
results have shown that these last two dissimilarity measurements, based on rank
and census transforms, are more efficient than correlation-based methods, to obtain
disparity maps from stereo images in the presence of occlusions and radiometric
distortions. Figure 4.58 shows the results of the census transform (figure c) applied
to the left image of Fig. 4.57 and the disparity map (figure d) obtained on the basis
of the census transform and the Hamming distance.

4.7.2.5 Gradient-Based Matching Function


More reliable similarity measurements can be obtained by operating on the mag-
nitude of the gradient of the images to replace the intensity levels that are more
sensitive to noise [28]. For the calculation of depth maps from stereo images with
small disparities it is possible to use an approach based on the optical flux formu-
lated through a differential equation that links the information of motion to that of
the intensity of the stereo images
(∇x I )ν + It = 0 (4.32)
where (∇x I ) indicates the horizontal component of the gradient of the image I , It
is the time derivative which in this context refers to the differences in intensity of
the two stereo images, and ν translation between the two images. The functional
of the optical flux (4.32) assumes that the lighting conditions of the scene do not
change during the acquisition of stereo images. A map of dense disparity it can be
iteratively estimated with the least squares method associated with the system of
4.7 Stereo Vision Algorithms 405

differential equations generated by applying the (4.32) to each pixel of the window
centered on the pixel being processed, imposing the constraint that the disparity
varies continuously on the pixels of the window same. This process of minimizing
this functional is repeated in each pixel of the image to estimate the disparity. This
methodology is described in detail in Sect. 6.4.
Another way to use the gradient is to accumulate the horizontal and vertical gradi-
ent difference calculated for each pixel of the two windows WL and WR to compare.
As a result, a measure of dissimilarity between the accumulated gradients of the
two windows is obtained. We can consider dissimilarity measures DG SAD and DG SSD
based, respectively, on the sum of the absolute differences (SAD) and on the sum
of the squares of the differences (SSD) of the horizontal (∇x W ) and vertical com-
ponents of the gradient of the local structures to be compared, respectively, (∇x W )
and (∇y W ). In this case the DG dissimilarity measures of the local structures to be
compared, based on the components of the gradient vector accumulated according
to the sum SSD or SAD, are given, respectively, by the following functions:
+M
 +M
  2
DG SSD (i, j; m, n) = ∇x WL (xL + k, yL + l) − ∇x WR (i + m + k, j + n + l) +
k=−M l=−M
 2
∇y WL (xL + k, yL + l) − ∇y WR (i + m + k, j + n + l)

+M
 +M

DG SAD (i, j; m, n) = |∇x WL (xL + k, yL + l) − ∇x WR (i + m + k, j + n + l)|+
k=−M l=−M
|∇y WL (xL + k, yL + l) − ∇y WR (i + m + k, j + n + l)|

using the same symbols defined in the previous paragraphs.


In [29] a technique is proposed that uses the module of the gradient vectors |∇WL |
and |∇WR | to be compared to derive a similarity measure (called evidence measure)
as the weighted sum of two terms: the average gradient modules (|∇WL |+|∇WR |)/2
and the negation of the difference of the same modules of the gradient. So the EG
evidence measure based on the two terms would result

|∇WL (xL , yL | + |∇WR (i + m, j + n)|


EG (i, j; m, n) = −α|∇WL (xL , yL )−∇WR (i+m, j+n)|
2
where α is the weight parameter that balances the two terms. Great value of EG
implies great similarity between the compared gradient vectors.

4.7.2.6 Global Methods for Correspondence


Methods that evaluate the correspondence on a global basis (i.e., analyzing the entire
image to assign the disparity), as an alternative to local and point-like methods,
have been proposed and are useful to further reduce the problem of local inten-
sity discontinuity, occlusion, and presence of uniform texture in stereo images. The
solution to these problems requires global constraints that involve an increase in
computational complexity and cannot be easily implemented at present in real time.
Precisely to reduce as much as possible the computational load, various methods
based on dynamic programming (assuming ordering and continuity constraints) and
406 4 Paradigms for 3D Vision

on minimizing a global energy function formulated as the sum of evidence and com-
patibility have been proposed. Methods based on dynamic programming attempt
to reduce the complexity of the problem into smaller and simpler subproblems by
setting a functional cost in the various stages with appropriate constraints. The best
known algorithms are: Grapf-Cuts [30], Belief Propagation [31], Intrinsic Curves
[32], Nonlinear Diffusion [33]. Further methods have been developed, known as
hybrid or semiglobal methods, which essentially use similar approaches to global
ones, but operate on parts of the image, such as line by line image, to reduce the
considerable computational load required by global methods.

4.7.3 Sparse Elementary Structures

So far we have generically considered point-like and local elementary structures


represented by windows WL and WR with dimensions (2M + 1) in stereo images
without specifying with which criterion the windows WL candidates for the search of
the corresponding window WR in the right image. One can easily think of excluding
homogeneous structures with little presence of texture. The objective is to analyze
the image of the left and identify a set of significant structures, point-like or local, that
are probably also found in the image on the right. These structures are those detected
with the algorithms described in Chap. 6 Vol. II, Points and descriptors of points
of interest, or the Structures of Interest (SdI) (border elements, small or large areas
with texture) calculated in different positions (x, y) in stereo images, without the
constraints of epipolarity and knowledge of the parameters of the binocular system.
This chapter also contains some examples of searching for homologous points of
interest for stereo image pairs using the SIFT algorithm.

4.7.4 PMF Stereo Vision Algorithm

The PMF stereo vision algorithm: A stereo correspondence algorithm using a dis-
parity gradient limit, proposed in 1985 by Pollard, Mayhem, and Frisby [28], is
based on a computational model in which the problem of stereo correspondence is
seen as intimately integrated with the problem of identifying and describing elemen-
tary structures (primal sketch) candidates and homologues. This contrasts with the
computational model proposed by Marr and Poggio which see the stereo vision with
separate modules for the identification of elementary structures and the problem of
correspondence.
The PMF algorithm differs mainly because it includes the constraint of the conti-
nuity of the visible surface (figural continuity). Candidate structures as homologous
must have a disparity value contained in a certain range and if there are more candidate
structures as homologous, some will have to be eliminated, verifying if structures in
the vicinity of the candidate, support the same relation of surface continuity (figural
continuity).
4.7 Stereo Vision Algorithms 407

Left image Cyclopic Image Right image

Al Ac Ar

Bl Br
xl Bc xr

Fig. 4.59 Cyclopic separation and disparity gradient

The constraint of the figural continuity eliminates many false pairs of homologous
structures applied to natural and simulated images. In particular, PMF exploits what
was experienced by Burt-Julesz [34] who found, in human binocular vision, how the
image fusion process is tolerated for homologous structures with disparity gradient
with value 1. The disparity gradient of two elementary structures A and B (points,
contours, zero crossing, SdI, etc.) identified in the pair of stereo images is given by the
ratio of the difference between their disparity value and their cyclopean separation
(see Fig. 4.59, concept of disparity gradient).
The disparity gradient measures the relative disparity of two pairs of homologous
elementary structures. The authors of the PMF algorithm assert that the fusion process
of binocular images can tolerate homologous structures within the unit value of the
disparity gradient. In this way, false homologous structures are avoided and implicitly
satisfy the constraint of the continuity of the visible surface introduced previously.
Now let’s see how to calculate the disparity gradient by considering pairs of
points (A, B) in the two stereo images (see Fig. 4.59). Let AL = (xAL , yAL ) and
BL = (xBL , yBL ), AR = (xAR , yAR ) and BR = (xBR , yBR ) the projection in the two
stereo images of the points A and B of the visible 3D surface. The disparity value for
this pair of points A and B is given by
dA = xAR − yAL and dB = xBR − yBL (4.33)
A cyclopic image (see Fig. 4.59) is obtained by projecting the considered points A
and B, respectively, in the points Ac and Bc whose coordinates are given by the
average of the coordinates of the same points A and B projected in the stereo image
pair. The coordinates of the cyclopic points Ac and Bc are

xAL + xAR
xAc = and yAc = yAL = yAR (4.34)
2
xBL + xBR
xBc = and yBc = yBL = yBR (4.35)
2
408 4 Paradigms for 3D Vision

The cyclopean separation S is given by the Euclidean distance between the cyclopic
points

S(A, B) = (xAc − xBc )2 + (yAc − yBc )2


 x + x   y + y 2
=
AL AR BL BR
− + (yAc − yBc )2
2 2
 (4.36)
1    2  2
= xAL − xBL + xAR − xBR + yAc − yBc
4
  2 
1 2
= dx (AL , BL ) + dx (AR , BR ) + yAc − yBc
4
where dx (AL , BL )and dx (AR , BR ) are the horizontal distances of points A and B pro-
jected in the two stereo images. The difference in disparity between the pairs of
points (AL , AR ) and (BL , BR ) is given as follows:
D(A, B) = dA − dB
   
= xAR − xAL − xBR − xBL
    (4.37)
= xAR − xBR − xAL − xBL
= dx (AR − BR ) − dx (AL − BL )
The disparity gradient G for the pair of homologous points (AL , AR ) and (BL , BR ) is
given by the ratio of the difference of disparity D(A, B) with the ciclopic separation
S(A, B) given by the Eq. (4.36)
D(A, B) dx (AR , BR ) − dx (AL , BL )
G(A, B) = =   2  (4.38)
S(A, B) 2
4 dx (AL , BL ) + dx (AR , BR ) + yAc − yBc
1

With the definition given to the disparity gradient G, the constraint of the disparity
gradient limit is immediate since the Eq. (4.38) shows that G can never exceed the
unit. It follows that small differences in disparities are not acceptable if the points
A and B, considered in the 3D space, are very close to each other. This is easy to
understand and is supported by physical evidence as experienced by the PMF authors.
The PMF algorithm includes the following steps:

1. Identification in the pair of stereo images of the elementary candidate structures


to evaluate the correspondence;
2. Calculus of the homologous structures in the conditions of epipolar geometry.
The search for homologous structures takes place only by analyzing the pixels of
the stereo images horizontally;
3. Assumption that the correspondence is unique, i.e., a structure in an image has
only one homologous structure in the other image and vice versa. It is obvious that
there will be situations, due to occlusion, that some structure has more than one
structure in the other image. For each candidate pair as homologous, a likelihood
index is increased according to the number of other homologous structures found,
which do not violate the limit of the chosen disparity gradient.
4.7 Stereo Vision Algorithms 409

4. Choosing homologous structures with the highest likelihood index. The unique-
ness constraint removes other incorrect homologous pairs and is excluded from
further considerations.
5. Return in step 2 and the indices are redetermined considering the derived homol-
ogous points.
6. The algorithm terminates when all possible pairs of homologous points have been
extracted.

From the procedure described it is observed that the PMF algorithm assumes that it
finds a set of candidate points as homologous from each stereo image and proposes
to find the correspondence for pairs of points (A, B), i.e., for pairs of homologous
points. The calculation of the correspondence of the single points is facilitated by
the constraint of the epipolar geometry, and the uniqueness of the correspondence
is used in step 4 to prevent the same point from being used more than once in the
calculation of the disparity gradient.
The estimation of the likelihood index is used to check that the more unlikely the
correspondences are, the more distant they are from the limit value of the disparity
gradient. In fact, candidate pairs of homologous points are considered those that have
the disparity gradient close to the unit. It is reasonable to consider only pairs that fall
within a circular area of radius seven, although this value depends on the geometry
of the vision system and the scene.
This means that small values of the D disparity difference are easily detected
and discarded when caused by points that are very close together in 3D space. The
PMF algorithm has been successfully tested for several natural and artificial scenes.
When the points of uniqueness, the epipolarity and the limits of the disparity gradient
are violated, the results are not good. In these latter conditions, it is possible to use
algorithms that calculate the correspondence for a number of points greater than two.
For example, we can organize the Structures of Interest (SdI) as a set of nodes
related to each other in topological terms. Their representation can be organized with
a graph H(V,E) where V = {A1 , A2 , . . . , An } is the set of Structures of Interest Ai
and E = {e1 , e2 , . . . , en } is the set of arcs that constitute the topological relations
between nodes (for example, based on the Euclidean distance).
In this case, the matching process is reduced to looking for the structures of interest
SdIL and SdIR in the pair of stereo images, organizing them in the form of graphs,
and subsequently performing the comparison of graphs or subgraphs (see Fig. 4.60).
In graph theory, the comparison of graphs or subgraphs is called isomorphism of
graph or subgraph.
The problem of the graph isomorphism however emerges considering that the
graphs HL and HR , which can be generated with the set of potentially homologous
points, can be many, and such graphs can never be identical due to of the diversity
of the stereo image pair. The problem has a solution as it is set by evaluating the
similarity of the graphs or subgraphs. In the literature several algorithms are proposed
for the problem of graph comparison [35–38].
410 4 Paradigms for 3D Vision

HL HR

Left image IL Right image IR

Fig. 4.60 Topological organization between structures of interest

References
1. D. Marr, S. Ullman, Directional selectivity and its use in early visual processing, in Proceedings
of the Royal Society of London. Series B, Biological Sciences, vol. 211 (1981), pp. 151–180
2. D. Marr, E. Hildreth, Theory of edge detection, in Proceedings of the Royal Society of London.
Series B, Biological Sciences, vol. 207 (1167) (1980), pp. 187–217
3. S.W. Kuffler, Discharge patterns and functional organization of mammalian retina. J. Neuro-
physiol. 16(1), 37–68 (1953)
4. C. Enroth-Cugell, J.G. Robson, The contrast sensitivity of retinal ganglion cells of the cat. J.
Neurophysiol. 187(3), 517–552 (1966)
5. F.W. Campbell, J.G. Robson, Application of fourier analysis to the visibility of gratings. J.
Physiol. 197, 551–566 (1968)
6. D.H. Hubel, T.N. Wiesel, Receptive fields, binocular interaction and functional architecture in
the cat’s visual cortex. J. Physiol. 160(1), 106–154 (1962)
7. V. Bruce, P. Green, Visual Perception: Physiology, Psychology, and Ecology, 4th edn. (Lawrence
Erlbaum Associates, 2003). ISBN 1841692387
8. H. von Helmholtz, Handbuch der physiologischen optik, vol. 3 (Leopold Voss, Leipzig, 1867)
9. R.K. Olson, F. Attneave, What variables produce similarity grouping? J. Physiol. 83, 1–21
(1970)
10. B. Julesz, Visual pattern discrimination. IRE Trans. Inf. Theory 8(2), 84–92 (1962)
11. M. Minsky, A framework for representing knowledge, in The Psychology of Computer Vision,
ed. by P. Winston (McGraw-Hill, New York, 1975), pp. 211–277
12. D. Marr, H. Nishihara, Representation and recognition of the spatial organization of three-
dimensional shapes. Proc. R. Soc. Lond. 200, 269–294 (1987)
13. I. Biederman, Recognition-by-components: a theory of human image understanding. Psychol.
Rev. 94, 115–147 (1987)
14. J.E. Hummel, I. Biederman, Dynamic binding in a neural network for shape recognition.
Psychol. Rev. 99(3), 480–517 (1992)
References 411

15. T. Poggio, C. Koch, Ill-posed problems in early vision: from computational theory to analogue
networks, in Proceedings of the Royal Society of London. Series B, Biological Sciences, vol.
226 (1985), pp. 303–323
16. B. Julesz, Foundations of Cyclopean Perception (The MIT Press, 1971). ISBN 9780262101134
17. L. Ungerleider, M. Mishkin, Two cortical visual systems, in ed. by D.J. Ingle, M.A.
Goodale, R.J.W. Mansfield, Analysis of Visual Behavior (MIT Press, Cambridge MA, 1982),
pp. 549–586
18. D. Marr, Vision: A Computational Investigation into the Human Representation and Processing
of Visual information, 1st edn. (The MIT Press, 2010). ISBN 978-0262514620
19. W.E.L. Grimson, From Images to Surfaces: A Computational Study of the Human Early Visual
System, 4th edn. (MIT Press, Cambridge, Massachusetts, 1981). ISBN 9780262571852
20. D. Marr, T. Poggio, A computational theory of human stereo vision, in Proceedings of the
Royal Society of London. Series B, vol. 204 (1979), pp. 301–328
21. O. Faugeras, Three-Dimensional Computer Vision: A Geometric Approach (MIT Press, Cam-
bridge, Massachusetts, 1996)
22. R. Hartley, A. Zisserman, Multiple View Geometry in Computer Vision, 2nd edn. (Cambridge,
2003)
23. R.Y. Tsai, A versatile camera calibration technique for 3D machine vision. IEEE J. Robot.
Autom. 4, 323–344 (1987)
24. Z. Zhang, A flexible new technique for camera calibration. IEEE Trans. Pattern Anal. Mach.
Intell. 22(11), 1330–1334 (2000)
25. A. Goshtasby, S.H. Gage, J.F. Bartholic, A two-stage cross correlation approach to template
matching. IEEE Trans. Pattern Anal. Mach. Intell. PAMI-6, 374–378 (1984)
26. W. Zhang, K. Hao, Q. Zhang, H. Li, A novel stereo matching method based on rank transfor-
mation. Int. J. Comput. Sci. Issues 2(10), 39–44 (2013)
27. R. Zabih, J. Woodfill, Non-parametric local transforms for computing visual correspondence,
in Proceedings of the 3rd European Conference Computer Vision (1994), pp. 150–158
28. S.B. Pollard, J.E.W. Mayhew, J.P. Frisby, PMF: a stereo correspondence algorithm using a
disparity gradient limit. Perception 14, 449–470 (1985)
29. D. Scharstein, Matching images by comparing their gradient fields, in Proceedings of 12th
International Conference on Pattern Recognition, vol. 1 (1994), pp. 572–575
30. Y. Boykov, O. Veksler, R. Zabih, Fast approximate energy minimization via graph cuts. IEEE
Trans. Pattern Anal. Mach. Intell. 11(23), 1222–1239 (2001)
31. J. Sun, N.N. Zheng, H.Y. Shum, Stereo matching using belief propagation, in Proceedings of
the European Conference Computer Vision (2002), pp. 510–524
32. C. Tomasi, R. Manduchi, Stereo matching as a nearest-neighbor problem. IEEE Trans. Pattern
Anal. Mach. Intell. 20, 333–340 (1998)
33. D. Scharstein, R. Szeliski, Stereo matching with non linear diffusion. Int. Jorn. Comput. Vis.
2(28), 155–174 (1998)
34. P. Burt, B. Julesz, A disparity gradient limit for binocular fusion. Science 208, 615–617 (1980)
35. N. Ayache, B. Faverjon, Efficient registration of stereo imagesby matching graph descriptions
of edge segments. Inter. J. Comput. Vis. 2(1), 107–131 (1987)
36. D.H. Ballard, C.M. Brown, Computer Vision (Prentice Hall, 1982). ISBN 978-0131653160
37. Radu HORAUD and Thomas SKORDAS, Stereo correspondence through feature grouping
and maximal cliques. IEEE Trans. PAMI 11(11), 1168–1180 (1989)
38. A. Branca, E. Stella, A. Distante, Feature matching by searching maximum clique on high order
association graph, in International Conference on Image Analysis and Processing (1999), pp.
642–658
Shape from Shading
5

5.1 Introduction

With Shape from Shading, in the field of computer vision, we intend to reconstruct
the shape of the visible 3D surface using only the brightness variation information,
i.e., the gray-level shades present in the image.
It is well known that an artist is able to represent the geometric shape of the
objects of the world in a painting (black/white or color) creating shades of gray or
color. Looking at the painting, the human visual system analyzes these shades of
brightness and can perceive the shape information of 3D objects even if represented
in the two-dimensional painting. The author’s ability consists in projecting, the 3D
scene, in the 2D plane of the painting, creating to the observer, through the shades
of gray (or color) level, the impression of the 3D vision of the scene.
The approach of Shape from Shading essentially proposes the analogous problem:
from the variation of luminous intensity of the image we intend to reconstruct the
visible surface of the scene. In other words, the inverse problem, to reconstruct the
shape of the visible surface from the brightness variations present in the image is
known as the Shape from Shading problem.
In Chap. 2 Vol. I, the fundamental aspects of radiometry involved in the image
formation process were examined, culminating in the definition of the fundamental
formula of radiometry. These aspects will have to be considered to solve the problem
of Shape from Shading by finding solutions based on reliable physical–mathematical
foundations, also considering, the complexity of the problem.
The statement reconstruction of the visible surface must not be strictly understood
as a 3D reconstruction of the surface. We know, in fact, that from a single point of
observation of the scene, a monocular vision system cannot estimate a distance

© Springer Nature Switzerland AG 2020 413


A. Distante and C. Distante, Handbook of Image Processing and Computer Vision,
https://doi.org/10.1007/978-3-030-42378-0_5
414 5 Shape from Shading

measure between observer and visible object.1 Horn [1] in 1970 was the first to
introduce the Shape from Shading paradigm by formulating a solution based on
the knowledge of the light source (direction and distribution), the scene reflectance
model, the observation point and the geometry of the visible surface, which together
contribute to the process of image formation.
In other words, Horn has derived the relations between the values of the luminous
intensity of the image and the geometry of the visible surface (in terms of the orienta-
tion of the surface point by point) under some lighting conditions and the reflectance
model. To understand the paradigm of the Shape from Shading, it is necessary to
introduce two concepts: the reflectance map and the gradient space.

5.2 The Reflectance Map

The basic concept of the reflectance map is to determine a function that calculates
the orientation of the surface from the brightness of each point of the scene. The
fundamental relation Eq. (2.34), of the image formation process, described in Chap. 2
Vol. I, allows to evaluate the brightness value of a generic point a of the image,
generated by the luminous radiance reflected from a point A of the object.2 We also
know that the process of image formation is conditioned by the characteristics of
the optical system (focal length f and diameter of the lens d) and by the model of
reflectance considered. We remember the fundamental equation that links the image
irradiance E to the scene radiance is given by the following:
 2
π d
E (a, ψ) = cos4 ψ · L (A, ) (5.1)
4 f

where L is the radiance of the object’s surface, and  is the angle formed with the
optical axis of the incident light ray coming from a generic point A of the object
(see Fig. 5.1). It should be noted that the image irradiance E, is linearly related to
the radiance of the surface, it is proportional to the area of the lens (defined by the
diameter d), it is inversely proportional to the square of the distance between lens
and image plane (dependent on the focal f ), and that the irradiance decreases with
the increase of the angle ψ comprised between the optical axis and line of sight.

1 While with the stereo approach we have a quantitative measure of the depth of the visible surface
with the shape from shading we have a nonmetric but qualitative (ordinal) reconstruction of the
surface.
2 In this paragraph, we will indicate with A the generic point of the visible surface and with a its

projection in the image plane, instead of, respectively, P and p, indicated in the radiometry chapter,
to avoid confusion with the Gradient coordinates that we will indicate in the next paragraph with
( p, q).
5.2 The Reflectance Map 415

ΔΩ P

d
P

θ
ΔΑ

Ψ
Ψ
Δa
p
ΔΩp

f Z

Fig. 5.1 Relationship between the radiance of the object and the irradiance of the image

I(x,y) Image Plane


a
Light source

S Li(s)
E(X,Y,Z) y
Ra n an
ce
di
θi θe ad
i
an
ce s Irr 0 x

Φi Φe
A
Y
Z
z=Z(x,y)
0 X

Fig. 5.2 Diagram of the components of a vision system for the shape from shading. The surface
element receives in A the radiant flow L i from the source S with direction s = (θi , φi ) and a part
of it is emitted in relation to the typology of material. The brightness of the I (x, y) pixel depends
on the reflectance properties of the surface, on its shape and orientation defined by the angle θi
between the normal n and the vector s (source direction), from the properties of the optical system
and from the exposure time (which also depends on the type of sensor)

We also rewrite (with the symbols according to Fig. 5.2), Eq. (2.16) of the image
irradiance, described in Chap. 2 Vol. I Radiometric model, given by

I (x, y) ∼
= L e (X, Y, Z ) = L e (A, θe , φe ) = F(θi , φi ; θe , φe ) · E i (θi , φi ) (5.2)

where we recall that F is the BRDF function (Bidirectional Reflectance Distribu-


tion Function), E i (θi , φi ) is the incident irradiance associated with the radiance
L i (θi , φi ) coming from the source S in the direction (θi , φi ), L e (θe , φe ) is the radi-
ance emitted by the surface from A in the direction of the observer due to the incident
416 5 Shape from Shading

irradiance E i . It is hypothesized that the radiant energy L e (θe , φe ) from A is com-


pletely projected into a in the image (in the ideal lens constraint), generating the
intensity I (x, y) (see Fig. 5.2).
The reflectance model described by Eq. (5.2) informs us that it depends on the
direction of the incident light coming from a source (point-like or extended), from
the direction of observation, from the geometry of the visible surface (so far described
by an element of surface centered in A(X, Y, Z )), and by the function BRDF (diffuse
or specular reflection) which in this case has been indicated with the symbol F to
distinguish it from the symbol f of focal length of the lens.
With the introduction of the reflectance map, we want to explicitly formalize
how the irradiance at the point a of the image is influenced simultaneously by three
factors3 :

1. Reflectance model. A part of light energy is absorbed (for example, converted


into heat), a part can be transmitted by the same illuminated object (by refraction
and scattering) and the rest reflected in the environment which is the radiating
component that reaches the observer.
2. Light source and direction.
3. Geometry of the visible surface, i.e., its orientation with respect to the observer’s
reference system (also called viewer-centered).

Figure 5.2 schematizes all the components of the image formation process highlight-
ing the orientation of the surface in A and of the lighting source. The orientation of
the surface in A(X, Y, Z ) is given by the vector n, which indicates the direction of the
normal to the tangent plane to the visible surface passing through A. Recall that the
BRDF function defines the relationship between L e (θe , φe ) the radiance reflected by
the surface in the direction of the observer and E i (θi , φi ) the irradiance incident on
the object at the point A coming from the source with known radiant flux L i (θi , φi )
(see Sect. 2.3 of Vol. I).
In the Lambertian modeling conditions, the image irradiance I L (x, y) is given by
Eq. (2.20) described in Chap. 2 Vol. I Radiometric model, which we rewrite as
ρ ρ
I L (x, y) ∼
= Ei · = L i cos θi (5.3)
π π
where 1/π is the BRDF of a Lambertian surface, θi is the angle formed by the vector n
normal to the surface in A, and the vector s = (θi , φi ) which represents the direction
of the irradiance incident E i in A generated by the source S which emits radiant
flux incident in the direction (θi , φi ). ρ indicates the albedo of the surface, seen
as reflectance coefficient which expresses the ability of the surface to reflect/absorb
incident irradiance at any point. The albedo has values in the range 0 ≤ ρ ≤ 1 and
for energy conservation, the missing part is due to absorption. Often a surface with a

3 Inthis context, the influence of the optical system, exposure time, and the characteristics of the
capture sensor are excluded.
5.2 The Reflectance Map 417

uniform reflectance coefficient is assumed with ρ = 1 (ideal reflector, while ρ = 0


indicates ideal absorber) even if in physical reality this hardly happens.
From Eq. (5.3), it is possible to derive the expression of the Lambertian reflectance
map, according to the 3 points above, given by
ρ
R(A, θi ) = L i cos θi (5.4)
π
which relates the radiance reflected by the surface element in A (whose direction
with respect to the incident radiant flux is given by θi ) with the irradiance of the
image observed in a.
Equation (5.4) specifies the brightness of a surface element taking into consider-
ation: the orientation of the normal n of a surface element (with respect to which the
incident radiant flux is evaluated), of the radiance of the source and of the model of
Lambertian reflectance. In these conditions, the fundamental radiometry Eq. (2.37),
described in Chap. 2 Vol. I, is valid, which expresses the equivalence between radi-
ance L(A) of the object and irradiance E(a) of the image (symbols according to the
note 2):
I (a) = I (x, y) = I (i, j) ∼
= E (a, ψ) = L (A, ) (5.5)

In this context, we can indicate with R(A, θi ) the radiance of the object expressed
by (5.4) that by replacing in the (5.5) we get

E(a) = R(A, θi ) (5.6)

which is the fundamental equation of Shape from Shading. Equation (5.6) directly
links the luminous intensity, i.e., the irradiance E(a) of the image to the orientation
of the visible surface at the point A, given by the angle θi between the surface normal
vector n and the vector s of incident light direction (see Fig. 5.2).
We summarize the properties of this simple Lambertian radiometric model:

(a) The radiance of a surface element ( patch) A is independent of the point of


observation.
(b) The brightness of a pixel in a is proportional to the radiance of the corresponding
surface element A of the scene.
(c) The radiance of a surface element is proportional to the cosine of the angle θi
formed between the normal to the patch and the direction of the source.
(d) It follows that the brightness of a pixel is proportional to the cosine of this angle
(Lambert’s cosine law).

5.2.1 Gradient Space

The fundamental equation of Shape form Shading given by (5.6) is convenient to


express it with the coordinates defined with respect to the reference system of the
image. To do this, it is convenient to introduce the concept of gradient space and
418 5 Shape from Shading

Light source Z

1 q
(0,0)
Z=1 (ps,qs) (p,q,1)
z=
Z(x
s n ,y)
A
p Y

Fig. 5.3 Graphical representation of the gradient space. Plans parallel to the plane x − y (for
example, the plane Z = 1, whose normal has components p = q = 0) have zero gradients in both
directions in x and y. For a generic patch (not parallel to the plane x y) of the visible surface, the
orientation of the normal vector ( p, q) is given by (5.10) and the equation of the plane comprising
the patch is Z = px +qy +k. In the gradient space p −q the orientation of each patch is represented
by a point indicated by the gradient vector ( p, q). The direction of the source given by the vector s
is also reported in the gradient space and is represented by the point ( ps , qs )

to review both the reference systems of the visible surface and that of the source,
which becomes the reference system of the image plane (x, y) as shown in Figs. 5.2
and 5.3.
It can be observed how, in the hypothesis of a convex visible surface, for each
of its generic point A we have a tangent plane, and its normal outgoing from A
indicates the attitude (or orientation) of the 3D surface element in space represented
by the point A(X, Y, Z ). The projection of A in the image plane identifies the point
a(x, y) which can be obtained from the perspective equations (see Sect. 3.6 Vol. II
Perspective Transformation):
X Y
x= f y= f (5.7)
Z Z
remembering that f is the focal length of the optical system. If the distance of the
object from the vision system is very large, the geometric projection model can be
simplified assuming the following orthographic projection is valid:

x=X y=Y (5.8)

which means projecting the points of the surface through parallel rays and, less than
a scale factor, the horizontal and vertical coordinates of the reference system of the
image plane (x, y) and of the reference system (X, Y, Z ) of the world, coincide.
With f → ∞ implies that Z → ∞ for which Zf becomes the unit that justifies the
(5.8) for the orthographic geometric model.
5.2 The Reflectance Map 419

Under these conditions, for the geometric reconstruction of the visible surface,
i.e., determining the distance Z of each point P from the observer, it can be thought
of as a function of the coordinates of the same point A projected in the image plane
in a(x, y). Reconstructing the visible surface then means finding the function:

z = Z (x, y) (5.9)

The fundamental equation of Shape from Shading (5.6) cannot calculate the function
of the distances Z (x, y) but, through the reflectance map R(A, θn ) the orientation of
the surface can be reconstructed, point by point, obtaining the so-called orientation
map. In other words, this involves calculating, for each point A of the visible surface,
the normal vector n, calculating the local slope of the visible surface that expresses
how the tangent plane passing through A is oriented with respect to observer.
The local slopes at each point A(X, Y, Z ) are estimated by evaluating the partial
derivatives of Z (x, y) with respect to x and y. The gradient of the surface Z (x, y)
in the point P(X, Y, Z ) is given by the vector ( p, q) obtained with the following
partial derivatives:

∂ Z (x, y) ∂ Z (x, y)
p= q= (5.10)
∂x ∂y
where p and q are, respectively, the components relative to the x-axis and y-axis of
the surface gradient.
Calculated with the gradient of the surface, the orientation of the tangent plane,
the gradient vector ( p, q) is bound to the normal n of the surface element centered
in A, by the following relationship:

n = ( p, q, 1)T (5.11)

which shows how the orientation of the surface in A can be expressed by the normal
vector n of which we are interested only as it is oriented and not the module. More
precisely, (5.11) tells us that in correspondence of unitary variations of the distance
Z , the variation x and y in the image plane, around the point a(x, y) must be
p and q, respectively.
To have the normal vector n unitary, it is necessary to divide the normal n to the
surface for its length:
n ( p, q, 1)
nN = = (5.12)
|n| 1 + p2 + q 2

The pair ( p, q) constitutes the gradient space that represents the orientation of each
element of the surface (see Fig. 5.3). The gradient space expressed by the (5.11)
can be seen as a plane parallel to the plane X-Y placed at the distance Z = 1. The
geometric characteristics of the visible surface can be specified as a function of the
coordinates of the image plane x-y while the coordinates p and q of the gradient
space have been defined to specify the orientation of the surface. The map of the
420 5 Shape from Shading

orientation of the visible surface is given by the following equations:

p = p(x, y) q = q(x, y) (5.13)

The origin of the gradient space is given by the normal vector (0, 0, 1), which is
normal to the image plane, that is, with the visible surface parallel to the image
plane. The more the normal vectors move away from the origin of the gradient
space, the larger the inclination of the visible surface is compared to the observer.
In the gradient space is also reported the direction of the light source expressed by
the gradient components ( ps , qs ). The shape of the visible surface, as an alternative
to the gradient space ( p, q), can also be expressed by considering the angles (σ, τ )
which are, respectively, the angle between the normal n and the Z -axis (reference
system of the scene) which is the direction of the observer, and the angle between
the projection of the normal n in the image plane and the x-axis of the image plane.

5.3 The Fundamental Relationship of Shape from Shading for


Diffuse Reflectance
Note the reflectance maps R(A, θi ) and the orientation of the source ( ps , qs ), it is
possible to deal with the problem of Shape from Shading, that is to calculate ( p, q)
from the pixel intensity at position (x, y). We must actually find the relation that
links the luminous intensity E(x, y) in the image plane with the orientation of the
surface expressed in the gradient space ( p, q). This relation must be expressed in
the reference system of the image plane (viewer centered) with respect to which the
normal vector n is given by ( p, q, 1) and the vector s indicating the direction of the
source is given by ( ps , qs , 1). Recalling that the cosine of the angle formed by two
vectors is given by the inner product of the two vectors divided by the length of the
two vectors itself, in the case of the two vectors corresponding to the normal n and
to the direction of the source s, the cosine of the angle of light incident θn , according
to (5.12), results in the following:
(− p, −q, 1) (− ps , −qs , 1) 1 + ps p + q s q
cos θn = n · s =   =   (5.14)
1 + p 2 + q 2 1 + ps2 + qs2 1 + p 2 + q 2 1 + ps2 + qs2

The reflectance map (i.e., the reflected radiance L e (A, θn ) of the Lambertian surface)
expressed by Eq. (5.4) can be rewritten with the coordinates p and q of the gradient
space and by replacing the (5.14) becomes
ρ 1 + ps p + q s q
R(A, θn ) = Li   (5.15)
π 1 + p 2 + q 2 1 + ps2 + qs2

This equation represents the starting point for applying the Shape from Shading, to
the following conditions:

1. The geometric model of image formation is orthographic;


2. The reflectance model is Lambertian (diffuse reflectance);
5.3 The Fundamental Relationship of Shape from Shading for Diffuse Reflectance 421

3. The optical system has a negligible impact and the visible surface of the objects
is illuminated directly by the source with incident radiant flow L i ;
4. The optical axis coincides with the Z -axis of the vision system and the visible
surface Z (x, y) is described for each point in terms of orientation of the normal
in the gradient space ( p, q);
5. Local point source very far from the scene.

The second member of Eq. (5.15), which expresses the radiance L(A, θn ) of the
surface, is called the reflectance map R( p, q) which when rewritten becomes
ρ 1 + ps p + q s q
R( p, q) = Li   (5.16)
π 1 + p 2 + q 2 1 + ps2 + qs2

The reflectance can be calculated for a certain type of material, and for a defined
type of illumination it can be calculated for all possible orientations of p and q of the
surface to produce the reflectance map R( p, q) which can have normalized values
(maximum value 1) to be invariant with respect to the variability of the acquisition
conditions. Finally, admitting the invariance of the radiance of the visible surface,
namely that the radiance L(A), expressed by (5.15), is equal to the irradiance of
the image E(x, y), as already expressed by Eq. (5.5), we finally get the following
equation of image irradiance:

E(x, y) = Rl,s ( p(x, y), q(x, y)) (5.17)

where l and s indicate the Lambertian reflectance model and the source direction,
respectively. This equation tells us that the irradiance (or luminous intensity) in the
image plane in the location (x, y) is equal to the value of the reflectance map R( p, q)
corresponding to the orientation ( p, q) of the surface of the scene. If the reflectance
map is known (computable with the 5.16), for a given position of the source, the
reconstruction of the visible surface z = Z (x, y) is possible in terms of orientation
( p, q) of the same, for each point (x, y) of the image.
 Let us remember
 that the
∂Z ∂Z
orientation of the surface is given by the gradient p = ∂ x , q = ∂ y .
Figure 5.4 shows two examples of reflectance map, represented graphically by
iso-brightness curves, under the Lambertian reflectance model conditions and with
point source illuminating in the direction ( ps = 0.7 and qs = 0.3) and ( ps = 0.0
and qs = 0.0) a sphere. In the latter condition, the incident light is adjacent to the
observer, that is, source and observer see the visible surface from the same direction.
Therefore, the alignment of the normal n to the source vector s implies that θn = 0◦
with cos θn = 1 and consequently for (5.4) has the maximum reflectance value
R( p, q) = πρ L i . When the two vectors are orthogonal, the surface in A is not
illuminated having cos θn = π/2 = 0 with reflectance value R( p, q) = 0.
The iso-brightness curves represent the set of points that in the gradient space
have the different orientations ( p, q) but derive from points that in the image plane
(x, y) have the same brightness. In the two previous figures, it can be observed how
these curves are different in the two reflectance maps for having simply changed the
422 5 Shape from Shading

(a) (b) (d)


0.0
q
0.1 0.2 0.3 0.4 0.5 0.6 0.7
(c) 1

2.0
0.8
0.8
0.6

1.0 0.4
0.9
0.2
(ps,qs) (ps,qs)
1.0 0 1.0
-3.0 -2.0 -1.0 1.0 2.0 3.0 p −0.2

−0.4
0.9
--1.0 −0.6 0.8
0.7
−0.8 0.6
0.5
0.4 0.3 0.2
--2.0 −1
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
θs=90°

Fig. 5.4 Examples of reflectance maps. a Lambertian spherical surface illuminated by a source
placed in the position ( ps , qs ) = (0.7, 0.3); b iso-brightness curves (according to (5.14) where
R( p, q) = constant, that is, points of the gradient space that have the same brightness) calculated
for the source of the example of figure (a) which is the general case. It is observed that for R( p, q) =
0 the corresponding curve is reduced to a line, for R( ps , qs ) = 1 the curve is reduced to a point
where there is the maximum brightness (levels of normalized brightness between 0 and 1), while in
the intermediate brightness values the curves are ellipsoidal, parabolic, and then hyperbole until they
become asymptotically a straight line (zero brightness). c Lambertian spherical surface illuminated
by a source placed at the position ( ps , qs ) = (0, 0) that is at the top in the same direction as the
observer. In this case, the iso-brightness curves are concentric circles with respect to the point
of maximum brightness (corresponding to the origin (0, 0) of the gradient space). Considering
the image irradiance Eq. (5.16), for the different values of the intensity E i , the equations of iso-
brightness circumferences are given by p 2 +q 2 = (1/E i −1) by virtue of (5.14) for ( ps , qs ) = (0, 0)

direction of illumination, oblique to the observer in a case and coinciding with the
direction of the observer in the other case. In particular, two points in the gradient
space that lie on the same curve indicate two different orientations of the visible
surface that reflect the same amount of light and are, therefore, perceived with the
same brightness by the observer even if the local surface is oriented differently in the
space. The iso-brilliance curves suggest the non-linearity of Eq. (5.16), which links
the radiance to the orientation of the surface.
It should be noted that using a single image I (x, y), despite knowing the direction
of the source ( ps , qs ) and the reflectance model Rl,s ( p, q), the SfS problem is not
solved because with (5.17) it is not possible to calculate the orientation ( p, q) of each
surface element in a unique way. In fact, with a single image each pixel has only
one value, the luminous intensity E(x, y) while, the orientation of the corresponding
patch is defined by the two components p and q of the gradient according to Eq. (5.17)
(we have only one equation and two unknowns p and q). Figure 5.4a and b shows
how a single intensity of a pixel E(x, y) corresponds different orientations ( p, q)
belonging to the same iso-brilliance curve in the map of reflectance. In the following
paragraphs, some solutions to the problem of SfS are reported.
5.4 Shape from Shading-SfS Algorithms 423

5.4 Shape from Shading-SfS Algorithms

Several researchers [2,3] have proposed solutions to the SfS problem inspired by
the visual perception of the 3D surface. In particular, in [3] it is reported that the
brain retrieves form information not only from shading, but also from contours,
elementary features, and visual knowledge of objects. The SfS algorithms developed
in computer vision use ad hoc solutions very different from those hypothesized by
human vision. Indeed, the proposed solutions use minimization methods [4] of an
energy function, of propagation [1] which extend the shape information from a set
of surface points on the whole image, and locale [5] which derives the shape from
the luminous intensity assuming the locally spherical surface.
Let us now return to the fundamental equation of Shape from Shading, Eq. (5.17),
by which we propose to reconstruct the orientation ( p, q) of each visible surface
element (i.e., calculate the orientation map of the visible surface) known the irradi-
ance E(x, y), that is, the energy reemitted at each point of the image in relation to
the reflectance model. We are faced with the situation of an ill-conditioned problem
since we have for each pixel E(x, y) a single equation with two unknowns p and q.
This problem can be solved by imposing additional constraints, for example, the
condition of visible surface continues in the sense that it has a minimal geometric
variability (smooth surface). This constraint implies that, in the gradient space, p and
q vary little locally, in the sense that nearby points in the image plane presumably
represent orientations whose positions in the gradient space are also very close to
each other. From this, it follows that the conditions of continuity of the visible surface
will be violated where only great geometric variations occur, which normally occur
at the edges and contours.
A strategy to solve the image irradiation Eq. (5.17) consists in not finding an exact
solution but defining a function to be minimized that includes a term representing
the error of the image irradiation equation e I and a term that controls the constraint
of the geometric continuity ec .
The first term e I is given by the difference between the irradiance of the image
E(x, y) and the reflectance function R( p, q):
 
 2
eI = E(x, y) − R( p, q) dxdy (5.18)

The second term ec , based on the constraint of the geometric continuity of the surface
is derived from the condition that the directional gradients p and q vary very slowly
(and to a greater extent their partial derivatives), respectively, with respect to the
direction of the x and y. The error ec due to geometric continuity is then defined by
minimizing the integral of the sum of the squares of such partial derivatives, and is
given by  
ec = px2 + p 2y + qx2 + q y2 dxdy (5.19)
424 5 Shape from Shading

The total error function eT to be minimized, which includes the two terms of previous
errors, is given by combining these errors as follows:
 
 2
eT = e I + λec = E(x, y) − R( p, q) + λ px2 + p 2y + qx2 + q y2 dxdy (5.20)

where λ is the positive parameter that weighs the influence of the geometric continuity
error ec with respect to that of the image irradiance. A possible solution to minimize
the function of the total error is given with the variational calculation which through
an iterative process determines the minimum acceptable solution of the error. It
should be noted that the function to be minimized (5.20) depends on the surface
orientations p(x, y) and q(x, y) which are dependent on the variables x and y of the
image plane.
Recall from (5.5) that the irradiance E(x, y) can be represented by the digital
image I (i, j), where i and j are the row and column coordinates, respectively, to
locate a pixel at location (i, j) containing the observed light intensity, coming from
the surface element, whose orientation is denoted in the gradient space with pi j and
qi j . Having defined these new symbols for the digital image, the procedure to solve
the problem of the Shape from Shading, based on the minimization method, is the
following:

1. Orientation initialization. For each pixel I (i, j), initialize the orientations pi0j
and qi0j .
2. Equivalence constraint between image irradiance and reflectance map. The
luminous intensity I (i, j) for each pixel must be very similar to that produced
by the reflectance map derived analytically in the conditions of Lambertianity
or evaluated experimentally knowing the optical properties of the surface and
the orientation of the source.
3. Constraint of geometric continuity. Calculation of partial derivatives of the
reflectance map ( ∂∂ Rp , ∂∂qR ), analytically, when R( pi j , qi j ) is Lambertian (by
virtue of Eqs. 5.14 and 5.17), or, estimated numerically by the reflectance map
obtained experimentally.
4. Calculation for iterations of the gradient estimation (p, q). Iterative process
based on the Lagrange multiplier method which minimizes the total error eT ,
defined with (5.20), through the following update rules that find pi j and qi j to
reconstruct the unknown surface z = Z (x, y):
∂R
pin+1 = p̄inj + λ (I (i, j) − R( p̄inj , q̄inj )
j
∂p
(5.21)
∂R
qin+1 = q̄inj + λ (I (i, j) − R( p̄inj , q̄inj )
j
∂q
5.4 Shape from Shading-SfS Algorithms 425

(a) (b) (c)


1
0.8
0.6
0.4
0.2
0
−0.2
−0.4
−0.6
−0.8
−1
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

Fig. 5.5 Examples of orientation maps calculated from images acquired with Lambertian illumi-
nation with source placed at the position ( ps , qs ) = (0, 0). a Map of the orientation of the spherical
surface of Fig. 5.4c. b Image of real objects with flat and curved surfaces. c Orientation map relative
to the image of figure b

where p̄i j and q̄i j denote the values of the mean of pi j and qi j calculated on four
local pixels (in correspondence of the location (i, j) in the image plane) with
the following equations:
pi+1, j + pi−1, j + pi, j+1 + pi, j−1
p̄i j =
4 (5.22)
qi+1, j + qi−1, j + qi, j+1 + qi, j−1
q̄i j =
4

The iterative process (step 4) continuously updates p and q until the total error eT
reaches a reasonable minimum value after several iterations, or stabilizes. It should
be noted that, although in the iterative process the estimates of p and q are evaluated
locally, a global consistency of surface orientation is realized with the propagation
of constraints (2) and (3), after many iterations.
Other minimization procedures can be considered to improve the convergence of
the iterative process. The procedure described above to solve the problem of Shape
from Shading is very simple but presents many difficulties when trying to apply it
concretely in real cases. This is due to the imperfect knowledge of the reflectance
characteristics of the materials and the difficulties in controlling the lighting condi-
tions of the scene.
Figure 5.5 shows two examples of maps of the orientations calculated from
images acquired with Lambertian illumination with the source placed at the position
( ps , qs ) = (0, 0). Figure a is relative to the sphere of Fig. 5.4c while that of figure c
is relative to a more complex real scene, constituted by overlapping opaque objects
whose dominant surface (cylindrical and spherical) presents a good geometric con-
tinuity to less than the zones of contour and in the shadow areas (i.e., shading which
contributes to an error in reconstructing the shape of the surface) where the intensity
levels vary abruptly. Even if the reconstruction of the orientation of the surface does
not appear perfect in every pixel of the image, overall, the results of the algorithm of
426 5 Shape from Shading

Shape from Shading are acceptable for the purposes of the perception of the shape
of the visible surface.
For better visual effect of the results of the algorithm, Fig. 5.5 graphically displays
the gradient information ( p, q) in terms of the orientation of the normals (represent-
ing the orientation of a surface element) with respect to the observer, seen as oriented
segments (perceived as needles oriented). In the literature, such orientation maps are
also called needle map.
The orientation map together with the depth map (when known and obtained
by other methods, for example, with stereo vision) becomes essential for the 3D
reconstruction of the visible surface, for example, by combining together, through
an interpolation process, the depth and orientation information (a problem that we
will describe in the following paragraphs).

5.4.1 Shape from Stereo Photometry with Calibration

With the stereo photometry, we want to recover the 3D surface of the objects observed
with the orientation map obtained through different images acquired from the same
point of view, but illuminated by known light sources with known direction.
We have already seen above that it is not possible to derive the 3D shape of a surface
using a single image with the SfS approach since there are an indefinite number of
orientations that can be associated with the same intensity value (see Fig. 5.4a and b).
Moreover, remembering (5.17), the intensity E(x, y) of a pixel has only one degree
of freedom while the orientation of the surface has two p and q. Therefore, additional
information is needed for calculating the orientation of a surface element.
One solution is given by the stereo photometry approach [6] which calculates the
( p, q) orientation of a patch using different images of the same scene, acquired
from the same point of view but illuminating the scene from different directions as
shown in Fig. 5.6.
The figure shows three different positions of the light source with the same obser-
vation point, subsequently acquiring images of the scene with different shading. For
each image acquisition, only one lamp is on. The different lighting directions lead to
different reflectance maps. Now let’s see how the stereo photometry approach solves
the problem of the poor conditioning of the S f S approach.
Figure 5.7a shows two superimposed reflectance maps obtained as expected by
stereo photometry. For clarity, only the iso-brightness curve 0.4 (in red) is super-
imposed relative to the second reflectance map R2 . We know from (5.14) that the
Lambertian reflectance function is not linear as shown in the figure with the iso-
brightness curves. The latter represent the different ( p, q) orientations of the surface
with the same luminous intensity I (i, j) related to each other by (5.17) that we
rewrite the following:

I (i, j) = E(x, y) = Rl,s ( p, q) (5.23)


5.4 Shape from Shading-SfS Algorithms 427

Light source 3
120
0° °
12

Light source 2
Light source 1

120°

Object Plane

Fig. 5.6 Acquisition system for the stereo photometry approach. In this experimentation, three
lamps are used positioned at the same height and arranged at 120◦ on the basis of an inverted cone
and with the base of the objects to be acquired placed at the apex of the cone

The intensity of the I (i, j) pixels, for the images acquired with stereo photometry,
varies for the different local orientation of the surface and for the different orientation
of the sources (see Fig. 5.6). Therefore, if we consider two images of the stereo pho-
tometry I1 (i, j) and I2 (i, j), according to the Lambertian reflectance model (5.14),
we will have two different reflectance maps R1 and R2 associated with the two dif-
ferent orientations s1 and s2 of the sources. Therefore applying to the two images
the (5.23), we will have

I1 (i, j) = R1l,s1 ( p, q) I2 (i, j) = R2l,s2 ( p, q) (5.24)

where l indicates the Lambertian reflectance model.


The orientation instead of each element of the visible surface is not modified
with respect to the observer. Changing the position of the source in each acquisition
changes the angle between the vectors s and n (which, respectively, indicate the
orientation of the source and that of the normal to the patch) and this leads to a
different reflectance map according to (5.15). It follows that the acquisition system
(the observer remains stationary while acquiring the images I1 and I2 ) subsequently
sees the same surface element with orientation ( p, q) but with two different values
of luminous intensity, respectively, I1 (i, j) and I2 (i, j) due only to different lighting
conditions.
In the gradient space, the two reflectance maps R1l,s1 ( p, q) and R2l,s2 ( p, q),
given by Eq. (5.24), that establish a relationship between the pair of intensity values
of the pixels (i, j) and orientation of the corresponding surface element, can be super-
imposed. In this gradient space (see Fig. 5.7a), with the overlap of the two reflectance
428 5 Shape from Shading

(a) (b)

(ps,qs)=(0.7,0.3) (ps,qs)=(0,0.75) (ps,qs)=(0.53,-0.36)


q q
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.4 0.4

0.5
2.0 2.0
0.8 0.8

P P
1.0 1.0
0.9 0.9

1.0 1.0

-2.0 -1.0 1.0 2.0 3.0 p -2.0 -1.0 1.0 2.0 3.0 p
Q

--1.0 --1.0

--2.0 --2.0

Fig. 5.7 Principle of stereo photometry. The orientation of a surface element is determined through
multiple reflectance maps obtained from images acquired with different orientations of the lighting
source (assuming Lambertian reflectance). a Considering two images I1 and I2 of stereo photom-
etry, they are associated with the corresponding reflectance maps R1 ( p, q) and R2 ( p, q) which
establish a relationship between the pair of intensity values of the pixels (i, j) and orientation of the
corresponding surface element. In the gradient space where the reflectance maps are superimposed,
the orientation ( p, q) associated with the pixel (i, j) can be determined by the intersection of the two
iso-brightness curves which in the corresponding maps represent the respective intensities I1 (i, j)
and I2 (i, j). Two curves can intersect in one or two points thus generating two possible orienta-
tions. b To obtain a single value ( p(i, j), q(i, j)) of the orientation of the patch, it is necessary to
acquire at least another image I3 (i, j) of photometric stereo, superimpose in the gradient space the
corresponding reflectance map R3 ( p, q). A unique orientation ( p(i, j), q(i, j)) is obtained with
the intersection of the third curve corresponding to the value of the intensity I3 (i, j) in the map R3

maps, the orientation ( p, q) associated with the pixel (i, j) can be determined by
the intersection of the two iso-brightness curves which in the corresponding maps
represent the respective intensities I1 (i, j) and I2 (i, j).
The figure graphically shows this situation where, for simplicity, several iso-
brightness curves for the normalized intensity values between 0 and 1 have been
plotted only for the map R1l,s1 ( p, q), while for the reflectance map R2l,s2 ( p, q) is
plotted the curve associated to the luminous intensity I2 (i, j) = 0.4 relative to the
same pixel (i, j). Two curves can intersect in one or two points thus generating two
possible orientations for the same pixel (i, j) (due to the non-linearity of Eq. 5.15).
The figure shows two gradient points P( p1 , q1 ) and Q( p2 , q2 ), the intersection of
the two iso-brightness curves corresponding to the intensities I1 (i, j) = 0.9 and
I2 (i, j) = 0.4, candidates as a possible orientation of the surface corresponding to
the pixel (i, j).
5.4 Shape from Shading-SfS Algorithms 429

To obtain a single value ( p(i, j), q(i, j)) of the orientation of the patch, it is
necessary to acquire at least another image I3 (i, j) of stereo photometry always
from the same point of observation but with a different orientation s3 of the source.
This involves the calculation of a third reflectance map R3l,s3 ( p, q) obtaining a third
image irradiance equation:

I3 (i, j) = R3l,s3 ( p, q) (5.25)

The superposition in the gradient space of the corresponding reflectance map


R3l,s3 ( p, q) leads to a unique orientation ( p(i, j), q(i, j)) with the intersection of
the third curve corresponding to the value of the intensity I3 (i, j) = 0.5 in the
map R3 . Figure 5.7b shows the gradient space that also includes the iso-brightness
curves corresponding to the third image I3 (i, j) and the gradient point solution is
highlighted, i.e., the one resulting from the intersection of the three curves of iso-
brightness associated with the set of intensity values (I1 , I2 , I3 ) detected for the pixel
(i, j). This method of stereo photometry directly estimates, as an alternative to the
previous procedure of S f S, the orientation of the surface by observing the scene
from the same position but acquiring at least three images in three different lighting
conditions.
The concept of Stereo Photometry derives from the fact of using several different
positions of the light source (and only one observation point) in analogy to the stereo
vision that instead observes the scene from different points but in the same lighting
conditions. Before generalizing the procedure of stereo photometry, we remember
the irradiance equation of the image in the Lambertian context (5.17) made explicit
in terms of the reflectance map from (5.16) which also includes the factor that takes
into account the reflectivity characteristics of the various materials. The rewritten
equation is

ρ(i, j) 1 + ps p + q s q
I (i, j) = R( p, q) = Li   (5.26)
π 1 + p 2 + q 2 1 + ps2 + qs2

where the term ρ, which varies between 0 and 1, indicates the albedo, i.e., the
coefficient that controls in each pixel (i, j) the reflecting power of a Lambertian
surface for various types of materials.
In particular, the albedo takes into account, in relation to the type of material, how
much of the incident luminous energy is reflected toward the observer.4
Now, let’s always consider the Lambertian reflectance model where a source with
diffuse lighting is assumed. In these conditions, if L S is the radiance of the source,

4 From a theoretical point of view, the quantification of the albedo is simple. It would be enough to
measure the reflected and incident radiation from a body with an instrument and make the ratio of
such measurements, respectively. In physical reality, the measurement of the albedo is complex for
various reasons:

1. The incident radiation does not come from a single source but normally comes from different
directions;
430 5 Shape from Shading

the radiance received from a patch A of the surface, that is, its irradiance is given by
I A = π L S (considering the visible hemisphere).
Considered the Lambertian model of surface reflectance (B R D Fl = π1 ), the
brightness of the patch, i.e., its radiance L A , is given by
1
L A = B R D Fl · I A = π LS = LS
π
Therefore, a Lambertian surface emits the same radiance as the source and its lumi-
nosity does not depend on the observation point but may vary, point to point, except
for the multiplicative factor of the albedo ρ(i, j).
In the Lambertian reflectance conditions, (5.26) suggests that the variables become
the orientation of the surface ( p, q) and the albedo ρ. The unit vector of the normal
n to the surface is given by Eq. (5.12) while the orientation vector of a source is
known, expressed by s = (S X , SY , S Z ). The mathematical formalism of the stereo
photometry involves the application of the irradiance Eq. (5.26) for the 3 images
Ik (i, j), k = 1, 2, 3 acquired subsequently for the 3 different orientations Sk =
(Sk,X , Sk,Y , Sk,Z ), k = 1, 2, 3 of the Lambertian diffuse light sources.
Assuming the albedo ρ constant and remembering Eq. (5.14) of the reflectance
map for a Lambertian surface illuminated by a generic orientation sk , the image
irradiance equation Ik (i, j) becomes
Li Li
Ik (i, j) = ρ(i, j)(sk · n)(i, j) = ρ(i, j) cos θk k = 1, 2, 3 (5.27)
π π
where θk represents the angle between the source vector sk and the normal vector n
of the surface. (5.27) reiterates that the incident radiance L i , generated by a source in
the context of diffused light (Lambertian), is given by the cosine of the angle formed
between the vector of the incident light and the unit vector of the surface normal,
i.e., the light intensity that reaches the observer is proportional to the inner product
of these two unit vectors assuming the albedo ρ constant.

2. the energy reflected by a body is never unidirectional but multidirectional and the reflected
energy is not uniform in all directions, and a portion of the incident energy can be absorbed
by the same body;
3. the measured reflected energy is only partial due to the angular opening limits of the detector
sensor.

Therefore, the reflectance measurements are to be considered as samples of the BRDF function.
The albedo is considered as a coefficient of global average reflectivity of a body. With the BRDF
function, it is possible to model instead the directional distribution of the reflected energy from a
body associated with a solid angle.
5.4 Shape from Shading-SfS Algorithms 431

Equation (5.27) of stereo photometry can be expressed in matrix terms to express


the system of the three equations as follows:
⎡ ⎤ ⎡ ⎤⎡ ⎤
I1 (i, j) S11 S12 S13 n X (i, j)
⎣ I2 (i, j)⎦ = L
ρ(i, j) ⎣ S21 S22 S23 ⎦ ⎣ n Y (i, j) ⎦
i
(5.28)
I (i, j) π S S S n (i, j)
3 31 32 33 Z

and in compact form we have


Li
I= ρ(i, j)S · n (5.29)
π
So we have a system of linear equations to calculate, in each pixel (i, j) from the
images Ik (i, j) k = 1, 2, 3 (the triad of measurements of stereo photometry), the
orientation of the unit vector n(i, j) normal to the surface, knowing the matrix S
which includes the three known directions of the sources (once configured the acqui-
sition system based on stereo photometry) and assuming a constant albedo ρ(i, j) at
any point on the surface. Solving with respect to the normal n(i, j) from (5.29), we
obtain the following system of linear equations:
π −1 π −1
n= S I =⇒ ρn = S I (5.30)
ρ Li Li
Recall that the solution of the linear system (5.30) exists only if the S is nonsingular,
that is, the vectors of the sources sk are linearly independent (vectors are not coplanar)
and the inverse matrix exists S−1 . It is also assumed that the incident radiant flux of
each source has a value of the intensity L i constant.
In summary, the stereo photometry approach, with at least three images captured
with three different source directions, can compute for each (i, j) pixel, with (5.30)
the surface normal vector n = (n x , n y , n z )T , estimated with the expression S−1 I.
Recalling (5.10), the gradient components ( p, q) of the surface are then calculated
as follows: nx ny
p=− q=− (5.31)
nz nz

and the unit normal vector of the resulting surface is n = (− p, −q, 1)T . Finally, we
can also calculate the albedo ρ with Eq. (5.30) and considering that the normal n is
a unit vector, we have
π −1 
ρ= |S I| =⇒ ρ = n 2x + n 2y + n 2z (5.32)
Li
If we have more than 3 stereo photometric images you have an overdetermined
system having the S matrix of the sources of size m × 3 with m > 3. In this case,
the system of Eq. (5.28) in matrix terms becomes
1
I = 
 S · 
b (5.33)
π
mx1 m x 3 3×1
432 5 Shape from Shading

where to simplify b(i, j) = ρ(i, j)n(i, j) indicates the unit vector of the normal
scaled from the albedo ρ(i, j) for each pixel of the image and, similarly, the vectors
sk direction of the sources are scaled (sk L i ) of the factor of intensity of the incident
radiance L i (we assume that the sources have identical radiant intensity). The calcu-
lation of normals with the overdetermined system (5.33) can be done with the least
squares method which finds a solution of b which minimizes the 2-norm squared of
the residual r :
min |r |22 = min |I − Sb|22

By developing and calculating the gradient ∇b (r 2 ) and setting it to zero,5 we get

b = (SS)−1 ST I (5.34)

The similar Eqs. (5.33) and (5.34) are known as normal equations which, when
solved, lead to the solution of the least squares problem if the matrix S has rank
equal to the number of columns (3 in this case) and the problem is well conditioned.6
Once the system is resolved, the normals and the albedo for each pixel are calculated
with analogous Eqs. (5.31) and (5.32) replacing (bx , b y , bz ) instead of (n x , n y , n z )
as follows:

ρ = π |(SS)−1 ST I| =⇒ ρ = b2x + b2y + bz2 (5.35)

In the case of color images for a given source kth, there are three equations of the
image irradiance for every pixel, each relating to each component of color RG B
characterized by the albedo ρc :

Ik,c = ρc (sk,c · n)(i, j) (5.36)

where the subscript c indicates the color component R, G and B. Consequently, we


would have a system of Eq. (5.34) for each color component:

bc = ρc n = (SS)−1 ST Ic (5.37)

and once calculated bc we get, as before, with (5.32) the value of the albedo ρc
relative to the color component considered. Normally, the surface of the objects is

5 In fact, we have

r (b)2 = |I − Sb|22 | = (I − Sb)T (I − Sb) = b T ST Sb − 2bT ST I + IT I

and the gradient ∇b (r 2 ) = 2ST Sb − 2ST I that equaling to zero and resolving with respect to b
produces (5.34) with a minimum solution of the residual.
6 We recall that a problem is well conditioned if small perturbations on the measures (the images in

the context) generate small variations of the same order also on the quantities to be calculated (the
orientations of the normals in the context).
5.4 Shape from Shading-SfS Algorithms 433

reconstructed using gray-level images and it is hardly interesting to work with RGB
images for the reconstruction in color of the surface.

5.4.2 Uncalibrated Stereo Photometry

Several research activities have demonstrated the use of stereo photometry to extract
the map of the normals of objects even in the absence of knowledge of the lighting
conditions (without calibration of the light sources or not knowing their orientation
sk and intensity L i ) and without knowing the real reflectance model of the same
objects. Several results based on the Lambertian reflectance model are reported in
the literature. The reconstruction of the observed surface is feasible, albeit still with
ambiguity, starting from the map of normals. It also assumes the acquisition of images
with orthographic projection and with linear response of the sensor.
In this situation, uncalibrated stereo photometry implies that in the system of
Eq. (5.33) the unknowns are the source direction matrix S and the vector of normals
given by:
b(i, j) ≡ ρ(i, j)n(i, j)

from which we recall it is possible to derive then the albedo as follows:

ρ(i, j) = |b(i, j)|2

A solution to the problem is that of factorizing a matrix [7] which consists in its
decomposition into singular values SVD (see Sect. 2.11 Vol. II).
In this formulation, it is useful to consider the I = (I1 , I2 , . . . , Im ) matrix of
m images, of size m × k (with m ≥ 3), which organizes the pixels of the image
Ii , i = 1, m in the ith row of the global image matrix I, each row of dimensions
k equal to total number of pixels per image (organization of pixels in lexicographic
order). Therefore, the matrix Eq. (5.33), to arrange the factoring of the intensity
matrix, is reformulated as follows:

I = 
 S · 
B (5.38)
(m×k) m×3 3×k

where S is the unknown matrix of the sources whose directions for the ith image are
reported by line (si x , si y , si z ), while the matrix of normals B organizes by column
the direction components (s j x , s j y , s j z ) for the jth pixel. With this formalism the
irradiance equation under Lambertian conditions for the jth pixel of the ith image
results in the following:
Ii,j = Si · Bj (5.39)
  
(1×1) 1×3 3×1
434 5 Shape from Shading

For any matrix of size m×, k there always exists its decomposition into singular
values, according to the SVD theorem, given by

I = 
 U ·  · 
VT (5.40)
(m×k) m×m m×k k×k

where U is a orthogonal unitary matrix (U T U = I matrix identity) whose columns


are the orthonormal eigenvectors of IIT , V is an orthogonal unit matrix whose
columns are the orthonormal eigenvectors of IIT and is the diagonal matrix with
positive real diagonal elements (σ1 ≥ σ2 · · · ≥ σt ) with t = min(m, k) known as
singular values of I. If the I has rank k we will have positive singular values, if it
has rank t < k we will have null singular values from (t+1)th value.
With the SVD method, it is possible to consider only the first three singular values
of the image matrix I having an approximate decomposition of rank 3 of this matrix,
considering only the first 3 columns of U, the first 3 rows of V and the first submatrix
3×3 of . The approximation of rank 3 of the image matrix Î results in the following:

Î = 
 U ·  · 
VT (5.41)
(m×k) m×3 3×3 3×k

where the submatrices are renamed with the addition of the apex to distinguish them
from the originals. With (5.41), we get the best approximation of the original image
expressed by (5.40) which uses all the singular values of the complete decomposition.
In an ideal context, with images without noise, SVD can well represent the original
image matrix with a few singular values. With SVD it is possible to evaluate, from
the analysis of singular values, the acceptability of the approximation based on the
first three singular values (in the presence of noise the rank of Î is greater of three).
Using the first three singular values of , and the corresponding columns (singular
vectors) of U and V from (5.41), we can define the pseudo-matrices Ŝ and B̂,
respectively, of the sources and normals, as follows:
√ √
Ŝ = 
 U ·  B̂ =  · 
 VT (5.42)
(m×3) m×3 3×3 (3×k) 3×3 3×k

and (5.38) can be rewritten as


Î = ŜB̂ (5.43)

The decomposition obtained with the (5.43) is not unique. In fact, if A is an arbitrary
invertible matrix with a size of 3 × 3, we have that also the matrices ŜA−1 and AB̂
are still a valid decomposition of the approximate image matrix Î such that

∀A ∈ G L(3) : ŜA−1 AB̂ = Ŝ(A −1 −1


  A)B̂ = (ŜA )(AB̂) = S̄B̄ (5.44)
identity
5.4 Shape from Shading-SfS Algorithms 435

where G L(3) is the group of all matrices of size 3 × 3. In essence, with the (5.44) we
establish an equivalence relation in the space of the solutions T = IRm×3 × IR3×k ,
where IRm×3 represents the space of all possible matrices Ŝ of direction of the
sources and IRm×3 the space of all possible matrices IR3×k of scaled normals.
The ambiguity generated by the factorization Eq. (5.44) of Î in Ŝ and B̂ can be
managed by considering the matrix A associated with a linear transformation such
that
S̄ = ŜA−1 B̄ = AB̂ (5.45)

Equation (5.45) tells us that the two solutions (Ŝ, B̂) ∈ T and (S̄, B̄) ∈ T are
equivalent if exists the matrix A ∈ G L(3). With SVD through Eq. (5.42), the matrix
of the stereo photometric images I can select an equivalence class T (I). This class
contains the matrix S of the true sources direction and the matrix B of true normals
scaled, but it is not possible to distinguish (S, B) from other members in T (I) based
only on the content of the images assembled in the images matrix I.
The A matrix can be determined with at least 6 pixels with the same or known
reflectance or considering that the intensity of at least six sources is constant or is
known [7]. It is shown [8] that by imposing the constraint of integrability7 ambiguity
can be reduced by introducing the Generalized Bas-Relief (GBR) transformations
which satisfy the integrability constraint. GBR transforms a surface z(x, y) into a
new surface ẑ(x, y) combining a flattening operation (or a scale change) along the
z-axis with the addition of a plane:

ẑ(x, y) = λz(x, y) + μx + νy (5.46)

where λ = 0, ν, μ ∈  are the parameters that represent the group of the GBR
transformations. The matrix A which solves Eq. (5.45) is given by the matrix G
associated with the group of GBR transformations given in the form:
⎡ ⎤
1 00
A = G = ⎣ 0 1 0⎦ (5.47)
μνλ

7 The constraint of integrability requires that the normals estimated by stereo photometry correspond

to a curved surface. Recall that from the orientation map the surface z(x, y) can be reconstructed
by integrating the gradient information { p(x, y), q(x, y)} or the partial derivatives of z(x, y) along
any path between two points in the image plane. For a curved surface, the constraint of integrability
[9] means that it does not matter the path chosen to have an approximation of ẑ(x, y). Formally,
this means
∂ 2 ẑ ∂ 2 ẑ
=
∂ x∂ y ∂ y∂ x
having already estimated the normals b̂(x, y) = (b̂x , b̂ y , b̂z )T with partial derivatives of the first
∂ ẑ b̂x (x,y) ∂ ẑ b̂ y (x,y)
order ∂x = ; = .
b̂z (x,y) ∂ y b̂z (x,y)
436 5 Shape from Shading

The GBR transformation defines the sources pseudo-orientations S̄ and the pseudo-
normals B̄ according to Eq. (5.45) replacing the matrix A with the matrix G. Resolv-
ing then with respect to Ŝ and B̄, we have
⎡ ⎤
λ 0 0
1
Ŝ = S̄G B̂ = G−1 B̄ = ⎣ 0 λ 0⎦ B̄ (5.48)
λ −μ −ν 1

Thus, the problem is reduced from the 9 unknowns of the matrix A of size 3×3 to the
3 parameters of the matrix G associated with the GBR transformation. Furthermore,
the property of the GBR transformations is unique, in the sense that it does not
alter the shading configurations of a surface z(x, y) (with Lambertian reflectance)
illuminated by any source s with respect to that of the surface ẑ(x, y) obtained from
the GBR transformation with G and illuminated by the source whose direction is
given by ŝ = G −T s. In other words, when the orientations of both surface and source
are transformed with the matrix G, the shading configurations are identical in the
images of origin and that of the transformed surface.

5.4.3 Stereo Photometry with Calibration Sphere

The reconstruction of the orientation of the visible surface can be carried out experi-
mentally using a look-up table (LUT), which associates the orientation of the normal
with a triad of luminous intensity (I1 , I2 , I3 ) measured after appropriately calibrating
the acquisition system.
Consider as a stereo photometry system the one shown in Fig. 5.6 which provides
the three sources arranged on the basis of a cone at 120◦ degrees between them, a
single acquisition post located at the center of the cone base, and the work plane placed
at the top of the inverted cone where the objects to be acquired are located. Initially,
the system is calibrated to consider the reflectivity component of the material of the
objects and to acquire all the possible intensity triads to be stored in the LUT table
to be associated with a given orientation of the surface normal. To be in Lambertian
reflectance conditions, objects made of opaque material are chosen, for example,
objects made of PVC plastic material.
Therefore, the calibration of the system is carried out by means of a sphere of
PVC material (analogous to the material of the objects), for which the three images
I1 , I2 , I3 provided with stereo photometry are acquired. The images are assumed to be
registered (images are aligned with each other), that is, during the three acquisitions,
the object and observer are stopped while the corresponding sources are turned
on in succession. The calibration process associates, for each pixel (i, j) of the
image, a set of luminous intensity values (I1 , I2 , I3 ) (the three measurements of
stereo photometry) as measured by the camera (single observation point) in the three
successive acquisitions while a single source is operative, and the orientation value
( p, q) of the calibration surface is derived from the knowledge of the geometric
description of the sphere.
5.4 Shape from Shading-SfS Algorithms 437

120° 240° 360°

T
R

Calibration sphere

Lookup Table
(p,q)

Fig. 5.8 Stereo photometry: calibration of the acquisition system using a sphere of material with
identical reflectance properties of the objects

In essence, the calibration sphere is chosen as the only solid whose visible surface
has all the possible orientations of a surface element in the space of the visible
hemisphere. In Fig. 5.8 are indicated on the calibration sphere two generic surface
elements centered in the points R and T visible by the camera and projected in
the image plane, respectively, in the pixels located in (i R , j R ) and (i T , jT ). The
orientation of the normals corresponding to these superficial elements of the sphere is
given by n R ( p R , q R ) and n T ( pT , qT ) calculated analytically knowing the parametric
equation of the hypothesized sphere of unit radius.
Once the projections of these points of the sphere in the image plane are known,
after the acquisition of the three stereo photometry images {I1 , I2 , I3 }spher e , it is
possible to associate, to the surface elements considered R and T , their normals
orientation (knowing the geometry of the sphere) and the triad of luminous intensity
measurements determined by the camera. These associations are stored in the LUT
table of dimensions (3 × 2 × m) using as pointers the triples of the luminous intensity
measurements to which the values of the two orientation components ( p, q) are made
to correspond. The number of triples m depends on the level of discretization of the
sphere or the resolution of the images.
In the example shown in Fig. 5.8, the associations for the superficial elements R
and T are, respectively, the following:

I1 (i R , j R ), I2 (i R , j R ), I3 (i R , j R ) =⇒ n R ( p R , q R )

I1 (i T , jT ), I2 (i T , jT ), I3 (i T , jT ) =⇒ n T ( pT , qT )

Once the system has been calibrated and chosen the type of material with Lam-
bertian reflectance characteristics, positioned the lamps appropriately, stored all the
associations orientations ( p, q), and measured triples (I1 , I2 , I3 ) with the sphere of
calibration, the latter is removed from the acquisition plan and replaced with the
objects to be acquired.
438 5 Shape from Shading

Lookup Table
(p,q)

Orientation map

Fig. 5.9 Stereo photometric images of a real PVC object with the final orientation map obtained
by applying the calibrated stereo photometry approach with the PVC sphere

The three stereo photometry images of the objects are acquired in the same con-
ditions in which those of the calibration sphere were acquired, that is, by making
sure that the image Ik is acquired keeping only the source Sk on. For each pixel (i, j)
of the image, the triad (I1 , I2 , I3 ) is used as a pointer in the LUT table this time
used not to store but to find the orientation ( p, q) to be associated with the surface
corresponding to the pixel located in (i, j).
Figure 5.9 shows the results of the process of stereo photometry, which builds the
orientation map (needle map) of an object of the same PVC material as the calibration
sphere. Figure 5.10 instead shows another experiment based on stereo photometry
to detect the best-placed object, in the stack of similar objects (identical to the one
in the previous example), for automatic gripping (known as a problem of the bin
picking) in the context of robotic cells. In this case, once the map of the orientation
of the stack has been obtained, it is segmented to isolate the object best placed for
gripping and to estimate the attitude of the isolated object (normally, it is the one at
the top in the stack).

5.4.3.1 Calculation of the Direction of the Sources with Reverse Stereo


Photometry
The stereo photometric images of the calibration sphere, with Lambertian reflectance
and uniform albedo, are used to derive the direction of the three sources si applying
the inverse stereo photometry with Eq. (5.34). In this case, the unknowns are the
vectors si the direction of the sources while the directions of the normal are known b
knowing the geometry of the sphere and the 3 stereo photometric images of the same.
The sphere normals are calculated for different pixels in the image plane included
5.4 Shape from Shading-SfS Algorithms 439

Ob
jec
ta
tt
he
to
po
ft
he
sta
ck
Orientation map Segmented map

Fig. 5.10 Stereo photometric images of a stack of objects identical to that of the previous example.
In this case, with the same calibration data, the calculated orientation map is used not for the purpose
of reconstructing the surface but to determine the attitude of the object higher up in the stack

in the iso-brightness curves (see Fig. 5.11). Once the direction of the sources is
estimated, the value of the albedo can be estimated according to (5.32).

5.4.4 Limitations of Stereo Photometry

The traditional stereo photometry approach is strongly conditioned by the assumption


of the Lambertian reflectance model. In nature, it is difficult to find ideally Lambertian
materials and normally have a specular reflectance component. Another aspect not
considered by Eq. (5.28) of stereo photometry is that of the shadows present in
the images. One way to mitigate these problems is to use a M > 3 number of
stereo photometric images when possible, making the system overdetermined. This
involves a greater computational load especially for large images, considering that
the calculation of the albedo and the normals must be done for each pixel.
Variants of the traditional stereo photometry approach have been proposed using
multiple sources to consider non-Lambertian surfaces. In [10,11], we find a method
based on the assumption that highly luminous spots due to specular reflectance do
not overlap in the photometric stereo images. In the literature, several works on non-
calibrated stereo photometry are reported which consider other optimization models
to estimate the direction of sources and normals in the context of the Lambertian and
440 5 Shape from Shading

Fig. 5.11 Stereo


photometric image of the
calibration sphere with
iso-intensity curves

non-Lambertian reflectance model, considering the presence in stereo photometric


images of the effects of the specular reflectance model and the problem of shadows
including those generated by mutual occlusion between observed objects.

5.4.5 Surface Reconstruction from the Orientation Map

Obtained with the stereo photometry the map of the orientations (unitary normals)
of the surface for each pixel of the image, it is possible to reconstruct the surface
z = Z (x, y), that is, to determine the map of depth through a data integration
algorithm. In essence, it requires a transition from the gradient space ( p, q) to the
depth map to recover the surface.
The problem of surface reconstruction, starting from the discrete gradient space
with noisy data (often the surface continuity is violated, a constraint imposed by
stereo photometry), is an ill-posed problem. In fact, the estimated normal surfaces
do not faithfully reproduce the local curvature (slope) of the surface itself. It can
happen that more surfaces with different height values can have the same gradients. A
check on the acceptability of the estimated normals can be done with the integrability
test (see Note 7) which evaluates at each point the value ∂∂ py − ∂q
∂ x which should be
theoretically zero but small values would be acceptable. Once this test has been
passed, the reconstruction of the surface occurs less than a constant additive value
of the heights and with an adequate depth error.
One approach to constructing the surface is to consider the gradient information
( p(x, y), q(x, y)) which gives the height increments between adjacent points of the
surface in the direction of the x- and y-axes. Therefore, the surface is constructed by
5.4 Shape from Shading-SfS Algorithms 441

adding these increments starting from a point and following a generic path. In the
continuous case, by imposing the integrability constraint, integration along different
paths would lead to the same value of the estimated height for a generic point (x, y)
starting from the same initial point (x0 , y0 ) with the same arbitrary height Z 0 . This
reconstruction approach is called local integration method.
A global integration method is based on a C cost function that minimizes the
quadratic error between the ideal gradient (Z x , Z y ) and the estimated ( p, q):
 
C= (|Z x − p|2 + |Z y − q|2 )dxdy (5.49)


where  represents the domain of all the points (x, y) of the map of the normals
N(x, y), while Z x and Z y are the partial derivatives of the ideal surface Z (x, y) with
respect to the respective axes x and y. This function is invariant when a constant
value is added to the function of the height’s surface Z (x, y). The optimization
problem posed by (5.49) can be solved with the variational approach, with the direct
discretization method or with the expansion methods. The variational approach [12]
uses the Euler–Lagrange equation as the necessary condition to reach a minimum.
The numerical solution to minimize (5.49) is realized with the conversion process
from continuous to discrete. The expansion methods instead are set by expressing
the function Z (x, y) as a linear combination of a set of basic functions.

5.4.5.1 Local Method Mediating Gradients


This local surface reconstruction method is based on the average of the gradients
between adjacent normals. From the normal map, we consider a 4-point grid and
indicate the normals and the respective surface gradients as follows:

nx,y = [ p(x, y), q(x, y)] nx+1,y = [ p(x + 1, y), q(x + 1, y)]
nx,y+1 = [ p(x, y + 1), q(x, y + 1)]nx+1,y+1 = [ p(x + 1, y + 1), q(x + 1, y + 1)]

Now let’s consider the normals of the second column of the grid (along the x-axis).
The line connecting these points z[x, y +1, Z (x, y +1)] e z[x +1, y +1, Z (x +1, y +
1)] of the surface is approximately perpendicular to the normal average between these
two points. It follows that the inner product between the vector (slope) of this line
and the average normal vector is zero. This produces the following:
1
Z (x + 1, y + 1) = Z (x, y + 1) + [ p(x, y + 1) + p(x + 1, y + 1)] (5.50)
2
Similarly, considering the adjacent points of the second row of the grid (along the
y-axis), that is, the line connecting the points z[x + 1, y, Z (x + 1, y)] and z[x +
1, y + 1, Z (x + 1, y + 1)] of the surface, we obtain the relation:
1
Z (x + 1, y + 1) = Z (x + 1, y) + [q(x + 1, y) + q(x + 1, y + 1)] (5.51)
2
442 5 Shape from Shading

Adding member to member the two relations and dividing by 2, we have


1 1
Z (x + 1, y + 1) = [Z (x, y + 1) + Z (x + 1, y)] + [ p(x, y + 1)
2 4
+ p(x + 1, y + 1) + q(x + 1, y) + q(x+1, y + 1)]
(5.52)
Essentially, (5.52) estimates the value of the height of the surface at the point (x +
1, y + 1) by adding, to the average of the heights of the diagonal points (x, y + 1)
and (x + 1, y) of the grid considered, the increments of the heights expressed by the
gradients mediated in the direction of the x- and y-axis. Now consider a surface that
is discretized with a gradient map of Nr × Nc points (consisting of Nr rows and Nc
columns). Let Z (1, 1) and Z (Nr , Nc ) be the initial arbitrary values of the heights in
the extreme points of the gradient map, then a two-scan process of the gradient map
can determine the values of the heights along the axis of the x and y discretizing
with the constraint of integrability in terms of the differences forward, as follows:

Z (x, 1) = Z (x − 1, 1) + p(x − 1, 1) Z (1, y) = Z (1, y − 1) + q(1, y − 1)


(5.53)
where x = 2, . . . , Nr , y = 2, . . . , Nc , and the map is scanned vertically using the
local increments defined with Eq. (5.52).
The second scan starts from the other end of the gradient map, the point (Nr , Nc ),
and calculates the values of the heights, as follows:

Z (x − 1, Nc ) = Z (x, Nc ) − p(x, Nc ) Z (Nr , y − 1) = Z (Nr , y) − q(Nr , y)


(5.54)
and the map is scanned horizontally with the following recursive equation:
1 1
Z (x − 1, y − 1) =
2
[Z (x − 1, y) + Z (x, y − 1) − [ p(x − 1, y) + p(x, y) + q(x, y − 1) + q(x, y)]
4
(5.55)

The map of heights thus estimated has the values influenced by the choice of the
initial arbitrary values. Therefore, it is useful to perform a final step by taking the
average of the values of the two scans to obtain the final map of the surface heights.
Figure 5.12 shows the height map obtained starting from the map of the normals of
the visible surface acquired with the calibrated stereo photometry.

5.4.5.2 Local Method Based on Least Squares


Another method of surface reconstruction [13], starts from the normal map and
considers the gradient values ( p, q) of the surface given by Eq. (5.31) which can be
expressed in terms of partial derivatives (5.10) of the height map Z (x, y) along the
x- and y-axis. Such derivatives can be approximated to the forward differences as
follows:
p(x, y) ≈ Z (x + 1, y) − Z (x, y)
(5.56)
q(x, y) ≈ Z (x, y + 1) − Z (x, y)
5.4 Shape from Shading-SfS Algorithms 443

Fig. 5.12 Results of the reconstruction of the surface starting from the orientation map obtained
from the calibrated stereo photometry

It follows that for each pixel (x, y) of the height map, a system of equations can be
defined by combining (5.56) with the Z (x, y) surface derivatives represented with
the gradient according to (5.31) obtaining

n z (x, y)Z (x + 1, y) − n z Z (x, y) = −n x (x, y)


(5.57)
n z (x, y)Z (x, y + 1) − n z Z (x, y) = −n y (x, y)

where n(x, y) = (n x (x, y), n y (x, y), n z (x, y) is the normal vector of the point (x, y)
of the normal map in 3D space. If the map includes M pixel, the complete equation
system (5.57) consists of 2M equations. To improve the estimate of Z (x, y) for each
pixel of the map can be extended the system (5.57) considering also the adjacent
pixels, respectively, the one on the left (x − 1, y) and the one above (x, y − 1)
with respect to the pixel (x, y) being processed. In that case, the previous system
extends to 4M equations. The system (5.57) can be solved as an overdetermined
linear system. It should be noted that Eq. (5.57) are valid for points not belonging to
the edges of the objects where the component n z → 0.

5.4.5.3 Global Method Based on the Fourier Transform


We have previously introduced the function (5.49) as global integration method
based on a cost function C(Z ) that minimizes the quadratic error between the ideal
gradient (Z x , Z y ) and that estimated ( p, q) to obtain the map of heights Z (x, y)
starting from the orientation map (expressed in the gradient space ( p, q)) derived by
the SFS method. To improve the accuracy of the surface height map to the function
to be minimized (5.49), an additional constraint is added that makes stronger the
relationship between height values Z (x, y) to be estimated with gradient values
( p(x, y), q(x, y)).
444 5 Shape from Shading

This additional constraint imposes the equality of the second partial derivatives
Z x x = px and Z yy = q y , and the cost function to be minimized becomes
 
C(Z ) = (|Z x − p|2 +|Z y −q|2 )+λ0 (|Z x x − px |2 +|Z yy −q y |2 )dxdy (5.58)


where  represents the domain of all the points (x, y) of the map of the normals
N(x, y) = ( p(x, y), q(x, y)) and λ0 > 0 controls how to tune the curvature of the
surface and the variability of the acquired gradient data. The constraint of integrability
still remains p y = qx ⇔ Z x y = Z yx .
An additional constraint can be added with the term smoothing (smoothness) and
the new function results in the following:
   
C(Z ) = (|Z x − p|2 + |Z y − q|2 )dxdy + λ0 (|Z x x − px |2 + |Z yy − q y |2 )dxdy
 
    (5.59)
+ λ1 (|Z x |2 + |Z y |2 )dxdy + λ2 (|Z x x |2 + 2|Z x y |2 + |Z yy |2 )dxdy
 

where λ1 and λ2 are two additional nonnegative parameters that control the smoothing
level of the surface and its curvature, respectively. The minimized cost function C(Z )
which estimates the surface Z (x, y) unknown is solvable using two algorithms both
based on the Fourier transform.
The first algorithm [12] is set as a minimization problem expressed by the func-
tion (5.49) with the constraint of integrability. The proposed method uses the the-
ory of projection8 on convex sets. In essence, the gradient map of the normal
N (x, y) = ( p(x, y), q(x, y)) is projected into the gradient space that can be inte-
grated in the sense of least squares, then using the Fourier transform for optimization
in the frequency domain. Consider the surface Z (x, y) represented by the functions
φ(x, y, ω) as follows:

Z (x, y) = K (ω)φ(x, y, ω) (5.60)
ω∈

where ω is a 2D map of indexes associated with a finite set , and the coefficients
K (ω) that minimize the function (5.49) can be expressed as

Px (ω)K 1 (ω) + Py (ω)K 2 (ω)


K (ω) = (5.61)
Px (ω) + Py (ω)

8 In mathematical analysis, the projection theorem, also known as projection theorem in Hilbert
spaces, which descends from the convex analysis, is often used in functional analysis, establishing
that for every point x in a space of Hilbert H and for each convex set closed C ⊂ H there exists a
single value y ∈ C such that the distance  x − y  assumes the minimum value on C. In particular,
this is true for any closed subspace M of H . In this case, a necessary and sufficient condition for y
is that the vector x − y is orthogonal to M.
5.4 Shape from Shading-SfS Algorithms 445
 
where Px (ω) = |φx (x, y, ω)|2 and Py (ω) = |φ y (x, y, ω)|2 . The Fourier
derivatives of the basic functions φ can be expressed as follows:

φx = jωx φ φ y = jω y φ (5.62)

whereas we have that derivatives Px ∝ ω2x , Py ∝ ω2y , and also we can have that
x (ω) K (ω)
K 1 (ω) = Kjω x
and K 2 (ω) = jω
y
y
. Expanding the surface Z (x, y) with the Fourier
basic functions, then the function (5.49) is minimized with
jωx K x (ω) − jω y K y (ω)
K (ω) = (5.63)
ω2x + ω2y

where K x (ω) and K y (ω) are the Fourier coefficients of the heights of the recon-
structed surface. These Fourier coefficients can be calculated from the following
relationships:

K x (ω) = F { p(x, y)} K y (ω) = F {q(x, y)} (5.64)

where F represents the Fourier transform operator applied to gradient maps.


The results of the integration and the height map associated with the map of the
normals N (x, y) is obtained by applying the inverse Fourier transform given by

Z (x, y) = F −1 {K (ω)} (5.65)

The method described, although based on theoretical foundations by virtue of the


projection theorem, the surface reconstruction is sensitive to the noise present in
the normal map acquired with stereo photometry. In particular, the reconstruction
has errors in the discontinuity areas of the map where the integrability constraint is
violated.
The second algorithm [14] in addition to the integrability constraint adds with the
second derivatives the surface smoothing constraints minimizing the C(Z ) function
given by (5.59) using the Discrete Fourier Transform (DFT). The latter, applied to
the surface function Z (x, y) is defined by

c −1 N
N r −1 
1 − j2π u Nxc +v Nyr
Z(u, v) = √ Z (x, y)e (5.66)
Nr Nc x=0 y=0

and the inverse transform is given as follows:


Nc −1 N
r −1 
1  − j2π x Nuc +y Nvr
Z (x, y) = Z(u, v)e (5.67)
Nr Nc
u=0 v=0

√ the transform is calculated for each point on the normal map ((x, y) ∈ ),
where
j = −1 is the imaginary unit, and u and v represent the frequencies in the Fourier
domain. We now report the derivatives of the function Z (x, y) in the spatial and
446 5 Shape from Shading

frequency domain by virtue of the properties of the Fourier transform:

Z x (x, y) ⇔ juZ(u, v)
Z y (x, y) ⇔ jvZ(u, v)
Z x x (x, y) ⇔ −u 2 Z(u, v) (5.68)
Z yy (x, y) ⇔ −v2 Z(u, v)
Z x y (x, y) ⇔ −uvZ(u, v)

We also consider the Rayleigh theorem9 :


1  
|Z (x, y)|2 = |Z(u, v)|2 (5.69)
Nr Nc
(x,y)∈ (u,v)∈

which establishes the equivalence of two representations (the spatial one and the
frequency domain) of the function Z (x, y) from the energetic point of view useful
in this case to minimize the energy of the function (5.59).
Let P(u, v) and Q(u, v) be the Fourier transform of the gradients p(x, y) and
q(x, y), respectively. Applying the Fourier transform to the function (5.59) and
considering the energy theorem (5.69), we obtain the following:

| juZ(u, v) − P(u, v)|2 + | jvZ(u, v) − Q(u, v)|2
(u,v)∈

+ λ0 | − u 2 Z(u, v) − juP(u, v)|2 + | − v2 Z(u, v) − jvQ(u, v)|2
(u,v)∈

+ λ1 | juZ(u, v)|2 + | jvZ(u, v)|2
(u,v)∈

+ λ2 | − u 2 Z(u, v)|2 + 2| − uvZ(u, v)|2 + | − v2 Z(u, v)|2
(u,v)∈
=⇒ minimum

9 Infact, Rayleigh’s theorem is based on Parseval’s theorem. If x1 (t) and x2 (t) are two real signals,
X1 (u) and X2 (u) are the relative Fourier transforms, for Parseval’s theorem proves that:
 +∞  +∞
x1 (t) · x2∗ (t)dt = X1 (u) · X∗2 (u)du
−∞ −∞

If x1 (t) = x2 (t) = x(t) then we have the Rayleigh theorem or energy theorem:
 +∞  +∞
E= |x(t)|2 dt = |X(u)|2 (u)du
−∞ −∞

the asterisk indicates the complex conjugate operator. Often used to calculate the energy of a function
(or signal) in the frequency domain.
5.4 Shape from Shading-SfS Algorithms 447

By expanding the previous expressions, we have



u 2 ZZ∗ − juZP∗ + juZ∗ P + PP∗
(u,v)∈

+ v2 ZZ∗ − jvZQ∗ + jvZ∗ Q + QQ∗



+ λ0 u 4 ZZ∗ − ju 3 ZP∗ + ju 3 Z∗ P + u 2 PP∗
(u,v)∈

+ v4 ZZ∗ − jv3 ZQ∗ + jv3 Z∗ Q + v2 QQ∗



+ λ1 (u 2 + v2 )ZZ∗
(u,v)∈

+ λ2 (u 4 + 2u 2 v2 + v4 )ZZ∗
(u,v)∈

where the asterisk denotes the complex conjugate operator. By differentiating the
latter expression with respect to Z∗ and setting the result to zero, it is possible to
impose the necessary condition to have a minimum of the function (5.59) as follows.

(u 2 Z + juP + v2 Z + jvQ) + λ0 (u 4 Z + ju 3 P + v4 Z + jv3 Q)


+ λ1 (u 2 + v2 )Z + λ2 (u 4 + 2u 2 v2 + v4 )Z = 0

and reordering this last equation we have

λ0 (u 4 + v4 ) + (1 + λ1 )(u 2 + v2 ) + λ2 (u 2 + v2 )2 Z(u, v)
+ j (u + λ0 u 3 )P(u, v) + j (v + λ0 v3 )Q(u, v) = 0

Solving the above equation except for (u, v) = (0, 0), we finally get

− j (u + λ0 u 3 )P(u, v) − j (v + λ0 v3 )Q(u, v)
Z(u, v) = (5.70)
λ0 (u 4 + v4 ) + (1 + λ1 )(u 2 + v2 ) + λ2 (u 2 + v2 )2
Therefore, with (5.70), we have arrived at the Fourier transform of the heights of
an unknown surface starting from the Fourier transforms P(u, v) and Q(u, v) of the
gradient maps p(x, y) and q(x, y) calculated with stereo photometry. The details of
the complete algorithm are reported in [15].

5.5 Shape from Texture

When a scene is observed, the image captures, in addition to the variation information
of light intensity (shading), even if present, the texture information. With Shape From
Texture—SFT, we mean the vision paradigm that analyzes the texture information
448 5 Shape from Shading

Fig. 5.13 Images of objects


with a textured surface

present in the image and produces a qualitative or quantitative reconstruction of the


3D surface of the objects observed. For the purpose of surface reconstruction, we are
interested in a particular texture information: micro- and macro-repetitive structures
(texture primitives) with strong geometric and radiometric variations present on the
surface of objects.
The ability of the SFT algorithms consists precisely in the automatic search from
the image of the texture primitives normally characterized by the shape information
(for example, ellipses, rectangles, circles, etc.), of size, density (number of primitives
present in an area of the image), and their orientation. The goal is to reconstruct the
3D surface or calculate its orientation with respect to the observer by analyzing
the characteristic information of the texture primitives. This is in accordance with
Gibson’s theory of human perception, which suggests the perception of the surface
of the scene to extract depth information and the 3D perception of objects. Gibson
emphasizes that the presence of regular texture in the environment for its perception
is fundamental to humans.
Figure 5.13 shows the images of a flat and cylindrical surface with the presence
in both images of the same texture primitive in the form of replicated disks. It is
observed that in relation to the configuration of the vision system (position of the
camera, source, and attitude of the visible surface) in the acquired image the texture
primitives are projected undergoing a systematic distortion (due to the perspective
projection) in relation also to the attitude of the visible surface with respect to the
observer. From the analysis of these detected geometric distortions of the primitives,
it is possible to evaluate the structure of the observed surface or perform the 3D
reconstruction of the surface itself.
From Fig. 5.13 we observe how, in the case of regular primitives (in the example
disks and ellipses), their geometric variations (shape and size) in the image depend
on their distance from the observer, and from the orientation of the surface containing
these primitives. Gibson claims that in the 2D image invariant scene information is
captured. An example of invariant information is given by the relationship between
the horizontal and vertical dimension of a replicated primitive that remains constant
regardless of the fact that, as they move away from the observer, they become smaller
in size until they disappear on the horizon. The goal is to determine this invariant
information from the acquired 2D image.
Now let’s see what are the parameters to consider to characterize the texture
primitives present in the images. These parameters are linked to the following factors:
5.5 Shape from Texture 449

Perspective projection. It introduces distortions to the geometry of primitives by


changing their height and width. In particular, as the texture primitives move away
from the observer, they appear smaller and smaller in the image (the effect of the
tracks appearing to converge on the horizon is known). In the hypothesis of disk-
shaped primitives, these become more and more ellipses, when they move away
from the observer.
Surface inclination. Typically, when observing a flat surface inclined with respect
to the observer, the texture primitives contained appear observed foreshortened
to the observer or compressed in the inclination direction. For example, a circle
that is not parallel to the image plane is seen foreshortened, that is, it is projected
as an ellipse in the image plane.

Any method of Shape From Texture must then evaluate the geometric parameters of
the texture primitives characterized by these two distortions, which are essential for
the reconstruction of the surface and the calculation of its structure. The orientation of
a plan must be estimated starting from the knowledge of the geometry of the texture,
from the possibility of extracting these primitives without ambiguity, and appropri-
ately estimating the invariant parameters of the geometry of the primitives such as:
relationships between horizontal and vertical dimension, variations of areas, etc. In
particular, by extracting all the primitives present, it is possible to evaluate invariant
parameters such as the texture gradient which indicates the rapidity of change of the
density of these by the observer. In other words, the texture gradient in the image
provides a continuous metric of the scene, analyzing the geometry of the primitives
that always appear smaller, as they move away from the observer. The information
measured with the texture gradient allows humans to perceive the orientation of a
flat surface, the curvature of a surface and the depth. In Fig. 5.14 are shown some
images, where it is shown how the texture gradient information gives the perception
of the depth of the primitives on the flat surface that move away from the observer,
and how locally the visible surface changes from the change of the texture gradi-
ent. Other information considered is the perspective gradient and the compression
gradient, defined, respectively, by the change in the width and height of the projec-

Fig. 5.14 Texture gradient: depth perception and surface orientation


450 5 Shape from Shading

Fig. 5.15 Geometry of the Y


projection model between
image plane and local
orientation of the defined X
surface in terms of angle of
inclination σ and rotation τ τ
of the normal n to the surface n
O
projected in the image plane σ
P
e
lan Z
eP
ag
Im
Surface element

tions of the texture primitives in the image plane. As the distance increases between
observer and points of the visible surface, the gradient of perspective and compres-
sion decrease with distance. This perspective and compression gradient information
has been widely used in computational graphics to give a good perception of the 3D
surface observed on a monitor or 2D screen.
In the context Shape From Texture, it is usual to define the structure of the flat
surface to be reconstructed, with respect to the observer, through the slant angle σ
which indicates the angle included between the normal vector on the flat surface and
the z-axis (coinciding with the optical axis) and through the tilt angle τ indicating
the angle between the X -axis and the projection vector, in the image plane, of the
normal vector n (see Fig. 5.15). The figure shows how the slant angle is such that
the textured flat surface is inclined with respect to the observer in such a way that
the upper part is further away while the tilt angle is zero and, consequently, all the
texture primitives which are arranged horizontally are all the same distance from the
observer.
A general algorithm of Shape From Texture includes the following essential steps:

1. De f ine the texture primitives to be considered for the given application (lines,
disks, ellipses, rectangles, curved lines, etc.).
2. Choose the invariant parameters (texture, perspective, and compression gradi-
ents) appropriate for the texture primitives defined in step 1.
3. U se the invariant parameters of step 2, to calculate the attitude of the textured
surface.
5.6 Shape from Structured Light 451

Fig. 5.16 Illustration of the


3D object
triangulation geometry Z
S P
between projector, camera,
and 3D object γ

h
Light
source (laser)
β α
O Q X
Camera L

5.6 Shape from Structured Light

A depth map can be obtained with a range imaging system10 , where the object to
be reconstructed is illuminated by the so-called structured lighting of which the
geometry of projected geometric structures is known.
In essence, remembering the binocular vision system, a camera is replaced by a
luminous pattern projector and the problem of correspondence is solved (in a simpler
way) by searching for the (known) patterns in the camera that captures the scene with
overlapping light patterns.
Figure 5.16 shows the functional scheme of a range acquisition device based on
structured light. The scene is illuminated with a project (for example, based on low-
power lasers) by known patterns of light (structured light) and the observer (camera)
are separated at a distance L and the distance measurement (range) can be calculated
with a single image (scene with overlapping light patterns) by triangulation in a
similar way to the stereo binocular system. Normally the scene can be illuminated
by a luminous spot or by a thin lamina of light (vertical light plane perpendicular to
the scene) or with more complex luminous patterns (for example, a rectangular or
square luminous grid, binary luminous strips or gray; Microsoft’s Kinect is a low-
cost device that projects with a laser scanner scattered in the infrared and uses an
infrared-sensitive camera).
The relation between the coordinates (X, Y, Z ) of a point P of the scene and those
(x, y) of its projection in the image plane is linked to the calibration parameters of
the capture system such as, the focal f of the camera’s optical system, the separation
distance L between projector and camera, from the angle of inclination α of the
projector with respect to the axis of X and from the projection angle β of the object’s
point P illuminated by the light spot (see Fig. 5.16). In the hypothesis of the 2D

10 Indicates a set of techniques that are used to produce a 2D image to calculate the distance of
points in a scene from a specific point, normally associated with a particular sensory device. The
pixels of the resulting image, known as the depth image, have the information content from which
to extrapolate values of distances between points of the object and sensory device. If the sensor that
is used to produce the depth image is correctly calibrated, the pixel values are used to estimate the
distance information as in a stereo binocular device.
452 5 Shape from Shading

Fig. 5.17 Illustration in the 3D object


3D extension of triangulation P(X,Y,Z)
geometry between projector, Z
camera, and 3D object
Y Y

Z γ

f
p(x,y) Light
Image
x source (laser)
Plane α
O X Q X
L

projection of a single light spot, this relation is determined to calculate the position

of P, by triangulation, considering the triangle O P Q and applying the law of sines:
d L
=
sin α sin γ
from which follows:
L · sin α L · sin α L · sin α
d= = = (5.71)
sin γ sin[π − (α + β)] sin(α + β)
The angle β (given by β = arctan( f /x)) is determined by the projection geometry of
the point P in the image plane located in p(x, y) considering the focal length f of the
optical system and the only horizontal coordinate x. Determined the angle β, known
by the system configuration the parameters L and α, is calculated the distance d with

Eq. (5.71). Considering the triangle O P S the polar coordinates (d, β) of the point P
in the plane (X, Z ) are calculated in Cartesian coordinates (X P , Y P ) as follows11 :

X P = d · cos β Z P = h = d · sin β (5.72)

The extension to the 3D projection of P is immediate considering the pinhole model


so that from the similarity of the triangles generated by the projection of P in the
image plane we have
f x y
= = (5.73)
Z X Y

11 Obtained according to the trigonometric formulas of the complementary angles (their sum is a

right angle) where in this case we have the complementary angle ( π2 − β).
5.6 Shape from Structured Light 453

Considering the right triangle with base (L − X ) in the baseline O Q (see Fig. 5.17)
we get
Z
tan α = (5.74)
L−X

From Eqs. (5.73) and (5.74) we can derive:


 
fX f
Z= = (L − X ) · tan α =⇒ X + tan α = L · tan α (5.75)
x x

Therefore, considering the equality of the relations of the (5.73) and the last expres-
sion of the (5.75), we get the 3D coordinates of P given by
L · tan α
[X Y Z ] = [x y z] (5.76)
f + x · tan α
It should be noted that the resolution of the depth measurement Z given by (5.76) is
related to the accuracy with which α is measured and the coordinates (x, y) deter-
mined for each point P of the scene (illuminated) projected in the image plane. It is
also observed that, to calculate the distance of P, the angle γ was not considered (see
Fig. 5.17). This depends on the fact that the projected structured light is a vertical
light plane (not a ray of light) perpendicular to the X Z plane and forms an angle α
with the X -axis. To calculate the various depth points, it is necessary to project the
light spot in different areas of the scene to obtain a 2D depth map by applying (5.76)
for each point. This technique using a single mobile light spot (varying α) is very
slow and inadequate for dynamic scenes.
Normally, systems with structured light are used, consisting of a vertical light
lamina (light plane) that scans the scene by tilting that lamina with respect to the
Y -axis as shown in Fig. 5.18. In this case, the projection angle of the  plane of
laser light gradually changed to capture the entire scene in amplitude. As before,

Fig. 5.18 Illustration of the Camera


triangulation geometry y
Laser

between laser light plane and


outgoing ray from the optical p
center of the calibrated x
camera that intersects a point
Light plane Ray camera
of the illuminated 3D object Y
in P π X

P Z

n
io
ct
re
di
an
Sc
454 5 Shape from Shading

the camera-projector system is calibrated geometrically in order to recover the depth


map from the images captured wherein each image is shifted the projection of the
light lamina that appears as a luminous silhouette based on the shape and orientation
of the objects in the scene relative to the camera. Essentially, with the intersection of
vertical light lamina and surface of objects, the camera sees the vertical light lamina
as a broken curve with various orientations.
Points of the 3D objects are determined in the image plane and their 3D recon-
struction is obtained by calculating the intersection between light planes (whose
spatial position is known) with a ray that starts at the center of the camera and passes
through the corresponding p pixel of the image (see Fig. 5.18). Scanning of the
complete scene can also be done by rotating the objects and leaving the light plane
source fixed. In 3D space the light plane is expressed by the equation of the plane
in the form AX + BY + C Z + D = 0. From the equalities of the ratios (5.73) and
considering the light plane equation, we can derive the fundamental equations of the
pinhole projection model and calculate the 3D coordinates of the points of the scene,
illuminated by the light plane, as follows:
Zx Zy −D f
X= Y = Z= (5.77)
f f Ax + By + C f
As an alternative to the projection of rays or light planes, structured light of static 2D
light patterns (for example, a square grid with note geometry or stripe patterns) can
be projected onto the objects of the scene and then the image from the camera of the
whole scene is acquired with the superimposed light grid (see Fig. 5.19). In this way,
from the image points of interest are detected from the grid patterns deformed by
the curved surface of the objects of the scene. With a single acquisition of the scene,
it is possible to calculate, from the determined points of interest of the deformed
grid, the depth map. The calculation of the depth map in this case depends on the
accuracy with which the projected interest patterns (points or light strips) necessary
for triangulation are determined.

5.6.1 Shape from Structured Light with Binary Coding

The techniques most used are those based on the sequential projection of coded light
patterns (binary, gray levels, or in color) to eliminate the ambiguity in identifying
patterns associated with the surface of objects with different depths. It is, therefore,
necessary to uniquely determine the patterns of multiple strips of light seen by the
camera projected onto the image plane comparing them with those of the original
pattern. The process that compares, the projected patterns (for example, light binary
strips) with the corresponding original projected patterns (known a priori), is known
as the process of decoding the patterns the equivalent of the process of searching for
correspondence in binocular vision. In essence, the decoding of the patterns consists
in locating them in the image and finding their correspondence in the plane of the
projector of which it is known how they were coded.
5.6 Shape from Structured Light 455

Fig. 5.19 Light grid, with

Laser
known geometry, projected Camera
onto a 3D object whose
deformation points of
interest are detected to
reconstruct the shape of the
observed curved surface

Binary light pattern projection techniques involve projecting light planes onto
objects where each light plane is encoded with appropriate binary patterns [16].
These binary patterns are uniquely encoded by black and white strips (bands) for
each plane, so that when projected in a time sequence (the strips increase their width
over time) each point on the surface of the objects is associated with a single binary
code distinct from the other codes of different points. In other words, each point is
identified by the intensity sequence it receives. If the patterns are n (i.e., the number
of planes to be projected) then you can code 2n strips (that is, in the image 2n regions
are identified). Each strip represents a specific α angle of the projected light plane
(which can be vertical or horizontal or both directions depending on the type of scan).
Figure 5.20 shows a set of luminous planes encoded with binary strips to be pro-
jected in a temporal sequence on the scene to be reconstructed. In the figure, the
number of the patterns is 5 and the coding of each plane represents the binary con-
figuration of pattern 0 and 1 to indicate light off and on, respectively. The figure also
shows the temporal sequence of the patterns with the binary coding to uniquely asso-
ciate the code (lighting code) for each of the 25 strips. Each acquired image relative
to the projected patten is in fact a bit-plane which together form a bit-plane block.
This block contains the n-bit sequences that establish the correspondence between
all the points of the scene and their projection in the image plane (see Fig. 5.20).
456 5 Shape from Shading

Projection sequence Horizontal spatial distribution

1
2
3
4
5

ce
u en
eq
ns
c tio
oje
Pr y Camera
Observed code p
Projector
P(10100) x

Fig. 5.20 3D reconstruction of the scene by projecting in time sequence 5 pattern planes with
binary. The observed surface is partitioned into 32 regions and each pixel is encoded in the example
by a unique 5-digit binary code

5.6.2 Gray Code Structured Lighting

Binary coding provides two levels of light intensity encoded with 0 and 1. Binary
coding can be made more robust by using the concept of Gray code12 where each band
is encoded in such a way that two adjacent ones differ in a bit which is the maximum
possibility of error in the encoding of the bands (see Fig. 5.21). The number of
images with the Gray code is the same as the binary code and each image is a bit-
plane of the Gray code that represents the luminous pattern plane to be projected. The
transformation algorithm from binary code to Gray is a simple recursive procedure
(see Algorithm 24). The inverse recursive procedure that transforms a Gray code into
a binary sequence is shown in Algorithm 25.
Once the images are acquired with the patterns superimposed on the surface of
the objects, with the segmentation the 2n bands are univocally codified and finally it
is possible to calculate the relative 3D coordinates with a triangulation process and

12 Named Gray code after Frank Gray, a researcher at Bell Laboratories in 1953, patented it). Also
known as Reflected Binary Code (RBC) which is a binary coding method where two successive values
differ only by one bit or a binary digit. RBC was originally designed to prevent spurious errors for
various electronic devices; today widely used in digital transmission. Basically, the graycode is
based on the Hamming distance (in this case 1) which evaluates the number of digit substitutions
to make two strings of the same length equal.
5.6 Shape from Structured Light 457

5 01 1 0 01 1 0 01 10 0 1 1 0 0 11 0 0 1 1 0 0 1 1 0 0 1 1 0

Projection sequence
4 00 1 1 11 0 0 00 11 1 1 0 0 0 01 1 1 1 0 0 0 0 1 1 1 1 0 0
3 00 0 0 11 1 1 11 11 0 0 0 0 0 00 0 1 1 1 1 1 1 1 1 0 0 0 0
2 00 0 0 00 0 0 11 11 1 1 1 1 1 11 1 1 1 1 1 0 0 0 0 0 0 0 0
1 00 0 0 00 0 0 00 00 0 0 0 0 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1
Horizontal spatial distribution

Fig. 5.21 Example of a 5 − bit Gray code that generates 32 bands with the characteristic that
adjacent bands only differ by 1 bit. It can be seen from Fig. 5.20 the comparison with structured
light planes with binary coding always at 5bit

Algorithm 24 Pseudo-code to convert a binary number B to G Gray code.


1: Bin2Gray(B)
2: n ← length(B)
3: G(1) ← B(1)
4: for i ← 2 to n do

5: G(i) ← B(i − 1) xor B(i)

6: end for
7: return G

Algorithm 25 Pseudocode to convert a Gray code G to B binary number.


1: Gray2Bin(G)
2: n ← length(G)
3: B(1) ← G(1)
4: for i ← 2 to n do

5: B(i) ← B(i − 1) xor G(i)

6: end for
7: return B

obtain a depth map. The coordinates of each pixel (X, Y, Z ) (along the 2n horizontal
bands) are calculated from the intersection between the plane passing through the
vertical band and the optical projection center with the straight line passing through
the optical center of the camera calibrated and points of the band (see Fig. 5.20),
according to Eq. (5.77).
The segmentation algorithm required is simple since they are normally well-
contrasted binary bars on the surface of objects and, except in shadow areas, the
projected luminous pattern plane does not optically interfere with the surface itself.
However, to obtain an adequate spatial resolution different pattern planes must be
projected. For example, to have a band resolution of 1024 you need log2 1024 = 10
pattern planes to project and then process 10 bit-plane images. Overall, the method
458 5 Shape from Shading

has the advantage of producing depth maps with high resolution and accuracy in the
order of µ m and reliable using the Gray code. The limits are related to the static
nature of the scene and the considerable computational time when a high spatial
resolution is required.

5.6.3 Pattern with Gray Level

To improve the 3D resolution of the acquired scene and at the same time to reduce
the number of pattern planes it is useful to project bright pattern planes at gray levels
[17] (or colored) [18]. In this way, the code base is increased instead of the binary
coding. If m is the number of gray (or color) levels and n is the number of pattern
planes (known as codes n-ary) we will have mn bands and each band is seen as a
space point of n − dimensions. For example with n = 3 and using only m = 4 gray
levels, we would have 43 = 64 unique codes to characterize the bands, against the 6
pattern planes required with binary coding.

5.6.4 Pattern with Phase Modulation

We have previously considered patterns based on binary coding, Gray code, and on
n-ary coding that have the advantage of encoding individual pixel regions without
spatially depending on neighboring pixels. A limitation of these methods is given by
the poor spatial resolution. A completely different approach is based on the Phase
Shift Modulation [19,20], which consists of projecting different modulated periodic
light patterns with a constant phase shift in each projection. In this way, we have
a high-resolution spatial analysis of the surface with the projection of sinusoidal
luminous patterns (fringe patterns) with constant phase shift (see Fig. 5.22).

2π-Φ(x,y) 2π

Ia(x,y)

) x

θ x

Fig. 5.22 The phase shift based method involves the projection of 3 planes of sinusoidal light
patterns (on the right, an image of one of the luminous fringe planes projecting into the scene is
displayed) modulated with phase shift
5.6 Shape from Structured Light 459

If we consider an ideal model of image formation we will have that every point
of the scene receives the luminous fringes perfectly in focus and not conditioned by
other light sources.
Therefore, the intensity of each pixel (x, y) of the images Ik , k = 1.3, acquired
by projecting three planes of sinusoidal luminous fringes with constant shift of the
phase angle θ , is given by the following:

I1 (x, y) = Io (x, y) + Ia (x, y) cos[φ(x p , y p ) − θ ]


I2 (x, y) = Io (x, y) + Ia (x, y) cos[φ(x p , y p )] (5.78)
I3 (x, y) = Io (x, y) + Ia (x, y) cos[φ(x p , y p ) + θ ]

where Io (x, y) is an offset that includes the contribution of other light sources in the
environment, Ia (x, y) is the amplitude of the modulated light signal,13 φ(x p , y p ) is
the phase of the luminous pixel of the projector which illuminates the point of the
scene projected in the point (x, y) in the image plane. The phase φ(x p , y p ) provides
the matching information in the triangulation process. Therefore, to calculate the
depth of the observed surface, it is necessary to recover the phase of each pixel
(a process known as wapped phase) relative to the three projections of sinusoidal
fringes starting from the three images Ik . The phases φk , k = 1, 3 recovered are then
combined to obtain a unique phase φ, unambiguous, through a procedure known as
unwrapped phase.14 Therefore, phase unwrapping is a trivial operation if the context
of the wrapped phases is ideal. However, in real measurements various factors (e.g.,
presence of shadows, low modulation fringes, nonuniform reflectivity of the object’s
surface, fringe discontinuities, noise) influence the phase unwrapping process. As
we shall see, it is possible to use a heuristic solution to solve the phase unwrapping
problem which attempts to use continuity data on the measured surface to move data
when it has obviously crossed the border even though it is not an ideal solution and
does not completely manage the discontinuity.

13 The value of Ia (x, y) is conditioned by the BRDF function of the point of the scene, by the
response of the camera sensor, by the arrangement of the tangent plane in that point of the scene
(as seen from foreshortening by the camera) and by the intensity of the projector.
14 It is known that the phase of a periodic signal is univocally defined in the main interval (−π ÷ π ).

As shown in the figure, fringes with sinusoidal intensity are repeated for different periods to cover
the entire surface of the objects. But this creates ambiguity (for example, 20◦ are equal to 380◦
and 740◦ to derive, from the gray levels of the acquired images (5.77), the phase is calculable to
less than multiples of 2, which is known as a wapped phase. The recovery of the original phase
values from the values in the main interval is a classic problem in signal processing known as the
process of phase unwrapping. Formally, the phase unwrapping means that, given the wapped phase
ψ ∈ (−π, π ), it needs to find the true phase φ which is related to ψ as follows:
 φ 
ψ = W(φ) = φ − 2π

 
where W is the phase wrapping operator and the expression • rounds its argument to the nearest
integer. It is shown that the phase unwrapping operator is generally a mathematically ill-posed
problem and is usually solved through algorithms based on heuristics that give acceptable solutions.
460 5 Shape from Shading

In this context, the phase unwrapping process that calculates the absolute (true)
phase ψ must be derived from the wrapped phase φ based on the observed intensities
given by the images Ik , k = 1, 2, 3 of light fringes, that is, Eq. (5.77). It should be
noted that in these equations we have the terms Io and Ia that are not known (we
will see later that they will be removed) while the phase angle φ is the unknown.
According to the algorithm proposed by Huang and Zhang [19], the wrapped phase
is given by combining the intensities Ik as follows:
I1 (x, y) − I3 (x, y) cos(φ − θ) − cos(φ + θ)
=
2I2 (x, y) − I1 (x, y) − I3 (x, y) 2 cos(φ) − cos(φ − θ) − cos(φ + θ)
2 sin(φ) sin(θ)
= (add/sub trigon. funcs)
2 cos(φ)[1 − cos(θ)]
tan(φ) sin(θ)
= (tangent half-angle formula)
1 − cos(θ)
tan(φ)
= (5.79)
tan(θ/2)
from which the removal of the dependence on the terms Io and Ia . Considering the
final result15 reported by (5.79), the phase angle, expressed in relation to the observed
intensities, is obtained as follows:
 
√ I1 (x, y) − I3 (x, y)
ψ(0, 2π ) = arctan 3 (5.80)
2I2 (x, y) − I1 (x, y) − I3 (x, y)

where θ = 120◦ is considered for which tan(θ/2) = 3. (5.80) gives the phase
angle of the pixel in the local period from the intensities.
To remove the ambiguity of the discontinuity of the arctangent function in 2π , we
need to add or subtract multiples of 2π to the calculated phase angle ψ, which is to
find the phase unwrapping (see Note 14 and Fig. 5.23) given by

φ(x, y) = ψ(x, y) + 2π k(x, y) k ∈ (0, . . . , N − 1) (5.81)

where k is an integer representing the projection period while N is the number of


projected fringes. It should be noted that the applied phase unwrapping process
provides only the relative phase φ and not the absolute one to reconstruct the depth
of each pixel. To estimate the 3D coordinates of the pixel, it is necessary to calculate
a reference phase φref with respect to which to determine by triangulation with
projector and camera the relative phase φ for each pixel. Figure 5.24 shows the
triangulation scheme with the reference plane and the surface of the objects where
the fringed patterns are projected with respect to which the phase unwrapping φref
is obtained and the one relative to the object φ. Considering the similar triangles of
the figure, the following relation is obtained:
z d Z −z
= f r om which z= d (5.82)
Z −z L L

1−cos θ
15 Obtained considering the tangent half-angle formula in the version tan θ/2 = sin θ valid with
θ = k · 180◦ .
5.6 Shape from Structured Light 461

Φ(x,y)
Φ(x,y)

2π Unwrapped Phase

π

π/2 Wrapped Phase

0
0
π 2π x x
-π/2
Current Phase

Fig. 5.23 Illustration of the phase unwrapping process. The graph on the left shows the conversion
of the phase angle φ(x, y) with module 2π , while the graph on the right shows the result of the
unwrapped phase

Fig. 5.24 Calculation of the d


depth z by triangulation and Reference Plane
the value of the phase z
3D object
difference with respect to the P(X,Y,Z)
reference plane
Z

Camera L Projector

where z is the height of a pixel with respect to the reference plane, L is the separation
distance between projector and camera, Z is the perpendicular distance between the
reference plane and the segment joining the optical centers of camera and projector,
and d is the separation distance of the projection points of P (point of the object
surface) in the reference plane obtained by the optical rays (of the projector and
camera) passing through P (see Fig. 5.24). Considering Z  z, the (5.82) can be
simplified as follows:
Z Z
z ≈ d ∝ (φ − φref ) (5.83)
L L

where the phase unwrapping φref is obtained by projecting and acquiring the fringe
patterns on the reference plane in the absence of the object, while φ is obtained by
repeating the scan with the presence of the object. In essence, the heights (depth) of
the object’s surface, once the scanning system has been calibrated (with known L
and Z ) and by virtue of the triangulation is determined d (a sort of disparit y of the
phase unwrapping), are calculated with Eq. (5.83).
462 5 Shape from Shading

5.6.5 Pattern with Phase Modulation and Binary Code

Previously, we have highlighted the problem of ambiguity with the use of the method
based on phase shift modulation and the need to find a solution for phase unwrap-
ping that does not resolve the absolute phase unequivocally. This ambiguity can be
resolved by combining this method that projects periodic patterns with that of the
Gray code pattern projection described above.
For example, even projecting 3 binary patterns would have the surface of the object
divided into 8 regions while projecting the periodic patterns would have an increase
in spatial resolution with a more accurate reconstruction of the depth map. In fact,
once the phase of a given pixel has been calculated, the period of the sinusoid where
the pixel lies is obtained from the region of belonging associated with the binary
code.
Figure 5.25 gives an example [21] which combines the binary code method (Gray
code) and the one with phase shift modulation. There are 32 binary code sequences
to partition the surface, determining the phase interval unambiguously, while phase
shift modulation reaches a subpixel resolution beyond the number of split regions
expected by the binary code. As shown in the figure, the phase modulation is achieved
by approximating the sine function with on/off intensity of the patterns generated
by a projector. These patterns are then translated into steps of π/2 for a total of 4
translations.
These last two methods have the advantage of operating independently from the
environmental lighting conditions but have the disadvantage of requiring different
light projection patterns and not suitable for scanning dynamic objects.

5.6.6 Methods Based on Colored Patterns

Methods based on the projection of sequential patterns have the problem of being
unsuitable for acquiring depth maps in the context of dynamic scenes (such as moving

Gray code sequence


0 1 2 3 4 5 6 7 8 9 10 1112 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
1
2
3
4
Phase Shift Sequence
0 1 2 3 4 5 6 7 8 9 10 1112 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
5
6
7
8

Fig. 5.25 Method that combines the projection of 4 Gy binary pattern planes with phase shift
modulation achieved by approximating the sine function with 4 angular translations with a step of
π/2 thus obtaining a sequence of 32 coded bands
5.6 Shape from Structured Light 463

people or animals, for example). In these contexts, real-time 3D image acquisition


systems are used based on colored pattern information or on the projection of light
patterns with a unique encoding scheme that require a single acquisition (full frame,
i.e., in each pixel of the image, the distance between the camera and the point of the
scene is accurately evaluated for 3D reconstruction of the scene surface in terms of
coordinates (X, Y, Z ).
The Rainbow 3D Camera [22] illuminates the surface to be reconstructed with
light with spatially variable wavelength and establishes a one-to-one correspondence
between the projection angle θ of a light plane and the particular spectral wavelength λ
realizing thus a simple identification of the light patterns on each point of the surface.
It basically solves the problem of correspondence that we have with the binocular
system. In fact, note the baseline L and the angle of view α, the distance value
corresponding to each pixel is calculated using the triangulation geometry, and the
full frame image is obtained with a single acquisition at the speed of a video signal
(30 f rame/sec or higher) also in relation to the spatial resolution of the sensor (for
example, 1024×1300 pixels/frame).
To solve the problem of occlusions in the case of complex surfaces systems have
been developed that scan the surface by projecting indexed colored strips [23]. Using
RGB color components with 8-bit channels, we get up to 224 different colors. With
this approach, by projecting patterns of indexed colored strips, the ambiguity that
occurs with the phase shift modulation method or with the method that projects
multiple monochromatic strip patterns is attenuated. Another approach is to consider
strips that can be distinguished from each other if they are made with pattern segments
of various lengths [24]. This technique can be applied to continuous and curved
and not very complex 3D surfaces otherwise it would be difficult to identify the
uniqueness of the pattern segment.

5.6.7 Calibration of the Camera-Projector Scanning System

As with all 3D reconstruction systems, even with structured light approaches, a cam-
era and projection system calibration phase is provided. In the literature, there are
various methods [25]. The objective is to estimate the intrinsic and extrinsic parame-
ters of the camera and projection system with appropriate calibration according to the
resolution characteristics of the camera and the projection system itself. The camera
is calibrated, assuming a perspective projection model (pinhole), looking at different
angles a white-black checkerboard reference plane (with known 3D geometry) and
establishing a nonlinear relationship between the spatial coordinates (X, Y, Z ) of
3D points of the scene and the coordinates (x, y) of the same points projected in the
image plane. The calibration of the projector depends on the scanning technology
used. In the case of projection of a pattern plane with known geometry, the cali-
bration is done by calculating the homography matrix (see Sect. 3.5 Vol. II) which
establishes a relationship between points of the plane of the patterns projected on a
plane and the same points observed by the camera considering known the separation
distance between the projector and the camera, and known the intrinsic parameters
464 5 Shape from Shading

(focal, sensor resolution, center of the sensor plane, ...) of the camera itself. Estimated
the homography matrix, it can be established, for each projected point, a relationship
between projector plane coordinates and those of the image plane.
The calibration of the projector is twofold providing for the calibration of the
active light source that projects the light patterns and the geometric calibration seen
as a normal reverse camera. The calibration of the light source of the projector
must ensure the stability of the contrast through the analysis of the intensity curve
providing for the projection of light patterns acquired by the camera and establishing
a relationship between the intensity of the projected pattern and the corresponding
values of the pixels detected from the camera sensor. The relationship between the
intensities of the pixels and that of the projected patterns determines the function to
control the linearity of the lighting intensity.
The geometric calibration of the projector consists of considering it as a reverse
camera. The optical model of the projector is the same as that of the camera (pinhole
model) only the direction changes. With the inverse geometry, it is necessary to
solve the difficult problem of detecting in the plane of the projector a point of the
image plane, projection of a 3D point of the scene. In essence, the homography
correspondence between points of the scene seen simultaneously by the camera and
the projector must be established.
Normally the camera is first calibrated with respect to a calibration plane for which
a homography relation is established H between the coordinates of the calibration
plane and those of the image plane. Then the light patterns of known geometry
calibration are projected onto the calibration plane and acquired by the camera. With
the homography transformation H, the known geometry patterns projected in the
calibration plane are known in the reference system of the camera, that is, they
are projected homographically in the image plane. This actually accomplishes the
calibration of the projector with respect to the camera having established with H
the geometrical transformation between points of the projector pattern plane (via
the calibration plane) and with the inverse transform H −1 the mapping between
image plane and pattern plane (see Sect. 3.5 Vol. II). The accuracy of the geometric
calibration of the projector is strictly dependent on the initial calibration of the camera
itself.
The calculated depth maps and in general the 3D surface reconstruction technolo-
gies with a shape from structured light approach are widely used in the industrial
applications of vision, where the lighting conditions are very variable and a passive
binocular vision system would be inadequate. In this case, structured light systems
can be used to have a well-controlled environment as required, for example, for
robotized cells with the movement of objects for which the measurements of 3D
shapes of the objects are to be calculated at time intervals. They are also applied for
the reconstruction of parts of the human body (for example, facial reconstruction,
dentures, 3D reconstruction in plastic surgery interventions) and generally in support
of CAD systems.
5.7 Shape from (de)Focus 465

5.7 Shape from (de)Focus

This technique is based on the depth of field of optical systems that is known to be
finite. Therefore, only objects that are in a given depth interval that depends on the
distance between the object and the observer and the characteristics of the optics
used are perfectly in focus in the image. Outside this range, the object in the image
is blurred in proportion to its distance from the optical system. Remember to have
used (see Sect. 9.12.6 Vol. I) as a tool for blurring the image the convolution process
by means of appropriate filters (for example Gaussian filter, binomial, etc.) and that
the same process of image formation is modelable as a process of convolution (see
Chap. 1 Vol. I) which intrinsically introduces blurring into the image.
By proceeding in the opposite direction, that is, from the estimate of blurring
observed in the image, it is possible to estimate a depth value knowing the parameters
of the acquisition system (focal length, aperture of the lens, etc.) and the transfer
function with which it is possible to model the blurring (for example, convolution
with Gaussian filter). This technique is used when one wants to obtain qualitative
information of the depth map or when one wants to integrate the depth information
with that obtained with other techniques (data fusion integrating, for example, with
depth maps obtained from stereo vision and stereo photometry).
Depth information is estimated with two possible strategies:

1. Shape from Focus (SfF): It requires the acquisition of a sequence of images


of the scene by varying the acquisition parameters (object–optical-sensor dis-
tances) thus generating images from different blurring levels up to the maximum
sharpness. The objective is to search in the sequence of images for maximum
sharpness and, taking note of the current parameters of the system, to evaluate
the depth information for each point on the scene surface.
2. Shape from Defocus (SfD): The depth information is estimated by capturing
at least two blurred images and by exploring blurring variation in the images
acquired with different settings of optical-sensor system parameters.

5.7.1 Shape from Focus (SfF)

Figure 5.26 shows the basic geometry of the image formation process on which the
shape from focus proposed in [26] is based. The light reflected from a point P of the
scene is refracted by the lens and converges at the point Q in the image plane. From
the Gaussian law of a thin lens (see Sect. 4.4.1 Vol. I), we have the relation, between
the distance p of the object from the lens, distance q of the image plane from the
lens and focal length f of the lens, given by
1 1 1
+ = (5.84)
p q f
466 5 Shape from Shading

According to this law, points of the object plane are projected into the image plane
(where the sensor is normally placed) and appear as well-focused luminous points
thus forming in this plane the image I f (x, y) of the scene resulting perfectly in focus.
If the plane of the sensor does not coincide with that of the image but is shifted to the
distance δ (before or after the image plane in focus, in the figure it is translated after),
the light coming from the point P of the scene, refracted from the lens, it undergoes
a dispersion and in the sensor plane the projection of P in Q is blurred due to
the dispersion of light and appears as a blurred circular luminous spot, assuming a
circular aperture of the lens.16 This physical blurring process occurs at all points in
the scene, resulting in a blurred image in the sensor plane Is (x, y). Using similar
triangles (see Fig. 5.26), it is possible to derive a formula to establish the relationship
between the radius of the blurred disk r and the displacement δ of the sensor plane
from the focal plane, obtaining
r δ δR
= f r om which r= (5.85)
R q q
where R is the radius of the lens (or aperture). From Fig. 5.26, we observe that the
displacement of the sensor plane from the image focal plane is given by:

δ =i −q

. It is pointed out that the intrinsic parameters of the optical and camera system are
(i, f, and R). The dispersion function that models point blurring in the sensor plane
can be modeled in physical optics.17 The approximation of the physical model of
point blurring can be achieved with the two-dimensional Gaussian function in the
hypothesis of limited diffraction and incoherent illumination.18
Thus, the blurred image Is (x, y) can be obtained through the convolution of the
image in focus I f (x, y) with the PSF Gaussian function h(x, y), as follows:

Is (x, y) = I f (x, y)  h(x, y) (5.86)

16 This circular spot is also known as confusion circle or confusion disk in photography or blur
circle, blur spot in image processing.
17 Recall from Sect. 5.7 Vol. I that in the case of circular openings the light intensity distribution

occurs according to the Airy pattern, a series of concentric rings that are always less luminous due
to the diffraction phenomenon. This distribution of light intensity on the image (or sensor) plane is
known as the dispersion function of a luminous point (called PSF—Point Spread Function).
18 Normally, the formation of images takes place in conditions of illumination from natural (or arti-

ficial) incoherent radiation or from (normally extended) non-monochromatic and unrelated sources
where diffraction phenomena are limited and those of interference cancel each other out. The lumi-
nous intensity in each point is given by the sum of the single radiations that are incoherent with
each other or that do not maintain a constant phase relationship. The coherent radiations are instead
found in a constant phase relation between them (for example, the light emitted by a laser).
5.7 Shape from (de)Focus 467

Fig. 5.26 Basic geometry in Object Plan


the image formation process in focus i
with a convex lens δ
P

Sensor Plane
R
f

Image Plane
O

Object Plane
Q
2r
Q’
p q
Δp If(x,y) Is(x,y)

with
x 2 +y 2
1 −
2σh2
h(x, y) = √ e (5.87)
2π σh
2

where the symbol “” indicates the convolution operator, σh is the dispersion param-
eter (constant for each point P of the scene, assuming the convolution a spatially
invariant linear transformation) that controls the level of blurring corresponding to the
standard deviation of the 2D Gaussian Point Spread Function (PSF), and is assumed
to be proportional [27,28] to the radius r .
Blurred image formation can be analyzed in the frequency domain, where it is
observed how the Optical Transfer Function (OTF) which corresponds to the Fourier
transform of the PSF is characterized. By indicating with I f (u, v), Is (u, v) and
H(u, v), the Fourier transforms, respectively, of the image in focus, the blurred
image and the Gaussian PSF, the convolution expressed by the (5.86) in the Fourier
domain results in the following:

Is (u, v) = I f (u, v) · H(u, v) (5.88)

where
u 2 +v 2 2
H(u, v) = e− 2 σh (5.89)

From Eq. (5.89), which represents the optical transfer function of the blur process,
its dependence on the dispersion parameter σh is explicitly observed, and indirectly
depends on the intrinsic parameters of the optical and camera system considering that
σh ∝ k ·r is dependent on r unless a proportionality factor k is, in turn, dependent on
the characteristics of the camera and can be determined from a previous calibration of
the same camera. Considering the circular symmetry of the OTF, expressed by (5.89),
with still Gaussian form, the blurring is due only to the passage of low frequencies and
the cutting of high frequencies in relation to the increase of σh , in turn conditioned to
468 5 Shape from Shading

the increase of δ and consequently of r according to (5.85). Therefore, the blurring


of the image is attributable to a low-pass filtering operator, where the bandwidth
decreases as the blur increases (the standard deviation of the Gaussian that model
the PSF and OTF are inversely proportional to each other, for details see Sect. 9.13.3
Vol. I).
So far we have examined the physical–optical aspects of the process of forming
an image (in focus or out of focus), together with the characteristic parameters of
the acquisition system. Now let’s see how, from one or more images, it is possible to
determine the depth map of the scene. From the image formation scheme of Fig. 5.26,
it emerges that a defocused image can be obtained in different ways:

1. Translating the sensor plane with respect to the image plane where the scene is
in perfect focus.
2. Translating the optical system.
3. By translating the objects of the scene relative to the object plane, against which,
the optical system focuses on the image plane the scene. Normally, of a 3D object
only the points belonging to the object plane are perfectly in focus, all the other
points, before and after the object plane, are acceptably or less in focus, in relation
to the depth of field of the system optical.

The mutual translation between the optical system and the sensor plane (modes 1 and
2) introduces a scale factor (of apparent reduction or enlargement) of the scene by
varying the coordinates in the image plane of the points of the scene and a variation
of intensity in the acquired image, caused by the different distribution of irradiance
in the sensor plane. These drawbacks are avoided by acquiring images translating
only the scene (mode 3) with respect to a predetermined setting of the optical-sensor
system, thus keeping the scale factor of the acquired scene constant.
Figure 5.27 shows the functional scheme of an approach shape from focus pro-
posed in [26]. We observe the profile of the surface S of the unknown scene whose
depth is to be calculated and in particular a surface element ( patch) s is highlighted.
We distinguish a reference base with respect to which the distance d f of the focused
object plane is defined and the distance d from the object-carrying translation basis
is defined simultaneously. These distances d f and d can be measured with controlled
resolution. Now consider the patch s and the situation where the base moves toward
the focused object plane (i.e., toward the camera).
It will have that in the images in acquisition the patch s will tend to be more and
more in focus reaching the maximum when the base reaches the distance d = dm ,
and then begin the process of defocusing as soon as it exceeds the focused object
plane. If for each step d of translation, the distance d of the base is registered
and the blur level of the patch s, we can evaluate the estimate of the height (depth)
ds = d f − dm at the value d = dm where the patch has the highest level of focus.
This procedure is applied for any patch on the surface S. Once the system has been
calibrated, from the height ds , the depth of the surface can also be calculated with
respect to the sensor plane or other reference plane.
5.7 Shape from (de)Focus 469

Fig. 5.27 Functional translation step


scheme of the Shape from Δd

mobile base
Focus approach

Sensor Plane
S
d
df Object Plane
Reference base

Once the mode of acquisition of image sequences has been defined, to determine
the depth map it is necessary to define a measurement strategy of the level of blurring
of the points of the 3D objects, not known, placed on a mobile base. In the literature
various metrics have been proposed to evaluate in the case of Shape from Focus (SfF)
the progression of focusing of the sequence of images until the points of interest of
the scene are in sharp focus, while in the case of Shape from Defocus (SfD) the
depth map is reconstructed from the blurring information of several images. Most of
the proposed SfF metrics [26,29,30] measure the level of focus by considering local
windows (which include a surface element) instead of the single pixel.
The goal is to automatically extract the patches of interest with the dominant
presence of strong local intensity variation through ad hoc operators that evaluate
from the presence of high frequencies the level of focus of the patches. In fact,
patches with high texture, perfectly in focus, give high responses to high-frequency
components. Such patches, with maximum responses to high frequencies can be
detected by analyzing the sequence of images in the Fourier domain or the spatial
domain.
In Chap. 9 Vol. I, several local operations have been described for both domains
characterized by different high-pass filters. In this context, the linear operator of
Laplace (see Sect. 1.12 Vol. II) is used, based on the differentiation of the second
order, which accentuates the variations in intensity and is found to be isotropic.
Applied for the image I (x, y), the Laplacian ∇ 2 is given by

∂ 2 I (x, y) ∂ 2 I (x, y)
∇ 2 I (x, y) = + = I (x, y)  h ∇ (x, y) (5.90)
∂x2 ∂ y2
calculable in each pixel (x, y) of the image. Equation (5.90), in the last expres-
sion, also expresses the Laplacian operator in terms of convolution, considering the
function PSF Laplacian h ∇ (x, y) (described in detail in Sect. 1.21.3 Vol. II). In the
frequency domain, indicating with F the Fourier transform operator, the Laplacian
of image I (x, y) is given by

F {∇ 2 I (x, y)} = L∇ (u, v) = H ∇ (u, v) · I(u, v) = −4π 2 (u 2 + v2 )I(u, v) (5.91)


470 5 Shape from Shading

which is equivalent to multiplying the spectrum I(u, v) by a factor proportional to the


frequencies (u 2 + v2 ). This leads to the accentuation of the high spatial frequencies
present in the image.
Applying the Laplacian operator (5.86) to the blurred image Is (x, y) in the spatial
domain, for (5.90), we have

∇ 2 Is (x, y) = h ∇ (x, y)  Is (x, y) = h ∇ (x, y)  [h(x, y)  I f (x, y)] (5.92)

where we remember that h(x, y) is the Gaussian PSF function. For the associative
property of convolution, the previous equation can be rewritten as follows:

∇ 2 Is (x, y) = h(x, y)  [h ∇ (x, y)  I f (x, y)] (5.93)

Equation (5.93) informs us that, instead of directly applying the Laplacian operator
to the blurred image Is with (5.92), it is also possible to apply it first to the focused
image I f and then blur the result obtained with the Gaussian PSF. In this way, with the
Laplacian only the high spatial frequencies are obtained from the I f and subsequently
attenuated with the Gaussian blurring, useful for attenuating the noise normally
present in the high-frequency components. In the Fourier domain, the application of
the Laplacian operator to the blurred image, considering also Eqs. (5.89) and (5.91),
results in the following:

Is (u, v) = H(u, v) · H ∇ (u, v) · I f (u, v)


u 2 +v 2 2
(5.94)
= −4π 2 (u 2 + v2 )e− 2 σh I f (u, v)

We highlight how in the Fourier domain, for each frequency (u, v), the transfer func-
tion H(u, v) · H ∇ (u, v) (produced between the Laplacian operator and Gaussian
blurring filter) has a Gaussian distribution controlled by the blurring parameter σh .
Therefore, a sufficiently textured image of the scene will present a richness of high
frequencies emphasized by the Laplacian filter H ∇ (u, v) and attenuated by the con-
tribution of the Gaussian filter according to the value of σh . The attenuation of high
frequencies is almost nil (ignoring any blurring due to the optical system) when the
image of the scene is in focus with σh = 0.
If the image is not well and uniformly textured, the Laplacian operator does not
guarantee a good measure of image focusing as the operator would hardly select dom-
inant high frequencies. Any noise present in the image (due to the camera sensor)
would introduce high spurious frequencies altering the focusing measures regardless
of the type of operator used. Normally, noise would tend to violate the spatial invari-
ance property of the convolution operator (i.e., the PSF would vary spatially in each
pixel h σ (x, y)).
To mitigate the problems caused by the noise by working with real images, the
focusing measurements obtained with the Laplacian operator are calculated locally in
each pixel (x, y) by adding the significant ones included in a support window x,y
of n × n size and centered in the pixel in elaboration (x, y). The focus measures
5.7 Shape from (de)Focus 471

M F(x, y) would result in the following:



M F(x, y) = ∇ 2 I (i, j) for ∇ 2 I (i, j) ≥ T (5.95)
(i, j)∈x,y

where T indicates a threshold value beyond which a Laplacian value of one pixel is
considered significant in the support window  of the Laplacian operator. The size
of  (normally square window equal to or greater than 3 × 3) is chosen in relation to
the size of the texture of the image. It is evident that with the Laplacian, the partial
second derivatives in the direction of the horizontal and vertical components can
have equal and opposite values, i.e., I x x = −I yy =⇒ ∇ 2 I = 0, thus canceling
reciprocally. In this case, the operator would produce incorrect answers, even in the
presence of texture, as the contributions of the high frequencies associated with this
texture would be canceled. Nayar and Nakagawa [26], to prevent the cancelation of
such high frequencies, they proposed a modified version of the Laplacian, known as
Modified Laplacian—ML given by
 2   2 
∂ I  ∂ I 
∇ M I =  2  +  2 
2
(5.96)
∂x ∂y

Compared to the original Laplacian the modified one is always greater or equal. To
adapt the possible dimensions of the texture, it is also proposed to calculate the partial
derivatives using a variable step s ≥ 1 between the pixels belonging to the window
x,y . The discrete approximation of the modified Laplacian is given by
2 I (x, y) = | − I (x + s, y) + 2I (x, y) − I (x − s, y)| + | − I (x, y + s) + 2I (x, y) − I (x, y − s)|
∇M (5.97)
D

The final focus measure S M L(x, y), known as Sum Modified Laplacian, is calculated
as the sum of the values of the modified Laplacian ∇ M
2 I , given by
D

S M L(x, y) = ∇M
2
D I (i, j) f or ∇M
2
D I (i, j) ≥ T (5.98)
(i, j)∈x,y

Several other focusing operators are reported in the literature based on the gradient
(i.e., on the first derivative of the image) which in analogy to the Laplacian operator
evaluates the edges present in the image; on the coefficients of the discrete wavelet
transform by analyzing the content of the image in the frequency domain and the spa-
tial domain, and using these coefficients to measure the level of focus; on the discrete
cosine transform (DCT), on the median filter and statistical methods (local variance,
texture, etc.). In [31], the comparative evaluation of different focus measurement
operators is reported.
Once the focus measurement operator has been defined, the depth estimate of
each point (x, y) of the surface is obtained from the set of focusing measurements
related to the sequence of m images acquired according to the scheme of Fig. 5.27.
For each image of the sequence, the focusing measure is calculated with (5.98) (or
with other measurement methods) for each pixel using a support window x,y (of
472 5 Shape from Shading

size n × n ≥ 3 × 3) centered in the pixel (x, y) in elaboration. We now denote with


dm i (x, y) the depth of a point (x, y) of the surface, where we have the maximum
value of the focusing measure between all the measures S M L i (x, y) of the pixels
corresponding to the m images of the sequence. The depth map obtained for all points
on the surface is given by the following:

dm i (x, y) = arg maxi {S M L i (x, y)} i = 1, . . . , m (5.99)

If the sequence of images is instead obtained by continuously varying the parame-


ters of the optical-sensor system, for example, by varying the distance between the
optic and the sensor, the depth values for each point of the surface are calculated,
selecting from the sequence S M L i (x, y) the image that corresponds to the maximum
measurement of focus. This determines the distance optical-sensor and applying the
formula of thin lenses (5.84), also notes the focal length of the optics, it is possible
to calculate the depth.
SfF cannot be used for all applications. The S M L measure assumes that the
focused image of an object is entirely flat as it happens in microscope applications.
For more complex objects the depth map is not accurate. Subbarao and Choi [30]
have proposed a different approach to SfF trying to overcome the limitations of SML
by introducing the concept called F I S (Focused Image Surface), which approxi-
mates the surface of a 3D object. In essence, the object is represented by the surface
F I S or from the set of points of the object which in this case are flat surface ele-
ments (patches). Starting from the initial estimate of F I S, the focusing measure is
recalculated for all possible flat surface patches in the small cubic volume formed
by the sequence of images. Patches with the highest focus measurement are selected
to extract the F I S surface. The research process is based on a brute force approach
and requires considerable calculation. For the reconstruction of complex surfaces,
the traditional approaches of SfF do not give good results and have the drawback of
being normally slow and require considerable computational load.

5.7.2 Shape from Defocus (SfD)

In the previous paragraph, we have seen the method of focusing based on setting
the parameters of the optical-sensor–object system according to the formula of thin
lenses (5.84) to have an image of the scene in focus. We have also seen what the
parameters are and how to model the process of defocusing images. We will take
up this last aspect in order to formulate the method of the S f D which relates the
object–optical distance (depth), the sensor–optical parameters, and the parameters
that control the level of blur to derive the depth map. Pentland [27] has derived from
(5.84) an equation that relates the radius r of the blurred circular spot with the depth
p of a scene point. We analyze this relationship to extract a dense depth map with
the S f D approach.
Returning to Fig. 5.26, if the sensor plane does not coincide with the focal image
plane, a blurred image is obtained in the sensor plane Is where each bright point of
5.7 Shape from (de)Focus 473

the scene is a blurred spot involving a circular pixel window known precisely as a
circle of confusion of radius r . We have seen in the previous paragraph the relation
(5.85), which links the radius r of this circle with the circular opening of the lens of
radius R, the translation δ of the sensor plane with respect to the focal image plane
and the distance i between the sensor plane P S and the lens center. The figure shows
the two situations in which the object is of p in front of the object plane P O (on
this plane, it would be perfectly in focus in the focal plane P F) and the opposite
situation with the object translated of p but closer to the lens. In the two situations,
according to Fig. 5.26 the translation δ would result in the following:

δ =i −q δ =q −i (5.100)

where i indicates the distance between the lens and the sensor plane and q the distance
of the focal image plane from the lens. A characteristic of the optical system is given
by the so-called f/number, here indicated with f # = f /2R which expresses the ratio
between the focal f and the diameter 2R of the lens (described in Sect. 4.5.1 Vol. I).
If we express the radius R of the lens in terms of f # , the f/number, we have
R = 2 ff# which together with the first equation of (5.100) let’s replace in (5.85), we
get the following relation for the radius r of the blurred circle:
f ·i − f ·q
r= (5.101)
2 f# · q
In addition, resolving with respect to q from (5.101) and replacing in the thin lens
formula (5.84), q is eliminated, and we get the following:

p(i − f ) − f · i
r= (5.102)
2 f# · p
Solving from the depth p from the (5.102), we finally get
 f ·i
if δ =i −q
p = i− f −2 f ·i
f # ·r
(5.103)
i− f +2 f # ·r if δ =q −i

It is pointed out that Eq. (5.103), valid in the context of geometric optics, relate the
calculation of the depth p of a point of the scene with the corresponding radius r
of the blurring circle. Furthermore, Pentland proposed to consider the size of σh of
the Gaussian PSF h σ proportional to the radius r of the blurring circle less than a
factor k:
σh = k · r (5.104)

with k to be determined experimentally in relation to the acquisition system used


(according to the characteristics of the optics and to the resolution of the sensor).
Figure 5.28 shows the graph that relates the theoretical r radius of the blurring circle
to the depth p of an object, according to (5.103), considering an optic with focal
f = 50 mm set to f # = 4 (remember the dimensionless of f # ), with the object well
474 5 Shape from Shading

Fig. 5.28 Graph showing 1

Radius of the blurring circle (mm)


the dependence of the radius 0.9
of the blurring circle r with 0.8
the depth p for an optic with
0.7
focal f = 25 mm, f/number
0.6
4. The main distance i of the
0.5
sensor plane remains
constant and the image is 0.4
well focused at the distance 0.3
of 1 m 0.2
0.1

0 5 10 15 20 25 30 35 40 45
Depth (x100 mm)

focused at 1 m and with the distance i of the sensor plane that remains constant while
varying depth p.
With (5.103) and (5.104), the problem of calculating the depth p is led back to
the estimation of the blurring parameter σh and the radius of the blurring circle r
once known (through calibration) the intrinsic parameters ( f, f # , i) and the extrinsic
parameter (k) of the acquisition system. In fact, Pentland [27] has proposed a method
that requires the acquisition of at least two images with different settings of the system
parameters to detect the different levels of blurring and derive with the (5.103) a
depth estimate. To calibrate the system, an image of the perfectly focused scene
(zero blurring) is initially acquired by adequately setting the acquisition parameters
(large f # ) against which the system is calibrated.
We assume an orthographic projection of the scene (pinhole model) to derive the
relation between the function of blurring and the parameters of setting of the optical-
sensor system. The objective is to estimate the depth by evaluating the difference of
the PSF between a pair of images blurred by this, the denomination of SfD. The idea
of Pentland is to emulate the human vision that is capable of evaluating the depth
of the scene based on similar principles since the focal length of the human visual
system is variable in a sinusoidal way around the frequencies of 2 H z.
In fact, the blurring model considered is the one modeled with the convolution
(5.86) between the image of the perfectly focused scene and the Gaussian blurring
function (5.87) which in this case is indicated with h σ ( p,e) (x, y), to indicate that
the dependency of the defocusing (blurring) depends on the distance p of the scene
from the optics and from the setting parameters e = (i, f, f # ) of the optical-sensor
system. We rewrite the blurring model given by the convolution equations and the
Gaussian PSF as follows:

Is (x, y) = I f (x, y)  h σ ( p,e) (x, y) (5.105)

with
x 2 +y 2
1 − 2σh2
h σ ( p,e) (x, y) = e (5.106)
2π σh2
5.7 Shape from (de)Focus 475

Analyzing these equations, it is observed that the defocused image Is (x, y) is known
with the acquisition while the parameters e are known with the calibration of the
system. The depth p and the image in focus Is (x, y) are instead unknown. The idea
is to acquire at least two defocused images with different settings of the e1 and e2
system parameters to get at least a theoretical estimate of the depth p. (5.105) is
not linear with respect to the unknown p, and therefore cannot be used to directly
solve the problem even for a numerical stability problem if a minimization functional
is desired. Pentland proposed to solve the unknown p by operating in the Fourier
domain. In fact, if we consider the two defocused images acquired, modeled by the
two spatial convolutions, we have

Is1 (x, y) = Is (x, y)  h σ 1 (x, y) Is2 (x, y) = Is (x, y)  h σ 2 (x, y) (5.107)

and executing the ratio of the relative Fourier transforms (remembering the (5.106)),
we obtain
Is1 (u, v) I f (u, v)H σ1 (u, v) H σ1 (u, v)  1
= = = exp − (u 2 + v 2 )(σ12 − σ22 ) (5.108)
Is2 (u, v) I f (u, v)H σ2 (u, v) H σ2 (u, v) 2

where σ1 = σ ( p, e1 ) and σ2 = σ ( p, e2 ). Applying now the natural logarithm to


both extreme members of (5.108), we get

Is1 (u, v) 1
ln = (u 2 + v2 )[σ 2 ( p, e2 ) − σ 2 ( p, e1 )] (5.109)
Is2 (u, v) 2
where the ideal image perfectly in focus Is is deleted. Known the transforms Is1
and Is2 , and calibrating the functions σ ( p, e1 ) and σ ( p, e2 ) it is possible to derive
from (5.109) the term (σ12 − σ22 ) given by
! "
1 Is1 (u, v)
σ12 − σ22 = −2 2 ln (5.110)
u +v 2 Is2 (u, v) W

where the term • denotes an average calculated over an extended area W of the
spectral domain, instead of considering single frequencies (u, v) of a point in that
domain. If one of the images is perfectly in focus, we have σ1 = 0 and σ2 is estimated
by (5.110) while the depth p is calculated with (5.100). If, on the other hand, the two
images are defocused due to the different settings of the system parameters, we will
have σ1 > 0 and σ2 > 0, two different values of the distance from the image plane
and from the lens i 1 and i 2 . Substituting these values in (5.103) we have
f · i1 f · i2
p= = (5.111)
i 1 − f − 2r1 f # i 2 − f − 2r2 f #
and considering the proportionality relation σh = kr we can derive a linear relation-
ship between σ1 and σ2 , given by:

σ1 = ασ2 + β (5.112)
476 5 Shape from Shading

where:  
i1 f · i1 · k 1 1
α= and β= − (5.113)
i2 2 f# i2 i1

In essence, we now have two equations that establish a relationship between σ1 and
σ2 , the (5.112) in terms of the known parameters of the optical-sensor system, the
(5.110) in terms of the level of blur between the two defocused images derived from
convolutions. Both are useful for determining depth. In fact, from the (5.110), we
have the contribution that σ12 + σ22 = C, replacing in this the value of σ1 given by
(5.112), we get an equation with only one unknown to calculate σ2 , given by [32]:

(α 2 − 1)σ22 + 2αβσ2 + β 2 = C (5.114)

where:  
1 −2 Is1 (u, v)
C= ln dudv (5.115)
A W +v
u2 2 Is2 (u, v)

The measurement of the defocusing difference C = σ12 − σ22 in the Fourier domain
is calculated considering the average of the values of the frequencies on a window
W centered in the point being processed (u, v) of images and A is the area of the
window W . With (5.114), we have a quadratic equation to estimate σ2 . If the main
distances are i 1 = i 2 , we have that α = 1 obtaining a single value of σ2 . Once the
parameters of the optical-sensor system are known, the depth p can be calculated
with one of the two Eq. (5.103). The procedure is repeated for each pixel of the image
thus obtaining a depth-dense map having acquired only two defocused images with
different settings of the acquisition system.
The approaches of S f D described are essentially based on the measurement of
the defocusing level of multiple images with different settings of the parameters of
the acquisition system. This measurement is estimated for each pixel often consid-
ering also the pixels of the surroundings included in a square window of adequate
dimensions, assuming that the projected points of the scene have constant depth.
The use of this local window also tends to mediate noise and minimize artifacts.
In the literature, S f D methods have been proposed based on global algorithms that
operate simultaneously on the whole image in the hypothesis that image intensity
and shape are spatially correlated although the image formation process tends to
lose intensity-shape information. This leads to the typical ill-posed problem often
proposing solutions based on the regularization [33], which introduce minimization
functions thus bringing back a problem ill posed in a problem of numerical approx-
imation or energy minimization, or formulated with Markov random field (MRF)
[34], or through a diffusion process based on differential equations [35].
References 477

References
1. B.K.P. Horn, Shape from Shading: A Method for Obtaining the Shape of a Smooth Opaque
Object from One View. Ph.D. thesis (MIT, Boston-USA, 1970)
2. E. Mingolla, J.T. Todd, Perception of solid shape from shading. Biol. Cybern. 53, 137–151
(1986)
3. V.S. Ramachandran, Perceiving shape from shading. Sci. Am. 159, 76–83 (1988)
4. K. Ikeuchi, B.K.P. Horn, Numerical shape from shading and occluding boundaries. Artif. Intell.
17, 141–184 (1981)
5. A.P. Pentland, Local shading analysis. IEEE Trans. Pattern Anal. Mach. Intell. 6, 170–184
(1984)
6. R.J. Woodham, Photometric method for determining surface orientation from multiple images.
Opt. Eng. 19, 139–144 (1980)
7. H. Hayakawa, Photometric stereo under a light source with arbitrary motion. J. Opt. Soc.
Am.-Part A: Opt., Image Sci., Vis. 11(11), 3079–3089 (1994)
8. P.N. Belhumeur, D.J. Kriegman, A.L. Yuille, The bas-relief ambiguity. J. Comput. Vis. 35(1),
33–44 (1999)
9. B. Horn, M.J. Brooks, The variational approach to shape from shading. Comput. Vis., Graph.
Image Process. 33, 174–208 (1986)
10. E.N. Coleman, R. Jain, Obtaining 3-dimensional shape of textured and specular surfaces using
four-source photometry. Comput. Graph. Image Process. 18(4), 1309–1328 (1982)
11. K. Ikeuchi, Determining the surface orientations of specular surfaces by using the photometric
stereo method. IEEE Trans. Pattern Anal. Mach. Intell. 3(6), 661–669 (1981)
12. R.T. Frankot, R. Chellappa, A method for enforcing integrability in shape from shading algo-
rithms. IEEE Trans. Pattern Anal. Mach. Intell. 10, 439–451 (1988)
13. R. Basri, D.W. Jacobs, I. Kemelmacher, Photometric stereo with general, unknown lighting.
Int. J. Comput. Vis. 72(3), 239–257 (2007)
14. T. Wei, R. Klette, On depth recovery from gradient vector fields, in Algorithms, Architectures
and Information Systems Security, ed. by B.B. Bhattacharya (World Scientific Publishing,
London, 2009), pp. 75–96
15. K. Reinhard, Concise Computer Vision, 1st edn. (Springer, London, 2014)
16. J.L. Posdamer, M.D. Altschuler, Surface measurement by space-encoded projected beam sys-
tems. Comput. Graph. Image Process. 18(1), 1–17 (1982)
17. E. Horn, N. Kiryati, Toward optimal structured light patterns. Int. J. Comput. Vis. 17(2), 87–97
(1999)
18. D. Caspi, N. Kiryati, J. Shamir, Range imaging with adaptive color structured light. IEEE
Trans. PAMI 20(5), 470–480 (1998)
19. P.S. Huang, S. Zhang, A fast three-step phase shifting algorithm. Appl. Opt. 45(21), 5086–5091
(2006)
20. J. Gühring, Dense 3-D surface acquisition by structured light using off-the-shelf components.
Methods 3D Shape Meas. 4309, 220–231 (2001)
21. C. Brenner, J. Böhm, J. Gühring, Photogrammetric calibration and accuracy evaluation of a
cross-pattern stripe projector, in Videometrics VI 3641 (SPIE, 1999), pp. 164–172
22. Z.J. Geng, Rainbow three-dimensional camera: new concept of high-speed three-dimensional
vision systems. Opt. Eng. 35(2), 376–383 (1996)
23. K.L. Boyer, A.C. Kak, Color-encoded structured light for rapid active ranging. IEEE Trans.
PAMI 9(1), 14–28 (1987)
24. M. Maruyama, S. Abe, Range sensing by projecting multiple slits with random cuts. IEEE
Trans. PAMI 15(6), 647–651 (1993)
25. Z. Zhang, A flexible new technique for camera calibration. IEEE Trans. Pattern Anal. Mach.
Intell. 22(11), 1330–1334 (2000)
26. S.K. Nayar, Y. Nakagawa, Shape from focus. IEEE Trans. PAMI 16(8), 824–831 (1994)
478 5 Shape from Shading

27. A.P. Pentland, A new sense for depth of field. IEEE Trans. Pattern Anal. Mach. Intell. 9(4),
523–531 (1987)
28. M. Subbarao, Efficient depth recovery through inverse optics, in Machine Vision for Inspection
and Measurement (Academic press, 1989), pp. 101–126
29. E. Krotkov, Focusing. J. Comput. Vis. 1, 223–237 (1987)
30. M. Subbarao, T.S. Choi, Accurate recovery of three dimensional shape from image focus. IEEE
Trans. PAMI 17(3), 266–274 (1995)
31. S. Pertuza, D. Puiga, M.A. Garcia, Analysis of focus measure operators for shape-from-focus.
Pattern Recognit. 46, 1415–1432 (2013)
32. C. Rajagopalan, Depth recovery from defocused images, in Depth From Defocus: A Real
Aperture Imaging Approach (Springer, New York, 1999), pp. 14–27
33. V.P. Namboodiri, C. Subhasis, S. Hadap. Regularized depth from defocus, in ICIP (2008), pp.
1520–1523
34. A.N. Rajagopalan, S. Chaudhuri, An mrf model-based approach to simultaneous recovery of
depth and restoration from defocused images. IEEE Trans. PAMI 21(7), 577–589 (1999)
35. P. Favaro, S. Soatto, M. Burger, S. Osher, Shape from defocus via diffusion. IEEE Trans. PAMI
30(3), 518–531 (2008)
Motion Analysis
6

6.1 Introduction

So far we have considered the objects of the world and the observer both stationary,
that is, not in motion. We are now interested in studying a vision system capable of
perceiving the dynamics of the scene, in analogy to what happens, in the vision sys-
tems of different living beings. We are aware, that these latter vision systems require
a remarkable computing skills, instant by instant, to realize the visual perception,
through a symbolic description of the scene, deriving various information of depth
and form, with respect to the objects themselves of the scene.
For example in the human visual system, the dynamics of the scene is captured by
stereo binocular images slightly different in time, acquired simultaneously by two
eyes, and adequately combined to produce a single 3D perception of the objects of
the scene. Furthermore, observing the scene over time, he is able to reconstruct the
scene completely, differentiating 3D objects in movement from stationary ones. In
essence, it realizes the visual tracking of moving objects, deriving useful qualitative
and quantitative information on the dynamics of the scene.
This is possible given the capacity of biological systems to manage spatial and tem-
poral information through different elementary processes of visual perception, ade-
quate and fundamental, for interaction with the environment. The temporal dimension
in visual processing plays a role of primary importance for two reasons:

1. the apparent motion of the objects in the image plane is an indication to understand
the structure and 3D motion;
2. the biological visual systems use the information extracted from time-varying
image sequences to derive properties of the 3D world with a little a priori knowl-
edge of the same.

The motion analysis has long been used as a specialized field of research that had
nothing to do with image processing in general; this is for two reasons:

© Springer Nature Switzerland AG 2020 479


A. Distante and C. Distante, Handbook of Image Processing and Computer Vision,
https://doi.org/10.1007/978-3-030-42378-0_6
480 6 Motion Analysis

1. the techniques used to analyze movement in image sequences were quite different;
2. the large amount of memory and computing power required to process image
sequences made this analysis available only to specialized research laboratories
that could afford the necessary resources.

These two reasons no longer exist because the methods used in motion analysis do
not differ from those used for image processing and the image sequence analysis
algorithms can also be developed by normal personal computers. The perception of
movement in analogy to other visual processes (color, texture, contour extraction,
etc.) is an inductive visual process. Visual photoreceptors derive motion information
by evaluating the variations in light intensity of the 2D (retinal) image formed in the
observed 3D world. The human visual system adequately interprets these changes in
brightness in time-varying image sequences to realize the perception of moving 3D
objects in the scene. In this chapter, we will describe how it is possible to derive 3D
motion, almost in real time, from the analysis of time-varying 2D image sequences.
Some studies on the analysis of movement have shown that the perception of
movement derives from the information of objects by evaluating the presence of
occlusions, texture, contours, etc. Psychological studies have shown that the visual
perception of movement is based on the activation of neural structures. In some
animals, it has been shown that the lesion of some parts of the brain has led to
the inability to perceive movement. These losses of visual perception of movement
were not associated with the loss of visual perception of color and sensitivity to
perceive different patterns. This suggests that some parts of the brain are specialized
for movement perception (see Sect. 4.6.4 describing the functional structure of the
visual cortex).
We are interested in studying the perception of movement that occurs in the phys-
ical reality and not in the apparent movement. A typical example of apparent motion
is when you observe advertising light panels, which light up and off at different
times, sequences of bright zones compared to others, which always remain on. Other
objects of apparent motion can be verified by varying the color or luminous intensity
of some objects.
Figure 6.1 shows two typical examples of movement, captured when a photo is
taken with the object of interest moving to the side and the person taking pictures
remains still (photo (a)) while in photo (b) is the observer moving toward the house.
The images a and b of Fig. 6.2 show two images (normally in focus) of a video
sequence acquired at the frequency of 50 Hz. Some differences between the images
are evident in the first direct comparison. If we subtract the images from each other,
the differences become immediately visible as seen in Fig. 6.2c. In fact, the dynamics
of the scene indicate the movement of the players near the door.
From the difference image c, it can be observed that all the parts not in motion
are in black (in the two images the intensity remained constant), while the moving
parts are well highlighted and one appreciates just a different speed of the moving
parts (for example, the goalkeeper moves less than the other players). Even from this
qualitative description, it is obvious that motion analysis helps us considerably in
understanding the dynamics of a scene. All the parts not in motion of the scene are
6.1 Introduction 481

Fig. 6.1 Qualitative motion captured from a single image. Example of lateral movement captured
in the photo (a) by a stationary observer, while in photo (b) the perceived movement is of the
observer moving toward the house

Fig. 6.2 Couple of a sequence of images acquired with the frequency of 50 Hz. The dynamics of
the scene indicates the approach of the players toward the ball. The value of the image difference
(a) and (b) is shown in (c)

dark while with the value of the difference the moving parts are exalted (the areas
with brighter pixels).
We can summarize what has been said by affirming that the movement (of the
objects in the scene or of the observer) can be detected by the temporal variation of
the gray levels; unfortunately, the inverse implication is not valid, namely that all
changes in gray levels are due to movement.
This last aspect depends on the possible simultaneous change of the lighting con-
ditions while the images are acquired. In fact, in Fig. 6.2 the scene is well illuminated
by the sun (you can see the shadows projected on the playing field very well), but if
a cloud suddenly changes the lighting conditions, it would not be possible to derive
motion information from the time-varying image sequence because the gray levels
of the images also change due to the change in lighting. Thus a possible analysis of
the dynamics of the scene, based on the difference of the space–time-variant images,
would result in artifact. In other words, we can say that from a sequence of space–time
images varying f (x, y, t), it is possible to derive motion information by analyzing
482 6 Motion Analysis

the gray-level variations between pairs of images in the sequence. Conversely, it


cannot be said that any change in the levels of gray over time can be attributed to the
motion of the objects in the scene.

6.2 Analogy Between Motion Perception and Depth Evaluated


with Stereo Vision

Experiments have shown the analogy between motion perception and depth (distance)
of objects derived from stereo vision (see Sect. 4.6.4). It can easily be observed as for
an observer in motion, with respect to an object, in some areas of the retina there are
local variations of luminous intensity, deriving from the motion, containing visual
information that depends on the distance of the observer from the various points of
the scene.
The variations in brightness on the retina change in a predictable manner in rela-
tion to the direction of motion of the observer and the distance between objects and
observer. In particular, objects that are farther away generally appear to move more
slowly than objects closer to the observer. Similarly, points along the direction of
motion of the observer move slowly with respect to points that lie in other direc-
tions. From the variations of luminous intensity perceived on the retina the position
information of the objects in the scene with respect to the observer is derived.
The motion field image of the movement is calculated from the motion information
derived from a sequence of time-varying images. Gibson [1] has defined an important
algorithm that correlates the perception of movement and the perception of distance.
The algorithm calculates the flow field of the movement and proposes algorithms to
extract the ego-motion information, that is, it derives the motion of the observer and
the depth of the objects from the observer, analyzing the information of the flow field
of the movement.
Motion and depth estimation have the same purpose and use similar types of
perceptive stimuli. The stereo vision algorithms use the information of the two retinas
to recover the depth information, based on the diversity of the images obtained by
observing the scene from slightly different points of view. Instead, motion detection
algorithms use coarser information deriving from image sequences that show slight
differences between them due to the motion of the observer or to the relative motion
of the objects with respect to the observer. As an alternative to determining the
variations of gray levels for motion estimation, in analogy to the correspondence
problem for stereo vision, some characteristic elements (features) can be identified
in the sequence image that corresponds to the same objects of the observed scene
and evaluate, in the image, the spatial difference of these features caused by the
movement of the object.
While in stereo vision this spatial difference (called disparity) of a f eatur e is due
to the different position of observation of the scene, in the case of the sequence of time-
varying images acquired by the same stationary observer, the disparity is determined
by images consecutive acquired at different times. The temporal frequency of image
6.2 Analogy Between Motion Perception and Depth Evaluated … 483

acquisition, normally realized with standard cameras, is 50 or 60 images per second


(often indicated with fps—frame per second), i.e., the acquisition of an image every
1/25 or 1/30 of a second. There are nonstandard cameras that reach over 1000 images
per second and with different spatial resolutions. The values of spatial disparity, in
the case of time-varying image sequences, are relatively smaller than those of stereo
vision.
The problem of correspondence is caused by the inability to unambiguously find
corresponding points in two consecutive images of a sequence. A first example occurs
in the case of deformable objects or in the case of small particles moving in a flow
field. Given their intrinsic relative motion, it is not possible to make any estimation of
the displacements because there are no visible features that we can easily determine
together with their relative motion that changes continuously.
We could assume that the correspondence problem will not exist for rigid objects
that show variations in the gray levels not due to the change in lighting. In general,
this assumption is not robust, since even for rigid objects there is ambiguity when
one wants to analyze the motion of periodic and nonperiodic structures, with a local
operator not able to find univocally the correspondence of these structures for their
vision partial. In essence, since it is not possible to fully observe these structures, the
detection of the displacement would be ambiguous. This subproblem of correspon-
dence is known in the literature as the aperture problem which we will analyze in
the following paragraphs.
At a higher level of abstraction, we can say that physical correspondence, i.e., the
actual correspondence of real objects, may not be identical to the visual correspon-
dence of the image. This problem has two aspects:

1. we can find the visual correspondence without the existence of a physical corre-
spondence, as in the case of indistinguishable objects;
2. a physical correspondence does not generally imply a visual correspondence, as
in the case in which we are not able to recognize a visual correspondence due to
variations in lighting.

In addition to the correspondence problem, the motion presents a further subproblem,


known as r econstr uction, defined as follows:

given a number of corresponding elements and, possibly, knowledge of the


intrinsic parameters of the camera, what can be said about the 3D motion and
the observed world structure?

The methods used to solve the correspondence and reconstruction problems are
based on the following assumption:

there is only one, rigid, relative motion between the camera and the observed
scene, moreover the lighting conditions do not change.
484 6 Motion Analysis

This assumption implies that the observed 3D objects cannot move according
to different motions. If the dynamics of the scene consist of multiple objects with
different movements than the observer, another problem will have to be considered,
that is, you will have to segment the flow image of the movement to select the
individual regions that correspond to the different objects with different motions (the
problem of segmentation).

6.3 Toward Motion Estimation

In Fig. 6.3, we observe a sequence of images acquired with a very small time inter-
val t such that the difference is minimal in each consecutive pair of the images
sequence. This difference in the images depends on the variation of the geometric
relationships between the observer (for example, a camera), the objects of the scene,
and the light source. It is these variations, determined in each pair of images of the
sequence, at the base of the motion estimation and stereo vision algorithms.
In the example shown in the figure, the dynamic of the ball is the main objective
in the analysis of the sequence of images which, once detected in the spatial domain
of an image, is tracked in the successive images of the sequence (time domain) while
approaching the door to detect the Goal–NoGoal event.
Let P(X, Y, Z ) be a 3D point of the projected scene (with pin-hole model) in the
time t · t in a point p(x, y) of the image I (x, y, t · t) (see Fig. 6.4). If the 3D
motion of P is linear and with speed V = (V X , VY , VZ ), in the time interval t will
move by V·t in Q = (X + V X ·t, Y + VY ·t, Z + VZ ·t). The motion of P with
speed V induces the motion of p in the image plane at the speed v(vx , v y ) moving
to the point q(x + vx · t, y + v y · t) of image I (x, y, t + 1 · t). The apparent
motion of the intensity variation of the pixels is called optical flow v = (vx , v y ).

Fig. 6.3 Goal–NoGoal detection. Sequence of space–time-variant images and motion field calcu-
lated on the last images of the sequence. The main object of the scene dynamics is only the motion
of the ball
6.3 Toward Motion Estimation 485

y Y
p X
x
Q
Z
0
q
P

Fig. 6.4 Graphical representation of the formation of the velocity flow field produced on the retina
(by perspective projection) generated by the motion of an object of the scene considering the
observer stationary

Moving towards object Moving away from object Rotation Translation from right to left

Fig. 6.5 Different types of ideal motion fields induced by the motion of the observer toward the
object or vice versa

Figure 6.5 shows ideal examples of optical flow with different vector fields gen-
erated by various types of motion of the observer (with uniform speed) which, with
respect to the scene, approaches or moves away, or moves laterally from right to left,
or rotates the head. The flow vectors represent an estimate of the variations of points
in the image that occur in a limited space–time interval. The direction and length
of each flow vector corresponds to the local motion entity that is induced when the
observer moves with respect to the scene or vice versa, or both move.
The projection of velocity vectors in the image plane, associated with each 3D
point of the scene, defines the motion field. Ideally, motion field and optical flow
should coincide. In reality, this is not true since the motion field associated with
an optical flow can be caused by an apparent motion induced by the change of the
lighting conditions and not by a real motion or by the aperture problem mentioned
above. An effective example is given by the barber pole1 illustrated in Fig. 6.6, where
the real motion of the cylinder is circular and the perceived optical flow is a vertical

1 Panel used in the Middle Ages by barbers. On a white cylinder, a red ribbon is wrapped helically. The

cylinder continuously rotates around its vertical axis and all the points of the cylindrical surface move
horizontally. It is observed instead that this rotation produces the illusion that the red ribbon moves
vertically upwards. The motion is ambiguous because it is not possible to find corresponding points
in motion in the temporal analysis as shown in the figure. Hans Wallach in 1935, a psychologist,
486 6 Motion Analysis

Real Motion

Perceived Motion

Motion Field Optical Flow Aperture Problem

Fig. 6.6 Illusion with the barber pole. The white cylinder, with the red ribbon wrapped helically,
rotates clockwise but the stripes are perceived to move vertically upwards. The perceived optical
flow does not correspond to the real motion field, which is horizontal from right to left. This illusion
is caused by the aperture problem or by the ambiguity of finding the correct correspondence of points
on the edge of the tape (in the central area when observed at different times) since the direction of
motion of these points is not determined uniquely by the brain

motion field when the real one is horizontal. The same happens when standing in a
stationary train observing another adjacent train and we have the feeling that we are
moving when instead it is the other that is moving. This happens because we have
a limited opening from the window and we don’t have precise references to decide
which of the two trains is really in motion.
Returning to the sequence of space-time-variant images I (x, y, t), the information
content captured of the dynamics of the scene can be effective to analyze it in a
space–time graph. For example, with reference to Fig. 6.3, we could analyze in the
space–time diagram (t, x) the dynamics of the ball structure (see Fig. 6.7). In the
sequence of images, the dominant motion of the ball is horizontal (along the x axis)
moving toward the goalposts. If the ball is stationary, the position in the time-variant
images does not change and in the diagram this state is represented by a horizontal
line. When the ball moves at a constant speed the trace described by its center of mass
is an oblique straight line and its inclination with respect to the time axis depends on
the speed of the ball and is given by
x
ν= = tan(θ ) (6.1)
t
where θ is the angle between the time axis t and the direction of movement of
the ball given by the mass centers of the ball located in the time-varying images
of the sequence or from the trajectory described by the displacement of the ball in
the images of the sequence. As shown in Fig. 6.7, a moving ball is described in the
plane (t, x) with an inclined trajectory while for a stationary ball, in the sequence

discovered that the illusion is less if the cylinder is shorter and wider, and the perceived motion is
correctly lateral. The illusion is also solved if the texture is present on the tape.
6.3 Toward Motion Estimation 487

Stationary ball Motion with uniform velocity


x x

θ
msec t msec t

Fig. 6.7 Space–time diagram of motion information. In the diagram on the left the horizontal line
indicates stationary motion while on the right the inclined line indicates motion with uniform speed
along the x axis

of images, the gray levels associated with the ball do not vary, and therefore in the
plane (t, x) we will see the trace of the motion of the ball which remains horizontal
(constant gray levels).
In other words, we can say that in the space–time (x, y, t) the dynamics of the
scene is estimated directly from the orientation in continuous space–time (t, x) and
not as discrete shifts by directly analyzing two consecutive images in the space
(x, y). Therefore, the motion analysis algorithms should be formulated in the contin-
uous space–time (x, y, t) for which the level of discretization to adequately describe
motion becomes important. In this space, observing the trace of the direction of
motion and how it is oriented with respect to the time axis, an estimate of the speed
is obtained. On the other hand, by observing only the motion of some points of the
contour of the object, the orientation of the object itself would not be univocally
obtained.

6.3.1 Discretization of Motion

In physical reality, the motion of an object is described by continuous trajectories and


characterized by its relative velocity with respect to the observer. A vision machine
to adequately capture the dynamics of the object will have to acquire a sequence
of images with a temporal resolution correlated to the speed of the object itself. In
particular, the acquisition system must operate with a suitable sampling frequency,
in order to optimally approximate the continuous motion described by the object.
For example, capturing the motion of a car in a high-speed race (300 km/h) it is
necessary to use vision systems with high acquisition time frequency (for example,
using cameras with acquisition speeds even higher than 1000 images per second, in
the literature also known as frame rate, to indicate the number of images acquired
in a second).
Another reason that affects the choice of sampling time frequency concerns the
need to display stable images. In the television and film industry, the most advanced
technologies are used to reproduce dynamic scenes obtaining excellent results in
the stability of television images and the dynamics of the scene reproduced with
the appearance of the continuous movement of objects. For example, the European
488 6 Motion Analysis

(a) (b)
t y
tn x α
l Bo
Goa

α
tgoal

a
Are
Y Goa
l
0 x
e
l Lin
Z Goa

Fig. 6.8 Goal–NoGoal event detection. a The dynamics of the scene is taken for each goal by a
pair of cameras with high temporal resolution arranged as shown in the figure on the opposite sides
with the optical axes (aligned with the Z -axis of the central reference system (X, Y, Z )) coplanar
with the vertical plane α of the goal. The significant motion of the ball approaching toward the goal
box is in the domain time–space (t, x), detected with the acquisition of image sequences. The 3D
localization of the ball with respect to the central reference system (X, Y, Z ) is calculated by the
triangulation process carried out by the relative pair of opposite and synchronized cameras, suitably
calibrated, with respect to the known positions of the vertical plane α and the horizontal goal area.
b Local reference system (x, y) of the sequence images

television standard uses a time frequency of 25 Hz displaying 25 static images per


second. The display technology, combined with the persistence of the image on the
retina, performs two video scans to present a complete image. In the first scan, the
even horizontal lines are drawn (first frame) and in the second scan the odd horizontal
lines (second frame) are drawn. Altogether there is a temporal frequency of 50 frames
per second to reproduce 25 static images corresponding to a spatial resolution of 525
horizontal lines. To improve stability (i.e., attenuate the flickering of the image on the
monitor) the reproduced video images are available with digital technology monitors
that have a temporal frequency even higher than 1000 Hz.
Considering this, let us now consider the criterion with which to choose the time
frequency of image acquisition. If the motion between observer and scene is null,
to reproduce the scene it is sufficient to acquire a single image. If instead, we want
to reproduce the dynamics of a sporting event as in the game of football to deter-
mine if the ball has completely crossed the vertical plane of the goalposts (in the
affirmative case we have the goal event) it is necessary to acquire a sequence of
images with a temporal frequency adequate with the speed of the ball. Figure 6.8
shows the arrangement of the tele-cameras to acquire the sequence of images of the
Goal–NoGoal event [2–5]. In particular, the cameras are positioned at the corners of
the playing field with the optical axis coplanar to the vertical plane passing through
the inner edges of the goalposts (poles and crossbar). The pair of opposite cameras
are synchronized and simultaneously observe the dynamics of the scene. As soon as
the ball enters the field of view of the cameras, it is located and begins its tracking
in the two sequences of images to monitor the approach of the ball toward the goal
box.
6.3 Toward Motion Estimation 489

The goal event occurs only when the ball (about 22 cm diameter) completely
crosses the goal box (plane α), that is, it completely exceeds the goalposts-crossbar
and goal line inside the goal as shown in the figure. The ball can reach a speed
of 120 Km/h. To capture the dynamics of the goal event, it is necessary to acquire
a sequence of images by observing the scene as shown in the figure, from which it
emerges that the significant and dominant motion is the lateral one with the trajectory
of the ball, which moves toward the goal, it is almost always orthogonal to the optical
axis of the cameras.
Figure 6.8 shows the dynamics of the scene being acquired by discreetly sampling
the motion of the ball over time. As soon as the ball appears in the scene, this
is detected in a I (x, y, t1 ) image of the time–space sequence varying at time t1
and the lateral motion of the ball is tracked in the spatial domain (x, y) of the
images of the sequence acquired in real time with the frame rate defined by the
camera that characterizes the level of temporal discretization t of the dynamics of
the event. In the figure, we can observe the 3D and 2D discretized trajectory of the
ball for some consecutive images of the two sequences captured by the two opposite
and synchronized cameras. It is also observed that the ball is spatially spaced (in
consecutive images of the sequence) with a value inversely proportional to the frame
rate. We will now analyze the impact of time sampling with the dynamics of the goal
event that we want to detect.
In Fig. 6.9a, the whole sequence of images (related to one camera) is represented
where it is observed that the ball moves toward the goal, with a uniform speed v,
leaving a cylindrical track of diameter equal to the real dimensions of the ball. In
essence, in this 3D space–time of the sequence I (x, y, t), the dynamics of the scene
is graphically represented by a parallelepiped where the images of the sequence
I (x, y, t) that vary over time t are stacked. Figure 6.9c, which is a section (t, x) of
the parallelepiped, shows the dominant and significant signal of the scene, that is,
the trajectory of the ball useful to detect the goal event if the ball crosses the goal
(i.e., the vertical plane α). In this context, the space–time diagram (t, y) is used to

(a) (b) (c)


y
y x

yk tk

t 0 msec t 0 msec t
t1+n+m t1+n

Fig. 6.9 Parallelepiped formed by the sequence of images that capture the Goal–NoGoal event. a
The motion of the ball moving at uniform speed is represented in this 3D space–time by a slanted
cylindrical track. b A cross section of the parallelepiped, or image plane (x-y) of the sequence at
time tth, shows the current position of the entities in motion. c A parallelepiped section (t-x) at a
given height y represents the space–time diagram that includes the significant information of the
motion structure
490 6 Motion Analysis

indicate the position of the cylindrical track of the ball with respect to the goal box
(see Fig. 6.9b).
Therefore, according to Fig. 6.8b, we can affirm that from the time–space diagram
(t, x), we can detect the image tgoal of the sequence in which the ball crossed the
vertical plane α of the goal box with the x-axis indicating the horizontal position of
the ball, useful for calculating its distance from the plane α with respect to the central
reference system (X, Y, Z ). From the time–space diagram (t, y) (see Fig. 6.9b), we
obtain instead the vertical position of the ball, useful to calculate the coordinate Y
with respect to the central reference system (X, Y, Z ).
Determined the centers of mass of the ball in the synchronized images

IC1 (tgoal , xgoal , ygoal ) and IC2 (tgoal , xgoal , ygoal )

relative to the opposite cameras C1 and C2 we have the information that the ball
has crossed the plane α useful to calculate the horizontal coordinate X of the central
reference system but, to detect the goal event, it is now necessary to determine if the
ball is in the goal box evaluating its position (Y, Z ) in the central reference system
(see Fig. 6.8a). This is possible through the triangulation between the two cameras
having previously calibrated them with respect to the known positions of the vertical
plane α and the horizontal goal area [2].
It should be noted that in Fig. 6.9 in the 3D space–time representation, for sim-
plicity the dynamics of the scene is indicated assuming a continuous motion even
if the images of the sequence are acquired with a high frame rate, and in the plane
(t, x) the resulting trace is inclined by an angle θ with a value directly proportional
to the speed of the object.
Now let’s see how to adequately sample the motion of an object to avoid the phe-
nomenon known as time aliasing, which introduces distortions in the signal due to
an undersampling. With the technologies currently available, once the spatial reso-
lution2 is defined, the continuous motion represented by Fig. 6.9c can be discretized
with a sampling frequency that can vary from a few images to thousands of images
per second.
In Fig. 6.10 is shown the relation between the speed of the ball and the displacement
of the ball in the time interval of acquisition between two consecutive images that
remains constant during the acquisition of the entire sequence. It can be observed that
for low values of the frame rate, the displacement of the object varies from meters to
a few millimeters, i.e., the displacement decreases with the increase of the sampling
time frequency. For example, for a ball with the speed of 120 km/h the acquisition of

2 We recall that we also have the phenomenon of the spatial aliasing already described in Sect. 5.10

Vol. I. According to the Shannon–Nyquist theorem, a sampled continuous function (in the time or
space domain) can be completely reconstructed if (a) the sampling frequency is equal to or greater
than twice the frequency of the maximum spectral component of the input signal (also called Nyquist
frequency) and (b) the spectrum replicas are removed in the Fourier domain, remaining only the
original spectrum. The latter process of removal is the anti-aliasing process of signal correction by
eliminating spurious space–time components.
6.3 Toward Motion Estimation 491

Relationship between velocity and object desplacement


1800
25 fps
1600 50 fps

Object Displacement (mm)


120 fps
1400 260 fps
400 fps
1000 fps
1200

1000

800

600

400

200

0
0 50 100 150
Object Velocity (km/h)

Fig.6.10 Relationship between speed and displacement of an object as the time sampling frequency
of the sequence images I (x, y, t) changes

the sequence with a frame rate of 400 fps the movement of the ball in the direction of
the x-axis is 5 mm. In this case, a time aliasing would occur with the appearance of
the ball in the sequence images in the form of an elongated ellipsoid and one could
hardly estimate the motion. The time aliasing also generates the helix effect when
observing on a television video the motion of the propeller of an airplane that seems
to rotate in an inverse way compared to the real one.
Figure 6.11 shows how the continuous motion represented in Fig. 6.9c is better
represented by sampling the dynamics of the scene with a very high time sampling
frequency, while as the level of discretization of the trajectory of the ball decreases,
it is very far from the continuous motion. The speed of the object in addition to

(a) (b) (c) (d)


x x x x

t t t t
vx
vx
vx
vx vx vx vx

ut ut ut ut

Fig. 6.11 Analysis in the Fourier domain of the effects of time sampling on the 1D motion of an
object. a Continuous motion of the object in the time–space domain (t, x) and in the corresponding
Fourier domain (u t , vx ); in b, c and d the analogous representations are shown with a time sampling
which is decreased by a factor 4
492 6 Motion Analysis

producing an increase in the movement of the object also produces an increase in


the inclination of the motion trajectory in the time–space domain (t, x). Horizontal
trajectory would mean the object stationary relative to the observer.
Let us now analyze the possibility of finding a method [6] that can quantitatively
evaluate a good compromise between spatial and temporal resolution, to choose a
suitable time sampling frequency value for which the dynamics of a scene can be
considered acceptable with respect to continuous motion and thus avoid the problem
of aliasing. A possible method is based on Fourier analysis. The spatial and temporal
frequencies described in the (t, x) domain can be projected in the Fourier frequency
domain (u t , vx ). In this domain, the input signal I (t, x) is represented in the spatial
and temporal frequency domain (u t , vx ), which best highlights the effects of sampling
at different frequencies.
In the graphical representation (t, x), the dynamics of the scene is represented by
the trace described by the ball and by its inclination with respect to the horizontal
axis of time which depends on the speed of the ball. In the Fourier domain (u t , vx ),
the motion of the ball is still a rectilinear strip (see the second line of Fig. 6.11) whose
inclination is evaluated with respect to the time frequency axis u t . From the figure, it
can be observed that if the ball is stationary, the energy represented in Fourier space
is zero at time frequency zero (no image is acquired). As soon as the ball moves with
speed ν, each component of the spatial frequencies vx (expressed in cycles/degrees)
associated with the movement of the ball are modified at the same speed and generate
a temporal modulation at the frequency:

u t = ν · vx (6.2)

with u t (expressed in cycles/sec). The graphical representation in the Fourier domain


(u t , vx ) is shown in the second line of Fig. 6.11, where it is shown how to modify the
spectrum from the continuous motion to the decrease of the sampling frequency and
its inclination with respect to the axis u t which depends on the speed of the object. The
motion of the ball with different time sampling frequencies is shown in Fig. 6.11b,
c, and d. It is observed how the replicas of the spectrum are spaced from each other
inversely proportional to the time sampling frequency. In correspondence of the high
temporal frequencies, there is a wide spacing of the replicas of the spectrum, vice
versa, in correspondence with the low frequencies of temporal sampling the distance
of the replicas decreases. In essence, the sampled motion generates replicas of the
original continuous motion spectrum in the direction of the frequency axis u t and is
easily discriminable.
It is possible to define a temporal frequency Ut above which the dynamics of the
scene can be considered acceptable sampling resulting in a good approximation of the
continuous motion of the observed scene. When replicas fall outside this threshold
value (as in Fig. 6.11b) and not detectable, the continuous motion and the sampled
one are indistinguishable.
The circular area of the spectrum (known as visibility area) indicated in the figure
includes the sensitive temporal space frequencies of the human visual system that is
able to detect temporal frequencies that are below 60 Hz (this explains the frequency
6.3 Toward Motion Estimation 493

of the devices television that is 50 Hz European and 60 Hz American) and spatial


frequencies below 60 cycles/degrees. The visibility area is very useful for evaluating
the correct value of the temporal sampling frequency of image sequences. If the
replicas are inside the visibility window these are the cause of a very coarse motion
sampling that produces visible distortions like flickering when the sequence of images
is observed on the monitor.
Conversely, if the replicas fall outside the visibility area, the corresponding sam-
pling rate is adequate enough to produce a distortion-free motion. Finally, it should
be noted that it is possible to estimate from the signal (t, x), in the Fourier domain,
the speed of the object by calculating the slope of the line on which the spectrum of
the sequence is located. The slope of the spectrum is not uniquely calculated if it is
not well distributed linearly. This occurs when the gray-level structures are oriented
in a nonregular way in the spatial domain.
For a vision machine in general, the spatial and temporal resolution is characterized
by the resolution of the sensor, expressed in pixels each with a width of W (normally
expressed in micr ons μm), and from the time interval T = 1/ f t p between images
of the sequence defined by the frame rate f t p of the camera. According to the
Shannon–Nyquist sampling theorem, it is possible to define the maximum admissible
supervised learning temporal frequencies as follows:
π π
ut ≤ vx ≤ (6.3)
T W
Combining Eq. (6.2) and Eq. (6.3), we can estimate the maximum value of the
horizontal component of the optical flow ν:
ut π W
ν= ≤ = (6.4)
vx vx T T
defined as a function of both the spatial and temporal resolutions of the acquisition
system. Even if the horizontal optical flow is determinable at the pixel resolution,
in reality it depends above all on the spatial resolution with which the object is
determined in each image of the sequence (i.e., by the algorithms of object detection
in turn influenced by the noise) and by the accuracy with which the spatial and
temporal gradient is determined.

6.3.2 Motion Estimation—Continuous Approach

In the continuous approach, the motion of each pixel of the image is estimated, obtain-
ing a dense map of estimated speed measurements by evaluating the local variations
of intensity, in terms of spatial and temporal variations, between consecutive images.
These speed measurements represent the apparent motion, in the two-dimensional
image plane, of 3D points in the motion of the scene and projected in the image
plane. In this context it is assumed that the objects of the scene are rigid, that is they
all move at the same speed and that, during observation, the lighting conditions do
494 6 Motion Analysis

Fig. 6.12 2D motion field Y


generated by the perspective V
projection (pin-hole model) Camera
of 3D motion points of a v
P
rigid body y p
0
X
x
Z Im
ag
eP
la n
e

not change. With this assumption, we analyze the two terms: motion field and optical
flow.
The motion field represents the 2D speed in the image plane (observer) induced
by the 3D motion of the observed object. In other words, the motion field represents
the apparent 2D speed in the image plane of the real 3D motion of a point in the scene
projected in the image plane. The motion analysis algorithms propose to estimate
from a pair of images of the sequence and for each corresponding point of the image
the value of the 2D speed (see Fig. 6.12).
The velocity vector estimated at each point of the image plane indicates the direc-
tion of motion and the speed which also depends on the distance between the observer
and observed objects. It should be noted immediately that the 2D projections of the
3D velocities of the scene points cannot be measured (acquired) directly by the acqui-
sition systems normally constituted, for example, by a camera. Instead, information
is acquired which approximates the motion field, i.e., the optical flow is calculated
by evaluating the variation of the gray levels in the pairs of time-varying images of
the sequence.
The optical flow and the motion field can be considered coincident only if the
following conditions are satisfied:

(a) The time distance is minimum for the acquisition of two consecutive images in
the sequence;
(b) The gray levels’ function is continuous;
(c) The conditions of Lambertianity are maintained;
(d) Scene lighting conditions do not change during sequence acquisition.

In reality, these conditions are not always maintained. Horn [7] has highlighted
some remarkable cases in which the motion field and the optical flow are not equal.
In Fig. 6.13, two cases are represented.

First case: Observing a stationary sphere with a homogeneous surface (of any
material), this induces optical flow when a light source moves in the scene. In this
case, by varying the lighting conditions, the optical flow is detected by analyzing
the image sequence since the condition (d) is violated, and therefore there is a
6.3 Toward Motion Estimation 495

(a) (b)

Nul

Fig. 6.13 Special cases of noncoincidence between optical flow and motion field. a Lambertian
stationary sphere induces optical flow when a light source moves producing a change in intensity
while the motion field is zero as it should be. b The sphere rotates while the light source is stationary.
In this case, the optical flow is zero no motion is perceived while the motion field is produced as it
should be

variation in the gray levels while the motion field is null having assumed the
stationary sphere.
Second case: The sphere rotates around its axis of gravity while the illumination
remains constant, i.e., the conditions indicated above are maintained. From the
analysis of the sequence, the induced optical flow is zero (no changes in the gray
levels between consecutive images are observed) while the motion field is different
from zero since the sphere is actually in motion.

6.3.3 Motion Estimation—Discrete Approach

In the discrete approach, the speed estimation is calculated only for some points in
the image, thus obtaining a sparse map of velocity estimates. The correspondence
in pairs of consecutive images is calculated only in the significant points of interest
(SPI) (closed contours of zero crossing, windows with high variance, lines, texture,
etc.). The discrete approach is used for certain small and large movements of moving
objects in the scene and when the constraints of the continuous approach cannot be
maintained.
In fact, in reality, not all the abovementioned constraints are satisfied (small shifts
between consecutive images do not always occur and the Lambertian conditions are
not always valid). The optical flow has the advantage of producing dense speed maps
and is calculated independently of the geometry of the objects of the scene unlike
the other (discrete) approaches that produce sparse maps and depend on the points
of interest present in the scene. If the analysis of the movement is also based on the
a priori knowledge of some information of the moving objects of the scene, some
assumptions are considered to better locate the objects:
496 6 Motion Analysis

Maximum speed. The position of the object in the next image after a time t can
be predicted.
Homogeneous movement. All the points of the scene are subject to the same
motion.
Mutual correspondence. Except for problems with occlusion and object rotation,
each point of an object corresponds to a point in the next image and vice versa
(non-deformable objects).

6.3.4 Motion Analysis from Image Difference

Let I1 and I2 be two consecutive images of the sequence, an estimate of the movement
is given by the binary image d(i, j) obtained as the difference between the two
consecutive images:

d(i, j) = 0 if |I1 (i, j) − I2 (i, j)| ≤ S
d(i, j) = (6.5)
d(i, j) = 1 otherwise

where S is a positive number indicating the threshold value above which to consider
the presence of movement in the observed scene. In the difference image d(i, j),
the presence of motion in pixels with value one is estimated. It is assumed that the
images are perfectly recorded and that the dominant variations of the gray levels are
attributable to the motion of the objects in the scene (see Fig. 6.14). The difference
image d(i, j) which contains the qualitative information on the motion is very much
influenced by the noise and cannot correctly determine the motion of very slow
objects. The motion information in each point of the difference image d(i, j) is
associated with the difference in gray levels between the following:

– adjacent pixels that correspond to pixels of moving objects and pixels that belong
to the background;
– adjacent pixels that belong to different objects with different motions;
– pixels that belong to parts of the same object but with a different distance from the
observer;
– pixels with gray levels affected by nonnegligible noise.

The value of the S threshold of the gray level difference must be chosen experimen-
tally after several attempts and possibly limited to very small regions of the scene.

6.3.5 Motion Analysis from the Cumulative Difference of Images

The difference image d(i, j), obtained in the previous paragraph, qualitatively high-
lights objects in motion (in pixels with value 1) without indicating the direction
of motion. This can be overcome by calculating the cumulative difference image
6.3 Toward Motion Estimation 497

Fig. 6.14 Motion detected with the difference of two consecutive images of the sequence and result
of the accumulated differences with Eq. (6.6)

dcum (i, j), which contains the direction of motion information in cases where the
objects are small and with limited movements. The cumulative difference dcum (i, j)
is evaluated considering a sequence of n images, whose initial image becomes the ref-
erence against which all other images in the sequence are subtracted. The cumulative
difference image is constructed as follows:


n
dcum (i, j) = ak |I1 (i, j) − Ik (i, j)| (6.6)
k=1

where I1 is the first image of the sequence against which all other images are com-
pared Ik , ak is a coefficient with increasingly higher values to indicate the image of
the most recently accumulated sequence and, consequently, it highlights the location
of the pixels associated with the current position of the moving object (see Fig. 6.14).
The cumulative difference can be calculated if the reference image I1 is acquired
when the objects in the scene are stationary, but this is not always possible. In the
latter case, we try to learn experimentally the motion of objects or, based on a model
of motion prediction, we build the reference image. In reality, it does not always
interest an image with motion information in every pixel. Often, on the other hand,
it is interesting to know the trajectory in the image plane of the center of mass of the
objects moving with respect to the observer. This means that in many applications it
may be useful to first segment the initial image of the sequence, identify the regions
associated with the objects in motion and then calculate the trajectories described by
the centers of mass of the objects (i.e., the regions identified).
In other applications, it may be sufficient to identify in the first image of the
sequence some characteristic points or characteristic areas (features) and then search
for such features in each image of the sequence through the process of matching
homologous features. The matching process can be simplified by knowing or learning
the dynamics of the movement of objects. In the latter case, tracking algorithms of
the features can be used to make the matching process more robust and reduce the
level of uncertainty in the evaluation of the motion and location of the objects. The
Kalman filter [8,9] is often used as a solution to the tracking problem. In Chap. 6 Vol.
II, the algorithms and the problems related to the identification of features and their
research have been described, considering the aspects of noise present in the images,
498 6 Motion Analysis

including the aspects of computational complexity that influence the segmentation


and matching algorithms.

6.3.6 Ambiguity in Motion Analysis

The difference image, previously calculated, qualitatively estimates the presence of


moving objects through a spatial and temporal analysis of the variation of the gray
levels of the pixels for limited areas of the images (spatiotemporal local operators).
For example, imagine that we observe the motion of the edge of an object through a
circular window, as shown in Fig. 6.15a. The local operator, who calculates the gray-
level variations near the edge visible from the circular window, of limited size, cannot
completely determine the motion of a point of the visible edge; on the other hand, it
can only qualitatively assess that the visible edge has shifted over time t but it is not
possible to have exact information on the plausible directions of motion indicated by
the arrows. Each direction of movement indicated by the arrows produces the same
effect of displacement of the visible edge in the final dashed position. This problem
is often called the aperture problem.
In the example shown, the only correctly evaluable motion is the one in the direc-
tion perpendicular to the edge of the visible object. In Fig. 6.15b, however, there is no
ambiguity about the determination of the motion because, in the region of the local
operator, the corner of the object is visible and the direction of motion is uniquely
determined by the direction of the arrows. The aperture problem can be considered as
a special case of the problem of correspondence. In Fig. 6.15a, the ambiguity induced
by the aperture problem derived from the impossibility to find the correspondence of
homologous points of the visible edge in the two images acquired in the times t and
t + t. The same correspondence problem would occur in the two-dimensional in
the case of deformable objects. The aperture problem is also the cause of the wrong
vertical motion perceived by humans with the barber’s pole described in Sect. 6.3.

(a) (b)

Fig. 6.15 Aperture problem. a The figure shows the position of a line (edge of an object) observed
through a small aperture at time t1 . At time t2 = t1 + t, the line has moved to a new position. The
arrows indicate the possible undeterminable line movements with a small opening because only the
component perpendicular to the line can be determined with the gradient. b In this example, again
from a small aperture, we can see in two consecutive images the displacement of the corner of an
object with the determination of the direction of motion without ambiguity
6.4 Optical Flow Estimation 499

6.4 Optical Flow Estimation

In Sect. 6.3, we introduced the concepts of motion field and optical flow. Now let’s
calculate the dense map of optical flow from a sequence of images to derive useful
information on the dynamics of the objects observed in the scene. Recall that the
variations of gray levels in the images of the sequence are not necessarily induced
by the motion of the objects which, instead, is always described by the motion field.
We are interested in calculating the optical flow in the conditions in which it can be
considered a good approximation of the motion field.
The motion estimation can be evaluated assuming that in small regions in the
images of the sequence, the existence of objects in motion causes a variation in the
luminous intensity in some points without the light intensity varying appreciably:
constraint of the continuity of the light intensity relative to moving points. In reality,
we know that this constraint is violated as soon as the position of the observer changes
with respect to the objects or vice versa, and as soon as the lighting conditions
are changed. In real conditions, it is known that this constraint can be considered
acceptable by acquiring sequences of images with an adequate temporal resolution
(a normal camera acquires sequences of images with a time resolution of 1/25 of a
second) and evaluating the brightness variations in the images through the constraint
of the spatial and temporal gradient, used to extract useful information on the motion.
The generic point P of a rigid body (see Fig. 6.12) that moves with speed V, with
respect to a reference system (X, Y, Z ), is projected, through the optical system, in
the image plane in the position p with respect to the coordinate system (x, y), united
with the image plane, and moves in this plane with an apparent speed v = (vx , v y )
which is the projection of the velocity vector V. The motion field represents the set
of velocity vectors v = (vx , v y ) projected by the optical system in the image plane,
generated by all points of the visible surface of the moving rigid object. An example
of motion field is shown in Fig. 6.16.
In reality, the acquisition systems (for example, a camera) do not determine the 2D
measurement of the apparent speed in the image plane (i.e., they do not directly mea-
sure the motion field), but record, in a sequence of images, the brightness variations
of the scene in the hypothesis that they are due to the dynamics of the scene. There-
fore, it is necessary to find the physical-mathematical model that links the perceived
gray-level variations with the motion field.
We indicate with the following:

(a) I(x, y, t) the acquired sequence of images representing the gray-level informa-
tion in the image plane (x, y) in time t;
(b) (I x , I y ) and It , respectively, the spatial variations (with respect to the axes x and
y) and temporal variations of the gray levels.

Suppose further that the space–time-variant image I(x, y, t) is continuous and differ-
entiable, both spatially and temporally. In the Lambertian hypotheses of continuity
conditions, that is, that each point P of the object appears equally luminous from any
direction of observation and, in the hypothesis of small movements, we can consider
500 6 Motion Analysis

Fig. 6.16 Motion field coinciding with the optical flow idealized in 1950 by Gibson. Each arrow
represents the direction and speed (indicated by the length of the arrow) of surface elements visible
in motion with respect to the observer or vice versa. Neighboring elements move faster than those
further away. The 3D motion of the observer with respect to the scene can be estimated through the
optical flow. This is the motion field perceived on the retina of an observer, who moves toward the
house in the situation of motion of the Fig. 6.1b

the brightness constant in every point of the scene. In these conditions, the brightness
(irradiance) in the image plane I(x, y, t) remains constant in time and consequently
the total derivative of the time-variant image with respect to time becomes null [7]:
dI
I[x(t), y(t), t] = Constant ⇒ =0 (6.7)
   dt
constant irradiance constraint

The dynamics of the scene is represented as the function I(x, y, t), which depends
on the spatial variables (x, y) and on the time t. This implies that the value of the
function I [x(t), y(t), t] varies in time in each position (x, y) in the image plane and,
consequently, the values of the partial derivatives ∂∂tI are distinguished with respect
to the total derivative dI
dt . Applying the definition of total derivative for the function
I [x(t), y(t), t], the expression (6.7) of the total derivative becomes
∂ I dx ∂ I dy ∂I
+ + =0 (6.8)
∂ x dt ∂ y dt ∂t
that is,
I x v x + I y v y + It = 0 (6.9)

dx dy
where the time derivatives dt and dt are the vector components of the motion field:

dx dy
v(vx , v y ) = v ,
dt dt
6.4 Optical Flow Estimation 501

while the spatial derivatives of the image I x = ∂∂ xI and I y = ∂∂ yI are the components of
the spatial gradient of the image ∇ I (I x , I y ). Equation (6.8), written in vector terms,
becomes
∇I(I x , I y ) · v(vx , v y ) + It = 0 (6.10)

which is the brightness continuity equation of the searched image that links the infor-
mation of brightness variation, i.e., of the spatial gradient of the image ∇I(I x , I y ),
determined in the sequence of multi-temporal images I(x, y, t), and the motion field
v(vx , v y ) which must be estimated once the components I x , I y , It are evaluated. In
such conditions, the motion field v(vx , v y ) calculated in the direction of the spatial
gradient of the image (I x , I y ) adequately approximates the optical flow. Therefore,
in these conditions, the motion field coincides with the optical flow.
The same Eq. (6.9) is achieved by the following reasoning. Consider the generic
point p(x, y) in an image of the sequence which at time t will have a luminous
intensity I (x, y, t). The apparent motion of this point is described by the velocity
components (vx , v y ) with which it moves, and in the next image, at time t + t,
it will be moved to the position (x + vx t, y + v y t) and for the constraint of
continuity of luminous intensity the following relation will be valid (irradiance
constancy constraint equation):

I (x, y, t) = I (x + vx t, y + v y t, t + t) (6.11)

Taylor’s series expansion of Eq. (6.11) generates the following:


∂I ∂I ∂I
I (x, y, t) + x + y + t + · · · = I (x, y, t) (6.12)
∂x ∂y ∂t

Dividing by t, ignoring the terms above the first and taking the limit for t → 0
the previous equation becomes Eq. 6.8, which is the equation of the total derivative of
dI /dt. For the brightness continuity constraint of the scene (Eq. 6.11) in a very small
time, and for the constraint of spatial coherence of the scene (the points included in
the neighborhood of the point under examination (x, y) move with the same speed
during the unit time interval), we can consider valid equality (6.11) that replaced in
Eq. (6.12) generates

I x vx + I y v y + It = ∇I · v + It ∼
=0 (6.13)

which represents the gradient constraint equation already derived above (see
Eqs. (6.9) and (6.10)). Equation (6.13) constitutes a linear relation between spatial
and temporal gradient of the image intensity and the apparent motion components
in the image plane.
We now summarize the conditions to which Eq. (6.13) gradient constraint is sub-
jected for the calculation of the optical flow:

1. Subject to the constraint of preserving the intensity of gray levels during the time t
for the acquisition of at least two images of the sequence. In real applications, we
502 6 Motion Analysis

know that this constraint is not always satisfied. For example, in some regions of
the image, in areas where edges are present and when lighting conditions change.
2. Also subject to the constraint of spatial coherence, i.e., it is assumed that in areas
where the spatial and temporal gradient is evaluated, the visible surface belongs
to the same object and all points move at the same speed or vary slightly in
the image plane. Also, this constraint is violated in real applications in the image
plane regions where there are strong depth discontinuities, due to the discontinuity
between pixels belonging to the object and the background, or in the presence of
occlusions.
3. Considering, from Eq. (6.13), which

−It = I x vx + I y v y = ∇I · v (6.14)

it is observed that the variation of brightness It , in the same location of the image
plane in the time t of acquisition of consecutive images, is given by the scalar
product of the spatial gradient vectors of the image ∇I and of the components
of the optical flow (vx , v y ) in the direction of the gradient ∇I. It is not possible
to determine the orthogonal component in the direction of the gradient, i.e., in
the normal direction to the direction of variation of the gray levels (due to the
aperture problem).

In other words, Eq. (6.13) shows that, estimated from the two consecutive images,
the spatial and temporal gradient of the image I x , I y , It it is possible to calculate the
motion field only in the direction of the spatial gradient of the image, that is we can
determine the component of the optical flow vn only in the normal direction to the
edge. From Eq. (6.14) follows:

∇I · v = ∇I  ·vn = −It (6.15)

from which
−It ∇I · v
vn = = (6.16)
 ∇I   ∇I 

where vn is the measure of the optical flow component that can be calculated in the
direction of the normalized spatial gradient with respect to the norm  ∇ I = I x2 I y2
of the gradient vector of the image. If the spatial gradient ∇I is null (that is, there is no
change in brightness that occurs along a contour) it follows by (6.15) that It = 0 (that
is, we have no temporal variation of irradiance), and therefore there is no movement
information available in the examination point. If we verify that the spatial gradient
is zero and the time gradient It = 0, at this point the constraint of the optical flow
is violated. This impossibility of observing the velocity components in the point in
the examination is known as the aperture problem already discussed in Sect. 6.3.6
Ambiguity in motion analysis (see Fig. 6.13).
From brightness continuity Eqs. (6.13) and (6.16), also called brightness preser-
vation, we highlight that in every pixel of the image it is not possible to determine
6.4 Optical Flow Estimation 503

the optical flow (vx , v y ) starting from the spatial and temporal image gradient of the
image (I x , I y , It ), having two unknowns vx and v y and a single linear equation. It
follows that Eq. (6.13) has multiple solutions and the only gradient constraint cannot
uniquely estimate the optical flow. Equation (6.16) instead can only calculate the
component of optical flow in the direction of the variation of intensity, that is, of
maximum variation of the spatial gradient of the image.
Equation (6.13), of brightness continuity, can be represented graphically, in the
velocity space, as a motion constraint line, as shown in Fig. 6.17a, from which it is
observed that all the possible solutions of (6.13) fall on the speed constraint straight
line. Calculated the spatial and temporal gradient in a pixel of the image, in the plane
of the flow components (vx , v y ), the straight line of speed constraint intersects the
axes vx and v y , respectively, in the points (I x /I y , 0) and (0, It /I y ). It is also observed
that only the optical flow component vn can be determined.
If the real 2D motion is the diagonal one (v x , v y ) indicated by the red dot in
Fig. 6.17a and from the dashed vector in Fig. 6.17b, the estimable motion is only
the one given by its projection on the gradient vector ∇ I . In geometric terms, the
calculation of vn with Eq. (6.16) is equivalent to calculating the distance d of the
motion constraint line (I x · vx + I y · v y + It = 0) from the origin of the optical flow
(vx , v y ) (see Fig. 6.17a). This constraint means that the optical flow can be calculated
only in areas of the image in the presence of edges.
In Fig. 6.17c, there are two constraint-speed lines obtained at two points close
to each other of the image for each of which the spatial and temporal gradient is
calculated by generating the lines (1) and (2), respectively. In this way, we can
reasonably hypothesize, that in these two points close together, the local motion is
identical (according to the constraint of spatial coherence) and can be determined
geometrically as the intersection of the constraint-speed lines producing a good local
estimate of the optical flow components (vx , v y ).
In general, to calculate both the optical flow components (vx , v y ) at each point of
the image, for example, in the presence of edges, the knowledge of the spatial and
temporal derivatives I x , I y , It , estimated by the pair of consecutive images), it is
not sufficient using Eq. (6.13). The optical flow is equivalent to the motion field in
particular conditions as defined above.
To solve the problem in a general way we can impose the constraint of spatial
coherence on Eq. (6.13), that is, locally, in the vicinity of the point (x, y) in the
processing of the motion field, the speed does not change abruptly. The differential
approach presents limits in the estimation of the spatial gradient in the image areas,
where there are no appreciable variations in gray levels. This suggests calculating
the speed of the motion field considering windows of adequate size, centered on the
point being processed to satisfy the constraint of the spatial coherence of validity of
Eq. (6.15). This is also useful for mitigating errors in the estimated optical flow in
the presence of noise in the image sequence.
504 6 Motion Analysis

(a) (b) (c)


vy vy
2D real Con
st
motion OF f raint of
Ta ΔI or p
oint the
no ngen 1
It vn ΔI tc
alc tial
ula com (vx,vx)
ble po
Iy p ne f the
(vx,vx)
nt
aint o
d Ixvx+Iyvy+It=0 contour Constr oint 2
r p
OF fo

Ix vx vx
Iy

Fig. 6.17 Graphical representation of the constraint of the optical flow based on the gradient. a
Equation (6.9) is represented by the straight line locus of the points (vx , v y ) which are possible
multiple solutions of the optical flow of a pixel of the image according to the values of the spatial
gradient ∇ I and time gradient It . b Of the real 2D motion (v x , v y ) (red point indicated in figure
a), it is possible to estimate the speed component of the optical flow in the direction of the gradient
(Eq. 6.16) perpendicular to the variation of gray levels (contour) while the component p parallel to
the edge cannot be estimated. c In order to solve the aperture problem and more generally to obtain
a reliable estimate of the optical flow, it is possible to calculate the velocities even in the pixels
close to the one being processed (spatial coherence of the flow) hypothesizing pixels belonging to
the same surface that they have the same motion in a small time interval. Each pixel in the speed
diagram generates a constraint straight line of the optical flow that tends to intersect in a small area
whose center of gravity represents the components of the speed of the estimated optical flow for
the set of locally processed pixels

6.4.1 Horn–Schunck Method


Horn and Schunck [7] have proposed a method to model the situations in which
Eq. (6.14) of continuity is violated (due to little variation in gray levels and noise),
imposing the constraint of spatial coherence in the form of a regularization term E 2
in the following function E(vx , v y ) to be minimized:
E(vx , v y ) = E 1 (vx , v y ) + λE 2 (vx , v y )
   2   
∂vx ∂v y 2 ∂vx 2
∂v y 2 (6.17)
= (I x vx + I y v y + It )2 + λ + + +
∂x ∂x ∂y ∂y
(x,y)∈ (x,y)∈

where the process of minimization involves all the pixels p(x, y) of an image I (:, :, t)
of the sequence that for simplicity we indicate with . The first term E 1 represents
the error of the measures (also known as data energy based on Eq. (6.13), the second
term E 2 represents the constraint of the spatial coherence (known also as smoothness
energy or smoothness error), and λ is the regularization parameter that controls the
relative importance of the constraint of continuity of intensity (Eq. 6.11) and that
of the validity of spatial coherence. The introduction of the E 2 constraint of spatial
coherence, as an error term expressed by the partial derivatives squared of the velocity
components, restricts the class of possible solutions for the flow velocity calculation
(vx , v y ) transforming a problem ill conditioned in a problem well posed. It should
be noted that in this context the optical flow components vx and v y are functions,
respectively, of the spatial coordinates x and y. To avoid confusion on the symbols,
6.4 Optical Flow Estimation 505

let us indicate for now the horizontal and vertical components of optical flow with
u = vx and v = v y .
Using the variational approach, Horn and Schunck have derived the differential
Eq. (6.22) as follows. The objective is the estimation of the optical flow components
(u, v) which minimizes the energy function:

E(u, v) = (u I x + v I y + It )2 + λ(u 2x + vx2 + u 2y + v2y ) (6.18)

∂vx ∂v y ∂vx
reformulated with the new symbolism where u x = ∂x , vx = ∂x , uy = ∂y and
∂v y
vx = ∂ y , represent the first derivatives of the optical flow components (now denoted
by u and v) with respect to x and y. Differentiating the function (6.18) with respect
to the unknown variables u and v, we obtain
⎧ ∂ E(u, v)

⎨ = 2I x (u I x + v I y + It ) + λ2(u x x + u yy )
∂u (6.19)
⎩ ∂ E(u, v) = 2I (u I + v I + I ) + λ2(v + v )

y x y t xx yy
∂v
where (u x x + u yy ) and (vx x + v yy ) are, respectively, the Laplacian of u(x, y) and
v(x, y) as shown in the notation.3 In essence, the expression corresponding to the
Laplacian controls the contribution of the smoothness term of (6.18) of the optical
flow, which when rewritten becomes
⎧ ∂ E(u, v)

⎨ = 2I x (u I x + v I y + It ) + 2λ∇ 2 u
∂u (6.20)
⎩ ∂ E(u, v) = 2I (u I + v I + I ) + 2λ∇ 2 v

y x y t
∂v
A possible solution to minimization of the function (6.18) is to set to zero the partial
derivatives given by (6.20) and approximating the Laplacian with the difference  of
the flow components u and v, respectively, with the local averages u and v obtained
on a local window W centered on the pixel being processed (x, y), given as follows:

u = u − u ≈ ∇ 2 u v = v − v ≈ ∇ 2 v (6.21)

3 In fact, considering the smoothness term of (6.18) and deriving with respect to u we have

∂(u 2x + vx2 + u 2y + v 2y ) ∂ ∂u(x, y) ∂ ∂u(x, y)


=2 +2
∂u ∂u ∂x ∂u ∂y
 2
∂ u ∂ u
2
=2 + 2 = 2(u x x + u yy ) = 2∇ 2 u
∂x2 ∂y

which corresponds to the second-order differential operator defined as the divergence of the gradient
of the function u(x, y) in a Euclidean space. This operator is known as the Laplace operator or simply
Laplacian. Similarly, the Laplacian of the function v(x, y) is derived.
506 6 Motion Analysis

Replacing (6.21) in Eq. (6.20), setting these last equations to zero and reorganizing
them, we obtain the following equations as a possible solution to minimize the
function E(u, v), given by
 
λ + I x2 u + v I x I y = λū − I x It
  (6.22)
λ + I y2 v + u I x I y = λ · v̄ − I y It

Thus, we have 2 equations in the 2 unknowns u and v. Expressing v in terms of u


and putting it in the other equation, we get the following:
I x · ū + I y · v̄ + It
u = ū − I x ·
λ + I x2 + I y2
(6.23)
I x · ū + I y · v̄ + It
v = v̄ − I y ·
λ + I x2 + I y2

The calculation of the optical flow is performed by applying the iterative method
of Gauss–Seidel, using a pair of consecutive images of the sequence. The goal is to
explore the space of possible solutions of (u, v) such that, for a given value found
at the kth iteration, the function E(u, v) is minimized within a minimum acceptable
error for the type of dynamic images of the sequence in question. The iterative
procedure applied to two images would be the following:

1. From the sequence of images choose, an adjacent pair of images I1 and I2 for each
of which a two-dimensional spatial Gaussian filter is applied, with an appropriate
standard deviation σ , to attenuate the noise. Apply the Gaussian filter to the time
component by considering the other images adjacent to the pair in relation to the
size of the standard deviation σt of the time filter. The initial values of the velocity
components u and v are assumed zero for each pixel of the image.
2. Kth iterative process. Calculate the velocities u (k) and v(k) for all the pixels (i, j)
of the image by applying Eq. (6.23):

I x · ū (k−1) + I y · v̄(k−1) + It
u (k) (i, j) = ū (k−1) (i, j) − I x (i, j) ·
λ + I x2 + I y2
(6.24)
(k) (k−1) I x · ū (k−1) + I y · v̄(k−1) + It
v (i, j) = v̄ (i, j) − I y (i, j) ·
λ + I x2 + I y2

where I x , I y , It are initially calculated from the pair of consecutive images.


3. Compute the global error e at the kth iteration over the entire image by applying
Eq. (6.17): 
e= E 2 (i, j)
i j

If the value of the error e is less than a certain threshold es , proceed with the
next iteration, that is, return to step 2 of the procedure, otherwise, the iterative
6.4 Optical Flow Estimation 507

process ends and the last values of u and v are assumed as the definitive estimates
of the optical flow map of the same dimensions as the images. The regularization
parameter λ is experimentally set at the beginning with a value between 0 and 1,
choosing by trial and error the optimal value in relation to the type of dynamic
images considered.

The described algorithm can be modified to use all the images in the sequence.
In essence, in the iterative process, instead of always considering the same pair of
images, the following image of the sequence is considered at the kth iteration. The
algorithm is thus modified:

1. Similar to the previous one, applying the Gaussian spatial and time filter to all the
images in the sequence. The initial values of u and v instead of being set to zero
are initialized by applying Eq. (6.24) to the first two images of the sequence. The
iteration begins with k = 1, which represents the initial estimate of optical flow.
2. Calculation of the (k+1)th estimation of the speed of the optical flow based on the
current values of the kth iteration and of the next image of the sequence. Equations
are applied to all pixels in the image:

I x · ū (k) + I y · v̄(k) + It
u (k+1) (i, j) = ū (k) (i, j) − I x (i, j) ·
λ + I x2 + I y2
(6.25)
(k+1) (k) I x · ū (k) + I y · v̄(k) + It
v (i, j) = v̄ (i, j) − I y (i, j) ·
λ + I x2 + I y2

3. Repeat step 2 and finish when the last image in the sequence has been processed.

The iterative process requires thousands of iterations, and only experimentally, one
can verify which are the optimal values of the regularization parameter λ and of the
threshold es that adequately minimizes the error function e.
The limits of the Horn and Schunck approach are related to the fact that in real
images the constraints of continuity of intensity and spatial coherence are violated.
In essence, the calculation of the gradient leads to two contrasting situations: on the
one hand, for the calculation of the gradient it is necessary that the intensity varies
locally in a linear way and this is generally invalid in the vicinity of the edges; on the
other hand, again in the areas of the edges that delimit an object, the smoothness
constraint is violated, since normally the surface of the object can have different
depths.
A similar problem occurs in areas, where different objects move with different
motions. In the border areas, the conditions of notable variations in intensity occur,
generating very variable values of flow velocity. However, the smoothness com-
ponent tends to propagate the flow velocity even in areas where the image does not
show significant speed changes. For example, this occurs when a single object moves
with respect to a uniform background where it becomes difficult to distinguish the
velocity vectors associated with the object from the background.
508 6 Motion Analysis

v
(Ix,Iy) (u
(u,v)

vn (u,v)
d Ixu+Iyv+It=0

Fig. 6.18 Graphical interpretation of Horn–Schunck’s iterative process for optical flow estimation.
During an iteration, the new velocity (u, v) of a generic pixel (x, y) is updated by subtracting the
value of the local average velocity (ū, v̄) the update value according to Eq. (6.24) moving on the
line perpendicular to the line of the motion constraint and in the direction of the spatial gradient

Fig. 6.19 Results of the optical flow calculated with the Horn–Schunck method. The first line
shows the results obtained on synthetic images [10], while in the second line the flow is calculated
on real images

Figure 6.18 shows a graphical interpretation of Horn–Schunck’s iterative process


for optical flow estimation. The figure shows in the diagram of the optical flow
components how, for a pixel in process, during any iteration, a speed will be assigned
for a point lying on a line perpendicular to the constraint-speed line and passing
through the average speed (ū, v̄) value of the pixels in its vicinity. In essence, the
current pixel speed is updated by subtracting an update value from the current average
value according to (6.25). This normally happens in the homogeneous areas of the
image where the smoothness constraint is maintained. In the presence of edges, the
situation changes, forming at least two homogeneous areas of speed that are distinct
from one another.
Figure 6.19 shows the results of the Horn–Schunck method applied on synthetic
and real images with various types of simple and complex motion. The spatial and
temporal gradient was calculated with windows of size from 5 × 5 up to 11 × 11.
6.4 Optical Flow Estimation 509

To mitigate the limitations highlighted above, Nagel [11] proposed an approach


based always on conservation of intensity but reformulated the function of Horn and
Schunck considering the terms of the second order to model the gray level variations
caused by motion. For the speed calculation, at each point of the image, the following
function is minimized, defined in a window W of the image centered on the pixel
being processed:
  
e= I (x, y, t) − I (x0 , y0 , t0 ) − I x · [x − u] − I y · [y − v]
W
2 (6.26)
1 1
− I x x · [x − u]2 − I x y · [x − u] · [y − v] − I yy · [y − v]2 dxdy
2 2
With the introduction of the second-order derivative in areas where there are small
variations in intensity and where, normally, the gradient is not very significant, there
are appreciable values of the second derivatives, thus eliminating the problem of
attenuating the flow velocities at the edges of the object with respect to the back-
ground. Where contours with high gradient values are present, the smoothness
constraint is forced only in the direction of the contours, while it is attenuated in
the orthogonal direction to the contours, since the values of the second derivatives
assume contained values.
The constraints introduced by Nagel are called oriented smoothness constraints.
Given the complexity of the function (6.26), Nagel applied it only to image areas
with significant regions (contours, edges, etc.) and only in these points is the func-
tional minimized. In the points (x0 , y0 ) where the contours are present only the main
curvature is zero, i.e., I x y = 0, while the gradient is different from zero, as well as the
second derivatives I x x and I yy . The maximum gradient value occurs at the contour
at point (x0 , y0 ), where at least one of the second derivatives passes through zero. If,
for example, I x x = 0, this implies that I x is maximum and the component I y = 0
(inflection points). With these assumptions the previous equation is simplified and
we have
   2
1
e= I (x, y, t) − I (x0 , y0 , t0 ) − I x · [x − u] − I yy · [y − v]2 dxdy
(y,y)∈W 2
(6.27)
from which we can derive the system of two differential equations in two unknowns,
namely the velocity components u and v. Even the model set by Nagel, in practice,
has limited applications due to the difficulty in correctly calculating the second-order
derivatives for images that normally present a nonnegligible noise.

6.4.2 Discrete Least Squares Horn–Schunck Method

A simpler approach to estimate the optical flow is based on the minimization of the
function (6.17) with the least squares regression method (Least Square Error-LSE)
and approximating the derivatives of the smoothness constraint E 2 with simple
symmetrical or asymmetrical differences. With these assumptions, if I (i, j) is the
510 6 Motion Analysis

pixel of the image being processed, the smoothness error constraint E 2 (i, j) is defined
as follows:

1 
E 2 (i, j) = (u i+1, j − u i, j )2 + (u i, j+1 − u i, j )2 + (vi+1, j − vi, j )2 + (vi, j+1 − vi, j )2 (6.28)
4
A better approximation would be obtained by calculating the symmetric differences
(of the type u i+1, j − u i−1, j ). The term E 1 based on Eq. (6.13), constraint of the
optical flow error, results in the following:

E 1 (i, j) = [I x (i, j)u i, j + I y (i, j)vi, j + It (i, j)]2 (6.29)

The regression process involves finding the set of unknowns of the flow components
{u i, j , vi, j }, which minimizes the following function:

e(i, j) = E 1 (i, j) + λE 2 (i, j) (6.30)
(i, j)∈

According to the LSE method, differentiating the function e(i, j) with respect to the
unknowns u i, j and vi, j for E 1 (i, j) we have
⎧ ∂ E (i, j)


1
= 2[I x (i, j)u i, j + I y (i, j)vi, j + It (i, j)]I x (i, j)
⎨ ∂u
i, j
(6.31)

⎪ ∂ E 1 (i, j)
⎩ = 2[I x (i, j)u i, j + I y (i, j)vi, j + It (i, j)]I y (i, j)
∂vi, j

and for E 2 (i, j) we have


∂ E 2 (i, j)
= −2[(u i+1, j − u i, j ) + (u i, j+1 − u i, j )] + 2[(u i, j − u i−1, j ) + (u i, j − u i, j−1 )]
∂u i, j (6.32)
 
= 2 (u i, j − u i+1, j ) + (u i, j − u i, j+1 ) + (u i, j − u i−1, j ) + (u i, j − u i, j−1 )

In (6.32), we have the only unknown term u i, j and we can simplify it by putting it
in the following form:
1 ∂ E 2 (i, j)  1 
= 2 u i, j − (u i+1, j + u i, j+1 + u i−1, j + u i, j−1 ) = 2[u i, j − u i, j ] (6.33)
4 ∂u i, j 4  
local average u i, j

Differentiating E 2 (i, j) (Eq. 6.28) with respect to vi, j we obtain the analogous
expression:
1 ∂ E 2 (i, j)
= 2[vi, j − vi, j ] (6.34)
4 ∂vi, j
6.4 Optical Flow Estimation 511

Combining together the results of the partial derivatives of E 1 (i, j) and E 2 (i, j) we
have the partial derivatives of the function e(i, j) to be minimized:

⎧ ∂e(i, j)

⎪ = 2[u i, j − u i, j ] + 2λ[I x (i, j)u i, j + I y (i, j)vi, j + It (i, j)]I x (i, j)
⎨ ∂u
i, j
(6.35)

⎪ ∂e(i, j)
⎩ = 2[vi, j − vi, j ] + 2λ[I x (i, j)u i, j + I y (i, j)vi, j + It (i, j)]I y (i, j)
∂vi, j

Setting to zero the partial derivatives of the error function (6.35) and solving with
respect to the unknowns u i, j and vi, j , the following iterative equations are obtained:
(k) (k)
(k+1) (k)
I x (i, j) · ū i, j + I y (i, j) · v̄i, j + It (i, j)
u i, j = ū i, j − I x (i, j) ·
1 + λ[(I x2 (i, j) + I y2 (i, j)]
(k) (k)
(6.36)
(k+1) (k)
I x (i, j) · ū i, j + I y (i, j) · v̄i, j + It (i, j)
vi, j = v̄i, j − I y (i, j) ·
1 + λ[(I x2 (i, j) + I y2 (i, j)]

6.4.3 Horn–Schunck Algorithm

For the calculation of the optical flow at least two adjacent images of the temporal
sequence are used at time t and t +1. In the discrete case, iterative Eq. (6.36) are used
to calculate the value of the velocities u i,k j and vi,k j to the kth iteration for each pixel
(i, j) of the image of size M × N . The spatial and temporal gradient in each pixel
is calculated using one of the convolution masks (Sobel, Roberts, . . .) described in
Chap. 1 Vol. II.
The original implementation of Horn estimated the spatial and temporal deriva-
tives (the data of the problem) I x , I y , It considering the mean of the differences
(horizontal, vertical, and temporal) between the pixel being processed (i, j) and the
3 spatially and temporally adjacent, given by
1
I x (i, j, t) ≈ I (i, j + 1, t) − I (i, j, t) + I (i + 1, j + 1, t) − I (i + 1, j, t)
4

+ I (i, j + 1, t + 1) − I (i, j, t + 1) + I (i + 1, j + 1, t + 1) − I (i + 1, j, t + 1)
1
I y (i, j, t) ≈ I (i + 1, j, t) − I (i, j, t) + I (i + 1, j + 1, t) − I (i, j + 1, t)
4

(6.37)
+ I (i + 1, j, t + 1) − I (i, j, t + 1) + I (i + 1, j + 1, t + 1) − I (i, j + 1, k + 1)
1
It (i, j, t) ≈ I (i, j, t + 1) − I (i, j, t) + I (i + 1, j, t + 1) − I (i + 1, j, t)
4

+ I (i, j + 1, t + 1) − I (i, j + 1, t) + I (i + 1, j + 1, t + 1) − I (i + 1, j + 1, k)

To make the calculation of the optical flow more efficient in each iteration, it is useful
to formulate Eq. (6.36) as follows:

u i,(k+1)
j = ū i,(k)j − α I x (i, j)
(k+1) (k)
(6.38)
vi, j = v̄i, j − α I y (i, j)
512 6 Motion Analysis

where
(k) (k)
I x (i, j) · ū i, j + I y (i, j) · v̄i, j + It (i, j)
α(i, j, k) = (6.39)
1 + λ[(I x2 (i, j) + I y2 (i, j)]

Recall that the averages ū i, j and v̄i, j are calculated on the 4 adjacent pixels as
indicated in (6.33). The pseudo code of the Horn–Schunck algorithm is reported in
Algorithm 26.

Algorithm 26 Pseudo code for the calculation of the optical flow based on the discrete
Horn–Schunck method.
1: Input: Maximum number of iterations N iter = 10; λ = 0.1 (Adapt experimentally)
2: Output: The dense optical flow u(i, j), v(i, j), i = 1, M and j = 1, N
3: for i ← 1 to M do

4: for j ← 1 to N do

5: Calculates I x (i, j, t), I y (i, j, t), and It (i, j, t) with Eq. (6.37)
6: Handle the image areas with edges
7: u(i, j) ← 0
8: v(i, j) ← 0

9: end for

10: end for


11: k ← 1
12: while k ≤ N iter do

13: for i ← 1 to M do

14: for j ← 1 to N do

15: Calculate with edge handling ū(i, j) and v̄(i, j)


16: Calculate α(i, j, k) with Eq. (6.39)
17: Calculate u(i, j) and v(i, j) with Eq. (6.38)

18: end for

19: end for


20: k ←k+1

21: end while


22: return u and v
6.4 Optical Flow Estimation 513

6.4.4 Lucas–Kanade Method

The preceding methods, used for the calculation of the optical flow, have the draw-
back, being iterative, of the noncertainty of convergence to minimize the error func-
tion. Furthermore, they require the calculation of derivatives of a higher order than the
first and, in areas with gray level discontinuities, the optical flow rates are estimated
with a considerable error. An alternative approach is based on the assumption that,
if locally for windows of limited dimensions the optical flow velocity components
(vx , v y ) remain constant, from the equations of conservation of brightness we obtain
a system of linear equations solvable with least squares approaches [12].
Figure 6.17c shows how the lines defined by each pixel of the window represent
geometrically in the domain (vx , v y ) the optical flow Eq. 6.13. Assuming that the
window pixels have the same speed, the lines intersect in a limited area whose
center of gravity represents the real 2D motion. The size of the intersection area of
the lines also depends on the error with which the spatial derivatives (I x , I y ) and
the temporal derivative It are estimated, caused by the noise of the sequence of
images. Therefore, the velocity v of the pixel being processed is estimated by the
linear regression method (line fitting) by setting a system of linear overdetermined
equations defining an energy function.
Applying the brightness continuity Eq. (6.14) for the N pixels pi of a window W
(centered in the pixel in elaboration) of the image, we have the following system of
linear equations: ⎧

⎪ ∇ I ( p1 ) · v( p1 ) = −It ( p1 )




⎨ I ( p2 ) · v( p2 ) = −It ( p2 )

... ... (6.40)


⎪. . .
⎪ . . .


⎩∇ I ( p ) · v( p ) = −I ( p )
N N t N

that in extended matrix form becomes


⎡ ⎤ ⎡ ⎤
I x ( p1 ) I y ( p1 )   I t ( p1 )
⎢ I x ( p2 ) I y ( p2 ) ⎥ ⎢ I t ( p2 ) ⎥
⎢ ⎥ vx ⎢ ⎥
⎢ .. .. ⎥· = −⎢ . ⎥
.
(6.41)
⎣ . . ⎦ v y ⎣ . ⎦
I x ( p N ) I y ( p N )) It ( p N )

and indicating with A the matrix of the components I x ( pi ) and I y ( pi ) of the spatial
gradient of the image, with b the matrix of the components It ( pi ) of the temporal
gradient and with v the speed of the optical flow, we can express the previous relation
in the compact matrix form:

A · 
 v = − 
b (6.42)
N ×2 2×1 N ×1
514 6 Motion Analysis

With N > 2, the linear equation system (6.42) is overdetermined and this means
that it is not possible to find an exact solution, but only an approximate estimate ṽ,
which minimizes the norm of the vector e derived with the least squares approach:

 e = Av + b  (6.43)

solvable as an overdetermined inverse problem, in the sense of finding the minimum


norm of the error vector which satisfies the following linear system:

(AT · A) · ṽ = AT b (6.44)
    
2×2 2×1 2×1

from which the following estimate of the velocity vector is obtained:

ṽ = (AT · A)−1 AT b (6.45)

for the image pixel being processed, centered on the W window. The solution (6.45)
exists if the matrix (AT A)4 is calculated as follows:
N N
(AT A) = i=1 I x ( pi )I x ( pi ) i=1 I x ( pi )I y ( pi ) =
Iαα Iαβ
(6.46)
N N
i=1 I x ( pi )I y ( pi ) i=1 I y ( pi )I y ( pi )
Iαβ Iββ

with size 2 × 2 and is invertible (not singular), i.e., if

det(AT A) = Iαα Iββ − Iαβ


2
= 0 (6.47)

The solvability of (6.44) is verifiable also by analyzing the eigenvalues λ1 and λ2 of


the matrix (AT A) that must be greater than zero and not very small to avoid artifacts
in the presence of noise. The eigenvector analysis of the (AT A) has already been
considered in Sect. 6.3 Vol. II in determining points of interest. In fact, it has been
verified that if λ1 /λ2 is too large we have seen that the pixel p being processed
belongs to the contour, and therefore the method is conditioned by the aperture
problem.
Therefore, this method operates adequately if the two eigenvalues are sufficiently
large and of the same order of magnitude (i.e., the flow is estimated only in areas
of nonconstant intensity, see Fig. 6.20) which is the same constraint required for the
automatic detection of points of interest (corner detector) in the image. In particular,
when all the spatial gradients are parallel or both null, the AT A matrix becomes
singular (rank 1). In the homogeneous zones, the gradients are very small and there
are small eigenvalues (numerically unstable solution).

4 The matrix (A T A) is known in the literature as the tensor structure of the image relative to a pixel
p. A term that derives from the concept of tensor which generically indicates a linear algebraic
structure able to mathematically describe an invariable physical phenomenon with respect to the
adopted reference system. In this case, it concerns the analysis of the local motion associated with
the pixels of the W window centered in p.
6.4 Optical Flow Estimation 515

Eigenvalues λ1 and λ2
of the tensor matrix ATA
are small

/
in the contour points
(aperture problem)

Homogeneous zone
Large eigenvalues in the presence of texture

Fig. 6.20 Operating conditions of the Lucas–Kanade method. In the homogeneous zones, there
are small values of the eigenvectors of the tensor matrix AT A while in the zones with texture
there are high values. The eigenvalues indicate the robustness of the contours along the two main
directions. On the contour area the matrix becomes singular (not invertible) if all the gradient
vectors are oriented in the same direction along the contour (aperture problem, only the normal
flow is calculable)

Returning to Eq. (6.46), it is also observed that in the matrix (AT A) does not
appear the component of temporal gradient It and, consequently, the accuracy of
the optical flow estimation is closely linked to the correct calculation of the spatial
gradient components I x and I y . Now, appropriately replacing in Eq. (6.45) the inverse
matrix (AT A)−1 given by (6.46), and the term AT b is obtained from
⎡ ⎤
It ( p1 )
⎢ It ( p2 ) ⎥  
I x ( p1 ) I x ( p2 ) . . . I x ( p N ) ⎢
⎢ . ⎥= −
⎥ N
i=1 I x ( pi )It ( pi ) −Iαγ
AT b = = (6.48)
I y ( p1 ) I y ( p2 ) . . . I y ( p N ) ⎢
⎣ .. ⎦
⎥ − N −Iβγ
i=1 I y ( pi )It ( pi )
It ( p N )

it follows that the velocity components of the optical flow are calculated from the
following relation:
⎡ Iαγ Iββ −Iβγ Iαβ ⎤
ṽx Iαα Iββ −I 2
= − ⎣ Iβγ Iαα −IαγαβIαβ ⎦ (6.49)
ṽ y 2 Iαα Iββ −Iαβ

The process is repeated for all the points of the image, thus obtaining a dense map of
optical flow. Window sizes are normally chosen at 3×3 and 5×5. Before calculating
the optical flow, it is necessary to filter the noise of the images with a Gaussian filter
with standard deviation σ according to σt , for filtering in the time direction. Several
other methods described in [13] have been developed in the literature.
Figure 6.21 shows the results of the Lucas–Kanade method applied on synthetic
and real images with various types of simple and complex motion. The spatial and
temporal gradient was calculated with windows of sizes from 3 × 3 up to 13 × 13.
In the case of RGB color image, Eq. (6.42) (which takes constant motion locally)
can still be used considering the gradient matrix of size 3 · N × 2 and the vector of
516 6 Motion Analysis

the time gradient of size 3 · N × 1 where in essence each pixel is represented by the
triad of color components thus extending the dimensions of the spatial and temporal
gradient matrices.

6.4.4.1 Variant of the Lucas–Kanade Method with the Introduction of


Weights
The original Lucas–Kanade method assumes that each pixel of the window has
constant optical flow. In reality, this assumption is violated and it is reasonable to
weigh with less importance the influence of the pixels further away from the pixel
being processed. In essence, the speed calculation v is made by giving a greater weight
to the pixel being processed and less weight to all the other pixels of the window in
relation to the distance from the pixel being processed. If W = diag{w1 , w2 , . . . , wk }
is the diagonal matrix of the weights of size k ×k, the equalities W T W = W W = W 2
are valid.
Multiplying both the members of (6.42) with the matrix weight W we get

WAv = Wb (6.50)

where both members are weighed in the same way. To find the solution, some matrix
manipulations are needed. We multiply both members of the previous equation with
(WA)T and we get

(WA)T WAv = (WA)T Wb


AT WT WAv = AT WT Wb
AT WWAv = AT WWb

Fig. 6.21 Results of the optical flow calculated with the Lucas–Kanade method. The first line
shows the results obtained on synthetic images [10], while in the second line the flow is calculated
on real images
6.4 Optical Flow Estimation 517

T
W2 A v = AT W2 b
A  (6.51)
2×2

If the determinant of (AT W2 A) is different from zero exists (AT W2 A)−1 , that is, its
inverse, and so we can solve the last equation of (6.51) with respect to v obtaining:

v = (AT W2 A)−1 AT W2 b (6.52)

The original algorithm is modified using for each pixel p(x, y) of the image I (:, :, t)
the system of Eq. (6.50), which includes the matrix weight and the solution at least
squares is given by (6.52).

6.4.5 BBPW Method

This BBPW (Brox–Bruhn–Papenberg–Weickert) method, described in [14], is an


evolution of the Horn–Schunck method in the sense that it tries to overcome the
constraint of the constancy of gray levels (imposed by Eq. (6.11): I (x, y, t) = I (x +
u, y +v, t +1)) and the limitation of small displacements of objects between adjacent
images of a sequence. These assumptions led to the linear equation I x u + I y v+ It = 0
(linearization with Taylor’s expansion) which in reality are often violated in particular
in outdoor scenes (noticeable variation in gray levels) and by displacements that
affect several pixels (presence of very fast objects). The BBPW algorithm uses a
more robust constraint based on the constancy of the spatial gradient given by:

∇x,y I (x, y, t) = ∇x,y I (x + u, y + v, t + 1) (6.53)

where ∇x,y denoting the gradient symbol, is limited to derivatives with respect to the
axes of x and y, it does not include the time derivative. Compared to the quadratic
functions of the Horn–Schunck method which do not allow flow discontinuities
(in the edges and due to the presence of gray levels noise), Eq. (6.53) based on the
gradient is a more robust constraint. With this new constraint it is possible to derive an
energy function E 1 (u, v) (data term) that penalizes the drifts from these assumptions
of constancy of levels and spatial gradient that results in the following:
 ! 2
E 1 (u, v) = I (x + u, y + v, t + 1) − I (x, y, t)
(x,y)∈ (6.54)
 2 "
+ λ1 ∇x,y I (x + u, y + v, t + 1) − ∇x,y I (x, y, t))

where λ1 > 0 weighs one assumption relative to the other. Furthermore, the smooth-
ness term E 2 (u, v) must be considered as done with the function (6.17). In this
case the third component of the gradient is considered, which relates two temporally
adjacent images from t to t + 1. Therefore, by indicating with ∇3 = (∂x , ∂ y , ∂t ) the
associated space–time gradients of smoothness, the smoothness term E 2 is formu-
518 6 Motion Analysis

lated as follows:  # $
E 2 (u, v) = |∇3 u|2 + |∇3 v|2 (6.55)
(x,y)∈

The total energy function E(u, v) to be minimized, which estimates the optical flow
(u, v) for each pixel of the image sequence I (:, :, t), is given by the weighted sum
of the terms data and smoothness:

E(u, v) = E 1 (u, v) + λ2 E 2 (u, v) (6.56)

where λ2 > 0 weighs appropriately the smoothness term with respect to that of the
data (total variation with respect to the assumption of the constancy of the gray levels
and gradient). Since the data term E 1 is set with quadratic expression (Eq. 6.54), the
outlier s (due to the variation of pixel intensities and the presence of noise) heavily
influence the flow estimation. Therefore, instead of the least squares approach, we
define the optimization problem set with a more robust energy function, based on
the increasing concave function Ψ (s), given by:
%
Ψ (s 2 ) = s 2 + 2 ≈ |s| = 0.001 (6.57)

This function is applied to the error functions E 1 and E 2 , expressed, respectively,


by Eqs. (6.54) and (6.55), thus setting the problem of minimization of the energy
function in terms of regularized energy function. The regularizing functions result
in the following:
 #! 2
E 1 (u, v) = Ψ I (x + u, y + v, t + 1) − I (x, y, t)
(x,y)∈ (6.58)
 2 "$
+ λ ∇x,y I (x + u, y + v, t + 1) − ∇x,y I (x, y, t))

and  # $
E 2 (u, v) = Ψ |∇3 u|2 + |∇3 v|2 (6.59)
(x,y)∈

It is understood that the total energy, expressed by (6.56), is the weighted sum of the
terms E 1 and E 2 , respectively, data and smoothness, controlled with the regular-
ization parameter λ2 . It should be noted that with this approach the influence of the
outlier s is attenuated since the optimization problem based on the l1 nor m of the
gradient (known as total variation–TV l1 ) is used instead of the l2 nor m.
The goal now is to find the functions u(x, y) and v(x, y) that minimize these
energy functions by trying to find a global minimum. From the theory of calculus
of variations, it is stated that a general way to minimize an energy function is that it
must satisfy the Euler–Lagrange differential equations.
6.4 Optical Flow Estimation 519

Without giving details on the derivation of differential equations based on the


calculus of variations [15], the resulting equations are reported:
#  2 $  # $ 
Ψ It2 + λ1 I xt 2  ·  I I + λ (I I + I I ) − λ · div Ψ ||∇ u||2 + ||∇ v||2 ∇ u = 0
+ I yt (6.60)
x t 1 x x xt x y yt 2 3 3 3

#  2 $  # $ 
Ψ It2 + λ1 I xt 2  ·  I I + λ (I I + I I ) − λ · div Ψ ||∇ u||2 + ||∇ v||2 ∇ v = 0
+ I yt (6.61)
y t 1 yy yt x y xt 2 3 3 3

where Ψ indicates the derivative of Ψ with respect to u (in the 6.60) and with respect
to v (in the 6.61). The divergence div indicates the sum of the space–time gradients
of smoothness ∇3 = (∂x , ∂ y , ∂t ) related to u and v. Recall that I x , I y , and It are
the derivatives of I (:, :, t) with respect to the spatial coordinates x and y of the
pixels and with respect to the time coordinate t. Also, I x x , I yy , I x y , I xt , and I yt
are the their second derivatives. The problem data is all the derivatives calculated
from two consecutive images of the sequence. The solution w = (u, v, 1), in each
point p(x, y) ∈ , for nonlinear Eqs. (6.60) and (6.61) can be found with an iterative
method of numerical approximation. The authors of BBPW used the one based on
fixed point iterations5 on w. The iterative formulation with index k, of the previous
nonlinear equations, starting from the initial value w(0) = (0, 0, 1)T , results in the
following:

⎧ # (k+1) 2 #
(k+1) 2  (k+1) 2
$$ 
(k) (k+1)
#
(k) (k+1) (k) (k+1)
$

⎪ Ψ It + λ1 I xt + I yt · I x It + λ1 I x x I xt + I x y I yt



⎪  #& &2 & &2 $ 


⎨ − λ2 · div Ψ &∇3 u (k+1) & + &∇3 v (k+1) & ∇3 u (k+1) = 0

#
(k+1) 2
#
(k+1) 2  (k+1) 2
$$ 
(k) (k+1)
#
(k) (k+1) (k) (k+1)
$ (6.62)

⎪ Ψ It + λ1 I xt + I yt · I y It + λ1 I yy I yt + I x y I xt



⎪  #&

⎩ &2 & &2 $ 
− λ2 · div Ψ &∇3 u (k+1) & + &∇3 v (k+1) & ∇3 v (k+1) = 0

This new system is still nonlinear due to the non-linearity of the function Ψ and
(k+1) (k+1)
the derivatives I∗ . The removal of the non-linearity of the derivatives I∗ is
obtained with their expansion to the Taylor series up to the first order:

It(k+1) ≈ It(k) + I x(k) du (k) + I y(k) dv(k)


(k+1) (k)
I xt ≈ I xt + I x(k)
x du
(k)
+ I x(k)
y dv
(k) (6.63)
(k+1) (k)
I yt ≈ I yt + I x(k)
y du
(k) (k) (k)
+ I yy dv

5 Represents the generalization of iterative methods. In general, we want to solve a nonlinear equation

f (x) = 0 by leading back to the problem of finding a fixed point of a function y = g(x), that is, we
want to find a solution α such that f (α) = 0 ⇐⇒ α = g(α). The iteration function is in the form
x (k+1) = g(x (k) ), which iteratively produces a sequence of x for each k ≥ 0 and for a x (0) initial
assigned. Not all the iteration functions g(x) guarantee convergence at the fixed point. It is shown
that, if g(x) is continuous and the sequence x (k) converges, then this converges to a fixed point α,
that is, α = g(α) which is also a solution of f (x) = 0.
520 6 Motion Analysis

Therefore, we can consider the unknowns u (k+1) and v(k+1) separate, in the solutions
of the previous iterative process u (k) and v(k) , and the unknown increments du (k) and
dv(k) , having

u (k+1) = u (k) + du (k) v(k+1) = v(k) + dv(k) (6.64)

Substituting the expanded derivatives (6.63) in the first equation of the system (6.62)
and separating for simplicity the terms data and smoothness, we have by definition
the following expressions:

# 2
(k) (k)
(Ψ ) E 1 := Ψ It + I x(k) du (k) + I y(k) dv(k)
   (k)  $ (6.65)
(k)
+ λ1 I xt + I x(k)
x du
(k)
+ I x(k)
y dv
(k) 2
+ I yt + I x(k)
y du
(k) (k) (k) 2
+ I yy dv
#&  &2 &  &2 $
(k)
(Ψ ) E 2 := Ψ &∇3 u (k) + du (k) & + &∇3 v (k) + v (k) & (6.66)
(k)
The term (Ψ ) E1 defined with (6.65) is interpreted as a factor of robustness in the
(k)
data term, while the term (Ψ ) E2 defined by (6.66), is considered to be the diffusivity
in the smoothness term. With these definitions, the first equation of the system (6.62)
is rewritten as follows:
 
(k) (k)  (k) (k) (k)
(Ψ ) E · I x It + I x du (k) + I y dv (k)
1
 
(k) (k)  (k) (k) (k)  (k)  (k) (k) (k)
+ λ1 (Ψ ) E · I x x I xt + I x x du (k) + I x y dv (k) + I x y I yt + I x y du (k) + I yy dv (k) (6.67)
1
#  $
(k)
− λ2 div (Ψ ) E ∇3 u (k) + du (k) = 0
2

Similarly, the second equation of the system is redefined (6.62). For a fixed value of
k (6.67) is still non-linear, but now, having already estimated u (k) and v(k) with the
approximation at the first fixed point, the unknowns in Eq. (6.67) are the increments
du (k) and dv(k) . Thus, there remains only the non-linearity due to the derivative Ψ ,
but was chosen as a convex function,6 and the remaining optimization problem is a
convex problem, that is, it can exist a minimal and unique solution.
The non-linearity in Ψ can be removed by applying a second iterative process
based on the search for the fixed point of Eq. (6.67). Now consider the unknown
variables to iterate du (k,l) and dv(k,l) , where l indicates the lth iteration. We assume
(k,l)
as initial values du (k,l) = 0 and dv(k,l) = 0, and we indicate with (Ψ ) E 1 and
(Ψ )(k,l)
E 2 , respectively, the factor robustness and diffusivity expressed by the respec-
tive Eqs. (6.65) and (6.66) at the iteration (k,l)th. Finally, we can formulate the first

6 A function f (x) with real values and defined in an interval is called convex if the segment joining
any two points of its graph is above the graph itself. Convex optimization problems simplify the
analysis and solution of a convex problem. It is shown that a convex function, defined in a convex
set, has no solution, or has only global solutions, and cannot have exclusively local solutions.
6.4 Optical Flow Estimation 521

linear system equation in an iterative form with the unknowns du (k,l+1) and dv(k,l+1)
given by
 # $
(Ψ )(k,l)
E1 · Ix
(k) (k)
It + I x(k) du (k,l+1) + I y(k) dv(k,l+1)
 # $
(k) (k)
+ λ1 (Ψ ) E 1 · I x(k) (k) (k,l+1)
x I xt + I x x du + I x(k)
y dv
(k,l+1)

# $ (6.68)
(k)
+ I x(k) (k) (k,l+1)
y I yt + I x y du
(k) (k,l+1)
+ I yy dv
# # $$
(k)
− λ2 div (Ψ ) E 2 ∇3 u (k) + du (k,l+1) = 0
This system can be solved using the normal iterative numerical methods (Gauss–
Seidel, Jacobi, Successive Over-Relaxation—SOR also called the method of over-
relaxation) for a linear system also of large size and with sparse matrices (presence
of null elements).

6.4.6 Optical Flow Estimation for Affine Motion

The model of motion considered until now is of pure translation. If we consider that
a small region R at the time t is subject to an affine motion model, at the time t + t
the speed (or displacement) of the relative pixels is given by
⎡ ⎤ ⎡ ⎤
u(x; a) p1 + p2 x + p3 y
u(x; p) = ⎣ ⎦=⎣ ⎦ (6.69)
v(x; a) p4 + p5 x + p6

remembering the affine transformation equations (described in Sect. 3.3 Vol. II) and
assuming that the speed is constant [12] for the pixels of the region R. By replacing
(6.69) in the equation of the optical flow (6.13), it is possible to set the following
error function:
 2
e(x; p) = ∇ I T u(x; p) + It
x∈R
 2 (6.70)
= I x p1 + I x p2 x + I x p3 y + I y p4 + I y p5 x + I y p6 y + I t
x∈R

to be minimized with the least squares method. Given that there are 6 unknown
motion parameters and each pixel provides only a linear equation, at least 6 pixels of
the region are required to set up a system of linear equations. In fact, the minimization
of the error function requires the differentiation of (6.70) with respect to the unknown
vector p, to set the result of the differentiation equal to zero and solve with respect to
the motion parameters pi , thus obtaining the following system of linear equations:
⎡ ⎤
I x2 x I x2 y I x2 Ix I y x Ix I y y Ix I y ⎡ p ⎤ ⎡
I x It

⎢ 1
⎢ x I x2 x 2 I x2 x y I x2 x Ix I y x 2 Ix I y x y Ix I y ⎥
⎥ ⎢ p2 ⎥ ⎢ x I x It ⎥
⎢ y I x2 x y I x2 y 2 I x2 y 2 Ix I y ⎥ ⎢ ⎥ ⎢ ⎥
⎢ y Ix I y x y Ix I y ⎥ ⎢ p3 ⎥ ⎢ y I x It ⎥

⎢ Ix I y x Ix I y y Ix I y I y2 x I y2 2 ⎥⎢ ⎥ = −⎢
y I y ⎥ ⎢ p4 ⎥ ⎢ I y It ⎥
⎥ (6.71)
⎢ ⎥ ⎣ ⎦ ⎣ x I y It ⎦
⎣ x Ix I y x 2 Ix I y x y Ix I y x I y2 x 2 I y2 x y I y2 ⎦ p5
y Ix I y x y Ix I y y 2 Ix I y y I y2 x y I y2 2
y I 2 p6 y I y It
y
522 6 Motion Analysis

As for the pure translation also for the affine motion the approximation to the Taylor
series constitutes a very approximate estimate of the real motion. This imprecision of
motion can be mitigated with the iterative alignment approach proposed by Lucas–
Kanade [12]. In the case of affine motion with large displacements, we can use
the multi-resolution approach described in the previous paragraphs. With the same
method, it is possible to manage other motion models (homography, quadratic, . . .).
In the case of a planar quadratic motion model, it is approximated by

u = p1 + p2 x + p3 y + p7 x 2 + p8 x y
(6.72)
v = p4 + p5 x + p6 y + p7 x y + p8 y 2

while the homography one results:


⎧ p1 + x p2 + yp3

⎪ x =
 ⎪
⎪ p7 + x p8 + yp9
u =x −x ⎨
wher e (6.73)
v=y −y ⎪


⎪ p + x p5 + yp6
⎩y = 4
p7 + x p8 + yp9

6.4.6.1 Segmentation for Homogeneous Motion Regions


Once the optical flow is detected, assuming a model of affine motion, it may be useful
to partition the image into homogeneous regions (in the literature also known as
layer ) despite the difficulty that one has at the edges of objects due to the problem of
local aperture. In literature [15,16] various methods are used based on the gradient of
the image and on statistical approaches in particular to manage the discontinuity of the
flow field. A simple method of segmentation is to estimate a affine motion hypothesis
by partitioning the image into elementary blocks. The affine motion parameters are
estimated for each block minimizing the error function (6.70) and then only the
blocks with a low residual error are considered.
To classify the various types of motion detected, it is possible to analyze in the
parametric space p how the flow associated with each pixel is distributed. For exam-
ple, with the K-means algorithm it applies to significant parameters of affine motion
joining clusters very close together and with low population, thus obtaining a quan-
titative description of the various types of motion present in the scene. The classi-
fication process can be iterated by assigning each pixel to the appropriate motion
hypothesis. The essential steps of the segmentation procedure of the homogeneous
areas of motion based on K-means are

1. Partitioning the image in small blocks of n × n pixels (nonoverlapping blocks);


2. Link the affine flow model of the pair of blocks B(t) and B(t + 1) relative to the
consecutive images;
3. Consider the 6 parameters of the model of affine motion to characterize the
motion of the block in this parametric space;
4. Classi f y with the K-means method in this parametric space R6 ;
6.4 Optical Flow Estimation 523

5. U pdate the k cluster of affine motion models found;


6. Relabel the classes of affine motion (segmentation);
7. Re f ine the model of affine motion for each segmented area;
8. Repeat steps 6 and 7 until it converges to significant motion classes.

6.4.7 Estimation of the Optical Flow for Large Displacements

The algorithms described in the previous paragraphs (Horn–Schunck, Lucas–Kanade,


BBPW) are essentially based on the assumption of the constancy of gray levels (or
of color) and the constraint of spatial coherence (see Sect. 6.4), i.e., it is assumed
that in areas where the spatial and temporal gradient is evaluated, the visible surface
belongs to the same object and all points move at the same speed or vary slightly
in the image plane. We know that these constraints in real applications are violated,
and therefore the equation of the optical flow I x u + I y v + It = 0 would not be valid.
Essentially, the algorithms mentioned above, based on local optimization methods,
would not work correctly for the detection of the optical flow (u, v) in the hypothesis
of objects with great movement. In fact, if the objects move very quickly the pixel
intensities change rapidly and the estimate of the space–time derivatives fails because
they are locally calculated on windows of inadequate dimensions with respect to the
motion of the objects. This last aspect could be solved by using a larger window to
estimate large movements but this would strongly violate the assumption of spatial
coherence of motion. To make these algorithms feasible in these situations it is first
of all useful to improve the accuracy of the optical flow and subsequently estimate
the optical flow with a multi-resolution approach.

6.4.7.1 Iterative Refinement of the Optical Flow


To derive an iterative flow estimation approach, let’s consider the one-dimensional
case by indicating with I (x, t) and I (x, t + 1) the two signals (relative to the two
images) in two instants of time (see Fig. 6.22). As shown in the figure, we can
imagine that the time signal (t + 1) is a translation of that at the time t. If at any
point on the graph, we knew exactly the speed u we could predict its position at the
time t + 1 given by x = xt + u. With the above algorithms, we can estimate only an
approximate velocity ũ through the optical flow considered the continuity constraints
of the brightness assumed with the optical flow equation having ignored the terms
higher than the first order in Taylor’s expansion of I (x, y, t + 1). Rewritten in the
1D case, we have
It I (x, t + 1) − I (x, t)
I x u + It = 0 =⇒ u=− ≈− (6.74)
Ix Ix
where the time derivative It has been approximated by the difference of the pixels of
the two temporal images. The iterative algorithm for the estimation of the 1D motion
of each pixel p(x) would result:
524 6 Motion Analysis

I I I
u(xt) u(xt+ε)=u(xt)-It/Ix
Time derivative
u(xt+1)
I(x,t) It I(x,t) It
p(xt) u I(x,t+1)=I(x,t)
p(xt)
I(x,t+1) I(x,t+1)

xt Ix x xt+ε Ix x xt+1 Ix x
Spatial derivative
(Tangent slope)

Fig. 6.22 Iterative refinement of the optical flow. Representation of the signal 1D I (x, t) and
I (x, t + 1) relative to the temporal images observed in two instants of time. Starting from an initial
speed, the flow is updated by imagining to translate the signal I (x, t) to superimpose it over that of
I (x, t + 1). Convergence occurs in a few iterations by calculating in each iteration the space–time
gradient on the window centered on the pixel being processed which gradually shifts less and less
until the two signals overlap. The flow update takes place with (6.75) and for the figure, signal
occurs when the time derivative varies while the spatial derivative is constant

1. Compute for the p pixel the spatial derivative I x using the pixels close to p;
2. Set the initial speed of p. Normally we assume u ← 0;
3. Repeat until convergence:

(a) Locate the pixel in the adjacent time image I (x , t + 1) by assuming the
current speed u. Let I (x , t + 1) = I (x + u, t + 1) obtained by interpolation
considering that the values of u are not always integers.
(b) Compute the time derivative It = I (x , t + 1) − I (x, t + 1) as an approxi-
mation to the difference of the interpolated pixel intensities.
(c) U pdate speed u according to Eq. (6.74).

It is observed that during the iterative process the spatial derivative 1D I x remains
constant while the speed u is refined in each iteration starting from a very approximate
initial value or assumed zero. With Newton’s method (also known as the Newton–
Raphson method), it is possible to generate a sequence of values of u starting from
a plausible initial value that after a certain number of iterations converges to an
approximation of the root of Eq. (6.74) in the hypothesis that I (x, t) is derivable.
Therefore, given the signal I (x(t), t) know an initial value of u (k) (x), calculated, for
example, with the Lucas–Kanade approach, it is possible to obtain the next value of
the speed, solution of (6.74), with the general iterative formula:

It (x)
u k+1 (x) ← u (k) (x) − (6.75)
I x (x)
where k indicates the iteration number.
The iterative approach of the optical flow (u, v) extended to the 2D case is imple-
mented considering Eq. (6.13), which we know to have the two unknowns u and v,
and an equation, but solvable with the simple Lucas–Kanade method described in
Sect. 6.4.4. The 2D iterative procedure is as follows:
6.4 Optical Flow Estimation 525

1. Compute the speeds (u, v) in each pixel of the image considering the adjacent
images I 1 and I 2 (of a temporal sequence) using the Lucas–Kanade method,
Eq. (6.49).
2. Transform (warp) the image I 1 into the image I 2 with bilinear interpolation
(to calculate the intensities at the subpixel level) using the optical flow speeds
previously calculated.
3. Repeat the previous steps until convergence.

The convergence is when applying step 2 the translation of the image of time t leads
to its overlap with the image of the time t + 1 and it follows that the ratio It /I x is
null and the speed value is unchanged.

6.4.7.2 Optical Flow Estimation with a Multi-resolution Approach


The compromise, of having large windows (to manage large displacements of mov-
ing objects) without violating the assumption of space–time coherence, is achieved
by implementing the optical flow algorithms by organizing the image data of the
temporal sequence with a pyramid structure [15,17]. In essence, the images are
first convolved with a Gaussian filter and then sampled by a factor of two, a tech-
nique that takes the name of coarse to fine (see Fig. 6.23). Therefore, the Gaussian
pyramid is first obtained (see Sect. 7.6.3 Vol. I) applied to the images It and It+1
of the sequence. Then, the calculation of the optical flow starts by processing the
low-resolution images of the pyramids first (toward the top of the pyramid). The
coar se result of the flow then passed as an initial estimate to repeat the process on
the next level with higher resolution images thus producing more accurate values
of the optical flow. Although the calculation of the optical flow in low-resolution
stages is estimated on sub-sampled images with respect to the original, this ensures
the taking of small movements and therefore the validity of the optical flow Eq. (6.9).
The essential steps of Lucas–Kanade’s coarse-to-fine algorithm are the following:

1. Generates the Gaussian pyramids associated with the images I (x, y, t) and
I (x, y, t + 1) (normally 3-level pyramids are used).
2. Compute the optical flow (u 0 , v0 ) with the simple method of Lucas–Kanade (LK)
at the highest level of the pyramid (coarse);
3. Reapply LK iteratively over the current images to update (correct) the flow by
converging normally within 5 iterations, which becomes the initial flow.
4. Remain to the level ith, perform the following steps:

(a) Consider the speeds (u i−1 , vi−1 ) of the level i − 1;


(b) E x pand flow matrices. Generates a matrix of (u i∗ , vi∗ ) with double resolution
through an interpolation (cubic, bilinear, Gaussian, . . .) for the level ith;
(c) Reapply LK iteratively to update the flow (u i∗ , vi∗ ) with the correction of
(u i , vi ) considering the images of the pyramid of the level ith thus obtaining
the updated flow: u i = u i∗ + u i ; vi = vi∗ + vi .
526 6 Motion Analysis

Initial estimate
x2

Interpolates
Trasf x2
Update Flow

Fig. 6.23 Estimation of the optical flow with a multi-resolution approach by generating two Gaus-
sian pyramids relative to the two temporal images. The initial estimate of the flow is made starting
from the coarse images (at the top of the pyramid) and this flow is propagated to the subsequent
levels until it reaches the original image. With the coarse-to-fine approach, it is possible to handle
large object movements without violating the assumptions of the optical flow equation

Figure 6.24 shows instead the results of the Lucas–Kanade method for the manage-
ment of large movements calculated on real images. In this case, the multi-resolution
approach based on the Gaussian pyramid was used and iteratively refining the flow
calculation in each of the three levels of the pyramid.

6.4.8 Motion Estimation by Alignment

In different applications, it is required to compare images and evaluate the level of


similarity (for example, to compare images of a scene shot at different times or from
slightly different points of view) or to verify the presence in an image of its parts
(blob) (as it happens for the stereo-vision in finding elementary parts ( patches)
homologous). For example, in the 1970s, large satellite images were available and
the geometric registration of several images of the territory acquired at different times
and from slightly different observation positions was strategic.
6.4 Optical Flow Estimation 527

Fig. 6.24 Results of the optical flow calculated on real images with Lucas–Kanade’s method for
managing large movements that involves an organization with Gaussian pyramid at three levels
of the adjacent images of the temporal sequence and an iterative refinement at every level of the
pyramid

The motion of the satellite caused complex geometric deformations in the images,
requiring a process of geometric transformation and resampling of images based on
the knowledge of points of reference of the territory (control points, landmarks).
Of these points, windows (samples of images) were available, to be searched later
in the images, useful for registration. The search for such sample patchs of sizes
n × n in the images to be aligned occurs using the classic gray-level comparison
methods which minimizes an error function based on the sum of the square of the
differences (SSD) described in Sect. 5.8 Vol. II, or on the sum of the absolute value
of the differences (SAD) or on the normalized cross-correlation (NCC). The latter
functional SAD and NCC have been introduced in Sect. 4.7.2.
In the preceding paragraphs, the motion field was calculated by considering the
simple translation motion of each pixel of the image by processing (with the differ-
ential method) consecutive images of a sequence. The differential method estimates
the motion of each pixel by generating a dense motion field map (optical flow), does
not perform a local search around the pixel being processed (it only calculates the
local space–time gradient) but only estimates limited motion (works on high frame
rate image sequences). With multi-resolution implementation, large movements can
be estimated but with a high computational cost.
The search for a patch model in the sequence of images can be formulated to
estimate the motion with large displacements that are not necessarily translational.
Lucas and Kanade [12] have proposed a general method that aligns a portion of
an image known as a sample image (template image) T (x) with respect to an input
528 6 Motion Analysis

image I (x), where x = (x, y) indicates the coordinates of the processing pixel where
the template is centered. This method can be applied for the motion field estimation
by considering a generic patch of size n × n from the image at time t and search for
it in the image at time t + t. The goal is to find the position of the template T (x)
in the successive images of the temporal sequence (template tracking process).
In this formulation, the search for the alignment of the template in the sequence
of varying space–time images takes place considering the variability of the intensity
of the pixels (due to the noise or to the changing of the acquisition conditions), and
a model of motion described by the function W (x; p), parameterizable in terms of
the parameters p = ( p1 , p2 , . . . , pm ). Essentially, the motion (u(x), v(x)) of the
template T (x) is estimated through the geometric transformation (warping trans-
formation) W (x; p) that aligns a portion of the deformed image I (W (x; p)) with
respect to T (x):
I (W (x; p)) ≈ T (x) (6.76)

Basically, the alignment (registration) of T (x) with patch of I (x) occurs by deforming
I (x) to match it with T (x). For example, for a simple motion of translation the
transformation function W (x; p) is given by
 
x + p1 x +u
W (x; p) = = (6.77)
y + p2 y+v

where, in this case, the vector parameter p = ( p1 , p2 ) = (u, v) represents the


optical flow. More complex motion models are described with 2D geometrical trans-
formations W (x; p), already reported in the Chap. 3 Vol. II, such as rigid Euclidean,
similarity, affine, projective (homography). In general, an affine transformation is
considered, that in homogeneous coordinates is given by
⎛ ⎞
  x
p1 x + p3 y + p5 p1 p3 p5 ⎝ ⎠
W (x; p) = = y (6.78)
p2 x + p4 y + p6 p2 p4 p6
1

The Lucas–Kanade algorithm solves the alignment problem (6.76) by minimizing


the error function e SS D , based on the SSD (Sum of Squared Differences) which
compares the pixel intensity of the template T (x) and those of the transformed
image I (W (x; p)) (prediction of the warped image), given by:
 2
e SS D = I (W (x; p)) − T (x) (6.79)
x∈T

The minimization of this functional occurs by looking for the unknown vector p
calculating the pixel residuals of the template T (x) in a search region D × D in
I . In the case of a affine motion model (6.78), the unknown parameters are 6. The
minimization process assumes that the entire template is visible in the input image
I and the pixel intensity does not vary significantly. The error function (6.79) to
be minimized is non-linear even if the deformation function W (x; p) is linear with
6.4 Optical Flow Estimation 529

respect to p. This is because the intensity of the pixels varies regardless of their spa-
tial position. The Lucas–Kanade algorithm uses the Gauss–Newton approximation
method to minimize the error function through an iterative process. It assumes that
an initial value of the estimate of p is known, and through an iterative process is
increased by p such that it minimizes the error function expressed in the following
form (parametric model):
 2
I (W (x; p + p)) − T (x) (6.80)
x∈T

In this way, the assumed known initial estimate is updated iteratively by solving the
function error with respect to p:

p ← p + p (6.81)

until they converge, checking that the norm of the p is below a certain thresh-
old value ε. Since the algorithm is based on the nonlinear optimization of a
quadratic function with the gradient descent method, (6.80) is linearized approx-
imating I (W (x; p + p)) with his Taylor series expansion up to the first order,
obtaining
∂W
I (W (x; p + p)) ≈ I (W (x; p)) + ∇ I p (6.82)
∂p

where ∇ I = (I x , I y ) is the spatial gradient of the input image I locally evaluated


according to W (x; p), that is, ∇ I is calculated in the coordinate reference system
of I and then transformed into the coordinates of the template image T using the
current estimate of the parameters p of the deformation W (x; p). The term ∂∂p W
is the
Jacobian of the function W (x; p) given by:
⎛ ∂W ∂W ⎞

x

x
· · · ∂∂W x
∂W ⎜ p 1 p 2 p m

=⎝ ⎠ (6.83)
∂p ∂ Wy ∂ Wy ∂ Wy
∂ p1 ∂ p2 · · · ∂ pm

having considered W (x; p) = (Wx (x; p), W y (x; p)). In the case of affine deforma-
tion (see Eq. 6.78), the Jacobian results:

∂W x 0 y010
= (6.84)
∂p 0x 0y01

If instead
 the
 deformation is of pure translation, the Jacobian corresponds to the
matrix 01 01 . Replacing (6.82) in the error function (6.80) we get

   ∂W
2
e SS D = I (W x; p) + ∇ I p − T (x) (6.85)
∂p
x∈T
530 6 Motion Analysis

The latter function is minimized as a least squares problem. Therefore, differentiating


it with respect to the unknown vector p and equaling to zero we obtain

∂e SS D  ∂W
T
  ∂W
=2 ∇I I (W x; p) + ∇ I p − T (x) = 0 (6.86)
∂p ∂p ∂p
x∈T

where ∇ I ∂∂p
W
means the Steepest Descent term. Solving with respect to the unknown
p, the function (6.80) is minimized, with the following correction parameter:
T
 ∂W  
p = H −1 ∇I T (x) − I (W x; p) (6.87)
∂p
x∈T
  
Term of steepest descend with error

where
 ∂W
T
∂W
H= ∇I ∇I (6.88)
∂p ∂p
x∈T

is the Hessian matrix (of second derivatives, of size m × m) of the deformed image
I (W x; p). (6.88) is justified by remembering that the Hessian matrix actually rep-
resents the Jacobian of the gradient (concisely H = J · ∇). Equation (6.87) repre-
sents the expression that calculates the increment of p to update p and through
the cycle of predict–correct to converge toward a minimum of the error function
(6.80). Returning to the Lucas–Kanade algorithm, the iterative procedure expects to
apply Eqs. (6.80) and (6.81) in each step. The essential steps of the Lucas–Kanade
alignment algorithm are
 
1. Compute I W (x, p) transforming I (x)  with the matrix W (x, p);
2. Compute the similarity value (error): I W (x, p) − T (x);
3. Compute war ped gradients ∇ I = (I x , I y ) with the transformation
W (x, p);
4. Evaluate the Jacobian of the warping ∂∂pW
at (x, p);
5. Compute steepest descent: ∇ I ∂∂p
W
;
6. Compute the Hessian matrix with Eq. (6.88);
7. Multi ply steepest descend with error, indicated in Eq. (6.87);
8. Compute p using Eq. (6.87);
9. U pdate the parameters of the motion model: p ← p + p;
10. Repeat until p < ε

Figure 6.25 shows the functional scheme of the Lucas–Kanade algorithm. In sum-
mary, the described algorithm is based on the iterative approach prediction–
correction. The pr ediction
 consists of the calculation of the transformed input
image (deformed) I W (x, p) starting from an initial estimate of the parameter vec-
tor p once the parametric motion model has been defined (translation, rigid, affine,
6.4 Optical Flow Estimation 531

Gradient x of I Gradient y of I

Template

T(x)
Step 2

I(x)
I(x) deformed
Step 3
Step 1
Step 4

I(W(x;p))
W parameters Gradient of I deformed Jacobian of W

Step 9

Δ Δ
1 2 3 4 5 6
Ix m x 2m Iy ∂W 2m x 6m
p ∂p
Δp Updating Inverse Hessian

Step 5
Step 8

1 2 3 4 5 6

Δp 6x1
Hessian Images of steepest descent
6x6
Step 8

Step 6

H
Δ ∂W m x 6m
I ∂p
Steepest descent
Step 2

parameters update
Error 4

3 6x1
2

0
Step 7 -1
Step 7
-2

T(x) - I(W(x;p)) -3

Δ ∂W
1 2 3 4 5 6
T
∑[
x
I ∂p ] [T(x)-I(W(x;p))]

Fig. 6.25 Functional scheme of the reported Lucas–Kanade alignment algorithm (reported by
Baker [18]). With step 1 the input image I is deformed with the current estimate of W (x; p) (thus
calculating the prediction) and the result obtained is subtracted from the template T (x) (step 2) thus
obtaining the error function T (x) − I (W (x; p)) between prediction and template. The rest of the
steps are described in the text

homography, . . .). The correction vector p is calculated as a function of the error


given by the difference
 between the prototype image T (x) and the deformed input
image I W (x, p) . The convergence of the algorithm also depends on the magnitude
of the initial error, especially if one does not have a good initial estimate of p. The
deformation function W (x, p) must be differentiable with respect to p considering
that the Jacobian matrix ∂∂p W
must be computed. The convergence of the algorithm
is very much influenced by the inverse of the Hessian matrix H −1 , which can be
analyzed through its eigenvalues.
532 6 Motion Analysis
10
For example, in the case of pure translation with the Jacobian given by 01 the
Hessian matrix results:
 ∂W
T
∂W
H= ∇I ∇I
∂p ∂p
x∈T
 10  
Ix   10
= Ix I y
01 Iy 01
x∈T

 I 2 Ix I y
= x
I x I y I y2
x∈T

Therefore, the Hessian matrix thus obtained for the pure translational motion corre-
sponds to that of the Harris corner detector already analyzed in Sect. 6.5 Vol. II. It
follows that, from the analysis of the eigenvalues (both high values) of H it can be
verified if the template is a good patch to search for in the images of the sequence in
the translation context. In applications where the sequence of images has large move-
ments the algorithm can be implemented using a multi-resolution data structure, for
example, using the coarse to fine approach with pyramidal structure of the images.
It is also sensitive if the actual motion model is very different from the one predicted
and if the lighting conditions are modified. Any occlusions become a problem for
convergence. A possible mitigation of these problems can be achieved by updating
the template image.
From a computational point of view, the complexity is O(m 2 N + m 3 ) where m
is the number of parameters and N is the number of pixels in the template image. In
[18], the details of the computational load related to each of the first 9 steps above
of the algorithm are reported. In [19], there are some tricks (processing of the input
image in elementary blocks) to reduce the computational load due in particular to
the calculation of the Hessian matrix and the accumulation of residuals. Later, as
an alternative to the Lucas–Kanade algorithm, other equivalent methods have been
developed to minimize the error function (6.79).

6.4.8.1 Compositional Image Alignment Algorithm


In [19] a different approximation method is proposed, called compositional algo-
rithm, which modifies the error function to be minimized (6.80) in the following:
   2
I W (W (x; p); p) − T (x) (6.89)
x∈T

First it is minimized with respect to p, in each iteration, and then the deformation
estimate is updated as follows:

W (x; p) ← W (x; p) ◦ W (x; p) ≡ W (W (x; p); p) (6.90)


6.4 Optical Flow Estimation 533

In this expression, the “◦” symbol actually indicates a simple linear combination
of the parameters of W (x; p) and W (x; p), and the final form is rewritten as the
compositional deformation W (W (•)). The substantial difference of the composi-
tional algorithm with respect to the Lucas–Kanade algorithm is represented by the
iterative incremental deformation W (x; p) rather than the additive updating of the
parameters p.
In essence, Eqs. (6.80) and (6.81) of the original method are replaced with (6.89)
and (6.90) of the compositional method. In other words, this variant in the iterative
approximation involves updating W (x; p) through the two compositional deforma-
tions given with the (6.90).
The compositional algorithm involves the following steps:
 
1. Compute I W (x, p) transforming I (x)  with the matrix W (x, p);
2. Compute the similarity value (error): I W (x, p)  − T (x);
3. Compute the gradient ∇ I (W ) of the image I W (x, p) ;
4. Evaluate the Jacobian ∂∂pW
at (x, 0). This step is only performed in the beginning
by pre-calculating the Jacobian at (x, 0) which remains constant.
5. Compute steepest descent: ∇ I ∂∂pW
;
6. Compute the Hessian matrix with Eq. (6.88);
7. Multi ply steepest descend with error, indicated in Eq. (6.87);
8. Compute p using Eq. (6.87);
9. Update the parameters of the motion model with Eq. (6.90);
10. Ripeti finché p < ε

Basically, the same procedure as in Lucas–Kanade does, except for the  steps shown

in bold, that is, step 3 where the gradient of the image is calculated I W (x, p) , step
4 which is executed at the beginning out of the iterative process, initially calculating
the Jacobian at (x, 0), and step 9 where the deformation W (x, p) is updated with
the new Eq. (6.90). This new approach is more suitable for more complex motion
models such as the homography one, where the Jacobian calculation is simplified
even if the computational load is equivalent to that of the Lucas–Kanade algorithm.

6.4.8.2 Inverse Compositional Image Alignment Algorithm


In [18], another variant of the Lucas–Kanade algorithm is described, known as the
Inverse compositional image alignment algorithm which substantially reverses the
role of the input image with that of the template. Rather than updating the additive
estimate of the parameters p of the deformation W , this algorithm solves iteratively
for the inverse incremental deformation W (x; p)−1 . Therefore the alignment of the
image consists in moving (deforming) the template image to minimize the differ-
ence between template and an image. The author demonstrates the equivalence of
this algorithm with the compositional algorithm. With the inverse compositional
534 6 Motion Analysis

algorithm, the function to be minimized, with respect to p, results:


    2
T W (x; p) − I W (x; p) (6.91)
x∈T

where, as you can see, the role of the I and T images is reversed. In this case, as
suggested by the name, the minimization problem of (6.91) is solved by updating the
estimated current deformation W (x; p) with the inverted incremental deformation
W (x; p)−1 , given by

W (x; p) ← W (x; p) ◦ W (x; p)−1 (6.92)

To better highlight the variants of the algorithms, that of Lucas–Kanade is indicated


as the forward additive algorithm. The compositional one expressed by (6.90) is
referred to as the compositional forward algorithm which differs by virtue of (6.92)
due to the fact that the incremental deformation W (x; p) is inverted before being
composed with the current estimate of W (x; p). Therefore, updating the deformation
W (x; p) instead of p makes the inverse compositional algorithm more suitable for
any type of deformation. With the expansion of the function (6.91) in Taylor series
until the first order is obtained:
   ∂W   2
T W (x; 0) + ∇T p − I W (x; p) (6.93)
∂p
x∈T

If we assume W (x; 0) equals the identity deformation (without loss of generality),


that is, W (x; 0) = x, the solution to the least squares problem is given by

 T 
∂W  
p = H −1 ∇T I W (x; p) − T (x) (6.94)
∂p
x∈T

where H is the Hessian matrix, but with I replaced with T , given by

 ∂W
T
∂W
H= ∇T ∇T (6.95)
∂p ∂p
x∈T

with the Jacobian ∂∂p


W
valued at (x; 0). Since the Hessian matrix is independent of the
parameter vector p, it remains constant during the iterative process. Therefore, instead
of calculating the Hessian matrix in each iteration, as happens with the Lucas–Kanade
algorithm and the forward compositional, it is possible to pre-calculate the Hessian
matrix before starting the iterative process with the advantage of considerably reduc-
ing the computational load. The inverse compositional algorithm is implemented in
two phases:
6.4 Optical Flow Estimation 535

Initial pre-calculation steps:

3. Compute the gradient ∇T of the template T (x);


4. Compute the Jacobian ∂∂p
W
at (x; 0);
5. Compute steepest descent term ∇T ∂∂p
W
;
6. Compute the Hessian matrix using Eq. (6.95);

Steps in the iterative process:


 
1. Compute I W (x, p) transforming I (x) with thematrix W (x, p);
2. Compute the value of similarity (errore image): I W (x, p) − T (x);

7–8. Compute the incremental value of the motion model parameters p using
Eq. (6.94);

9. Update the current deformation W (x; p) ← W (x; p) ◦ W (x; p)−1


10. Repeat until p < ε

The substantial difference between the forward compositional algorithm and the
inverse compositional algorithm concerns: the calculation of the similarity value
(step 1) having exchanged the role between input image and template; steps 3, 5, and
6 calculating the gradient of T instead of the gradient of I , with the addition of being
able to precompute it out of the iterative process; the calculation of p which is done
with (6.94) instead of (6.87); and finally, step 9, where the incremental deformation
W (x; p) is reversed before being composed with the current estimate.
Regarding the computational load, the inverse computational algorithm requires a
computational complexity of O(m 2 N ) for the initial pre-calculation steps (executed
only once), where m is the number of parameters and N is the number of pixels
in the template image T . For the steps of the iterative process, a computational
complexity of O(m N + m 3 ) is required for each iteration. Essentially, compared
to the Lucas–Kanade and compositional algorithms, we have a computational load
saving of O(m N + m 2 ) for each iteration. In particular, the greatest computation
times are required for the calculation of the Hessian matrix (step 6) although it is
done only once while keeping in memory the data of the matrix H and of the images
of the steepest descent ∇T ∂∂pW
.

6.4.9 Motion Estimation with Techniques Based on Interest Points

In the previous section, we estimated the optical flow considering sequences of


images with a very small time interval between two consecutive images. For example,
for images taken with a standard 50 Hz camera, with a 1/25 s interval or with a much
higher frame rate even of hundreds of images per second. In some applications, the
dynamics of the scene involves longer intervals and, in this case, the calculation of
536 6 Motion Analysis

the motion field can be done using techniques based on the identification, in the
images of the sequence in question, of some significant structures (Points Of Interest
(POI)). In other words, the motion estimation is calculated by first identifying the
homologous points of interest in the consecutive images (with the correspondence
problem analogous to that of the stereo vision) and measuring the disparity value
between the homologous points of interest. With this method, the resulting speed
map is a scattered speed map, unlike previous methods that generated a dense speed
map. To determine the dynamics of the scene from the sequence of time-varying
images, the following steps must be performed:

1. De f ine a method of identifying points of interest (see Chap. 6 Vol. II) to be


searched in consecutive images;
2. Calculate the Significant Points of Interest (SPI);
3. Find the homologous SPI points in the two consecutive images of the time
sequence;
4. Calculate the disparity values (relative displacement in the image) for each point
of interest and estimate the speed components;
5. T racking the points of interest identified in step 3 to determine the dynamics of
the scene in a more robust way.

For images with n pixel, the computational complexity to search for points of interest
in the two consecutive images is O(n 2 ). To simplify the calculation of these points,
we normally consider windows of minimum 3 × 3 with high brightness variance.
In essence, the points of interest that emerge in the two images are those with high
variance that normally are found in correspondence of corners, edges, and in general
in areas with strong discontinuity of brightness.
The search for points of interest (step 2) and the search for homologous (step 3)
between consecutive images of the time sequence is carried out using the appropriate
methods (Moravec, Harris, Tomasi, Lowe, . . .) described in Chap. 6 Vol. II. In partic-
ular, the methods for finding homologous points of interest have also been described
in Sect. 4.7.2 in the context of stereo vision for the problem of the correspondence
between stereo images. The best known algorithm in the literature for the tracking of
points of interest is the KLT (Kanade-Lucas–Tomasi), which integrates the Lucas–
Kanade method for the calculation of the optical flow, the Tomasi–Shi method for
the detection of points of interest, and the Kanade–Tomasi method for the ability to
tracking points of interest in a sequence of time-varying images. In Sect. 6.4.8, we
have already described the method of aligning a patch of the image in the temporal
sequence of images.
The essential steps of the KLT algorithm are

1. Find the POIs in the first image of the sequence with one of the methods above
that satisfy min(λ1 , λ2 ) > λ (default threshold value);
2. For each POI, apply a motion model (translation, affine, . . .) to calculate the
displacement of these points in the next image of the sequence. For example,
alignment algorithms based on the Lucas–Kanade method can be used;
6.4 Optical Flow Estimation 537

3. Keep track of the motion vectors of these POIs in the sequence images;
4. Optionally, it may be useful to activate the POI detector (step 1) to add more to
follow. Step to execute every m processed images of the sequence (for example
every 10–20 images);
5. Repeat steps 2.3 and optionally step 4;
6. KLT returns the vectors that track the points of interest found in the image
sequence.

The KLT algorithm would automatically track the points of interest in the images
of the sequence compatibly with the robustness of the detection algorithms of the
points and the reliability of the tracking influenced by the variability of the contrast
of the images, by the noise, by the lighting conditions that must vary little and above
all from the motion model. In fact, if the motion model (for example, of translation
or affine) changes a lot with objects that move many pixels in the images or change
of scale, there will be problems of tracking with points of interest that may appear
partially and not more detectable. It may happen that in the phase of tracking a point
of interest detected has identical characteristics but belongs to different objects.

6.4.9.1 Tracking Using the Similarity of Interest Points


In application contexts, where the lighting conditions can change considerably
together with the geometric variations, due not only to the rotation and change of
scale, but also to the affine distortion, can be useful a tracking based on the similarity
of points of interest with robust characteristics of geometric and radiometric invari-
ance. In these cases, the KLT tracking algorithm, described above, is modified in the
first step to detect invariant points of interest with respect to the expected distortions
and lighting variations, and in the second phase it is evaluated the similarity of the
points of interest found by the first image to successive images of the sequence.
In Sect. 6.7.1 Vol. II, the characteristics of the points of interest SIFT (proposed by
Lowe [20]) detected through an approach coarse to fine (based on Gaussian pyramid)
for the invariance to change of scale, while the DoG (Difference of Gaussian) pyramid
is used for the locating of the points of interest at the various scales. The orientation
of the POIs is calculated through the locally evaluated gradient and finally for each
POI there is a descriptor that captures the local radiometric information.
Once the points of interest have been detected, it is necessary to find the corre-
spondence of the points in the consecutive images of the sequence. This is achieved
by evaluating a similarity measure between the characteristics of the correspond-
ing points with the foresight that must be significant with respect to the invariance
properties. This similarity measure is obtained by calculating the sum of the squared
differences (SSD) of the characteristics of the points of interest found in the consec-
utive images and considering as the corresponding candidate points those with an
SSD value lower than a predefined threshold.
In Fig. 6.26, the significant POIs of a sequence of 5 images are highlighted with
an asterisk, detected with the SIFT algorithm. The first line shows (with an asterisk)
all the POIs found in 5 consecutive images. The significant rotation motion toward
538 6 Motion Analysis

Fig. 6.26 Points of interest detected with Lowe’s SIFT algorithm for a sequence of 5 images
captured by a mobile vehicle

(a) (b)

(c)

Fig. 6.27 Results of the correspondence of the points of interest in Fig. 6.26. a Correspondence of
the points of interest relative to the first two images of the sequence calculated with Harris algorithm.
b As in a but related to the correspondence of the points of interest SIFT. c Report the tracking of
the SIFT points for the entire sequence. We observe the correct correspondence of the points (the
trajectories do not intersect each other) invariant to rotation, scale, and brightness variation

the right side of the corridor is highlighted. In the second line, only the POIs found
corresponding between two consecutive images are shown, useful for motion detec-
tion. Given the real navigation context, tracking points of interest must be invariant
to the variation of lighting conditions, rotation, and, above all, scale change. In fact,
the mobile vehicle during the tracking of the corresponding points captures images
of the scene where the lighting conditions vary considerably (we observe reflections
with specular areas) and between one image and another consecutive image of the
sequence the points of interest can be rotated and with a different scale. The figure
also shows that some points of interest present in the previous image are no longer
visible in the next image, while the latter contains new points of interest not visible
in the previous image due to the dynamics of the scene [21].
Figure 6.27 shows the results of the correspondence of the points of interest SIFT
of Fig. 6.26 and the correspondence of the points of interest calculated with Har-
6.4 Optical Flow Estimation 539

ris algorithm. While for SIFT points, the similarity is calculated using the SIFT
descri ptor s for Harris corners the similarity measurement is calculated with the
SSD considering a square window centered on the position of the corresponding
corners located in the two consecutive images.
In figure (a), the correspondences found in the first two images of the sequence are
reported, detected with the corners of Harris. We observe the correct correspondence
(those relating to nonintersecting lines) for corners that are translated or slightly
rotated (invariance to translation) while for scaled corners the correspondence is
incorrect because they are non-invariant to the change of scale. Figure (b) is the
analogous of (a) but the correspondences of the points of interest SIFT are reported,
that being also invariant to the change of scale, they are all correct correspondences
(zero intersections). Finally, in figure (c) the tracking of the SIFT points for the
whole sequence is shown. We observe the correct correspondence of the points (the
trajectories do not intersect each other) being invariant with respect to rotation, scale,
and brightness variation.
There are different methods for finding the optimal correspondence between a
set of points of interest, considering also the possibility that for the dynamics of the
scene some points of interest in the following image may not be present. To eliminate
possible false correspondences and to reduce the times of the computational process
of the correspondence, constraints can be considered in analogy to what happens for
stereo vision, where the constraints of the epipolarity are imposed or knowing the
kinematics of the objects in motion it is possible to predict the position of points of
interest. For example, the correspondences of significant point pairs can be considered
to make the correspondence process more robust, placing constraints on the basis of
a priori knowledge that can be had on the dynamics of the scene.

6.4.9.2 Tracking Using the Similarity Between Graphs of Points


of Interest
Robust correspondence approaches are based on organizing with a graph the set of
points of interest P O I1 of the first image and comparing all the possible graphs
generated by the points of interest P O I2 of the consecutive image. In this case, the
correspondence problem would be reduced to the problem of the comparison between
graphs which is called isomor phism between graphs which is a difficult problem to
solve, especially when applied with noisy data. A simplification is obtained by con-
sidering graphs consisting of only two nodes that represent two points of interest in an
image where, in addition to their similarity characteristics, topological information
(for example the distance) of points of interest is also considered.
The set of potential correspondences forms a bipartite graph, and the correspon-
dence problem consists in choosing a particular coverage of this graph. Initially, each
node can be considered as the correspondent of every other node of the other partition
(joining with a straight line these correspondences intersect). Using some similarity
criteria, the goal of the correspondence problem is to remove all connections of the
entire bipartite graph except one for each node (where there would be no intersec-
tions). Considering the dynamics of the scene for some nodes, no correspondence
540 6 Motion Analysis

would be found (also because they are no longer visible) and are excluded for motion
analysis.

6.4.9.3 Tracking Based on the Probabilistic Correspondence of Points


of Interest
An iterative approach, to solve the correspondence problem, is the one proposed by
Barnard and Thompson [22], which is based on the estimation of the probability
of finding the correspondence of points in the two consecutive images placing the
constraint of finding the homologous point in another image within a defined dis-
tance based on the dynamics of the scene. Basically, assuming a maximum speed
on the motion of the points of interest, the number of possible matches decreases
and a correspondence probability value is estimated for each potential pair of corre-
sponding points. These probabilities are computed in an iterative way to obtain an
optimal global probability measure calculated for all possible pairs of points whose
correspondence is to be calculated. The process ends when the correspondences of
each point in the first image with a point of interest are found in the next image of the
sequence, with the constraint that the global probability of all the correspondences
is significantly higher than any other possible set of correspondences, or is greater
than a predefined threshold value. We now describe this algorithm.
In analogy to the stereo approach, we consider two images acquired over time
t and t + t with t very small. The temporal acquisition distance is assumed
to be very small and we assume that we have determined a set of points of interest
A1 = {x1 , x2 , . . . , xm }, in the first image at time t, and a set of points of interest A2 =
{y1 , y2 , . . . , yn } in the second image at the time t +t. If the generic point of interest
x j is as consistent as the corresponding point yk , that is, they represent a potential
correspondence with the best similarity, a velocity vector c jk is associated to this
potential correspondence with an initial probability value based on their similarity.
With this definition, if the point of interest x j moves in the image plane with velocity
v, it will correspond in the second image to the point yk given by

yk = x j + vt = x j + c jk (6.96)

where the vector c jk can be seen as the vector connecting the points of interest x j
and yk . This pair of homologous points has a good correspondence if the following
condition is satisfied:
|x j − yk | ≤ cmax (6.97)

where cmax indicates the maximum displacement (disparity) of x j in the time interval
t found in the next image with the homologous point yk . Two pairs of homologous
points (x j , yk ) and (x p , yq ) are declared consistent if they satisfy the following
condition:
|c jk − c pq | ≤ cost
6.4 Optical Flow Estimation 541

where cost is an experimentally calculated constant based on the dynamics of the


scene known a priori. Initially, an estimate of the correspondence of potential homol-
ogous points of interest can be calculated, based on their similarity measure. Each
point of interest x j , in the image being processed, and yk , in the next image, identifies
a window W of appropriate size (3 × 3, 5 × 5, . . .).
A similarity measure, between potential homologous points, can be considered
the sum of the squares of the gray-level differences, corresponding to the window
considered. Indicating with S jk this similarity estimate between the points of interest
x j and yk , we get

S jk = [I1 (s, t) − I2 (s, t)]2 (6.98)
(s,t)∈W

where I1 (s, t) and I2 (s, t) indicate the respective pixels of the windows W j and
Wk , respectively, centered, respectively, on the points of interest x j and yk in the
corresponding images. An initial value of the estimate of the correspondence of a
generic pair of points of interest (x j , yk ), expressed in terms of probability P jk
0 ,

of potential points homologous x j and yk , is calculated, according to the similarity


measure given by (6.98), as follows:

(0)
 1
P jk = (6.99)
1 + αS jk
(s,t)∈W

(0)
where α is a positive constant. The probability P jk is determined by considering
the fact that a certain number of POIs have a good similarity, excluding those that
are inconsistent as a value of similarity, for which a value of probability equal to
 (0) 
1 − max(P jk ) can be assumed. The probabilities of the various possible matches
are given by
(0)
P jk
P jk = n (6.100)
t=1 P jt

where P jk can be considered as the conditional probability that x j has the homolo-
gous point yk , normalized on the sum of the probabilities of all other potential points
{y1 , y2 , . . . , yn }, excluding those found with inconsistent similarity. The essential
steps of the complete algorithm for calculating the flow rate of two consecutive
images of the sequence are the following:

1. Calculate the set of points of interest A1 and A2 , respectively, in the two consec-
utive images It and It+t .
2. Organize a data structure among the potential points of correspondence for each
point x j ∈ A1 with points yk ∈ A2 :

{x j ∈ A1 , (c j1 , P j1 ), (c j2 , P j2 ), . . . , (β, γ )} j = 1, 2, . . . , m
542 6 Motion Analysis

where P jk is the probability of matching points x j and yk while β and γ are


special symbols that indicate the non-potential correspondence.
3. Initialize the probabilities P jk0 with Eq. (6.99) based on the similarity of the POIs

given by Eq. (6.98) once you have chosen the appropriate size W of the window
in relation to the dynamics of the scene.
4. Iteratively calculate the probability of matching of a point x j with all the potential
points (yk , k = 1, . . . , n) as the weighted sum of all the probabilities of corre-
spondence for all consistent pairs (x p , yq ), where the x p points are in the vicinity
of x j (while i points yq are in the vicinity of yk ) and consistency (x p , yq ) is eval-
uated according to the pair (x j , yk ). A quality measure Q jk of the matching pair
is given by 
(s−1) (s−1)
Q jk = Ppq (6.101)
p q

where s is the iteration step, p refers to all x p points that are in the vicinity of
the point of interest x j being processed and index q refers to all the yq ∈ A2
points that form pairs (x p , yq ) consistent with pairs (x j , yk ) (points that are not
consistent or with probability below a certain threshold are excluded).
5. Update correspondence probabilities for each pair (x j , yk ) as follows:
# $
(s) (s−1) (s−1)
P̂ jk = P̂ jk a + bQ jk

(s)
with a and b default constants. The probability P̂ jk is useful to normalize it with
the following:
(s)
(s)
P̂ jk
P jk = n (s)
(6.102)
t=1 P̂ jt

6. Iterate steps (4) and (5) until the best match (x j , yk ) is found for all the points
examined x j ∈ A1 .
7. The vectors c jk constitute the velocity fields of the analyzed motion.

The selection of the constants a and b conditions the convergence speed of the algo-
rithm that normally converges after a few iterations. The algorithm can be speedup
(0)
by eliminating correspondences with initial probability values P jk that are very low
below a certain threshold.

6.4.10 Tracking Based on the Object Dynamics—Kalman Filter

When we know the dynamics of a moving object, instead of independently deter-


mining (in each image of the sequence) the position of the points of interest that
characterize it, it may be more effective to set the tracking of the object consider-
ing the knowledge of its dynamics. In this case, the object tracking is simplified by
6.4 Optical Flow Estimation 543

z
x y
x

Fig. 6.28 Tracking of the ball by detecting its dynamics using the information of its displacement
calculated in each image of the sequence knowing the camera’s frame rate. In the last image on the
right, the Goal event is displayed

calculating the position of its center of mass, independently in each image of the
sequence, also based on a prediction estimate of the expected position of the object.
In some real applications, the dynamics of objects is known a priori. For example
in the tracking of a pedestrian or in the tracking of entities (players, football, . . .)
in sporting events, they are normally shot with appropriate cameras having a frame
rate appropriate to the intrinsic dynamics of these entities. For example, the tracking
of the ball (in the game of football or tennis) would initially predict its location in
an image of the sequence, analyzing the image entirely, but in subsequent images of
the sequence its location can be simplified by the knowledge of the motion model
that would predict its current position.
This tracking strategy reduces the search time and improves, with the prediction
of the dynamics, the estimate of the position of the object, normally influenced by
the noise. In object tracking, the camera is stationary and the object must be visible
(otherwise the tracking procedure must be reinitialized to search for the object in
a sequence image, as we will see later), and in the sequence acquisition phase the
object–camera geometric configuration must not change significantly (normally the
object moves with lateral motion with respect to the optical axis of the stationary
camera). Figure 6.28 shows the tracking context of the ball in the game of football
to detect the Goal–NoGoal event.7 The camera continuously acquires sequences of
images and as soon as the ball appears in the scene a ball detection process locates
it in an image of the sequence and begins a phase of ball tracking in the subsequent
images.
Using the model of the expected movement, it is possible to predict where the
ball-object is located in the next image. In this context, the Kalman filter can be
used considering the dynamics of the event that presents uncertain information (the
ball can be deviated) and it is possible to predict the next state of the ball. Although
in reality, external elements interfere with the predicted movement model, with the
Kalman filter one is often able to understand what happened. The Kalman filter is
ideal for continuously changing dynamics. A Kalman filter is an optimal estima-
tor, i.e., it highlights parameters of interest from indirect, inaccurate, and uncertain

7 Goal–NoGoal event, according to FIFA regulations, occurs when the ball passes entirely the ideal
vertical plane parallel to the door, passing through the inner edge of the horizontal white line
separating the playing field from the inner area of the goal itself.
544 6 Motion Analysis

observations. It operates recursively by evaluating the next state based on the pre-
vious state and does not need to keep the historical data of the event dynamics. It
is, therefore, suitable for real-time implementations, and therefore strategic for the
tracking of high-speed objects.
It is an optimal estimator in the sense that if the noise of the data of the problem
is Gaussian, the Kalman filter minimizes the mean square error of the estimated
parameters. If the noise were not Gaussian (that is, for data noise only the average
and standard deviation are known), the Kalman filter is still the best linear estimator
but nonlinear estimators could be better. The word f ilter must not be associated
with the most common meaning that removes the frequencies of a signal but must
be understood as the process that finds the best estimate from noisy data or to filter
(attenuate) the noise.
Now let’s see how the Kalman filter is formulated [8,9]. We need to define the
state of a deterministic discrete dynamic system, described by a vector with the
smallest possible number of components, which completely synthesizes the past of
the system. The knowledge of the state allows theoretically to predict the dynamics
and future (and previous) states of the deterministic system in the absence of noise.
In the context of the ball tracking, the state of the system could be described by the
vector x = (p, v), where p = (x, y) and v = (vx , v y ) indicate the position of
the center of mass of the ball in the images and the ball velocity, respectively. The
dynamics of the ball is simplified by assuming constant velocity during the tracking
and neglecting the effect of gravity on the motion of the ball.
This velocity is initially estimated by knowing the camera’s frame rate and evalu-
ating the displacement of the ball in a few images of the sequence, as soon as the ball
appears in the field of view of the camera (see Fig. 6.28). But nothing is known about
unforeseeable external events (such as wind and player deviations) that can change
the motion of the ball. Therefore, the next state is not determined with certainty and
the Kalman filter assumes that the state variables p and v may vary randomly with the
Gaussian distribution characterized by the mean μ and variance σ 2 which represents
the uncertainty. If the prediction is maintained, knowing the previous state, we can
estimate in the next image where the ball would be, given by:

pt = pt−1 + tv

where t indicates the current state associated with the image tth of the sequence, t–1
indicates the previous state and t indicates the time elapsed between two adjacent
images, defined by the camera’s frame rate. In this case, the two quantities, position
and velocity of the ball, are corr elated.
Therefore, in every time interval, we have that the state changes from xt−1 to
xt according to the prediction model and according to the new observed measures
zt evaluated independently from the prediction model. In this context, the observed
measure zt = (x, y) indicates the position of the center of mass of the ball in the
image tth of the sequence. We can thus evaluate a new measure of the tracking state
(i.e., estimate the new position and speed of the ball, processing the tth image of the
sequence directly). At this point, with the Kalman filter one is able to estimate the
6.4 Optical Flow Estimation 545

current state x̂t by filtering out the uncertainties (generated by the measurements zt
and/or from the prediction model xt ) optimally with the following equation:

x̂t = Kt · zt + (1 − Kt )x̂t−1 (6.103)

where Kt is a coefficient called Kalman Gain and t indicates the current state of
the system. In this case, t is used as an index of the images of the sequence but in
substance it has the meaning of discretizing time such that t = 1, 2, . . . indicates t ·
t ms (millisecondi), where t is the constant time interval between two successive
images of the sequence (defined by the frame rate of the camera).
From (6.103), we observe that with the Kalman filter the objective is to estimate,
optimally the state at time t, filtering through the coefficient Kt (which is the unknown
of the equation) the intrinsic uncertainties deriving from the prediction estimated in
the state x̂t−1 and from the new observed measures zt . In other words, the Kalman
filter behaves like a data fusion process (prediction and observation) by optimally
filtering the noise of that data. The key to this optimal process is reduced to the Kt
calculation in each process state.
Let us now analyze how the Kalman filter mechanism, useful for the tracking,
realizes this optimal fusion between the assumption of state prediction of the system,
the observed measures and the correction proposed by the Kalman Eq. (6.103). The
state of the system at time t is described by the random variable xt , which evolves
from the previous state t-1 according to the following linear equation:

xt = Ft · xt−1 + Gt ut + ε t (6.104)

where

– xt is a state vector of size n x whose components are the variables that characterize
the system (for example, position, velocity, . . .). Each variable is normally assumed
to have a Gaussian distribution N (μ, ) with mean μ which is the center of the
random distribution and covariance matrix  of size n x × n x . The correlation
information between the state variables is captured by the covariance matrix 
(symmetric) of which each element i j represents the level of correlation between
the variable ith and jth. For the example of the ball tracking, the two variables p
and v would be correlated, considering that the new position of the ball depends
on velocity assuming no external influence (gravity, wind, . . .).
– F t is the transition matrix that models the system prediction, that is, it is the state
transition matrix that applies the effect of each system state parameter at time t-1 in
the system state over time t (therefore, for the example considered, the position and
velocity, over time t-1, both influence the new ball position at the time t). It should
be noted that the components of F t are assumed to be constant in the changes of
state of the system even if in reality they can be modified (for example, the variables
of the system deviate with respect to the hypothesized Gaussian distribution). This
last situation is not a problem since we will see that the Kalman filter will converge
toward a correct estimate even if the distribution of the system variables deviate
from the Gaussian assumption.
546 6 Motion Analysis

– ut is an input vector of size n u whose components are the system control input
parameters that influence the state vector xt . For the ball tracking problem, if we
wanted to consider the effect of the gravitational field, we should consider in the
state vector xt , also the vertical component y of the position p = (x, y), which is
dependent on the acceleration −g according to the gravitational motion equation
y = 21 gt 2 .
– G t is the control matrix associated with the input parameters ut .
– εt is the vector (normally unknown, represents the uncertainty of the system model)
that includes the terms of process noise for each parameter associated with the state
vector xt . It is assumed that the process noise has a multivariate normal distribution
with mean zero (white noise process) and with a covariance matrix Q t = E[εt ε tT ].

Equation (6.104) defines a linear stochastic process, where each value of the state
vector xt is a linear combination of its previous value xt−1 plus the value of the
control vector ut and the process noise εt .
The equation associated with the observed measurements of the system, acquired
in each state t, is given by
zt = Ht · xt + ηt (6.105)

where

– zt is an observation vector of size n z whose components are the observed measure-


ments at time t expressed in the sensor domain. Such measurements are obtained
independently from the prediction model which in this example correspond to the
coordinates of the position of the ball detected in the tth image of the sequence.
– Ht is the transformation matrix (also known as the measure matrix), of size n z ×n x ,
which maps the state vector xt in the measurement domain.
– ηt is the vector (normally unknown, models additional noise in the observation)
of size n z whose components are the noise terms associated with the observed
measures. As for the noise of the process εt , this noise is assumed as white noise,
that is, with Gaussian distribution at zero mean and with covariance matrix Rt =
E[ηt ηtT ].

It should be noted that ε t and ηt are independent variables, and therefore, the uncer-
tainty on the prediction model does not depend on the uncertainty on the observed
measures and vice versa.
Returning to the example of the ball tracking, in the hypothesis of constant veloc-
ity, the state vector xt is given by the velocity vt and by the horizontal position
xt = xt−1 + tvt−1 , as follows:

xt 1 t xt−1
xt = = + εt = F t−1 xt−1 + εt (6.106)
vt 0 1 vt−1

For simplicity, the dominant dynamics of lateral motion was considered, with respect
to the optical axis of the camera, thus considering only the position p = (x, 0) along
6.4 Optical Flow Estimation 547

the horizontal axis of the x-axis of the image plane (the height of the ball is neglected,
that is, the y-axis as shown in Fig. 6.28). If instead we also consider the influence of
gravity on the motion of the ball, in the state vector x two additional variables must
be added to indicate the vertical fall motion component. These additional variables
are the vertical position y of the ball and the velocity v y of free fall of the ball along
the y axis. The vertical position is given by

g(t)2
yt = yt−1 − tv yt−1 −
2
where g is the gravitational acceleration. Now let us indicate the horizontal speed of
the ball with vx which we assume constant taking into account the high acquisition
frame rate (t = 2ms for a frame rate of 500 fps). In this case, the status vector
results in xt = (xt , yt , vxt , v yt )T and the linear Eq. (6.104) becomes
⎡ ⎤ ⎡ ⎤⎡ ⎤ ⎡ ⎤
xt 1 0 t 0 xt−1 0
⎢ yt ⎥ ⎢0 1 0 t ⎥ ⎢ yt−1 ⎥ ⎢− 2 ⎥
⎥ ⎢ ⎥ ⎢ (t) 2

xt = ⎢ ⎥ ⎢
⎣v x t ⎦ = ⎣ 0 + ⎥ g + εt = F t−1 xt−1 + G t ut + εt (6.107)
0 1 0 ⎦ ⎣vxt−1 ⎦ ⎣ 0 ⎦
v yt 0 0 0 1 v yt−1 t

with the control variable ut = g. For the ball tracking, the observed measurements
are the (x, y) coordinates of the center of mass of the ball calculated in each image
of the sequence through an algorithm of ball detection [23,24]. Therefore, the vector
of measures represents the coordinates of the ball z = [x y]T and the equation of
the observed measures, according to Eq. (6.105), is given by

1000
zt = x + ηt = H t xt + ηt (6.108)
0100 t

Having defined the problem of the tracking of an object, we are now able to adapt it
to the Kalman filter model. This model involves two distinct processes (and therefore
two sets of distinct equations): update of the prediction and update of the observed
measures.
The equations for updating the prediction are

x̂t|t−1 = Ft · x̂t−1|t−1 + Gt ut (6.109)

P t|t−1 = Ft Pt−1|t−1 FtT + Qt (6.110)

Equation (6.109), derived from (6.104), computes an estimate x̂t|t−1 of the current
state t of the system on basis of the previous state values t-1 with the prediction
matrices Ft and Gt (provided with the definition of the problem), assuming known
and Gaussian the distributions of the variables status x̂t−1|t−1 and control ut .
Equation (6.110) updates the system state prediction covariance matrix by know-
ing the covariance matrix Qt associated with the noise ε of the input control variables.
548 6 Motion Analysis

The variance associated with the prediction x̂t|t−1 of an unknown real value xt is
given by
P t|t−1 = E[(xt − x̂t|t−1 )(xt − x̂t|t−1 )T ],

where E(•) is the expectation value.8


The equations for updating the observed measurements are

x̂t|t = x̂t|t−1 + Kt (z t − Ht xt|t−1 ) (6.112)

P t|t = Pt|t−1 − Kt Ht Pt|t−1 = (I − K t H t ) P t|t−1 (6.113)

where Kalman Gain is

K t = Pt|t−1 HtT (Ht Pt|t−1 HtT + Rt )−1 (6.114)

8 Let’s better specify how the uncertainty of a stochastic variable is evaluated, which we know to
be its variance, to motivate Eq. (6.110). In this case, we are initially interested in evaluating the
uncertainty of the state vector prediction x̂ t−1 which is given, being multidimensional, from its
covariance matrix P t−1 = Cov( x̂ t−1 ). Similarly, the uncertainty of the next value of the prediction
vector x̂ t at time t, after the transformation Ft obtained with (6.109), is given by

P t = Cov( x̂ t ) = Cov(F t x̂ t−1 ) = Ft Cov( x̂ t−1 )FtT = Ft Pt−1 FtT (6.111)


Essentially, the linear transformation of the values x̂ t−1 (which have a covariance matrix P t−1 ) with
the prediction matrix F t modifies the covariance matrix of the vectors of the next state x̂ t according
to Eq. (6.111), where P t in fact represents the output covariance matrix of the linear prediction
transformation assuming the Gaussian state variable. In the literature, (6.111) represents the error
propagation law for a linear transformation of a random variable whose mean and covariance are
known without necessarily knowing its exact probability distribution. Its proof is easily obtained,
remembering the definition and properties of the expected value μx and covariance x of a random
variable x. In fact, considering a generic linear transformation y = Ax + b, the new expected value
and the covariance matrix are derived from the original distribution as follows:

μ y = E[ y] = E[ Ax + b] = AE[x] + b = Aμx + b

 y = E[V ar ( y)V ar ( y)T ] = E[( y − E[ y])( y − E[ y])T ]


= E[( Ax + b − AE[x] − b)( Ax + b − AE[x] − b)T ]
= E[( A(x − E[x]))( A(x − E[x]))T ]
= E[( A(x − E[x]))((x − E[x])T AT ]
= AE[(x − E[x])(x − E[x])T ] AT
= A AT
In the context of the Kalman filter, the error propagation also occurs in the propagation of the
measurement prediction uncertainty (Eq. 6.105) from the previous state xt|t−1 , whose uncertainty
is given by the covariance matrix Pt|t−1 .
6.4 Optical Flow Estimation 549

Fig. 6.29 Functional KALMAN FILTER


scheme of the iterative Iterative process of Prediction | Correction
process of the Kalman filter
Observed Measures
State Update Measure Update
(prediction) (correction)

1- Predict the next state 1- Calculate the Kalman Gain


T T
xt|t-1=Ftxt-1|t-1+Gtut Kt=Pt|t-1Ht (HtPt|t-1Ht +Rt
2- Update the estimate of xt
2- Update the prediction matrix
covariance estimate using the observed measures zt

Pt|t-1=FtPt-1|t-1Ft +Qt xt|t=xt|t-1+Kt (zt -Htxt|t-1 )


3- Update the covariance matrix
Pt|t=Pt|t-1-Kt HtPt|t-1

t=t+1
Initialize state t

with Rt which is the covariance matrix associated with the noise of the observed mea-
surements z t (normally known by knowing the uncertainty of the measurements of
the sensors used) and Ht Pt|t−1 HtT is the covariance matrix of the measures that cap-
tures the propagated uncertainty of the previous state of prediction (characterized by
the covariance matrix Pt|t−1 ) on the expected measures, through the transformation
matrix Ht , provided from the model of the measures, according to Eq. (6.105).
At this point, the matrices R and Q remain to be determined, starting from the
initial values of x0 and P 0 , and start thus the iterative process of updating the status
( pr ediction) and updating of the observed measures (state correction), as shown
in Fig. 6.29. We will now analyze the various phases of the iterative process of the
Kalman filter reported in the diagram of this figure.
In the prediction phase, step1, an a priori estimate x̂t|t−1 is calculated, which is in
fact a rough estimate made before the observation of the measures zt , that is before
the correction phase (measures update). In step 2, the covariance matrix a priori of
propagation of errors Pt|t−1 is computed with respect to the previous state t − 1.
These values are then used in the update equations of the observed measurements.
In the correction phase, the system state vector xt is estimated combining the
knowledge information a priori with the measurements observed at the current time
t thus obtaining a better updated and correct estimate of x̂t and of Pt . These values
are necessary in the prediction/correction for the future estimate at time t + 1.
Returning to the example of the ball tracking, the Kalman filter is used to predict
the region where the ball would be in each image of the sequence, acquired in real
time with a high frame rate considering that the speed of the ball can reach 120 km/ h.
The initial state t0 starts as soon as the ball appears in an image of the sequence,
only initially searched for on the entire image (normally HD type with a resolution
of 1920 × 1080). The initial speed v0 of the ball is estimated by processing multiple
adjacent images of the sequence before triggering the tracking process. The accuracy
550 6 Motion Analysis

(a) (b)
Filtered measure combining
prediction and observed measure
Observed measure
with noise
Predicted estimate
Predicted estimate

y
x x x

Fig. 6.30 Diagram of the error filtering process between two successive states. a The position of the
ball at the time t1 has an uncertain prediction (shown by the bell-shaped Gaussian pdf whose width
indicates the level of uncertainty given by the variance) since it is not known whether external
factors influenced the model of prediction. b The position of the ball is shown by the measure
observed at time t1 with a level of uncertainty due to the noise of the measurements, represented by
the Gaussian pdf of the measurements. Combining the uncertainty of the prediction model and the
measurement one, that is, multiplying the two pdfs (prediction and measurements), there is a new
filtered position measurement obtaining a more precise measurement of the position in the sense
of the Kalman filter. The uncertainty of the filtered measurement is given by the third Gaussian pdf
shown

of the position and initial speed of the ball is reasonably known and estimated in
relation to the ball detection algorithm [24–26].
The next state of the ball (epoch t = 1) is estimated by the prediction update
Eq. (6.109) which for the ball tracking is reduced to Eq. (6.106) excluding the influ-
ence of gravity on the motion of the ball. In essence, in the tracking process there is
no control variable ut to consider in the prediction equation, and the position of the
ball is based only on the knowledge of the state x0 = (x0 , v0 ) at time t0 , and therefore
with the uncertainty given by the Gaussian distribution (xt ∼ N(F t xt−1 ; ). This
uncertainty is due only to the calculation of the position of the ball which depends on
the environmental conditions of acquisition of the images (for example, the lighting
conditions vary between one state and the next). Furthermore, it is reasonable to
assume less accuracy in predicting the position of the ball at the time t1 compared to
the time t0 due to the noise that we propose to filter with the Kalman approach (see
Fig. 6.30a).
At the time t1 we have the measure observed on the position of the ball acquired
from the current image of the sequence which, for this example, according to
Eq. (6.105) of the observed measures, results:
  xt
zt = Ht · xt + ηt = 1 0 + ηt (6.115)
vt

where we assume the Gaussian noise ηt ∼ N(0; ηt ). With the observed measure z t ,
we have a further measure of the position of the ball whose uncertainty is given by
the distribution z t ∼ N(μz ; σz2 ). An optimal estimate of the position of the ball is
obtained by combining that of the prediction x̂ t|t−1 and that of the observed measure
z t . This is achieved by multiplying the two Gaussian distributions together. The
product of two Gaussians is still a Gaussian (see Fig. 6.30b). This is fundamental
6.4 Optical Flow Estimation 551

because it allows us to multiply an infinite number of Gaussian functions in epochs,


but the resulting function does not increase in terms of complexity or number of
terms. After each epoch, the new distribution is completely represented by a Gaussian
function. This is the strategic solution according to the recursive property of the
Kalman filter.
The situation represented in the figure can be analytically described to derive the
Kalman filter updating equations shown in the functional diagram of Fig. 6.29. For
this purpose, we indicate for simplicity the Gaussian pdf related to the prediction with
N(μx ; σx2 ) and with N(μz ; σz2 ) the one relative to the pdf of the observed measures.
2
1 − (x−μ2x )
Remembering the one-dimensional Gaussian function in the form e √ 2σx
σx 2π
and similarly for the z, executing the product of the two Gaussians we obtain a new
Gaussian N(μc ; σc2 ) where the mean and variance of the product of two Gaussians
are given by
μx σz2 + μz σx2 σx2 (μz − μx )
μc = = μ x + (6.116)
σx2 + σz2 σx2 + σz2

σx2 σz2 σx4


σc2 = = σ 2
x − (6.117)
σx2 + σz2 σx2 + σz2

These equations represent the updating equations at the base of the process of pre-
diction/correction of the Kalman filter that was rewritten according to the symbolism
of the iterative process and we have
μx̂t|t−1 σz2t + μz t σx2t|t−1 σx2t|t−1
μx̂t|t = = μx̂t|t−1 + (μz t − μx̂t|t−1 ) (6.118)
σx2t|t−1 + σz2t|t σx2t|t−1 + σz2t|t
  
K alman Gain

σx2t|t−1 σz2t σx2t|t−1


σx̂2t|t = = σx2t|t−1 − σx2t|t−1 (6.119)
σx2t|t−1 + σz2t σx2t|t−1 + σz2t|t

By indicating with k the Kalman Gain, the previous equations are thus simplified:

μx̂t|t = μx̂t|t−1 + k(μz t − μx̂t|t−1 ) (6.120)

σx̂2t|t = σx2t|t−1 − kσx2t|t−1 (6.121)

The Kalman Gain and Eqs. (6.120) and (6.121) can be rewritten in matrix form to
handle the multidimensional Gaussian distributions N(μ; ) given by
 xt|t−1
K= (6.122)
 xt|t−1 +  z t|t
552 6 Motion Analysis

μ x̂ t|t = μ x̂ t|t−1 + K (μz t − μ x̂ t|t−1 ) (6.123)

 x̂ t|t =  xt|t−1 − K  xt|t−1 (6.124)

Finally, we can derive the general equations of prediction and correction in matrix
form. This is possible considering the distribution of the prediction measures x̂ t
given by (μ x̂ t|t−1 ;  xt|t−1 ) = (H t x̂ t|t−1 ; H t P t|t−1 H tT ) and the distribution of the
observed measures z t given by (μz t|t ,  z t|t ) = (z t|t ; Rt ). Replacing these values of
the prediction and correction distributions in (6.123), in (6.124), and in (6.122), we
get
H t x̂ t|t = H t x̂ t|t−1 + K (z t − H t x̂ t|t−1 ) (6.125)

H t P t|t H tT = H t P t|t−1 H tT − K (H t P t|t−1 H tT ) (6.126)

H t P t|t−1 H tT
K= (6.127)
H t P t|t−1 H tT + Rt

We can now delete H t from the front of each term of the last three equations (remem-
bering that one is hidden in the expression of K ), and H tT from Eq. (6.126), we finally
get the following update equations:

x̂ t|t = x̂ t|t−1 + K t (z t − H t x̂ t|t−1 ) (6.128)


  
measur ement r esidual

P t|t = P t|t−1 − K t H t P t|t−1 (6.129)

K = P t|t−1 H tT (H t P t|t−1 H tT + Rt )−1 (6.130)


  
covariance r esidual

(6.128) calculates, for the time t, the best new estimate of the state vector x̂ t|t of the
system, combining the estimate of the prediction x̂ t|t−1 (calculated with the 6.109)
with the r esidual (also known as innovation) given by the difference between
observed measurements z t and expected measurements ẑ t|t = H t x̂ t|t−1 .
We highlight (for Eq. 6.128) that the measurements residual is weighted by the
Kalman gain K t , which establishes how much importance to give the r esidual
with respect to the predicted estimate x̂ t|t−1 . We also sense the importance of K
in filtering the r esidual. In fact, from Eq. (6.127), we observe that the value of K
6.4 Optical Flow Estimation 553

depends strictly on the values of the covariance matrices Rt and P t|t−1 . If Rt → 0


(i.e., the observed measurements are very accurate) (6.127) tells us that K t → H1 t
and the estimate of x̂ t|t depends a lot on the observed measurements. If instead the
prediction covariance matrix P t|t−1 → 0 this implies that H t P t|t−1 H tT → 0, and
therefore dominates the covariance matrix of the observed measurements Rt and
consequently we have K t → 0 and the filter mostly ignores measurements that are
based instead on the prediction derived from the previous state (according to 6.128).
We can now complete the example of ball tracking based on the Kalman filter to
optimally control the noise associated with the dynamics of the ball moving at high
speed. The configuration of the acquisition system is the one shown in Fig. 6.28,
where a high frame rate camera (400 fps) continuously acquires 1920 × 1080 pixel
image sequences. As shown in the figure, it is interesting to estimate the horizontal
trajectories of the ball over time, analyzing the images of the sequence in which the
temporal distance between consecutive images is t = 400 1
ś. The acquisition system
(camera-optics) has such a field angle that it observes an area of the scene that in
the horizontal direction corresponds to an amplitude of about 12 m. Considering the
horizontal resolution of the camera, the position of the ball is estimated with the
resolution of 1 pixel which corresponds in the spatial domain to 10 mm.
The prediction equation
  for calculating
 the position xt|t−1 is given by (6.106) with
the matrix F = 01 t 1 = 1 1/400
0 1
that represents the dynamics of the system, while
the state vector of the system is given by xt = [xt vt ]T where vt is the horizontal
speed of the ball, assumed constant, considering the high frame rate of the sequence
of acquired images (this also justifies excluding the gravitational effect on the ball).
The error εt of the motion model is assumed to be constant in time t. The initial state
of the system xt0 = [xt0 vt0 ]T is estimated as follows. An algorithm of ball detection,
as soon as the ball appears in the scene, detects it (analyzing the whole image) in a
limited number of consecutive images and determines the initial state (position and
velocity) of the system.
The initial estimate of the speed may not be very accurate compared to the real
one. In this context, (6.106) defines the motion equation:

1 t x̂t−1|t−1
x̂ t|t−1 = F x̂ t−1|t−1 = = x̂t−1|t−1 + vt−1 t,
0 1 vt−1

where vt−1 is not measured. The covariance matrix of the initial


 state,
 which repre-
σ2 0
sents the uncertainty of the state variables, results in P t0 = 0s σ 2 , where σs is the
s
standard deviation of the motion model associated only with the horizontal position
of the ball while assuming zero for velocity (measure not observed). The control
variables u and its control
 matrix G are both null in this example, as is the covari-
ance matrix Q t = 00 00 which defines the uncertainty of the model. It is, therefore,
assumed that the ball does not suffer significant slowdowns in the time interval of a
few seconds during the entire tracking.
554 6 Motion Analysis

The equation of measures (6.105) becomes


  x̂t|t−1
z t = H t x̂ t|t−1 + ηt−1 = 1 0 = x̂t|t−1 + ηt−1 ,
vt−1

where the measurement matrix has only one nonzero element since only the hor-
izontal position is measured while H (2) = 0 since the speed is not measured.
Measurements noise η is controlled by the covariance matrix R, which in this case is
a scalar r , associated only with the measure xt . The uncertainty of the measurements
is modeled as Gaussian noise controlled with r = σm2 assuming constant variance in
the update process.
The simplified update equations are

P t|t−1 H T
Kt =
H P t|t−1 H T + r
P t|t = P t|t−1 − K t H P t|t−1
x̂ t|t = x̂ t|t−1 + K t (z t − H x̂ t|t−1 )

Figure 6.31 shows the results of the Kalman filter for the ball tracking considering
only the motion in the direction of the x-axis. The ball is assumed to have a speed of
80 km/h = 2.2222 · 104 mm/s. The uncertainty of the motion model (with constant
speed) is assumed to be null (with covariance matrix Q = 0) and any slowing down
or acceleration of the ball are assumed as noise. The covariance matrix P t checks
the error due to the process in each time t and indicates whether we should give
more weight to the new measurement or to the estimation of the model according to
Eq. (6.129).
In this example, assuming a zero-noise motion model the state of the system is
controlled by the variance of the state variables reported in the terms of the main
diagonal of P t . Previously, we indicated with σs2 the variance (error) of the variable
xt and the confidence matrix of the model P t is predicted, in every time, based on
the previous value, through (6.110). The Kalman filter results shown in the figure are
obtained with initial values of σs = 10 mm. Measurements noise is characterized
instead with standard deviation σm = 20 mm (note that in this example the units of the
measurements of the state variables and observed measurements are homogeneous,
expressed in mm).
The Kalman filter was applied with an initial value of the wrong speed of 50%
(graphs of the first row) and of 20% (graphs second row) with respect to the real one.
In the figure, it is observed that the filter converges, however, toward the real values
of the speed even if in different epochs in relation to the initial error. Convergence
occurs with the action of error filtering (of the model and measurements) and it
is significant to analyze the qualitative trend of the P matrix and of the gain K
(which asymptotically tends to a minimum value), as the σs2 and σm2 variances vary.
In general, if the variables are initialized with significant values, the filter converges
faster. If the model corresponds well to a real situation the state of the system is
6.4 Optical Flow Estimation 555

4
x 10 Estimated position x 10
4
Estimated velocity
6 3
True position True velocity
Observed measure Average velocity estimated
Estimated position with KF Estimated velocity with the KF
5
X axis position (mm)

2.5

Velocity (mm/s)
4 2

3 1.5

2 1

1 0.5

0 0
0 0.5 1 1.5 2 2.5 0 0.5 1 1.5 2 2.5
Time (1/fps s) Time (1/fps s)

x 10
4 Estimated position x 10
4 Estimated velocity
6 3
True position True velocity
Observed measure Average velocity estimated
Estimated position with KF Estimated velocity with the KF
5 2.5
X axis position (mm)

Velocity (mm/s)

4 2

3 1.5

2 1

1 0.5

0 0
0 0.5 1 1.5 2 2.5 0 0.5 1 1.5 2 2.5
Time (1/fps s) Time (1/fps s)

Fig.6.31 Kalman filter results for the ball tracking considering only the dominant horizontal motion
(x-axis) and neglecting the effect of gravity. The first column shows the graphs of the estimated
position of the ball, while in the second column the estimated velocities are shown. The graphs of
the two lines refer to two initial velocity starting conditions (in the first line the initial velocity error
is very high at 50% while in the second line it is 20%), where it is observed as after a short time the
initial speed error is quickly filtered by the Kalman filter

well updated despite the presence of measures observed with considerable error (for
example, 20−30% of error).
If instead, the model does not reproduce a real situation well even with measure-
ments with not very noise the state of the system presents a drift with respect to the
true measures. If the model is poorly defined, there will be no good estimate. In this
case, it is worth trying to make the model weigh less by increasing the estimated
error. This will allow the Kalman filter to rely more on measurement values while
still allowing some noise removal. In essence, it would be convenient to set the mea-
surements error η and verify the effects on the system. Finally, it should be noted
that the gain K tends to give greater weight to the observed measures if it has high
values, on the contrary, it weighs more the prediction model if it has low values.
In real applications, the filter does not always achieve the optimality conditions
provided by the theory, but the filter is used anyway, giving acceptable results, in
556 6 Motion Analysis

various tracking situations, and in general, to model the dynamics of systems based
on prediction/correction to minimize the covariance of the estimated error. In the case
of the ball tracking it is essential, in order to optimally predict ball position, to sig-
nificantly reduce the ball searching region, and consequently appreciably reduce the
search time (essential in this tracking application context which requires to process
several hundred images per second) of the ball by the ball detection algorithm. For
nonlinear dynamic models or nonlinear measurement models, the Extended Kalman
Filter (EKF)[9,27] is used, which solves the problem (albeit not very well) by apply-
ing the classical Kalman filter to the linearization of the system around the current
estimate.

6.5 Motion in Complex Scenes

In this section, we will describe some algorithms that continuously (in real time)
detect the dynamics of the scene characterized by the different entities in motion and
by the continuous change of environmental conditions. This is the typical situation
that arises for the automatic detection of complex sporting events (soccer, basketball,
tennis, . . .) where the entities in motion can reach high speeds (even hundreds of
km/h) in changing environmental conditions and the need to detect the event in real
time. For example, the automatic detection of the offside event in football would
require the simultaneous tracking of different entities, therefore recognizing the
class to which it belongs (ball, the two goalkeepers, player team A and B, referee
and assistants), to process the event data (for example, player who has the ball, his
position and that of the other players at the moment he hits the ball, player who
receives the ball) and make the decision in real time (in a few seconds) [3].
In the past, technological limits of vision and processing systems prevented the
possibility of realizing vision machines for the detection, in real time, of such complex
events under changing environmental conditions. Many of the traditional algorithms
of motion analysis and object recognition fail in these operational contexts. The goal
is to find robust and adaptive solutions (with respect to changing light conditions,
recognizing dynamic and static entities, and arbitrary complex configurations of mul-
tiple moving entities) by choosing algorithms with adequate computational complex-
ity and immediate operation. In essence, algorithms are required that automatically
learn the initial conditions of the operating context (without manual initialization)
and automatically learn how the conditions change.
Several solutions are reported in the literature (for tracking people, vehicles, . . .)
which are essentially based on fast and approximate methods such as the background
subtraction—BS which are the direct way to detect and trace the motion of moving
entities of the scene observed by stationary vision machines (with frame rates even
higher than the standard of 25 fps) [28–30]. Basically, the BS methods label the
“dynamic pixels” at time t whose gray level or color information changes significantly
compared to the pixels belonging to the background. This simple and fast method,
valid in the context of a stationary camera, is not always valid especially when the
6.5 Motion in Complex Scenes 557

light conditions change and in all situations when the signal-to-noise ratio becomes
unacceptable, also due to the noise inherent to the acquisition system.
This involved the development of some backgr ound models also based on statisti-
cal approaches [29,31] to mitigate the instability of simple BS approaches. Therefore,
these new BS methods must not only robustly model the noise of the acquisition sys-
tems but must also adapt to the rapid change in environmental conditions. Another
strategic aspect concerns shadow management and temporary occlusions of moving
entities compared to backgr ound. It is also highlighted the need to process the diver-
sity of recorded video sequences (video broadcast) from video sequences acquired
in real time which has an impact on the types of BS algorithms to be used. We now
describe the most common BS methods.

6.5.1 Simple Method of Background Subtraction

Several BS methods are proposed that are essentially based on the assumption that
the sequence of images is acquired in the context of a stationary camera that observes
a scene with stable background B with respect to which one sees objects in motion
that normally have an appearance (color distribution or gray levels) distinguishable
from B. These moving pixels represent the foreground, or regions of interest of
ellipsoidal or rectangular shape (also known as blob, bounding box, cluster, . . .). The
general strategy, which distinguishes the pixels of moving entities (vehicle, person,
. . .) from the static ones (unchangeable intensity) belonging to the background, is
shown in Fig. 6.32. This strategy involves the continuous comparison between the
current image and the background image, the latter appropriately updated through a
model that takes into account changes in the operating context. A general expression
that for each pixel (x, y) evaluates this comparison between current background B
and image I t at time t is the following:

1 if d[I t (x, y), B(x, y)] > τ
D(x, y) = (6.131)
0 other wise

where d[•] represents a metric to evaluate such a comparison, τ is a threshold value


exceeded which the pixel (x, y) is labeled with “1” to indicate that it belongs to
an object (“0” non-object), and D(x, y) represents the image-mask of the pixels
in motion. The various BS methods are distinguished by the type of metric and its
background model used.
A very simple BS method is based on the absolute value of the difference between
the background image and the current image Dτ (x, y) = |I t (x, y)− B(x, y)|, where
we assume a significant difference in appearance (in terms of color or levels of gray)
between background and moving objects. A simple background model is obtained by
updating it with the previous image B(x, y) = I t−1 (x, y) and Eq. (6.131) becomes

1 if |I t (x, y) − B t−1 (x, y)] > τ
Dt,τ (x, y) = (6.132)
0 other wise
558 6 Motion Analysis

Fig. 6.32 Functional


scheme of the process of
detecting moving objects
based on the simple method
of subtraction of the
background. The current Current Detected
Compare
frame of the sequence is Frame Objects
compared to the current
model of static objects (the
background) to detect Background
moving objects (foreground) Model

Background
Update

The background is then updated with the last image acquired B t (x, y) = I t (x, y) to
be able to reapply (6.132) for subsequent images. When a moving object is detected
and then stops, with this simple BS method, the object disappears into Dt,τ . Fur-
thermore, it is difficult to detect it and recognize it when the dominant motion of
the object is not lateral (for example, if it moves away or gets closer than the cam-
era). The results of this simple method depend very much on the threshold value
adopted which can be chosen manually or automatically by previously analyzing the
histograms of both background and object images.

6.5.2 BS Method with Mean or Median

A first step to get a more robust background is to consider the average or the median
of the n previous images. In essence, an attempt is made to attenuate the noise present
in the background due to the small movement of objects (leaves of a tree, bushes,
. . .) that are not part of the objects. Applies a filtering operation based on the average
or median (see Sect. 9.12.4 Vol. I) of the n previous images. With the method of the
media, the background is modeled with the arithmetic mean of the n images kept in
memory:
1
n−1
B t (x, y) = I t−i (x, y) (6.133)
n
i=0

where n is closely related to the acquisition frame rate and object speed.
Similarly, the background can be modeled with the median filter for each pixel of
all temporarily stored images. In this case, it is assumed that each pixel has a high
6.5 Motion in Complex Scenes 559

probability of remaining static. The estimation of the background model is given by

B t (x, y) = median{I t−i (x, y)} (6.134)


  
i∈{0,1,...,n−1}

The mask image for both the mean and the median results:

1 if |I t (x, y) − B t (x, y)] > τ
Dt,τ (x, y) = (6.135)
0 other wise

Appropriate n and frame rate values produce a correct updated background and a
realistic foreground mask of moving objects, with no phantom or missing objects.
These methods are among the nonrecursive adaptive updating techniques of the
background, in the sense that they depend only on the images stored and maintained
in the system at the moment. Although easy to make and fast, they have the drawbacks
of nonadaptive methods, meaning that they can only be used for short-term tracking
without significant changes to the scene. When the error occurs it is necessary to
reinitialize the background otherwise the errors accumulate over time. They also
require an adequate memory buffer to keep the last n images acquired. Finally, the
choice of the global threshold can be problematic.

6.5.2.1 BS Method Based on the Moving Average


To avoid keeping a history of the last n frames acquired, the background can be
updated parametrically with the following formula:

B t (x, y) = α I t (x, y) + (1 − α)B t−1 (x, y) (6.136)

where the parameter α, seen as a learning parameter (assumes a value between 0.01
and 0.05), models the update of the background B t (x, y) at time t, weighing the
previous value B t−1 (x, y) and the current value of the image I t (x, y). In essence,
the current image is immersed in the model image of the background via the parameter
α. If α = 0, (6.136) is reduced to B t (x, y) = B t−1 (x, y), the background remains
unchanged, and the mask image is calculated by the simple subtraction method
(6.133). If instead, α = 1, (6.136) is reduced to B t (x, y) = I t (x, y) producing the
simple difference between images.

6.5.3 BS Method Based on the Moving Gaussian Average

This method [29] proposes approximating the distribution of the values of each
pixel in the last n images with a Gaussian probability density function (unimodal).
Therefore, in the hypothesis of gray-level images, two maps are maintained, one for
the average and one for the standard deviation. In the initialization phase the two
maps are created background μt (x, y) and σ t (x, y) to characterize each pixel with
560 6 Motion Analysis

its own pd f with the parameters, respectively, of the average μt (x, y) and of the
variance σt2 (x, y). The maps of the original background are initialized by acquiring
the images of the scene without moving objects and calculating the average and the
variance for each pixel. To manage the changes of the existent background, due to
the variations of the ambient light conditions and to the motion of the objects, the
two maps are updated in each pixel for each current image at the time t, calculating
the moving average and the relative mobile variance, given by

μt (x, y) = α I t (x, y) + (1 − α)μt−1 (x, y)


(6.137)
σ 2t (x, y) = d 2 α + (1 − α)σ 2t−1 (x, y)

with
d(x, y) = |I t (x, y) − μt (x, y) (6.138)

where d indicates the Euclidean distance between the current value It (x, y) of the
pixel and its average, α is the learning parameter of the background update model.
Normally α = 0.01 and as evidenced by (6.137) it tends to weigh little the value of
the current pixel It (x, y) if this is classified as foreground to avoid merging it in the
background. Conversely, a pixel classified as background the value of α should be
chosen based on the need for stability (lower value) or fast update (higher value).
Therefore, the current average in each pixel is updated based on the weighted average
of the previous values and the current value of the pixel. It is observed that with the
adaptive process given by (6.137) the values of the average and of the variance of
each pixel are accumulated, requiring little memory and having high execution speed.
The pixel classification is performed by evaluating the absolute value of the dif-
ference between the current value and the current average of the pixel with respect
to a confidence value of the threshold τ as follows:
 |I (x,y)−μ (x,y)|
For egr ound if t σ t (x,y)t >τ
It = (6.139)
Backgr ound otherwise

The value of the threshold depends on the context (good results can be obtained
with τ = 2.5) even if it is normally chosen less than a factor k of the standard
deviation τ = kσ . This method is easily applicable also for color images [29] or
multispectral images maintaining two background maps for each color or spectral
channel. This method has been successfully experimented for indoor applications
with the exception of cases with multimodal background distribution.

6.5.4 Selective Background Subtraction Method

In this method, a sudden change of a pixel is considered as foreground and the


background model is ignored (background not updated). This prevents the back-
ground model from being modified by pixels that are not inherently belonging to the
6.5 Motion in Complex Scenes 561

background. The direct formula for classifying a pixel between moving object and
background is the following:


For egr ound if |I t (x, y) − I t−1 (x, y)| > τ (no background update)
It = (6.140)
Backgr ound otherwise

The methods based on mobile average, seen in the previous paragraphs, are actually
also selective.

6.5.5 BS Method Based on Gaussian Mixture Model (GMM)

So far methods have been considered where the background update model is based
on the recent pixel history. Only with the Gaussian moving average method was the
background modeled with the statistical parameters of average and variance of each
pixel of the last images with the assumption of a unimodal Gaussian distribution. No
spatial correlation was considered with the pixels in the vicinity of the one being pro-
cessed. To handle more complex application contexts where the background scene
includes structures with small movements not to be regarded as moving objects (for
example, small leaf movements, trees, temporarily generated shadows, . . .) differ-
ent methods have been proposed based on models of background with multimodal
Gaussian distribution [31].
In this case, the value of a pixel varies over time as a stochastic process instead
of modeling the values of all pixels as a particular type of distribution. The method
determines which Gaussian a background pixel can correspond to. The values of the
pixels that do not adapt to the background are considered part of the objects, until it is
associated with a Gaussian which includes them in a consistent and coherent way. In
the analysis of the temporal sequence of images, it happens that the significant varia-
tions are due to the moving objects compared to the stationary ones. The distribution
of each pixel is modeled with a mixture of K Gaussians N(μi,t (x, y),  i,t (x, y)).
The probability P of the occurrence of an RGB pixel in the location (x, y) of the
current image t is given by


K
P(I i,t (x, y)) = ωi,t (x, y)N(μi,t (x, y),  i,t (x, y)) (6.141)
i=1

where ωi,t (x, y) is the weight of the ith Gaussian. To simplify, as suggested by the
author, the covariance matrix  i,t (x, y) can be assumed to be diagonal and in this
case we have  i,t (x, y) = σi,t
2 (x, y)I, where I is the matrix 3×3 identity in the case

of RGB images. The K number of Gaussians depends on the operational context and
the available resources (in terms of calculation and memory) even if it is normally
between 3 and 5.
Now let’s see how the weights and parameters of the Gaussians are initialized
and updated as the images I t are acquired in real time. By virtue of (6.141), the
distribution of recently observed values of each pixel in the scene is characterized by
562 6 Motion Analysis

the Gaussian mixture. With the new observation, i.e., the current image I t , each pixel
will be associated with one of the Gaussian components of the mixture and must be
used to update the parameters of the model (the Gaussians). This is implemented as a
kind of classification, for example, the K-means algorithm. Each new pixel I t (x, y)
is associated with the Gaussian component for which the value of the pixel is within
2.5 standard deviations (that is, the distance is less than 2.5σi ) of its average. This 2.5
threshold value can be changed slightly, producing a slight impact on performance. If
a new It (x, y) pixel is associated with one of the Gaussian distributions, the relative
parameters of the average μi,t (x, y) and variance σi,t 2 (x, y) are updated as follows:

μi,t (x, y) = (1 − ρ)μi,t−1 (x, y) + ρ It (x, y)


(6.142)
σi,t
2
(x, y) = (1 − ρ)σi,t−1
2
(x, y) + ρ[It (x, y) − μi,t (x, y)]2

while the previous weights of all the Gaussians are updated as follows:

ωi,t (x, y) = (1 − α)ωi,t−1 (x, y) + α Mi,t (x, y) (6.143)

where α is the user-defined learning parameter, ρ is a second learning parameter


defined as ρ = αN(It |μi,t−1 (x, y), σi,t−1 2 (x, y)), and Mi,t (x, y) = 1 indicates that
the It (x, y) pixel is associated with the Gaussian (for which the weight is increased)
while it is zero for all the others (for which the weight is decreased).
If, on the other hand, It (x, y) is not associated with any of the Gaussian mixtures,
It (x, y) is considered a foreground pixel, the least probable distribution is replaced
by a new distribution with the current value as a mean value (μt = It ), initialized with
a high variance and a low weight value. The previous weights of the K Gaussians are
updated with (6.143) while the parameters μ and σ of the same Gaussians remain
unchanged.
The author defines the least probable distribution with the following heuristic: the
Gaussians that have a greater population of pixels and the minimum variance should
correspond to the background. With this assumption, the Gaussians are ordered with
respect to the ratio ωσ . In this way, this ratio increases when the Gaussians have
broad support and as the variance decreases. Then the first B distributions are simply
chosen as a background template:

#
b $
B = arg min ωi > T (6.144)
b i=1

where T indicates the minimum portion of the image that should be background
(characterized with distribution with high value of weight and low variance). Slowly
moving objects take longer to include in the background because they have more
variance than the background. Repetitive variations are also learned and a model
is maintained for the distribution of the background, which leads to faster recovery
when objects are removed from subsequent images.
The simple BS methods (difference of images, average and median filtering),
although very fast, using a global threshold to detect the change of the scene are
6.5 Motion in Complex Scenes 563

inadequate in complex real scenes. A method that models the background adaptively
with a mixture of Gaussians better controls real complex situations where often the
background is bimodal with long-term scene changes and confused repetitive move-
ments (for example caused by the temporary overlapping of objects in movement).
Often better results are obtained by combining the adaptive approach with temporal
information on the dynamics of the scene or by combining local information deriving
from simple BS methods.

6.5.6 Background Modeling Using Statistical Method Kernel


Density Estimation

A nonparametric statistical method for modeling the background [32] is based on


the Kernel Density Estimation (KDE), where the pd f multimodal distribution is
estimated directly from the data without any knowledge of the intrinsic data distri-
bution. With the KDE approach it is possible to estimate the pd f by analyzing the
last partial data of a stochastic distribution through a kernel function (for example,
the uniform and normal distribution) with the property of being nonnegative, pos-
itive, and even values, with the integral equal to 1, defined on the support interval.
In essence, for each data item, a kernel function is created with the data centered,
thus ensuring that the kernel is symmetrical. The pd f is then estimated by adding
all the normalized kernel functions with respect to the number of data to satisfy the
properties of the kernel function to be nonnegative and with the integral equal to 1
on the entire definition support.
The simplest kernel function is the uniform rectangular function (also known as
the Parzen window), where all the data have the same weight without considering
their distance from the center with zero mean. Normally, the Gaussian kernel function
K (radially symmetric and unimodal) is used instead, which leads to the following
estimate of the pd f distribution of the background, given by

1  
t−1

Pkde (I i,t (x, y)) = K I t (x, y) − I i (x, y) (6.145)
n
i=t−n

where n is the number of the previous images used to estimate the pd f distribution
of the background using the Gaussian kernel function K .
A It (x, y) pixel is labeled as background if Pkde (It (x, y)) > T , where T is a
default threshold otherwise it is considered a pixel foreground. The T threshold is
appropriately adapted in relation to the number of false positives acceptable for the
application context. The KDE method is also extended for multivariate variable and
is immediately usable for multispectral or color images. In this case, the kernel func-
tion is obtained from the product of one-dimensional kernel functions and (6.145)
becomes
564 6 Motion Analysis

 ( j) ( j)
1  -
t−1 m
I t (x, y) − I i (x, y)
Pkde (Ii,t (x, y)) = K (6.146)
n σj
i=t−n j=1

where I represents the image to m components (for RGB images, m = 3) and


σ j is the standard deviation (the default parameter of smoothing that controls the
difference between estimated and real data) associated with each component kernel
function that is assumed to be all Gaussian. This method attempts to classify the
pixels between background and foreground by analyzing only the last images of
the time sequence, forgetting the past situation and updating the pixels of the scene
considering only the recent ones of the last n observations.

6.5.7 Eigen Background Method

To model the background, it is possible to use the principal component analysis


(PCA, introduced in Sect. 2.10.1 Vol. II) proposed in [33]. Compared to the previous
methods, which processed the single pixel, with the PCA analysis the whole sequence
of n images is processed to calculate the eigenspace that adaptively models the
background. In other words, with the PCA analysis the background is modeled
globally resulting more stable than the noise and reducing the dimensionality of the
data. The eigen background method involves two phases:

Learning phase. The pixels I j of the image ith of the sequence are organized in a
column vector I i = {I1,i , . . . , I j,i · · · , I N ,i } of size N × 1 which allocates all N
pixels of the image. The entire sequence of images is organized in n columns in the
n
matrix I of size N × n of which the image average μ = n1 i=1 I i is calculated.
Then, the matrix X = [X 1 , X 2 , . . . , X n ] of size N × n is calculated, where each
of its column vector (image) has mean zero given by X i = I i − μ. Next, the
covariance matrix C = E{X i , X iT } ≈ n1 X X T is calculated. By virtue of the PCA
transform it is possible to diagonazlize the covariance matrix C calculating the
eigenvector matrix  by obtaining

D = CT (6.147)

where D is the diagonal matrix of size n × n. Always according to the PCA


transform, we can consider only the first m < n principal components to model
the background in a small space, projecting into that space, with the first m
eigenvectors m (associated with the largest m eigenvalues) the new images I t
acquired at time t.
Test phase. Defined the eigen background and the average image μ a new column
image I t is projected into this new reduced eigenspace through the eigenvector
matrix m of size n × m obtaining

B t = m (I t − μ) (6.148)
6.5 Motion in Complex Scenes 565

where B t represents the projection with zero mean of I t in the eigenspace


described by m . The reconstruction of I t from the reduced eigenspace in the
image space is given by the following inverse transform:

B t = m
T
Bt + μ (6.149)

At this point, considering that the eigenspace described by m mainly models static
scenes and not dynamic objects, the image Bt reconstructed by the autospace does
not contain moving objects that can be highlighted instead, comparing with a metric
(for example, the Euclidean distance d2 ), the input image I t and the reconstructed
one Bt are as follows:

For egr ound if d2 (I t , Bt ) > T
F t (x, y) = (6.150)
Backgr ound otherwise

where T is a default threshold value. As an alternative to the PCA principal com-


ponents, to reduce the dimensionality of the image sequence I and model the back-
ground, the SVD decomposition of I can be used, given by

I = UV T (6.151)

where U is the orthogonal matrix of size N × N and V T is an orthogonal matrix of


size n × n. The singular values of the image sequence I are contained in the diagonal
matrix  in descending order. Considering the strong correlation of the images of
the sequence, it will be observed that the first m singular values other than zero will
be few or m  n.
It follows that the first m columns of U m can be considered as the basis of orthog-
onal vectors equivalent to m to model the background and create the reduced space
where to project input images I t , and then detecting objects as foreground as done
with the previous PCA approach.
The eigen background approach based on PCA analysis or SVD decomposition
remains problematic when it is necessary to continuously update the background,
particularly when it needs to process a stream of video images, even if the SVD
decomposition is faster. Other solutions are proposed [34], for example, by adaptively
updating the eigen background and making the detection of foreground objects more
effective.

6.5.8 Additional Background Models

In the parametric models, the probability density distribution (pdf) of the background
pixels is assumed to be known (for example, a Gaussian) described by its own charac-
teristic parameters (mean and variance). A semiparametric approach used to model
the variability of the background is represented for example by the Gaussian mixture
as described above. A more general method used, in different applications (in partic-
566 6 Motion Analysis

ular in the Computer Vision), consists instead of trying to estimate the pd f directly
by analyzing the data without assuming a particular form of their distribution. This
approach is known as the nonparametric estimate of the distribution, for example,
the simplest one is based on the calculation of the histogram or the Parzen window
(see Sect. 1.9.4), known as kernel density estimation.
Among the nonparametric approaches is the mean-shift algorithm (see Sect. 5.8.2
Vol. II), an iterative method of ascending the gradient with good convergence prop-
erties that allows to detect the peaks (modes) of a multivariate distribution and the
related covariance matrix. The algorithm was adopted as an effective technique for
both the blob tracking and for the background modeling [35–37]. Like all nonpara-
metric models, we are able to model complex pd f , but the implementation requires
considerable computational and memory resources. A practical solution is to use the
mean-shift method only to model the initial background (the pd f of the initial image
sequence) and to use a propagation method to update the background model. This
strategy is proposed in [38], which propagates and updates the pd f with new images
in real time.

6.6 Analytical Structure of the Optical Flow of a Rigid Body

In this paragraph, we want to derive the geometric relations that link the motion
parameters of a rigid body, represented by a flat surface, and the optical flow induced
in the image plane (observed 2D displacements of intensity patterns in the image),
hypothesized corresponding to the motion field (projection of 3D velocity vectors on
the 2D image plane).
In particular, given a sequence of space-time-variant images acquired while objects
of the scene move with respect to the camera or vice versa, we want to find solutions
to estimate:

1. the 3D motion of the objects with respect to the camera by analyzing the 2D flow
field induced by the sequence of images;
2. the distance to the camera object;
3. the 3D structure of the scene.

As shown in Fig. 6.33a, the camera can be considered stationary and the object in
motion with speed V or vice versa. The optical axis of the camera is aligned with
the Z -axis of the reference system (X, Y, Z ) of the camera, with respect to which
the moving object is referenced. The image plane is represented by the plane (x, y)
perpendicular to the Z -axis at the distance f , where f is the focal point of the optics.
In reality, the optical system is simplified with the pinhole model and the focal
distance f is the distance between the image plane and the perspective projection
center located at the origin O of the reference system (X, Y, Z ). A point P =
(X, Y, Z ) of the object plane, in the context of perspective projection, is projected
in the image plane at the point p = (x, y) calculated with the perspective projection
6.6 Analytical Structure of the Optical Flow of a Rigid Body 567

(a) (b)
Y Y
y P(X,Y,Z) y P(Y,Z)
V V
f f
p(x,y) COP Z p Z Z
v
x
X

Fig. 6.33 Geometry of the perspective projection of a 3D point of the scene with model pinhole.
a Point P in motion with velocity V with respect to the observer with reference system (X, Y, Z )
in the perspective center of projection (CoP), with the Z -axis coinciding with the optical axis and
perpendicular to the image plane (x, y); b 3D relative velocity in the plane (Y, Z ) of the point P
and the 2D velocity of its perspective projection p in the image plane (visible only y-axis)

equations (see Sect. 3.6 Vol. II), derived with the properties of similar triangles (see
Fig. 6.33b), given by
X Y P
x= f y= f =⇒ p= f (6.152)
Z Z Z

Now let’s imagine the point P(t) in motion at the velocity V = dP


dt that after time t
moves in the 3D position in P(t + t). The perspective projection of P at time t in
the image plane is in p(t) = (x(t), y(t)) while in the next image after time (t + t)
is shifted to p(t + t). The apparent velocity of P in the image plane is indicated
with v given by the components:
dx dy
vx = vy = (6.153)
dt dt
These components are precisely those that generate the image motion field or repre-
sent the 2D velocity vector v = (vx , v y ) of p, perspective projection of the velocity
vector V = (Vx , Vy , Vz ) of the point P in motion.
To calculate the velocity v in the image plane, we can differentiate with respect
to t (using the derivative quotient rule) the perspective Eq. (6.152) which, expressed
in vector terms results in p = f PZ , we get the following:

dp(t) d fZP(t)
(t) Z V − Vz P
v= = = f (6.154)
dt dt Z2
whose components are

f Vx − x Vz f Vy − yVz
vx = vy = (6.155)
Z Z

while vz = 0 = f VZz − f VZz . From (6.154), it emerges that the apparent velocity is a
function of the velocity V of the 3D motion of P and of its depth Z with respect to the
image plane. We can reformulate the velocity components in terms of a perspective
568 6 Motion Analysis

projection matrix given by


⎡ ⎤
V
vx f 0 −x 1 ⎣ x ⎦
= Vy (6.156)
vy 0 f −y Z
Vz

The relative velocity of the P point with respect to the camera, in the context of a rigid
body (where all the points of the objects have the same parameters of motion), can
also be described in terms of instantaneous rectilinear velocity T = (Tx , Ty , Tz )T
and angular = ( x , y , z )T (around the origin) from the following equation
[39,40]:
V=T+ ×P (6.157)

where the “×” symbol indicates the vector product. The components of V are

Vx = Tx + yZ −Y z
V y = Ty − xZ + X z (6.158)
Vz = Tz + xY − X y

In matrix form the relative velocity V = (Vx , Vy , Vz ) of P = (X, Y, Z ) is given by


⎡ ⎤
Tx
⎡ ⎤ ⎡ ⎤ ⎢ Ty ⎥
Vx 1 0 0 0 Z −Y ⎢ ⎥
⎢ Tz ⎥
V = Vy = 0 1 0 −Z 0 X ⎢ ⎥
⎣ ⎦ ⎣ ⎦ ⎢
⎥ (6.159)
Vz 0 0 1 Y −X 0 ⎢ x⎥
⎣ y⎦
z

The instant velocity v = (vx , v y ) of p = (x, y) can be computed by replacing


Eq. (6.156) of the relative velocity of V obtaining
⎡ ⎤ ⎡ ⎤
T
v 1 f 0 −x ⎣ x ⎦ 1 x · y −( f 2 + x 2 ) f · y ⎣ x
v= x = Ty + y⎦ (6.160)
vy Z 0 f −y f f 2 + y2 −x · y −f ·x
Tz z
     
Translational component Rotational component

from which it emerges that the perspective motion of P in the image plane in p
induces a flow field v produced by the linear composition of the translational and
rotational motion. The translational flow component depends on the distance Z of
the point P which does not affect the rotational motion. For a better readability of
the flow induced in the image plane by the different possibilities of motion (6.160),
6.6 Analytical Structure of the Optical Flow of a Rigid Body 569

Fig. 6.34 Geometry of the


perspective projection of 3D Y n
points of a moving plane y d P
f

p
COP Z
x
X

we can rewrite it in the following form:

Tx f − Tz x xxy yx
2
vx = − y f + zy + −
  Z  f f
  
Transl. comp. Rotational component
(6.161)
Ty f − Tz y yxy xx
2
vy = − x f + zx + −
  Z  f f
  
Transl. comp. Rotational component

We have, therefore, defined the model of perspective motion for a rigid body, assum-
ing zero optical distortions, which relates to each point of the image plane the apparent
velocity (motion field) of a 3D point of the scene at a distance Z , subject to the trans-
lation motion T and rotation . Other simpler motion models can be considered as
the weak perspective model, orthographic or affine. From the analysis of the motion
field, it is possible to derive some parameters of the 3D motion of the objects.
In fact, once the optical flow is known (vx , v y ) with Eq. (6.161) we would have for
each point (x, y) of the image plane two bilinear equations in 7 unknowns: the depth
Z , the 3 translational velocity components T and the 3 angular velocity components
. The optical flow is a linear combination of T and once known Z or it results a
linear combination of the inverse depth 1/Z and once is known the translational
velocity T. Theoretically, the 3D structure of the object (the inverse of the depth 1/Z
for each image point) and the motion components (translational and rotational) can
be determined by knowing the optical flow for different points in the image plane.
For example, if the dominant surface of an object is a flat surface (see Fig. 6.34)
it can be described by
P · nT = d (6.162)

where d is the perpendicular distance of the plane from the origin of the reference
system, P = (X, Y, Z ) is a generic point of the plane, and n = (n x , n y , n z )T is the
normal vector to the flat surface as shown in the figure. In the hypothesis of transla-
tory and rotatory motion of the flat surface with respect to the observer (camera), the
normal n and the distance d vary in time. Using Eq. (6.152) of the perspective projec-
tion, solving with respect to the vector P the spatial position of the point belonging
570 6 Motion Analysis

to the plane is obtained again:


pZ
P= (6.163)
f

which replaced in the equation of the plane (6.162) and resolving with respect to the
inverse of the depth Z1 of P we have

1 p·n nx x + n y y + nz f
= = (6.164)
Z fd fd
Substituting Eq. (6.164) in the equations of the motion field (6.161) we get
1  2 
vx = a1 x + a2 x y + a3 f x + a4 f y + a5 f 2
fd
(6.165)
1  
vy = a1 x y + a2 y 2 + a6 f y + a7 f x + a8 f 2
fd
where
a1 = −d y + Tz n x a2 = d x + Tz n y
a3 = Tz n z − Tx n x a4 = d z − Tx n y
(6.166)
a5 = −d y − Tx n z a6 = Tz n z − Ty n y
a7 = −d z − Ty n x a8 = d x − T y n z

Equation (6.165) define the motion field of a plane surface represented by a


quadratic polynomial expressed in the coordinates of the image plane (x, y) at the
distance f of the projection center. With the new Eq. (6.165), having an estimate of
the optical flow (vx , v y ) in different points of the image, it would be possible to the-
oretically recover the eight independent parameters (3 of T , 3 of , and 2 of n, with
known f ) describing the motion and structure of the plane surface. It is sufficient to
have at least 8 points in the image plane to estimate the 8 coefficients seen as global
parameters to describe the motion of the flat surface.
Longuet-Higgins [41] highlighted the ambiguities in recovering the structure of
the plane surface from the instantaneous knowledge of the optical flow (vx , v y ) using
Eq. (6.165). In fact, different planes with different motion can produce the same flow
field. However, if the n and T vectors are parallel, these ambiguities are attenuated.

6.6.1 Motion Analysis from the Optical Flow Field

Let us now analyze what information can be extracted from the knowledge of optical
flow. In a stationary environment consisting of rigid bodies, whose depth can be
known, a sequence of images is acquired by a camera moving toward such objects
or vice versa. From each pair of consecutive images, it is possible to extract the
optical flow using one of the previous methods. From the optical flow, it is possible
to extract information on the structure and motion of the scene. In fact, from the
6.6 Analytical Structure of the Optical Flow of a Rigid Body 571

optical flow map, it is possible, for example, to observe that regions with small
velocity variations correspond to single image surfaces including information on
the structure of the observed surface. Regions with large speed variations contain
information on possible occlusions, or they concern the areas of discontinuity of
the surfaces of objects even at different distances from the observer (camera). A
relationship, between the orientation of the surface with respect to the observer and
the small variations of velocity gradients, can be derived. Let us now look at the type
of motion field induced by the model of perspective motion (pinhole) described by
general Eq. (6.161) in the hypothesis of a rigid body subject to roto-translation.

6.6.1.1 Motion with Pure Translation


In this case, it is assumed that = 0 and the induced flow field, according to (6.161),
is modeled by the following:
Tx f − Tz x
vx =
Z (6.167)
Ty f − Tz y
vy =
Z
We can now distinguish two particular cases related to translational motion:

1. Tz = 0, which corresponds to translation motion at a constant distance Z from


the observer (lateral motion). In this case the induced flow field is represented
with flow vectors parallel to each other in a horizontal motion direction. See
the fourth illustration in Fig. 6.5 which shows an example of the optical flow
induced by lateral motion from right to left. In this case, the equations of the pure
translational motion model, derived from (6.167), are
Tx Ty
vx = f vy = f (6.168)
Z Z
It should be noted that if Z varies the length of the parallel vectors varies inversely
proportional with Z .
2. Tz = 0, which corresponds to translation motion (with velocity Tz > 0) in
approaching or in moving away from the object of the observer (with velocity
Tz < 0) or vice versa. In this case, the induced flow field is represented by radial
vectors emerging from a common point called Focus Of Expansion—FOE (if
the observer moves toward the object), or is known as Focus Of Contraction—
FOC when the radial flow vectors converge in the same point, in which case the
observer moves away from the object (or in the opposite case if the observer is
stationary and the object moves away). The first two images in Fig. 6.5 show the
induced radial flows of expansion and contraction of this translational motion,
respectively. The length of the vectors is inversely proportional to the distance
Z and is instead directly proportional to the distance in the image plane of p
from the FOE located in p0 = (0, 0) (assuming that the Z -axis and the direction
572 6 Motion Analysis

(a) (b) (c) Roll rotation center


y FOE y FOE y

Y Y Y
Tz Z Z

Z x x Ωz
x X X
X Tx +Tz

Fig. 6.35 Example of flow generated by translational or rotary motion. a Flow field induced by
longitudinal translational motion T = (0, 0, Tz ) with the Z -axis coinciding with the velocity vector
T; b Flow field induced by translational motion T = (Tx + Tz ) with the FOE shifted with respect to
the origin along the x-axis in the image plane; c Flow field induced by the simple rotation (known
as r oll) of the camera around the longitudinal axis (in this case, it is the Z axis)

of the velocity vector Tz are coincident), where the flow velocity is zeroed (see
Fig. 6.35a). In the literature, FOE is also known as vanish point.
In the case of lateral translational motion seen above, the FOE can be thought
to be located at infinity where the parallel flow vectors converge (the FOE cor-
responds to a vanishing point). Returning to the FOE, it represents the point of
intersection between the direction of motion of the observer and the image plane
(see Fig. 6.35a). If the motion also has a translation component in the direction of
X , with the resulting velocity vector T = Tx + Tz , the FOE appears in the image
plane displaced horizontally with the flow vectors always radial and convergent
in the position p0 = (x0 , y0 ) where the relative velocity is zero (see Fig. 6.35b).
This means that from a sequence of images, once the optical flow has been deter-
mined, it is possible to calculate the position of the FOE in the image plane, and
therefore know the direction of motion of the observer. We will see in the follow-
ing how this will be possible considering also the uncertainty in the localization
of the FOE, due to the noise present in the sequence of images acquired in the
estimation of the optical flow. The flow fields shown in the figure refer to ideal
cases without noise. In real applications, it can also be had that several indepen-
dent objects are in translational motion, the flow map is always radial but with
different FOEs, however each is associated with the motion of the corresponding
object. Therefore, before analyzing the motion of the single objects, it is necessary
to segment the flow field into regions of homogeneous motion relative to each
object.

6.6.1.2 Motion with Pure Rotation


In this case, it is assumed that T = 0 and the induced flow field, according to
(6.161), is modeled by the rotation component only. The only rotation around the
Z -axis (called roll rotation,9 with z = 0, and x = y = 0), at a constant distance

9 In the aeronautical context, the attitude of an aircraft (integral with the axes (X, Y,
Z )), in 3D space,
is indicated with the angles of rotation around the axes, indicated, respectively, as lateral, vertical
and longitudinal. The longitudinal rotation, around the Z -axis, indicates the roll, the lateral one,
6.6 Analytical Structure of the Optical Flow of a Rigid Body 573

from the object, induces a flow that is represented from vectors oriented along the
points tangent to concentric circles whose center is the point of rotation of the points
projected in the image plane of the object in rotation. In this case, the FOE does not
exist and the flow is characterized by the center of rotation with respect to which
the vectors of the flow orbit (see Fig. 6.35c) tangent to the concentric circles. The
pure rotation around the X -axis (called pitch rotation) or the pure rotation around
the axis of Y (called yaw rotation) induces in the image plane a center of rotation
of the vectors (no longer oriented tangents along concentric circles but positioned
according to the perspective projection) shifted towards the FOE, wherein these two
cases it exists.

6.6.2 Calculation of Collision Time and Depth

The collision time is the time required by the observer to reach the contact with the
surface of the object when the movement is of pure translation. In the context of
pure translational motion between scene and observer, the radial map of optical flow
in the image plane is analytically described by Eq. (6.167) valid for both Tz > 0
(observer movement towards the object) with the radial vectors emerging from the
FOE, and both for Tz < 0 (observer that moves away from the object) with the
radial vectors converging in the FOC. A generic point P = (X, Y, Z ) of the scene,
with translation velocity T = (0, 0, Tz ) (see Fig. 6.35a) in the image plane, it moves
radially with velocity v = (vx , v y ) expressed by Eq. (6.168) and, remembering the
perspective projection Eq. (6.152), is projected in the image plane in p = (x, y). We
have seen how in this case of pure translational motion, the vectors of the optical
flow converge at the FOE point p0 = (x0 , y0 ) where they vanish, that is, they cancel
with (vx , v y ) = (0, 0) for all vectors. Therefore, in the FOE point Eq. (6.167) cancel
each other thus obtaining the FOE coordinates given by

Tx Ty
x0 = f y0 = f (6.169)
Tz Tz
The same results are obtained if the observer moves away from the scene, and in this
case, it is referred to as a Focus Of Contraction (FOC).
We can now express the relative velocity (vx , v y ) of the points p = (x, y) projected
in the image plane with respect to their distance from p0 = (x0 , y0 ), that is, from
the FOE, combining Eq. (6.169) and the equations of the pure translational motion

around the X -axis, indicates the pitch, and the ver tical one, around the Y axis indicates the yaw. In
the robotic context (for example, in the case of an autonomous vehicle) the attitude of the camera
can be defined with 3 degrees of freedom with the axes (X, Y, Z ) indicating the lateral direction
(side-to-side), ver tical (up-down), and camera direction (looking). The rotation around the lateral,
up-down, and looking axes retain the same meaning as the axes considered for the aircraft.
574 6 Motion Analysis

Y
P(XP,YP,ZP)
Image plane
at time t+1 Image plane
y y at time t
p’(xp’,yp’) p(xp,yp)
Z
f COP(t+1) FOE f COP(t)
x x Camera motion
X ΔZ=Z-Z’ with velocity Tz

Fig. 6.36 Geometry of the perspective projection of a 3D point in the image plane in two instants
of time while the observer approaches the object with pure translational motion

model (6.167) by obtaining


Tz x − Tz x0 Tz
vx = = (x − x0 )
Z (x, y) Z (x, y)
(6.170)
Tz y − Tz y0 Tz
vy = = (y − y0 )
Z (x, y) Z (x, y)
It is pointed out that (6.170) represent geometrically the radial map of the optical
flow in the hypothesis of pure translational motion. Moreover, it is observed that the
length of the flow vectors are proportional to their distance (p − p0 ) from the FOE
and inversely proportional to the depth Z . In other words, the apparent velocity of the
points of the scene increases with the approach of the observer to them. Figure 6.35a
shows the example of radial expansion flow induced with Tz > 0 (it would result
instead of contraction with Tz < 0).
Now let’s see how to estimate, from the knowledge of the optical flow, derived
from a sequence of images:

– The Time To Collision (TTC) of the observer from a point of the scene without
knowing his distance and the approach velocity.
– The distance d of a point of the scene from the observer moving at a constant
velocity Tz parallel to the Z -axis in the direction of it.

This is the typical situation of an autonomous vehicle that tries to approach an object
and must be able to predict the collision time assuming that it moves with constant
velocity Tz . In these applications, it is strategic that this prediction occurs without
knowing the speed and the distance instant by instant from the object. While in other
situations, it is important to estimate a relative distance between vehicle and object
without knowing the translation velocity.
6.6 Analytical Structure of the Optical Flow of a Rigid Body 575

6.6.2.1 TTC Calculation


Let us first consider the TTC calculation (often also referred to as time to contact).
For simplicity, considering that the rotational component has no effect, we represent
this situation with Fig. 6.36 considering only the projection Y–Z of the 3D reference
system and the y-axis of the image plane. Let P = (Y P , Z P ) be a generic point in
the scene, p = (x p , y p ) its perspective projection in the image plane, f the distance
between image plane (perpendicular to the Z axis) and the Center Of Projection COP
(assumed pinhole model). If P (or the observer but with the stationary object, or vice
versa) moves with velocity Tz its perspective projection results y p = YZ PP while the
apparent velocity in the image plane, according to Eq. (6.170), is given by
y p Tz
vy = (6.171)
ZP
where we assume that the FOE has coordinates (x0 , y0 ) = (0, 0) in the image plane.
If the velocity vector Tz was not aligned with the Z -axis, the FOE would be shifted
and its coordinates (x0 , y0 ) can be calculated with the intersection of at least two flow
vectors, projections of 3D points of the object. We will see later the most accurate way
to determine the coordinates of the FOE considering also the possible noise derived
from the calculation of the optical flux from the sequence of images. Returning to
Eq. (6.171) of the instantaneous velocity v y of the projection of P in the image plane
in p = (x p , y p ), and we divide both members by y p and then take the reciprocal, we
get
yp ZP yp
= =⇒ τ= (6.172)
vy Tz vy

We have essentially obtained that the time to collision τ is given by the relationship
between observer–object distance Z P and velocity Tz , which represents the classic
way to estimate TTC but they are quantities that we do not know with a single camera,
but above all, we have the important result that we wanted, namely that the time to
collision τ is also given by the ratio of the two measurements deriving from the
optical flow, y p (length of the vector of the optical flow obtained from the distance
y p − x0 from the FOE) and v y (flow velocity vector ∂∂ty ) that can be estimated from the
sequence of images, in the hypothesis of translational motion with constant speed.
The accuracy of τ depends on the accuracy of the FOE position and the optical
flow. Normally the value of τ is considered acceptable if the value of y p exceeds
a threshold value in terms of the number of pixels to be defined in relation to the
velocity Tz .
At same Eq. (6.172), of the time to collision τ , we arrive by considering the
observer in motion toward the stationary object as represented in Fig. 6.36. The
figure shows two time instants of the P projections in the image plane moving with
velocity Tz toward the object. At the instant t its perspective projection p in the image
Y
plane, according to (6.152), results in y p = f Z pp . In time t, the projection p moves
away from the FOE as the image plane approaches P moving radially in p at time
t + 1. This dynamic is described by differentiating y with respect to time t obtaining
576 6 Motion Analysis

 ∂Y P  ∂Zp
∂ yp ∂t ∂t
= f − f YP (6.173)
∂t ZP Z 2P

This expression can be simplified by considering, the typology of motion T =


(0, 0, Tz ) for which ∂Y
∂t = 0, which for the perspective projection (Eq. 6.152) we
P

∂Zp
have Y P = y p ZfP and that ∂t = Tz . Replacing in (6.173), we get

∂ yp Tz
= −y p (6.174)
∂t ZP

Dividing as before, both members for y p and taking the reciprocal, we finally get the
same expression (6.172) of the time to collision τ .
It is also observed that τ does not depend on the focal length of the optical system
or the size of the object, but depends only on the observer–object distance Z P and
the translation velocity Tz . With the same principle with which TTC is estimated,
we could calculate the size of the object (useful in navigation applications where we
want to estimate the size of an obstacle object) reformulating the problem in terms of
τ = hht , where h is the height (seen as a scale factor) of the obstacle object projected
in the image plane and h t (in analogy to v y ) represents the time derivative of how
the scale of the object varies. The reformulation of the problem in these terms is not
to estimate the absolute size of the object with τ but to have an estimate of how its
size varies temporally between one image and another in the sequence. In this case,
it is useful to estimate τ also in the plane X–Z according to Eq. (6.170).

6.6.2.2 Depth Calculation


It is not possible to determine the absolute distance between the observer and the
object with a single camera. However, we can estimate a relative distance between
any two points of an object with respect to the observer. According to Eq. (6.172),
we consider two points of the object (which moves with constant speed Tz ) at the
time t at the distance Z 1 (t) and Z 2 (t) and we will have
y1 Z 1 (t)
(1)
=
v y (t) Tz
(6.175)
y2 Z 2 (t)
(2)
=
v y (t) Tz

By dividing member by member, we get


(1)
Z 2 (t) y2 (t) v y (t)
= · (6.176)
Z 1 (t) y2 (t) v(2)
y (t)
6.6 Analytical Structure of the Optical Flow of a Rigid Body 577

from which it emerges that it is possible to calculate the relative 3D distance for any
(t)
two points of the object, in terms of the ratio ZZ 21 (t) using the optical flow measure-
ments (velocity and distance from the FOE) derived between adjacent images of the
image sequence. If for any point of the object the accurate distance Z r (t) is known,
according to (6.176) we could determine the instantaneous depth Z i (t) of any point
as follows:
(r )
yi (t) v y (t)
Z i (t) = Z r (t) · (i) (6.177)
yr (t) v y (t)

In essence, this is the approach underlying the reconstruction of the 3D structure


of the scene from the motion information, in this case based on the optical flow
derived from a sequence of varying space–time images. With a single camera, the
3D structure of the scene is estimated only within a scale factor.

6.6.3 FOE Calculation

In real applications of an autonomous vehicle, the flow field induced in the image
plane is radial generated by the dominant translational motion (assuming a flat floor),
with negligible r oll and pitch rotations and possible yaw rotation (that is, rotation
around the axis Y perpendicular to the floor, see the note 9 for the conventions used
for vehicle and scene orientation). With this radial typology of the flow field, the
coordinates (x0 , y0 ) of FOE (defined by Eq. 6.169) can be determined theoretically
by knowing the optical flow vector of at least two points belonging to a rigid object. In
these radial map conditions, the lines passing through two flow vectors intersect at the
point (x0 , y0 ), where all the other flow vectors converge, at least in ideal conditions
as shown in Fig. 6.35a.
In reality, the optical flow observable in the image plane is induced by the motion
of the visible points of the scene. Normally, we consider the corners and edges that are
not always easily visible and univocally determined. This means that the projections
in the image plane of the points, and therefore the flow velocities (given by the 6.170)
are not always accurately measured by the sequence of images. It follows that the
direction of the optical flow vectors is noisy and this involves the convergence of the
flow vectors not in a single point. In this case, the location of the FOE is estimated
approximately as the center of mass of the region [42], where the optical flow vectors
converge.
Calculating the location of the FOE is not only useful for obtaining information
on the structure of the motion (depth of the points) and the collision time, but it is also
useful for obtaining information relating to the direction of motion of the observer
(called heading) not always coinciding with the optical axis. In fact, the flow field
induced by the translational motion T = (Tx + Tz ) produces the FOE shifted with
respect to the origin along the x-axis in the image plane (see Fig. 6.35b).
We have already shown earlier that we cannot fully determine the structure of the
scene from the flow field due to the lack of knowledge of the distance Z (x, y) of
578 6 Motion Analysis

3D points as indicated by Eq. (6.170), while the position of the FOE is independent
of Z (x, y) according to Eq. (6.169). Finally, it is observed (easily demonstrated
geometrically) that the amplitude of the velocity vectors of the flow is dependent on
the depth Z (x, y) while the direction is independent.
There are several methods used for the estimation of FOE, many of which use
calibrated systems that separate the translational and rotary motion component or by
calculating an approximation of the FOE position by setting a function of minimum
error (which imposes constraints on the correspondence of the points of interest of the
scene to be detected in the sequence of images) and resolving to the least squares, or
simplifying the visible surface with elementary planes. After the error minimization
process, an optimal FOE position is obtained.
Other methods use the direction of velocity vectors and determine the position of
the FOE by evaluating the maximum number of intersections (for example, using the
Hough transform) or by using a multilayer neural network [43,44]. The input flow
field does not necessarily have to be dense. It is often useful to consider the velocity
vectors associated with points of interest of which there is good correspondence in
the sequence of images. Generally, they are points with high texture or corners.
A least squares solution for the FOE calculation, using all the flow vectors in the
pure translation context, is obtained by considering the equations of the flow (6.170)
and imposing the constraint that eliminates the dependence of the translation velocity
Tz and depth Z , so we have
vx x − x0
= (6.178)
vy y − y0

This constraint is applied for each vector (vxi , v yi ) of the optical flow (dense or
scattered) that contribute to the determination of the position (x0 , y0 ) of the FOE. In
fact, placing (6.178) in the matrix form:

x0    
v yi −vxi = xi v yi − yi vxi (6.179)
y0

we have a highly overdetermined linear system and it is possible to estimate the FOE
position (x0 , y0 ) from the flow field (vxi , v yi ) with the least-squares approach:

x0
= (A T A)−1 A T b (6.180)
y0

where ⎡ ⎤ ⎡ ⎤
v y1 −vx1 x1 v y1 − y1 vx1
A = ⎣· · · · · · ⎦ b=⎣ ··· ⎦ (6.181)
v yn −vxn xn v yn − yn vxn
6.6 Analytical Structure of the Optical Flow of a Rigid Body 579

The explicit coordinates of the FOE are

vx2i v yi bi − vxi v yi vxi bi


x0 =  2
vx2i v2yi − vxi v yi
(6.182)
vxi v yi v yi bi − v2yi vxi bi
y0 =  2
vxi v yi −
2 2 vxi v yi

6.6.3.1 Calculation of TTC and Depth Knowing the FOE


Known the position (x0 , y0 ) of FOE (given by Eq. 6.182), calculated with an adequate
accuracy (as well estimated with the linear least squares method), it is possible to
estimate the collision time τ and the depth Z (x, y), always in the context of pure
translational motion, in an alternative way to the procedure used in the Sect. 6.6.2.
This new approach combines the FOE constraint imposed by Eq. (6.178) with
Eq. (6.9) of the optical flow I x vx + I y v y + It = 0, described in Sect. 6.4. Dividing
the optical flow equation by vx and then combining with the FOE constraint equation
(similarly for v y ) we obtain the flow equations in the form:

(x − x0 )It
vx = −
(x − x0 )I x + (y − y0 )I y
(6.183)
(y − y0 )It
vy = −
(x − x0 )I x + (y − y0 )I y

where I x , I y , and It are the first partial space–time derivatives of the adjacent images
of the sequence. Combining then, the equations of the flow (6.170) (valid in the
context of translational motion with respect to a rigid body) with the equations of
the flow (6.183) we obtain the following relations:

Z (x − x0 )I x + (y − y0 )I y
τ= =
Tz It
(6.184)
Tz  
Z (x, y) = (x − x0 )I x + (y − y0 )I y
It
These equations express the collision time τ and the depth Z (x, y), respectively, in
relation to the position of the FOE and the first derivatives of the images. The estimate
of τ and Z (x, y) calculated, respectively, with (6.184) may be more robust than those
calculated with Eqs. (6.172) and (6.176) having determined the position of the FOE
(with the least-squares approach, closed form) considering only the direction of the
optical flow vectors and not the module.
580 6 Motion Analysis

Y
Sy

Ry

y P’(X’,Y’,Z’)
p’(x’,y’)
Sz
Z

)
0

,Δy
Sx x Rz

(Δx
Rx p(x,y)
X
P(X,Y,Z)

Fig. 6.37 Geometry of the perspective projection of a point P, in the 3D space (X, Y, Z ) and in the
image plane (x, y), which moves in P = (X , Y , Z ) according to the translation (Sx , S, y , Sz )
and r otation (Rx , R y , Rz ). In the image plane the vector (x, y) associated with the roto-
translation is indicated

6.6.4 Estimation of Motion Parameters for a Rigid Body

The richness of information present in the optical flow can be used to estimate the
motion parameters of a rigid body [45]. In the applications of autonomous vehicles
and in general, in the 3D reconstruction of the scene structure through the optical flow,
it is possible to estimate the parameters of vehicle motion and depth information.
In general, for an autonomous vehicle it is interesting to know its own motion (ego-
motion) in a static environment. In the more general case, there may be more objects
in the scene with different velocity and in this case the induced optical flow can be
segmented to distinguish the motion of the various objects. The motion parameters
are the translational and rotational velocity components associated with the points
of an object with the same motion.
We have already highlighted above that if the depth is unknown only the rotation
can be determined univocally (invariant to depth, Eq. 6.161), while the translation
parameters can be estimated at less than a scale factor. The high dimensionality of the
problem and the non-linearity of the equations derived from the optical flow make
the problem complex.
The accuracy of the estimation of motion parameters and scene structure is related
to the accuracy of the flow field normally determined by sequences of images with
good spatial and temporal resolution (using cameras with high frame rate). In par-
ticular, the time interval t between images must be very small in order to estimate
with a good approximation the velocity x t of a point that has moved by x between
two consecutive images of the sequence.
Consider the simple case of rigid motion where the object’s points move with
the same velocity and are projected prospectively in the image plane as shown in
Fig. 6.37. The direction of motion is always along the Z -axis in the positive direction
with the image plane (x, y) perpendicular to the Z -axis. The figure shows the position
of a point P = (X, Y, Z ) at the time t and the new position in P = (X , Y , Z ) after
6.6 Analytical Structure of the Optical Flow of a Rigid Body 581

the movement at time t . The perspective projections of P in the image plane, in the
two instants of time, are, respectively, p = (x, y) and p = (x , y ).
The 3D displacement of P in the new position is modeled by the following geo-
metric transformation:
P =R×P+S (6.185)

where ⎡ ⎤ ⎡ ⎤
1 −Rz R y Sx
R = ⎣ Rz 1 −Rx ⎦ S = ⎣ Sy ⎦ (6.186)
−R y Rx 1 Sz

It follows that, according to (6.185), the spatial coordinates of the new position of P
depend on the coordinates of the initial position (X, Y, Z ) multiplied by the matrix of
the rotation parameters (Rx , R y , Rz ) added together with the translation parameters
(Sx , S y , Sz ). Replacing R and S in (6.185) we have
X = X − R z Y + R y Z + Sx X = X − X = Sx − Rz Y + R y Z
Y = Y + Rz X − Rx Z + S y =⇒ Y = Y − Y = S y + Rz X − Rx Z (6.187)
Z = Z − R y X + Rx Y + Sz Z = Z − Z = Sz − R y X + Rx Y

The determination of the motion is closely related to the calculation of the motion
parameters (Sx , S y , Sz , Rx , R y , Rz ) which depend on the geometric properties of
projection of the 3D points of the scene in the image plane. With the perspective model
of image formation given by Eq. (6.152), the projections of P and P in the image
plane are, respectively, p = (x, y) and p = (x , y ) with the relative displacements
(x, y) given by
X X
x = x − x = f − f
Z Z (6.188)
Y Y
y = y − y = f − f
Z Z

In the context of rigid motion and with very small 3D angular rotations, then
the 3D displacements (X, Y, Z ) can be approximated by Eq. (6.187) which
when replaced in the perspective Eq. (6.188) give the corresponding displacements
(x, y) as follows:
( f Sz −Sz x) 2
Z + f R y − Rz y − Rx xfy + R y xf
x = x − x = Sz y
1+ Z + Rx f − R y xf
(6.189)
( f S y −Sz y) 2
Z − f Rx + Rz x + R y xfy − Rx yf
y = y − y = Sz y
1+ Z + Rx f − R y xf

The equations of the displacements (x, y) in the image plane in p = (x, y) are
thus obtained in terms of the parameters (Sx , S y , Sz , Rx , R y , Rz ) and the addition of
the depth Z for the point P = (X, Y, Z ) of the 3D object, assuming the perspective
582 6 Motion Analysis

projection model. Furthermore, if the images of the sequence are acquired with
high frame rate, the components of the displacement (x, y), in the time interval
between one image and the other, are very small, it follows that the variable terms in
the denominator of Eq. (6.189) are small compared to unity, that is,
Sz y x
+ Rx − R y  1 (6.190)
Z f f
With these assumptions, it is possible to derive the equations that relate motion in the
image plane with the motion parameters by differentiating equations (6.189) with
respect to time t. In fact, dividing (6.189) with respect to the time interval t and
for t → 0 we have that the displacements (x, y) approximate (become ) the
instant velocities (vx , v y ) in the image plane, known as optical flow.
Similarly, in 3D space, the translation motion parameters (Sx , S y , Sz ) instead
become translation velocity indicated with (Tx , Ty , Tz ), so for the rotation parame-
ters (Rx , R y , Rz ), which become rotation velocity indicated with ( x , y , z ). The
equations of the optical flow (vx , v y ) correspond precisely to Eq. (6.160) of Sect. 6.6.
Therefore, these equations of motion involve velocities both in 3D space and in the
2D image plane. However, in real vision systems, information is that acquired from
the sequence of space–time images based on the induced displacements on very small
time intervals according to the condition expressed in Eq. (6.190).
In other words, we can approximate the 3D velocity of a point P = (X, Y, Z )
with the equation V = T + × P (see Sect. 6.6) of a rigid body that moves with a
translational velocity T = (Tx , Ty , Tz ) and a rotational velocity = ( x , y , z ),
while the relative velocity (vx , v y ) of the 3D point projected in p = (x, y), in the
image plane, can be approximated by considering the displacements (x, y) if the
constraint Z Z  1 is maintained, according to Eq. (6.190).
With these assumptions, we can now address the problem of estimating the motion
parameters (Sx , S y , Sz , Rx , R y , Rz ) and Z starting from the measurements of the
displacement vector (x, y) given by Eq. (6.189) that we can break them down
into separate components of translation and rotation, as follows:

(x, y) = (x S , y S ) + (x R , y R ) (6.191)

with
f Sz − Sz x
x S =
Z (6.192)
f S y − Sz y
y S =
Z

and
xy x2
x R = f R y − Rz y − Rx + Ry
f f
(6.193)
xy y2
y R = − f Rx + Rz x + R y − Rx
f f
6.6 Analytical Structure of the Optical Flow of a Rigid Body 583

As previously observed, the rotational component does not depend on the depth Z
of a 3D point of the scene.
Given that the displacement vector (x, y) is applied for each 3D point projected
in the scene, the motion parameters (the unknowns) can be estimated with the least
squares approach by setting a function error e(S, R,Z ) to be minimized given by


n
e(S, R,Z ) = (xi − x Si − x Ri )2 + (yi − y Si − y Ri )2 (6.194)
i=1

where (xi , yi ) is the measurable displacement vector for each point i of the image
plane, (x R , y R ) and (x S , y S ) instead they are, respectively, the rotational and
translational components of the displacement vector. With the perspective projection
model it is not possible to estimate an absolute value for the translation vector S and
for the depth Z for each 3D point. In essence, they are estimated at less than a scale
factor. In fact, in the estimate of the translation vector (x S , y S ) from Eq. (6.192)
we observe that multiplied both by a constant c this equation is not altered.
Therefore, by scaling the translation vector by a constant factor, at the same
time the depth is increased by the same factor without producing changes in the
displacement vector in the image plane. Of the displacement vector it is possible
to estimate the direction of motion and the relative depth of the 3D point from the
image plane. According to the strategy proposed in [45] it is useful to set the function
error by eliminating first ZS with a normalization process. Let U = (Ux , U y , Uz )
the normalized motion direction vector and r the translation component module
S = (Sx , S y , Sz , ). The normalization of U is given by

(Sx , S y , Sz )
(Ux , U y , Uz ) = (6.195)
r
Let Z̄ be the relative depth given by
r
Z̄ = ∀i (6.196)
Zi
At this point we can rewrite in normalized form the translation component (6.192)
which becomes
x S
xU = = U x − Uz x
Z̄ (6.197)
y S
yU = = U y − Uz y

Rewriting the error function (6.194) with respect to U (for Eq. 6.197) we get


n
e(U, R, Z̄ ) = (xi − xUi Z̄ i − x Ri )2 + (yi − yUi Z̄ i − y Ri )2 (6.198)
i=1

We are now interested in minimizing this error function for all possible values of Z̄ i .
584 6 Motion Analysis

Differentiating Eq. (6.198) with respect to Z , setting the result to zero and resolv-
ing with respect to Z̄ i we obtain

(yi − y Ri )yUi
Z̄ i = (xi − x Ri )xUi + ∀i (6.199)
xU2 i + yU2 i

Finally, having estimated the relative depths Z̄ i , it is possible to replace them in the
error function (6.198) and obtain the following final formulation:

n
[(xi − x Ri )yUi − (yi − y Ri )xUi ]2
e(U, R) = (6.200)
i=1
xU2 i + yU2 i

With this artifice the depth Z has been eliminated from the error function which is
now formulated only in terms of U and R. Once the motion parameters have been
estimated with (6.200) it is possible to finally estimate with (6.199) the optimal depth
values associated to each point ith of the image plane.
In the case of motion with pure rotation, the error function (6.200) to be minimized
becomes
n
e(R) = (xi − x Ri )2 + (yi − y Ri )2 (6.201)
i=1

where the rotational motion parameters to be estimated are the three components
of R. This is possible differentiating the error function with respect to each of the
components (Rx , R y , Rz ), setting the result to zero and solving the three linear equa-
tions (remembering the nondependence on Z ) as proposed in [46]. The three linear
equations that are obtained are


n
[(xi − x Ri )x y + (yi − y Ri )(y 2 + 1)] = 0
i=1
n
[(xi − x Ri )(x 2 + 1) + (yi − y Ri )x y] = 0 (6.202)
i=1

n
[(xi − x Ri )y + (yi − y Ri )x] = 0
i=1

where (xi , yi ) is the displacement vector ith for the image point (x, y) and
(x Ri , y Ri ) are the rotational components of the displacement vector. As reported
in [46], to estimate the rotation parameters (Rx , R y , Rz ) for the image point (x, y),
Eq. (6.202) are expanded and rewritten in matrix form obtaining
⎡ ⎤ ⎡ ⎤−1 ⎡ ⎤
Rx a d f k
⎣ Ry ⎦ = ⎣d b e ⎦ ⎣ l ⎦ (6.203)
Rz f e c m
6.6 Analytical Structure of the Optical Flow of a Rigid Body 585

where
n n n
a= i=1 [x y
2 2 + (y 2 + 1)] b = i=1 [(x
2 + 1) + x 2 y 2 ] c = i=1 (x
2 + y2)

n n n
d= i=1 [x y(x
2 + y 2 + 2)] e= i=1 y f = i=1 x (6.204)
n n n
k= i=1 [ux y + v(y 2 + 1)] l = i=1 [u(x
2 + 1) + vx y] m = i=1 (x y − vx)

In Eq. (6.204), (u, v) indicates the flow measured for each pixel (x, y). It is proved
that the matrix in (6.203) is diagonal and singular if the image plane is symmetrical
with respect to the axes x and y. Moreover, if the image plane is reduced in size,
the matrix could be ill conditioned, it results in an inaccurate estimation of the
summations (k, l, m) calculated by the observed flow (u, v) which would be very
amplified. But this in practice should not happen because generally it is not required
to determine the rotary motion around the optical axis of the camera by observing
the scene with a small angle of the field of view.

6.7 Structure from Motion

In this section, we will describe an approach known as Structure from Motion (SfM)
to obtain information on the geometry of the 3D scene starting from a sequence of
2D images acquired by a single uncalibrated camera (the intrinsic parameters and its
position are not known). The three-dimensional perception of the world is a feature
common to many living beings. We have already described the primary mechanism
used by the ster eopsis human vision system (i.e., the lateral displacement of objects
in two retinal images) to perceive the depth and obtain 3D information of the scene
by fusing the two retinal images.
The computer vision community has developed different approaches for 3D recon-
struction of the scene starting from 2D images observed from multiple points of view.
One approach is to find the correspondence of points of interest of the 3D scene (as
happens in the stereo vision) observed in 2D multiview images and by triangulation
construct a 3D trace of these points. More formally (see Fig. 6.38), given n projected
points (xi j ; i = 1, n and j = 1, m) in the m 2D images, the goal is to find all projec-
tion matrices P j ; j = 1, . . . , m (associated with motion) consisting of the structure
of n observed 3D points (X i ; i = 1, . . . , n). Fundamental to the SfM approach is
the knowledge of the camera projection matrix (or camera matrix) that is linked to
the geometric model of image formation (camera model).

6.7.1 Image Projection Matrix

Normally, the simple model of perspective projection is assumed (see Fig. 6.39)
which corresponds to the ideal pinhole model where a 3D point Q = (X, Y, Z ),
whose coordinates are expressed in the reference system of the camera, it is projected
in the image plane in q whose coordinates x = (x, y) are related to each other with
586 6 Motion Analysis

Xi

xi1
xim

xij
Pm
Pj

Fig. 6.38 Observation of the scene from slightly different points of view obtaining a sequence of
m images. n points of interest of the 3D scene are taken from the m images to reconstruct the 3D
structure of the scene by estimating the projection matrices P j associated with the m observations

Image Discretization
xis
u la
t ica
Op int Q(X,Y,Z)
v al P
o
cip
Prin
c
x
f y
Z q(x,y)
C X
T
Y T Yw Zw
R
O Xw

Fig. 6.39 Intrinsic and extrinsic parameters in the pinhole model. A point Q, in the camera’s 3D
reference system (X, Y, Z ), is projected into the image plane in q = (x, y), whose coordinates are
defined with respect to the principal point c = (cx , c y ) according to Eq. (6.206). The transformation
of the coordinates (x, y) of the image plane into the sensor coordinates, expressed in pixels, is defined
by the intrinsic parameters with Eq. (6.207) which takes into account the translation of the principal
point c and sensor resolution. The transformation of 3D point coordinates of a rigid body from
the world reference system (X w , Yw , Z w ) (with origin in O) to the reference system of the camera
(X, Y, Z ) with origin in the projection center C is defined by the extrinsic parameters characterized
by the roto-translation vectors R, T according to Eq. (6.210)
6.7 Structure from Motion 587

the known nonlinear equations x = f XZ and y = f YZ , where f indicates the focal


length that has the effect of introducing a scale change of the image.
In the figure, for simplicity, the image plane is shown in front of the optical
projection center C. In reality, the image plane is located on the opposite side with
the image of the inverted scene. Expressing in homogeneous coordinates we have
the equation of the perspective projection in the following matrix form:
⎡ X⎤
f Z ⎡ ⎤
⎡ ⎤ ⎡ ⎤ ⎡ ⎤ X
x ⎢ ⎥ fX f 0 00 ⎢ ⎥
⎢ ⎥
⎣ y ⎦ = ⎢ f Y ⎥ = ⎣ f Y ⎦ = ⎣ 0 f 0 0⎦ ⎢ Y ⎥ (6.205)
⎢ Z⎥ ⎣Z⎦
1 ⎣ ⎦ Z 0 0 10
1
1

Setting f = 1, we can rewrite (6.205), unless an arbitrary scale factor, as follows:


⎡ ⎤
⎡ ⎤ ⎡ ⎤ X
x̃ 1000 ⎢ ⎥
⎣ ỹ ⎦ ≈ ⎣0 1 0 0⎦ ⎢ Y ⎥ =⇒ x̃ ≈ A X̃ (6.206)
⎣Z⎦
1 0010
1

where ≈ indicates that the projection x̃ is defined less than a scale factor. Furthermore,
x̃ is independent of the size of X that is only depends on the direction of the 3D point
relative to the camera and not how far from it is distant. The matrix A represents the
geometric model of the camera and is known as canonical perspective projection
matrix.

6.7.1.1 Intrinsic Calibration Parameters


Another aspect to consider is the relation that transforms the coordinates x̃, of the
points q projected in the image plane, in the coordinates ũ = [u v 1]T of the sensor
in terms of pixels. This transformation takes into account the discretization of the
image and the translation of the image coordinates with respect to the principal
point (u 0 , v0 ) (projection in the image plane of the perspective projection center).
The principal point corresponds to the intersection point of the optical axis with the
image plane of the camera (see Fig. 6.39). To take into account the translation of the
principal point with respect to the origin of the 2D reference system of the image, it is
sufficient to add the horizontal and vertical translation component defined by (u 0 , v0 )
to the equation of the perspective projection (6.206). To transform the coordinates
expressed in unit of length into pi xel, it is necessary to know the horizontal resolution
px and vertical resolution p y of the sensor, normally expressed in pixel/mm.10

10 The physical position of a point projected in the image plane, normally expressed with a metric,

for example, in mm, must be transformed into units of the image sensor expressed in pi xel which
typically does not correspond to a metric such as for example in mm. The physical image plane is
discretized by the pixels of the sensor characterized by its horizontal and vertical spatial resolution
588 6 Motion Analysis

Assuming the translation of the central point (u 0 , v0 ) already expressed in pixels,


the coordinates in the image plane expressed in pixels are given by
fX X fY Y
u = px + u 0 = αu + u 0 v = py + v0 = αv + v0 (6.207)
Z Z Z Z
where αu = px f and αv = p y f are the horizontal and vertical scale factors that
transform the physical coordinates of the points projected in the image plane into
pixels, and u0 = [u 0 v0 ]T is the principal point of the sensor already expressed
in pixels. The parameters αu , αv , u 0 , and v0 represent the intrinsic (or internal)
parameters of the camera. Normally the pixels are squares for which αu = αv = α
and so α is considered as the focal length of the optical system expressed in terms
of pixel units. An additional intrinsic parameter to consider is the skew parameter
s of the sensor (usually assumed to be zero) given by s = αv tan β, factor, which
takes into account the nonrectangular area of the pixels. Considering (6.207), the 2D
transformation, which relates the projected points in the image plane into the sensor
coordinate system, is given by
⎡ ⎤ ⎡ ⎤⎡ ⎤
ũ αu s u 0 x̃
⎣ ṽ ⎦ ≈ ⎣ 0 αv v0 ⎦ ⎣ ỹ ⎦ =⇒ ũ = K x̃ (6.208)
1 0 0 1 1

where K is a 3×3 triangular matrix known as camera calibration matrix to calculate


for each optical system used.

6.7.1.2 Extrinsic Calibration Parameters


A more general projection model, which relates 3D points of the scene with respect
to the 2D image plane, involves the geometric transformation that projects, 3D points
of a rigid body expressed in homogeneous coordinates in the world reference system
X w = [X w Yw Z w ]T , in the reference system of the camera X = [X Y Z ]T always
expressed in homogeneous coordinates (see Fig. 6.39). In essence, the reference
system of the objects of the world (which originates in O w ) will be linked to the

expressed in pixels/mm. Therefore, the transformation of coordinates from mm to pixels introduces


npi x y
a horizontal scale factor px = npi xx
dimx and vertical p y = dimy with which to multiply the physical
image coordinates (x, y). In particular, npi x x × npi x y represents the horizontal and vertical resolu-
tion of the sensor in pixels, while dimx × dimy indicates the horizontal and vertical dimensions of
the sensor given in mm. Often to define the pixel’s rectangles is given the aspect ratio as the ratio of
width to height of the pixel (usually expressed in decimal form, for example, 1.25, or in fractional
form as 5/4 to break free from the problem of the periodic decimal approximation). Furthermore,
the coordinates must be translated with respect to the position of the principal point expressed in
pixels since the center of the sensor’s pixel grid does not always correspond with the position of the
principal point. The pixel coordinate system of the sensor is indicated with (u, v) with the principal
point (u 0 , v0 ) given in pixels and assumed with the axes parallel to those of the physical system
(x, y). The accuracy of the coordinates in pixels with respect to the physical ones depends on the
resolution of the sensor and its dimensions.
6.7 Structure from Motion 589

reference system of the camera by a geometric relation that includes the camera
orientation (through the rotation matrix R with a size of 3 × 3) and the translation
vector T (3D vector indicating the position of the origin O w with respect to the
system camera reference). This transformation into a compact matrix form is given
by
X = RX w + T (6.209)

while in homogeneous coordinates it is written as follows:


⎡ ⎤ ⎡ ⎤
X Xw
⎢Y ⎥ R T ⎢ Yw ⎥
⎢ ⎥= T 3×3 3×1 ⎢ ⎥ =⇒ X̃ = M X̃ w (6.210)
⎣Z⎦ 01×3 1 ⎣ Zw ⎦
1 1

where M is the 4 × 4 roto-translation matrix whose elements represent the extrinsic


(or external) parameters that characterize the transformation between coordinates
of the world and those of the camera.11
In particular, the rotation matrix R, although with 9 elements, has 3 degrees of
freedom that correspond to the values of the rotation angles around the three axes
of the reference system. The rotations on the three axes are convenient to express it
through a single matrix operation instead of three separate elementary matrices for
the three angles of rotation (φ, θ, ψ) associated, respectively, to the axes (X, Y, Z ).
The translation vector has 3 elements and the extrinsic parameters are a total of 6
which characterize the position and the attitude of the camera.
Combining now, the perspective projection Eq. (6.206), the camera calibration
Eq. (6.208), and the rigid body transformation Eq. (6.210), we can get a unique
equation that relates a 3D point, expressed in coordinates homogenous of the world
X̃ w , with its projection in the image plane, expressed in homogeneous coordinates
in pixels ũ, given by
ũ ≈ P X̃ w (6.211)

where P is the perspective projection matrix, of size 3 × 4, expressed in the most


general form:
P 3×4 = K 3×3 [R3×3 T 3×1 ] (6.212)

In essence, the perspective projection matrix P defined by (6.212) includes: the sim-
ple perspective transformation defined by A (Eq. 6.206), the effects of discretization

11 The roto-translation transformation expressed by (6.209) indicates that the r otation R was per-
formed first and then the translation T . Often it is reported with inverted operations, that is, before
the translation and after the rotation, having thus

X = R(X w − T ) = RX w + (−RT )

and in this case, in Eq. (6.210), the translation term T is replaced with −RT .
590 6 Motion Analysis

of the image plane associated with the sensor through the matrix K (Eq. 6.208), and
the transformation that relates the position of the camera with respect to the scene
by means of the matrix M (Eq. 6.210).
The transformation (6.211) is based only on the pinhole perspective projection
model and does not include the effects due to distortions introduced by the optical
system, normally modeled with other parameters that describe radial and tangential
distortions (see Sect. 4.5 Vol. I).

6.7.2 Methods of Structure from Motion

Starting from Eq. (6.211), we can now analyze the proposed methods to solve the
problem of 3D scene reconstruction by capturing a sequence of images with a single
camera whose intrinsic parameters remain constant even if not known (uncalibrated
camera) together without the knowledge of the motion [47,48]. The proposed meth-
ods are part of the problem of solving an inverse problem. In fact, with Eq. (6.211) we
want to reconstruct the 3D structure of the scene (and the motion), that is, calculate
the homogeneous coordinates of n points X̃ i (for simplicity we indicate, from now
on, without a subscript w the 3D points of the scene) whose projection is known in
homogeneous coordinates ũi j detected in m images characterized by the associated
m perspective projection matrices P j unknowns.
Essentially, the problem is reduced by estimating the m projection matrices P j
and the n 3D points X̃ i , known the m · n correspondences ũi j found in the sequence
of m images (see Fig. 6.38). We observe that with (6.211) the scene is reconstructed
up to a scale factor having considered a perspective projection. In fact, if the points of
the scene are scaled by a factor λ and we simultaneously scale the projection matrix
by a factor λ1 , the points of the scene projected in the image plane remain exactly the
same: 
1
ũ ≈ P X̃ = P (λ X̃) (6.213)
λ

Therefore, the scene cannot be reconstructed with an absolute scale value. For recog-
nition applications, even if the structure of the reconstructed scene shows some
resemblance to the real one and reconstructed with an arbitrary scale, it still pro-
vides useful information. The methods proposed in the literature use, the algebraic
approach [49] (based on the Fundamental matrix described in the Chap. 7), the fac-
torization approach (based on the singular values decomposition—SVD ), the bundle
adjustment approach [50,51] that iteratively refines the motion parameters and 3D
structure of the scene minimizing an appropriate functional cost. In the following
section, we will describe the method of factorization.

6.7.2.1 Factorization Method


The f actori zation method proposed in [52] determines the scene structure and
motion information from a sequence of images by assuming an orthographic projec-
6.7 Structure from Motion 591

tion. This simplifies the geometric model of projection of the 3D points in the image
plane whose distance with respect to the camera can be considered irrelevant (ignored
the scale factor due to the object–camera distance). It is assumed that the depth of the
object is very small compared to the observation distance. In this context, no motion
information is detected along the optical axis (Z -axis). The orthographic projection
is a particular case of the perspective12 one where the orthographic projection matrix
results:
⎡ ⎤
X
x 1000 ⎢ Y
⎢ ⎥
⎥ x=X
= =⇒ (6.214)
y 0 1 0 0 ⎣Z⎦ y=Y
1

Combining Eq. (6.214) with the roto-translation matrix (6.210), we obtain an affine
projection:
⎡ ⎤⎡ ⎤⎡ ⎤
r11 r12 r13 0 1 0 0 T1 X
x 1000 ⎢ ⎢ r 21 r 22 r 23 0 ⎥ ⎢0 1 0 T2 ⎥ ⎢ Y ⎥
⎥⎢ ⎥⎢ ⎥
=
y 0 1 0 0 ⎣r31 r32 r33 0⎦ ⎣0 0 1 T3 ⎦ ⎣ Z ⎦
0 0 0 1 000 1 1
⎡ ⎤⎡ ⎤ (6.215)
1 0 0 T1 X
r r r 0 ⎢ ⎥⎢ ⎥
⎢0 1 0 T2 ⎥ ⎢ Y ⎥
= 11 12 13
r21 r22 r23 0 0 0 1 T3 ⎦ ⎣ Z ⎦

000 1 1

from which simplifying (eliminating the last column in the first matrix and the last
row in the second matrix) and expressing in nonhomogeneous coordinates the ortho-
graphic projection is obtained combined with the extrinsic parameters of the roto-
translation: ⎡ ⎤
X
x r r r T
= 11 12 13 ⎣ Y ⎦ + 1 = RX + T (6.216)
y r21 r22 r23 T2
Z

From (6.213), we know that we cannot determine the absolute positions of the 3D
points. To factorize it is further worth simplifying (6.216) by assuming the origin of
the reference system of 3D points coinciding with their center of mass, namely:

1
n
Xi = 0 (6.217)
n
i=1

Now we can execute the centering of the points, in each image of the sequence,
subtracting from the coordinates xi j the coordinates of their center of mass x̄i j and

12 The distance between the projection center and the image plane is assumed infinite with focal

length f → ∞ and parallel projection lines.


592 6 Motion Analysis

indicate with x̃i j the new coordinates:

1 1
n n
x̃i j = xi j − xik = R j X i + T j − (R j X k + T j )
n n
k=1 k=1
 
n (6.218)
1
= R j Xi − Xk = R j Xi
n
k=1
  
=0

We are now able to factorize with (6.218) aggregating the data centered in large
matrices. In particular, the coordinates
 of the 2D points centered x̃i j are placed in
a single matrix W = U V organized into two submatrices each of size m × n.
In the m rows of submatrix, U are placed the horizontal coordinates of the n 2D
points centered relative to the m image. Similarly, in the m rows of the submatrix V
the vertical coordinates of the n 2D centered points are placed. Thus we obtain the
matrix W, called matrix of the measures  R1 of size 2m × n. In analogy to the W, we can
construct the rotation matrix M = R2 relative to all the m images indicating with
R1 = [ r11 r12 r13 ] and R2 = [ r21 r22 r23 ], respectively, the rows of the camera rotation
matrix (Eq. 6.216) representing the motion information.13 Rewriting Eq. (6.218) in
matrix form, we get
⎡ ⎤ ⎡ ⎤
x̃11 x̃12 · · · x̃1n R11
⎢ .. ⎥ ⎢ .. ⎥
⎢ . ⎥ ⎢ . ⎥
⎢ ⎥ ⎢ ⎥
⎢x̃m1 x̃m2 · · · x̃mn ⎥ ⎢ R1m ⎥  
⎢ ⎥=⎢ ⎥
⎢ ỹ11 ỹ12 · · · ỹ1n ⎥ ⎢ R21 ⎥ X 1 X 2· · · X n (6.219)
⎢ ⎥ ⎢ ⎥
⎢ .. ⎥ ⎢ . ⎥
⎣ . ⎦ ⎣ .. ⎦ 3×n

ỹm1 ỹm2 · · · ỹmn R2m


     
2m×n 2m×3

In addition, by aggregating all the coordinates of the n 3D points centered in the


matrix S = [ X 1 X 2 ··· X n ], called matrix of the 3D structure of the scene, we have
the factorization of the measure matrix W as product of the motion matrix M and
the structure matrix S, which in compact matrix form is

W = MS (6.220)

By virtue of the rank theorem, we have that the matrix of the observed centered
measures W of size 2m × n has at the highest rank 3. This statement is immediate
considering the properties of the rank. The rank of a matrix of size m × n is at most
the minimum between m and n. In fact, the rank of a product matrix A · B is at

13 Recallthat the rows of R represent the coordinates in the original space of the unit vectors along
the coordinate axes of the rotated space, while the columns of R represent the coordinates in the
rotated space of unit vectors along the axes of the original space.
6.7 Structure from Motion 593

most the minimum between the rank of A and that of B. Applying this rank theorem
to the factoring matrices M · S, we immediately get that the rank of W is 3. The
importance of the rank theorem is evidenced by the fact that the 2m × n measures
taken from the sequence of images are highly redundant, to reconstruct the 3D scene.
It also informs us that quantitatively the 2m × 3 motion information and the
3 × n coordinates of the 3D points would be sufficient to reconstruct the 3D scene.
Unfortunately, both these latter information are not known and to solve the problem
of the reconstruction of the structure from motion the method of factorization has
been proposed, seen as an overdetermined system that can be solved with the least
squares method based on singular value decomposition (SVD). The SVD approach
involves the following decomposition of W:

W = 
 U 
 
VT (6.221)
2m×n 2m×2m 2m×n n×n

where the matrices U and V are unitary and orthogonal (U T U = V T V = I, where


I is the matrix identity), and  is a diagonal matrix with nonnegative elements in
ascending order σ11 ≥ σ22 ≥ · · · uniquely determined by W. The values σi (W) ≥
0 are the singular values of the decomposition of W and the columns ui and vi
are the orthonormal eigenvectors of the symmetric matrices WW T and W T W,
respectively. It should be noted that the properties of W are very much related to the
properties of these symmetric matrices. The singular values σi (W) coincide with
the nonnegative square roots of the eigenvalues λi of the symmetric matrices WW T
and W T W.
Returning to the factorization approach of W and considering that the property
of the rank of a matrix is equal to the number of nonzero elements of the singular
values, it is required that the first three values σi , i = 1, 3 are nonzero, while the rest
are at zero. In practice, due to the noise we know that the singular values different
from zero are more than 3, but by virtue of the rank theorem we can ignore all the
others. Under these conditions, the SVD decomposition, given by Eq. (6.221), can be
considered only with the first three columns of U, the first three rows of V and with
the diagonal matrix  of size 3×3, thus obtaining the following SVD decomposition:

Ŵ = 
 U 
 
VT (6.222)
2m×n 2m×3 3×3 3×n

Essentially, according to the rank theorem considering only the three greatest singular
values of W and the corresponding left and right eigenvectors, with (6.222) we get the
best estimate of motion and structure information. Therefore, Ŵ can be considered
as a good estimate of the ideal observed measures W which we can be defined as
1 1
Ŵ = (U [ ] 2 ) ([ ] 2 V T )
 =⇒ Ŵ = M̂Ŝ (6.223)
     
2m×n 2m×3 3×n
594 6 Motion Analysis

where the matrices M̂ and Ŝ even if different from the ideal case M and S, always
maintain the motion information of the camera and the structure of the scene, respec-
tively. It is pointed out that, except for noise, the matrices M̂ and Ŝ, respectively,
induce a linear transformation of the true motion matrix M and of the true matrix of
the scene structure S. If the observed measure matrix W is acquired with an adequate
frame rate, appropriate to the camera motion, we can have a noise level low enough
that it can be ignored. This can be checked by analyzing the singular values of Ŵ
verifying that the ratio of the third to the fourth singular value is sufficiently large.
We also point out that the decomposition obtained with (6.223) is not unique. In
fact, any invertible matrix Q of size 3 × 3 would produce an identical decomposition
of Ŵ with the matrices M̂ Q and Q −1 M̂, as follows:

Ŵ = (M̂ Q)( Q −1 Ŝ) = M̂( Q Q −1 )Ŝ = M̂Ŝ (6.224)

Another problem concerns the row pairs R1T ; R2T of M̂ that may not necessarily be
orthogonal.14 To solve these problems, we can find a matrix Q such that appropriate
rows of M̂ satisfy some metric constraints. Indeed, considering the matrix M̂ as a
linear transformation of the true motion matrix M (and similarly for the matrix of
scene structure) we can find a matrix Q such that M = M̂ Q and S = Q −1 Ŝ. The Q
matrix is found by observing that the rows of the true motion matrix M, considered
as 3D vectors, must have a unitary norm and the first m rows R1T must be orthogonal
to the corresponding last m rows R2T . Therefore, the solution of Q is found with
the system of equations deriving from the following metric constraints:

ˆ iT Q Q T R1
R1 ˆ i =1
ˆ iT Q Q T R2
R2 ˆ i =1 (6.225)
ˆ iT Q Q T R2
R1 ˆ i =0

This system of 3 nonlinear equations can be solved by setting C = Q Q T and then


solving the system with respect to C (which has 6 unknowns, the 6 elements of the
symmetric matrix C) to get Q, with iterative methods, for example, the Newton
method or with factorization methods based on Cholesky or SVD.
The described factorization method is summarized in the following steps:

1. Compose the matrix of the measurements centered W from the n 3D points


observed through the tracking on m images of the sequence acquired by a moving
camera with appropriate frame rate. Better results are obtained with motion such
that the images of the sequence guarantee the tracking of the points with few

14 Let’sremember here from the properties of the rotation matrix R which is nor mali zed, that is,
the squares of the elements in a row or in a column are equal to 1, and it is or thogonal, i.e., the
inner product of any pair of rows or any pair of columns is 0.
6.7 Structure from Motion 595

occlusions and the distance from the scene is much greater than the depth of the
objects of the scene to be reconstructed.
2. Organi ze 2D centered measures in the W matrix such that each pair of rows jth
and (j+m)th contains the horizontal and vertical coordinates of the n points 3D
projected in the jth image. While a column of W contains in the first m elements,
respectively, the horizontal coordinates of a 3D point observed in the m images,
and in the last m elements the vertical coordinates of the same point. It follows
that W has dimensions 2m × n.
3. Calculate the decomposition of W = UV T with the SVD method, which
produces the following matrices:

– U of size 2m × 2m.
–  of size 2m × n.
– V T of size n × n.

4. From the original decomposition, consider the truncated decomposition to the


first three largest singular values, forming the following matrices:

– U of size 2m × 3 which corresponds to the column vectors of U.


–  of size 3 × 3 which corresponds to the diagonal matrix with the largest
singular values.
– V T of size 3 × n which corresponds to the row vectors of V T .
1
5. De f ine the motion matrix M̂ = U [ ] 2 and the structure matrix Ŝ =
1
[ ] 2 V T .
6. Eliminate the ambiguities by solving for Q to make the rows of M̂ and Ŝ
orthogonal.
7. The final solution is given by M̆ = M̂ Q and S̆ = Q −1 Ŝ.

Figure 6.40 shows the 3D reconstruction of the interior of an archaeological cave


called Grotta dei Cervi15 with the presence of paintings dating back to the middle
Neolithic period made with red ocher and bat guano. The walls have an irregular
surface characterized by cavities and prominences that give them a strong three-
dimensional structure. The sequence of images and the process of detecting the
points of interest (with the Harris method) together with the evaluation of the corre-
spondences are based on the approach described in [53], while the 3D reconstruction
of the scene is realized with the method of factorization. The 3D scene was recon-

15 TheGrotta dei Cervi (Deer Cave) is located in Porto Badisco near Otranto in Apulia–Italy at a
depth of 26 m below sea level and represents an important cave, it is, in fact, the most impressive
Neolithic pictorial complex in Europe, only recently discovered in 1970.
596 6 Motion Analysis

Fig. 6.40 Results of the 3D scene reconstructed with the factorization method. The first line shows
the three images of the sequence with the points of interest used. The second line shows the 3D
image reconstructed as described in [53]

structed with VRML software16 using a triangular mesh built on the points of interest
shown on the three images and superimposing the texture of the 2D images.

References
1. J.J. Gibson, The Perception of the Visual World (Sinauer Associates, 1995)
2. T. D’orazio, M. Leo, N. Mosca, M. Nitti, P. Spagnolo, A. Distante, A visual system for real
time detection of goal events during soccer matches. Comput. Vis. Image Underst. 113 (2009a),
622–632
3. T. D’Orazio, M. Leo, P. Spagnolo, P.L. Mazzeo, N. Mosca, M. Nitti, A. Distante, An investi-
gation into the feasibility of real-time soccer offside detection from a multiple camera system.
IEEE Trans. Circuits Syst. Video Surveill. 19(12), 1804–1818 (2009b)
4. A. Distante, T. D’Orazio, M. Leo, N. Mosca, M. Nitti, P. Spagnolo, E. Stella, Method
and system for the detection and the classification of events during motion actions. Patent
PCT/IB2006/051209, International Publication Number (IPN) WO/2006/111928 (2006)
5. L. Capozzo, A. Distante, T. D’Orazio, M. Ianigro, M. Leo, P.L. Mazzeo, N. Mosca, M. Nitti,
P. Spagnolo, E. Stella, Method and system for the detection and the classification of events
during motion actions. Patent PCT/IB2007/050652, International Publication Number (IPN)
WO/2007/099502 (2007)

16 Virtual Reality Modeling Language (VRML) is a programming language that allows the simulation

of three-dimensional virtual worlds. With VRML it is possible to describe virtual environments that
include objects, light sources, images, sounds, movies.
References 597

6. B.A. Wandell, Book Rvw: Foundations of vision. By B.A. Wandell. J. Electron. Imaging 5(1),
107 (1996)
7. B.K.P. Horn, B.G. Schunck, Determining optical flow. Artif. Intell. 17, 185–203 (1981)
8. Ramsey Faragher, Understanding the basis of the kalman filter via a simple and intuitive
derivation. IEEE Signal Process. Mag. 29(5), 128–132 (2012)
9. P. Musoff, H. Zarchan, Fundamentals of Kalman Filtering: A Practical Approach. (Amer-
ican Institute of Aeronautics and Astronautics, Incorporated, 2000). ISBN 1563474557,
9781563474552
10. E. Meinhardt-Llopis, J.S. Pérez, D. Kondermann, Horn-schunck optical flow with a multi-scale
strategy. Image Processing On Line, 3, 151–172 (2013). https://doi.org/10.5201/ipol.2013.20
11. H.H. Nagel, Displacement vectors derived from second-order intensity variations in image
sequences. Comput. Vis., Graph. Image Process. 21, 85–117 (1983)
12. B.D. Lucas, T. Kanade, An iterative image registration technique with an application to stereo
vision, in Proceedings of Imaging Understanding Workshop (1981), pp. 121–130
13. J.L. Barron, D.J. Fleet, S. Beauchemin, Performance of optical flow techniques, Performance
of optical flow techniques. Int. J. Comput. Vis. 12(1), 43–77 (1994)
14. T. Brox, A. Bruhn, N. Papenberg, J. Weickert, High accuracy optical flow estimation based on
a theory for warping. in Proceedings of the European Conference on Computer Vision, vol. 4
(2004), pp. 25–36
15. M.J. Black, P. Anandan, The robust estimation of multiple motions: parametric and piecewise
smooth flow fields. Comput. Vis. Image Underst. 63(1), 75–104 (1996)
16. J. Wang, E. Adelson, Representing moving images with layers. Proc. IEEE Trans. Image
Process. 3(5), 625–638 (1994)
17. E. Mémin, P. Pérez, Hierarchical estimation and segmentation of dense motion fields. Int. J.
Comput. Vis. 46(2), 129–155 (2002)
18. Simon Baker, Iain Matthews, Lucas-kanade 20 years on: a unifying framework. Int. J. Comput.
Vis. 56(3), 221–255 (2004)
19. H.-Y. Shum, R. Szeliski, Construction of panoramic image mosaics with global and local
alignment. Int. J. Comput. Vis. 16(1), 63–84 (2000)
20. David G. Lowe, Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis.
60, 91–110 (2004)
21. F. Marino, E. Stella, A. Branca, N. Veneziani, A. Distante, Specialized hardware for real-time
navigation. R.-Time Imaging, Acad. Press 7, 97–108 (2001)
22. Stephen T. Barnard, William B. Thompson, Disparity analysis of images. IEEE Trans. Pattern
Anal. Mach. Intell. 2(4), 334–340 (1980)
23. M. Leo, T. D’Orazio, P. Spagnolo, P.L. Mazzeo, A. Distante, Sift based ball recognition in
soccer images, in Image and Signal Processing, vol. 5099, ed. by A. Elmoataz, O. Lezoray, F.
Nouboud, D. Mammass (Springer, Berlin, Heidelberg, 2008), pp. 263–272
24. M. Leo, P.L. Mazzeo, M. Nitti, P. Spagnolo, Accurate ball detection in soccer images using
probabilistic analysis of salient regions. Mach. Vis. Appl. 24(8), 1561–1574 (2013)
25. T. D’Orazio, M. Leo, C. Guaragnella, A. Distante, A new algorithm for ball recognition using
circle hough transform and neural classifier. Pattern Recognit. 37(3), 393–408 (2004)
26. T. D’Orazio, N. Ancona, G. Cicirelli, M. Nitti, A ball detection algorithm for real soccer
image sequences, in Proceedings of the 16th International Conference on Pattern Recognition
(ICPR’02), vol. 1 (2002), pp. 201–213
27. Y. Bar-Shalom, X.R. Li, T. Kirubarajan, Estimation with Applications to Tracking and Navi-
gation (Wiley, 2001). ISBN 0-471-41655-X, 0-471-22127-9
28. S.C.S. Cheung, C. Kamath, Robust techniques for background subtraction in urban traffic
video. Vis. Commun. Image Process. 5308, 881–892 (2004)
29. C.R. Wren, A. Azarbayejani, T. Darrell, A. Pentland, Pfinder: real-time tracking of the human
body. IEEE Trans. Pattern Anal. 19(7), 780–785 (1997)
598 6 Motion Analysis

30. D. Makris, T. Ellis, Path detection in video surveillance. Image Vis. Comput. 20, 895–903
(2002)
31. C. Stauffer, W.E. Grimson, Adaptive background mixture models for real-time tracking, in IEEE
Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2 (1999)
32. A Elgammal, D. Harwood, L. Davis, Non-parametric model for background subtraction, in
European Conference on Computer Vision (2000), pp. 751–767
33. N.M. Oliver, B. Rosario, A.P. Pentland, A bayesian computer vision system for modeling
human interactions. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 831–843 (2000)
34. R. Li, Y. Chen, X. Zhang, Fast robust eigen-background updating for foreground detection, in
International Conference on Image Processing (2006), pp. 1833–1836
35. A. Sobral, A. Vacavant, A comprehensive review of background subtraction algorithms evalu-
ated with synthetic and real videos. Comput. Vis. Image Underst. 122(05), 4–21 (2014). https://
doi.org/10.1016/j.cviu.2013.12.005
36. Y. Benezeth, P.M. Jodoin, B. Emile, H. Laurent, C. Rosenberger, Comparative study of back-
ground subtraction algorithms. J. Electron. Imaging 19(3) (2010)
37. M. Piccardi, T. Jan, Mean-shift background image modelling. Int. Conf. Image Process. 5,
3399–3402 (2004)
38. B. Han, D. Comaniciu, L. Davis, Sequential kernel density approximation through mode prop-
agation: applications to background modeling, in Proceedings of the ACCV-Asian Conference
on Computer Vision (2004)
39. D.J. Heeger, A.D. Jepson, Subspace methods for recovering rigid motion i: algorithm and
implementation. Int. J. Comput. Vis. 7, 95–117 (1992)
40. H.C. Longuet-Higgins, K. Prazdny, The interpretation of a moving retinal image. Proc. R. Soc.
Lond. 208, 385–397 (1980)
41. H.C. Longuet-Higgins, The visual ambiguity of a moving plane. Proc. R. Soc. Lond. 223,
165–175 (1984)
42. W. Burger, B. Bhanu, Estimating 3-D egomotion from perspective image sequences. IEEE
Trans. Pattern Anal. Mach. Intell. 12(11), 1040–1058 (1990)
43. A. Branca, E. Stella, A. Distante, Passive navigation using focus of expansion, in WACV96
(1996), pp. 64–69
44. G. Convertino, A. Branca, A. Distante, Focus of expansion estimation with a neural network, in
IEEE International Conference on Neural Networks, 1996, vol. 3 (IEEE, 1996), pp. 1693–1697
45. G. Adiv, Determining three-dimensional motion and structure from optical flow generated by
several moving objects. IEEE Trans. Pattern Anal. Mach. Intell. 7(4), 384–401 (1985)
46. A.R. Bruss, B.K.P. Horn, Passive navigation. Comput. Vis., Graph., Image Process. 21(1), 3–20
(1983)
47. M. Armstong, A. Zisserman, P. Beardsley, Euclidean structure from uncalibrated images, in
British Machine Vision Conference (1994), pp. 509–518
48. T.S. Huang, A.N. Netravali, Motion and structure from feature correspondences: a review. Proc.
IEEE 82(2), 252–267 (1994)
49. S.J. Maybank, O.D. Faugeras, A theory of self-calibration of a moving camera. Int. J. Comput.
Vis. 8(2), 123–151 (1992)
50. D.C. Brown, The bundle adjustment - progress and prospects. Int. Arch. Photogramm. 21(3)
(1976)
51. W. Triggs, P. McLauchlan, R. Hartley, A. Fitzgibbon, Bundle adjustment - a modern synthesis,
in Vision Algorithms: Theory and Practice ed. by W. Triggs, A. Zisserman, R. Szeliski (Springer,
Berlin, 2000), pp. 298–375
52. C. Tomasi, T. Kanade, Shape and motion from image streams under orthography: a factorization
method. Int. J. Comput. Vis. 9(2), 137–154 (1992)
53. T. Gramegna, L. Venturino, M. Ianigro, G. Attolico, A. Distante, Pre-historical cave fruition
through robotic inspection, in IEEE International Conference on Robotics an Automation
(2005)
Camera Calibration and 3D
Reconstruction 7

7.1 Introduction

In Chap. 2 and Sect. 5.6 of Vol.I, we described the radiometric and geometric aspects
of an imaging system, respectively. In Sect. 6.7.1, we instead introduced the geometric
projection model of the 3D world in the image plane. Let us now see, defined a
geometric projection model, the aspects involved in the camera calibration to correct
all the sources of geometric distortions introduced by the optical system (radial and
tangential distortions, …) and by the digitization system (sensor noise, quantization
error, …), often not provided by vision system manufacturers.
In various applications, a camera is used to detect metric information of the scene
from the image. For example, in the dimensional control of an object it is required to
perform certain accurate control measurements, while for a mobile vehicle, equipped
with a vision system, it is required to self-locate, that is, estimate its position and
orientation with respect to the scene. Therefore, a calibration procedure of the camera
becomes necessary, which determines, the relative intrinsic parameters (focal length,
the horizontal and vertical dimension of the single photoreceptor of the sensor or the
aspect ratio, the size of the sensor matrix, the model coefficients of radial distortion,
the coordinates of the main point or the optical center) and the extrinsic parameters.
The latter define the geometric transformation to pass from the world reference sys-
tem to the camera system (the 3 translation parameters and the 3 rotation parameters
around the coordinate axes) described in the Sect. 6.7.1.
While the intrinsic parameters define the internal characteristic of the acquisition
system, independent of the position and attitude of the camera, the extrinsic parame-
ters describe its position and attitude regardless of its internal parameters. The level
of accuracy of these parameters defines the accuracy of the level of the derivable
measurements from the image. With reference to Sect. 6.7.1, the geometric model
underlying the image formation process is described by Eq. (6.211) that we rewrite
the following:
ũ ≈ P X̃ w (7.1)

© Springer Nature Switzerland AG 2020 599


A. Distante and C. Distante, Handbook of Image Processing and Computer Vision,
https://doi.org/10.1007/978-3-030-42378-0_7
600 7 Camera Calibration and 3D Reconstruction

where ũ indicates the coordinates in the image plane, expressed in pixels (taking
into account the position and orientation of the sensor in the image plane), of 3D
points with coordinates X̃ w , expressed in the world reference system, and P is the
perspective projection matrix, of size 3 × 4, expressed in the most general form:

P 3×4 = K 3×3 [R3×3 T 3×1 ] (7.2)

The matrix P, defined by (7.2), represents the most general perspective projection
matrix that includes

1. 
The simple
 canonical perspective transformation defined by the matrix A =
100
0 1 0 (Eq. 6.206) according to the pinhole model;
001
2. The effects of discretization of the image plane associated with the sensor through
the matrix K (Eq. 6.208);
3. The geometric transformation that relates the position and
 orientation
 of the cam-
era with respect to the scene through the matrix M = 0RT T1 (Eq. 6.210).

Essentially, in the camera calibration process, the matrix K is the matrix of intrinsic
parameters (maximum 5) that models, the pinhole perspective projection of 3D
points, expressed in coordinates of the camera (i.e., in 2D image coordinates), and
the further transformation needed to take into account the displacement of the sensor
in the image plane or the offset of the projection of the principal point on the sensor
(see Fig. 6.39). The intrinsic parameters describe the internal characteristics of the
camera regardless of its location and position with respect to the scene. Therefore,
with the calibration process of the camera all its intrinsic parameters are determined
which correspond to the elements of the matrix K .
In various applications, it is useful to define the observed scene with respect
to an arbitrary 3D reference system instead of the camera reference system (see
Fig. 6.39). The equation that performs the transformation from one system and the
other is given by the (6.209), where M is the roto-translation matrix of size 4 × 4
whose elements represent the extrinsic (or external) parameters that characterize the
transformation between coordinates of the world and camera. This matrix models
a geometric relationship that includes the orientation of the camera (through the
rotation matrix R of size 3 × 3) and the translation T (3D vector indicating the
position of the origin O w with respect to the reference system of the camera) as
shown in Fig. 6.39 (in the hypothesis of rigid movement).

7.2 Influence of the Optical System

The transformation (7.1) is based only on the pinhole perspective projection model
and does not contemplate the effects due to the distortions introduced by the optical
system, normally modeled with other parameters that describe the radial and tan-
7.2 Influence of the Optical System 601

gential distortions (see Sect. 4.5 Vol.I). These distortions are very accentuated when
using optics with a large angle of view and in low-cost optical systems.
The radial distortion generates an internal or external displacement of a 3D point
projected in the image plane with respect to its ideal position. It is essentially caused
by a defect in the radial curvature of the lenses of the optical system. A negative radial
displacement of an image point generates the barrel distortion, that is, it causes more
external points to accumulate more and more toward the optical axis, decreasing by
a scale factor with decreasing axial distance. A positive radial displacement instead
generates the pincushion distortion, that is, it causes more external points to expand
by increasing by a factor of scale with the increase of the axial distance. This type
of distortion has circular symmetry around the optical axis (see Fig. 4.25 Vol.I).
The tangential or decentralized distortion is instead caused by the displacement
of the lens center with respect to the optical axis. This error is checked with the
coordinates of the principal point (x̃0 , ỹ0 ) in the image plane.
Experimentally, it has been observed that radial distortion is dominant. Although
both optical distortions are generated by complex physical phenomena, they can
be modeled with acceptable accuracy with a polynomial function D(r ), where the
variable r is the radial distance of the image points (x̃, ỹ) (obtained with the ideal
pinhole projection, see Eq. 6.206) from the principal point (u 0 , v0 ). In essence, the
optical distortions influence the pinhole perspective projection coordinates and can
be corrected before or after the transformation in the sensor coordinates, in relation
to the camera’s chosen calibration method.
Therefore, having obtained the projections with distortions (x̃, ỹ) of the 3D points
of the world in the image plane with the (6.206), the corrected coordinates by radial
distortions, indicated with (x̃c , ỹc ) are obtained as follows:

x̃c = x̃0 + (x̃ − x̃0 ) · D(r, k) ỹc = ỹ0 + ( ỹ − ỹ0 ) · D(r, k) (7.3)

where (x̃0 , ỹ0 ) is the principal point and D(r ) is the function that models the nonlinear
effects of radial distortion, given by

D(r, k) = k1r 2 + k2 r 4 + k3r 6 + · · · (7.4)

with 
r= (x̃ − x̃0 )2 + ( ỹ − ỹ0 )2 (7.5)

The truncation of terms with powers of r greater than the sixth degree make a neg-
ligible contribution and can be assumed null. Experimental tests have shown that
2–3 coefficients ki are sufficient to correct almost 95% of the radial distortion for a
medium quality optical system.
To include also the tangential distortion that attenuates the effects of the lens
decentralization, the tangential correction component is added to Eq. (7.3), obtaining
the following equations:
    
x̃c = x̃0 + (x̃ − x̃0 ) k1 r 2 + k2 r 4 + k3 r 6 + · · · + p1 r 2 + 2(x̃ − x̃0 )2 + 2 p2 (x̃ − x̃0 )( ỹ − ỹ0 ) 1 + p3 r 2 + · · ·
(7.6)
602 7 Camera Calibration and 3D Reconstruction

    
ỹc = ỹ0 + ( ỹ − ỹ0 ) k1 r 2 + k2 r 4 + k3 r 6 + · · · + 2 p1 (x̃ − x̃0 )( ỹ − ỹ0 ) + p2 (r 2 + 2( ỹ − ỹ0 )2 ) 1 + p3 r 2 + · · ·
(7.7)
As already mentioned, the radial distortion is dominant and is characterized with this
approximation with the coefficients (ki , i = 1, 3) while the tangential distortion is
characterized by the coefficients ( pi , i = 1, 3), all of which can be obtained through
a process of ad hoc calibration by projecting sample patterns on a flat surface, for
example, grids with vertical and horizontal lines (or other types of patterns) and
then acquiring distorted images. Note the geometry of the projected patterns and by
calculating the coordinates of the patterns distorted with Eqs. (7.6) and (7.7), we can
find the optimal coefficients of the distortions through a nonlinear equation system
and a nonlinear regression method.

7.3 Geometric Transformations Involved in Image Formation

Before analyzing the calibration methods of an acquisition system, we summarize,


with reference to Sect. 6.7.1, the geometric transformations involved in the whole
process of image formation (see Fig. 7.1).

1. Transformation from the world’s 3D coordinates (X w , Yw , Z w ) to the camera’s 3D


coordinates (X, Y, Z ) through the rotation matrix R and the translation vector T
of the coordinate axes according to Eq. 6.209 which, expressed in homogeneous
coordinates in compact form, results:

X̃ = M X̃ w

where M is the roto-translation matrix whose elements represent the extrinsic


parameters.

Q(X,Y,Z) (a) x
(b) xc (c) u

q(x,y) q(xc,yc) q(x,y)


Yw Zw
O Xw
y yc v

Fig. 7.1 Sequence of projections: 3D → 2D → 2D → 2D. a Pinhole perspective projection of 3D


points projected in the 2D image plane (in the camera’s reference system) expressed in normal-
ized coordinates given by Eq. (6.205); b 2D transformation in the image plane of the normalized
coordinates (x, y) mapped in coordinate (xc , yc ), corrected by radial and tangential distortions; c
Further 2D transformation (affine) according to the camera’s intrinsic parameter matrix to express
the coordinates, corrected by the distortions, in pixel coordinates (u, v) of the sensor
7.3 Geometric Transformations Involved in Image Formation 603

2. Pinhole perspective projection of the 3D points X = (X, Y, Z ) from the camera’s


reference system to the 2D image plane in coordinates x = (x, y), given by Eq.
6.206 which, expressed in homogeneous coordinates in compact form, results in
the following:
x̃ = A X̃ w

where A is the canonical perspective projection matrix. The image coordinates


are not distorted since they are ideal perspective projections (we assume focal
length f = 1).
3. Apply Eqs. 7.3 to the coordinates x̃ (obtained in the previous step with the pinhole
model) influenced by the dominant radial distortion thus obtaining the new correct
coordinates x̃c in the image plane.
4. Apply the affine transformation given by (6.208):

ũ = K x̃c

and characterized by the calibration matrix of the camera K (see Sect. 6.7.1),
to switch between the reference system of the image plane (coordinates x̃c =
(x̃c , ỹc ) corrected by radial distortions) to the sensor reference system with the
new coordinates, expressed in pixels and indicated ũ = (ũ, ṽ). Recall that the 5
elements of the triangular calibration matrix K are the intrinsic parameters of the
camera.

7.4 Camera Calibration Methods

Camera calibration is a strategic procedure in different applications, where a vision


machine can be used for 3D reconstruction of the scene, to make measurements on
observed objects, and several other activities. For these purposes, it becomes strategic
for the calibration process to detect the intrinsic and extrinsic parameters of a vision
machine with adequate accuracy. Although the camera manufacturers give the level
of accuracy of the camera’s internal parameters, these must be verified through the
acquisition of ad hoc images using calibration platforms of which the geometry of
the reference fiducial patterns is well known.
The various calibration processes can capture multiple images of a flat calibration
platform observed from different points of view, or capture a single image of a
platform with at least two different planes of known patterns (see Fig. 7.2), or platform
with linear structures at various inclinations. By acquiring these platforms from
various points of view with different calibration methods, it is possible to estimate
the intrinsic and extrinsic parameters.
Unlike the previous methods, a camera autocalibration approach avoids the use
of a calibration platform and determines the intrinsic parameters that are consistent
with the geometry of a given image set [1]. This is possible by finding a sufficient
number of pattern matches between at least three images to recover both intrinsic
604 7 Camera Calibration and 3D Reconstruction

(a) (b) Y
(c)

Y
Y

Z X
Z X

Z X

Fig. 7.2 Calibration platforms. a Flat calibration platform observed from different view points; b
Platform having at least two different planes of patterns observed from a single view; c Platform
with linear structures

and extrinsic parameters. Since the autocalibration is based on the pattern matches
determined between the images, it is important that with this method the detection
of the corresponding patterns is accurate. With this approach, the distortion of the
optical system is not considered.
In the literature, calibration methods based on vanishing points are proposed using
parallelism and orthogonality between the lines in 3D space. These approaches rely
heavily on the process of detecting edges and linear structures to accurately determine
vanishing points. Intrinsic parameters and camera rotation matrix can be estimated
from three mutually orthogonal vanishing points [2].
The complexity and accuracy of the calibration algorithms also depend on the
need to know all the intrinsic and extrinsic parameters and the need to remove optical
distortions or not. For example, in some cases, it may not be required to use methods
that do not require the estimation of the focal length f and of the location of the
principal point (u 0 , v0 ) since only the relationship between the coordinates of the
world’s 3D points and its 2D coordinates in the image plane is required.
The basic approaches of calibration derive from photogrammetry which solves
the problem by minimizing a nonlinear error function to estimate the geometric and
physical parameters of a camera. Less complex solutions have been proposed for
the computer vision by simplifying the camera model using linear and nonlinear
systems and operating in real time. Tsai [3] has proposed a two-stage algorithm that
will be described in the next section and a modified version of this algorithm [4] has
a four-stage extension including the Direct Linear Transformation—DLT method in
the two stages.
Other calibration methods [5] are based on the estimation of the perspective pro-
jection matrix P by acquiring an image of calibration platforms with noncoplanar
patterns (at least two pattern planes as shown in Fig. 7.2b) and selecting at least 6 3D
points and automatically detecting the 2D coordinates of the relative projections in
the image plane. Zhang [6] describes instead an algorithm that requires at least two
images acquired from different points of view of a flat pattern platform (see Fig. 7.2a,
c).
7.4 Camera Calibration Methods 605

7.4.1 Tsai Method

The Tsai method [3], also called direct parameter calibration (i.e., it directly recovers
the intrinsic and extrinsic parameters of the camera), uses as calibration platform
two orthogonal planes with black squares patterns on a white background equally
spaced. Of these patterns, we know all the geometry (number, size, spacing, …) and
their position in the 3D reference system of the world, integral with the calibration
platform (see Fig. 7.2b). The acquisition system (normally a camera) is placed in front
of the platform and through normal image processing algorithms all the patterns of
the calibration platform are automatically detected and the position of each pattern
in the reference system of the camera is determined in the image plane. It is then
possible to find the correspondence between the 2D patterns of the image plane and
the visible 3D patterns of the platform.
The relationship between 3D world coordinates X w = [X w Yw Z w ]T and 3D
coordinates of the camera X = [X Y Z ]T in the context of rigid roto-translation
transformation (see Sect. 6.7) is given by Eq. 6.209 which when rewritten in explicit
matrix form results results in the following:
⎡ ⎤⎡ ⎤ ⎡ ⎤
r11 r12 r13 Xw Tx
X = RX w + T = ⎣r21 r22 r23 ⎦ ⎣ Yw ⎦ + ⎣Ty ⎦ (7.8)
r31 r32 r33 Zw Tz

where R is the rotation matrix of size 3 × 3 and T is the 3D translation vector


indicating the position of the origin O w with respect to the camera reference system
(see Fig. 6.39), which schematizes more generally the projection of 3D points of
a rigid body in the world reference system X w = [X w Yw Z w ]T and in the camera
reference system X = [X Y Z ]T ). In essence, (7.8) projects a generic point Q of
the scene (in this case, it would be a pattern of the calibration platform) between
the two 3D spaces, the one in the reference system of the platform and that of the
camera. From (7.8), we explicitly get the 3D coordinates in the camera reference
system given by
X = r11 X w + r12 Yw + r13 Z w
Y = r21 X w + r22 Yw + r23 Z w (7.9)
Z = r31 X w + r32 Yw + r33 Z w

The relation between 3D coordinates of the camera X = [X Y Z ]T to the 2D image


coordinates, expressed in pixels u = (u, v), is given by Eq. 6.207. Substituting in
these the Eqs. 7.9, we obtain the following relations to pass from the coordinates of
the world X w = [X w Yw Z w ]T to the images coordinates expressed in pixels:
X r11 X w + r12 Yw + r13 Z w + Tx
u − u 0 = αu = αu
Z r31 X w + r32 Yw + r33 Z w + Tz
(7.10)
Y r21 X w + r22 Yw + r23 Z w + Ty
v − v0 = αv = αv
Z r31 X w + r32 Yw + r33 Z w + Tz
606 7 Camera Calibration and 3D Reconstruction

where αu and αv are the horizontal and vertical scale factors expressed in pixels,
and u0 = [u 0 v0 ]T is the sensor’s principal point. The parameters αu , αv , u 0 and v0
represent the intrinsic (or internal) parameters of the camera that together with the
extrinsic parameters (or exter nal) R and T are the unknown parameters. Normally
the pixels are squares for which αu = αv = α, and α is considered as the focal length
of the optical system expressed in units of the pixel size. The sensor skew parameter
s, neglected in this case (assumed zero), is a parameter that takes into account the
non-rectangularness of the pixel area (see Sect. 6.7).
This method assumes known the image coordinates (u 0 , v0 ) of the principal point
(normally assumed as the center of the image-sensor) and consider the unknowns the
parameters αu , αv , R and T . To simplify, let’s assume (u 0 , v0 ) = (0, 0) and denote
with (u i , vi ) the ith projection in the image plane of the 3D calibration patterns. Thus
the first members of the previous equations become, respectively, u i − 0 = u i and
vi − 0 = vi . After these simplifications, dividing member to member this projection
equations between them, considering only the first and third members of (7.10), for
each ith projected pattern we get the following equation:
u i αv (r21 X w(i) + r22 Yw(i) + r23 Z w(i) + Ty ) = vi αu (r11 X w(i) + r12 Yw(i) + r13 Z w(i) + Tx ) (7.11)

If we now divide both members of (7.11) by αv , let’s indicate with α = ααuv , which we
remember to be the aspect ratio of the pixel, and we define the following symbols:

ν1 = r21 ν5 = αr11
ν2 = r22 ν6 = αr12
(7.12)
ν3 = r23 ν7 = αr13
ν4 = Ty ν8 = αTx

then (7.11) can be rewritten in the following form:


u i X w(i) ν1 + u i Yw(i) ν2 + u i Z w(i) ν3 + u i ν4 − vi X w(i) ν5 − vi Yw(i) ν6 − vi Z w(i) ν7 − vi ν8 = 0
(7.13)
 (i) (i) (i) 
For the N known projections (X w , Yw , Z w ) → (u i , vi ), i = 1, . . . , N , we
have N equations by obtaining a system of linear equations in the 8 unknowns
ν = (ν1 , ν2 , . . . , ν8 ), given by
Aν = 0 (7.14)

where:
⎡ (1) (1) (1) (1) (1) (1)

u 1 X w u 1 Yw u1 Zw u 1 −v1 X w −v1 Yw −v1 Z w −v1
⎢ ⎥
⎢ u 2 X w(2) u 2 Yw(2) u 2 Z w(2) (2)
u 2 −v2 X w −v2 Yw
(2)
−v2 Z w
(2)
−v2 ⎥
A=⎢ ⎥ (7.15)
⎣ ··· ··· ··· ··· ··· ··· ··· ··· ⎦
u N X w(N ) u N Yw(N ) u N Z w(N ) u N −v N X w(N ) −v N Yw(N ) −v N Z w(N ) −v N

It is shown [7] that if N ≥ 7 and the points are not coplanar then the matrix A has
rank 7 and there exists a nontrivial solution which is the eigenvector corresponding to
the zero eigenvalue of AT A. In essence, the system can be solved by f actori zation
of the matrix A with the SVD approach (Singular Value Decomposition), that is, A =
UV T , where we remember that the diagonal elements of  are the singular values
of A and the solution of the system has solution not trivial ν which is proportional
7.4 Camera Calibration Methods 607

to the column vector of V corresponding to the smallest singular value of A. This


solution corresponds to the last column vector of V , which we can indicate with ν̄,
which unless a factor of proportionality κ can be a solution ν of the system, given
by
ν = κ ν̄ or ν̄ = γ ν (7.16)

where γ = 1/κ.
According to the symbols reported in (7.12), using the last relation of (7.16)
assuming that the solution is given by the eigenvector ν̄, we have

(ν̄1 , ν̄2 , ν̄3 , ν̄4 , ν̄5 , ν̄6 , ν̄7 , ν̄8 ) = γ (r21 , r22 , r23 , Ty , αr11 , αr12 , αr13 , αTx ) (7.17)

Now let’s see how to evaluate the various parameters involved in the previous equa-
tion.

Calculation of α and γ . These parameters can be calculated considering that


the rotation matrix by definition is an orthogonal matrix1 therefore, according to
(7.17), we can set the following relations:
   
||(ν̄1 , ν̄2 , ν̄3 )||2 = ν̄12 + ν̄22 + ν̄32 = γ 2 r21
2
+ r22
2
+ r23
2
= |γ | (7.18)
  
=1

   2 
||(ν̄5 , ν̄6 , ν̄7 )||2 = ν̄52 + ν̄62 + ν̄72 = γ 2 α 2 r11 + r12
2
+ r13
2
= α|γ | (7.19)
  
=1

1 By definition, an orthogonal matrix R has the following properties:

1. R is nor mali zed, that is, the sum of the squares of the elements of any row or column is 1
( ri 2 = 1, i = 1, 2, 3).
2. The inner product of any pair of rows or columns is zero (r1T r3 = r2T r3 = r2T r1 = 0).
3. From the first two properties follows the orthonormality of R, or R R T = R T R = I, its
invertibility, that is, R −1 = R T (its inverse coincides with the transpose), and the determinant
of R has module 1 (det (R) = 1), where I indicates the identity matrix.
4. The rows of R represent the coordinates in the space of origin of unit vectors along the
coordinate axes of the rotated space.
5. The columns of R represent the coordinates in the rotated space of unit vectors along the
axes of the space of origin. In essence, they represent the directional cosines of the triad axes
rotated with respect to the triad of origin.
6. Each row and each column are orthonormal to each other, as they are orthonormal bases of
space. It, therefore, follows that given two vectors row or column of the matrix r 1 and r 2 it
is possible to determine the third base vector as a vector product given by r 3 = r 1 × r 2 .

.
608 7 Camera Calibration and 3D Reconstruction

The scale factor γ is determined less than the sign with (7.18) while by definition
the aspect ratio α > 0 is calculated with (7.19). Explicitly α can be determined as
α = ||(ν̄5 , ν̄6 , ν̄7 )||2 /||(ν̄1 , ν̄2 , ν̄3 )||2 .
At this point, known as α and the module |γ |, knowing the solution vector ν̄, we
can calculate the rest of the parameters even without the sign, since the sign of γ
is not known.
Calculation of Tx and Ty . Always considering the estimated vector ν̄ for (7.17),
we have
ν̄8 ν̄4
Tx = Ty = (7.20)
α|γ | |γ |

determined even less than the sign.


Calculation of the rotation matrix R. The elements of the first two rows of the
rotation matrix are given by

(ν̄5 , ν̄6 , ν̄7 ) (ν̄1 , ν̄2 , ν̄3 )


(r11 , r12 , r13 ) = (r21 , r22 , r23 ) = (7.21)
|γ |α |γ |
determined even less than the sign. The elements of the third row r 3 , which we
remember not included in system (7.14), due to the orthonormality characteristics
of the matrix R are calculated considering the outer product of the vectors r 1 =
(r11 , r12 , r13 ) and r 2 = (r21 , r22 , r23 ), associated with the first two rows of R
calculated with the previous equations. And it is as follows:
⎡ ⎤ ⎡ ⎤
r31 r12 r23 − r13r22
r 3 = r 1 × r 2 = ⎣r32 ⎦ = ⎣r13r21 − r11r23 ⎦ (7.22)
r33 r11r22 − r12 r21

Verification of the orthogonality of the rotation matrix R. Recall that the elements
of R have been calculated starting from the estimated eigenvector ν̄, possible solu-
tion of system (7.14), obtained with the SVD approach. Therefore, it is necessary
to verify if the computed matrix R satisfies the properties of an orthogonal matrix
where the rows and the columns constitute the characteristics of orthonormal bases,
T
that is, R̂ R̂ = I considering R̂ the estimate of R. The orthogonality of R̂ can
be imposed again using the SVD approach by calculating R̂ = U I V T , where the
diagonal matrix  of the singular values is replaced with the identity matrix I.
Determine the sign of γ and calculate Tz , αu and αv . The sign of γ is determined
by checking the sign of u and X , and the sign of v and of Y (see Eq. 7.10) which
must be consistent with the signs in agreement also to the geometric configuration
of the camera and calibration platform (see Fig. 6.39). Observing Eq. 7.10, we have
that the two members are positive. If we replace in these equations the coordinates
(u, v) of the image plane and the coordinates (X, Y, Z ) of the camera reference
system of a generic point used for the calibration, we could analyze the sign of
Tx , Ty and the sign of the elements of r 1 and r 2 . In fact, we know that αu and
αv are positive as well as Z (that is, the denominator of Eq. 7.10) considering the
origin of the camera reference system and the position of the camera itself (see
7.4 Camera Calibration Methods 609

Fig. 6.39). Therefore, there must be concordance of signs between the first member
and numerator for the two equations, that is, the following condition must occur:

u(r11 X w + r12 Yw + r13 Z w + Tx ) > 0


(7.23)
v(r21 X w + r22 Yw + r23 Z w + Ty ) > 0

If these conditions are satisfied, the sign of Tx , Ty is positive and the elements of
r 1 and r 2 are left unchanged together with the translation values Tx , Ty . Otherwise
the signs of these parameters are reversed.
Determined the parameters R, Tx , Ty , α remain to estimate Tz and αu (we remem-
ber from the definition of aspect ratio α that αu = ααv ). Let’s reconsider the first
equation (7.10) and rewrite it in the following form:

u(r31 X w + r32 Yw + r33 Z w + Tz ) = αu (r11 X w + r12 Yw + r13 Z w + Tx )


 
(7.24)
where Tz and αu are the unknowns. We can then determine Tz and αu by setting and
solving a system of equations, similar to the system (7.14), obtained by considering
again the N calibration points, as follows:
 
T
M z =b (7.25)
αu

where ⎡ ⎤
(1) (1) (1)
u 1 (r11 X w + r12 Yw + r13 Z w + Tx )
⎢ ⎥
⎢ u (r11 X w(2) + r12 Yw(2) + r13 Z w(2) + Tx ) ⎥
M=⎢ 2 ⎥ (7.26)
⎣· · · ··· ⎦
(N ) (N ) (N )
u N (r11 X w + r12 Yw + r13 Z w + Tx )

⎡ ⎤
−u 1 (r31 X w(1) + r32 Yw(1) + r33 Z w(1) + Tx )
⎢ ⎥
⎢ −u 2 (r31 X w(2) + r32 Yw(2) + r33 Z w(2) + Tx ) ⎥
b=⎢ ⎥ (7.27)
⎣ ··· ⎦
(N ) (N ) (N )
−u N (r31 X w + r32 Yw + r33 Z w + Tx )

The solution of this overdetermined linear system is possible with the SVD or
pseudo-inverse methods of Moore–Penrose, obtaining
 
Tz
= (M T M)−1 M T b (7.28)
αu

Knowing αu we can calculate αv as follows, αv = ααu .


610 7 Camera Calibration and 3D Reconstruction

Fig. 7.3 Orthocenter method


for calculating the optical
center through the three
vanishing points generated
by the intersection of the
edges of the same color

The coordinates of the principal point u 0 and v0 can be calculated by virtue of the
orthocenter theorem.2 If in the image plane a triangle is defined by three vanishing
points generated by three groups of parallel and orthogonal lines in the 3D space
(i.e., the vanishing points correspond to 3 orthogonal directions of the 3D world),
then the principal point coincides with the or thocenter of the triangle. The same
calibration platform (see Fig. 7.3) can be used to generate the three vanishing points
using pairs of parallel lines present in the two orthogonal planes of the same platform
[2]. It is shown that from 3 finite vanishing points we can estimate the focal length
f and the principal point (u 0 , v0 ).
The accuracy of the u 0 and v0 coordinates improves if you calculate the principal
point by looking at the calibration platform from multiple points of view and then
averaging the results. The camera calibration can be performed using the vanishing
points, which on the one hand eliminate the problem of finding the correspondences
but have the disadvantage of the problem of the presence of infinite vanishing points
and the inaccuracy of the calculation of the same.

7.4.2 Estimation of the Perspective Projection Matrix

Another method to determine the parameters (intrinsic and extrinsic) of the camera is
based on the estimation of the perspective projection matrix P (always assuming the
pinhole projection model) defined by (7.2) which transform, with Eq. (7.1), 3D points
of the calibration platform (for example, a cube of known dimensions with faces
having a checkerboard pattern also of known size), expressed in world homogeneous
coordinates X̃ w , in image homogeneous coordinates ũ, the last expressed in pixels.
In particular, the coordinates (X i , Yi , Z i ) of the corners of the board are assumed to
be known and the coordinates (u i , vi ) in pixels of the same corners projected in the
image plane are automatically determined.
The relation, which relates the coordinates of the 3D points with the coordinates
of the relative projections in the 2D image plane, is given by Eq. (7.1) that rewritten
in extended matrix form is

2 Triangle Orthocenter: located at the intersection of the heights of the triangle.


7.4 Camera Calibration Methods 611

⎡ ⎤ ⎡ ⎤
⎡ ⎤ Xw ⎡ ⎤ Xw
u ⎢ ⎥ p p p p
11 12 13 14 ⎢ ⎥
⎣ v ⎦ = P ⎢ Yw ⎥ = ⎣ p21 p22 p23 p24 ⎦ ⎢ Yw ⎥ (7.29)
⎣ Zw ⎦ ⎣ Zw ⎦
1 p31 p32 p33 p34
1 1

From (7.29), we can get three equations, but dividing the first two by the third
equation, we have two equations, given by
p11 X w + p12 Yw + p13 Z w + p14
u=
p31 X w + p32 Yw + p33 Z w + p34
(7.30)
p21 X w + p22 Yw + p23 Z w + p24
v=
p31 X w + p32 Yw + p33 Z w + p34
from which we have 12 unknowns that are precisely the elements of the matrix P.
Applying these equations for the ith correspondence 3D → 2D relative to an corner
of the cube, and placing them in the following linear form, we obtain
p11 X w(i) + p12 Yw(i) + p13 Z w(i) + p14 − p31 u i X w(i) − p32 u i Yw(i) − p33 u i Z w(i) − p34 = 0
(7.31)
p21 X w(i) + p22 Yw(i) + p23 Z w(i) + p24 − p31 vi X w(i) − p32 vi Yw(i) − p33 vi Z w(i) − p34 = 0

we can rewrite them in the compact homogeneous linear system of 12 equations con-
sidering that they need at least N = 6 points to determine the 12 unknown elements
of the matrix P:
Ap = 0 (7.32)

where
⎡ ⎤
X w(1) Yw(1) Z w(1) 1 0 0 0 0 −u 1 X w(1) −u 1 Yw(1) −u 1 Z w(1) −u 1
⎢ (1) (1) (1) (1) (1) (1) ⎥
⎢ 0 0 0 0 X w Yw Z w 1 −v1 X w −v1 Yw −v1 Z w −v1 ⎥
⎢ ⎥
A = ⎢ ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ⎥
⎢ (N ) (N ) (N ) (N ) (N ) (N ) ⎥
⎣ X w Yw Z w 1 0 0 0 0 −u N X w −u N Yw −u N Z w −u N ⎦
(N ) (N ) (N ) (N ) (N ) (N )
0 0 0 0 X w Yw Z w 1 −v N X w −v N Yw −v N Z w −v N
(7.33)
and
T
p = p1 p2 p3 p4 (7.34)

with pi , i = 1, 4 representing the rows of the solution matrix p of the system (7.32)
and 0 represents the zero vector of length 2N .
If we use only N = 6 points, we have a homogeneous linear system with 12
equations and 12 unknowns (to be determined). The points (X i , Yi , Z i ) are projected
in the image plane in (u i , vi ) according to the pinhole projection model. From algebra,
it is known that a homogeneous linear system admits the trivial solution p = 0, that
is, the vector p lying in the null space of the matrix A and we are not interested in
this trivial solution. Alternatively, it is shown that (7.32) admits infinite solutions if
and only if the rank of A is less than 2N , the number of unknowns.
612 7 Camera Calibration and 3D Reconstruction

In this context, the elements of the matrix A can be reduced to 11 if each ele-
ment is divided with any of the elements themselves (for example, the element p34 )
thus obtaining only 11 unknowns (with p34 = 1). Therefore, it would result that
rank( A) = 11 is less than 2N = 12 unknowns and the homogeneous system admits
infinite solutions. One of the possible nontrivial solutions, in the space of homoge-
neous system solutions, can be determined with the SVD method where A = UV T
and the solution would be the column vector of V corresponding to the zero singular
value of the matrix A. In fact, a p solution is the eigenvector that corresponds to
the smallest eigenvalue of AT A unless a proportionality factor. If we denote with
p̄ the last column vector of V , which is less than a proportionality factor κ can be a
solution of p of the homogeneous system, we have

p = κ p̄ or p̄ = γ p (7.35)

where γ = 1/κ. Given a solution p̄ of the projection matrix P, even if less than
γ , we can now find an estimate of the intrinsic and extrinsic parameters using the
elements of matrix p̄. For this purpose, two decomposition approaches of the matrix
P are presented, one based on the equation of the perspective projection matrix and
the other on the Q R f actori zation.

7.4.2.1 From the Matrix P to the intrinsic and extrinsic parameters


Now let’s rewrite in an extended way Eq. (6.212) of the perspective projection matrix
by expliciting the parameters:
⎡ ⎤
αu r11 + u 0 r31 αu r12 + u 0 r32 αu r13 + u 0 r33 αu Tx + u 0 Tz
P = K [R T ] = αv r21 + v0 r31 αv r22 + v0 r32 αv r23 + v0 r33 αv Ty + v0 Tz ⎦
⎣ (7.36)
r31 r32 r33 Tz

where we recall that K is the triangular calibration matrix of the camera given by
(6.208), R is the rotation matrix and T the vector translation.
Before deriving the calibration parameters, considering that the solution obtained
is less than a scale factor, it is better to normalize p̄ also to avoid a trivial solution
of the type p̄ = 0. The normalization by setting p34 = 1 can cause singularity if
the value of p34 is close to zero. The alternative is to normalize by imposing the
constraint of the unit vector r 3T , that is, p̄31 + p̄32 + p̄33 = 1 which eliminates the
singularity problem [5]. In analogy to what was done in the previous paragraph with
the direct calibration method (see Eq. 7.18), we will use γ to normalize p̄. Looking
from (7.36) that the first three elements of the third line of P correspond to the vector
r 3 of the rotation matrix R, we have

2 + p̄ 2 + p̄ 2 = |γ |
p̄31 (7.37)
32 33

At this point, dividing the solution vector p̄ for |γ | is normalized to less than the
sign. For simplicity, we define the following intermediate vectors q 1 , q 2 , q 3 , q 4 , on
the matrix solution found, as follows:
7.4 Camera Calibration Methods 613

⎡ ⎤
p̄11 p̄12 p̄13 p̄14
⎢   ⎥ ⎡ T ⎤
⎢ q 1T ⎥ q
⎢ ⎥ ⎢ 1 ⎥
⎢ p̄21 ⎥
p̄22 p̄23 p̄24 ⎥ ⎢ ⎥

p̄ = ⎢   ⎥=⎢
⎢ q T q ⎥
⎥ (7.38)
⎢ q 2T ⎥ ⎣ 2 4

⎢ ⎥
⎢ p̄31 p̄32 p̄33 p̄34 ⎥
⎣    ⎦ q T
3
q 3T q4

We can now determine all the intrinsic and extrinsic parameters by analyzing the
elements of the projection matrix given by (7.36) and that of the estimated parameters
given by (7.38) according to the intermediate vectors q i . The p̄ is an approximation
of the P matrix. Unless of the sign we can get Tz (the term p̄34 of the matrix 7.36)
and the elements r3i of the third row of R, associating them, respectively, with the
corresponding term p̄34 of (7.38) (i.e., the term q̄43 ) and the elements p̄3i :

Tz = ± p̄34 r3i = ± p̄3i , i = 1, 2, 3 (7.39)

The sign is determined by examining the equality Tz = ± p̄34 and knowing if the
origin of the world reference system is positioned in front of or behind the camera.
If forward Tz > 0, and therefore the sign must agree with p̄34 , otherwise Tz < 0 the
sign must be opposite to that of p̄34 .
According to the properties of the rotation matrix (see Note 1) the other parameters
are calculated as follows:

u 0 = q 1T q 3 v0 = q 2T q 3 (inner pr oduct) (7.40)

 
αu = q 1T q 1 − u 20 αv = q 2T q 2 − v02 (7.41)

and assuming the positive sign (with Tz > 0) we have

r1i = ( p̄1i − u 0 p̄3i )/αu r2i = ( p̄2i − v0 p̄3i )/αv


(7.42)
Tx = ( p̄14 − u 0 Tz )/αu Ty = ( p̄24 − v0 Tz )/αv

where i = 1, 2, 3.
It should be noted that the parameter s of skew has not been determined assuming
the rectangularity of the sensor pixel area that can be determined with other nonlinear
methods. The accuracy level of P calculated depends very much on the noise present
on the starting data, that is, on the projections (X i , Yi , Z i ) → (u i , vi ). This can be
verified by ensuring that the rotation matrix R maintains the orthogonality constraint
with the det (R) = 1.
An alternative approach, to recover the intrinsic and extrinsic parameters from the
estimated projection matrix P, is based on its decomposition into two submatrices
B and b, where the first is obtained by considering the first 3 × 3 elements of P,
614 7 Camera Calibration and 3D Reconstruction

while the second represents the last column of P. Therefore, we have the following
breakdown: ⎡ ⎤
p11 p12 p13 p14
⎢ p21 p22 p23 p24 ⎥
P ≡⎢ ⎣ p31 p32 p33 p34 ⎦ = B
⎥ b (7.43)
   
B b

From (7.2), we have P decomposed in terms of intrinsic parameters K and extrinsic


parameter R and T :

P=K R T =K R − RC (7.44)

so by virtue of (7.43), we can write the following decompositions:

B = KR b = KT (7.45)

It is pointed out that T = −RC expresses the translation vector of the origin of the
world reference system to the camera system or the position of the origin of the
world system in the camera coordinates. According to the decomposition (7.43),
considering the first equation of (7.45) and the intrinsic parameters defined with the
matrix K (see Eq. 6.208), we have
⎡ ⎤
αu2 + s 2 + u 20 sαv + u 0 v0 u 0
B ≡ B B = K K R
T
R = KK
T T T ⎣
= sαv + u 0 v0 αv + v0 v0 ⎦
2 2 (7.46)
I u0 v0 1

where R R T = I is motivated by the orthogonality of the rotation matrix R. We also


know that P is defined less than a scale factor and the last element is not normally 1.
It, therefore, needs to normalize the matrix B , obtained with (7.46), so that the last
element B 33 is 1. Next, we can derive the intrinsic parameters as follows:

u 0 = B 13 v0 = B 23 (7.47)


B 12 − u 0 v0
αv = B22 − v02 s= (7.48)
αv


αu = B 11 − u 20 − s 2 (7.49)

The previous assignments are valid if αu > 0 and αv > 0.


The extrinsic parameters, i.e., the rotation matrix R and the translation vector T ,
once the intrinsic parametric parameters are known (i.e., the matrix K known with
the previous assignments), we can calculate them with Eq. (7.45), as follows:

R = K −1 B T = K −1 b (7.50)
7.4 Camera Calibration Methods 615

7.4.2.2 Intrinsic and Extrinsic Parameters with the Q R Decomposition


of the Matrix P
From linear algebra, we define QR decomposition or QR factorization of a square
matrix A, the product A = Q R where Q is an orthogonal matrix (i.e., its columns are
orthogonal unit vectors resulting in Q T Q = I) and R is an upper triangular matrix
[8]. If A is invertible, then the decomposition is unique if the diagonal elements
of R are positive. To avoid confusion, we immediately point out that the matrix R
indicated in the Q R decomposition definition is not the rotation matrix R considered
so far in the text.
In this context, we want to use the QR decomposition to find K , R and T from
the projection matrix P given by Eq. (7.44). For this purpose, consider the matrix B
defined in (7.43) which in fact is the submatrix of P consisting of the 3 × 3 elements
in the upper left which for (7.43) we know to correspond to B = K R. Therefore,
executing the QR decomposition on the submatrix B −1 would have

B −1 = Q L (7.51)

where by definition Q is the orthogonal matrix and L the upper triangular matrix.
From (7.51), it is immediate to derive the following:

B = L −1 Q −1 = L −1 Q T (7.52)

which shows that K = L −1 and R = Q T . After the decomposition the matrix K


might not have the last element K 33 = 1 as expected by Eq. 6.208. Therefore, it is
necessary to normalize K (and also P) by dividing all the elements by K 33 . However,
it is noted that this does not change the validity of the calibration parameters because
the scale is arbitrary. Subsequently, with the second equation of (7.43) we calculate
the translation vector T given by

T = K −1 b (7.53)

remembering that b is the last column of the matrix P.


Alternatively, it could be used RQ decomposition that decomposes a matrix A into
the product A = R Q, where R is an upper triangular matrix and Q is an orthogonal
matrix. The difference with the QR decomposition consists only in the order of the
two matrices.

7.4.2.3 Nonlinear Estimation of the Perspective Projection Matrix


In the previous section, we have estimated the calibration matrix P using the least
squares approach, which in fact minimizes a geometric distance || A p||22 , subject to
the constraint || p||2 = 1 where we remember that p is the column vector of 2 × 1
elements. Basically it does not take into account the physical meaning of the camera
calibration parameters. An approach closer to the physical reality of the parameters
is based on the maximum likelihood estimation that directly minimizes an error
function obtained as the sum of the distance between points in the image plane of
616 7 Camera Calibration and 3D Reconstruction

which the position (u i , vi ) is known and their predicted position from the perspective
projection equation (7.30). This error function is given by
N  (i) (i) (i)   (i) (i) (i)  
p X + p12 Yw + p13 Z w + p14 2 p X + p22 Yw + p23 Z w + p24 2
min u i − 11 w + vi − 21 w
p (i) (i) (i) (i) (i) (i)
i=1 p31 X w + p32 Yw + p33 Z w + p34 p31 X w + p32 Yw + p33 Z w + p34
(7.54)
where N is the number of correspondences (X w , Yw , Z w ) → (u, v) which are
assumed to be affected by independent and identically distributed (iid) random noise.
The error function (7.54) is nonlinear and can be minimized using the Levenberg–
Marquardt minimization algorithm. This algorithm is applied starting from initial
values of p calculated with the linear least squares approach described in Sect. 7.4.2.

7.4.3 Zhang Method

This method [6,9] uses a planar calibration platform, that is, a flat chessboard (see
Fig. 7.2a) observed from different points of view or by keeping the camera position
fixed, changes position and attitude of the chessboard. The 3D points of the chess-
board are automatically localized in the image plane (with the known algorithms
of corner detector, for example, that of Harris) of which the geometry is known
and detected the correspondences (X w , Yw , 0) → (u, v). Without losing generality,
the world’s 3D reference system assumes that the chessboard plane is on Z w = 0.
Therefore, all the 3D points lying in the chessboard plane have the third coordinate
Z w = 0. If we denote by r i the columns of the rotation matrix R, we can rewrite the
projection relation (7.1) of the correspondences in the form:
⎡ ⎤
⎡ ⎤ Xw ⎡ ⎤
u ⎢ Yw ⎥ Xw
ũ = ⎣ v ⎦ = K r 1 r 2 r 3 T ⎢ ⎥ ⎣ ⎦
⎣ 0 ⎦ = K r 1r 2 T  H Yw = H X̃ w (7.55)
1 1
1 homography

from which it emerges that the third column of R (matrix of the extrinsic parameters)
is deleted, and the homogeneous coordinates in the image plane ũ and the correspond-
ing 2D on the chessboard plane X̃ w = (X w , Yw , 1) are related by the homography
matrix H of size 3 × 3 less than a scale factor λ:

λũ = H X̃ w (7.56)

with
H = h1 h2 h3 = K r 1 r 2 T (7.57)

Equation. (7.56) represents the homography transformation already introduced in


Sect. 3.5 Vol.II, also known as projective transformation or geometric collineation
between planes, while the homography matrix H can be considered the equivalent
of the perspective projection matrix P, but valid for planar objects.
7.4 Camera Calibration Methods 617

7.4.3.1 Calculation of the Homography Matrix


Now let’s rewrite Eq. (7.56) of homography transformation in the extended matrix
form: ⎡ ⎤ ⎡ ⎤⎡ ⎤
u h 11 h 12 h 13 Xw
λ ⎣ v ⎦ = ⎣h 21 h 22 h 23 ⎦ ⎣ Yw ⎦ (7.58)
1 h 31 h 32 h 33 1

From (7.58), we can get three equations, but dividing the first two by the third
equation, we have two nonlinear equations in the 9 unknowns which are precisely
the elements of the homography matrix H, given by3 :
h 11 X w + h 12 Yw + h 13
u=
h 31 X w + h 32 Yw + h 33
(7.59)
h 21 X w + h 22 Yw + h 23
v=
h 31 X w + h 32 Yw + h 33
with (u, v) expressed in nonhomogeneous coordinates. By applying these last equa-
(i) (i)
tions to the ith correspondence (X w , Yw , 0) → (u i , vi ), related to the corners of
the chessboard, we can rewrite them in the linear form, as follows:

h 11 X w(i) + h 12 Yw(i) + h 13 − h 31 u i X w(i) − h 32 u i Yw(i) − h 33 u i = 0


(7.60)
h 21 X w(i) + h 22 Yw(i) + h 23 − h 31 vi X w(i) − h 32 vi Yw(i) − h 33 vi = 0

These linear constraints applied to at least N = 4 corresponding points generate a


homogeneous linear system with 2N equations to determine the 9 unknown elements
of the homography matrix H, defined less than a scale factor and the independent
parameters are only 8. Thus we obtain the following homogeneous linear system:

AH = 0 (7.61)

where the matrix A of size 2N × 9 is:


⎡ (1) (1) (1) (1) ⎤
X w Yw 1 0 0 0 −u 1 X w −u 1 Yw −u 1
⎢ ⎥
⎢ 0 0 0 X w(1) Yw(1) 1 −v1 X w(1) −v1 Yw(1) −v1 ⎥
⎢ ⎥
A = ⎢ ··· ··· ··· ··· ··· ··· ··· ··· ··· ⎥ (7.62)
⎢ (N ) (N ) (N ) (N ) ⎥
⎣ X w Yw 1 0 0 0 −u N X w −u N Yw −u N ⎦
(N ) (N ) (N ) (N )
0 0 0 X w Yw 1 −v N X w −v N Yw −v N

and
T
H = h1 h2 h3 (7.63)

3 The coordinates (u, v) of the points in the image plane in Eq. (7.58) are expressed in homogeneous

coordinates while in the nonlinear equations (7.59) are not homogeneous (u/λ, v/λ) but for sim-
plicity they remain with the same notation. Once calculated H, from the third equation obtainable
from (7.58), we can determine λ = h 31 u + h 32 v + 1.
618 7 Camera Calibration and 3D Reconstruction

with hi , i = 1, 3 representing the rows of the solution matrix H of the system (7.61)
and 0 represents the zero vector of length 2N .
The homogeneous system (7.61) can be solved with the SVD approach which
decomposes the data matrix (known at least N correspondences) in the product of
3 matrices A = UV T and the solution would be the column vector of V corre-
sponding to the singular value zero of the matrix A. In reality, a solution H is the
eigenvector that corresponds to the smallest eigenvalue of AT A for less than a
proportionality factor. Therefore, if we denote by h̄ the last column vector of V , it
can be a solution of the homogeneous system up to a factor of proportionality. If the
coordinates of the corresponding points are exact, the homography transformation
is free of errors and the singular value found is zero. Normally this does not happen
for the noise present in the data and especially in the case of overdetermined sys-
tems with N > 4 and in this case the singular value chosen is always the smallest
seen as the optimal solution of the system (7.61) in the sense of least squares (i.e.
 A · H 2 → minimum).
The system (7.58) can be solved with the constraint of the scale factor h 33 =
1 and in this case the unknowns result to be 8 with the normal linear system of
nonhomogeneous equations expressed in the form:

AH = b (7.64)

where the matrix A of size 2N × 8 is


⎡ (1) (1) (1) (1)

X w Yw 1 0 0 0 −u 1 X w −u 1 Yw
⎢ (1) (1) (1) (1) ⎥
⎢ 0 0 0 X w Yw 1 −v1 X w −v1 Yw ⎥
⎢ ⎥
A = ⎢ ··· ··· ··· ··· ··· ··· ··· ··· ⎥ (7.65)
⎢ (N ) (N ) (N ) ⎥
⎣ X w Yw 1 0 0 0 −u N X w(N ) −u N Yw ⎦
(N ) (N ) (N ) (N )
0 0 0 X w Yw 1 −v N X w −v N Yw

with the unknown vector:


T
H = h 11 h 12 h 13 h 21 h 22 h 23 h 31 h 32 (7.66)

and
T
b = u 1 v1 · · · u N v N (7.67)

where at least 4 corresponding points are always required to determine the homogra-
phy matrix H. The accuracy of the homography transformation could improve with
N > 4. In the latter case, we would have an overdetermined system solvable with
the least squares approach (i.e., minimizing  A · H − b 2 ) or with the method of
the pseudo-inverse.
The computation of H done with the preceding linear systems minimizes an
algebraic error [10], and therefore not associable with a physical concept of geometric
distance. In the presence of errors in the coordinate measurements of the image points
of correspondences (X w , Yw , 0) → (u, v), assuming affected by Gaussian noise, the
7.4 Camera Calibration Methods 619

optimal estimate of H is the maximum likelihood estimation which minimizes the


distance between the measured image points and their position predicted by the
homography transformation (7.58). Therefore, the geometric error function to be
minimized is given by

N
min  ũi − H X̃ i 2 (7.68)
H
i=1

where N is the number of matches, ũi are the points in the image plane affected by
noise, and X̃ i are the points on the calibration chessboard assumed accurate. This
function is nonlinear and can be minimized with an iterative method like that of
Levenberg–Marquart.

7.4.3.2 From Homography Matrices to Intrinsic Parameters


In the previous section, we calculated the homography matrix H assuming that
(i)
the N checkerboard calibration points X̃ w lie in the plane X w − Yw of the world
reference system and at the distance of the plane Z w = 0 with the projection model
given by (7.56) and the matrix H defined by (7.57). With the assumption Z w = 0 it
follows that the M chessboard observations are made by moving the camera. This
implies that we will have projections in the image plane with different coordinates
in the different views of the corresponding 3D points and this is not a problem for
the calibration because in the homography transformation they concern the relative
coordinates in the image plane, that is, the homography transform is independent
of the projection reference system. At this point, we assume to have calculated the
homography matrices H j , j = 1, . . . , M independently for the corresponding M
observations of the calibration chessboard.
In analogy to the perspective projection matrix P also, the homography matrices
capture the information associated to the camera intrinsic parameters and the extrinsic
ones that vary for each observation. Given the set of computed homography matrices,
let us now see how to derive the intrinsic parameters from these. For each observed
calibration image, we rewrite Eq. (7.57) which relates the homography matrix H to
the intrinsic parameters K (see as the matrix of inner camera transformation) and
extrinsic parameters R, T (the view-related external transformation), given by

H = h1 h2 h3 = λK r 1 r 2 T (7.69)

where λ is an arbitrary nonzero scale factor. From (7.69), we can get the relations
that link the column vectors of R as follows:

h1 = λK r1 (7.70)
h2 = λK r2 (7.71)
620 7 Camera Calibration and 3D Reconstruction

from which ignoring the factor λ (not useful in this context) we have

r1 = K −1 h1 (7.72)
−1
r2 = K h2 (7.73)

We know that the column vectors r1 and r2 are orthonormal by virtue of the properties
of the rotation matrix R (see Note 1) which, applied to the previous equations, we
get
h T K −T K −1 h2 = 0 (7.74)
1  
r1T r2 =0

h T K −T K −1 h1 = h2T K −T K −1 h2 (7.75)
1  
r1T r1 =r2T r2 =1

which are the two relations to which the intrinsic unknown parameters associated
with a homography are constrained. We now observe that the matrix of the intrinsic
unknown parameters K is upper triangular and Zhang defines a new matrix B,
according to the last two constraint equations found, given by
⎡ ⎤
b11 b12 b13
B = K −T K −1 = ⎣b21 b22 b23 ⎦ (7.76)
b31 b32 b33

Considering the superior triangularity of K the matrix B is symmetric and with


only 6 unknowns that we will group into a vector b with 6 elements:
T
b = b11 b12 b22 b13 b23 b33 (7.77)

Substituting in (7.76) the elements of K defined by (6.208), we obtain the matrix B


made explicit from the intrinsic parameters, as follows:
⎡ v0 s−u 0 αv ⎤
1
αu2
− α 2sα αu2 αv
u v
⎢ ⎥
⎢ ⎥
⎢ s2 0 αv )

=⎢ − s(v0αs−u ⎥
v0
B=K −T
K −1 − s + 1
− (7.78)
⎢ αu2 αv αu2 αv2 αv2 u αv
2 2 αv2 ⎥
⎢ ⎥
⎣ ⎦
v0 s−u 0 αv 0 αv ) v0 (v0 s−u 0 αv )2 v02
αu2 αv
− s(v0αs−u
2 α2 − αv2 αu2 αv2
+ αv2
+1
u v

Now let’s rewrite the constraint equations (7.74) and (7.75), considering the matrix
B defined by (7.76), given by

h1T Bh2 = 0 (7.79)


h1T Bh1 − h2T Bh2 = 0 (7.80)

From these constraint equations, we can derive a relation that links the vectors column
hi = (h i1 , h i1 , h i1 )T , i = 1, 2, 3 of the homography matrix H, given by (7.69), with
7.4 Camera Calibration Methods 621

the unknown vector b, given by (7.77), obtaining


⎡ ⎤T ⎡ ⎤
h i1 h j1 b11
⎢h i1 h j2 + h i2 h j1 ⎥ ⎢b12 ⎥
⎢ ⎥ ⎢ ⎥
⎢ h i2 h j2 ⎥ ⎢b22 ⎥

hi Bh j = ⎢
T ⎥ ⎢ ⎥ = vT b (7.81)
h h + h h ⎥ ⎢b13 ⎥ ij
⎢ i3 j1 i1 j3 ⎥ ⎢ ⎥
⎣h i3 h j2 + h i2 h j3 ⎦ ⎣b23 ⎦
h i3 h j3 b33

where vi j is the vector obtained from the calculated homography H and considering
both the two constraint equations (7.74) and (7.75) we can rewrite them in the form
of a homogeneous system in two equations, as follows:
 T
  
v12 0
b= (7.82)
(v11 − v22 )T 0

where b is the unknown vector. The information of the intrinsic parameters of the
camera is captured by the M observed images of the chessboard of which we have
independently estimated the relative homographies H k , k = 1, . . . , M. Therefore,
every homography projection generates 2 equations (7.82) and since the unknown
vector b has 6 elements at least 3 different homography projections of the chessboard
are necessary (M = 3) to assemble in a homogeneous linear system of 2M equations,
starting from Eq. (7.82), obtaining the following system:

Vb = 0 (7.83)

where V is the matrix of the known homographies of size 2M × 6 and b is the


unknown vector. Also in this case we have an overdetermined system with M ≥
3 which can be solved with the SVD method obtaining a solution b less than a
scale factor. Recall that the estimated solution of the system (7.83) corresponds
to the rightmost column vector of V (obtained from the decomposition of V =
U U ) associated with the smallest singular value which is the minimum value of
the diagonal elements of .
Calculated the vector b (and therefore the matrix B), it is possible to derive then in
closed form [6] the intrinsic parameters that is the matrix K . In fact, from (7.76) we
have the relation that binds the matrices B and K less than the unknown scale factor
λ (B = λK −T K −1 ). The intrinsic parameters obtainable in closed form, proposed
by Zhang, are
v0 = (b12 b13 − b11 b23 )/(b11 b22 − b12
2
) λ = b33 − [b13
2
+ v0 (b12 b13 − b11 b23 )]/b11
 
αu = λ/b11 αv = λb11 /(b11 b22 − b122 )

s = −b12 αu2 αv /λ u 0 = sv0 /αu − b13 αu2 /λ


(7.84)
The calculation of K from B could also be formulated with the Cholesky factorization
[8].
622 7 Camera Calibration and 3D Reconstruction

7.4.3.3 Estimation of Extrinsic Parameters


Calculated the intrinsic parameters matrix K we can estimate the extrinsic parameters
R and T for each observation k of the chessboard using the relative homography
H k . In fact, according to (7.57) we get

r1 = λK −1 h1 r 2 = λK−1 h2 T = λK−1 h3 (7.85)

where remembering that the vectors of the rotation matrix have unitary norm, the
scale factor is
1 1
λ= −1
= −1
(7.86)
 K h1   K h2 

Finally, for the orthonormality of R we have

r3 = r1 × r2 (7.87)

The extrinsic parameters are different for each homography because the points of
view of the calibration chessboard are different. The rotation matrix R may not satisfy
numerically the orthogonality properties of a rotation matrix due to the noise of the
correspondences. In [8,9], there are techniques to approximate the calculated R for
a tr ue rotation matrix. A technique is based on the SVD decomposition, imposing
the orthogonality R R T = I by forcing the matrix  to the identity matrix:
⎡ ⎤
100
R̄ = U ⎣0 1 0⎦ V T (7.88)
001

7.4.3.4 Estimation of Radial Distortions


The homography projections considered so far are assumed according to the pinhole
projection model. We know instead that the optical system introduces radial and
tangential distortions by altering the position of the 3D points in the image plane
(see Sect. 7.2). Now let’s see how to consider the effects only of radial distortions
(which are the most relevant) on the M observed homography projections.
Having in the preceding paragraphs estimated for each projection the intrinsic
parameters matrix K , with the homography transformation (7.56) we obtained the
ideal projection of the points in the image plane u = (u, v), while of the same we
have the observations in the image plane indicated with ū = (ū, v̄) which we assume
influenced by radial distortions thus obtaining a distortion on the real coordinates of
each point in the image plane given by (ū − u). This radial distortion we know to be
modeled by Eqs. (7.6) and (7.7) that, adapted to this context and rewritten in vector
form for each point of the observed images, we obtain

u − u0 · D(r, k) = ū − u (7.89)
7.4 Camera Calibration Methods 623

where we recall that k is the vector of the coefficients of the nonlinear radial distortion
function D(r, k) and r = x − x 0 = x  is the distance of the point x, associated
with the projected point u, is not calculated in the image plane, but is calculated as
the distance of x from the principal point x 0 = (0, 0), expressed not in pixels. In this
equation, knowing the ideal coordinates of the projected points u and the distorted
ones observed ū, the unknown to be determined is the vector k = (k1 , k2 ) (approx-
imating the nonlinear distortion function to only 2 coefficients) which rewritten in
matrix form, for each point of each observed image, two equations are obtained:
    
(u − u 0 ) · r 2 (u − u 0 ) · r 4 k1 ū − u
= (7.90)
(v − v0 ) · r 2 (v − v0 ) · r 4 k2 v̄ − v

With these equations, it is possible to set up a system of linear equations by assembling


together all the equations associated with the N points of the M images observed
obtaining a system of 2M N equations, which in compact matrix form results in the
following:
Dk = d (7.91)

The estimate of the vector k = (k1 , k2 ) solution of such overdetermined system can
be determined with the least squares approach with the method of the pseudo-inverse
of Moore–Penrose for which k = ( D T D)−1 D T d, or with SVD or QR factorization
methods.

7.4.3.5 Nonlinear Optimization of Calibration Parameters and Radial


Distortion
The approaches used to estimate the calibration parameters (intrinsic and extrinsic)
and radial distortions are based on the minimization of an algebraic distance, already
highlighted previously to be of no physical significance. The noise present in the
homography projection of the N chessboard 3D points in the M images are defined by
the homography transformation (7.56) and by radial distortion (7.89). If we indicate
with ūi j the observed image points and with ui j (K , k, Ri , T i , X j ) the projection of
a point X j in the ith homography image we can refine the results of all previously
obtained parameters through the maximum likelihood estimation (MLE) minimizing
the geometric error function, given by


N 
M
 ūi j − ui j (K , k, Ri , T i , X j ) 2 (7.92)
j=1 i=1

This nonlinear least squares minimization problem can be solved by iterative methods
such as the Levenberg–Marquardt algorithm. The iterative process is useful to start
with the estimates of the intrinsic parameters obtained in Sect. 7.4.3.2 and with the
extrinsic parameters obtained in Sect. 7.4.3.3. The initial parameters of the radial
distortion coefficients can be with an initial zero value or with those estimated in
the previous paragraph. It should be noted that the rotation matrix R has 9 elements
624 7 Camera Calibration and 3D Reconstruction

despite having 3 degrees of freedom (that is, the three angles of rotation around the
3D axes). The Euler–Rodrigues method [11,12] is used in [6] to express a 3D rotation
with only 3 parameters.

7.4.3.6 Summary of Zhang’s Autocalibration Method


This camera calibration method uses as a calibration platform a plane object, for
example, a chessboard, whose geometry and the positions of the identified patterns
(at least four) are observed from at least two different points of view. The essential
steps of the calibration procedure are

1. Acquisition of M images of the calibration platform observed from different


points of view: moving the camera with respect to the platform or vice versa or
moving both.
2. For each of the M images, N patterns are detected whose correct correspondence
is known with the associated points on the chessboard (3D points). For each 3D
point on the chessboard the coordinates are known X = (X, Y, 0) (points lying in
the same plane Z = 0). Of these points, are known the 2D coordinates ū = (ū, v̄)
(expressed in pixels) corresponding to each homography image.
3. Knowns the correspondences 3D ↔ 2D, the M relative homographies are calcu-
lated for each of the M images. The homographs {H 1 , H 2 , . . . , H M } are inde-
pendent of the coordinate reference system of the 2D and 3D points.
4. Knowing the homographies {H 1 , H 2 , . . . , H M }, the intrinsic parameters of the
camera are estimated, that is, the 5 elements of the intrinsic parameters matrix
K , using the linear closed-form solution. In this context, the radial distortion
introduced by the optical system in the homography projection of the calibration
points is ignored. If the homography images are at least 3, K is determined as
the only solution to less than an indeterminate scale factor. If you have multiple
homography images the estimated elements of K are more accurate. The camera
calibration procedure can take on zero the sensor skew parameter (s = 0 con-
sidering the good level of accuracy of the current sensors available) and in this
case, the number of homography images can be only two. Once the homographies
are known, for each 3D point on the chessboard we can calculate its homogra-
phy projection (according to the pinhole model) and obtain the ideal coordinates
u = (u, v) generally different from those observed ū = (ū, v̄). In fact, the latter
are affected by noise, partly due to the uncertainty of the algorithms that detect
the points in the image plane (inaccurate measurements of the coordinates of the
2D points detected in each homography image) and also caused by the optical
system.
5. Once the intrinsic parameters are known, it is possible to derive the extrinsic
parameters, that is the rotation matrix R and the vector translation T related to
the M views thus obtaining the corresponding attitude of the camera.
6. Estimate of the vector k relative to the coefficients of the nonlinear function that
models the radial distortion introduced by the optical system.
7.4 Camera Calibration Methods 625

7. Refining the accuracy of intrinsic and extrinsic parameters and radial distortion
coefficients initially estimated with least squares methods. Basically, starting from
these initial parameters and coefficients, a nonlinear optimization procedure based
on the maximum likelihood estimation (MLE) is applied globally to all the param-
eters related to the M homography images and the N points observed.

The camera calibration results, for the different methodologies used, are mainly
influenced by the level of accuracy of the 3D calibration patterns (referenced with
respect to the world reference system) and the corresponding 2D (referenced in the
reference system of the image plane). The latter are dependent on the automatic pat-
tern detection algorithms in the acquired images of the calibration platform. Another
important aspect concerns how the pattern localization error is propagated to deter-
mine the camera calibration parameters. In general, the various calibration methods,
at least theoretically, should produce the same results, but in reality then differ in
the solutions adopted to minimize pattern localization and optical system errors.
The propagation of errors is highlighted in particular when the configuration of the
calibration images is modified (for example, the focal length varies), while the exper-
imental configuration remains intact. In this situation, the extrinsic parameters do not
remain stable. Similarly, the instability of the intrinsic ones occurs when the experi-
mental configuration remains the same while only the translation of the calibration
patterns varies.

7.4.4 Stereo Camera Calibration

In the previous paragraphs, we have described the methods for calibrating a sin-
gle camera, that is, we have defined what the characteristic parameters are, how to
determine them with respect to the known 3D points of the scene assuming the pin-
hole projection model. In particular, the following parameters have been described:
the intrinsic parameters that characterize the optical-sensor components defining the
camera intrinsic matrix K , and the extrinsic parameters, defining the rotation matrix
R and the translation vector T with respect to an arbitrary reference system of the
world that characterize the attitude of the camera with respect to the 3D scene.
In the stereo system (with at least two cameras), always considering the pinhole
projection model, a 3D light spot of the scene is seen (projected) simultaneously
in the image plane of the two cameras. While with the monocular vision the 2D
projection of a 3D point defines only the ray that passes through the optical center
and the 2D intersection point with the image plane, in the stereo vision the 3D point
is uniquely determined by the intersection of the homologous rays that generate their
2D projections on the corresponding image planes of the left and right camera (see
Fig. 7.4). Therefore, once the calibration parameters of the individual cameras are
known, it is possible to characterize and determine the calibration parameters of a
stereo system and establish a unique relationship between a 3D point and its 2D
projections on the stereo images.
626 7 Camera Calibration and 3D Reconstruction

Fig. 7.4 Pinhole projection P


model in stereo vision
λPL

uL ZL ZR
uR
vL pR=(uR,vR)
cL pL=(uL,vL) vR cR

f f
PL=(XL,YL,f) PR=(XR,YR,f)
CL XR
XL CR
T
YL YR
R

According to Fig. 7.4, we can model the projections of the stereo system from the
mathematical point of view as an extension of the monocular model seen as rigid
transformations (see Sect. 6.7) between the reference systems of the cameras and
the world. The figure shows the same nomenclature of monocular vision with the
addition of the subscript L and R to indicate the parameters (optical centers, 2D and
3D reference systems, focal length, 2D projections, …), respectively, of the left and
right camera.
If T is the column vector representing the translation between the two optical
centers C L and C R (the origins of the reference systems of each camera) and R is
the rotation matrix that orients the left camera axes to those of the right camera (or vice
versa), then the coordinates of a world point P w = (X, Y, Z ) of 3D space, denoted
by P L = (X L p , Y L p , Z L p ) and P R = (X R p , Y R p , Z R p ) in the reference system of
the two cameras, they are related to each other with the following equations:

P R = R( P L − T ) (7.93)

P L = RT P R + T (7.94)

where R and T characterize the relationship between the left and right camera coor-
dinate systems, which is independent of the projection model of each camera. R and
T are essentially the extrinsic parameters that characterize the stereo system in the
pinhole projection model. Now let’s see how to derive the extrinsic parameters of
the stereo system R, T knowing the extrinsic parameters of the individual cameras.
Normally the cameras are individually calibrated considering known 3D points,
defined with respect to a world reference system. We indicate with
P w = (X w , Yw , Z w ) the coordinates in the world reference system, and with R L ,
T L and R R , T R the extrinsic parameters of the two cameras, respectively, the rota-
tion matrices and the translation column vectors. The relationships that project the
7.4 Camera Calibration Methods 627

point P w in the image plane of the two cameras (according to the pinhole model),
in the respective reference systems, are the following:

P L = RL Pw + T L (7.95)

P R = RR Pw + T R (7.96)

We assume that the two cameras have been independently calibrated with one of the
methods described in Sect. 7.4, and therefore their intrinsic and extrinsic parameters
are known. The extrinsic parameters of the stereo system are obtained from Eqs.
(7.95) and (7.96) as follows:
for the (7.96)
  
P L = R L P w + T L = R L R−1
R ( P R − T R ) +T L
comparison with the (7.94) (7.97)
  
= (R L R−1 ) P R −(R L R −1
)T R + T L
  R   R
 
RT T

from which we have

R T = (R L R−1
R ) ⇐⇒ R = R TL R R (7.98)

T = T L − R L R−1 T R = T L − RT T R (7.99)
  R 
RT

where (7.98) and (7.99) define the extrinsic parameters (the rotation matrix R T
and the translation vector T ) of the stereo system. At this point, the stereo system is
completely calibrated and can be used for 3D reconstruction of the scene starting from
2D stereo projections. This can be done, for example, by triangulation as described
in Sect. 7.5.7.

7.5 Stereo Vision and Epipolar Geometry

In Sect. 4.6.8, we have already introduced the epipolar geometry. This section
describes how to use epipolar geometry to solve the problem of matching homol-
ogous points in a stereo vision system with the two calibrated and non-calibrated
cameras. In other words, with the epipolar geometry we want to simplify the search
for homologous points between the two stereo images.
628 7 Camera Calibration and 3D Reconstruction

Let us remember with the help of Fig. 7.5a as a point P in 3D space is acquired
by a stereo system and projected (according to the pinhole model) in PL in the left
image plane and in PR in the right image plane. Epipolar geometry establishes a
relationship between the two corresponding projections PL and PR in the two stereo
images acquired by cameras (having optical centers C L and C R ), which can have
different intrinsic and extrinsic parameters.
Let us briefly summarize notations and properties of epipolar geometry:

Baseline is the line that joins the two optical centers and defines the inter-optical
distance.
Epipolar Plane is the plane (which we will indicate from now on with π ) which
contains the baseline line. A family of epipolar planes are generated that rotate
around the baseline and passing through the 3D points of the considered scene
(see Fig. 7.5b). For each point P, an epipolar plane is generated containing the
three points {C L , P, C R }. An alternative geometric definition is to consider the
3D epipolar plane containing the projection PL (or the projection PR ) together
with the left and right optical centers C L and C R .
Epipole is the intersection point of the baseline line with the image plane. The
epi pole can also be seen as the projection of the optical center of a camera on
the image plane of the other camera. Therefore, we have two epi poles indicated
with e L and e R , respectively, for the left and right image. If the image planes
are coplanar (with parallel optical axes) the epi poles are located at the opposite
infinites (intersection at infinity between baseline and image planes since they
are parallel to each other). Furthermore, the epipolar lines are parallel to an axis
of each image plane (see Fig. 7.6).
Epipolar Lines indicated with lL and lR are the intersections between an epipolar
plane and the image planes. All the epipolar lines intersect in the relative epipoles
(see Fig. 7.5b).

From Fig. 7.5 and from the properties described above, it can be seen that given a
point P of the 3D space, its projections PL and PR in the image planes, and the optical
centers C L and C R are in the epipolar plane π generated by the triad {C L , P, C R }. It
also follows that the rays drawn backwards from the PL and PR points intersecting
in P are coplanar to each other and lie in the same epipolar plane identified. This last
property is of fundamental importance in finding the correspondence of the projected
points. In fact, if we know PL , to search for the homologous PR in the other image
we have the constraint that the plane π is identified by the triad {C L , P, C R } (i.e.,
from the baseline and from the ray defined by PL ) and consequently also the
ray corresponding to the point PR must lie in the plane π , and therefore PR itself
(unknown) must be on the line of intersection between the plane π and the plane of
the second image.
This intersection line is just the right epipolar line lR that can be thought of as the
projection in the second image of the backprojected ray from PL . Essentially, lR is the
searched epipolar line corresponding to PL and we can indicate this correspondence
as follows:
7.5 Stereo Vision and Epipolar Geometry 629

(a) (b)
Epipolar
P planes

π
Epipolar
PL plane PR
Epipolar Epipolar
lL lR lines
lines
cL cR cL cR
eL eR
Baseline Epipoles

Fig. 7.5 Epipolar geometry. a The baseline is the line joining the optical centers C L and C R and
intersects the image planes in the respective epipoles and L and e R . Each plane passing through the
baseline is an epipolar plane. The epipolar lines l L and l R are obtained from the intersection of
an epipolar plane and the stereo image planes. b At each point P of the 3D space corresponds an
epipolar plane that rotates around the baseline and intersects the relative pair of epipolar lines in
the image planes. In each image plane, the epipolar lines intersect in the relative epi pole

uL uR
eL eR
vL pL vR pR

cL cR

Fig. 7.6 Epipolar geometry for a stereo system with parallel optical axes. In this case, the image
planes are coplanar and the epi poles are in the opposite infinite. The epipolar lines are parallel to
the horizontal axis of each image plane

PL → lR

which establishes a dual relationship between a point in an image and the associated
line in the other stereo image. For a stereo binocular system, with the epipolar
geometry, that is, known the epi poles and the epipolar lines, it is possible to restrict
the possible correspondences between the points of the two images by searching the
homologue of PL only on the corresponding epipolar line lR in the other image and
not over the entire image (see Fig. 7.7). This process must be repeated for each 3D
point of the scene.

7.5.1 The Essential Matrix

Let us now look at how to formalize epipolar geometry in algebraic terms, using
the Essential matrix [13], to find correspondences P L → lR [11]. We denote by
630 7 Camera Calibration and 3D Reconstruction

Fig. 7.7 Epipolar geometry


for a converging stereo
system. For the pair of stereo
images, the epipolar lines are
superimposed for the
simplified search for
homologous points
according to the dual
relationship point/epipolar
line PL → lR and PR → lL

P L = (X L , Y L , Z L )T and P R = (X R , Y R , Z R )T , respectively, the coordinates of


the point P with respect to the systems of reference (with origin in C L and C R
respectively) of the two left and right cameras4 (see Fig. 7.4).
Knowing a projection of P in the image planes and the optical centers of the
calibrated cameras (known orientations and calibration matrices), we can calculate
the epipolar plane and consequently calculate the relative epipolar lines, given by
intersection between epipolar plane and image plane. For the properties of epipolar
geometry, the projection of P, in the right image plane in P R lies on the right epipolar
line. Therefore, the foundation of epipolar geometry is that it allows us to create a
strong link between pairs of stereo images without knowing the 3D structure of the
scene. Now let’s see how to find an algebraic solution to find these correspondences,
point/epipolar line P L → lR , through stereo pairs of images.
According to the pinhole model, we consider the left camera with a perspective
projection matrix5 P L = [I| 0] and both the origin of the reference stereo system,
while the right camera is positioned with respect to the left one according to the
perspective projection matrix P R = [R| T ] characterized by extrinsic parameters R
and T , respectively, the rotation matrix and the translation vector. Without losing
generality, we can assume the calibration matrix K of each camera equal to the
identity matrix I. To solve the correspondence problem P L → lR needs to be mapped
in the reference system of C R (the right camera), the homologous candidate points
of P L , which are on the epipolar line lR in the right image plane.

4 We know that a 3D point, according to the pinhole model, projected in the image plane in P L
defines a ray passing through the optical center C L (in this case the origin of the stereo system) locus
of 3D points aligned represented by λ P L . These points can be observed by the right camera and
referenced in its reference system to determine the homologous points using the epipolar geometry
approach. We will see that this is possible so that we will neglect the parameter λ in the following.
5 From now on, the perspective projection matrices will be indicated with P to avoid confusion with

the points of the scene indicated with P.


7.5 Stereo Vision and Epipolar Geometry 631

Epipolar line
PL=(XL,YL,f) ZR PR=(XR,YR,f)
lL ZL
lR
CL XR
eL eR CR
XL T
YL YR

Fig. 7.8 Epipolar geometry. Derivation of the essential matrix from the coplanarity constraint
between the vectors CL PL , CR PR and CR CL

In other words, to find P R it can be mapped P L in the reference system C R


through the roto-translation parameters [R| T ].6 This is possible by applying Eq. 7.96
as follows:
PR = RPL + T (7.100)

Pre-multiplying both members of (7.100) first vectorially for T and then scaling for
P TR , we get
P T (T × P R ) = P TR (T × R P L ) + P TR (T × T ) (7.101)
 R     
=0 =0

For the property of the vector product T × T = 0 and the scalar one considered the
coplanarity of the vectors P TR (T × P R ) = 0, the previous relationship becomes

P TR (T × R P L ) = 0 (7.102)

From the geometric point of view (see Fig. 7.8), Eq. (7.102) expresses the coplanarity
of the vectors CL PL , CR PR and CR CL representing, respectively, the projection
rays of a point P in the respective image planes and the direction of the vector
translation T .
At this point, from the algebra we use the property of the antisymmetric matrix
consisting of only three independent elements that can be considered as the elements
of a vector with three components.7 For the translation vector T = (Tx , Y y , Z z )T ,

6 With reference to Note 1, we recall that the matrix R provides the orientation of the camera C R
with respect to the C L one. The column vectors are the direction cosines of C L axes rotated with
respect to the C R .
7 A matrix A is said to be antisymmetric when it satisfies the following properties:

A + AT = 0 AT = − A

It follows that the elements on the main diagonal are all zeroes while those outside the diagonal
satisfy the relation ai j = −a ji . This means that the number of elements is only n(n − 1)/2 and for
n = 3 we have a matrix of the size 3 × 3 of only 3 elements that can be considered as the components
632 7 Camera Calibration and 3D Reconstruction

we would have its representation in terms of antisymmetric matrix, indicated with


[T ]× , as follows: ⎡ ⎤
0 Tz −Ty
[T ]× = ⎣−Tz 0 Tx ⎦ (7.103)
Ty −Tx 0

where conventionally [•]× indicates the operator that transforms a 3D vector into an
antisymmetric matrix 3 × 3. It follows that we can express the vector T in terms of
antisymmetric matrix [T ]× and define the matrix expression:

E = [T ]× R (7.104)

which replaced in (7.102) we get

P TR E P L = 0 (7.105)

where E, defined by (7.104), is known as the essential matrix which depends only
on the rotation matrix R and the translation vector T , and is defined less than a scale
factor.
Equation (7.105) is still valid by scaling the coordinates from the reference system
of the cameras to those of the image planes, as follows:
 T  T
X L YL X R YR
p L = (x L , y L ) =
T
, p R = (x R , y R ) =
T
,
ZL ZL ZR ZR

which are normalized coordinates, so we have

pTR E p L = 0 (7.106)

This equation realizes the epipolar constraints, i.e., for a 3D point projected in the
stereo image planes it relates the homologous vectors p L and p R . It also expresses
the coplanarity between any two corresponding points p L and p R included in the
same epipolar plane for the two cameras.
In essence, (7.105) expresses the epipolar constraint between the rays that intersect
the point P of the space coming from the two optical centers, while Eq. (7.106) relates
points homologues between the image planes. Moreover, for any projection in the
left image plane p L , through the essential matrix E, the epipolar line in the right

of a generic three-dimensional vector v. In this case, we use the symbolism [v]× or S(v) to indicate
the operator that transforms the vector v in an antisymmetric matrix as reported in (7.103). Often
this dual form of representation between vector and antisymmetric matrix is used to write the vector
product or outer product between two three-dimensional vectors with the traditional form x × y in
the form of simple product [x]× y or S(x) y.
7.5 Stereo Vision and Epipolar Geometry 633

image is given by the three-dimensional vector (see Fig. 7.8)8 :

l R = E pL (7.107)

and Eq. (7.106) becomes


pTR E p L = pTR l R = 0 (7.108)
  
lR

which shows that p R , the homologue of p L , is on the epipolar line l R (defined by


7.107) according to the Note 8.
Similarly, for any projection in the right image plane p R , through the essential
matrix E, the epipolar line in the left image is given by the three-dimensional vector:

l TL = pTR E =⇒ lL = pR ET (7.109)

and Eq. (7.106) becomes


pTR E p L = l TL p L = 0 (7.110)
  
lL T

which verifies that p L is on the epipolar line l L (defined by the 7.109) according to
the Note 8.
Epipolar geometry requires that the epi poles are in the intersection between the
epipolar lines and the translation vector T . Another property of the essential matrix
is that its product with the epipoles e L and e R is equal to zero:

eTR E = 0 Ee L = 0 (7.111)

In fact, for each point p L , except e L , in the left image plane must hold Eq. (7.107)
of the right epipolar line l R = E p L , where the epipole e R also lies. Therefore, also
the epipole e R must satisfy (7.108) thus obtaining

eTR E p L = (eTR E) p L = 0 ∀ pL ⇒ eTR E = 0 or ET eR = 0

The epipole e R is thus in the null space on the left of E. Similarly, it is shown
for the left epipole e L that Ee L = 0, i.e, which is in the null space on the right of
E. The equations (7.111) of the epipoles can be used to calculate their position by
knowing E.

8 With reference to the figure, if we denote by p̃ L = (x L , y L , 1) the projection vector of P in


the left image plane, expressed in homogeneous coordinates, and with l L = (a, b, c) the epipolar
line expressed as 3D vector whose equation in the image plane is ax L + by L + c = 0, then the
constraint that the point p̃ L is on the epipolar line l L induces l L p̃ L = 0 or l TL p̃ L = 0 or p̃TL l L = 0.
Furthermore, we recall that a line l passing through two points p1 and p2 is given by the following
vector product l = p1 × p2 . Finally, a point p as the intersection of two lines l 1 and l 2 is given by
p = l 1 × l 2.
634 7 Camera Calibration and 3D Reconstruction

The essential matrix has rank 2 (so it is also singular) since the antisymmetric
matrix [T ]× is of rank 2. It also has 5 degrees of freedom, 3 associated with rotation
angles and 2 for the vector T defined less than a scale factor. We point out that,
while the essential matrix E associates a point with a line, the homography matrix
H associates a point with another point ( p L = H p R ).
An essential matrix has two equal singular values and the third equals zero. This
property can be demonstrated by decomposing it with the SVD method and verifying
with E = UV that the first two elements of the main diagonal of  are σ1 = σ2 = 0
and σ3 = 0.
Equation (7.106) in addition to solving the problem of correspondence in the
context of epipolar geometry is also used for 3D reconstruction. In this case at least
5 corresponding points are chosen, in the stereo images, generating a linear system
of equations based on (7.106) to determine E and then R and T are calculated. We
will see in detail in the next paragraphs the 3D reconstruction of the scene with the
triangulation procedure.

7.5.2 The Fundamental Matrix

In the previous paragraph, the coordinates of the points in relation to the epipolar
lines were expressed in the reference system of the calibrated cameras and in accor-
dance with the pinhole projection model. Let us now propose to obtain a relationship
analogous to (7.106) but with the points in the image plane expressed directly in
pixels. Suppose that for the same stereo system considered above the cameras cali-
bration matrices K L and K R are known with the projection matrices P L = K L [I | 0]
and P R = K R [R | T ] for the left and right camera, respectively. We know from Eq.
(6.208) that we can get, for a 3D point with (X, Y, Z ) coordinates, the homogeneous
coordinates in pixels ũ = (u, v, 1) in stereo image planes, which for the two left and
right images are given by

ũ L = K L p̃ L ũ R = K R p̃ R (7.112)

where p̃ L and p̃ R are the homogeneous coordinates, of the 3D point projected in the
stereo image planes, expressed in the reference system of the cameras. These last
coordinates can be derived from (7.112) obtaining

p̃ L = K −1
L ũ L p̃ R = K −1
R ũ R (7.113)

that putting them in Eq. (7.106) of the essential matrix is obtained:

(K −1 −1
R ũ R ) E(K L ũ L ) = 0
T

ũ TR K −T E K −1 ũ L = 0 (7.114)
 R  L 
F
7.5 Stereo Vision and Epipolar Geometry 635

from which we can derive the following matrix:

F = K −T −1
R EKL (7.115)

where F is known as the Fundamental Matrix (proposed in [14,15]). Finally we get


the equation of the epipolar constraint based on F, given by

ũ TR F ũ L = 0 (7.116)

where the fundamental matrix F has size 3 × 3 and rank 2. As for the essential
matrix E, Eq. (7.116) is the fundamental algebraic tool based on the fundamental
matrix F for the 3D reconstruction of a point P of the scene observed from two
views. The fundamental matrix represents the constraint of the correspondence of
the homologous image points ũ L ↔ ũ R being 2D projections of the same point P
of 3D space. As done for the essential matrix we can derive the epipolar lines and
the epipoles from (7.116). For homologous points ũ L ↔ ũ R , we have for ũ R the
constraint to lie on the epipolar line l R associated with the point ũ L , which is given
by
l R = F ũ L (7.117)

such that ũ TR F ũ L = ũ TR l R = 0. This dualism, of associating a point of an image


plane to the epipolar line of the other image plane, is also valid in the opposite sense,
so a point ũ L on the left image, homologue of ũ R , must lie on the epipolar line l L
associated with the point ũ R , which is given by

l L = F T ũ R (7.118)

such that ũ TR F ũ L = (ũ TR F)ũ L = l TL ũ L = 0.


From the equations of the epipolar lines (7.117) and (7.118), subject to the con-
straint of the fundamental matrix equation (7.115), we can associate epipolar lines
with the respective epipoles ẽ L and ẽ R , since the latter must satisfy the following
relations:
F ẽ L = F T ẽ R = F ẽTR = 0 (7.119)

from which emerges the direct association of the epipole e L with the epipolar line
F ẽ L and which is in the null space of F, while the epipole e R is associated with the
epipolar line F T ẽ R and is in the null space of F. It should be noted that the position
of the epipoles does not necessarily fall within the domain of the image planes (see
Fig. 7.9a).
A further property of the fundamental matrix concerns the transposed property: if


F is the fundamental matrix relative to the stereo pair of cameras C L → C R , then
←−
the fundamental matrix F of the stereo pair of cameras ordered in reverse C L ← C R


is equal to F T . In fact, applying (7.116) to the ordered pair C L ← C R we have
←− −
→ ←
− − →
ũ TL F ũ R = 0 =⇒ ũ TR F T ũ L = 0 f or which F = F T
636 7 Camera Calibration and 3D Reconstruction

(a) (b)
P π

pL pR

lR
cL eL eR cR
cL eL eR cR

Fig. 7.9 Epipolar geometry and projection of homologous points through the homography plane.
a Epipoles on the baseline but outside the image planes; b Projection of homologous points by
homography plane not passing through optical centers

Finally, we analyze a further feature of the fundamental matrix F also rewriting the
equation of the fundamental matrix (7.115) with the essential matrix E expressed
by (7.104), thus obtaining

F = K −T −1 −T −1
R E K L = K R [T ]× R K L (7.120)

We know that the determinant of the antisymmetric matrix [T ]× is zero, it follows that
the det (F) = 0 and the rank of F is 2. Although both matrices include the constraints
of the epipolar geometry of two cameras and simplify the correspondence problem
by mapping points of an image only on the epipolar line of the other image, from
Eqs. (7.114) and (7.120), which relate the two matrices F and E, it emerges that
the essential matrix uses the coordinates of the camera and depends on the relative
extrinsic parameters (R and T ), while the fundamental matrix operates directly with
the coordinates in pixels and can be abstracted from the knowledge of the intrinsic and
extrinsic parameters of the cameras. Knowing the intrinsic parameters (the matrices
K ) from (7.120), it is observed that the fundamental matrix is reduced to the essential
matrix, and therefore it operates directly in coordinates of the cameras.
An important difference between the E and F matrices is the number of degrees
of freedom, the Essential matrix has 5 while the Fundamental matrix has 7.

7.5.2.1 Relationship Between Fundamental and Homography Matrix


The fundamental matrix is an algebraic representation of epipolar geometry. Let us
now see a geometric interpretation of the fundamental matrix that maps homologous
points in two phases [10]. In the first phase, the point p L is mapped to a point p R in
the right image that is potentially the homologue we know lies on the right epipolar
line l R . In the second phase the epipolar line l R is calculated as the line passing
through p R and the epipole e R according to the epipolar geometry. With reference
to Fig. 7.9b, we consider a point P of the 3D space lying in a plane (not passing
through the optical centers of the cameras) and projected in the left image plane in
the point p L with coordinates u L . Then P is projected in the right image plane in
7.5 Stereo Vision and Epipolar Geometry 637

the point p R with coordinates u R . Basically, by the projection of P in the left and
right image planes, we can consider it occurred through the plane .
From epipolar geometry we know that p R lies on the epipolar line l R (projection
of the ray P − p L ) and also passing through the right epipole e R . Any other point
in the plane is projected in the same way in the stereo image planes thus real-
izing an omographic projection H to map each point p L i of an image plane in the
corresponding points p Ri in the other image plane.
Therefore, the homologous points between the stereo image planes we can con-
sider them mapped by the 2D homography transformation:

u R = H uL

Then, imposing the constraint that the epipolar line l R is the straight line passing
through p R and the epiple e R , with reference to the Note 8, we have

l R = e R × u R = [e R ]× u R = [e R ]× H u L = Fu L (7.121)
  
F

from which, considering also (7.117), we obtain the searched relationship between
homography matrix and fundamental matrix, given by

F = [e R ]× H (7.122)

where H is the homography matrix with rank 3, F is the fundamental matrix of rank
2, and [e R ]× is the epipole vector expressed as an antisymmetric matrix with rank
2. Equation (7.121) is valid for any ith point p L i projected from the plane and
must satisfy the equation of epipolar geometry (7.116). In fact, replacing in (7.116)
u R given by the homography transformation and considering the constraint that the
homologous of each p L i must be on the epipolar line l R , given by (7.121), we can
verify that the constraint of the epipolar geometry remains valid, as follows:

u TR Fu L = (H u L )T [e R ]× u R = u TR [e R ]× H u L = 0 (7.123)
          
H uL lR u TR lR F

thus confirming the relationship between fundamental and homography matrix


expressed by (7.122).
From the geometric point of view, it has been shown that the fundamental matrix
projects 2D points from an image plane to a 1D point lying on the epipolar line l R
passing through the homologous point and the epipole e R of the other image plane
(see Fig. 7.8), abstracting from the scene structure. The planar homography, on the
other hand, is a projective transformation one to one (H 3×3 of rank 3), between 2D
points, applied directly to the points of the scene (involved by the homography plane
) with the transformation u R = H u L and can be considered as a special case of
the fundamental matrix.
638 7 Camera Calibration and 3D Reconstruction

7.5.3 Estimation of the Essential and Fundamental Matrix

In theory, both the E and F matrices can theoretically be estimated experimen-


tally using a numerical method knowing a set of corresponding points between the
stereo images. In particular, it is possible to estimate the fundamental matrix without
knowing the intrinsic and extrinsic parameters of the cameras, while for the essential
matrix the attitudes of the cameras must be known.

7.5.3.1 8-Point Algorithm


A method for calculating the fundamental matrix is the one proposed in [10,13]
known as 8-point algorithm. In this approach, at least 8 correspondences are used
ul = (u l , vl , 1) ↔ ur = (u r , vr , 1) between stereo images. The F matrix is esti-
mated by setting the problem with a homogeneous system of linear equations apply-
ing the equation of the epipolar constraint (7.116) for n ≥ 8 as follows:

urTi Fuli = 0 i = 1, . . . , n (7.124)

which in matrix form is


⎡ ⎤⎡ ⎤
f 11 f 12 f 13 u li
u ri vri 1 ⎣ f 21 f 22 f 23 ⎦ ⎣ vli ⎦ = 0 (7.125)
f 31 f 32 f 33 1

from which by making explicit we get


u li u ri f 11 + u li vri f 21 + u li f 31 + vli u ri f 12 + vli vri f 22 + vli f 32 + u ri f 13 + vri f 23 + f 33 = 0 (7.126)

If we group in the 9 × 1 vector f = ( f 11 , . . . , f 33 ) the unknown elements of F,


(7.126) can be reformulated as an inner product between vectors in the form:

(u li u ri , u li vri , u li , vli u ri , vli vri , vli , u ri , vri , 1) · f = 0 (7.127)

Therefore, we have an equation for every correspondence uli ↔ uri , and with n
correspondences we can assemble a homogeneous system of n linear equations as
follows:
⎡ ⎤
f 11
⎢ f 12 ⎥
⎢ ⎥
⎡ ⎤ ⎢ f 13 ⎥
u l1 u r1 u l1 vr1 u l1 vl1 u r1 vl1 vr1 vl1 u r1 vr1 1 ⎢ ⎥
⎢ f 21 ⎥
⎢ .. .. .. .. .. .. ⎥ ⎢
.. .. .. ⎢ ⎥ ⎥
⎣ . . . . . . . . . ⎦ ⎢ f 22 ⎥ = A f = 0 (7.128)
⎢ f 23 ⎥
u ln u rn u ln vrn u ln vln u rn vln vrn vln u rn vrn 1 ⎢ ⎥
⎢ f 31 ⎥
⎢ ⎥
⎣ f 32 ⎦
f 33

where A is a matrix of size n × 9 derived from the n correspondences. To admit


a solution, the correspondence matrix A of the homogeneous system must have at
least rank 8. If the rank is exactly 8 we have a unique solution of f less than a
7.5 Stereo Vision and Epipolar Geometry 639

factor of scale and can be determined with linear methods the null space solution
of the system. Therefore, 8 correspondences are sufficient from which the name of
the algorithm follows. In reality, the coordinates of the homologous points in stereo
images are affected by noise and to have a more accurate estimate of f it is useful to
use a number of correspondences n  8. In this case, the system is solved with the
least squares method finding a solution f that minimizes the following summation:


n
(urTi f uli )2 (7.129)
i=1

subject to the additional constraint such that  f = 1 since the norm of u is arbitrary.
The least squares solution of f corresponds to the smallest singular value of the SVD
decomposition of A = UV T , taking the components of the last column vector of
V (which corresponds to the smallest eigenvalue).
Remembering some properties of the fundamental matrix it is necessary to make
some considerations. We know that F is a singular square matrix (det (F) = 0) of
size 3 × 3 = 9 with rank 2. Also F has 7 degrees of freedom motivated as follows.
The constraint of rank 2 implies that any column is a linear combination of the other
two. For example, the third is the linear combination of the first two. Therefore,
the first two elements of the third column specify the linear combination and then
give the third element of the third column. This suggests that F has eight degrees of
freedom. Furthermore, by operating in homogeneous coordinates, the elements of F
can be scaled to less than a scale factor without violating the epipolar constraint of
(7.116). It follows that the degrees of freedom are reduced to 7.
Another aspect to consider is the effect of the noise present in the correspondence
data on the SVD decomposition of the matrix A. In fact, this causes the ninth singular
value obtained to be different from zero, and therefore the estimate of F is not really
with rank equal to 2. This implies a violation of the epipolar constraint when this
approximate value of F is used, and therefore the epipolar lines (given by Eqs. 7.117
and 7.118) do not exactly intersect in the their epipoles. It is, therefore, advisable to
correct the F matrix obtained from the decomposition of A with SVD, effectively
reapplying a new SVD decomposition directly on the first estimate of F to obtain a
new estimate F̂, which minimizes the Frobenius norm,9 as follows:

min  F − F̂  F subject to det (F) = 0 (7.130)


F

9 TheFrobenius norm is an example of a matrix norm that can be interpreted as the norm of the
vector of the elements of a square matrix A given by
  
  r  r
 n  n   
 A F = ai j = T r (AT A) =
2 λi = σi2
i=1 j=1 i=1 i=1

where A is the n × n square matrix of real elements, r ≤ n is the rank of A, λi = σi2 is the ith

nonzero eigenvalues of AT A, and σi = λi is the ith singular value of the SVD decomposition of
A. It should be considered T r ( A A) with the transposed conjugate A∗ in the more general case.

640 7 Camera Calibration and 3D Reconstruction

where
F̂ = UV T (7.131)

Found with (7.131) the matrix of rank 2 which approximates F, to obtain the matrix
of rank 2 with the closest approximation to F, the third singular value, that is, σ 33 = 0
of the last SVD decomposition of F. Therefore, the best approximation is obtained,
recalculating F with the updated matrix , as follows:
⎡ ⎤
σ11 0 0
F = UV T = U ⎣ 0 σ22 0⎦ VT (7.132)
0 0 0

This arrangement which forces the approximation of F to rank 2 allows to reduce


as much as possible the error to map a point to the epipolar line between the stereo
images and that all the epipolar lines converge in the relative epipoles. In reality, this
trick reduces the error as much as possible but does not eliminate it completely.
The essential matrix E is calculated when the correspondences

pl = (xl , yl , 1) ↔ pr = (xr , yr , 1)

are expressed in homogeneous image coordinates of the calibrated cameras. The


calculation procedure is identical to that of the fundamental matrix using the 8-
algorithm (or more points) so that the correspondences satisfy Eq. (7.106) of the
epipolar constraint pTR E p L = 0. Therefore, indicating with e = (e11 , . . . , e33 ) the
9-dimensional vector that groups the unknown elements of E, we can get a system
homogeneous of linear equations analogous to the system (7.128) that is written in
compact form:
Be = 0 (7.133)

where B is the data matrix of the correspondences of the size n × 9, the analog
of the matrix A of the system (7.128) relative to the fundamental matrix. As with
the fundamental matrix, the least squares solution of e corresponds to the smallest
singular value of the SVD decomposition of B = UV T , taking the components
of the last column vector of V (which corresponds to the smallest eigenvalue). The
same considerations on data noise remain, so the solution obtained may not satisfy the
requirement that the essential matrix obtained is not exactly of rank 2, and therefore it
is also convenient for the essential matrix to reapply the SVD decomposition directly
on the first estimate of E to get a new estimate given by Ê = UV T .
The only difference in the calculation procedure concerns the different properties
between the two matrices. Indeed, the essential matrix with respect to the fundamental
has the further constraint that its two nonzero singular values are equal. To take this
into account, the diagonal matrix is modified by imposing  = diag(1, 1, 0) and
the essential matrix is E = Udiag(1, 1, 0)V T which is the best approximation of
the normalized essential matrix that minimizes the Frobenius norm. It is also shown
that, if from the SVD decomposition Ê = UV T we have that  = diag(a, b, c)
7.5 Stereo Vision and Epipolar Geometry 641

with a ≥ b ≥ c, and we set  = diag( a+b


2 , 2 , 0) the essential matrix is put back
a+b

with the following E = UV T , which is the most approximate essential matrix in
agreement to the Frobenius norm.

7.5.3.2 7-Point Algorithm


With the same approach of the 8-algorithm, it is possible to calculate the essential
and fundamental matrix considering only 7 correspondences since the fundamental
matrix has 7 degrees of freedom as described in the previous paragraph. This means
that the respective data matrices A and B are of size 7 × 9 and have rank 7 in general.
Resolving, in this case, the homogeneous system A f = 0 with 7 correspondences,
the system presents a set of two-dimensional solutions, generated by two bases f 1
and f 2 (calculated with SVD and belonging to the null space of A) which corresponds
to two matrices F 1 and F 2 . The two-dimensional solution of the system is in the
form f = α f 1 + (1 − α) f 2 with α scalar variable. The solution expressed in matrix
form is
F = α F 1 + (1 − α)F 2 (7.134)

For F, we can impose the constraint that det (F) = 0 for which we have

det (α F 1 + (1 − α)F 2 ) = 0

such that F has rank 2. This constraint leads to a nonlinear cubic equation with the
unknown α with notes F 1 and F 2 . The solutions of this equation for real α are in
numbers of 1 or 3. In the case of 3 solutions, these must be verified replacing them
in (7.134) and not considered the degenerate ones.
Recall that the essential matrix has 5 degrees of freedom and can be set as pre-
viously a homogeneous system of linear equations Be = 0 with the data matrix B
of size 5 × 9 built with only 5 correspondences. Compared to overdetermined sys-
tems, its implementation is more complex. In [16], an algorithm is proposed for the
estimation of E from just 5 correspondences.

7.5.4 Normalization of the 8-Point Algorithm

The 8-point algorithm, described above for the estimation of essential and fundamen-
tal matrices, uses the basic least squares approach and if the error in experimentally
determining the coordinates of the correspondences is contained, the algorithm pro-
duces acceptable results. As with all algorithms, to reduce the numerical instability,
due to data noise and above all as in this case when the coordinates of the correspon-
dences are expressed with a large numerical amplitude (data matrix badly conditioned
by SVD by altering the singular values to be equal and others cleared), it is, therefore,
advisable to activate a normalization process on the data before applying the 8-point
algorithm [14].
642 7 Camera Calibration and 3D Reconstruction

This normalization process consists in applying to the coordinates a translation


transformation that establishes a new reference system with origin in the centroid
of the points in the image plane. Subsequently, the coordinates are scaled to have
a quadratic mean value of the distance of the points from the centroid of 1–2 pix-
els. This transformation, a combination of scaling and translation of the origin of
the data, is carried out through two transformation matrices T L and T R for the
two stereo left and right images, respectively. We indicate with û = (û i , v̂i , 1) the
nor mali zed coordinates, expressed in pixels, of a point ith in the image plane, with
T the transformation matrix that normalizes the input coordinates u = (u i , vi , 1),
the transformation equation is given by
⎡ ⎤ ⎡ u i −μu ⎤ ⎡ 1 ⎤
μu ⎡ ⎤
û i μd μd 0 − μd ui
⎣ v̂i ⎦ = ⎢ vi −μv ⎥
=
⎢ μv ⎥ ⎣ ⎦
⎣ μd ⎦ ⎣ 0 μd − μd ⎦ vi = T ui
1 (7.135)
1 1 0 0 1 1

where the centroid (μu , μv ) and the average distance from the centroid μd are cal-
culated for n points as follows:
!n 
1 1
n n
i=1 u i − μu )2 + (vi − μv )2
μu = ui μv = vi μd = (7.136)
n n n
i=1 i=1

According to (7.135) and (7.136), the normalization matrices T L and T R relative to


the two stereo cameras are computed and then normalized correspondence coordi-
nates as follows:
û L = T L u L û R = T R u R (7.137)

After the normalization of the data, the fundamental matrix F n is estimated with the
approach indicated above and subsequently needs to denormalize it to be used with
the original coordinates. The denormalized version F is obtained from the epipolar
constraint equation as follows:
u TR Fu L = û TR T −T FT −1 û L = û TR F n û L = 0 =⇒ F = T TR F n T L (7.138)
 R  L 
Fn

7.5.5 Decomposition of the Essential Matrix

With the 8-point algorithm (see Sect. 7.5.3), we have calculated the fundamental
matrix F and knowing the matrices K of the stereo cameras it is possible to calculate
with (7.115) the essential matrix E. Alternatively, E can be calculated directly with
7.106) which we know to include the extrinsic parameters, that is, the rotation matrix
R and the translation vector T .
R and T are just the result of the decomposition of E we want to accomplish.
Recall from 7.104 that the essential matrix E can be expressed in the following form:

E = [T ]× R (7.139)
7.5 Stereo Vision and Epipolar Geometry 643

which suggests that we can decompose E into two components, the vector T
expressed in terms of the antisymmetric matrix [T ]× and the rotation matrix R.
By virtue of the theorems demonstrated in [17,18] we have

Theorem 7.1 An essential matrix E of size 3 × 3 can be factored as the product of a


rotation matrix and a nonzero antisymmetric matrix, if and only if, E has two equal
nonzero singular values and a null singular value.

Theorem 7.2 Suppose that E can be factored into a product RS, where R is an
orthogonal matrix and S is an antisymmetric matrix. Let be the SVD of E given
by E = UV T , where  = diag(k, k, 0). Then, up to a scale factor, the possible
factorization is one of the following:
S = U ZU T R = UWVT or R = UWT VT E = RS (7.140)

where W and Z are rotation matrix and antisymmetric matrix, respectively, defined
as follows: ⎡ ⎤ ⎡ ⎤
0 10 0 −1 0
W = ⎣−1 0 0⎦ Z = ⎣1 0 0 ⎦ (7.141)
0 01 0 0 0

Since the scale of the essential matrix does not matter, it, therefore, has 5 degrees of
freedom. The reduction from 6 to 5 degrees of freedom produces an extra constraint
on the singular values of E, moreover we have that det (E) = 0 and finally since the
scale is arbitrary we can assume both singular values equal to 1 and having an SVD
given by
E = Udiag(1, 1, 0)V T (7.142)

But this decomposition is not unique. Furthermore, being U and V orthogonal matri-
ces det (U) = det (V T ) = 1 and if we have an SVD like (7.142) with det (U) =
det (V T ) = −1 then we can change the sign of the last column of V . Alternatively, we
can change the sign to E and then get a following SVD −E = Udiag(1, 1, 0)(−V )T
with det (U) = det (−V T ) = 1. It is highlighted that the SVD for −E generates a
different decomposition since it is not unique.
Now let’s see with the decomposition of E according to (7.142) the possible
solutions considering that

ZW = diag([1 1 0]) ZW T = −diag([1 1 0]) (7.143)

and the possible solutions are E = S1 R1 , where

S1 = −U ZU T R1 = U W T V T (7.144)

and E = S2 R2 , where

S2 = U ZU T R2 = U W V T (7.145)
644 7 Camera Calibration and 3D Reconstruction

Now let’s see if these are two possible solutions for E by first checking if R1 and R2
are rotation matrices. In fact, remembering the properties (see Note 1) of the rotation
matrices must result in the following:
 T
R1T R1 = U W T V T U W T V T = V W U T U W T V T = I (7.146)

and therefore R1 is orthogonal. It must also be shown that the det (R1 ) = 1:
 
det (R1 ) = det U W T V T = det (U)det (W T )det (V T ) = det (W )det (U V T ) = 1 (7.147)

To check instead that S1 is an antisymmetric matrix must be S1 = −S1T . Therefore,


we have  T
−S1T = U ZU T = U Z T U T = −U ZU T = S1 (7.148)

To verify that the possible decompositions are valid or that the last equation of (7.140)
is satisfied, we must get E = S1 R1 = S2 R2 by verifying
 
S1 R1 = −U ZU T U W T V T = −U ZW T V T = −U −diag([1 1 0]) V T = E (7.149)
     
f or the equation (7.143)

By virtue of (7.142), the last step of the (7.149) shows that the decomposition S1 R1
is valid. Similarly it is shown that the decomposition S2 R2 is also valid. Two possible
solutions have, therefore, been reached for each essential matrix E and it is proved
to be only two [10].
Similarly to what has been done for the possible solutions of R we have to examine
the possible solutions for the translation vector T which can assume different values.
We know that T is encapsulated in S the antisymmetric matrix, such that S = [T ]× ,
obtained from the two possible decompositions. For the definition of vector product
we have
ST = [T ]× T = U ZU T T = T × T = 0 (7.150)

Therefore, the vector T is in the null space of S which is the same as the null
space of the matrices S1 and S2 . It follows that the searched estimate of T from
this decomposition, by virtue of (7.150), corresponds to the third column of U as
follows10 : ⎡ ⎤
0
T = U ⎣0⎦ = u3 (7.151)
1

10 For the decomposition predicted by (7.140), it must result that the solution of T = U[0 0 1]T
since it must satisfy (7.150) that is ST = 0 according to the property of an antisymmetric matrix.
In fact, for T = u3 the following condition is satisfied:
" "
" 0 −1 0 " T
Su3 =U ZU T u3 =U "" 1 0 0 ""U u3 =[u2 −u1 0][u1 u2 u3 ]T u3 =u2 u1T u3 −u1 u2T u3 =0
0 0 0
.
7.5 Stereo Vision and Epipolar Geometry 645

Fig. 7.10 The 4 possible (a) (b)


solutions of the pair R and T
in the decomposition of E. A
reversal of the baseline
(inverted optical centers) is cL cR cR cL
observed horizontally while
vertically we have a rotation
of 180◦ around the baseline. (c)
Only the a configuration (d)
correctly reconstructs the 3D
point of the scene being in cL c’R
front of both cameras
c’L cR

Let us now observe that if T is in the null space of S the same is for λT , in fact for
any nonzero value of λ we have a valid solution since we will have

[λT ]× R = λ[T ]× R = λE (7.152)

which is still a valid essential matrix defined less than an unknown scale factor λ.
We know that this decomposition is not unique given the ambiguity of the sign of E
and consequently also the sign of T is determined considering that S = U(±Z)U T .
Summing up, for a given essential matrix, there are 4 possible choices of projection
matrices P R for the right camera, since there are two choice options for both R and
T , given by the following:

P R = U W V T | ± u3 or U W T V T | ± u3 (7.153)

By obtaining 4 potential pairs (R, T ) there are 4 possible configurations of the stereo
system by rotating the camera in a certain direction or in the opposite direction with
the possibility of translating it in two opposite directions as shown in Fig. 7.10. The
choice of the appropriate pair is made for each 3D point to be reconstructed by
triangulation by selecting the one where the points are in front of the stereo system
(in the direction of the positive z axis).

7.5.6 Rectification of Stereo Images

With epipolar geometry, the problem of searching for homologous points is reduced
to mapping a point of an image on the corresponding epipolar line in the other image.
It is possible to simplify the problem of correspondence through a one-dimensional
point-to-point search between the stereo images. For example, we can execute an
appropriate geometric transformation (e.g., projective) with resampling (see Sect. 3.9
Vol. II) on stereo images such as to make the epipolar lines parallel and thus sim-
646 7 Camera Calibration and 3D Reconstruction

plify the search for homologous points as a 1D correspondence problem. This also
simplifies the correlation process that evaluates the similarity of the homologous
patterns (described in Chap. 1). This image alignment process is known as recti-
fication of stereo images and several algorithms have been proposed based on the
constraints of epipolar geometry (using uncalibrated cameras where the fundamental
matrix includes intrinsic parameters) and on the knowledge of intrinsic and extrinsic
parameters of calibrated cameras.
Rectification algorithms with uncalibrated cameras [10,19] perform without the
explicit camera parameter information, implicitly included in the essential and fun-
damental matrix used for image rectification. The nonexplicit use of the calibration
parameters makes it possible to simplify the search for homologous points by operat-
ing on the aligned homography projections of the images but for the 3D reconstruction
we have the problem that objects observed from different scales or from different
perspectives may appear identical in the homography projections of aligned images.
In the approaches with calibrated cameras, intrinsic and extrinsic parameters are
used to perform geometric transformations to horizontally align the cameras and
make the epipolar lines parallel to the x-axis. In essence, the images transformed
for alignment can be thought of as reacquired with a new configuration of the stereo
system where the alignment takes place by rotating the cameras around their optical
axes with the care of minimizing distortion errors in perspective reprojections.

7.5.6.1 Rectification Not Calibrated


Consider the initial configuration of stereo vision with the cameras arranged with
the parallel optical axes and therefore with the image planes coplanar and vertically
aligned (known as canonical or lateral configuration, see Fig. 7.6). We assume that
the calibration matrix K of the cameras is the same and the essential matrix E (for
example, first calculated with the SVD method described in the previous paragraphs)
is known. Since the cameras have not been rotated between them, we can assume
that R = I, where I is the identity matrix. If b is the baseline (distance between
optical centers) we have that T = (b, 0, 0) and considering (7.104) we get
⎡ ⎤
00 0
E = [T ]× R = ⎣0 0 −b⎦ (7.154)
0b 0

and according to the equation of the epipolar constraint (7.106) we have


⎡ ⎤⎡ ⎤ ⎡ ⎤
00 0 xL 0
pTR E p L = xR y R 1 ⎣0 0 −b⎦ ⎣ y L ⎦ = x R y R 1 ⎣ b ⎦ =⇒ by R = by L (7.155)
0b 0 1 −by L
     
lR lR

from which it emerges that the y vertical coordinate is the same for the homologous
points and the equation of the epipolar line l R = (0, −b, by L ) associated to the point
p L is horizontal. Similarly, we have for the epipolar line l L = E T p R = (0, b, −by R )
7.5 Stereo Vision and Epipolar Geometry 647

associated with the point p R . Therefore, a 3D point of the scene always appears on
the same line in the two stereo images.
The same result is obtained by calculating the fundamental matrix F for the paral-
lel stereo cameras. Indeed, assuming for the two cameras, the perspective projection
matrices have
P L = K L [I | 0] P R = K R [R | T ] with K L = K R = I R=I T = (b, 0, 0)

where b is the baseline. By virtue of Eq. (7.120), we get the fundamental matrix:
⎡ ⎤⎡ ⎤⎡ ⎤ ⎡ ⎤
100 00 0 100 00 0
F= K −T [T ]× R K −1 = ⎣0 1 0⎦ ⎣0 0 −b⎦ ⎣0 1 0⎦ = ⎣0 0 −1⎦ (7.156)
R    L
E
001 0b 0 001 01 0

and according to the equation of the epipolar constraint (7.116) we have


⎡ ⎤⎡ ⎤ ⎡ ⎤
00 0 uL 0
u TR Fu L = uR ⎣
v R 1 0 0 −1⎦ ⎣ ⎦
vL = u R v R 1 ⎣ −1 ⎦ =⇒ v R = v L (7.157)
01 0 1 −v L
     
lR lR

We thus have that even with F the vertical coordinate v is the same for the
homologous points and the equation of the epipolar line l R = (0, −1, −v L ) asso-
ciated with the point u L is horizontal. Similarly, we have for the epipolar line
l L = F T u R = (0, 1, −v R ) associated with the point u R .
Now let’s see how to rectify the stereo images acquired in the noncanonical config-
uration, with the converging and non-calibrated cameras, of which we can estimate
the fundamental matrix (with the 8-point normalized algorithm) and consequently
calculate the epipolar lines relative to the two images for the similar points consid-
ered. Known the fundamental matrix and the epipolar lines, it is then possible to
calculate the relative epipoles.11
At this point, having known the epipoles e L and e R , we can already check if the
stereo system is in the canonical configuration or not. From the epipolar geometry,
we know (from Eq. 7.119) that the epipole is the vector in the null space of the
fundamental matrix F for which F · e = 0. Therefore, from (7.156) the fundamental
matrix of a canonical configuration is known and in this case we will have
⎡ ⎤⎡ ⎤
00 0 1
F·e = ⎣0 0 −1⎦ ⎣0⎦ = 0 (7.158)
01 0 0

11 According to epipolar geometry, we know that the epipolar lines intersect in the relative epipoles.

Given the noise is present in the the correspondence coordinates, in reality the epipolar lines intersect
not in a single point e but in a small area. Therefore, it is required to optimize the calculation of the
position of each epipole considering the center of gravity of this area and this is achieved with the
least squares method to minimize this error. Remembering that each line is represented with a 3D
vector of the type l i = (ai , bi , ci ) the set of epipolar lines {l 1 , l 2 , . . . , l n } can be grouped in a n × 3
matrix L and form a homogeneous linear system L · e = 0 in the unknown the epipole vector e,
solvable with the SVD (singular value decomposition) method.
648 7 Camera Calibration and 3D Reconstruction

Fig. 7.11 Rectification of


P
stereo image planes. Stereo
images, acquired from a
noncanonical stereo pL pR
configuration, are HL lR HR
reprojected into image planes
that are coplanar and parallel cL eL eR cR
to the baseline. The epipolar
lines correspond to the lines pL pR
Plans
of the rectified images
Parallel Epipolar
eL Lines eR

for which ⎡ ⎤
1
e = ⎣0⎦ (7.159)
0

is the solution vector of the epipole corresponding to the configuration with parallel
cameras, parallel epipolar lines, and epipole at infinity in the horizontal direction.
If the configuration is not canonical, it is necessary to carry out an appropriate
homography transformation for each stereo image to make them coplanar with each
other (see Fig. 7.11), so as to obtain each epipole at infinity along the horizontal axis,
according to (7.159).
If we indicate with H L and H R the homography transforms that, respectively, cor-
rect the original image of left and right and indicate with û L and û R the homologous
points in the rectified images, these are defined as follows:

û L = H L ũ L û R = H R ũ R (7.160)

where ũ L and ũ R are homologous points in the original images of the noncanonical
stereo system of which we know F. We know that the latter satisfy the constraint of
the epipolar geometry given by (7.116) so considering Eq. (7.160) we have
 T  
ũ TR F ũ L = û R H −1 F û L H −1 = û TR H −T F H −1 û L = 0 (7.161)
R L
 R  L 

from which we have, that the fundamental matrix F̂, for the rectified images must
result, according to (7.156), to the following factorization:
⎡ ⎤
00 0
F̂ = H −T −1 ⎣
R F H L = 0 0 −1
⎦ (7.162)
01 0

Therefore, find the homography transforms H L and H R that satisfy (7.162) the
images are rectified obtaining the epipoles to infinity as required. The problem is
7.5 Stereo Vision and Epipolar Geometry 649

that these homography transformations are not unique and if chosen improperly they
generate distorted rectified images. One idea is to consider homography transforma-
tions as rigid transformations by rotating and translating the image with respect to a
point of the image (for example, the center of the image). This is equivalent to carry-
ing out the rectification with the techniques described in Chap. 3 Vol. II with linear
geometric transformations and image resampling. In [19], an approach is described
which minimizes distortions of the rectified images by decomposing the homographs
into elementary transformations:

H = H p Hr Hs

where H p indicates a projective transformation, H r a similarity transformation and


H s indicates a shearing transformation (i.e., takes into account the deformations
which inclines the flat shape of an object along the coordinate axes u or v or in both).
In [10] is proposed a method of rectification that first performs a homography
transformation on the image on the right in order to obtain the combined effect of a
rigid roto-translation around the center of the image (in homogeneous coordinates
(0, 0, 1)) followed by a transformation that takes any point in ( f, 0, 1) and maps it
to infinity in ( f, 0, 0) along the horizontal axis of u. In particular, if we consider the
image center in homogeneous coordinates (0, 0, 1) as the reference point, the pixel
coordinates of the image on the right are translated with the matrix T given by
⎡ ⎤
1 0 −L/2
T = ⎣0 1 −H/2⎦ (7.163)
01 1

where H and L are the height and width of the image, respectively. After applying
the translation, we apply a rotation R to position the epipole on the horizontal axis
at a certain point ( f, 0.1). If the translated epipole T e R is in position (e Ru , e Rv ,1 ) the
rotation applied is ⎡ ⎤
e e
α  2 Ru 2 α  2 Rv 2 0
⎢ e Ru +e Rv e Ru +e Rv ⎥
⎢ ⎥
⎢ ⎥
⎢ e Rv e Ru ⎥
R = ⎢−α  2 2 α  2 2 0⎥ (7.164)
⎢ e Ru +e Rv e Ru +e Rv ⎥
⎢ ⎥
⎣ ⎦
0 0 1

where α = 1 if e Ru ≥ 0 and α = −1 otherwise. After the roto-translation T · R any


point located in ( f, 0, 1) to map it to the infinite point ( f, 0, 0) along the axis of the
u needs to apply the following transformation G:
⎡ ⎤
1 00
G = ⎣ 0 1 0⎦ (7.165)
− 1f 0 1
650 7 Camera Calibration and 3D Reconstruction

Therefore, the homography transformation H R for the right image is given by the
combination of the three elementary transformations as follows:

H R = G RT (7.166)

which represents the rigid transformation of the first order with respect to the image
center.
At this point, having known the homography H R , we need to find an optimal solu-
tion for the homography H L such that the images rectified with these homographs
are very similar with less possible distortions. This is possible by searching for the
homography H L , which minimizes the difference of the adjusted images by setting a
function that minimizes the sum of the square of the distances between homologous
points of the two images:

min  H L u L i − H R u Ri 2 (7.167)
HL
i

Without giving the algebraic details described in [10], it is shown that the homography
H L can be expressed in the form:

HL = HAHRM (7.168)

assuming that the fundamental matrix F of the stereo pair of input images is known,
which we express as
F = [e]× M (7.169)

while the H A matrix is given by:


⎡ ⎤
a1 a2 a3
HA = ⎣ 0 1 0 ⎦ (7.170)
0 0 1

where the generic vector a = (a1 , a2 , a3 ) will be defined later. H A expressed by


(7.170) represents the affine transformation component included in the compound
transformation of H L in (7.168). Now let’s show what the matrix M represents. First
of all, we highlight that under the property of an antisymmetric matrix A we have
that A = A3 less than a scale factor. Since any vector e can be represented as an
3 × 3 antisymmetric matrix [e]× unless a scale factor such as the matrix F, we can
apply these properties to (7.169) and have

F = [e]× M = [e]× [e]× [e]× M = [e]× [e]× F (7.171)


  
F

from which we get


M = [e]× F (7.172)
7.5 Stereo Vision and Epipolar Geometry 651

We observe that if a multiple vector of e is added to the columns of M then (7.171)


remains valid even less than a scale factor. Therefore, the most general form to define
M is as follows:
M = [e]× F + evT (7.173)

where v is a generic 3D vector. Normally, M is defined by putting it equal to (1, 1, 1)


with good results.
Now it remains to define H A , that is, the vector a introduced in (7.170) to estimate
H L given by (7.168). This is accomplished by considering that the initial goal was
to minimize the function (7.167) by adequately finding H L and H R . Now we know
H R (the homography matrix that maps the epipole e R to an infinite point in (1, 0, 0))
and M, so we can write transformations for homologous points of the two images in
the form:
û L i = H R Mu L i û Ri = H R u Ri

and then the minimization problem results:



min  H A û L i − û Ri 2 (7.174)
HA
i

If the points are expressed in homogeneous coordinates like û L i = (u L i , v L i , 1) and


û Ri = (u Ri , v Ri , 1), then the minimization function can become:

min (a1 û L i + a2 v̂ L i + a3 − û Ri )2 + (v̂ L i − v̂ Ri )2 (7.175)
a
i

It is observed that v̂ L i − v̂ Ri is a constant value for which the minimization problem


is further reduced in the form:

min (a1 û L i + a2 v̂ L i + a3 − û Ri )2 (7.176)
a
i

and finally, the minimization problem can be set up as a simple least squares problem
solving a system of linear equations, where the unknowns are the components of the
vector a, given by
⎡ ⎤ ⎡ ⎤
û L 1 v̂ L 1 1 ⎡a1 ⎤ û R1
⎢ .. .. .. ⎥ ⎣ ⎦ ⎢ .. ⎥
Ua = b ⇐⇒ ⎣ . . . ⎦ a2 = ⎣ . ⎦ (7.177)
û L v̂ L 1 a3 û R
n n n

Once we have calculated the vector a with (7.170), we can calculate H A , estimate
H L with (7.168) and with the other homography matrix H R already calculated we
can rectify each pair of stereo images acquired with the n correspondences used.
We summarize the whole procedure of the rectification process of stereo images,
based on homography transformations, applied to a pair of images acquired by a
652 7 Camera Calibration and 3D Reconstruction

stereo system (in the noncanonical configuration) of which we know the epipolar
geometry (the fundamental matrix) for which the epipolar lines in the input images
are mapped horizontally in the rectified images. The essential steps are

1. Find n ≥ 7 initial correspondences u L ↔ u R in the two stereo input images. We


know that for the estimate of F it is better if n > 7.
2. Estimate the fundamental matrix F and find the epi poles e L and e R in the two
images.
3. Calculate the homography transformation H R which maps the epi pole e R to
infinity in (1, 0, 0)T .
4. Find the transformation matrix H L that minimizes the function (7.167), that is,
it minimizes the sum of the square of the distances of the transformed points.
5. Found the best transformations H L and H R , rectify (geometric transformation
with resampling) the respective stereo images of left and right.

7.5.6.2 Calibrated Rectification


We now describe the process of rectification (nonphysical) of stereo images in which
the intrinsic parameters (the calibration matrix K ) and extrinsic parameters (the
rotation matrix R and the translation vector T ) of each camera can be estimated
using the methods described in Sect. 7.4. In practice, if the cameras are fixed on
a mobile turret with 3 degrees of freedom, it is possible to configure the canonical
stereo system thus directly acquiring the rectified images with the accuracy dependent
on the accuracy of the camera’s attitude. In the general stereo configuration, with
knowns R and T , it is necessary to realize geometric transformations to the stereo
images to make the epipolar lines collinear and parallel to the horizontal axis of the
images.
In essence, these transformations rectify the stereo images by simulating a virtual
stereo acquisition from a canonical stereo system through the rotation of the cameras
with respect to their optical center (see Fig. 7.12). The essential steps of this method,
proposed in [20], are the following:

1. Calibrate the cameras to get K , R and T and derive the calibration parameters
of the stereo system.
2. Compute the rotation matrix Rr ect with which to rotate the left camera to map
the left epipole e L to infinity along the x-axis and thus making the epipolar lines
horizontal.
3. Apply the same rotation to the right camera.
4. Calculate for each point of the left image the corresponding point in the new
canonical stereo system.
5. Repeat the previous step even for the right camera.
6. Complete the rectification of the stereo images by adjusting the scale and then
resample.
7.5 Stereo Vision and Epipolar Geometry 653

PLr
Plans
PL
PRr
RRECT
cL
PR

RRECT
cR

Fig. 7.12 Rectification of the stereo image planes knowing the extrinsic parameters of the cameras.
The left camera is rotated so that the epipole moves to infinity along the horizontal axis. The same
rotation is applied to the camera on the right, thus obtaining plane images parallel to the baseline.
The horizontal alignment of the epipolar lines is completed by rotating the right camera according
to R−1 and possibly adjusting the scale by resampling the rectified images

Step 1 calculates the parameters (intrinsic and extrinsic) of the calibration of the
individual cameras and the stereo system. Normally the cameras are calibrated con-
sidering known 3D points, defined with respect to a world reference system. We
indicate with P w (X w , Yw , Z w ) the coordinates in the world reference system, and
with R L , T L and R R , T R the extrinsic parameters of the two cameras, respectively,
the rotation matrices and the translation column vectors. The relationships that project
the point P w in the image plane of the two cameras (according to the pinhole model),
in the respective reference systems, are the following:

P L = RL Pw + T L (7.178)

P R = RR Pw + T R (7.179)

We assume that the two cameras have been independently calibrated with one of the
methods described in Sect. 7.4, and therefore their intrinsic and extrinsic parameters
are known.
If T is the column vector representing the translation between the two optical cen-
ters (the origins of each camera’s reference systems) and R is the rotation matrix that
orients the right camera axes to those of the left camera, then the relative coordinates
of a 3D point P(X, Y, Z ) in the space, indicated with P L = (X L p , Y L p , Z L p ) and
P R = (X R p , Y R p , Z R p ) in the reference system of the two cameras, are related to
each other with the following:

P L = RT P R + T (7.180)
654 7 Camera Calibration and 3D Reconstruction

The extrinsic parameters of the stereo system are computed with Eqs. (7.98) and
(7.99) (derived in Sect. 7.4.4) that we rewrite here

R = R TL R R (7.181)

T = T L − RT T R (7.182)

In the step 2, the rotation matrix Rr ect is calculated for the left camera which
has the purpose of mapping the relative epipole to infinity in the horizontal direction
(x axis) and obtain the horizontal epipolar lines. From the property of the rotation
matrix we know that the column vectors represent the orientation of the rotated axes
(see Note 1). Now let’s see how to calculate the three vectors r i of Rr ect . The new
x-axis must have the direction of the translation column vector T (the baseline vector
joining the optical centers) given by the following unit vector:
⎡ ⎤
Tx
T 1 ⎣ Ty ⎦
r1 = = (7.183)
T T2 + T2 + T2 T
x y z z

The second vector r 2 (which is the direction of the new y axis) has the constraint of
being only orthogonal to r 1 . Therefore it can be calculated as the normalized vector
product between r 1 and the direction vector (0, 0, 1) of the old axis of z (which is
the direction of the old optical axis), given by
⎡ ⎤
−Ty
r 1 × [0 0 1] T 1 ⎣ Tx ⎦
r2 = = (7.184)
 r 1 × [0 0 1]T  T + T2
2
0
x y

The third vector r 3 represents the new z-axis which must be orthogonal to the
baseline (vector r 1 ) and to the new axis of y (the vector r 2 ), so we get as the
vector product of these vectors:
⎡ ⎤
−Tx Tz
1 ⎣ −Ty Tz ⎦
r3 = r1 × r2 =  (7.185)
(Tx + Ty )(Tx2 + Ty2 + Tz2 ) Tx2 + Ty2
2 2

This results in the rotation matrix given by


⎡ ⎤
r 1T
Rr ect = ⎣ r 2T ⎦ (7.186)
r 3T

We can verify the effect of the rotation matrix Rr ect on the stereo images to be
rectified as follows. Let us now consider the relationship (7.180), which orients the
7.5 Stereo Vision and Epipolar Geometry 655

right camera axes to those of the left camera. Applying to both members Rr ect we
have

Rr ect P L = Rr ect R T P R + Rr ect T (7.187)

from which it emerges that in fact the coordinates of the points of the image of the
left and the right are rectified, by obtaining

P L r = Rr ect P L P Rr = Rr ect R T P R (7.188)

having indicated with P L r and P Rr the rectified points, respectively, in the reference
system of the left and right camera. The correction of the points, according to (7.188),
is obtained considering that
⎡ T ⎤ ⎡ ⎤
r1 T T 
Rr ect T = ⎣ r 2T T ⎦ = ⎣ 0 ⎦ (7.189)
r 3T T 0

hence replacing in (7.187) we get


⎡ ⎤
T 
P L r = P Rr +⎣ 0 ⎦ (7.190)
0

(7.190) shows that the rectified points have the same coordinates Y and Z and differ
only in the horizontal translation along the X -axis. Thus the steps 2 and 3 are made.
The corresponding 2D points rectified in the left and right image planes are
obtained instead from the following:
f f
pLr = P Lr p Rr = P Rr (7.191)
ZL ZR
Thus steps 4 and 5 are realized. Finally, with step 6, to avoid empty areas in the rec-
tified images, the inverse geometric transformation (see Sect. 3.2 Vol.II) is activated
to associate in the rectified images the pixel value of the stereo input images and
possibly resample if in the inverse transform the pixel position is between 4 pixels
in the input image.

7.5.7 3D Stereo Reconstruction by Triangulation

The 3D reconstruction of the scene can be realized in different ways, in relation to the
knowledge available to the stereo acquisition system. The 3D geometry of the scene
can be reconstructed, without ambiguity, given the 2D projections of the homolo-
gous points of the stereo images, by triangulation, known the calibration parameters
(intrinsic and extrinsic) of the stereo system. If instead only the intrinsic parame-
656 7 Camera Calibration and 3D Reconstruction

ters are known the 3D geometry of the scene can be reconstructed by estimating the
extrinsic parameters of the system to less than a not determinable scale factor. If
the calibration parameters of the stereo system are not available but only the corre-
spondences between the stereo images are known, the 3D structure of the scene is
recovered through an unknown homography transformation.

7.5.7.1 3D Reconstruction Known the Intrinsic and Extrinsic


Parameters
Returning to the reconstruction by triangulation with the stereo system [20], this is
directed knowing the calibration parameters, the correspondences of the homologous
points p L and p R in the image planes, and the linear equations of homologous rays
passing, respectively, for the points p L and optical center C L related to the left
camera, and for the points p R and optical center C R related to the right camera.
The estimate of the coordinates of a 3D point P = (X, Y, Z ) is obtained precisely
by triangulation of the two rays which in ideal conditions intersect at the point
P. In reality, the errors on the estimation of the intrinsic parameters and on the
determination of the position of the projections of P in the stereo image planes
cause no intersection of the rays even if their minimum distance is around P as
shown in Fig. 7.13a, where P̂L and P̂R represent the ends of the segment of the
minimum distance between the rays, to be determined. Therefore, it is necessary to
obtain an estimate P̂ of P as the midpoint of the segment of a minimum distance
between homologous rays. It can be guessed that using multiple cameras having a
triangulation with more rays would improve the estimate of P by calculating the
minimum distance in the sense of least squares (the sum of the squared distances is
zero if the rays are incident in P).
Let us denote by l L and l R the nonideal rays, respectively, of the left and right
camera, passing through their optical centers C L and C R and its projections p L and

(a) (b)
PR = TpR

P P P
PL= L
Ra
yl

Ra
yl
R
L
yl

Ra

yl
Ra

Distances to be
pL minimized
uR
pL pR
pR uL

cL cR cL cR

Fig. 7.13 3D reconstruction with stereovision. a Triangulation by not-exact intersection of the


rays l L and l R retroprojected for the 3D reconstruction of the point P; b Triangulation through the
approach that minimizes the error of the observed projections p L and p R of the points P of the
scene with respect to the projections u L and u R calculated with Eqs. (7.196) and (7.197) with the
pinhole projection model defined by the known projection matrices of the cameras
7.5 Stereo Vision and Epipolar Geometry 657

p R in the image plane. Furthermore, we have evidence that there is only one segment
of minimum length indicated with the column vector v, which is perpendicular to
both rays joining them via the intersection points indicated with P̂L (extreme of
the segment obtained from the 3D intersection between ray l L and segment) and
P̂R (extreme of the segment obtained from the 3D intersection between radius l R
and segment) as shown in the figure. The problem is then reduced to finding the
coordinates of the extreme points P̂L and P̂R of the segment.
We now express in the vector form a p L and b p R where a, b ∈ R, the equations
of the two rays, in the respective reference systems, passing through the optical
centers C L and C R , respectively. The extremes of the segment to be found are
expressed with respect to the reference system of the left camera with origin in C L
whereby according to (7.94) the equation of the right ray expressed with respect to
the reference system of the left camera is R T b p R + T remembering that R and T
represent the extrinsic parameters of the stereo system defined from Eqs. (7.98) and
(7.99), respectively. The constraint that the segment, represented by the equation cv
with c ∈ R, is orthogonal to the two rays defines the vector v obtained as a vector
product of the two vectors/rays given by

v = pL × RT p R (7.192)

where also v is expressed in the reference system of the left camera. At this point, we
have that the segment represented by cv will intersect the ray a p L for a given value
of a0 thus obtaining the coordinates of P̂L , an extreme of the segment. a p L + cv
represents the equation of the plane passing through the ray l L and to be orthogonal
to the ray l R must be
a p L + cv = T + b R T p R (7.193)

for certain values of the unknown scalars a, b, and c that can be determined consid-
ering that the vector equation (7.193) can be set as a linear system of 3 equations
(for three-dimensional vectors) in 3 unknowns. In fact, replacing the vector v given
by the (7.192) we can solve the following system:
 
a pL + c pL × RT p R − b RT p R = T (7.194)

If a0 , b0 , and c0 are the solution of the system then the intersection between the
ray l L and the segment gives an extreme of the segment P̂L = a0 p L , while the
other extreme is obtained from the intersection of segment and ray l R given by
P̂R = T + b0 R T p R and the midpoint between the two extremes finally identifies
the estimate of P̂ reconstructed in 3D with the coordinates expressed in the reference
system of the left camera.
658 7 Camera Calibration and 3D Reconstruction

7.5.7.2 3D Reconstruction Known Intrinsic and Extrinsic Parameters


with Linear Triangulation
An alternative method of reconstruction is based on the simple linear triangulation
that solves the problem of the nonintersecting backprojected rays by minimizing the
estimated backprojection error directly in the image planes (see Fig. 7.13b). Known
the projection matrices P L = K L [I| 0] and P R = K R [R| T ] of the two cameras,
the projections p L = (x L , y L , 1) and p R = (x R , y R , 1) of a point P of the 3D space
with coordinates X = (X, Y, Z , 1), in the respective image planes are
⎡ T ⎤ ⎡ T ⎤
PL 1 X P R1 X
p L = P L X = ⎣PTL 2 X ⎦ p R = P R X = ⎣PTR2 X ⎦ (7.195)
PTL 3 X PTR3 X

where P L i and P Ri indicate the rows of the two perspective projection matrices,
respectively. The perspective projections in Cartesian coordinates u L = (u L , v L )
and u R = (u R , v R ) are

PTL 1 X PTL 2 X
uL = vL = (7.196)
PTL 3 X PTL 3 X

PTR1 X PTR2 X
uR = vR = (7.197)
PTR3 X PTR3 X

From Eq. (7.196), we can derive two linear equations12 :


 
u L PTL 3 − PTL 1 X = 0
  (7.198)
v L PTL 3 − PTL 2 X = 0

Putting in matrix form, we have


 
u L PTL 3 − PTL 1
X = 02×1 (7.199)
v L PTL 3 − PTL 2

12 The same equations can be obtained, for each camera, considering the properties of the vector

product p × (PX) = 0, that is, by imposing the constraint of parallel direction between the vectors.
Once the vector product has been developed, three equations are obtained but only two are linearly
independent of each other.
7.5 Stereo Vision and Epipolar Geometry 659

Proceeding in the same way for the homologous point u R , from (7.197) we get
two other linear equations that we can assemble in (7.199) and we thus have a
homogeneous linear system with 4 equations, given by

⎡ ⎤
u L PTL 3 − PTL 1
⎢ v L PTL − PTL ⎥
⎢ 2⎥
⎣u R PT − PT ⎦ X = 04×1 ⇐⇒ A4×4 X 4×1 = 04×1 (7.200)
3

R3 R1
v R PTR3 − PTR2

where it is observed that each pair of homologous points gives the point P in the
3D space of coordinates X = (X, Y, Z , W ) with the fourth unknown component.
Considering the noise present in the localization of homologous points, the solution
of the system is found with the SVD method which estimates the best solution in
the sense of least squares. With this method the 3D estimate of P can be improved
by adding further observations: with N > 2 cameras. In this case two equations of
the type (7.198) would be added to the matrix A for each camera thus obtaining a
homogeneous system with 2N equations always in 4 unknowns, with the matrix A
of size 2N × 4.
Recall that the reconstruction of P based on this linear method minimizes the
algebraic error without geometric meaning. To better filter the noise, present in the
correspondences and in the perspective projection matrices, the optimal estimate
can be obtained by setting a nonlinear minimization function (in the sense of the
maximum likelihood estimation) as follows:

min  P L X̂ − u L 2 +  P R X̂ − u R 2 (7.201)

where X̂ represents the best estimate of the 3D coordinates of the point P. In essence,
X̂ is the best least squared estimate of the backprojection error of P in both images,
seen as the distance in the image plane between its projection (for the respective
cameras are given by Eqs. 7.196 and 7.197) and the related observed measurement
of P always in the image plane (see Fig. 7.13b). In the function (7.201) the backpro-
jection error for the point P is accumulated for both cameras and in the case of N
cameras the error is added and the function to be minimized is


N
min  Pi X̂ − ui 2 (7.202)
X̂ i=1

resolvable with iterative methods (for example, Gauss–Seidel, Jacobi,. . .) of nonlin-


ear least squares approximation.

7.5.7.3 3D Reconstruction with only the Intrinsic Parameters Known


In this case, of the stereo system with projection matrices P L = K L [I| 0] and P R =
K R [R| T ], we know a set of homologous points and only the intrinsic parameters
660 7 Camera Calibration and 3D Reconstruction

K L and K R of the stereo cameras. The 3D reconstruction of the scene occurs less
than an incognito scale factor because the cameras setups (cameras attitude) are not
known. In particular, not knowing the baseline (the translation vector T ) of the
stereo system it is not possible to reconstruct the 3D scene in the real scale even if
the reconstruction is unique but unless an incognito scale factor.
Known at least 8 corresponding points it is possible to calculate the fundamental
matrix F and once the calibration matrices K are known it is possible to calculate
the essential matrix E (alternatively, E can be calculated directly with 7.106) which
we know to include the extrinsic parameters, that is, the rotation matrix R e the
translation vector T . R and T are just the unknowns we want to calculate and
then perform the 3D reconstruction by triangulation. The essential steps of the 3D
reconstruction process, known the intrinsic parameters of the stereo cameras and a
set of homologous points, are the following:

1. Detects a set of corresponding points (at least 8).


2. Estimate the fundamental matrix F with the normalized 8-point algorithm (see
Sect. 7.5.3).
3. Compute the essential matrix E from the fundamental matrix F known the
intrinsic parameter matrices K L and K R .
4. Estimate the extrinsic parameters, that is, the rotation matrix R and the transla-
tion vector T of the stereo system by decomposing E as described in Sect. 7.5.5.
5. Reconstruction of the position of 3D points by triangulation and appropriately
selecting R and T among the possible solutions.

In this context only step 4 is analyzed while the others are immediate since they have
already been treated previously. From Sect. 7.5.5, we know that the essential matrix
E = [T ]× R can be factored with the SVD method obtaining E = UV T where by
definition the essential matrix has rank 2 and must admit two equal singular values and
the third equals zero, so we have  = diag(1, 1, 0). We also know, from Eqs. (7.142)
and (7.143), the existence of the rotation matrice W and the antisymmetric matrix Z
such that their product is ZW = diag(1, 1, 0) = , producing the following result:

E = U{}V T = U{ZW }V T = U ZU T
V  = [T ]× R
  U W
T
(7.203)
[T ]× R

where the penultimate step is motivated by Eq. (7.140). The orthogonality charac-
teristics of the obtained rotation matrix and the definition of the essential matrix are
thus satisfied. We know, however, that the decomposition is not unique and E is
defined unless a scale factor λ and the translation vector unless the sign. In fact, the
decomposition leads to 4 possible solutions of R and T , and consequently we have
4 possible projection matrices P R = K R [R T ] of the stereo system for the right
camera given by Eq. (7.153), which we rewrite as follows:

[U W V T | λu3 ] [U W V T | − λu3 ] [U W T V T | λu3 ] [U W T V T | − λu3 ] (7.204)


7.5 Stereo Vision and Epipolar Geometry 661

where, according to Eq. (7.151), u3 = T corresponds to the third column of U.


Obtained 4 potential pairs (R, T ) there are 4 possible configurations of the stereo
system by rotating the camera in a certain direction or in the opposite direction, and
with the possibility of translating it in two opposite directions as shown in Fig. 7.10.
The choice of the appropriate pair is made for each 3D point to be reconstructed by
triangulation by selecting the one where the points are in front of the stereo system (in
the direction of the positive z axis). In particular, we consider each correspondence
pair as reprojected backwards to identify the 3D point and determine its depth with
respect to both cameras by choosing the solution where the depth is positive for both.

7.5.7.4 3D Reconstruction with Known Only Intrinsic Parameters and


Normalizing the Essential Matrix
As in the previous paragraph, the essential matrix E is calculated unless an unknown
scale factor. A normalization procedure [20] of E is considered to normalize the
length of the translation vector T to unity.
From Eq. (7.104) of essential matrix E = [T ]× R = S R we have

E T E = (S R)T S R = ST R T RS = ST S (7.205)

where with S we have indicated the antisymmetric matrix associated with the trans-
lation vector T defined by (7.103). Expanding the antisymmetric matrix in (7.205)
we have ⎡ 2 ⎤
Ty + Tz2 −Tx Ty −Tx Tz
E T E = ⎣ −Ty Tx Tz2 + Tx2 −Ty Tz ⎦ (7.206)
−Tz Tx −Tz Ty Tx2 + Ty2

which shows that the trace of E T E is given by

T r (E T E) = 2  T 2 (7.207)

To normalize the translated vector to the unit, the essential matrix is normalized as
follows:
E
Ê =  (7.208)
T r (E T E)/2

while the normalized translation vector is given by

T [Tx Ty Tz ]T
T̂ = = = T̂x T̂y T̂z (7.209)
T Tx2 + Ty2 + Tz2
662 7 Camera Calibration and 3D Reconstruction

According to the normalization defined with (7.208) and (7.209) the matrix (7.206)
is rewritten as follows:
⎡ ⎤
1 − T̂x2 −T̂x T̂y −T̂x T̂z
T ⎢ ⎥
Ê Ê = ⎣ −T̂y T̂x 1 − T̂y2 −T̂y T̂z ⎦ (7.210)
−T̂z T̂x −T̂z T̂y 1 − T̂z2

At this point, the components of the vector T̂ can be derived from any row or column
T
of the matrix Ê Ê given by (7.210). Indeed, by indicating it for simplicity with
T
E = Ê Ê the components of the translation vector T̂ are derived from the following:
 E12 E13
T̂x = ± 1 − E11 T̂y = − T̂z = − (7.211)
T̂x T̂x

Due to the quadratic elements of E for the components of T̂ , the latter can differ from
the real ones unless the sign. The rotation matrix R can be calculated by knowing the
normalized essential matrix E and the normalized vector T̂ albeit with the ambiguity
in the sign. For this purpose the 3D vectors are defined:

wi = Ê i × T̂ (7.212)

where Ê i indicates the three rows of the normalized essential matrix. From these
vectors wi , through simple algebraic calculations, are calculated the rows of the
rotation matrix given by
⎡ T⎤ ⎡ ⎤
R1 (w1 + w2 + w3 )T
R = ⎣ R2T ⎦ = ⎣(w2 + w3 + w1 )T ⎦ (7.213)
R3T (w3 + w1 + w2 )T

Due to the double ambiguity in the sign of E and T̂ we have 4 different estimates of
pairs of possible solutions for ( T̂ , R). In analogy to what was done in the previous
paragraph, the choice of the appropriate pair is made through the 3D reconstruction
starting from the projections to solve the ambiguity. In fact, for each 3D point, the
third component is calculated in the reference system of the left camera considering
the 4 possible pairs of solutions ( T̂ , R). The relation that for a point P of the 3D
space links the coordinates P L = (X L , Y L , Z L ) and P R = (X R , Y R , Z R ) among the
reference systems of the stereo cameras is given by (7.93), that is, P R = R( P L − T )
to reference P with respect to the left camera, from which we can derive the third
component Z R :
Z R = R3T ( P L − T̂ ) (7.214)
7.5 Stereo Vision and Epipolar Geometry 663

and from the relation (6.208) which links the point P and its projection in the image
on the right we have
fR f R R( P L − T̂ )
pR = PR = (7.215)
ZR R3T ( P L − T̂ )

from which we derive the first component of p R given by

f R R1T ( P L − T̂ )
xR = (7.216)
R3T ( P L − T̂ )

In analogy to (7.215), we have the equation that links the coordinates of P in the left
image plane:
fL
pL = PL (7.217)
ZL

Replacing (7.217) in (7.216) and resolving with respect to Z L , we get

( f R R1 − x R R3 )T T̂
Z L = fL (7.218)
( f R R 1 − x R R 3 )T p L

From (7.217), we get P L and considering (7.218) we finally get the 3D coordinates
of P in the reference systems of the two cameras:

( f R R1 − x R R3 )T T̂
PL = P R = R( P L − T̂ ) (7.219)
( f R R 1 − x R R 3 )T
Therefore, being able to calculate for each point to reconstruct the depth coordinates
Z L and Z R for both cameras it is possible to choose the appropriate pair (R, T̂ ), that
is, the one for which the depths are both positive because the scene to be reconstructed
is in front of the stereo system. Let’s summarize the essential steps of the algorithm:

1. Given the correspondences of homologous points estimate the essential matrix


E.
2. Computes the normalized translation vector T̂ with (7.211).
3. Computes the rotation matrix R with Eqs. (7.212) and (7.213).
4. Computes the depths Z L and Z R for each point P with Eqs. (7.217)–(7.219).
5. E xamine the sign of the depths Z L and Z R of the reconstruction point:

a. If both are negative for some points, change the sign of T̂ and go back to step
4.
b. Otherwise, if one is negative and the other is positive for some point, change
the sign of each element of the matrix Ê and go back to point 3.
c. Otherwise, if both depths of the reconstruction points are positive, then it ends.

Recall that the 3D points of the scene are reconstructed less than an incognito scale
factor.
664 7 Camera Calibration and 3D Reconstruction

7.5.7.5 3D Reconstruction with Known only the Correspondences of


Homologous Points
In this case, we have only N ≥ 8 correspondences and a completely uncalibrated
stereo system available without the knowledge of intrinsic and extrinsic parameters.
In 1992 three groups of researchers [21–23] independently dealt with the problem of
3D reconstruction starting from uncalibrated cameras and all three researches were
based on projective geometry.
The proposed solutions reconstructed the scene not unambiguously but less than
a projective transformation of the scene itself. The fundamental matrix F can be
estimated from the N correspondences in the stereo system together with the location
of the epipoles e L and e R . The matrix F does not depend on the choice of the 3D
point reference system in the world, while it is known that this dependency exists for
the projection matrices P L and P R of the stereo cameras. For example, if the world
coordinates are rotated the camera projection matrix varies while the fundamental
matrix remains unchanged.
In particular, if H is a projective transformation matrix in 3D space, then the
fundamental matrix associated with the pair of projection matrices of the stereo
cameras (P L , P R ) and (P L H, P R H) are the same (we recall from Sect. 7.5.2.1 that
the relation between fundamental and homography matrix is given by F = [e R ]× H).
It follows, although a pair of projection matrices (P L , P R ) of the cameras univocally
determines a fundamental matrix F, the vice versa does not occur. Therefore, the
cameras’ matrices are defined less than a projective transformation with respect to
the fundamental matrix. This ambiguity can be controlled by choosing appropriate
matrices of projections consisting of the fundamental matrix F such that

P L = [I| 0] and P R = [[e R ]× F| e R ]

as described in [10]. Then with these matrices triangulate the 3D points by retropro-
jection of the corresponding projections.
In summary, it is shown that in the context of uncalibrated cameras, the ambiguity
in the reconstruction is attributable only to an arbitrary projective transformation.
In particular, given a set of correspondences for a stereo system, the fundamental
matrix is uniquely determined, then the cameras’ matrix is estimated and then the
scene can be reconstructed with only these correspondences. It should be noted,
however, that any two reconstructions from these correspondences are equivalent
from the projective point of view, that is, the reconstruction is not unique but less
than a projective transformation (see Fig. 7.14).
The ambiguity of 3D reconstruction from uncalibrated cameras is formalized by
the following projective reconstruction theorem [10]:

Theorem 7.3 Let p L i ↔ p Ri the correspondences of homologous points in the


stereo images and F is the uniquely determined fundamental matrix that satisfies the
relation pTRi F p L i = 0 ∀ i.
(1) (1) (1) (2) (2) (2)
Let (P L , P R , { P i }) and (P L , P R , { P i }) two possible reconstructions associ-
ated with the correspondences p L i ↔ p Ri .
7.5 Stereo Vision and Epipolar Geometry 665

Object reconstructed
Original 3D object observed
with non-calibrated
stereovision

Ambiguous
Projective
Reconstruction

Fig. 7.14 Ambiguous 3D reconstruction from a non-calibrated stereo system with only the projec-
tions of the homologous points known. The 3D reconstruction, although the structure of the scene
emerges, takes place unless an unknown projective transformation

Then, there exists a nonsingular matrix H 4×4 such that


(2) (1) (2) (1) (2) (1)
P L = P L H −1 P R = P R H −1 and Pi = H Pi

for all i, except for those i such that F p L i = pTRi F = 0 (coincident with the epipoles
related to stereo images).

In essence, the 3D triangulation of points P i is reconstructed unless an unknown


projective transformation H 4×4 . In fact, if the reconstructed 3D points are trans-
formed with a projective matrix H, these become

P i = H 4×4 P i (7.220)

with the associated projection matrices of the stereo cameras:

PL = P L H −1 PR = P R H −1 (7.221)

but the original projection points p L i ↔ p Ri are the same (together with the F),
verifying as follows:

p L i = P L P i = P L H −1 H P i = PL P i
(7.222)
p Ri = P R P i = P R H −1 H P i = PR P i

assuming H an invertible matrix and the fundamental matrix uniquely determined.


The ambiguity can be reduced if you have additional information on the 3D scene
to reconstruct or using additional information on the stereo system. For example,
having available 3D fiducial points (with at least 5 points) the ambiguity introduced
by the projective can be eliminated by obtaining a reconstruction associated with a
real metric.
In [20,24], there are two 3D reconstruction approaches with uncalibrated cameras.
Both approaches are based on the fact that the reconstruction is not unique so that a
basic projective transformation can be chosen arbitrarily. This is defined by choosing
666 7 Camera Calibration and 3D Reconstruction

only 5 3D points (of which 4 must not be coplanar), of the N of the scene, used to
define a basic projective transformation. The first approach [20], starting from the
basic projective finds the projection matrices (known the epipoles) with algebraic
methods while the second approach [24] uses a geometric method based on the
epipolar geometry to select the reference points in the image plans.

References
1. S.J. Maybank, O.D. Faugeras, A theory of self-calibration of a moving camera. Int. J. Comput.
Vis. 8(2), 123–151 (1992)
2. B. Caprile, V. Torre, Using vanishing points for camera calibration. Int. J. Comput. Vis. 4(2),
127–140 (1990)
3. R.Y. Tsai, A versatile camera calibration technique for 3d machine vision. IEEE J. Robot.
Autom. 4, 323–344 (1987)
4. J. Heikkila, O. Silvén, A four-step camera calibration procedure with implicit image correction,
in IEEE Proceedings of Computer Vision and Pattern Recognition (1997), pp 1106–1112
5. O.D. Faugeras, G. Toscani, Camera calibration for 3d computer vision, in International Work-
shop on Machine Vision and Machine Intelligence (1987), pp. 240–247
6. Z. Zhengyou, A flexible new technique for camera calibration. IEEE Trans. Pattern Anal. Mach.
Intell. 22(11), 1330–1334 (2000)
7. R.K. Lenz, R.Y. Tsai, Techniques for calibration of the scale factor and image center for high
accuracy 3-d machine vision metrology. IEEE Trans. Pattern Anal Mach Intell 10(5), 713–720
(1988)
8. G.H. Golub, C.F. Van Loan, Matrix Computations, 3rd edn. (Johns Hopkins, 1996). ISBN
978-0-8018-5414-9
9. Z. Zhang, A flexible new technique for camera calibration. Technical Report MSR- TR-98-71
(Microsoft Research, 1998)
10. R. Hartley, A. Zisserman, Multiple View Geometry in computer vision, 2nd. (Cambridge, 2003)
11. O. Faugeras, Three-Dimensional Computer Vision: A Geometric Approach (MIT Press, Cam-
bridge, Massachusetts, 1996)
12. J. Vince, Matrix Transforms for Computer Games and Animation (Springer, 2012)
13. H.C. Longuet-Higgins, A computer algorithm for reconstructing a scene from two projections.
Nature 293, 133–135 (1981)
14. I. Hartley Richard, In defense of the eight-point algorithm. IEEE Trans. Pattern Recogn. Mach.
Intell. 19(6), 580–593 (1997)
15. Q.-T. Luong, O. Faugeras, The fundamental matrix: theory, algorithms, and stability analysis.
Int. J. Comput. Vis. 1(17), 43–76 (1996)
16. Nistér David, An efficient solution to the five-point relative pose problem. IEEE Trans. Pattern
Recogn. Mach. Intell. 26(6), 756–777 (2004)
17. O. Faugeras, S. Maybank, Motion from point matches : Multiplicity of solutions. Int. J. Comput.
Vis. 4, 225–246 (1990)
18. T.S. Huang, O.D. Faugeras, Some properties of the e matrix in two-view motion estimation.
IEEE Trans. Pattern Recogn. Mach. Intell. 11(12), 1310–1312 (1989)
19. C. Loop, Z. Zhang, Computing rectifying homographies for stereo vision, in IEEE Conference
of Computer Vision and Pattern Recognition (1999), vol. 1, pp. 125–131
20. E. Trucco, A. Verri, Introductory Techniques for 3-D Computer Vision (Prentice Hall, 1998)
21. R. Mohr, L. Quan, F. Veillon, B. Boufama, Relative 3d reconstruction using multiples uncali-
brated images. Technical Report RT 84-I-IMAG LIFIA 12, Lifia-Irimag (1992)
References 667

22. O.D. Faugeras, What can be seen in three dimensions from an uncalibrated stereo rig, in ECCV
European Conference on Computer Vision (1992), pp. 563–578
23. R. Hartley, R. Gupta, T. Chang, Stereo from uncalibrated cameras, in IEEE CVPR Computer
Vision and Pattern Recognition (1992), pp. 761–764
24. R. Mohr, L. Quan, F. Veillon, Relative 3d reconstruction using multiple uncalibrated images.
Int. J. Robot. Res. 14(6), 619–632 (1995)
Index

Symbols B
2.5D Sketch map, 342 background modeling
3D representation based on eigenspace, 564, 565
object centered, 344 based on KDE, 563, 564
viewer centered, 344 BS based on GMM, 561, 562
3D stereo reconstruction BS with mean/median background, 558, 559
by linear triangulation, 658 BS with moving average background, 559,
by triangulation, 656 560
knowing intrinsic parameters & Essential BS with moving Gaussian average, 559, 560
matrix, 661 BS-Background Subtraction, 557, 558
knowing only correspondences of non-parametric, 566, 567
homologous points, 664 parametric, 565, 566
knowing only intrinsic parameters, 659 selective BS, 560, 561
3D world coordinates, 605 backpropagation learning algorithm
batch, 119
online, 118
A stochastic, 118
active cell, 324 Bayes, 30
Airy pattern, 466 classifier, 48
albedo, 416 rules, 37, 39
aliasing, 490, 491 theorem, 38
alignment Bayesian learning, 56
edge, 340 bias, 51, 62, 91, 93
image, 533, 534, 646 bilinear interpolation, 525, 526
pattern, 180 binary coding, 455, 458, 462
ambiguous 3D reconstruction, 665 binary image, 231, 496, 497
angular disparity, 355 binocular fusion, 351
anti-aliasing, 490, 491 binocular fusion
aperture problem, 483, 484, 498, 499, 514, 515 fixation point, 352
artificial vision, 316, 348, 393 horopter, 353
aspect ratio, 599, 606, 608, 609 Vieth-Müller circumference, 353
associative area, 369 binocular vision
associative memory, 229 angular disparity calculation, 389
autocorrelation function, 276, 282 computational model, 377
© Springer Nature Switzerland AG 2020 669
A. Distante and C. Distante, Handbook of Image Processing and Computer Vision,
https://doi.org/10.1007/978-3-030-42378-0
670 Index

depth calculation with parallel axes, 385 Mixtures of Gaussian, 66


Marr-Poggio algorithm I, 378 MultiLayer Perceptrons - MLP, 110
Marr-Poggio algorithm II, 380 neural network, 87
PMF algorithm, 406 nonmetric methods, 125
triangulation equation, 387 statistical, 37
binocular disparity, 358, 373, 374, 391 Cauchy, 199
binomial distribution, 78, 142 CBIR-Content-Based Image Retrieval, 313
bipartite graph, 539, 540 center of mass, 59, 591, 592
blurred image, 466, 472 central limit theorem, 50, 58
blurring centroid, 24
circle, 473 child node, 133
filter, 470 Cholesky factorization, 594, 595, 621
Boltzmann machine, 236 Chow’s rule, 46
bounding box, 557, 558 clustering, 2
BRDF-Bidirectional Reflectance Distribution agglomerative hierarchical, 149
Function, 415 divisive hierarchical, 152
Brewster stereoscope, 354 Hierarchical, 148
brightness continuity equation, see irradiance K-means, 30
constancy constraint equation clustering methods, 4
Brodatz’s texture mosaic, 294 CND image, 313
bundle adjustment, 590, 591 CNN-Convolutional Neural Network, 240
coherence measure, 306
C collineation, 616
calibration, see camera calibration collision time estimation, 573, 574
calibration matrix, 615, 630, 652 knowing the FOE, 579, 580
calibration sphere, 436, 437 color space, 16
CAM-Content-Addressable Memory, 225 complex motion estimation
camera background, 556, 557
coordinates, 614, 626 foreground, 557, 558
camera calibration motion parameters calculation by OF, 580,
accuracy, 625 581
algorithms, 603 complex conjugate operator, 447
equation, 589, 590 Computational complexity, 147
extrinsic parameters, 589, 590 confusion circle, 466, 473
intrinsic parameters, 588, 589 contrast, 274, 309, 394
matrix, 588, 589 convolution
platform, 603, 616 filter, 241, 292, 380
radial distortions, 623 mask, 241, 243, 291, 381, 511, 512
stereo vision, 625 theorem, 278
tangential distortion, 601 convolutional layer, 242
Tsai method, 605 CoP-Center of Projection, 567, 568
Zhang method, 616 correlation matrix, 15, 219
camera projection matrix, 585, 586, see also correspondence structure detection
camera calibration local POIs, 396
category level classification, 3 point-like elementary, 394
Bayesian discriminant, 57 strategy, 393
deterministic, 17 correspondence problem, 374, 390, 483, 484,
FCM-Fuzzy C-Means , 35 536, 537, 539, 540, 636, 646
Gaussian probability density, 58 covariance matrix, 13, 50, 66, 545, 546, 564,
interactive, 17 565
ISODATA, 34 Cover’s theorem, 194
Index 671

cross correlation function, 397 eigenvalue, 13, 24, 612


cross validation, 86, 103 eigenvector, 13, 514, 515, 564, 565, 606, 612
CT-Census Transform, 403 EKF-Extended Kalman Filter, 556, 557
electric-chemical signal, 91
D electrical signal, 89
data compression, 16, 217 EM-Expectation–Maximization, 32, 67, 69
decision trees algorithm, 127 epipolar constraint, 632
C4.5, 137 epipolar geometry, 394, 408, 627
CART, 143 epipolar line, 389, 628
ID3, 129 epipolar plane, 389, 628
deep learning, 238 epipole, 390, 628
CNN architectures, 256 Essential matrix, 629
dropout, 251 5-point algorithm, 641
full connected layer, 246 7-point algorithm, 641
pooling layer, 245 8-point algorithm, 638
stochastic gradient descent, 249 8-point normalization, 641
defocusing, 468, 474 decomposition, 642
delta function, 84 Euclidean distance, 28, 59, 395, 408, 560, 561
depth calculation before collision, 576, 577 Euler-Lagrange equation, 441
knowing the FOE, 579, 580 extrinsic parameter estimation
depth map, 342, 374, 375, 426, 453, 472 from perspective projection matrix P, 614
depth of field, 465, 468
DFS-Deterministic Finite State, 169 F
DFT-Discrete Fourier Transform, 445 f# number, 473
diagonalization, 13 factorization methods, 623, 648
diffuse reflectance, see Lambertian model false negative, 381
diffuse reflection, 416 false positive, 381, 563, 564
digitization, 388, 599 feature
Dirichlet tessellation, 30, see also Voronoi extraction, 194
diagram homologous, 497, 498
Discrete Cosine Transform-DCT, 471 selection, 4, 218
disparity map, 380, 403 significant, 7, 10, 241
dispersion space, 8
function, 466 vector, 4, 282
matrix, 23 filter
measure, 23 bandpass, 296
parameter, 467 binomial, 465
displacement vector, 582, 583 Gabor, 295, 365
distortion function, 623 Gaussian, 304
distortion measure, 31 Gaussian bandpass, 297
divide and conquer, 128, 147 high-pass, 469
DLT-Direct Linear Transformation, 604 Laplacian of Gaussian, 322
DoG-Difference of Gaussian, 323, 537, 538 low-pass, 468
median, 400
E smoothing, 322
early vision, 342 Fisher’s linear discriminant function, 21
eccentricity, 345 FOC-Focus Of Contraction, 571–574
edge extraction algorithms, 304, 310, 322, 324 focal length, 387, 414, 452, 587, 588, 603, 610
ego-motion, 580, 581 FOE-Focus Of Expansion, 571, 572
eigen decomposition, 60 calculation, 577, 578
eigenspace, 564, 565 Fourier descriptors, 8
672 Index

Fourier transform, 278 dissimilarity measures SSD & SAD, 399


power spectrum, 278, 279 gradient-based matching, 404
spectral domain, 279, 475 non metric RD, 401
Freeman’s chain code, 155 similarity measures, 394
Frobenius norm, 639 Hopfield network, 225
Fundamental and Homography matrix: Hough transform, 578, 579
relationship, 636 human binocular vision, 350, 351
Fundamental matrix, 634 human brain, 87, 211
fundamental radiometry equation, 417 human visual system, 265, 315, 348, 350, 374,
480
G hyperplane equation, 62
Gabor filter bank, 300
Gaussian noise, 287, 554, 555 I
Gaussian probability density, 559, 560
ill conditioned, see ill-posed problems
Gaussian pyramid, 537, 538
ill-posed problems, 201
GBR-Generalized Bas-Relief transform, 435
illumination
generalized cones, 345
incoherent, 466
geometric collineation, see homography
Lambertian, 425
transformation
illusion, 356, 485, 486
geometric distortion, 388, 448, 599
image compression, 16
geometric transformation, 296, 388, 581, 582,
image filtering, 291
588, 589, 599, 645
geometric transformation in image formation, image gradient, 304, 503, 504
602 image irradiance
Gestalt theory, 326 Lambertian, 416
Gini index, 143 image irradiance fundamental equation, 414
GLCM-Gray-Level Co-occurrence Matrix, image resampling, 649
270, 310 impulse response, 290
gradient space, 417, 419 incident irradiance, 416
gradient vector, 97 Information gain, 131
graph isomorphism, 409 infrared-sensitive camera, 451
Green function, 203 inner product, 94
interpolation matrix, 198
H interpolation process, 396, 426
Hamming distance, 230 intrinsic parameter estimation
harmonic function, 296 from homography matrix H, 619
Harris corner detector, 532, 533, 616 from perspective projection matrix P, 612
Helmholtz associationism, 326 intrinsic image, 307
Hessian matrix, 530, 531 inverse geometric transformation, 655
hierarchical clustering algorithms inverse geometry, 464
agglomerative, 149 inverse problem, 315, 349, 413, 590, 591
divisive, 151 irradiance constancy constraint equation, 501,
high-speed object tracking by KF, 544, 545 502
histogram, 77, 267 iso-brightness curve, 421
homogeneous coordinates, 528, 529 isomorphism, 171, 539, 540
homography matrix, 463, 617, 637 isotropic
calculation by SVD decomposition, 617 fractals, 288
homography transformation, 616 Gaussian function, 208
homologous structures calculation Laplace operator, 469
census transform, 403 iterative numerical methods - sparse matrix,
correlation measures, 398 521, 522
Index 673

J moment
Jacobian central, 268, 309
function, 529, 530 inertia, 274
matrix, 531, 532 normalized spatial, 8
momentum, 121, 273
K motion discretization
KDE-Kernel Density Estimation, 563, 564 aperture problem, 498, 499
kernel function, 81 frame rate, 487, 488
KF-Kalman filter, 544, 545 motion field, 494, 495
ball tracking example, 546, 547, 553, 554 optical flow, 494, 495
gain, 545, 546 space–time resolution, 492, 493
object tracking, 543, 544 space-time frequency, 493, 494
state correction, 549, 550 time-space domain, 492, 493
state prediction, 545, 546 visibility area, 492, 493
KLT algorithm, 536, 537 motion estimation
kurtosis, 269 by compositional alignment, 532, 533
by inverse compositional alignment, 533,
L 534
Lambertian model, 321, 416, 430 by Lucas–Kanade alignment, 526, 527
Laplace operator, 281, see also LOG-Laplacian by OF pure rotation, 572, 573
of Gaussian by OF pure translation, 571, 572
LDA-Linear Discriminant Analysis, 21 by OF-Optical Flow, 570, 571
least squares approach, 514, 515, 618 cumulative images difference, 496, 497
lens image difference, 496, 497
aperture, 466 using sparse POIs, 535, 536
crystalline, 387 motion field, 485, 486, 494, 495
Gaussian law, 465 MRF-Markov Random Field, 286, 476
line fitting, 513, 514 MSE-Mean Square Error, 124, 544, 545
local operator, 308 multispectral image, 4, 13, 16, 17
LOG-Laplacian of Gaussian, 290, 380
LSE-Least Square Error, 509, 510
LUT-Look-Up-Table, 436 N
NCC-Normalized Cross-Correlation, 527, 528
needle map, 426, see also orientation map
M
neurocomputing biological motivation
Mahalonobis distance, 59
MAP-Maximum A Posterior, 39, 67 mathematical model, 90
mapping function, 200 neurons structure, 88
Marr’s paradigm synaptic plasticity, 89
algorithms and data structures level, 318 neuron activation function, 90
computational level, 318 ELU-Exponential Linear Units, 245
implementation level, 318 for traditional neural network, 91
Maximum Likelihood Estimation, 49 Leaky ReLU, 245
for Gaussian distribution & known mean, 50 Parametric ReLU, 245
for Gaussian with unknown µ and , 50 properties of, 122
mean-shift, 566, 567 ReLU-Rectified Linear Units, 244
Micchelli’s theorem, 199 Neyman–Pearson criterion, 48
minimum risk theory, 43 nodal point, 352
MLE estimator distortion, 51 normal map, 441
MND-Multivariate Normal Distribution, 58 normal vector, 417, 419, 430
MoG-Mixtures of Gaussian, 66, see also normalized coordinate, 632
EM-Expectation–Maximization NP-complete problem, 147
674 Index

O using Kalman filter, 542, 543


optical center, 385 using POIs probabilistic correspondence,
optical flow estimation 540, 541
affine motion, 521, 522 polar coordinate, 280, 452
BBPW method, 517, 518 preface, vii
brightness preservation, 502, 503 Prewitt kernel, 310
discrete least-squares, 509, 510 primary visual cortex, 360, 366
homogeneous motion, 522, 523 color perception area, 372
Horn-Schunck, 504, 505 columnar organization, 366
Horn-Schunck algorithm, 512, 513 complex cells, 364
iterative refinement, 523, 524 depth perception area, 373
large displacements, 523, 524 form perception area, 373
Lucas-Kanade, 513, 514 Hubel and Wiesel, 362
Lucas-Kanade variant, 516, 517 hypercomplex cells, 365
multi-resolution approach, 525, 526 interaction between cortex areas, 369
rigid body, 566, 567 movement perception area, 373
spatial gradient vector, 502, 503 neural pathway, 368
time gradient, 502, 503 receptive fields, 362
orientation map, 342, 348, 419, 426, 440 simple cells, 362
Oriented Texture Field, 303 visual pathway, 365
orthocenter theorem, 610 projective reconstruction theorem, 664
orthogonal matrix properties, 607 projective transformation, see homography
orthographic projection, 418, 591, 592 transformation
orthonormal bases, 607, 608 pseudo-inverse matrix, 209
OTF-Optical Transfer Function, 468 PSF-Point Spread Function, 466
outer product, 608, 632 pyramidal cells, 360
pyramidal neurons, 360
P
Parseval’s theorem, 446 Q
Parzen window, 81 QR decomposition, 615
PCA-Principal Component Analysis, 11, 13, quadratic form, 59, 61
27, 301, 564, 565 quantization error, 218
principal plane, 14
Pearson’s correlation coefficient, 15 R
perspective projection, 449 radial optical distortions
canonical matrix of, 587, 588, 603 barrel effect, 601
center, 587, 588 estimation, 622
equations, 567, 568 pincushion effect, 601
matrix, 568, 569, 589, 590, 600 radiance emitted, 416
matrix estimation, 610 radiant energy, 416
non-linear matrix estimation, 615 radiant flux, 416
pinhole model, 585, 586 random-dot stereograms, 356
physical coherence, 377 rank distance, 402
pinhole model, 452, 484, 485, 653 rank transform, 401
pixel coordinates image, 588, 589 Rayleigh theorem, 446
POD-proper orthogonal decomposition, see RBC-Reflected Binary Code, 456
PCA RBF-Radial Basis Function, 198
POI-Point of interest, 393, 454, 495, 496, 514, RD-Rank Distance, see rank transform
515, 536–541 rectification of stereo images
POIs tracking calibrated, 652
using graphs similarity, 539, 540 non calibrated, 646
Index 675

recurrent neural architecture, 223, see also feature selection, 219


Hopfield network Hebbian learning, 213
reflectance coefficient, 416 topological ordering, 218
reflectance map, 414 space-time coherence, 525, 526
regularization theory, 201 space-time gradient, 517, 518
Green function, 203 sparse elementary structures, 406
parameter, 202 sparse map, 495, 496
Tikhonov functional, 202 spatial coherence, 501, 502
retinal disparity, 351 spectral band, 4, 40
crossed, 353 specular reflection, 416
uncrossed, 353, 356 SPI-Significant Points of Interest, 495, 496
rigid body transformation, 589, 590 SSD-Sum of Squared Difference, 400, 528,
529
S standardization procedure, 14
SAD-Sum of Absolute Difference, 400, 528, stereo photometry
529 calibrated, 436
SfM-Structure from Motion, 585, 586 diffuse light equation, 430
3D reconstruction, 590, 591 uncalibrated, 433
SVD decomposition methods, 590, 591 stereopsis, 356
Shannon-Nyquist sampling theorem, 493, 494 neurophysiological evidence, 358
Shape from Shading equation, 420 string recognition
Shape from X, 348 Boyer–Moore algorithm, 172
contour, 423 edit distance, 183
defocus, 465, 472 supervised learning
focus, 465, 468 artificial neuron, 90
pattern with phase modulation, 459 backpropagation, 113
shading, 413, 420, 424 Bayesian, 52, 53
stereo, 348 dynamic, 121
stereo photometry, 426 gradient-descent methods, 100
structured colored patterns, 463 Ho-Kashyap, 104
structured light, 451 perceptron, 95
structured light binary coding, 454 Widrow–Hoff, 103
structured light gray code, 456 surface reconstruction
structured light gray level, 458 by global integration, 443
texture, 448 by local integration, 442
SIFT tracking, 539, 540 from local gradient, 441
SIFT-Scale-Invariant Feature Transform, 537, from orientation map, 440
538 SVD decomposition, 622, 639
descriptor, 537, 538 SVD-Singular Value Decomposition, 209, 433
detector, 537, 538 syntactic recognition, 154
SIFT-Scale-Invariant Feature Transform, 539, ascending analysis, 166
540 descending analysis, 165
similarity function, see correlation function formal grammar, 156
simulated annealing, 237 grammars types, 161
skew parameter, 588, 589, 606 language generation, 158
skewness, 269
SML-Sum Modified Laplacian, 471 T
smoothness constraint, 509, 510 tangential optical distortions, 601
smoothness error, 504, 505 template image tracking, 528, 529
SOM-Self-Organizing Map temporal derivatives, 511, 512
competitive learning, 211 tensor
676 Index

matrix, 515, 516 hierarchical clustering, 148


structure, 514, 515 K-means, 30
texture, 261 Kohonen map, 210
based on autocorrelation, 276 unwrapped phase, 459
based on co-occurrence matrix, 272
based on edge metric, 281 V
based on fractals models, 286 vanishing point, 604, 610
based on Gabor filters, 295 Vector Quantization theory, 220
based on Run Length primitives, 283 vision strategy
based on spatial filtering, 290 bottom-up hierarchical control, 316
coherence, 306 hybrid control, 317
Julesz’s conjecture of, 264 nonhierarchical control, 317
oriented field of, 303
top-down hierarchical control, 317
perceptive features of, 308
visual tracking, 479
spectral method of, 278
Voronoi diagram, 30
statistical methods of, 267
syntactic methods for, 302
texture visual perception, 261 W
thin lens, 465 warped image, 528, 529
thin lens formula, 472 warping transformation, see geometric
tracking of POIs transformation
using SIFT POIs, 537, 538 wavelet transform, 266, 295, 471
trichromatic theory, 359 white noise, 546, 547
TTC-Time To Collision, see collision time whitening transform, 60
estimation wrapped phase, 460

U Z
unsupervised learning zero crossing, 322, 324, 380, 495, 496
brain, 89 ZNCC-Zero Mean Normalized Cross-
Hebbian, 213 Correlation, 398

You might also like