Th_Visual Object Tracking in Challenging Situations Using a Bayesian Perspective | Bayesian Inference | Normal Distribution

Universidad Polit

´
ecnica de Madrid
Escuela T
´
ecnica Superior
de Ingenieros de Telecomunicaci
´
on
Visual Object Tracking in Challenging
Situations using a Bayesian Perspective
Seguimiento visual de objetos en situaciones
complejas mediante un enfoque bayesiano
Ph.D. Thesis
Tesis Doctoral
Carlos Roberto del Blanco Ad´ an
Ingeniero de Telecomunicaci´on
2010
ii
Departamento de Se
˜
nales, Sistemas y
Radiocomunicaciones
Escuela T
´
ecnica Superior
de Ingenieros de Telecomunicaci
´
on
Visual Object Tracking in Challenging
Situations using a Bayesian Perspective
Seguimiento visual de objetos en situaciones
complejas mediante un enfoque bayesiano
Ph.D. Thesis
Tesis Doctoral
Autor:
Carlos Roberto del Blanco Ad´an
Ingeniero de Telecomunicaci´on
Universidad Polit´ecnica de Madrid
Director:
Fernando Jaureguizar N´ u˜ nez
Doctor Ingeniero de Telecomunicaci´on
Profesor titular del Dpto. de Se˜ nales, Sistemas y
Radiocomunicaciones
Universidad Polit´ecnica de Madrid
2010
iii
TESIS DOCTORAL
Visual Object Tracking in Challenging Situations using a Bayesian
Perspective
Seguimiento visual de objetos en situaciones complejas mediante un
enfoque bayesiano
Autor: Carlos Roberto del Blanco Ad´ an
Director: Fernando Jaureguizar N´ u˜ nez
Tribunal nombrado por el Mgfco. y Excmo. Sr. Rector de la Universidad Polit´ecnica
de Madrid, el d´ıa . . . . de . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . de 2010.
Presidente D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Vocal D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Vocal D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Vocal D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Secretario D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Realizado el acto de defensa y lectura de la Tesis el d´ıa . . . . de . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . de 2010 en . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Calificaci´on: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
EL PRESIDENTE LOS VOCALES
EL SECRETARIO
v
A Vanessa, a mis padres, a mis hermanos.
Acknowledgements
Me gustar´ıa agradecer a un gran n´ umero de personas el que hayan compar-
tido mi andadura por el camino de la tesis. Empezar´e por mi mujer Vanessa
que tanto apoyo me ha dado y que en m´ as de una ocasi´ on ha tenido que
aguantar la versi´ on m´ as irascible de mi mismo. Seguir´e con mis padres y
hermanos que tantas veces me han preguntado si me quedaba mucho para
terminar la tesis. Continuar´e con todos los miembros y visitantes del GTI,
haciendo una menci´on especial a Fernando, Narciso y Luis, los cuales han
sufrido los estragos de mis art´ıculos escritos en ese ingl´es tan espa˜ nol. Sin
olvidar el esfuerzo y migra˜ nas que a Fernando, mi tutor, le ha costado leer
esta tesis, gracias a la cual debe odiar al se˜ nor Bayes. Sin duda me gustar´ıa
escribir unas l´ıneas de todos mis compa˜ neros del GTI con los que tanta
cafe´ına he compartido y que han hecho tan agradable mi vida laboral. Sin
embargo, eso supondr´ıa escribir, m´ as que un tomo de tesis, toda la enciclo-
pedia Salvat. Por ello me veo obligado a hacer algo m´ as alternativo a la
par que dudosamente ´ util: una tabla con las edades del GTI, en la que se
puede ver a todos mis compa˜ neros (o al menos esa ha sido mi intenci´ on) a
lo largo del viaje temporal de mi tesis.
This work has been partially supported by the Ministerio de Ciencia e In-
novaci´ on of the Spanish government by means of a “Formaci´on del Per-
sonal Investigador” fellowship and the projects TIN2004-07860 (Medusa)
and TEC2007-67764 (SmartVision).
Era Periodo Espec´ımenes Procedencia
Proterozoico - Narciso, Fernando, Luis, Espa˜ na
Francisco, Julian, Nacho
Arquezoico - Marcos N., Juan Carlos, Carlos R. Espa˜ na, EEUU
Marcos A., Usoa, Shagniq
Paleozoico
C´ambrico Carlos C., Daniel A., Sharko Espa˜ na, Macedonia
Ordov´ıcico Ra´ ul, Jon,
´
Angel Espa˜ na
Sil´ urico Irena, Kristina, Binu Macedonia, India
Dev´ onico Nerea, Pieter Espa˜ na, B´elgica
Carbon´ıfero Pablo, Victor, Gian Luca Espa˜ na, Italia
P´ermico Hui, Xioadan, Yi, Yang China
Mesozoico
Tri´ asico Shankar, Ravi, Gogo, Antonio India, Macedonia, Brasil
Jur´asico Filippo, Maykel, Esther Italia, Cuba, Espa˜ na
Cret´ azico Sasho, C´esar Macedonia, Espa˜ na
Cenozoico
Paleoceno Daniel B., Claire Espa˜ na, Francia
Eoceno Lihui, Yu, Ivana China, Macedonia
Oligoceno To˜ ni, Richard, Carlos G. Espa˜ na, Peru
Mioceno Manuel, Massimo, Jes´ us Espa˜ na, Italia
Plioceno Rafa, Sergio Espa˜ na
Pleistoceno Su, Wenjia, Xiang, Iviza China, Macedonia
Holoceno Samira, Abel, Pratik, Srimanta Iran, Espa˜ na, India
Abstract
The increasing availability of powerful computers and high quality video
cameras has allowed the proliferation of video based systems, which perform
tasks such as vehicle navigation, traffic monitoring, surveillance, etc. A
fundamental component in these systems is the visual tracking of objects
of interest, whose main goal is to estimate the object trajectories in a video
sequence. For this purpose, two different kinds of information are used:
detections obtained by the analysis of video streams and prior knowledge
about the object dynamics. However, this information is usually corrupted
by the sensor noise, the varying object appearance, illumination changes,
cluttered backgrounds, object interactions, and the camera ego-motion.
While there exist reliable algorithms for tracking a single object in con-
strained scenarios, the object tracking is still a challenge in uncontrolled
situations involving multiple interacting objects, heavily-cluttered scenar-
ios, moving cameras, and complex object dynamics. In this dissertation,
the aim has been to develop efficient tracking solutions for two complex
tracking situations. The first one consists in tracking a single object in
heavily-cluttered scenarios with a moving camera. To address this situa-
tion, an advanced Bayesian framework has been designed that jointly models
the object and camera dynamics. As a result, it can predict satisfactorily
the evolution of a tracked object in situations with high uncertainty about
the object location. In addition, the algorithm is robust to the background
clutter, avoiding tracking failures due to the presence of similar objects.
The other tracking situation focuses on the interactions of multiple objects
with a static camera. To tackle this problem, a novel Bayesian model has
been developed, which manages complex object interactions by means of
an advanced object dynamic model that is sensitive to object interactions.
This is achieved by inferring the occlusion events, which in turn trigger
different choices of object motion. The tracking algorithm can also handle
false and missing detections through a probabilistic data association stage.
Excellent results have been obtained using publicly available databases,
proving the efficiency of the developed Bayesian tracking models.
Resumen
La creciente disponibilidad de potentes ordenadores y c´amaras de alta cal-
idad ha permitido la proliferaci´on de sistemas basados en v´ıdeo para la
navegaci´ on de veh´ıculos, la monitorizaci´ on del tr´afico, la v´ıdeo-vigilancia,
etc. Una parte esencial en estos sistemas es seguimiento de objetos, siendo
su principal objetivo la estimaci´on de las trayectorias en secuencias de v´ıdeo.
Para tal fin, se usan dos tipos de informaci´ on: las detecciones obtenidas del
an´ alisis del v´ıdeo y el conocimiento a priori de la din´ amica de los objetos.
Sin embargo, esta informaci´ on suele estar distorsionada por el ruido del sen-
sor, la variaci´on en la apariencia de los objetos, los cambios de iluminaci´on,
escenas muy estructuradas y el movimiento de la c´amara.
Mientras existen algoritmos fiables para el seguimiento de un ´ unico objeto
en escenarios controlados, el seguimiento es todav´ıa un reto en situaciones
no restringidas caracterizadas por m´ ultiples objetos interactivos, escenarios
muy estructurados y c´amaras en movimiento. En esta tesis, el objetivo ha
sido el desarrollo de algoritmos de seguimientos eficientes para dos situa-
ciones especialmente complicadas. La primera consiste en seguir un ´ unico
objeto en escenas muy estructuradas con una c´amara en movimiento. Para
tratar esta situaci´on, se ha dise˜ nado un sofisticado marco bayesiano que
modela conjuntamente la din´ amica de la c´amara y el objeto. Esto per-
mite predecir satisfactoriamente la evoluci´on de la posici´on de los objetos
en situaciones de gran incertidumbre. Adem´ as, el algoritmo es robusto a
fondos estructurados, evitando errores por la presencia de objetos similares.
La otra situaci´on considerada se ha centrado en las interacciones de objetos
con una c´amara est´atica. Para tal fin, se ha desarrollado un novedoso mod-
elo bayesiano que gestiona las interacciones mediante un avanzado modelo
din´ amico.
´
Este se basa en la inferencia de oclusiones entre objetos, las cuales
a su vez dan lugar a diferentes tipos de movimiento de objeto. El algoritmo
es tambi´en capaz de gestionar detecciones p´erdidas y falsas detecciones a
trav´es de una etapa de asociaci´ on de datos probabil´ıstica.
Se han obtenido excelentes resultados en diversas bases de datos, lo que
prueba la eficiencia de los modelos bayesianos de seguimiento desarrollados.
xiv
Contents
List of Figures xvii
List of Tables xix
1 Introduction 1
2 Bayesian models for object tracking 5
2.1 Tracking with moving cameras . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Tracking of multiple interacting objects . . . . . . . . . . . . . . . . . . 12
3 Bayesian Tracking with Moving Cameras 17
3.1 Optimal Bayesian estimation for object tracking . . . . . . . . . . . . . 17
3.1.1 Particle filter approximation . . . . . . . . . . . . . . . . . . . . . 20
3.2 Bayesian tracking framework for moving cameras . . . . . . . . . . . . . 21
3.3 Object tracking in aerial infrared imagery . . . . . . . . . . . . . . . . . 24
3.3.1 Particle filter approximation . . . . . . . . . . . . . . . . . . . . . 27
3.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.2.1 Strong ego-motion situation . . . . . . . . . . . . . . . . 36
3.3.2.2 High uncertainty ego-motion situation . . . . . . . . . . 39
3.3.2.3 Global tracking results . . . . . . . . . . . . . . . . . . 43
3.4 Object tracking in aerial and terrestrial visible imagery . . . . . . . . . 46
3.4.1 Particle Filter approximation . . . . . . . . . . . . . . . . . . . . 51
3.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
xv
CONTENTS
4 Bayesian tracking of multiple interacting objects 65
4.1 Description of the multiple object tracking problem . . . . . . . . . . . . 65
4.2 Bayesian tracking model for multiple interacting objects . . . . . . . . . 69
4.2.1 Transition pdfs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.2.2 Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.3 Approximate inference based on Rao-Blackwellized particle filtering . . 85
4.3.1 Kalman filtering of the object state . . . . . . . . . . . . . . . . . 91
4.3.2 Particle filtering of the data association and object occlusion . . 94
4.4 Object detections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.5.1 Qualitative results . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.5.2 Quantitative results . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5 Conclusions and future work 125
5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6 Appendix 131
6.1 Conditional independence and d-separation . . . . . . . . . . . . . . . . 131
References 133
xvi
List of Figures
3.1 Graphical model for the Bayesian object tracking. . . . . . . . . . . . . 18
3.2 Consecutive frames of an aerial infrared sequence . . . . . . . . . . . . . 25
3.3 Multimodal LoG filter response . . . . . . . . . . . . . . . . . . . . . . . 26
3.4 Likelihood distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.5 Initial translational transformations . . . . . . . . . . . . . . . . . . . . 30
3.6 Probability values for the ego-motion hypothesis . . . . . . . . . . . . . 30
3.7 Metropolis-Hastings sampling of the likelihood distribution . . . . . . . 32
3.8 Particle approximation of the posterior pdf . . . . . . . . . . . . . . . . 33
3.9 SIR resampling of the posterior pdf . . . . . . . . . . . . . . . . . . . . . 33
3.10 Kernel density estimation and state estimation . . . . . . . . . . . . . . 35
3.11 Object tracking result . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.12 Intermediate results for a situation of strong ego-motion . . . . . . . . . 37
3.13 Tracking results for the BEH algorithm under strong ego-motion . . . . 38
3.14 Tracking results for the DEH algorithm under strong ego-motion . . . . 39
3.15 Tracking results for the NEH algorithm under strong ego-motion . . . . 40
3.16 Intermediate results for a situation greatly affected by the aperture problem 41
3.17 Tracking results for the BEH algorithm in a situation greatly affected
the aperture problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.18 Tracking results for the DEH algorithm in a situation greatly affected
the aperture problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.19 Tracking results for the NEH algorithm in a situation greatly affected
the aperture problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.20 Example of the similarity measurement between image regions . . . . . 50
3.21 Example of feature correspondence . . . . . . . . . . . . . . . . . . . . . 53
xvii
LIST OF FIGURES
3.22 Representation of the affine transformation hypothesis . . . . . . . . . . 55
3.23 Samples of the object position . . . . . . . . . . . . . . . . . . . . . . . . 56
3.24 Samples of ellipsis enclosing the object . . . . . . . . . . . . . . . . . . . 56
3.25 Weighted sampled representation of the posterior pdf . . . . . . . . . . . 57
3.26 Tracking results with a camera mounted on a car . . . . . . . . . . . . . 59
3.27 Tracking results with a camera mounted on a helicopter . . . . . . . . . 61
4.1 Set of detections yielded by multiple detectors . . . . . . . . . . . . . . . 67
4.2 Data association between detections and objects . . . . . . . . . . . . . 68
4.3 Object dynamic model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.4 Graphical model for multiple object tracking . . . . . . . . . . . . . . . 71
4.5 Graphical model for the initial time step . . . . . . . . . . . . . . . . . . 74
4.6 Restrictions imposed to the associations between detections and objects. 79
4.7 Restrictions imposed to the occlusions among objects. . . . . . . . . . . 80
4.8 Color histograms of two object categories . . . . . . . . . . . . . . . . . 100
4.9 Similarity maps of the color histograms . . . . . . . . . . . . . . . . . . 102
4.10 Computed detections from the red dressed team. . . . . . . . . . . . . . 104
4.11 Computed detections from the black and white dressed team. . . . . . . 105
4.12 Tracking results for a simple object cross . . . . . . . . . . . . . . . . . . 110
4.13 Marginalization of the posterior pdf over one specific object . . . . . . . 111
4.14 Marginalization of the posterior pdf over one specific object . . . . . . . 112
4.15 Tracking results for a complex object cross . . . . . . . . . . . . . . . . 113
4.16 Marginalization of the posterior pdf over one specific object . . . . . . . 114
4.17 Marginalization of the posterior pdf over one specific object . . . . . . . 115
4.18 Marginalization of the posterior pdf over one specific object . . . . . . . 116
4.19 Tracking results for an overtaking action . . . . . . . . . . . . . . . . . . 117
4.20 Marginalization of the posterior pdf over one specific object . . . . . . . 118
4.21 Marginalization of the posterior pdf over one specific object . . . . . . . 119
4.22 Marginalization of the posterior pdf over one specific object . . . . . . . 120
6.1 Concepts of d-separation and descendants . . . . . . . . . . . . . . . . . 132
xviii
List of Tables
2.1 Tracking problems related to the data association . . . . . . . . . . . . . 13
3.1 Quantitative results for object tracking with a moving camera in infrared
imagery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2 Quantitative results for object tracking with a moving camera in visible
imagery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.1 Quantitative results for interacting objects 1/2 . . . . . . . . . . . . . . 122
4.2 Quantitative results for interacting objects 2/2 . . . . . . . . . . . . . . 123
xix
LIST OF TABLES
xx
Chapter 1
Introduction
The evolution and spreading of the technology nowadays have allowed the prolifera-
tion of video based systems, which make use of powerful computers and high quality
video cameras to automatically perform increasingly demanding tasks such as vehicle
navigation, traffic monitoring, human-computer interaction, motion-based recognition,
security and surveillance, etc. Visual object tracking is a fundamental part in all of
the previous tasks, and also in the field of computer vision in general. This fact has
motivated a great deal of interest in object tracking algorithms. The ultimate goal of
tracking algorithms is to estimate the object trajectories in a video sequence. For this
purpose, two different kinds of information are used: the video streams acquired by
the camera sensor and the prior knowledge about the tracked objects and the envi-
ronment. The video-stream based information is used to compute object detections in
each frame, also known as observations or measurements. The detection process uses
the most distinctive appearance features, such as color, gradient, texture, and shape,
to minimize the probability of false detections and at the same time to maximize the
detection probability. However, the object appearance can undergo significant varia-
tions that cause noisy detections and even missing detections, i.e. tracked objects that
have not been detected. The appearance variations can be produced by articulated or
deformable objects, illumination changes due to weather conditions (typical in outdoor
applications), and variations in the camera point of view. Object interactions, such as
partial and total occlusions, are another source of noisy and missing detections. On the
other hand, scene structures similar to the objects of interest can cause false detections,
and thus obfuscating the tracking process. To alleviate these detection shortcomings,
1
1. INTRODUCTION
the tracking also relies on the available prior information in order to constrain the tra-
jectory estimation problem. This kind of information is mainly the object dynamics,
which is used to predict the evolution of the object trajectories. The modeling of the
object dynamics can be a very difficult task, especially in situations in which objects
undergo complex interactions. On the other hand, object dynamic information is only
meaningful for static or quasi-static cameras, since, in the case of moving cameras, a
global motion is induced in the image, called ego-motion, that corrupts the trajectory
predictions. As a result, the camera dynamics must be also modeled, which makes
more complex the tracking and increases the uncertainty in the trajectory estimation.
While there exist reliable algorithms for the tracking of a single object in constrained
scenarios, the object tracking is still a challenge in uncontrolled situations involving
multiple interacting objects, heavily-cluttered scenarios, moving cameras, objects with
varying appearance, and complex object dynamics. In this dissertation, the main aim
has been the development of efficient tracking solutions for two of these complex track-
ing situations. The first one consists in tracking a single object in heavily-cluttered
scenarios with a moving camera. For this purpose, an advanced Bayesian framework
has been designed that jointly models the object and camera dynamics. This allows
to predict satisfactorily the evolution of the tracked object in situations of high uncer-
tainty, in which several object locations are possible because of the combined dynamics
of the object and the camera. In addition, the algorithm is robust to the background
clutter, avoiding tracking failures due to the presence of similar objects to the tracked
one in the background. The inference of the tracking information in the proposed
Bayesian model cannot be performed analytically, i.e. there is not a closed form ex-
pression to directly compute the required tracking information. This situation arises
from the fact that the dynamic and observation processes involved in the Bayesian
tracking framework are non-linear and non-Gaussian. In order to deal with this prob-
lem, a suboptimal inference method has been derived that makes use of the particle
filtering technique to compute an accurate approximation of the object trajectory.
The other unrestricted tracking situation focuses on the interactions of multiple
objects with a static camera. To successfully tackle this problem, a novel recursive
Bayesian model has been developed to explicitly manage complex object interactions.
This is accomplished by an advanced object dynamic model that is sensitive to the
object interactions involving long-term occlusions of two or more objects. For this
2
purpose, the proposed Bayesian tracking model uses a random variable to predict the
occlusion events, which in turn triggers different choices of object motion. The track-
ing algorithm is also able to handle false and missing detections through a probabilistic
data association stage, which efficiently computes the correspondence between the un-
labeled detections and the tracked objects. Regarding the inference of the tracking
information in the proposed Bayesian model for interacting objects, two major issues
have been carefully addressed. The first one is the mathematical derivation of the
posterior distribution of the object tracking information, which has been a challenging
task due to the complexity of the tracking model. The second issue, closely related
to the first one, arises from the fact that the derived mathematical expression for the
posterior distribution has not an analytical form due to the involved complex integrals.
This situation is caused by the non-linear and non-Gaussian character of the stochastic
processes involved in the Bayesian tracking model, i.e. the dynamic, observation and
occlusion processes. Subsequently, the inference has to be accomplished by means of
suboptimal methods, such as the particle filtering. However, the high dimensionality of
the tracking problem, proportional to the number of tracked objects and object detec-
tions, causes that the accuracy of the approximate posterior distribution be very poor.
To overcome this drawback, a novel suboptimal inference method has been developed
which combines the particle filtering technique with a variance reduction technique,
called Rao-Blackwellization. This allows to obtain an accurate approximation of the
object trajectories in high dimensional state spaces, involving multiples tracked objects.
The organization of the dissertation is as follows. In Chap. 2, a state of the art
of the most remarkable object tracking techniques for multiple objects is presented,
placing special emphasis in Bayesian models, strategies for handling moving cameras,
and the management of multiple objects. The developed recursive Bayesian model
for tracking a single object in heavily-cluttered scenarios with a moving camera is
described in Chap. 3. At the end of the chapter, tracking results of the proposed
Bayesian framework is presented for two kinds of applications, one involving aerial
infrared imagery, and another dealing with both terrestrial and aerial visible imagery.
In Chap. 4, the developed Bayesian tracking solution for multiple interacting objects is
presented, along with a test bench to evaluate the efficiency of the tracking in object
interactions. Lastly, conclusions and future lines of research are set out in Chap. 5.
3
1. INTRODUCTION
4
Chapter 2
Bayesian models for object
tracking
Visual object tracking is a fundamental task in a wide range of military and civilian
applications, such as surveillance, security and defense, autonomous vehicle navigation,
robotics, behavior analysis, traffic monitoring and management, human-computer in-
terface, video retrieval, and many more. The visual tracking can be defined as the
problem of estimating the trajectories of a set of objects of interest in a video sequence
as they move around the scene. In a typical tracking application there are one or more
object detectors that generate a set of noisy measurements or detections in discrete
time instants. The uncertainty of the detection process arises from the noise of camera
sensor, changes in the scene illumination, variations in the appearance of the objects,
non-rigid and/or articulated objects, and loss of information caused by the projection
of the 3D world onto the 2D image plane. The tracking algorithm must be able to han-
dle the uncertainty in the detection process to assign consistent labels to the tracked
objects in each frame of a video sequence. This process can be simplified by imposing
certain constrains on the motion of the objects. For this purpose, a dynamic model can
be used to predict the motion of the objects, restricting in this way the spatio-temporal
evolution of the trajectories. Nonetheless, the dynamic model is only an approximation
of the underlying object dynamics, which indeed can be very complex. As a result, the
tracking algorithm have to manage different sources of information (detections and ob-
ject dynamics), taking into account their own uncertainties, to efficiently estimate the
object trajectories.
5
2. BAYESIAN MODELS FOR OBJECT TRACKING
Bayesian estimation is the most commonly used framework in visual tracking and
also in other contexts such as radar and sonar. This framework models in a probabilistic
way the tracking problem, and all its sources of uncertainty: sensor noise, inaccurate
dynamic models, environmental clutter, etc. From a Bayesian perspective, the aim is to
compute the posterior distribution over the object state, which is a vector containing
all the desired tracking information such as, position, velocity, etc. This posterior
distribution encodes all the necessary information to efficiently compute an estimation
of the object state. The computation of the posterior distribution is usually performed
in a recursive way via two cyclic stages: prediction and update. Thus, the computation
is efficiently performed since it is only required the previous estimation of the posterior
distribution, and the set of detections at the current time step. The prediction stage
evolves the posterior distribution at the previous time step according to the object
dynamics, obtaining as a result the predicted posterior distribution at the current time
step. The update stage makes use of the available detections at the current time step
to correct the predicted posterior distribution by means of the likelihood model of the
object detector.
In single object tracking with static cameras, the main difficulty arises from the fact
that realistic models for the object dynamics and detection processes are often non-
linear and non-Gaussian, which leads to a posterior distribution without a closed-form
analytic expression. In fact, only in a limited number of cases there exist close-form ex-
pressions. The most well-known closed-form expression is the Kalman filter (1), which
is obtained when both the dynamic and likelihood models are linear and Gaussian.
Grid based approaches (2) overcome the limitations imposed on Kalman filter by re-
stricting the state space to be discrete and finite. If any of the previous assumptions
does not hold, the exact computation of posterior distribution is not possible, and it
becomes necessary to resort to approximate inference methods that computes an ap-
proximation of the posterior distribution. The extended Kalman filter (1) linearizes
models with weak non-linearities using the first term in a Taylor expansion, so that
the Kalman filter expression can be still applied. Nonetheless, the performance of the
extended Kalman filter rapidly decreases as the non-linearities becomes more severe.
The unscented Kalman filter (3; 4) has proved to be more efficient in models that
are moderately non-linear. It recursively propagates a set of selected sigma points to
maintain the second order statistics of the posterior distribution. Both approximate
6
solutions, extended and unscented Kalman filters, assume that the underlying posterior
distribution is Gaussian. But if this assumption does not hold (e.g. the distribution is
heavily skewed or multimodal), the accuracy of the estimation can be randomly poor.
The Gaussian sum filter (5) was one of the first attempts to deal with non-Gaussian
models, approximating the posterior distribution by a mixture of Gaussians. The main
limitation of the Gaussian sum filter is that linear approximations are required, as in
the extended Kalman filter. Another limitation is the combinatorial growth of the
number of Gaussian components in the mixture over time. An alternative solution
for non-linear non-Gaussian models that does not need linearization is obtained by
approximate-Grid based methods (2; 6). These methods approximate the continuous
state space by a finite and fixed grid, and then they apply numerical integration for
computing the posterior distribution. The grid must be sufficiently dense to compute
an accurate approximation of the posterior distribution. However, the computational
cost increases dramatically with the dimensionality of the state space and becomes
impractical for dimensions larger than four. An additional disadvantage of grid-based
methods is that the state space cannot be partitioned unevenly in order to improve
the resolution in regions of high density probability. All these shortcomings are over-
come by the particle filtering technique (2; 7; 8), also known as Sequential Monte Carlo
method (9; 10; 11), condensation algorithm (12; 13), or bootstrap filtering (14). It is
a numerical integration technique that simulates the posterior distribution by a set of
weighted samples, known as particles, that are propagating recursively along the time.
The samples are drawn from a proposal distribution, that is the key component of the
algorithm, and evaluated by means of the dynamic and likelihood models. The particle
filter has become very successful in a wide range of tracking applications due to its
efficiency, flexibility, and easy of implementation. Moreover, its computational cost is
theoretically independent of the dimension of the state space.
To sum up, the previous tracking approaches have proved to be efficient and reliable
solutions for single object tracking provided that:
• they fulfill the assumptions of linearity/non-linearity Gaussianity/non-Gaussinity
for which they were conceived,
• the cameras are static, i.e. with no motion, and
7
2. BAYESIAN MODELS FOR OBJECT TRACKING
• there is always a unique detection for the tracked object, which only occurs in
constrained scenarios where there is total control about the number and types of
objects that compose the scene.
In the rest of situations, the tracking task is still a challenge, which is receiving a great
deal of research attention because of the wide range of potential applications that can
be developed. The main contribution of this dissertation is the development of efficient
and reliable algorithms for the tracking of objects in challenging situations. Specifically,
the research has been focused on two situations: the single object tracking with moving
cameras, and the multiple interacting object tracking with static cameras.
In the first situation, the moving camera induces a global motion in the scene,
called ego-motion, that corrupts the spatio-tempotal continuity of the video sequence.
As a consequence, the object dynamic information is not useful anymore, since the
camera motion is not considered, and the tracking performance is seriously reduced. In
Sec. 2.1, a thorough review of the main techniques that address the ego-motion problem
for single object tracking with moving cameras is presented.
In the other considered situation, the tracking algorithm has to manage several
interacting objects in an environment with static cameras. The difficulty arises from
the fact that in each time step there is a set of unlabeled detections generated from
the detectors. This means that the correspondence between objects and detections
is not known, and therefore a data association stage is required. This fact violates
the assumption that there is always a unique detection per object, since potentially
whatever detection can be associated with an object. In fact, the data association
can be very complex since the number of possible associations is combinatorial with
the number of objects and detections. Furthermore, there can be false detections and
missing detections that increase even more the complexity of the data association.
The false detections arise from the noise of the camera sensor and the scene clutter
(similar structures to the tracked object in the background). On the other hand, object
occlusions and strong variations in the object appearance can cause that one or more
of the objects are not detected, the so-called missing detections. These phenomena
can also occur for single object tracking in unconstrained scenarios, in which there is
a unique object, but there can be none, one or multiple detections. For example, the
nearest neighbor Kalman filter (15) handles the data association problem by selecting
8
2.1 Tracking with moving cameras
the closest detection to the predicted object trajectory, which is used to update the
posterior distribution of the object state. Unlike the previous method that only uses a
unique detection, the probabilistic data association filter (16; 17) updates the posterior
distribution utilizing all the detections that are close to the predicted object trajectory.
This is accomplished by averaging the innovation terms of the Kalman filter resulting
from the set of detections. This approach maintains the Gaussian character of the
posterior distribution. On the other hand, the data association in single object tracking
can be considered as a specific case of the data association of multiple object tracking,
where the number of tracked objects in the scene is just one. There exist a lot of
scientific literature in the field of multiple object tracking, and recently there has been
a revival of interest due to the recent developments in particle filtering and recursive
Bayesian models in general. In Sec. 2.2, the main multi-object tracking techniques are
presented, focusing on the problem of the data association.
2.1 Tracking with moving cameras
In video based applications in which the video acquisition system is mounted on a
moving aerial platform (such as a plane, a helicopter, or an Unmanned Aerial Vehicle),
a mobile robot, a vehicle, etc., the acquired video sequences undergo a random global
motion, called ego-motion, that prevents the use of the object dynamic information to
restrict the object position in the scene. As a consequence, the tracking performance
can be dramatically reduced. The ego-motion problem has been addressed in different
manners in the scientific literature. They can be split into two categories: approaches
based on the assumption of low ego-motion, and those based on the ego-motion esti-
mation.
Approaches assuming low ego-motion consider that the motion component due to
the camera is not very significant in comparison with the object motion. In this context,
some works assume that the spatio-temporal connectivity of the object is preserved
along the sequence (18; 19; 20), i.e. the image regions associated with the tracked object
are spatially overlapped in consecutive frames. Then, the tracking is performed using
morphological connected operators. In cases where the previous assumption does not
hold, the most common approach is to search for the object in a bounded area centered
in the location where it is expected to find the object, according to its dynamics.
9
2. BAYESIAN MODELS FOR OBJECT TRACKING
In (21; 22), an exhaustive search is performed in a fixed-size image region centered in
the previous object location. In (23), the initial search location is estimated using a
Kalman filter, and then the search is performed deterministically using the Mean Shift
algorithm (24). Other authors (25; 26) propose a stochastic search based on particle
filtering, which is able to manage multiple initial locations for the search. However,
all these methods lose effectiveness as the displacement induced by the ego-motion
increases. The reason is the size of the search area must be enlarged to accommodate
the expected camera ego-motion, which produces that the probability that the tracking
can be distracted by false candidates increases dramatically.
The other category of approaches based on the ego-motion estimation are able
to deal with strong ego-motion situations, in which the camera motion is at least as
significant as the object motion, and even more. They aim to compute the camera ego-
motion between consecutive frames in order to compensate it, and thus recovering the
spatio-temporal correlation of the video sequence. The camera ego-motion is modeled
by a geometric transformation, typically an affine or projective one, whose parameters
are estimated by means of an image registration technique. The existing works differ
in the specific image registration technique used to compute the parameters of the
geometric transformation. Extensive reviews of image registration techniques can be
found in (27; 28), where the first one tackles all kind of vision based applications, while
the second one is focused on aerial imagery. According to them, a possible classification
of the image registration techniques is: those based on features and those based on
area (i.e. image regions). Feature based image registration techniques detect and
match distinctive image features between consecutive frames to estimate a geometric
transformation, which represents the camera ego-motion model. In (29), an object
detection and tracking system with a moving airborne platform is described, which
uses a feature based approach to estimate an affine camera model. In (30), the KLT
method (31) is used to infer a bilinear camera model in an application that detects
moving objects from a mobile robot. In the field of FLIR (Forward Looking InfraRed)
imagery, the works (32; 33; 34) describe a detection and tracking system of aerial
targets mounted on an airborne platform that uses a robust statistic framework to
match edge features in order to estimate an affine camera model. This system is able
to successfully handle situations in which the camera motion estimation is disturbed by
the presence of independent moving objects, provided that there is a minimum number
10
2.1 Tracking with moving cameras
of detected features belonging to the background. In situations in which the detection
of distinctive features is particularly complicated, because the acquired images are low
textured and structured, an area-based image registration technique is used to estimate
the parameters of the camera model. In (35), a perspective camera model is computed
by means of an optical flow algorithm to detect moving objects in an application of
aerial visual surveillance. An optical flow algorithm is also used in (36) to estimate the
parameters of a pseudo perspective camera model, which is utilized to create panoramic
image mosaics. The same approach is followed in (37; 38) for a tracking application of
terrestrial targets in airborne FLIR imagery. In (39; 40), a target detection framework is
presented for FLIR imagery that minimizes a SSD (Sum of Squares Differences) based
error function to estimate an affine camera model. A similar framework of camera
motion compensation is used in (41) for tracking vehicles in aerial infrared imagery,
but utilizing a different minimization algorithm. In (42), the Inverse Compositional
Algorithm is used to obtain the parameters of an affine camera model for a tracking
application of vehicles in aerial imagery. Unlike the feature based image registration
techniques, the area based techniques are not robust to the presence of independent
moving objects, which can drift the ego-motion estimation. In addition, they require
that the involved images are closely aligned to achieve satisfactory results.
All the previous approaches, independently of the used camera ego-motion compen-
sation technique, have in common that they compute at most one parametric model to
represent the ego-motion between consecutive frames. However, in real applications,
the ego-motion computation can be quite challenging, because there can be several
feasible solutions, i.e. several camera geometric transformations, and not necessarily
the solution with less error is the correct one. This situation arises as a consequence
of several phenomena, such us the aperture problem (43) (related to low structured
or textured scenes), the presence of independent moving objects, changes in the scene,
and limitations of the own camera ego-motion technique. In Chap. 3, an efficient and
reliable Bayesian framework is proposed to deal with the uncertainty in the estimation
of the camera ego-motion for tracking applications.
11
2. BAYESIAN MODELS FOR OBJECT TRACKING
2.2 Tracking of multiple interacting objects
Multiple object tracking can be sought as the generalization of single object tracking,
in the sense that the main goal is to recover the trajectories of multiple objects from
a video sequence, rather than only one trajectory from an unique object. However,
techniques of multiple object tracking are fundamentally different from those of sin-
gle object tracking, due to the particular problems that arise in the presence of two
or more objects. In multiple object tracking, the object detections are unlabeled and
unordered, i.e. the true correspondence between objects and detections is unknown.
The estimation of the true correspondence, called data association, suffers from the
combinatorial explosion of the possible associations, in which the computational cost
inevitably grows exponentially with the number of objects. On the other hand, data
association is a stochastic process in which the estimation of the true detection as-
sociation can be extremely difficult due to the involved uncertainty. Furthermore, in
real situations there can be none, one, or several detections per object. As a result,
there can be false detections and missing detections, in spite of the fact that the goal
of the detector is both to minimize the probability of false alarms and to maximize the
detection probability. This fact increases the complexity of the data association prob-
lem. The false detections arise from scene structures similar to the objects of interest,
which can obfuscate the tracking process. The missing detections can be originated
from changes in the object appearance, which in turn are caused by articulated or de-
formable objects, illumination changes due to weather conditions (typical in outdoor
applications), and variations in the camera point of view. Another source of missing
detections are the partial and total occlusions involved in the object interactions. All
of these phenomena are also responsible of the noisy character of the detection process.
Tab. 2.1 summarizes the mentioned sources of disturbances along with their effects,
and the derived data association problems.
A great deal of strategies have been proposed in the scientific literature to solve
the data association problem. These can be divided into single-scan and multiple-scan
approaches. Single-scan approaches perform the data association considering only the
set of available detections in a specific time step, while the multiple-scan approaches
make use of the detections acquired in a temporal interval, comprising several time
steps. Multiple-scan approaches consider that tracks are basically a sequence of noisy
12
2.2 Tracking of multiple interacting objects
Disturbance Effect Data association Problem
Changes in the camera Variations in the Missing detections,
point of view object appearance noisy detections
Articulated or Variations in the Missing detections,
deformable objects object appearance noisy detections
Illumination changes Variations in the Missing detections,
object appearance noisy detections
Object interactions Partial or Missing detections,
total occlusions noisy detections
Scene structures similar Presence of clutter False detections
to the objects of interest
Table 2.1: Disturbances in the detection process, their effects, and the resulting problems
in data association.
detections. Thus, the multiple object tracking consists in seeking the optimal paths
in a trellis formed by the temporal sequence of detections. In this way, the data as-
sociation problem is cast to one of association of sequence of detections. Techniques
that accomplish this task are the Viterbi algorithm (44; 45; 46), multiple scan assign-
ment (47; 48), network theoretic algorithms (49), and the expectation-maximization
algorithm (EM) (50). The precedent approaches compute a single solution that is
considered the best one, discarding a lot of feasible hypotheses that could be the true
solution. To alleviate this situation, some approaches (51; 52) compute the best N solu-
tions in order to minimize the risk of an incorrect trajectory estimation. An additional
problem is the computational cost. It is known that the multiple-scan approaches are
NP-hard problems in combinatorial optimization, i.e. their complexity is exponential
with the number of objects and detections. The most popular solution to tackle this
problem is the Lagrangian relaxation (53; 54), wherein the N dimensional assignment
problem is divided into a set of assignment problems of lower dimensionality. Another
approach (55) transforms the integer programming problem, posed by the multiple-
scan assignment, into a linear programming problem by relaxing the constraints for
an integer solution. This allows to efficiently solve the problem in polynomial time
through well-known algorithms, such as the interior point method (56).
Inside the group of single-scan approaches, the simplest one is the global nearest
13
2. BAYESIAN MODELS FOR OBJECT TRACKING
neighbor algorithm (57), also known as the 2D assignment algorithm, which computes
a single association between detections and objects by minimizing a distance based
cost function. The main problem of this approach is that many feasible associations
are discarded. On the other hand, the multiple hypotheses tracker (MHT) (58; 59)
attempts to keep track of all the possible associations along the time. As it occurs
with the multiple-scan approaches, the complexity of the problem is NP-hard because
the number of association grows exponentially over time, and also with the number
of objects and detections. Therefore, additional methods are required to establish a
trade-off between the computational complexity and the handling of multiple associa-
tion hypotheses. In this respect, one of the most popular methods is the joint prob-
abilistic data association filter (JPDAF) (60; 61), which performs a soft association
between detections and objects. This is carried out by combining all the detections
with all the objects, in such a way that the contribution of each detection to each
object depends on the statistical distance between them. This method prunes away
many unfeasible hypotheses, but also restricts the data association distribution to be
Gaussian, which limits the applicability of the technique. Subsequent works (62; 63)
try to overcome this limitation by modeling the data association distribution by a mix-
ture of Gaussians. However, heuristics techniques are necessary to reduce the number
of components to make the algorithm computationally manageable. The probabilistic
multiple hypotheses tracker (PMHT) (64; 65) is another alternative to estimate the best
data associations hypotheses at a moderate computational cost. It assumes that the
data association is an independent process to work around the problems with pruning.
Nevertheless, the performance is similar to that of the JPDAF, although the compu-
tational cost is higher. The data association problem has been also addressed with
particle filtering, which allows to deal with arbitrary data association distributions in a
natural way. Theoretically, the algorithms based in particle filtering have the ability to
manage the best data association hypotheses with a computational cost independently
of the number of objects and detections. The computed association hypotheses consti-
tute an approximation of the true data association distribution, and the approximation
is more accurate as the number of hypotheses increases. In practice, the performance
of the particle filtering techniques depends on the ability to correctly sample associa-
tion hypotheses from a proposal distribution called importance density. In (66; 67), a
Gibbs sampler is used to sample the data association hypotheses. In a similar way, a
14
2.2 Tracking of multiple interacting objects
Markov Chain Monte Carlo (MCMC) (68; 69; 70) scheme has been used for drawing
samples that simulate the underlying data association distribution. The main problem
with these samplers is that they are iterative methods that need an unknown number
of iterations to converge. This fact makes them inappropriate for online applications.
Some works (71; 72) overcome this limitation by means of the design of an efficient
and non-iterative proposal distribution that depends on the specific characteristic of
the underlying dynamic and likelihood processes of the tracking system. The accuracy
of the estimation achieved by techniques based on particle filtering depends on the size
of the dimension of the state space. For high dimensional spaces, the accuracy can
be quite low. In order to deal with this drawback, a technique of variance reduction,
called Rao-Blackwellization, has been used in (73), which improves the accuracy of
the estimated object trajectories for a given number of samples or hypotheses. An
alternative to the particle filtering is the probability hypothesis density (PHD) filter
that can also address missing and false detections like the particle filtering. However,
the computational cost is exponential with the number of objects. In order to reduce
the complexity from exponential to linear, the full posterior distribution is simplified
by its first-order moment in (74). Nonetheless, this approach is only satisfactory for
multivariate distributions that can be reasonable approximated by its first moment,
which can be an excessive limitation for some tracking applications.
The previous works have been designed to track multiple objects with restricted
kinds of interactions among them. For instance, these works are able to handle object
interactions involving trajectory changes but without occlusions, such as a situation
with two people who stop one in front the other. In this case the object detections are
used to efficiently correct the object trajectories. Another kind of interaction that is
successfully addressed involves object occlusions but without trajectory changes, such
as a situation with two people who cross each other maintaining their paths. In this
case, the data association stage can manage the missing detections during the occlusion,
relying on their trajectories are unchanged in order to predict their tracks. However, in
complex object interactions involving trajectory changes and occlusions, the previous
approaches are prone to fail because the occluded objects have not available detections
to correct their trajectories. This limitation arises from the fact the main tracking
techniques for multiple objects have been developed for radar and sonar applications,
in which the dynamics of the tracked objects have physical restrictions that make
15
2. BAYESIAN MODELS FOR OBJECT TRACKING
impossible the complex interactions that arise in visual tracking. Moreover, in the field
of radar and sonar, the objects are handled as point targets that cannot be occluded.
Some works have proposed strategies to deal with the specific problems that arise in
the field of visual tracking. In (75; 76), the data association hypotheses are drawn
using a sampling technique that is able to handle split object detections, i.e. group
of detections that have been generated from the same object. The split detections
are typical from background subtraction techniques (77), which are used to detect
moving objects in video sequences. In (78), a specific approach for handling object
interactions that involve occlusions and changes in trajectories is presented. It creates
virtual detections of possible occluded objects to cope with the changes in trajectories
during the occlusions. However, since the occlusion events are not explicitly modeled,
tracking errors can appear when a virtual detection is associated to an object that is
actually not occluded. In order to improve the performance of the tracking of multiple
objects in the field of computer vision, a novel Bayesian approach that explicitly models
the occlusion phenomenon has been developed. This approach is able to track complex
interacting objects whose trajectories change during the occlusions. Chap. 4 describes
in detail the proposed visual tracking for multiple interacting models.
16
Chapter 3
Bayesian Tracking with Moving
Cameras
This chapter starts with a brief overview of the optimal Bayesian framework for gen-
eral object tracking (Sec. 3.1), explaining also the basics of the particle filtering, an
approximate inference technique. Next, the developed Bayesian tracking framework for
moving cameras is presented in Sec. 3.2, which models the camera motion in a prob-
abilistic way. Lastly, Secs. 3.3 and. 3.4 show respectively how to apply the proposed
Bayesian model to two visual tracking applications for moving cameras: the first one
focused on aerial infrared imagery, and the second one for aerial and terrestrial visible
imagery.
3.1 Optimal Bayesian estimation for object tracking
The Bayesian approach for object tracking aims to estimate a state vector x
t
that
evolves over time using a sequence of noisy observations z
1:t
= {z
i
|i = 1, ..., t} up
to time t. The state vector contains all the relevant information for the tracking at
time step k, such as the object position, velocity, size, appearance, etc. The noisy
observations z
1:t
(also called measurements or detections) are obtained by one or more
detectors, which analyze the video sequence information acquired by the camera to
either directly compute the object position, or indirectly obtain relevant features that
can related to the object position, such as motion, color, texture, edges, corners, etc.
From a Bayesian perspective, some degree of belief in the state x
t
at time t is
17
3. BAYESIAN TRACKING WITH MOVING CAMERAS
calculated, using the available prior information (about the object, the camera and
the scene), and the set of observations z
1:t
. Therefore, the tracking problem can be
formulated as the estimation of the posterior probability density function (pdf) of the
state of the object, p(x
t
|z
1:t
), conditioned to the set of observations, where the initial
pdf p(x
0
|z
0
) ≡ p(x
0
) is assumed to be known. This probabilistic model for the object
tracking can be graphically represented by a graph (see Fig. 3.1), called graphical
model, in which the random variables are represented by nodes, and the probabilistic
relationships among the variables by arrows.
Figure 3.1: Graphical model for the Bayesian object tracking.
For efficiency purposes, the estimation of the posterior pdf p(x
t
|z
1:t
) is recursively
performed through two stages: the prediction of the most probable state vectors us-
ing the prior information, and the update (or correction) of the prediction based on
the observations. The prediction stage involves computing the prior pdf of the state,
p(x
t
|z
1:t−1
), at time t via the Chapman-Kolmogorov equation
p(x
t
|z
1:t−1
) =
_
p(x
t
, x
t−1
|z
1:t−1
)dx
t−1
=
_
p(x
t
|x
t−1
)p(x
t−1
|z
1:t−1
)dx
t−1
, (3.1)
where p(x
t−1
|z
1:t−1
) is the posterior pdf at the previous time step, and p(x
t
|x
t−1
) is
the state transition probability, that encodes the prior information, for example the
object dynamics along with its uncertainty. The state transition probability is defined
by a possibly non-linear function of the state x
t−1
, and an independent identically
distributed noise process v
t−1
x
t
= f
t
(x
t−1
, v
t−1
). (3.2)
18
3.1 Optimal Bayesian estimation for object tracking
The update stage aims to reduce the uncertainty of the prediction, p(x
t
|z
1:t−1
),
using the new available observation z
t
(observations are available at discrete times)
through the Bayes’ rule
p(x
t
|z
1:t
) =
p(z
t
|x
t
)p(x
t
|z
1:t−1
)
p(z
t
|z
1:t−1
)
, (3.3)
where p(z
t
|x
t
) is the likelihood distribution that models the observation process, i.e. it
assesses the degree of support of the observation z
t
by the prediction x
t
. The likelihood
is given by a possibly nonlinear function of the state x
t
, and an independent identically
distributed noise process n
t
z
t
= h
t
(x
t
, n
t
). (3.4)
The denominator of Eq. (3.3) is simply a normalization constant given by
p(z
t
|z
1:t−1
) =
_
p(z
t
, x
t
|x
t
)dx
t
=
_
p(z
t
|x
t
)p(x
t
|z
1:t−1
)dx
t
. (3.5)
The posterior p(x
t
|z
1:t
) embodies all the available statistical information, allowing
the computation of an optimal estimate of the state vector ´ x
t
, that contains the de-
sired tracking information. Commonly used estimators are the Maximum A Posteriori
(MAP) and the Minimum Mean Square Error (MMSE), given respectively by
MAP : ´ x
t
= arg max
x
t
p(x
t
|z
1:t
) (3.6)
MMSE : ´ x
t
= E(p(x
t
|z
1:t
)) (3.7)
Nevertheless, the optimal solution of the posterior probability, given by Eq. 3.3,
can not be determined analytically in practice, due to the nonlinearities and non-
Gaussianities of the prior information and observation models. Therefore, it is necessary
the use of suboptimal methods to obtain an approximate solution. In the Sec. 3.1.1, a
powerful and popular suboptimal method, call Particle Filtering, will be described.
19
3. BAYESIAN TRACKING WITH MOVING CAMERAS
3.1.1 Particle filter approximation
The Particle Filter is an approximate inference method based on Monte Carlo simula-
tion for solving Bayesian filters. In contrast to other approximate inference methods,
such as Extended Kalman Filters, Unscented Kalman Filters and Hidden Markov Mod-
els, Particle Filtering is able to deal with continuous state spaces and nonlinear/non-
Gaussian processes (9), which arise in a natural way in real tracking situations. The
Particle Filtering technique approximates the posterior probability p(x
t
|z
1:t
) by a set
of N
S
weighted random samples (or particles) {x
i
t
, i = 0, ..., N
S
} (2)
p(x
t
|z
1:t
) ≈
1
c
N
S

i=1
w
i
t
δ(x
t
−x
i
t
), (3.8)
where the function δ(x) is the Kronecker’s delta, {w
i
t
, i = 0, ..., N
S
} is the set of weights
related to the samples, and c =

N
S
i=1
w
i
t
is a normalization factor. As the number
of samples becomes very large, this approximation becomes equivalent to the true
posterior pdf.
Samples x
i
t
and weights w
i
t
are obtained using the concept of importance sam-
pling (2; 79), which aims to reduce the variance of the approximation given by Eq. (3.8)
through Monte Carlo simulation. The set of samples {x
i
t
, i = 0, ..., N
S
} is drawn from
a proposal distribution function q(x
t
|x
t−1
, z
t
), called the importance density. The op-
timal q(x
t
|x
t−1
, z
t
) should be proportional to p(x
t
|z
1:t
), and should have the same
support (the support of a function is the set of points where the function is not zero),
in whose case the variance would be zero. But this is only a theoretical solution, since
it would imply that p(x
t
|z
1:t
) is known. In practice, it is chosen a proposal distribution
as similar as possible to the posterior pdf, but there is not a standard solution, since
it depends on the specific characteristics of the tracking application. The choice of
the proposal distribution is a key component in the design of Particle Filters, since
the quality of the estimation of the posterior pdf depends on the ability to find an
appropriate proposal distribution.
The weights w
i
t
related to each sample x
i
t
are recursively computed by (2)
w
i
t
= w
i
t−1
p(z
t
|x
i
t
)p(x
i
t
|x
i
t−1
)
q(x
i
t
|x
i
t−1
, z
t
)
. (3.9)
20
3.2 Bayesian tracking framework for moving cameras
The importance sampling principle has a serious drawback, called the degeneracy
problem (2), consisting in all the weights except one have an insignificant value after
a few iterations. In order to overcome this problem, several resampling techniques
have been proposed in the scientific literature, which introduce an additional sampling
step that consists in populating more times those samples that are more probable. A
popular resampling strategy is the Sampling Importance Resampling (SIR) algorithm,
which makes a random selection of the samples at each time step according to their
weights. Thus, the samples with higher weights are selected more times, while the ones
with an insignificant weight are discarded. After SIR resampling, all the samples have
the same weight.
3.2 Bayesian tracking framework for moving cameras
In video sequences acquired by a moving camera, the perceived motion of the objects is
composed by the own object motion and the camera motion. Consequently, it is neces-
sary to estimate the camera motion in order to obtain the object position. According
to this, the state vector x
t
= {d
t
, g
t
} must contain not only the object dynamics, d
t
,
(position and velocity over the image plane), but also the camera dynamics, g
t
, i.e. the
camera ego-motion. The posterior pdf of the state vector is recursively expressed by
the equations
p(x
t
|z
1:t
) =
p(z
t
|x
t
)p(x
t
|z
1:t−1
)
p(z
t
|z
1:t−1
)
(3.10)
p(x
t
|z
1:t−1
) =
_
p(x
t
|x
t−1
)p(x
t−1
|z
1:t−1
)dx
t−1
. (3.11)
The transition probability p(x
t
|x
t−1
) = p(d
t
, g
t
|d
t−1
, g
t−1
) encodes the information
about the object and camera dynamics, along with their uncertainty. If the camera
motion is not considered, the object dynamics can be modeled by the linear function
d

t
= M· d

t−1
, (3.12)
where M is a matrix that represents a first order linear system of constant velocity.
This object dynamic model is a reasonable approximation for a wide range of object
tracking applications, provided that the camera frame rate is enough high. The camera
21
3. BAYESIAN TRACKING WITH MOVING CAMERAS
dynamics is modeled by a geometric transformation g
t
that ideally is a projective cam-
era model, although, depending on the camera and scene disposition, it can be simplify
to an affine or Euclidean transformation. For example, in aerial tracking systems, an
affine geometric transformation is a satisfactory approximation of the projective cam-
era model, since the depth relief of the objects in the scene is small enough compared
to the average depth, and the field of view is also small (80). The joint dynamic model
for the camera and the object is expressed as a composition of both individual models
d
t
= g
t
· M· d
t−1
. (3.13)
Based on this joint dynamic model, the transition probability p(x
t
|x
t−1
) can be ex-
pressed as
p(x
t
|x
t−1
) = p(d
t
, g
t
|d
t−1
, g
t−1
) = p(d
t
|d
t−1
, g
t−1:t
)p(g
t
|d
t−1
, g
t−1
)
= p(d
t
|d
t−1
, g
t
)p(g
t
), (3.14)
where it has been assumed that, on the one hand, the current object position is condi-
tionally independent of the camera motion in the previous time step (as the proposed
joint dynamic model states), and, on the other hand, the current camera motion is
conditionally independent of both the camera motion and the object position in pre-
vious time steps. This last assumption results from the fact that the camera ego-
motion is completely random, not following any specific pattern. The probability term
p(d
t
|d
t−1
, g
t
) models the uncertainty of the proposed joint dynamic model as
p(d
t
|d
t−1
, g
t
) = N
_
d
t
; g
t
· M· d
t−1
, σ
2
tr
_
, (3.15)
where N(x; µ, σ
2
) is a Gaussian or Normal distribution of mean µ and variance σ
2
.
Thus, the term σ
2
tr
represents the unknown disturbances of the joint dynamic model.
The other probability term in Eq. 3.14, p(g
t
), expresses the probability that one
specific geometric transformation represents the true camera motion between consecu-
tive time steps. This is typically computed by a deterministic approach using an image
registration algorithm (27), which amounts to express p(g
t
) as
p(g
t
) = δ(g
t
−g
j
t
), (3.16)
22
3.2 Bayesian tracking framework for moving cameras
where g
j
t
is the geometric transformation obtained by the image registration technique.
However, this approximation can fail in situations in which the aperture problem (43;
81) is quite significant and/or the assumption of only one global motion does not hold,
for instance, in the presence of independent moving objects. Under these circumstances
there are several putative geometric transformations that can explain the camera ego-
motion. Moreover, the best geometric transformation according to some error or cost
function can not necessarily be the actual camera ego-motion, due to the noise and
non-linearities involved in the estimation process. In order to satisfactorily deal with
this situation, g
t
is addressed as a random variable, rather than a parameter computed
in a deterministic way. The specific computation of p(g
t
) depends of the tracking
application and type of imagery. Two different methods are proposed in Secs. 3.3 and
3.4 for infrared and visible imagery, respectively. In any case, they have in common
that they compute an approximation of p(g
t
) as
p(g
t
) ≈
N
g

j=1
w
j
t
δ(g
t
−g
j
t
), (3.17)
where N
g
is the number of geometric transformations used to represent p(g
t
), {g
j
t
|j =
1, ..., N
g
} are the best transformation candidates to model the camera ego-motion, and
w
j
t
is the weight of g
j
t
, that evaluates how well the transformation represents the camera
ego-motion.
The likelihood function p(z
t
|x
t
) in Eq. 3.10 is dependent on the kind of imagery, and
the object type that is being tracked. Two different models have been developed, one
based on the detection of blob regions for infrared imagery, and another based on color
histograms for visible video sequences, which are described respectively in Secs. 3.3
and 3.4. In general terms, the resulting likelihood will be non-Gaussian, nonlinear and
multi-modal, due to the presence of clutter and objects similar to the tracked object.
The initial pdf p(x
0
|z
0
) ≡ p(x
0
), called the prior, can be initialized as a Gaussian
distribution using the information given by an object detector algorithm, as in (18;
19; 32; 33; 34; 39; 40). Another alternative is to use the ground truth information (if
it is available) to initialize a Kronecker’s delta function δ(x
0
).
23
3. BAYESIAN TRACKING WITH MOVING CAMERAS
3.3 Object tracking in aerial infrared imagery
This section presents the developed object tracking approach for aerial infrared imagery.
In contrast to visual-range images, infrared images have low signal-to-noise ratios, ob-
jects low contrasted with the background, and non-repeatable object signatures. These
drawbacks, along with the competing background clutter, and illumination changes
due to weather conditions, make the tracking task extremely difficult. On the other
hand, the unpredictable camera ego-motion, resulting from the fact that the camera
is on board of an aerial platform, distorts the spatio-temporal correlation of the video
sequence, negatively affecting the tracking performance.
All the aforementioned problems are addressed by a tracking strategy based on
the Bayesian tracking framework for moving cameras proposed in Sec. 3.2. According
to this, the posterior pdf of the state vector p(x
t
|z
1:t−1
) is recursively computed by
Eqs. 3.10 and 3.11.
The transition probability p(x
t
|x
t−1
), that encodes the joint camera and object dy-
namic model, is given by Eq. 3.14, where the prior probability p(g
t
) of the geometric
transformation was dependent on the specific type of imagery. For the ongoing tracking
application dealing with infrared imagery, the p(g
j
t
) of a specific geometric transfor-
mation g
j
t
is based on the quality of the image alignment between consecutive frames
achieved by g
j
t
. The quality of the image alignment (or the ego-motion compensation)
is computed by means of the Mean Square Error function, mse(x, y), between the cur-
rent frame I
t
, and the previous frame I
t−1
warped by the transformation g
j
t
. Thus, the
probability p(g
j
t
) is mathematically expressed as
p(g
j
t
) = N
_
mse
_
I
t
, g
j
t
· I
t−1
_
; 0, σ
2
g
_
(3.18)
where N(x; µ, σ
2
) is a Gaussian distribution of mean µ and variance σ
2
, and σ
2
g
is the
expected variance of the image alignment process. Notice that I
t
is an infrared intensity
image.
Finding an observation model for the likelihood p(z
t
|x
t
) in airborne infrared imagery
that appropriately describes the object appearance and its variations along the time,
is quite challenging due to the aforementioned characteristics of the infrared imagery.
The most robust and reliable object property is the presence of bright regions, or at
least, regions that are brighter than their surrounding neighborhood, which typically
24
3.3 Object tracking in aerial infrared imagery
(a) (b)
Figure 3.2: Two consecutive frames of an infrared sequence acquired by an airborne
camera.
correspond to the engine and exhaust pipe area of the object. Based on this fact,
the likelihood function uses an observation model that aims to detect the main bright
regions of the target. This is accomplished by a rotationally symmetric Laplacian of
Gaussian (LoG) filter, characterized by a sigma parameter that is tuned to the lowest
dimension of the object size, so that the filter response be maximum in the bright
regions with a size similar to the tracked object. The main handicap of the observation
model is its lack of distinctiveness, since whatever bright region with an adequate
size can be the target object. As a consequence, the resulting LoG filter response is
strongly multi-modal. This fact, coupled with the camera ego-motion, dramatically
complicate a reliable estimation of the state vector. This situation is illustrated in
Figs. 3.2 and 3.3. The first one, Fig. 3.2, shows two consecutive frames,(a) and (b),
of an infrared sequence acquired by an airborne camera, in which the target object
has been enclosed by a rectangle. Fig. 3.3 shows the LoG filter response related to
Fig. 3.2(b), where the own image has been projected over the filter response for a
better interpretation. The multi-modality feature is clearly observed, and in theory
any of the modes could be the right object position. Moreover, if only the object
dynamics is considered, the closest mode to the predicted object location (marked by a
vertical black line) is not the true object location, because of the effects of the camera
ego-motion.
25
3. BAYESIAN TRACKING WITH MOVING CAMERAS
Figure 3.3: Multimodal LoG filter response related to Fig. 3.2(b).
Figure 3.4: Likelihood distribution related to Fig. 3.3.
The likelihood probability can be simplified by
p(z
t
|x
t
) = p(z
t
|d
t
, g
t
) = p(z
t
|d
t
), (3.19)
assuming that z
t
is conditionally independent of g
t
given d
t
. Then, p(z
t
|d
t
) is expressed
by the Gaussian distribution
p(z
t
|d
t
) = N(z
t
; Hd
t
, σ
2
L
), (3.20)
where z
t
is the LoG filter response of the frame I
t
, H is a matrix that selects the object
positional information, and the variance σ
L
is set to highlight the main modes of z
t
,
while discarding the less significant ones. This is illustrated in Fig. 3.4, where only the
most significant modes of Fig. 3.3 are highlighted.
26
3.3 Object tracking in aerial infrared imagery
As both dynamic and observation models, are nonlinear and non-Gaussian, the
posterior pdf can not be analytically determined, and therefore, the use of approximate
inference methods is necessary. In the next section, a Particle Filtering strategy is
presented to obtain an approximate solution of the posterior pdf.
3.3.1 Particle filter approximation
The posterior pdf p(x
t
|z
1:t
) is approximated by means of a Particle Filter as
p(x
t
|z
1:t
) ≈
1
c
N
S

i=1
w
i
t
δ(x
t
−x
i
t
), (3.21)
where the samples x
i
t
are drawn from a proposal distribution based on the likelihood
and the prior probability of the camera motion
q(x
t
|x
t−1
, z
t
) = p(z
t
|d
t
)p(g
t
), (3.22)
which is an efficient simplification of the optimal, but not tractable, importance density
function (9)
q(x
t
|x
t−1
, z
t
) = p(x
t
|x
t−1
, z
t
). (3.23)
The samples x
i
t
= {d
i
t
, g
i
t
} are drawn from the proposal distribution by a hierar-
chical sampling strategy. This, firstly, draws samples g
i
t
from p(g
t
), and then, draws
samples d
i
t
from p(z
t
|d
t
). The sampling procedure for obtaining samples g
i
t
from p(g
t
)
is based on an image registration algorithm presented in (82). This method assumes an
initial geometric transformation t
i
t
, and then, uses the whole image intensity informa-
tion to compute a global affine transformation g
i
t
, which is a candidate for representing
the true camera motion. This method explicitly accounts for global variations in image
intensities to be robust to illumination changes. However, the computed candidate g
i
t
only will be a reasonable approximation of the camera motion if the initial geometric
transformation t
i
t
is close to the geometric transformation that represents the actual
camera motion. This means that the image in the previous time step, warped by the
initial transformation, must be closely aligned to the current image to achieve a satis-
factory result. This limitation derives from the optimization strategy used in the image
registration algorithm, that tends to the closest mode given an initial transformation.
27
3. BAYESIAN TRACKING WITH MOVING CAMERAS
As a consequence, if the two images are not closely aligned, the computed solution will
probably correspond to a local mode, that does not represent the true camera motion.
By default, t
i
t
is a 3 × 3 identity matrix that represents the previous image without
warping. This approach is inefficient in airborne visual tracking, since the camera can
undergo strong displacements that can not be satisfactorily compensated. To overcome
this problem, the previous image registration technique has been improved using several
initial geometric transformations {t
i
t
|i = 1, ..., N
S
}, obtaining in turn a set of camera
ego-motion candidates {g
i
t
|i = 1, ..., N
S
}. The set of initial transformations are com-
puted with the purpose that at least one of them is relatively close to the actual camera
motion, so that the image registration algorithm can effectively compute the correct
geometric transformation. In this context, the concept of closeness between geometric
transformations depends, on the one hand, on the magnitude of the camera motion,
and, on the other hand, on the own capability of the image registration algorithm to
rectify misaligned images. For example, the ideal situation would be that the magni-
tude of the camera motion were lower than the maximum displacement that the image
registration algorithm is able to rectify. For the purpose of measuring the magnitude
of the camera motion, a subset of video sequences belonging to the AMCOM dataset
(see Sec. 3.3.2) has been used as training set to compute the actual camera motion.
These sequences have been acquired by different infrared cameras on board a plane.
The computation process of the camera motion has been supervised by a user, which
not only guides the image alignment, but also evaluates if the reached result is enough
accurate to be considered the real camera motion. As a result, a set of affine transfor-
mations are obtained, which described the typical camera movements. Regarding the
image registration algorithm, the following benchmark has been performed to measure
its capability to align images. An image is warped by different affine transformations
of increasing magnitude until the image registration algorithm is not able to rectify the
induced motion. In addition, since the rectification capability depends on the scene
structure, this process has been performed with a set of different images belonging to
several sequences of the AMCOM dataset. As a result, two sets of affine transforma-
tions are obtained, one composed by the camera movements that is able to rectify, and
another one containing the camera movements that is not able to. Analyzing the affine
parameters of the previous sets of transformations, it can be concluded that the image
registration algorithm can deal with the scale, rotation, and shear transformations that
28
3.3 Object tracking in aerial infrared imagery
take place in the AMCOM video sequences. However, it can not satisfactorily rectify
all the translation transformations. Specifically, when the translation components of
the affine transformation is higher than 7-8 pixels, the computed camera motion can
be quite inaccurate.
Taking into account the previous conclusions, the set of initial transformations t
i
t
are computed as
_
_
_
t
i
=
_
_
1 0 (t
x
)
i
t
0 1 (t
y
)
i
t
0 0 1
_
_
, i = 1, ..., N
S
_
_
_
, (3.24)
where (t
x
)
i
t
and (t
y
)
i
t
are the translation components, obtained by uniformly sampling
the x-y coordinate space, in such a way that at least one t
i
t
will be close to the actual
camera motion. Thus, the corresponding g
i
t
obtained by the image registration algo-
rithm using t
i
t
is an accurate representation of the camera ego-motion. The maximum
sampling step should be less than 7-8 pixels, so that at least one g
i
t
represents accu-
rately the camera motion. For the ongoing tracking application, the sampling step has
been fixed to 5 pixels, which is a good trade off between the achieved camera motion
accuracy, and the computational cost related to the estimation of {g
i
t
|i = 1, ..., N
S
}.
Note that the rest of affine parameters of the affine transformation are equivalent to the
identity matrix, meaning that there is no scale, rotation and shear warping respect to
the previous image frame, since, as was stated before, the image registration algorithm
can deal with the magnitude of these kind of transformations induced by the camera
motion.
Figs. 3.5 and 3.6 illustrate an example of the sampling procedure from q(g
t
) = p(g
t
).
Fig. 3.5 shows a set of translational transformations, {t
i
t
, i = 1, ..., 441}, arranged in a
21×21 grid, that are used as initial transformations for the image registration technique.
The sampling step for the translational components, (t
x
)
i
t
and (t
y
)
i
t
, is set to 5 pixels,
as was stated before. Fig. 3.6 shows the probability value of each affine transformation
g
i
t
given by the image registration algorithm using the previous initial transformations.
The probability values p(g
i
t
) are arranged in the same way that the previous grid of
initial transformations, and are encoded with a color scale. Notice that the default
solution obtained by the original image registration algorithm (82) (that uses as initial
transformation the identity matrix) corresponds to g
221
t
, just located in the middle of
the grid. This affine transformation does not represent a good approximation for the
29
3. BAYESIAN TRACKING WITH MOVING CAMERAS
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t t
t
Figure 3.5: Set of translational transformations used as initialization transformations in
the image registration method.
Figure 3.6: Probability values of {g
i
t
, i = 1, ..., 441}, representing the camera ego-motion
candidates.
camera motion, since the related probability value is much lower than the corresponding
to g
74
t
, that is the global maximum, and in this case is the best representation of the
camera motion.
30
3.3 Object tracking in aerial infrared imagery
In a typical situation, only a few affine transformations of the whole set {g
i
t
|i =
1, ..., N
S
} will have enough precision to actually represent the camera motion, due to the
particularities of the proposed sampling procedure. This can be checked by evaluating
the probability value of each affine transformation p(g
i
t
), where only the transformations
with enough accuracy have a significant value. To alleviate this situation, a resampling
stage is performed using the SIR algorithm to place more samples g
i
t
over the regions
of the affine space that concentrate the majority of the probability. An alternative
approach could be to replicate the sample with the highest probability value, since it
should represent the most accurate camera motion. In this case, the sampling procedure
would be equivalent to use an optimization approach based on a stochastic search, since
only the best sample is used. However, in situations with independent moving objects,
this statement does not hold, since the moving objects bias the camera ego-motion
estimation. As a consequence, the actual camera motion could be represented by one
g
i
t
of lower probability. For this reason, a probabilistic representation of the camera
motion, based on discrete samples, is more efficient than a deterministic approach based
on the estimation of the best transformation that represents the camera motion.
Once the samples of the camera dynamics g
i
t
are computed, the samples of the
object dynamics d
i
t
are drawn from the likelihood p(z
t
|d
t
) (Eq. 3.22), to finally obtain
x
i
t
= {d
i
t
, g
i
t
}. This is a convenient decision since the main modes of the posterior dis-
tribution also appear in the likelihood function. The sampling task from the likelihood
function is not a trivial task, since it is a bivariate function composed by narrow modes
(see Fig. 3.4). To deal with this issue, a Markov Chain Monte Carlo (MCMC) sampling
method is proposed, which is able to efficiently represent the likelihood function by a
reduced number of samples. The MCMC approach generates a sequence of samples
{d
i
t
, i = 1, ..., N
S
} by means of a Markov Chain, in such a way that the stationary
distribution is exactly the target distribution. The Metropolis-Hasting (79; 83) algo-
rithm is a method of this type that uses a proposal distribution for simulating such a
chain. The appropriate selection of the proposal distribution is the key for the efficient
sampling of the target distribution. For the case of the likelihood p(z
t
|d
t
) sampling,
a Gaussian function, with mean zero and a variance proportional to the lowest size
dimension of the tracked object, has proven to be efficient. Another fundamental issue
is the initialization of the Markov Chain. Since the likelihood function concentrates
almost all the probability in a few sparse regions of the state space (i.e. in its sparse
31
3. BAYESIAN TRACKING WITH MOVING CAMERAS
Figure 3.7: Metropolis-Hastings sampling of the likelihood distribution depicted in
Fig. 3.4.
narrow modes), the Markov Chain needs a large amount of samples to correctly sim-
ulate it. A more efficient approach is to use a set of Markov Chains, with different
initialization states given by the main local maxima of the likelihood distribution. In
this way, the likelihood is efficiently simulated by a reduced number of samples located
on the main modes.
Fig. 3.7 shows the result of applying the proposed Metropolis-Hasting sampling
algorithm to simulate the likelihood distribution depicted in Fig. 3.4. The samples
have been marked with circles. Notice that the samples are on the main modes of the
likelihood distribution, in spite of the relatively low number of used samples.
The weights w
i
t
related to the drawn samples x
i
t
= {d
i
t
, g
i
t
} are computed using
the Eq. 3.9, which can be simplified using the expression of the proposed proposal
distribution as
w
i
t
= w
i
t−1
p(z
t
|x
i
t
)p(x
i
t
|x
i
t−1
)
q(x
i
t
|x
i
t−1
, z
t
)
= w
i
t−1
p(z
t
|d
i
t
)p(d
i
t
|d
i
t−1
, g
i
t
)p(g
i
t
)
p(z
t
|d
i
t
)p(g
i
t
)
= w
i
t−1
p(d
i
t
|d
i
t−1
, g
i
t
). (3.25)
Thus, the samples that best fitted with the joint camera and object dynamic model
will have more relevance than the rest.
After computing the samples and weights to approximate the posterior distribution,
a resampling step is applied to reduce the degeneracy problem. This is accomplished
by means of the SIR algorithm (Sec. 3.1.1), by selecting more times the samples with
32
3.3 Object tracking in aerial infrared imagery
Figure 3.8: Particle Filtering based approximation of the posterior probability p(x
t
|z
1:t
).
Figure 3.9: SIR resampling of p(x
t
|z
1:t
).
higher weights, while the ones with an insignificant weight are discarded. After SIR
resampling, all the samples have the same weight.
Figs. 3.8 and 3.9 show the estimated posterior probability, p(x
t
|z
1:t
), and the result
of applying the SIR resampling, respectively. Notice that the samples corresponding
with modes related to background structures have a lower weight than the ones re-
lated to the tracked object, due to the coherence with the expected camera and object
dynamics.
In general terms, the resulting posterior probability can be quasi unimodal (if there
is only one significant mode), or multimodal. This fact depends on the distance between
the mode corresponding to the tracked object, and the modes relative to the background
in the likelihood function. While for the case of a quasi unimodal posterior probability,
the state estimation can be efficiently performed by means of the MMSE estimator, for
the case of a multimodal posterior probability, the MMSE estimator does not produce
a satisfactory estimation, since the background modes bias the result. To avoid such
33
3. BAYESIAN TRACKING WITH MOVING CAMERAS
a bias in the estimation, the MMSE estimator should only use the samples relative to
the tracked object mode, discarding the rest. This is achieved by means a bivariate
Gaussian kernel N(x; µ
e
, Σ
e
) of mean µ
e
and covariance matrix Σ
e
(79), which gives
more relevance to the samples located close to the Gaussian mean. In this way, when
the Gaussian mean is centered over the tracked object mode, only its samples will have
significant value. The proposed Gaussian-MMSE estimator is mathematically expressed
as
´ x
t
= max
_
1
N
S
N
S

l=1
N
S

i=1
N(x
i
t
; x
l
t
, Σ
e
)p(x
i
t
|z
1:t
)
_
, (3.26)
where the covariance matrix Σ
e
determines the bandwidth of the Gaussian kernel, which
must be coherent with the size of the tracked object mode. Taking into account the
relationship between the size of the tracked object mode, and the bandwidth of the
LoG filter used in the object detection (Sec. 3.3), that in turn it was set according to
the object size, an efficient covariance matrix can be estimated as
Σ
e
=
_
s
x
/2 0
0 s
y
/2
_
, (3.27)
where s
x
and s
y
are the width and height of the object, respectively, which are the
same parameters as the ones used in the LoG based object detector.
Fig. 3.10 shows the result of applying the Gaussian kernel over p(x
t
|z
1:t
), along
with the maximum corresponding to the final estimation ´ x
t
, that has been marked
by a black circle. Fig. 3.11 shows the tracked object accurately enclosed by a white
rectangle corresponding to the estimated ´ x
t
.
3.3.2 Results
The proposed object tracking algorithm has been tested using the AMCOM dataset.
This consists of 40 infrared sequences acquired from cameras mounted on an airborne
platform. A variety of moving and stationary terrestrial targets can be found in two
different wavelengths: mid-wave (3µm - 5µm) and long-wave (8µm - 12µm). In general,
the tracking task is quite challenging in this dataset due to the strong camera ego-
motion, the magnification and pose variations of the target signatures, and the own
characteristics of the FLIR imagery described in Sec. 3.3.
34
3.3 Object tracking in aerial infrared imagery
Figure 3.10: Result of applying the Gaussian kernel over p(x
t
|z
1:t
) (depicted in Fig. 3.8),
along with the final state estimation ´ x
t
marked by a black circle.
Figure 3.11: Tracked object accurately enclosed by a white rectangle.
The proposed object tracking algorithm has been compared with other two tracking
algorithms to prove its superior performance. These both algorithms are inspired on
the existing works (25; 26), which also use a Particle Filter framework for the track-
ing, making easier and fairer to compare the performance of all the three algorithms.
The three algorithms differ in the way they tackle the ego-motion: Bayesian model-
ing, deterministic modeling, and not explicit modeling. The developed algorithm uses
a Bayesian model for the ego-motion, and it is referred to as tracking with Bayesian
ego-motion handling (BEH). In this model, p(g
t
) was approximated by a set of discrete
samples (Sec. 3.3.1). The second algorithm is based on a deterministic modeling, and
is referred to as tracking with deterministic ego-motion handling (DEH). It models the
ego-motion by only one affine transformation, which is equivalent to express p(A
t
) by a
Kronecker’s delta centered in g
d
t
, an affine transformation deterministically computed
35
3. BAYESIAN TRACKING WITH MOVING CAMERAS
through the image registration algorithm described in (82). The last algorithm, re-
ferred to as tracking with no ego-motion handling (NEH), has not an explicit model for
the camera ego-motion, which leads to a simplified expression of the state transition
probability
p(x
t
|x
t−1
) = N
_
d
t
; M· d
t−1
, σ
2
tr
_
, (3.28)
where the value of the parameter σ
2
tr
should be larger than for BEH and DEH algorithms
to try to alleviate the ego-motion effect.
With the purpose of making a fair comparison, the same number of samples have
been used for the three algorithms: N
S
= 300. This number is enough to ensure a
satisfactory approximation of the state posterior probability given the specific charac-
teristics of the AMCOM dataset. In the same way, the same value σ
2
tr
= 2 has been
chosen for the BEH and DEH algorithms, while a value of σ
2
tr
= 4 has been chosen
for the NEH algorithm, in order to alleviate its lack of an explicit ego-motion model,
and make it comparable with the other algorithms. The BEH algorithm needs an ex-
tra parameter which has been heuristically set to σ
2
g
= 0.03, offering good results for
the given AMCOM dataset. Although, other values with a variation less than the 15
percent have also offered similar results.
In the two following subsections, two different tracking situations are evaluated
to demonstrate the higher performance of the BEH algorithm in complex ego-motion
situations. The last subsection presents the overall tracking results for each of three
algorithms using the aforementioned AMCOM dataset.
3.3.2.1 Strong ego-motion situation
The BEH algorithm has been compared with the DEH and NEH ones for a situation
of strong ego-motion. Fig. 3.12 shows the common intermediate results for all the
three algorithms, where Fig. 3.12(a) and (b) show two consecutive frames that have
undergone a large displacement, in which the target object has been enclosed by a
black rectangle as visual aid. Fig. 3.12(c) shows the multimodal likelihood function,
and lastly, Fig. 3.12(d) shows the resulting Metropolis-Hasting based sampling, where
each sample has been marked by a black circle.
36
3.3 Object tracking in aerial infrared imagery
(a) Frame 15 (b) Frame 16
(c) Likelihood p(z
t
|x
t
) (d) Sampling
Figure 3.12: Common intermediate results for all the three tracking algorithms in a
situation of strong ego-motion.
37
3. BAYESIAN TRACKING WITH MOVING CAMERAS
Fig. 3.13 shows the tracking results for the BEH algorithm. The probability values
of g
i
t
(the estimated affine transformations) are shown in Fig. 3.13(a), which have been
arranged in a rectangular grid, in a similar way to that of Fig. 3.6. The probability
values are displayed using a color scale. Notice that there is a peak in the middle
left side, indicating that the camera has undergone a strong right translation motion.
Fig. 3.13(b) shows the sampled posterior probability, where the samples d
i
t
with higher
weights are correctly located over the target object, thanks to the Bayesian treatment
of the camera ego-motion. Fig. 3.13(c) shows the result of applying the Gaussian kernel
over the sampled posterior probability, which is used by the Gaussian-MMSE estimator
to compute the final state estimation (marked as a black circle). Finally, Fig. 3.13(d)
shows the target object satisfactorily enclosed by white rectangle, whose coordinates
are determined by the state estimation. Observe that the infrared image is projected
over the X-Y plane of each probability distribution as visual aid.


5 10 15 20
2
4
6
8
10
12
14
16
18
20
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
(a) Probability values of g
i
t
(b) Sampled posterior probability
(c) Gaussian-MMSE based estimation (d) Object tracking result
Figure 3.13: Tracking results for the BEH algorithm in a situation of strong ego-motion.
Figs. 3.14 and 3.15 show the tracking results for the DEH and NEH algorithms,
respectively. Notice that the tracking fails in both cases, since the dynamic model does
not correctly represent the camera and object dynamics, and consequently the tracking
38
3.3 Object tracking in aerial infrared imagery
drifts to another mode of the likelihood function. In the case of DEH algorithm, this
fact can be checked by observing that the estimated affine transformation corresponds
to the one located in the coordinates (11, 11) of Fig. 3.13(a), which has a probability
value much lower than the one related to the true camera motion.
(a) Sampled posterior probability (b) Gaussian-MMSE based estimation
(c) Object tracking result
Figure 3.14: Tracking results for the DEH algorithm in a situation of strong ego-motion.
3.3.2.2 High uncertainty ego-motion situation
A comparison about the tracking performance of all the three algorithms (BEH, DEH,
and NEH) is presented for a situation where the ego-motion estimation is especially
challenging due to the aperture problem (the frames are very low-textured). Fig. 3.16
shows common intermediate results, in which the first column shows five consecutive
frames, where the target object has been enclosed by a black rectangle as visual aid.
The following two columns show the resulting multimodal likelihood function and the
Metropolis-Hasting based sampling for each frame, respectively.
Fig. 3.17 shows the tracking results for the BEH algorithm. The probability values of
g
i
t
(the estimated affine transformations) are shown in the first column, which have been
arranged in a rectangular grid in a similar way to that of Fig. 3.6. The probability values
39
3. BAYESIAN TRACKING WITH MOVING CAMERAS
(a) Sampled posterior probability (b) Gaussian-MMSE based estimation
(c) Object tracking result
Figure 3.15: Tracking results for the NEH algorithm in a situation of strong ego-motion.
are displayed using a color scale. Notice that there is not a well defined peak, unlike the
strong ego-motion situation (Fig. 3.13(a)), but there is a set of affine transformation
candidates with similar probability values, meaning that whatever of them could be
the true camera motion. The affine transformations with higher probability value are
located in the horizontal direction, indicating that the aperture problem is especially
significant in that direction. In other words, the horizontal translation of the camera
motion can not be reliably computed between consecutive frames. The second column
of Fig. 3.17 shows the sampled posterior probability related to each frame. Notice
that there are several samples with high weights that are not located over the target
object, as a consequence of the high uncertainty in the camera ego-motion estimation.
However, the majority of samples that have a high weight are located over the target
object, allowing to track it satisfactorily. This fact can be verified by observing the
last two columns, which respectively show the Gaussian-MMSE estimation, and the
tracking result, where the target object has been satisfactorily enclosed by a rectangle
(whose coordinates are determined by the state estimation).
Figs. 3.18 and 3.19 show the tracking results for the DEH and NEH algorithms,
40
3.3 Object tracking in aerial infrared imagery
Fr. 168
Fr. 169
Fr. 170
Fr. 171
Fr. 172
Frames 168-172 Likelihood p(z
t
|x
t
) Sampling
Figure 3.16: Common intermediate results for all the three tracking algorithms in a
situation where the ego-motion compensation is especially challenging due to the aperture
problem.
41
3. BAYESIAN TRACKING WITH MOVING CAMERAS


5 10 15 20
2
4
6
8
10
12
14
16
18
20
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9


5 10 15 20
2
4
6
8
10
12
14
16
18
20
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9


5 10 15 20
2
4
6
8
10
12
14
16
18
20
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8


5 10 15 20
2
4
6
8
10
12
14
16
18
20
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8


5 10 15 20
2
4
6
8
10
12
14
16
18
20
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Probability values Sampled posterior Gaussian-MMSE Object tracking
of g
i
t
probability estimation result
Figure 3.17: Tracking results for the BEH algorithm in a situation where the ego-motion
compensation is especially challenging due to the aperture problem.
42
3.3 Object tracking in aerial infrared imagery
arranged in the same way that Fig 3.17. Observe that the tracking fails in the frame
172 for the DEH algorithm, and also in the frame 171 for NEH algorithm. These failures
arise from the accumulation of slight errors in the estimation of the object location,
which, in turn, are caused by the poor characterization of the camera ego-motion.
3.3.2.3 Global tracking results
Global performance results of the BEH, DEH, and NEH algorithms using the AMCOM
dataset are shown in Tab. 3.1. The first two columns show the sequence name and the
target name, respectively. The third, fourth and fifth columns show the first frame,
the last frame, and the number of consecutive frames in which the target appears.
The remaining columns show the performance of all the three algorithms, measured
by means of the number of tracking failures and the tracking accuracy. The number
of failures indicates the number of times that the target object has been lost. An
object is considered to be lost when the rectangle that encloses the object does not
overlap the one corresponding with the ground truth. On the other hand, the tracking
accuracy has been defined as the average Euclidean distance between the estimated
object locations and the locations given by the ground truth. Therefore, the accuracy
is better when its value is lower. It is important to note that the tracking algorithms are
not reinitialized in case of tracking failure, since one of the more appealing advantages of
the Particle Filter framework is its capability of recovering from tracking failures thanks
to the handling of multiple hypotheses (or samples). This fact affects to the tracking
accuracy measurement, since all the erroneous object locations, derived from tracking
failures, have been taken into account in its estimation. Therefore, the sequences with
a lot of tracking failures will have a much worse tracking accuracy.
From the analysis of the number of tracking failures, it can be summarized that there
are 11 situations (sequences) in which the BEH algorithm outperforms the DEH one,
and 16 situations in which the BEH algorithm outperforms the NEH one. Regarding
the DEH algorithm, there are 11 situations in which it outperforms the NEH one, and
3 situations in which it outperforms BEH algorithm. Lastly, there are 4 situations in
which the NEH algorithm outperforms the DEH one, and none situation in which it
outperforms the BEH algorithm. In the rest of situations, the performance is similar
for all the three algorithms. To sum up, the BEH algorithm is the best of all, and
the DEH algorithm is better than NEH one, as was expected. The errors obtained
43
3. BAYESIAN TRACKING WITH MOVING CAMERAS
Sampled posterior Gaussian-MMSE Object tracking
probability estimation result
Figure 3.18: Tracking results for the DEH algorithm in a situation where the ego-motion
compensation is especially challenging due to the aperture problem.
44
3.3 Object tracking in aerial infrared imagery
Sampled posterior Gaussian-MMSE Object tracking
probability estimation result
Figure 3.19: Tracking results for the NEH algorithm in a situation where the ego-motion
compensation is especially challenging due to the aperture problem.
45
3. BAYESIAN TRACKING WITH MOVING CAMERAS
by the DEH and NEH algorithms arise from the poor characterization of the camera
ego-motion, which is satisfactorily solved by the BEH algorithm.
The results about the tracking accuracy follow the same trend. An interesting fact
happens when the ego-motion is quite low: the tracking accuracy of the DEH algorithm
is slightly better than BEH one. The reason is that a deterministic approach introduces
less uncertainty than a Bayesian approach for situations in which is possible to reliably
compute the camera motion. Nonetheless, the improvement is insignificant, and in
addition, the BEH algorithm is able to cope with a wider range of situations than the
rest.
There is one situation in which none of the three algorithms can ensure a correct
tracking. This situation arises when the likelihood distribution has false modes very
close to the true one (corresponding to the tracked object), and the apparent motion
of the tracked object is very low. Under these circumstances, the tracker can be locked
on a false mode.
3.4 Object tracking in aerial and terrestrial visible im-
agery
This section presents an object tracking approach for visual-range imagery. The theo-
retical basis is almost the same as for the object tracking for infrared imagery (Sec. 3.3).
The main differences are given by the specific characteristics of the color (visual range)
imagery, which are exploited to built a robust appearance model of the tracked object.
Also, the estimation of the camera motion is handled in a specific way, since, in con-
trast to infrared imagery, it is possible to use image registration techniques based on
features. These techniques have several advantages in comparison with the region/area
based ones, such as a less computational cost, a higher robustness to moving indepen-
dent objects, and a greater tolerance to large camera displacements.
The object tracking is based on the same Bayesian framework as the one described
in Sec. 3.3 for airborne infrared imagery. But, the prior probability p(g
t
) of the camera
motion, the observation model, and the likelihood p(z
t
|x
t
) are different, specifically
designed for the color imagery.
The prior probability p(g
t
) is based on the quality of the image alignment achieved
by g
t
, the affine transformation that models the camera motion. This affine trans-
46
3.4 Object tracking in aerial and terrestrial visible imagery
BEH DEH NEH
Sequence Target First Last No. of No. of Tracking No. of Tracking No. of Tracking
name frame frame frames failures accuracy failures accuracy failures accuracy
L19NSS M60 1 57 57 15 6.19 0 4.10 14 6.78
L19NSS mantruck 86 100 15 0 7.33 2 12.40 2 12.30
L19NSS mantruck 211 274 64 0 4.90 0 6.69 0 4.98
L1415S Mantruck 1 280 280 8 14.14 10 14.84 25 16.32
L1607S mantrk 225 409 185 0 3.49 0 3.56 0 4.78
L1608S apc1 1 79 79 0 2.11 0 2.19 0 1.99
L1608S M60 1 289 289 0 2.95 0 2.74 1 3.73
L1608S mantrk 193 289 97 0 3.93 0 3.38 48 9.09
L1618S apc1 1 290 290 0 13.73 0 18.90 0 16.57
L1618S M60 1 100 100 0 3.51 0 1.87 0 2.35
L1701S Bradley 1 370 370 0 6.54 0 10.89 0 8.21
L1701S pickup(trk) 1 30 30 0 1.99 0 1.99 0 1.70
L1702S Mantruck 113 179 67 0 3.27 8 6.18 0 4.56
L1702S Mantruck 631 697 67 0 2.57 0 2.43 0 2.59
L1720S target 1 34 34 0 2.91 0 2.60 4 4.99
L1720S M60 43 777 735 15 6.59 0 5.80 0 4.39
L1805S apc1 1 86 86 0 1.41 0 1.52 0 1.48
L1805S M60 615 641 27 0 4.88 0 4.80 19 13.15
L1805S tank1 614 734 125 0 2.14 0 2.17 163 65.73
L1812S M60 72 157 86 0 2.53 0 2.52 0 2.71
L1813S apc1 1 167 167 0 2.68 0 2.64 0 2.84
L1817S-1 M60 1 193 193 0 4.46 0 3.41 0 4.18
L1817S-2 M60 1 189 189 0 4.93 0 4.43 0 4.75
L1818S apc1 21 112 92 32 6.01 72 10.22 72 26.05
L1818S M60 81 202 122 60 10.55 119 17.79 118 85.25
L1818S tank1 151 364 214 119 35.89 213 54.29 213 55.47
L1906S Mantruck 1 203 203 0 8.47 0 5.58 0 8.21
L1910S apc1 56 129 74 0 10.70 0 11.47 0 11.15
L1911S apc1 1 164 164 0 8.15 0 9.19 0 8.75
L1913S apc1 1 264 264 25 6.79 262 29.93 262 27.69
L1913S M60 182 264 61 0 10.42 0 9.19 98 15.70
L1918S tank1 26 259 234 15 7.03 20 8.98 144 9.46
L2018S tank1 1 447 447 0 3.77 0 4.52 4 4.32
L2104S bradley 1 320 320 51 5.60 319 34.39 319 43.82
L2104S tank1 69 759 691 235 16.88 629 56.99 673 48.33
L2208S apc1 1 379 379 0 5.88 0 5.40 0 5.53
L2312S apc1 1 367 367 0 2.11 0 1.70 0 2.10
M1406S Bradley 1 379 379 0 1.84 0 1.92 0 2.72
M1407S Bradley 1 399 399 0 4.83 0 5.77 0 4.37
M1410S tank1 1 497 497 0 7.45 0 5.25 0 3.91
M1413S Mantruck 1 379 379 2 7.44 1 7.73 0 6.67
M1415S Mantruck 1 10 10 0 3.14 2 6.68 0 3.09
M1415S Mantruck 15 527 513 7 5.42 512 62.30 512 65.16
Table 3.1: Comparison of the tracking performance of the BEH, DEH and NEH algo-
rithms using the AMCOM dataset. See the text for a detailed description of the results.
47
3. BAYESIAN TRACKING WITH MOVING CAMERAS
formation is obtained by means of an image registration algorithm based on features
(see Sec. 3.3.1 for more details), which, firstly, computes the correspondence of features
between consecutive images, and then estimates the affine transformation that best fits
with the set of matched features. The quality of the image alignment (i.e. the quality
of the ego-motion compensation) is based on the goodness of fit of g
t
with the set of
matched features. The goodness of fit of g
t
with one correspondence is determined by
the residual distance
d
j
=
_
_
_f
j
t
−(g
t
· f
j

t−1
)
_
_
_ (3.29)
where x is the Euclidean norm, f
j
t
is the j
th
feature detected in the color image I
t
, and
f
j

t−1
is the corresponding feature in the previous color image I
t−1
. The residual distance
is only meaningful for correct correspondences, called inliers, since the residual distance
of erroneous correspondences, called outliers, can be randomly large. Due to this fact,
the outliers can drift the quality of the image alignment related to g
t
. Therefore, to
robustly estimate the quality of the image alignment, only the inliers should be taken
into account. According to this, the quality is computed by
q
g
=

d
j
∈d
in
d
j
/N
c
, (3.30)
where d
in
is the set of the residual distances related to the inliers, and N
c
is the number
of inliers. The set of inliers is also estimated by the image registration algorithm.
The quality of the image alignment q
g
is, then, used to compute the prior probability
p(g
t
), which is given by a Normal distribution of zero mean and variance σ
2
g
p(g
t
) = N
_
q
g
; 0; σ
2
g
_
(3.31)
where σ
2
g
quantifies the uncertainty in the estimation of the feature correspondence.
The object model uses the color as the main feature to characterize the tracked
object. The object model describes the appearance of the object through a color spa-
tiogram (84), which is similar to a color histogram, but augmented with the spatial
means and covariances of the histogram bins. This allows to capture a richer descrip-
tion of the object, since a basic spatial distribution of the object color is considered. The
spatiogram of a region in a HSV color image is defined as a vector h = [h
1
, . . . , h
N
B
],
where the b
th
component is given by h
b
= [n
b
, µ
b
, Σ
b
]. The first component n
b
is the
48
3.4 Object tracking in aerial and terrestrial visible imagery
number of pixels belonging to the b
th
bin, according to their HSV triplet value. This
is similar to a standard HSV color histogram. The rest of components, µ
b
and Σ
b
, are
respectively the mean vector and the covariance matrix of the spatial coordinates of
the pixels contributing to the b
th
bin.
In order to compute the similarity between two image regions, characterized by
their spatiograms h
r1
and h
r2
, a variation of the Bhattacharyya coefficient is calculated
ρ
w
(h
r1
, h
r2
) =
B

b=1
w
b
ρ(n
r1
b
, n
r2
b
), (3.32)
where ρ(n
r1
b
, n
r2
b
) =
_
n
r1
b
n
r2
b
, and the weights are defined by w
b
= N(µ
r1
b
, µ
r2
b
, Σ
r2
b
) ·
·N(µ
r2
b
, µ
r1
b
, Σ
r1
b
). The function N(x, µ, Σ) is a multivariate Gaussian function evaluated
at x. Fig. 3.20 contains three images that show a graphical example of the similarity
measurement between the region that encloses the tracked object and all the region
candidates in an image. Fig. 3.20(a) shows the reference image with the object enclosed
by a bounding box, Fig. 3.20(b) shows the another image of the same scene where
the target object has a different spatial position, and Fig. 3.20(c) shows the result of
computing the Bhattacharyya coefficient between the reference region of Fig. 3.20(a)
and all the possible regions of Fig. 3.20(b). Note that there is an highlighting mode
corresponding with the target object.
Once defined the observation model, the likelihood can be expressed as
p(z
t
|x
t
) = p(z
t
|d
t
, g
t
) = p(z
t
|d
t
) = N(d
w
(h
r1
, h
r2
); 0, σ
2
d
), (3.33)
where it has been assumed that z
t
is conditionally independent of g
t
given d
t
, i.e. the
camera motion is not involved in the likelihood computation. The function N(x; µ, σ
2
)
is a Gaussian distribution of mean µ and variance σ
2
. The function d
w
(h
r1
, h
r2
) =
_
1 −ρ
w
(h
r1
, h
r2
) is the Bhattacharyya distance between h
r1
, the spatiogram of the
tracked object, and h
r2
, the spatiogram of a candidate image region in the current
image I
t
, located in the coordinates given by d
t
. The variance σ
2
d
models the expected
variation of the Bhattacharyya distance due to changes in the object appearance.
The resulting posterior pdf can not be analytically determined because both dy-
namic and observation models are nonlinear and non-Gaussian. Therefore, the use of
approximate inference methods is necessary. In the next section, a Particle Filtering
49
3. BAYESIAN TRACKING WITH MOVING CAMERAS
(a) Image containing the target object
(b) Image containing the same object in a different spatial position
(c) Computation of the Bhattacharyya coefficient
Figure 3.20: Graphical example of the similarity measurement between image regions
based on the Bhattacharyya coefficient.
50
3.4 Object tracking in aerial and terrestrial visible imagery
strategy is presented to obtain an approximate solution of the posterior pdf, which is
different to the Particle Filtering approach presented in Sec. 3.3.1.
3.4.1 Particle Filter approximation
The posterior pdf p(x
t
|z
1:t
) is approximated by means of a Particle Filter as
p(x
t
|z
1:t
) ≈
1
c
N
S

i=1
w
i
t
δ(x
t
−x
i
t
), (3.34)
where the samples x
i
t
are drawn from a proposal distribution based on the transition
probability (Eq. 3.4)
q(x
t
|x
t−1
, z
t
) = p(x
t
|x
t−1
) = p(d
t
|d
t−1
, g
t
)p(g
t
), (3.35)
which is an efficient simplification of the optimal, but not tractable, importance density
function (9)
q(x
t
|x
t−1
, z
t
) = p(x
t
|x
t−1
, z
t
). (3.36)
The samples x
i
t
= {d
i
t
, g
i
t
} are drawn from the proposal distribution using a hier-
archical sampling strategy. This, firstly, draws samples g
i
t
from p(g
t
), and then, draws
samples d
i
t
from p(d
t
|d
t−1
, g
t
), conditioned to g
i
t
.
The computation of samples g
i
t
from p(g
t
) is based on an image registration algo-
rithm based on features. This algorithm can be split into two stages: the computa-
tion of correspondences of image features, and the estimation of affine transformations
that fit the features correspondences. The first stage, the computation of features
correspondences, is accomplished by a fast feature-based motion estimation technique
(FFME) (85), which is robust to the noise, the aperture problem, illumination changes,
and small variations of 3D viewpoint. The main steps of the FFME algorithm are
described in the following lines. The selection of image features, also called singular
points, is accomplished by means of three steps. The first one selects the image regions
that have a significant gradient magnitude. The second step filters out the image re-
gions with low cornerness, which are typically located in the straight edges (i.e. they
are still affected by the aperture problem). As a result, the output of the second step
are the image regions that stand out by their gradient magnitude and cornerness. The
51
3. BAYESIAN TRACKING WITH MOVING CAMERAS
last step performs a non-maximal suppression in the cornerness space to discard the
image points that are not very significant according to their neighborhood. The result
is a set of singular points that are the most reliable ones to compute the feature cor-
respondence, since the image points that were specially sensitive to the noise and to
the aperture problem have been discarded. Each singular point is characterized by a
sophisticated descriptor robust to illumination changes and small variations in the 3D
viewpoint. To compute the descriptor, an array of gradient phase histograms in the
neighborhood of each singular point is calculated. All the histograms are concatenated
in a vector to form the descriptor, which is, then, normalized to make it invariant to
brightness and contrast changes. Singular points of consecutive images are matched
by computing the minimum Euclidean distance between descriptors. Potentially er-
roneous correspondences are discarded if the Euclidean distance of the best matching
of a feature is too close to the second best matching. Fig. 3.21 shows the feature
correspondence between two consecutive images that are stacked in a column. The
correspondences are represented by arrows that link a feature in the previous image
(top image) with a feature in the current image (bottom image). Note that, in spite the
carefully selection of features, and the posterior robust correspondence, the presence of
wrong correspondences is still possible in real images.
The second stage of the image registration algorithm uses the computed feature
correspondences to estimate a set of affine transformations, which are candidates for
representing the camera motion. For a hypothetical ideal situation in which all the
correspondences are correct, and also all of them are related to static background
regions, the affine transformation representing the camera motion can be obtained by
solving the following equation system through the Least Mean Squares (LMS) technique
_
_
x
j
k−1
y
j
k−1
1
_
_
=
_
_
a b t
x
c d t
y
0 0 1
_
_
·
_
_
x
j
t
y
j
t
1
_
_
j = C
all
(3.37)
where a, b, c, d, t
x
and t
y
are the parameters of the affine transformation; x
j
t−1
, y
j
t−1
,
x
j
t
, y
j
t
are the coordinates of the feature related to the j
th
correspondence; and C
all
is
the set of all the correspondences. However, in a real situation there could be wrong
correspondences, and correct correspondences that are related to independent moving
objects (see Fig. 3.21), and therefore they do no fit with the camera motion. All these
52
3.4 Object tracking in aerial and terrestrial visible imagery
Figure 3.21: Example of feature correspondence between consecutive images using the
FFME algorithm.
53
3. BAYESIAN TRACKING WITH MOVING CAMERAS
correspondences are outliers that bias the estimation of the affine transformation. To
overcome this problem, there exist robust statistical techniques that can tolerate up to
a 50% of outliers, such as RANSAC (86) and Least Median Squares (87). Neverthe-
less, the amount of outliers can be greater than the 50% in real scenarios. The main
sources of the outliers are the presence of several independent moving objects, low tex-
tured image regions, and repetive background structures. Under these circumstances,
the estimation of the camera motion can be so uncertain that can be not possible
to deterministacally estimate the correct camera motion. To address this situation,
a probabilistic approach is proposed, which computes a set of weighted affine trans-
formations to approximate the underlying probability of the camera motion p(g
t
). It
is assumed that at least one of the affine transformation candidates is a satisfactory
approximation of the actual camera motion. The combination of these camera motion
hypotheses and the object motion hypotheses produces a robust dynamic model of the
scene, which is used by the proposed Bayesian filter to reliably estimate both the ob-
ject tracking information and the camera motion. The affine transformation samples
g
i
t
are obtained using the principle of RANdom SAmple Consensus (RANSAC). This
randomly selects N
c
correspondences from the whole set C
all
, and then it estimates
through Least Mean Squares (LMS) the affine transformation that best fits with the
selected correspondences. In order to improve the accuracy of the estimated camera
motion, the set of inliers with regard to the computed affine tranformation is estimated
by means of the Median Absolute Deviation algorithm (87). Finally, the affine trans-
formation sample g
i
t
is obtained using the LMS algorithm with the whole set of inliers.
As a result of this RASNSAC based drawing procedure, the set of affine transformation
samples {g
i
t
|i = 1 . . . N
S
} is obtained. Lastly, a resampling stage is performed using
the SIR algorithm to place more samples g
i
t
over the regions of the affine space that
concentrate the majority of the probability. Fig. 3.22 shows an image that represents
a set of affine transformations obtained by the proposed method. Each affine trans-
foramtion is represented by a square, which has been obtained by warping a canonical
square according to the parameters of the affine transformation. The canonical square,
represented in red colour, is centered in the coordinates (0, 0), and has a side lenght
equal to 5. Observing the resulting warped squares with regard to the canonical square,
the different kinds of deformations produced by an affine tranformation (translation,
rotation, scaling, and shearing) can be perceived.
54
3.4 Object tracking in aerial and terrestrial visible imagery
Figure 3.22: Representation of a set of affine transformations obtained by the RANSAC
based drawing procedure. Each affine transformation is represented by a warped square.
See the text for more details.
Once that the affine transformation samples have been obtained, the object dy-
namic samples d
i
t
are drawn from the transition probability of the object dynamics
p(d
t
|d
t−1
, g
i
t
), which is a Gaussian distribution given by Eq. 3.15. The state sam-
ples are then obtained concatenating both kinds of samples x
i
t
= {d
i
t
, g
i
t
}. Figs. 3.23
and 3.24 respectively show the samples of the object position (marked by green crosses)
given by d
i
t
, and the samples of the ellipse that encloses the object according to x
i
t
.
Notice that the samples are correctly placed around the most probable image locations,
containing the target object.
The weights w
i
t
related to the drawn samples x
i
t
are computed using the Eq. 3.9,
which can be simplified using the expression of the proposed proposal distribution as
w
i
t
= w
i
t−1
p(z
t
|x
i
t
)p(x
i
t
|x
i
t−1
)
q(x
i
t
|x
i
t−1
, z
t
)
= w
i
t−1
p(z
t
|d
i
t
)p(x
i
t
|x
i
t−1
)
p(x
i
t
|x
i
t−1
)
= w
i
t−1
p(z
t
|d
i
t
), (3.38)
where the object likelihood p(z
t
|d
i
t
) is given by Eq. 3.20. Therefore, the samples that
best fit with the observation model, based on the color appearance of the object, will
55
3. BAYESIAN TRACKING WITH MOVING CAMERAS
Figure 3.23: Samples of the object position determined by d
i
t
.
Figure 3.24: Samples of the ellipsis that enclose the object determined by x
i
t
.
56
3.4 Object tracking in aerial and terrestrial visible imagery
have more relevance than the rest. Fig. 3.25 shows the discrete representation of the
posterior probability of the object position given by the samples x
i
t
and their weights
w
i
t
. Notice that the position samples that have higher weight are those whose associated
ellipse encloses better the target object.
Figure 3.25: Weighted samples of the object position representing the posterior p(x
t
|z
1:t
).
After computing the samples and weights to approximate the posterior distribution,
a resampling step is applied to reduce the degeneracy problem. This is accomplished
by means of the SIR algorithm (Sec. 3.1), by selecting more times the samples with
higher weights, while the ones with an insignificant weight are discarded. After the SIR
resampling, all the samples have the same weight.
The state estimation ´ x
t
, containing the desired tracking information, is finally ob-
tained using the MAP estimator 3.1.
3.4.2 Results
The proposed visual tracking strategy for moving cameras in visible imagery is tested in
several challenging situations, such as the chasing of an object at high speed. Next, it is
showed the graphical results for two sequences, in which the camera has been mounted
in two different moving platforms: a car and a helicopter. This kind of sequences are
57
3. BAYESIAN TRACKING WITH MOVING CAMERAS
suitable for testing the proposed tracking algorithm because they involve strong ego
motions, and multiple moving independent objects, which can bias the camera ego
motion estimation. In both situations the posterior pdf over the state of the object has
been approximated by 50 samples.
The first sequence has been acquired by a camera mounted in a police car that is
chasing a motorbike. The main challenges are the strong ego motion and the reduced
set of appropriate regions to compute reliable correspondences. The image is mainly
composed by low textured regions in which the quality of correspondences is very poor
due to the aperture problem. The proposed model for the camera dynamics, that takes
into account multiple hypotheses, can manage this situation, performing successfully
the tracking of the target object. This is demonstrated in Fig. 3.26. The first row
of frames corresponds to two non consecutive time steps of the sequence, where the
tracked object, a motorbike, is enclosed by a white ellipse. This ellipse is defined
by the parameters of the state vector, which has been computed using the MMSE
estimator. The second row shows a set of ellipses that represent the state samples but
only considering the camera dynamics. Note that the ellipses are not exactly located on
the target object, since only the camera dynamics has been taken into account. Using
the object dynamics to rectify the object locations, the majority of the ellipses are
correctly located over the target object, as shown in the third row. The combination
of the object and camera dynamics allows to satisfactorily propagate the posterior pdf
of the object state.
The second sequence has been acquired from a camera mounted on a helicopter,
and the target object is a red car in a highway. In addition to the strong ego-motion,
there are several independent moving objects (other cars) that makes more challenging
the camera motion estimation, since the correspondences related to the moving objects
can induce false camera motions. Fig. 3.27 shows three rows of frames with the same
disposition and meaning as that of Fig. 3.26. Note that ellipses of the second and third
rows are distributed along the direction of the highway because the correspondences
related to the rest of cars induce several false candidates of camera motion in such
a direction. In spite of this drawback, the tracking is successfully accomplished as
observed in the first row, where the white ellipse satisfactorily encloses the tracked
object. Notice how this strategy allows to discard other candidates such as the other
red car (second column of frames) marked with a white X, since there is no camera nor
58
3.4 Object tracking in aerial and terrestrial visible imagery
Object tracking represented by a white ellipse,
which is determined by the estimated object state.
Ellipse samples of the posterior pdf
taking only into account the camera dynamics.
Ellipse samples of the posterior pdf
considering both the object and the camera dynamics.
Figure 3.26: Results using the proposed Bayesian framework for moving cameras and
visual imagery. The tracked object is a motorbike, and the camera is mounted on a car.
59
3. BAYESIAN TRACKING WITH MOVING CAMERAS
object motion that supports that region. This avoids that the tracking process can be
distracted by similar objects.
The performance of the proposed tracking algorithm has been also compared with
two tracking techniques described in (26; 88), respectively. The main differences with
the presented approach are that the camera ego-motion is not compensated in (88),
while in (26) it is compensated, but using only one affine transformation instead of
several ones. The three strategies use a Particle Filter framework to perform the track-
ing. In order to appropriately compare the three tracking algorithms, the same set of
parameters has been used to tune the Particle Filters. The dataset used to perform the
comparison is composed by 11 videos with challenging situations: strong ego-motion,
changes in illumination, variations of the object appearance and size, and occlusions.
The comparison is based on the following criteria: number of times that the tracked ob-
ject has been lost. Each time that the tracked object is lost, the corresponding tracking
algorithm is initiated with a correct object detection. Tab. 3.2 shows the comparison of
the three tracking algorithms, where it can be appreciated that the proposed algorithm
is quite superior, because the other approaches can not satisfactorily handle the ego-
motion. On the other hand, it can be observed that, independently of the algorithm,
the results obtained by the videos “motorbike1-6.avi” are worse than the rest. The
reason is that the size of the tracked object is very small, less than one hundred pixels,
and therefore the object can not be robustly represented by a spatiogram.
3.5 Conclusions
In this chapter the problem of tracking a single object in heavily-cluttered scenarios with
a moving camera has been addressed. Firstly, a common Bayesian framework has been
presented to jointly model the object and camera dynamics. This framework allows
to satisfactorily predict the evolution of the tracked object in situations where several
putative candidates for the object location are possible due to combined dynamics of the
object and the camera. This fact is especially problematic in situations with strong ego-
motion, high uncertainty in the camera motion because of the aperture problem, and
the presence of multiple objects with independent motion. In addition, the Bayesian
model is robust to the background clutter, which avoids tracking failures due to the
presence of similar objects to the tracked one in the background.
60
3.5 Conclusions
Object tracking represented by a white ellipse,
which is determined by the estimated object state.
Ellipse samples of the posterior pdf
taking only into account the camera dynamics.
Ellipse samples of the posterior pdf
considering both the object and the camera dynamics.
Figure 3.27: Results using the proposed Bayesian framework for moving cameras and
visual imagery. The tracked object is the red car, and the camera is mounted on a heli-
copter. The white X in the second column indicates a similar object (other red car) that
could be easily mistaken for the target object.
61
3. BAYESIAN TRACKING WITH MOVING CAMERAS
Video #L alg. 1 #L alg. 2 #L alg. 3
redcar1.avi 0 0 0
redcar2.avi 1 4 9
person1.avi 0 3 7
motorbike1.avi 4 13 24
motorbike2.avi 1 3 6
motorbike3.avi 2 8 19
motorbike4.avi 5 16 26
motorbike5.avi 8 23 37
bluecar1.avi 0 1 3
bluecar2.avi 0 2 5
bluecar3.avi 1 3 6
Table 3.2: Comparison between the three tracking algorithms using the criteria of number
of times that the tracked object has been lost (#L). Alg. 1, alg. 2, and alg. 3 refer the
proposed method, the Particle Filter approach with compensation, and the Particle Filter
approach without compensation, respectively.
Secondly, the Bayesian framework has been adapted to work in different kinds of
imagery, specifically infrared and visible-range imagery. Every type of imagery has its
own characteristics, which lead to use different techniques to characterize the tracked
objects and compute their detections. Also, the strategies to infer the multiple hy-
pothesis about the camera motion are different, depending on the type of imagery. In
infrared sequences, an area-based image registration technique has been used due to the
absence of distinctive image features, while a feature-based image registration method
has been adopted for the visible-range imagery. On the other hand, the inference of
the tracking information in the proposed Bayesian models have been carried out by
means of approximate suboptimal methods, since there is not a closed form expres-
sion to directly compute the required tracking information. This situation arises from
the fact that the dynamic and observation processes are non-linear and non-Gaussian.
Subsequently, taking into account the specific characteristics of the observation and the
camera dynamic models of the infrared and visible-range imagery, two different sub-
optimal inference methods have been derived, which make use of the particle filtering
technique to compute an accurate approximation of the object trajectory.
62
3.5 Conclusions
Lastly, the developed Bayesian tracking models have been tested in different video
datasets, proving its efficiency and reliability in the task of tracking objects with moving
cameras in heavily-cluttered scenarios.
63
3. BAYESIAN TRACKING WITH MOVING CAMERAS
64
Chapter 4
Bayesian tracking of multiple
interacting objects
This chapter presents a tracking approach for multiple interacting objects with a static
camera. Firstly, a detailed description of the tracking problem is presented in Sec. 4.1.
Next, the developed Bayesian model for multiple interacting objects is described in
Sec. 4.2. The approximate inference method to obtain a tractable solution from the
optimal Bayesian model is presented in Sec. 4.3. The computation of the object de-
tections carried out by the set of object detectors is described in Sec. 4.4. Lastly,
the tracking results using a public dataset containing football sequences are shown in
Sec. 4.5.
4.1 Description of the multiple object tracking problem
The aim is to track several interacting objects in a video sequence acquired by a static
camera along the time. The object tracking information at the time step t, referred
also to as the object state, is represented by the vector
x
t
= {x
t,i
x
|i
x
= 1, . . . , N
obj
}, (4.1)
where each component contains the tracking information of a specific object
x
t,i
x
= {r
t,i
x
, v
t,i
x
, id
t,i
x
}, (4.2)
65
4. BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS
where r
t,i
x
contains the object position over the image plane, the variable v
t,i
x
stores the
object velocity, and id
t,i
x
is a scalar value that uniquely identifies an object throughout
the video sequence. It is assumed that the i
th
x
element between consecutive time steps
corresponds to the same tracked object. The number of tracked objects N
obj
is variable,
but it is assumed that the entrances and exits of the objects in the scene are known.
This allows to focus on the modeling of the object interactions.
The tracking information contained by the object state x
t
is estimated along the
time using a set of noisy detections (also called observations or measurements), which
are obtained by means of the analysis of the video stream acquired by the camera. The
set of detections at time step t
z
t
= {z
t,i
z
|i
z
= 1, . . . , N
ms
} (4.3)
contains potential detections of the tracked objects extracted from the frame I
t
. There
can be several detections per object, although only one or none comes from the actual
object. The rest of detections are false alarms, known as clutter, that arise as a result of
similar objects in the scene, or uncertainties related to the detection process. Moreover,
there can be objects without any detection as a consequence of occlusions, or severe
changes in the visual appearance of the tracked objects. The number of detections N
ms
can vary at each time step. Detections are generated by a set of detectors, each one
specialized in one specific type or category of object. Fig. 4.1 illustrates the situation
of a set of detections yielded by several object detectors. Each detection
z
t,i
z
= {r
t,i
z
, w
t,i
z
, id
t,i
z
} (4.4)
contains the coordinates over the image plane of the detected object r
t,i
z
, a weight
value related to the quality of the detection w
t,i
z
, and an identifier that associates the
detection with the object detector that generated it, id
t,i
z
.
The association between detections and objects is unknown, since the set of com-
puted detections at each time step are unordered. In addition, some of the detections
are false alarms, which should be discarded or associated to a generic entity that sym-
bolizes the clutter to prevent the drifting in the estimation of the object state vector.
Consequently, the tracking algorithm has to include a data association stage to cor-
rectly compute the associations between the true detections and the objects, while the
66
4.1 Description of the multiple object tracking problem
Detector 1
Detector 2
Several measurements
or detections per object
False alarm or clutter
Figure 4.1: Illustration depicting a set of detections yielded by multiple detectors.
false detections are associated to the clutter entity. The data association stage has
also to handle the situation in which some objects have not associated any detections,
either because all the detections are false, or because their respective detectors have not
produced any detection. The data association is represented by the random variable
a
t
= {a
t,i
a
|i
a
= 1, . . . , N
ms
}, (4.5)
where the component a
t,i
a
contains the association of the i
th
a
detection z
t,i
a
. The
possible associations are with one of the objects or with the clutter. This is expressed
as
a
t,i
a
=
_
0 if i
th
a
detection is assigned to clutter,
i
x
if i
th
a
detection is assigned to i
th
x
object,
(4.6)
where i
x
∈ (1, . . . , N
obj
). Fig. 4.2 illustrates the data association process between the
sets of detections and objects.
To improve the estimation of the object state, and at the same time to reduce the
ambiguity in the estimation of the data association, some prior knowledge about the
67
4. BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS
Objects
Clutter
Measurements
Objects without
measurements
}
1
2
3
4
1
2
3
4
0
Figure 4.2: Illustration depicting the data association between detections and objects.
object dynamics is used. Typically, it is assumed that the objects move independently
according to a constant velocity model. However, in real situations, the objects interact
each other, changing their trajectories as they approach each other. For example, in
the context of people tracking, one person can cross with another one, stop in front of
him, follow him, and even perform more complicated interactions. In these cases, the
assumption of constant velocity does not hold, causing errors in the computed object
trajectories. This situation especially happens when an object changes its trajectory
while is occluded, since there are not available detections (due to the occlusion) that
can be used to rectify its trajectory. Consequently, a more complex dynamic model
is required that takes into account the interaction between objects. In this line, an
advanced dynamic model for interacting objects has been developed that depends on
the relative distance between objects. An object that is far away from the others is
considered to move independently with a piecewise constant velocity model. But, an
object that is close to or occluded by another one can either undergo an independent
motion according to a constant velocity model, or an interacting motion, that depends
on the position of the interacting objects. The interacting motion is modeled in such a
way that it can either follow the trajectory of an occluding object, which means that
the tracked object remains occluded behind others objects, or it can follow an inde-
pendent trajectory, meaning that the object interaction has finished, and the occluded
object becomes visible around the previously occluding objects. Fig. 4.3 illustrates the
68
4.2 Bayesian tracking model for multiple interacting objects
different kinds of situations that the proposed dynamic model can handle.
As previously stated, the dynamic model depends on the distances between the
existing objects in the scene, since the interactions take place when two or more objects
are really close involving some kind of occlusion. Therefore, the inference of occlusions
between objects allows to predict when the objects are interacting each other, and thus
to estimate their expected position using an interacting dynamic model, rather than a
constant velocity model. The object occlusions are modeled by an auxiliary random
variable
o
t
= {o
t,i
o
|i
o
= 1, . . . , N
obj
}, (4.7)
that stores the occlusion events of each tracked object. The component o
t,i
o
encodes
the occlusion information of the i
th
o
object as
o
t,i
o
=
_
0 if i
th
o
object is visible, not occluded,
i
x
if i
th
o
object is occluded by the i
th
x
object,
(4.8)
where i
x
∈ (1, . . . , N
obj
).
All the previous random variables, representing the basic elements of information
in the tracking problem, are probabilistically related through a Bayesian model, which
is described in the next section.
4.2 Bayesian tracking model for multiple interacting ob-
jects
Visual tracking of multiple interacting objects is performed using a Bayesian approach,
which recursively computes the probability of the object state x
t
= {x
t,i
x
|i
x
= 1, . . . , N
obj
}
along the time, given a sequence of noisy detections z
t
= {z
t,i
z
|i
z
= 1, . . . , N
ms
}. This
probability contains all the required information to compute an optimum estimate of
the object trajectories at each time step.
The developed Bayesian model for multiple object tracking can be represented by
the graphical model of Fig. 4.4, which encodes the probabilistic dependencies among the
different random variables involved in the tracking task, which were already introduced
in Sec. 4.1
69
4. BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS
Distant objects: constant velocity model
Cross of objects without interaction: constant velocity model
Cross of objects with interaction: interacting model
End of an interaction (objects move away): constant velocity model
Figure 4.3: Illustration depicting the object dynamic model.
70
4.2 Bayesian tracking model for multiple interacting objects
Figure 4.4: Graphical model of the developed Bayesian algorithm for multiple object
tracking.
From a Bayesian perspective, the aim is to compute the posterior probability den-
sity function (pdf) over the object state, the data association, and the object occlusion
p(x
t
, a
t
, o
t
|z
1:t
), using the prior information about the object dynamics and the se-
quence of available detections until the current time step z
1:t
= {z
1
, ..., z
t
}. This task
can be efficiently performed by expressing the posterior pdf in a recursive way using
the Bayes’ theorem
p(x
t
, a
t
, o
t
|z
1:t
) =
p(x
t
, a
t
, o
t
, z
t
|z
1:t−1
)
p(z
t
|z
1:t−1
)
=
=
p(z
t
|z
1:t−1
, x
t
, a
t
, o
t
)p(x
t
, a
t
, o
t
|z
1:t−1
)
p(z
t
|z
1:t−1
)
, (4.9)
where p(x
t
, a
t
, o
t
|z
1:t−1
) is the prior pdf, p(z
t
|z
1:t−1
, x
t
, a
t
, o
t
) is the likelihood, and
p(z
t
|z
1:t−1
) is just a normalization constant.
The prior pdf predicts the evolution of {x
t
, a
t
o
t
} between consecutive time steps
using the posterior pdf at the previous time step p(x
t−1
, a
t−1
, o
t−1
|z
1:t−1
) and the prior
knowledge about the dynamics of {x
t
, a
t
o
t
}. This is mathematically expressed by
p(x
t
, a
t
, o
t
|z
1:t−1
) =
_
x
t−1

a
t−1

o
t−1
p(x
t
, a
t
, o
t
, x
t−1
, a
t−1
, o
t−1
|z
1:t−1
)dx
t−1
=
=
_
x
t−1

a
t−1

o
t−1
p(x
t
, a
t
, o
t
|z
1:t−1
, x
t−1
, a
t−1
, o
t−1

· p(x
t−1
, a
t−1
, o
t−1
|z
1:t−1
)dx
t−1
, (4.10)
71
4. BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS
where p(x
t
, a
t
, o
t
|z
1:t−1
, x
t−1
, a
t−1
, o
t−1
) is the joint transition pdf that determines the
temporal evolution of {x
t
, a
t
o
t
}. It can be factorized as
p(x
t
, a
t
, o
t
|z
1:t−1
, x
t−1
, a
t−1
, o
t−1
) = p(x
t
|z
1:t−1
, x
t−1
, a
t−1:t
, o
t−1:t

· p(a
t
|z
1:t−1
, x
t−1
, a
t−1
, o
t−1:t
)p(o
t
|z
1:t−1
, x
t−1
, a
t−1
, o
t−1
), (4.11)
where p(x
t
|z
1:t−1
, x
t−1
, a
t−1:t
, o
t−1:t
) is the transition pdf over the object state,
p(a
t
|z
1:t−1
, x
t−1
, a
t−1
, o
t−1:t
) is the transition pdf over the data association, and
p(o
t
|z
1:t−1
, x
t−1
, a
t−1
, o
t−1
) is the transition pdf over the object occlusion.
The transition pdf over the object state can be simplified as
p(x
t
|z
1:t−1
, x
t−1
, a
t−1:t
, o
t−1:t
) = p(x
t
|x
t−1
, o
t
), (4.12)
where the following conditionally independent statements have been taken into account
(see Sec. 6.1 for the definition of conditional independence and d-separation):
• z
1:t−2
and o
t−1
are d-separated from x
t
by x
t−1
since x
t−1
is a given variable
with incoming-outgoing arrows on the path.
• z
t−1
and a
t−1
are d-separated from x
t
by x
t−1
since x
t−1
is a given variable with
outgoing-outgoing arrows on the path.
• a
t
is d-separated from x
t
by z
t
since z
t
is a not given variable with incoming-
incoming arrows on the path.
Therefore the object trajectories depend only on the previous object positions and
possible occlusions among them.
On the other hand, the transition pdf over the data association can be simplified as
p(a
t
|z
1:t−1
, x
t−1
, a
t−1
, o
t−1:t
) = p(a
t
), (4.13)
where it has been taken into account that z
1:t−1
, x
t−1
, a
t−1
, and o
t−1:t
are d-separated
from a
t
by z
t
, since z
t
is a not given variable with incoming-incoming arrows on the
path. These conditionally independent properties are derived from the fact that the
detections are unordered, making useless the previous data associations or object tra-
jectories.
72
4.2 Bayesian tracking model for multiple interacting objects
The last term in Eq 4.11, the transition pdf over the object occlusion can be sim-
plified as
p(o
t
|z
1:t−1
, x
t−1
, a
t−1
, o
t−1
) = p(o
t
|x
t−1
), (4.14)
where the following conditionally independent statements have been taken into account:
• z
1:t−2
and o
t−1
are d-separated from o
t
by x
t−1
since x
t−1
is a given variable with
incoming-outgoing arrows on the path.
• z
t−1
and a
t−1
are d-separated from o
t
by x
t−1
since x
t−1
is a given variable with
outgoing-outgoing arrows on the path.
This reflects the fact that object occlusions only depend on the previous object posi-
tions.
Substituting Eqs. 4.12, 4.13, and 4.14 in Eq. 4.11, the joint transition pdf on
{x
t
, a
t
, o
t
} can be then expressed as
p(x
t
, a
t
, o
t
|z
1:t−1
, x
t−1
, a
t−1
, o
t−1
) = p(x
t
|x
t−1
, o
t
)p(a
t
)p(o
t
|x
t−1
), (4.15)
where the mathematical expressions for each probability term are derived in Sec. 4.2.1.
Substituting Eq. 4.15 in Eq. 4.10, the prior pdf on {x
t
, a
t
, o
t
} can be then expressed
as
p(x
t
, a
t
, o
t
|z
1:t−1
) = p(a
t
)
_
x
t−1
p(x
t
|x
t−1
, o
t
)p(o
t
|x
t−1

·

a
t−1

o
t−1
p(x
t−1
, a
t−1
, o
t−1
|z
1:t−1
)dx
t−1
. (4.16)
The prediction on {x
t
, a
t
, o
t
|z
1:t−1
} performed by the prior pdf p(x
t
, a
t
, o
t
|z
1:t−1
)
is corrected by means of the likelihood p(z
t
|z
1:t−1
, x
t
, a
t
, o
t
), which uses the new set of
available detections at the current time. The likelihood can be simplified as
p(z
t
|z
1:t−1
, x
t
, a
t
, o
t
) = p(z
t
|x
t
, a
t
), (4.17)
taking into account that z
1:t−1
and o
t
are d-separated from z
t
by x
t
, since x
t
is a given
variable with incoming-outgoing arrows on the path. This result arises from the fact
73
4. BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS
that the data association information is necessary to evaluate the coherence between
the predicted object positions and the detections.
The last probability term in Eq. 4.9 is just a normalization constant given by
p(z
t
|z
1:t−1
) =
_
x
t

a
t

o
t
p(z
t
|z
1:t−1
, x
t
, a
t
, o
t
)p(x
t
, a
t
, o
t
|z
1:t−1
)dx
t
, (4.18)
where all the involved terms have been already introduced.
Whereas the posterior pdf for a generic time step t is given by the Eq. 4.9, the
posterior pdf for the initial time step is computed in a different way, since there is not
still any available detection
p(x
0
, a
0
, o
0
|z
0
) = p(x
0
) = N(x
0
; µ
0
, Σ
0
), (4.19)
where N(x
0
; µ
0
, Σ
0
) is a multivariate Gaussian distribution with mean µ
0
and covariance
matrix Σ
0
. The Gaussian distribution models the uncertainty of the object detection
process, i.e. the expected distortion in the location of the detected objects. Fig. 4.5
depicts the graphical model corresponding to the initial time step.
Figure 4.5: Graphical model of the developed Bayesian framework for the initial time
step.
All the random variables are necessary to model the multi-tracking problem, but
the goal is to obtain the tracking information of the objects, i.e. their trajectories,
in which only the object state x
t
is involved. Therefore, in order to obtain the object
74
4.2 Bayesian tracking model for multiple interacting objects
trajectories, the marginal posterior pdf over the object state is computed by integrating
out the data association and object occlusion
p(x
t
|z
1:t
) =

a
t

o
t
p(x
t
, a
t
, o
t
|z
1:t
). (4.20)
However, the posterior pdf p(x
t
, a
t
, o
t
|z
1:t
) can not be analytically solved for the
ongoing tracking application, and therefore neither can the marginal posterior proba-
bility p(x
t
|z
1:t
) be solved. The reason is that some of the stochastic processes involved
in the multiple object tracking model are non-linear or/and non-Gaussian, which makes
intractable the integrals involved in the posterior pdf (2). Therefore, it is necessary to
resort to approximate inference techniques to obtain a suboptimal solution. In Sec. 4.3,
an efficient framework specifically developed for the current tracking application is pre-
sented, which is able to obtain an accurate approximate solution to both p(x
t
, a
t
, o
t
|z
1:t
)
and p(x
t
|z
1:t
).
Once p(x
t
|z
1:t
) has been approximated, it is used to obtain an accurate estimation of
the state vector ¯ x
k
by means of the Maximum a Posteriori (MAP) estimator, described
in Sec. 3.1. The estimated ¯ x
k
contains the object positions and all the desired tracking
information at the current time step.
4.2.1 Transition pdfs
As stated in Sec. 4.1, the object dynamics is not an independent process due to the
potential interactions among close objects, especially in partial or total occlusions.
However, if the object interactions are known, it is possible to predict the motion of
each object independently. The interaction information is inferred from the potential
occlusions among objects, which in turn is represented by the random variable o
t
.
Therefore the transition pdf over the object state can be factorized as a product of
independent probability terms conditioned to o
t
is known
p(x
t
|x
t−1
, o
t
) =
N
obj

i
x
=1
p(x
t,i
x
|x
t−1
, o
t,i
x
), (4.21)
where the factor p(x
t,i
x
|x
t−1
, o
t,i
x
) represents the temporal evolution of the i
th
x
object.
The object motion depends on whether the own object is occluded by another one or
not, which is indicated by o
t,i
x
= i

x
and o
t,i
x
= 0 respectively. In the case of occlusion,
75
4. BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS
the occluded object can move independently according to its own dynamics indicating
that no interaction exists, or the occluded object can follow the occluding object as a
result of the interaction between them. On the other hand, if there is no occlusion the
object moves independently according to its own dynamics.
All of this can be mathematically expressed as
p(x
t,i
x
|x
t−1
, o
t,i
x
) =
_
¸
_
¸
_
N(x
t,i
x
; Ax
t−1,i
x
, Σ
vis
) if o
t,i
x
= 0,
N(x
t,i
x
; Ax
t−1,i
x
, Σ
occ
)(1 −β(p
b
)) if o
t,i
x
= i

x
,
N(x
t,i
x
; x
t,i

x
, Σ
occ
)β(p
b
) if o
t,i
x
= i

x
.
(4.22)
where N(x; µ, Σ) is a multivariate Gaussian function evaluated at x with mean µ and
covariance matrix Σ; A
t
is a matrix simulating a constant velocity model
A =
_
¸
¸
¸
¸
_
1 0 1 0 0
0 1 0 1 0
0 0 1 0 0
0 0 0 1 0
0 0 0 0 1
_
¸
¸
¸
¸
_
; (4.23)
x
t,i

x
= Ax
t−1,i

x
is predicted position of the occluding object; Σ
vis
is a diagonal co-
variance matrix that models the uncertainty of the dynamics of not occluded objects,
i.e. visible objects; Σ
occ
is a diagonal covariance matrix that models the uncertainty of
the dynamics of occluded objects; and β(p
b
) is a Bernoulli distribution with parameter
p
b
that models the probability that one occluded object has an interacting behavior.
The elements of the diagonal covariance matrix Σ
occ
are set to have a greater value
than the ones of Σ
vis
because the uncertainty related to the trajectory of an occluded
object is usually higher. On the other hand, the parameter p
b
should be set according
to the expected number of object interactions. Notice that, according to the above
dynamic model, an occluded object changes its trajectory to follow the occluding one
in an interaction situation.
As happened with the object dynamics, the data association is not a conditionally
independent process since the association of one detection with one object depends on
the previous estimated associations. This fact leads to express the transition pdf over
the data association as
p(a
t
) =
N
ms

i
a
=1
p(a
t,i
a
|a
t,1
, . . . , a
t,i
a
−1
), (4.24)
76
4.2 Bayesian tracking model for multiple interacting objects
where the first element p(a
t,1
) is the only one that is not restricted by other associations,
and therefore its mathematical expression is different. The probability that the first
detection is associated to an object is
p(a
t,1
= i
x
) = p
obj
, (4.25)
where i
x
∈ {1, . . . , N
obj
}. And the probability that the first detection is associated to
the clutter is
p(a
t,1
= 0) = p
clu
= 1 −p
obj
. (4.26)
The other probability terms p(a
t,i
a
|a
t,1
, . . . , a
t,i
a
−1
) have a different expression that
is common to all of them. The conditional dependency with the previous estimated
associations is based on three different restrictions that limit the possible associations
between detections and objects. The first restriction consists in one detection can be
associated at most to one object or to the clutter. This restriction arises from the fact
that each detection is generated from the analysis of a specific image region, which can
be only part of one object due to the occlusion phenomenon. The second restriction
imposes that one object can be associated at most to one detection, although the
clutter can be associated to several detections. This restriction is given by the own
characteristics of the object detector, which does not allow split detections. The last
restriction is also related to the occlusion problem, avoiding that different detections
can share the same image regions. Therefore, if one detection is already associated
to one object, the detections that share the same image regions with it can not be
associated to any object, and thus they have to be associated to the clutter. Fig. 4.6
depicts the three restrictions. Fig. 4.6(a) illustrates the first restriction where there
are two objects partially occluded and only one detection since one of the objects is
too occluded to yield a detection. The restriction avoids that the detection can be
associated to both objects. Fig. 4.6(b) shows the second restriction where there are
only one object and two detections. This situation can arise, for example, when the
image region containing the object fits with different possible poses of the object. The
second restriction enforces that only one detection can be associated with the object,
whereas the others are associated with the clutter. Fig. 4.6(c) illustrates the third
restriction where there are two objects partially occluded and three detections. Two of
77
4. BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS
these detections arise from the combination of the image regions of both objects, but
only one detection is actually correct since one of the objects is too occluded to generate
a detection. The third restriction ensures that only one detection be associated to the
visible object, whereas the rest of detections that have common image regions with it
are associated to the clutter.
The mathematical formalization of the previous restrictions is as follows. The prob-
ability that the i
th
a
detection is associated to an object is expressed as
p(a
t,i
a
= i
x
|a
t,1
, . . . , a
t,i
a
−1
) =
_
p
obj
if cond
1
∧ cond
2
,
0 rest of cases,
(4.27)
where i
x
∈ {1, . . . , N
obj
}, cond
1
is a logical condition that encodes the second restriction
as
cond
1
= ¬∃i

a
< i
a
|a
t,i

a
= i
x
, (4.28)
and cond
2
is another logical condition that encodes the third restriction as
cond
2
= ¬∃i

a
< i
a
|a
t,i

a
= i

x
∧ Ov(z
t,i
a
, z
t,i

a
) ≥ th
ov
, (4.29)
where Ov(z
t,i
a
, z
t,i

a
) is the overlapping of the image regions related to the detections
z
t,i
a
and z
t,i

a
, and th
ov
is a threshold value. Notice that the first restriction is implicitly
fulfilled since each probability evaluates the association event of one detection with a
single object.
On the other hand, the probability that the i
th
a
detection is associated to the clutter
is
p(a
t,i
a
= 0|a
t,1
, . . . , a
t,i
a
−1
) =
_
p
clu
if cond
2
,
1 rest of cases.
(4.30)
Notice that the condition cond
1
is not imposed in this case since several detections can
be associated to the clutter.
The object occlusion is not an independent process either since only one object can
be in the foreground in an occlusion, whereas the rest are total or partially behind it
in the background. Therefore, if it is estimated that the i
th
x
object is occluded by the
i

th
x
object, this last one can not be occluded by any other object. This means that
an occluding object is to be considered to be always in the foreground. On the other
78
4.2 Bayesian tracking model for multiple interacting objects
Objects
Clutter
Measurements
Partially
occluded
(a) First restriction
Objects
Clutter
Measurements
(b) Second restriction
Objects
Clutter
Measurements
Partially
occluded
(c) Third restriction
Figure 4.6: Restrictions imposed to the associations between detections and objects.
79
4. BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS
hand, the relative depth position in a set of occluded objects is considered irrelevant,
and thus the only important thing is that all of them are occluded by the same object.
Fig. 4.7 illustrates the aforementioned occlusion restrictions among objects.
Correct description
of object occlusions
Incorrect description
of object occlusions
(a) Two objects: one occluding and another occluded.
Correct description
of object occlusions
Incorrect description
of object occlusions
(b) Three objects: one occluding and the rest occluded.
Figure 4.7: Restrictions imposed to the occlusions among objects.
From a mathematical perspective, the expression of the transition pdf on o
t
subject
to the previous restrictions is
p(o
t
|x
t−1
) =
N
obj

i
x
=1
p(o
t,i
x
|x
t−1
, o
t,1
, . . . , o
t,i
x
−1
), (4.31)
where only the first element p(o
t,1
|x
t−1
) does not depend on the other occlusion events.
The probability that the first object is occluded by another is computed as
p(o
t,1
= i
x
|x
t−1
) =
_
N(Gx
t−1,i
x
; Gx
t−1,1
, Σ
occ
) · β(p
b
) if i
x
= 1
0 if i
x
= 1,
(4.32)
80
4.2 Bayesian tracking model for multiple interacting objects
where G is a matrix used to select the object position information given by
G =
_
¸
¸
¸
¸
_
1 0 0 0 0
0 1 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
_
¸
¸
¸
¸
_
; (4.33)
Σ
occ
is a covariance matrix whose elements are set according to the size and pose of the
objects involved in the potential occlusion; and β(p
b
) is a Bernoulli distribution with
parameter p
b
. Therefore, according to the definition of p(o
t,1
|x
t−1
), the probability that
the first object is occluded by another object depends on both the distance between
them (modeled by the Gaussian function), and if the first object is in the background
occluded by another object (modeled by the Bernoulli distribution). On the other hand,
the probability that the first object is not occluded is expressed as
p(o
t,1
= 0|x
t−1
) = p(o
t,1
= 0) = d
vis
, (4.34)
where d
vis
is the probability density that the object is visible, i.e. not occluded.
The rest of probability terms in Eq. 4.31 depend on the previous estimated occlusion
events. The probability that the i
th
x
object is occluded by another is
p(o
t,i
x
= i

x
|x
t−1
, o
t,1
, . . . , o
t,i
x
−1
) =
=
_
¸
_
¸
_
0 if i

x
= i
x
∨ cond
1
,
N(Gx
t−1,i

x
; Gx
t−1,i
x
, Σ
occ
) if cond
2
∨ cond
3
,
N(Gx
t−1,i

x
; Gx
t−1,i
x
, Σ
occ
)β(p
b
) rest of cases,
(4.35)
where cond
1
= i

x
< i
x
∧ o
t,i

x
= 0 is a logical condition that expresses the restriction
that one occluded object can not occlude another one; and cond
2
= i

x
< i
x
∧ o
t,i

x
= 0
and cond
3
= i

x
> i
x
∧ ∃i
′′
x
< i
x
|o
t,i
′′
x
= i

x
are logical conditions that reflect that one
occluding object is always in the foreground. On the other hand, the probability that
the i
th
x
object is not occluded is given by
p(o
t,i
x
= 0|x
t−1
, o
t,1
, . . . , o
t,i
x
−1
) =
= p(o
t,i
x
= 0|o
t,1
, . . . , o
t,i
x
−1
) =
_
1 if ∃i

x
< i
x
|o
t,i

x
= i
x
,
d
vis
rest of cases.
(4.36)
81
4. BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS
The transition pdf on o
t
can be reformulated in a more convenient and compact
way using the previous equations as
p(o
t
|x
t−1
) =

i
x
∈S
vis
p(o
t,i
x
= 0|o
t,1
, . . . , o
t,i
x
−1

·

i
x
∈S
occ
p(o
t,i
x
= i

x
|x
t−1
, o
t,1
, . . . , o
t,i
x
−1
) =
=

i
x
∈S
vis
p(o
t,i
x
= 0|o
t,1
, . . . , o
t,i
x
−1

·

i
x
∈S
occ
2,3
N(Gx
t−1,i

x
; Gx
t−1,i
x
, Σ
occ

·

i
x
∈S
occ
oth
N(Gx
t−1,i

x
; Gx
t−1,i
x
, Σ
occ
) · β(p
b
) =
=

i
x
∈S
vis
p(o
t,i
x
= 0|o
t,1
, . . . , o
t,i
x
−1

· β(p
b
)
|S
occ
oth
|
· N(G

x
t−1
; G
′′
x
t−1
, Σ

occ
) (4.37)
where S
vis
is the set of indexes related to visible objects, i.e. not occluded; S
occ
2,3
is the
set of indexes related to occluded objects that fulfills the logical conditions cond
2
and
cond
3
; S
occ
oth
is the set of indexes related to occluded objects that does not fulfill any
logical conditions; |S
occ
oth
| is the cardinal of the set S
occ
oth
; G

and G
′′
are matrices that
associate the position information between the occluded and occluding objects, and
they are formed by the proper combination of the matrix G; and Σ

occ
is a covariance
matrix derived from the matrix Σ
occ
. For the sake of a clearer notation, the productory
of probability terms related to visible objects is represented by p(o
t,S
vis), and thus the
expression of the transition pdf on o
t
becomes
p(o
t
|x
t−1
) = p(o
t,S
vis)β(p
b
)
|S
occ
oth
|
· N(G

x
t−1
; G
′′
x
t−1
, Σ

occ
). (4.38)
4.2.2 Likelihood
The likelihood on {x
t
, a
t
} can be expressed as product of independent factors as
p(z
t
|x
t
, a
t
) =
N
ms

i
a
=1
p(z
t,i
a
|x
t
, a
t,i
a
) (4.39)
82
4.2 Bayesian tracking model for multiple interacting objects
since the association between detections and objects is already given. Each factor
computes the likelihood of a specific detection as
p(z
t,i
a
|x
t
, a
t,i
a
) =
_
δ(Jz
t,i
a
−J

x
t,i
x
)N(Hz
t,i
a
; H

x
t,i
x
, Σ
lh
) if a
t,i
a
= i
x
,
d
clu
if a
t,i
a
= 0,
(4.40)
where i
x
∈ {1, . . . , N
obj
}. The term δ(Jz
t,i
a
− J

x
t,i
x
) is a Dirac delta function that
imposes that a detection can be only associated with an object that shares the same
identifier, i.e. the detection and the associated object must be of the same type. The
matrices J and J

select the identifiers of the detection and the tracked object from
z
t,i
a
and x
t,i
x
, respectively, and are defined as
J =
_
¸
¸
¸
¸
_
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 1
_
¸
¸
¸
¸
_
, (4.41)
and
J

=
_
¸
¸
¸
¸
_
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 1
_
¸
¸
¸
¸
_
. (4.42)
The matrices H and H

relate the position and weight information of the detection to
those one of the object state and are given by
H =
_
¸
¸
¸
¸
_
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 0
0 0 0 0
_
¸
¸
¸
¸
_
(4.43)
and
H

=
_
¸
¸
¸
¸
_
1 0 0 0 0
0 1 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
_
¸
¸
¸
¸
_
. (4.44)
83
4. BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS
Notice that, while the position information of the detection is compared with the pre-
dicted object position, the weight information of the detection is evaluated without
an explicit reference related to the tracked object. This is because, in the context of
the current visual tracking application, the weight information measures the similarity
between an image region and the theoretical appearance of a type of object, which is
assumed to be constant along the time (more details are given in Sec. 4.4). In this way,
a weight value of 1 means that the appearance of the image region under consideration
is exactly the same as that of the object type, while a value of 0 means that they are
completely different. Finally, Σ
lh
is a covariance matrix that models the uncertainty
of the detection process.
The likelihood on {x
t
, a
t
} can be reformulated in a more convenient and compact
way using the previous equations as
p(z
t
|x
t
, a
t
) =

i
a
∈S
clu
p(z
t,i
a
|x
t
, a
t,i
a
= 0)

i
a
∈S
obj
p(z
t,i
a
|x
t
, a
t,i
a
= i
x
) =
=

i
a
∈S
clu
p(z
t,i
a
|a
t,i
a
= 0)

i
a
∈S
obj
p(z
t,i
a
|x
t,i
x
, a
t,i
a
= i
x
) =
=

i
a
∈S
clu
p(z
t,i
a
|a
t,i
a
= 0)N(H
′′
z
t
; H
′′′
x
t
, Σ
lh

), (4.45)
where S
clu
is the set of indexes that associate a detection with the clutter, S
obj
is the set
of indexes that associate a detection with an object, and H
′′
and H
′′′
are matrices that
respectively select the position information of all the detections and objects. These
matrices are formed by the proper combination of the matrices H and H

. Lastly,
Σ
lh

is a covariance matrix that models the uncertainty of the detections, and it is
formed by means of the combination of Σ
lh
. In order to maintain a clear notation,
the productory of terms involved with clutter associations is renamed as p(z
t,S
clu), and
thus the likelihood on {x
t
, a
t
} is expresed as
p(z
t
|x
t
, a
t
) = p(z
t,S
clu)N(H
′′
z
t
; H
′′′
x
t
, Σ
lh

). (4.46)
84
4.3 Approximate inference based on Rao-Blackwellized particle filtering
4.3 Approximate inference based on Rao-Blackwellized
particle filtering
An efficient inference method based on particle filtering has been developed to accu-
rately approximate the posterior pdf p(x
t
, a
t
, o
t
|z
1:t
). This method uses a variance
reduction technique, called Rao-Blackwellization (89), to deal with the curse of dimen-
sionality (90). This problem refers to the exponential increase in volume in a mathe-
matical space related to the addition of extra dimensions. The Rao-Blackwellization
technique achieves a more accurate estimation than an approximation technique based
solely on particle filtering. This can be informally accounted for as follows. Consider a
particle filter framework which simulates the posterior pdf by a set of N
sam
weighted
samples, also called particles, as
p(x
t
, a
t
, o
t
|z
1:t
) =
N
sam

i=1
w
[i]
t
δ
_
x
t
−x
[i]
t
, a
t
−a
[i]
t
, o
t
−o
[i]
t
_
, (4.47)
where δ(x, y, z) is a multi-dimensional Dirac delta function, {x
[i]
t
|i = 1, . . . , N
sam
}
are the samples of the object state, {a
[i]
t
|i = 1, . . . , N
sam
} are the samples of the
data association, {o
[i]
t
|i = 1, . . . , N
sam
} are the samples of the object occlusion, and
{w
[i]
t
|i = 1, . . . , N
sam
} are the weights of the samples. Both samples and weights are
computed using a Monte Carlo strategy (see Sec. 3.1.1) that is theoretically able to
obtain an approximation of p(x
t
, a
t
, o
t
|z
1:t
), which is only dependent on the number
of samples, rather than the dimensionality of the involved random variables. However,
in practice, the Monte Carlo based methods depend on the ability of sampling from
multivariate distributions, which becomes more and more difficult as the dimension of
the random variables increases. This means that the accuracy of the sampling dimin-
ishes with the increasing number of dimensions, which in turn causes approximations
with higher variance. For the ongoing multi-tracking application, the dimension of
{x
t
, a
t
, o
t
} is very high: 7 ×N
obj
. This inevitably leads to approximations of low qual-
ity, i.e. with a high variance, unless a prohibitively high number of samples is used. The
Rao-Blackwellization technique overcomes this problem by assuming that the random
variables have a special structure that allows to analytically marginalize out some of the
variables conditioned to the rest ones. Since the marginalized variables are computed
exactly, the posterior pdf will be more accurate. The rest of variables still have to be
85
4. BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS
approximated by particle filtering, but the approximation will be more accurate since
their dimension is less. A formal demonstration about the variance reduction achieved
by the Rao-Blackwellization technique can be found in (89).
The designed Bayesian model for multi-tracking has a special structure that al-
lows to marginalize out the object state x
t
conditioned to {a
t
, o
t
}. Thus, the Rao-
Blackwellization technique can be used to express the posterior pdf as
p(x
t
, a
t
, o
t
|z
1:t
) = p(x
t
|z
1:t
, a
t
, o
t
)p(a
t
, o
t
|z
1:t
), (4.48)
where p(x
t
|z
1:t
, a
t
, o
t
) is assumed to be conditionally linear Gaussian, and therefore
with an analytical expression known as the Kalman filter (see Sec. 4.3.1), given by
p(x
t
|z
1:t
, a
t
, o
t
) = N(x
t
; µ
x
t
, Σ
x
t
) . (4.49)
The derivation of the previous equation, along with the parameters µ
x
t
and Σ
x
t
, is de-
scribed in Sec. 4.3.1. In order for the linear Gaussian assumption to come true, both
the prior pdf p(x
t
|z
1:t−1
, a
t
, o
t
) and the likelihood p(z
t
|x
t
, a
t
, o
t
) must be condition-
ally linear Gaussian processes. The prior pdf p(x
t
|z
1:t−1
), which defines the object
dynamics, is a linear non-Gaussian process due to the occlusions. But, if the occlusion
information is given such as in p(x
t
|z
1:t−1
, a
t
, o
t
), the linear Gaussian condition is ful-
filled. Similarly, the likelihood p(z
t
|x
t
) is linear non-Gaussian because of the different
possible combination between detections and objects. But, if the data association is
provided such as in p(z
t
|x
t
, a
t
, o
t
), the likelihood becomes linear Gaussian.
The another probability term in Eq. 4.48, the posterior probability on {a
t
, o
t
}, can
be expressed using the Bayes’ theorem as
p(a
t
, o
t
|z
1:t
) =
p(a
t
, o
t
, z
t
|z
1:t−1
)
p(z
t
|z
1:t−1
)
=
p(z
t
|z
1:t−1
, a
t
, o
t
)p(a
t
, o
t
|z
1:t−1
)
p(z
t
|z
1:t−1
)
, (4.50)
where p(a
t
, o
t
|z
1:t−1
) is the prior pdf on {a
t
, o
t
}, p(z
t
|z
1:t−1
, a
t
, o
t
) is the likelihood,
and p(z
t
|z
1:t−1
) is just a normalization constant.
The prior pdf p(a
t
, o
t
|z
1:t−1
) can be recursively expressed as
86
4.3 Approximate inference based on Rao-Blackwellized particle filtering
p(a
t
, o
t
|z
1:t−1
) =

a
t−1

o
t−1
p(a
t
, o
t
, a
t−1
, o
t−1
|z
1:t−1
) =
=

a
t−1

o
t−1
p(a
t
, o
t
|z
1:t−1
, a
t−1
, o
t−1
)p(a
t−1
, o
t−1
|z
1:t−1
), (4.51)
where p(a
t
, o
t
|z
1:t−1
, a
t−1
, o
t−1
) is the transition pdf on {a
t
, o
t
}, and p(a
t−1
, o
t−1
|z
1:t−1
)
is the posterior pdf at the previous time step.
The transition pdf on {a
t
, o
t
} can be factorized as
p(a
t
, o
t
|z
1:t−1
, a
t−1
, o
t−1
) = p(a
t
|z
1:t−1
, a
t−1
, o
t−1
)p(o
t
|z
1:t−1
, a
t−1:t
, o
t−1
), (4.52)
where p(a
t
|z
1:t−1
, a
t−1
, o
t−1
) is the transition pdf over the data association, and
p(o
t
|z
1:t−1
, a
t−1:t
, o
t−1
) is the transition pdf over the object occlusion.
The transition pdf over the data association can be simplified as
p(a
t
|z
1:t−1
, a
t−1
, o
t−1
) = p(a
t
), (4.53)
where it has been taken into account that z
1:t−1
, a
t−1
, and o
t−1
are d-separated from a
t
by z
t
(Sec. 6.1 for the definition of conditional independence and d-separation) since z
t
is
a not given variable with incoming-incoming arrows on the path. These conditionally
independent properties are derived from the fact that the detections are unordered,
which makes useless the previous data associations. The mathematical expression of
p(a
t
) has been already introduced in Sec. 4.2.1.
On the other hand, the transition pdf over the object occlusion can be simplified as
p(o
t
|z
1:t−1
, a
t−1:t
, o
t−1
) = p(o
t
|z
1:t−1
, a
t−1
, o
t−1
) =
=
_
x
t−1
p(o
t
, x
t−1
|z
1:t−1
, a
t−1
, o
t−1
)dx
t−1
=
=
_
x
t−1
p(o
t
|z
1:t−1
, a
t−1
, o
t−1
, x
t−1
)p(x
t−1
|z
1:t−1
, a
t−1
, o
t−1
)dx
t−1
=
=
_
x
t−1
p(o
t
|x
t−1
)p(x
t−1
|z
1:t−1
, a
t−1
, o
t−1
)dx
t−1
, (4.54)
where it has been taken into account the following conditional independence properties:
87
4. BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS
• a
t
is d-separated from o
t
by z
t
since z
t
is a not given variable with incoming-
incoming arrows on the path.
• z
1:t−2
and o
t−1
are d-separated from o
t
by x
t−1
since x
t−1
is a given variable with
incoming-outgoing arrows on the path.
• z
t−1
and a
t−1
are d-separated from o
t
by x
t−1
since x
t−1
is a given variable with
outgoing-outgoing arrows on the path.
These properties manifest the fact that the object occlusions only depend on the pre-
vious object positions. The mathematical expression of p(o
t
|x
t−1
) has been already
introduced in Sec. 4.2.1, whereas the one for the conditional posterior pdf over the ob-
ject state p(x
t−1
|z
1:t−1
, a
t−1
, o
t−1
) appears in Sec. 4.3.1. Substituting these expressions
in Eq. 4.54 gives
p(o
t
|z
1:t−1
, a
t−1:t
, o
t−1
) = p(o
t,S
vis)β(p
b
)
|S
occ
oth
|
·
·
_
x
t−1
N(G

x
t−1
; G
′′
x
t−1
, Σ

occ
)N
_
x
t−1
; µ
x
t−1
, Σ
x
t−1
_
dx
t−1
. (4.55)
The another term in Eq. 4.50, the likelihood, can be expressed as
p(z
t
|z
1:t−1
, a
t
, o
t
) =
_
x
t
p(z
t
, x
t
|z
1:t−1
, a
t
, o
t
)dx
t
=
=
_
x
t
p(z
t
|z
1:t−1
, a
t
, o
t
, x
t
)p(x
t
|z
1:t−1
, a
t
, o
t
)dx
t
=
=
_
x
t
p(z
t
|a
t
, x
t
)p(x
t
|z
1:t−1
, o
t
)dx
t
, (4.56)
where the following conditional independent properties have been taken into account:
• z
1:t−1
and o
t
are d-separated from z
t
by x
t
since x
t
is a given variable with
incoming-outgoing arrows on the path.
• a
t
is d-separated from x
t
by z
t
since z
t
is a not given variable with incoming-
incoming arrows on the path.
The first property states the fact that only the data association information is necessary
to evaluate the coherence between the predicted object positions and the detections.
88
4.3 Approximate inference based on Rao-Blackwellized particle filtering
And, the second one sets out that, if the detections at the current time are not available,
the data association is useless for predicting the object positions. The probability
term p(z
t
|a
t
, x
t
) has been already introduced in Sec. 4.2.2, and the expression for
p(x
t
|z
1:t−1
, o
t
), representing the conditional prior pdf over the object state, is derived
in Sec. 4.3.1. Then, substituting the expressions of the previous probability terms in
Eq. 4.56, the following expression for the likelihood is obtained
p(z
t
|z
1:t−1
, a
t
, o
t
) = p(z
t,S
clu)
_
x
t
N(H
′′
z
t
; H
′′′
x
t
, Σ
lh

)N
_
x
t
; ˆ µ
x
t
,
ˆ
Σ
x
t
_
dx
t
. (4.57)
This expression can be further simplified by deriving a solution for the involved integral.
The solution is based on the fact that
N(H
′′
z
t
; H
′′′
x
t
, Σ
lh

)N
_
x
t
; ˆ µ
x
t
,
ˆ
Σ
x
t
_
(4.58)
is proportional to p(x
t
|z
1:t
, a
t
, o
t
), the conditional posterior pdf over the object state,
as demonstrated as follows. Using the Bayes’ theorem, the conditional posterior pdf
over the object state can be factorized as
p(x
t
|z
1:t
, a
t
, o
t
) =
p(z
t
|z
1:t−1
, a
t
, o
t
, x
t
)p(x
t
|z
1:t−1
, a
t
, o
t
)
_
x
t
p(z
t
|z
1:t−1
, a
t
, o
t
, x
t
)p(x
t
|z
1:t−1
, a
t
, o
t
)dx
t
, (4.59)
where the likelihood p(z
t
|z
1:t−1
, a
t
, o
t
, x
t
) is given in Sec. 4.2.2, and the prior pdf
p(x
t
|z
1:t−1
, a
t
, o
t
) in Sec. 4.3.1. Substituting their expressions in the previous equa-
tion is obtained
N(x
t
; µ
x
t
, Σ
x
t
) =
p(z
t,S
clu)N(H
′′
z
t
; H
′′′
x
t
, Σ
lh

)N
_
x
t
; ˆ µ
x
t
,
ˆ
Σ
x
t
_
_
x
t
p(z
t,S
clu)N(H
′′
z
t
; H
′′′
x
t
, Σ
lh

)N
_
x
t
; ˆ µ
x
t
,
ˆ
Σ
x
t
_
dx
t
. (4.60)
Then, rearranging some terms is obtained
N(H
′′
z
t
; H
′′′
x
t
, Σ
lh

)N
_
x
t
; ˆ µ
x
t
,
ˆ
Σ
x
t
_
=
=
N(x
t
; µ
x
t
, Σ
x
t
)
_
x
t
p(z
t,S
clu)N(H
′′
z
t
; H
′′′
x
t
, Σ
lh

)N
_
x
t
; ˆ µ
x
t
,
ˆ
Σ
x
t
_
dx
t
p(z
t,S
clu)
, (4.61)
89
4. BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS
where the integral is just a normalization constant. Finally, noticing that p(z
t,S
clu)
acts like a constant, since it does not depend on x
t
, the proportionality statement is
demonstrated
N(H
′′
z
t
; H
′′′
x
t
, Σ
lh

)N
_
x
t
; ˆ µ
x
t
,
ˆ
Σ
x
t
_
= k
p
N(x
t
; µ
x
t
, Σ
x
t
) . (4.62)
The value of k
p
can be obtained by evaluating the previous expression on a specific
object state x

t
. Using the mean x

t
= µ
x
t
, the computation is simplified by
k
p
=
N(H
′′
z
t
; H
′′′
µ
x
t
, Σ
lh

)N
_
µ
x
t
; ˆ µ
x
t
,
ˆ
Σ
x
t
_
N(µ
x
t
; µ
x
t
, Σ
x
t
)
=
=
_
(2π)
n
det(Σ
x
t
) · N(H
′′
z
t
; H
′′′
µ
x
t
, Σ
lh

)N
_
µ
x
t
; ˆ µ
x
t
,
ˆ
Σ
x
t
_
, (4.63)
where n is the dimension of x
t
and det() is the determinant operator. Using the previous
results, the integral in Eq. 4.57 can be solved as
_
x
t
N(H
′′
z
t
; H
′′′
x
t
, Σ
lh

)N
_
x
t
; ˆ µ
x
t
,
ˆ
Σ
x
t
_
dx
t
=
= k
p
_
x
t
N(x
t
; µ
x
t
, Σ
x
t
) dx
t
= k
p
, (4.64)
where it has been taken into account that the marginalization of the posterior pdf
N(x
t
; µ
x
t
, Σ
x
t
) is equal to one.
Finally, substituting the integral in Eq. 4.57 by its solution, the following expression
for the likelihood pdf on {a
t
, o
t
} is achieved
p(z
t
|z
1:t−1
, a
t
, o
t
) = p(z
t,S
clu)k
p
=
= p(z
t,S
clu)
_
(2π)
n
det(Σ
x
t
) · N(H
′′
z
t
; H
′′′
µ
x
t
, Σ
lh

)N
_
µ
x
t
; ˆ µ
x
t
,
ˆ
Σ
x
t
_
.
(4.65)
Lastly, the probability term p(z
t
|z
1:t−1
) in Eq. 4.50, which corresponds to a nor-
malization constant, is given by
p(z
t
|z
1:t−1
) =

a
t

o
t
p(z
t
|z
1:t−1
, a
t
, o
t
)p(a
t
, o
t
|z
1:t−1
). (4.66)
90
4.3 Approximate inference based on Rao-Blackwellized particle filtering
Since the posterior pdf on {x
t
, a
t
, o
t
} had not an analytical form, neither has the
posterior pdf on {a
t
, o
t
}. The Rao-Blackwellization technique just decomposes a joint
probability into two parts, one with analytical expression and another without it.
Therefore, the use of approximate inference is still necessary. For this purpose, a
new particle filtering framework has been developed to approximate the posterior pdf
on {a
t
, o
t
}. The computational complexity is linear with the number of detections and
objects, rather than exponential, making in this way the inference computationally
tractable. The particle filtering framework is described in detail in Sec. 4.3.2, in which
the following approximation based on a weighted sum of samples is derived
p(a
t
, o
t
|z
1:t
) =
N
sam

i=1
w
[i]
t
δ
_
a
t
−a
[i]
t
, o
t
−o
[i]
t
_
, (4.67)
where the expression for the samples and weights can be also found in Sec. 4.3.2.
Using the above approximation of p(a
t
, o
t
|z
1:t
), the mathematical expression of
p(x
t
, a
t
, o
t
|z
1:t
) can be expressed as
p(x
t
, a
t
, o
t
|z
1:t
) = p(x
t
|z
1:t
, a
t
, o
t
)p(a
t
, o
t
|z
1:t
) =
=
N
sam

i=1
p(x
t
|z
1:t
, a
[i]
t
, o
[i]
t
)w
[i]
t
δ
_
a
t
−a
[i]
t
, o
t
−o
[i]
t
_
=
=
N
sam

i=1
N
_
x
t
; µ
x[i]
t
, Σ
x[i]
t
_
w
[i]
t
δ
_
a
t
−a
[i]
t
, o
t
−o
[i]
t
_
. (4.68)
This expression represents a mixture of multivariate Gaussians, where the parameters
of each Gaussian are determined by the estimated object occlusions and detection
associations.
4.3.1 Kalman filtering of the object state
Assume that the posterior pdf p(x
t−1
, a
t−1
, o
t−1
|z
1:t−1
) at the time step t − 1 is a
multivariate Gaussian distribution given by
p(x
t−1
|z
1:t−1
, a
t−1
, o
t−1
) = N
_
x
t−1
; µ
x
t−1
, Σ
x
t−1
_
, (4.69)
91
4. BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS
where µ
x
t−1
is the mean. and Σ
x
t−1
is the covariance matrix. The mean can be divided
into as many components as number of tracked objects
µ
x
t−1
= {µ
x
t−1,i
x
|i
x
= 1, . . . , N
obj
}, (4.70)
in such a way that each component contains information relative to one object. In the
same way, the covariance matrix can be expressed as
Σ
x
t−1
=
_
¸
_
Σ
x
t−1,1
0
.
.
.
0 Σ
x
t−1,N
obj
_
¸
_
, (4.71)
where the elements {Σ
x
t−1,i
x
|i
x
= 1, . . . , N
obj
} are matrices that handle each component
of the object state independently. This covariance arrangement shows the fact that ob-
ject trajectories are independent once object occlusions and the detection associations
are known.
The posterior pdf p(x
t
|z
1:t
, a
t
, o
t
) for the current time step can be recursively com-
puted in two steps: prediction and update. The first step predicts the evolution of x
t−1
at the current time step by means of the following expression, called prior pdf,
p(x
t
|z
1:t−1
, a
t
, o
t
) = p(x
t
|z
1:t−1
, o
t
) = N
_
x
t
; ˆ µ
x
t
,
ˆ
Σ
x
t
_
, (4.72)
where it has been taken into account that a
t
is d-separated from x
t
by z
t
(see Sec. 6.1
for the definition of conditional independence and d-separation) since z
t
is a not given
variable with incoming-incoming arrows on the path. This property reflects the fact
that if the detections at the current time step are not available, data association is
useless for predicting object positions. The mean parameter can be decomposed into
ˆ µ
x
t
= {ˆ µ
x
t,i
x
|i
x
= 1, . . . , N
obj
}, (4.73)
where each component involves a different tracked object, and it is mathematically
given by
ˆ µ
x
t,i
x
=
_

x
t−1,i
x
if o
t,i
x
= 0,

x
t−1,i

x
if o
t,i
x
= i

x
,
(4.74)
92
4.3 Approximate inference based on Rao-Blackwellized particle filtering
where the matrix A has been already introduced in Sec. 4.2.1. On the other hand, the
covariance matrix is defined as
ˆ
Σ
x
t
=
_
¸
_
ˆ
Σ
x
t,1
0
.
.
.
0
ˆ
Σ
x
t,N
obj
_
¸
_
, (4.75)
where the matrix elements {
ˆ
Σ
x
t,i
x
|i
x
= 1, . . . , N
obj
} are given by
ˆ
Σ
x
t,i
x
=
_
Σ
vis
+AΣ
x
t−1,i
x
A

if o
t,i
x
= 0,
Σ
occ
+AΣ
x
t−1,i

x
A

if o
t,i
x
= i

x
.
(4.76)
The matrices Σ
vis
and Σ
occ
have been already introduced in Sec. 4.2.1.
The second step uses the set of available detections at the current time step to
update the previous prediction as
p(x
t
|z
t
, a
t
, o
t
) = N(x
t
; µ
x
t
, Σ
x
t
) . (4.77)
The mean is expressed by
µ
x
t
= {µ
x
t,i
x
|i
x
= 1, . . . , N
obj
}, (4.78)
and its components by
µ
x
t,i
x
=
_
ˆ µ
x
t,i
x
+K
t
(Hz
t,i
a
−H

ˆ µ
x
t,i
x
) if a
t,i
a
= i
x
,
ˆ µ
x
t,i
x
if a
t,i
a
= 0,
(4.79)
where the matrices H and H

have been already defined in Sec. 4.2.2, and the variable
K
t
is the Kalman gain given by
K
t
=
ˆ
Σ
x
t,i
x
H


H

ˆ
Σ
x
t,i
x
H



lh
, (4.80)
where the covariance matrix Σ
lh
related to the detection model has been already in-
troduced in Sec. 4.2.2.
Lastly, the covariance matrix is given by
Σ
x
t
=
_
¸
_
Σ
x
t,1
0
.
.
.
0 Σ
x
t,N
obj
_
¸
_
, (4.81)
93
4. BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS
where the matrix elements {Σ
x
t,i
x
|i
x
= 1, . . . , N
obj
} are given by
Σ
x
t,i
x
=
_
ˆ
Σ
x
t,i
x
−K
t
H

ˆ
Σ
x
t,i
x
if a
t,i
a
= i
x
,
ˆ
Σ
x
t,i
x
if a
t,i
a
= 0.
(4.82)
Notice that the posterior pdf over the objects that have not been associated to a
detection is determined by the prediction.
4.3.2 Particle filtering of the data association and object occlusion
The posterior pdf on {a
t
, o
t
} is simulated by a set of N
sam
weighted samples, also called
particles, as
p(a
t
, o
t
|z
1:t
) =
N
sam

i=1
w
[i]
t
δ
_
a
t
−a
[i]
t
, o
t
−o
[i]
t
_
, (4.83)
where δ(x, y) is a multi-dimensional Dirac delta function, {a
[i]
t
, o
[i]
t
|i = 1, . . . , N
sam
} are
the samples of data association and object occlusion, and {w
[i]
t
|i = 1, . . . , N
sam
} are
sample weights. Samples and weights are computed using the principle of importance
sampling (see Sec. 3.1.1), in which a proposal distribution q(a
t
, o
t
|z
1:t
), called also
importance density, is used to draw the samples {a
[i]
t
, o
[i]
t
}. The choice of the proposal
distribution has been based on the criteria of diminishing the degeneracy problem,
which is achieved by means of the minimization of the variance of the weights w
[i]
t
.
Theoretically, the proposal distribution that attains this goal should be proportional
to p(a
t
, o
t
|z
1:t
) since in this case the variance would be zero. But, this is an unfeasible
solution, since it would imply that the analytical expression of p(a
t
, o
t
|z
1:t
) is known.
The proposed solution has consisted in using the approximation of the posterior pdf
at the previous time step p(a
t−1
, o
t−1
|z
1:t−1
) to approximate in turn the posterior pdf
at the current time step. This process is explained as follows. Consider the optimal
proposal distribution
q(a
t
, o
t
|z
1:t
) ∝ p(a
t
, o
t
|z
1:t
) ∝ p(z
t
|z
1:t−1
, a
t
, o
t
)p(a
t
, o
t
|z
1:t−1
) =
= p(z
t
|z
1:t−1
, a
t
, o
t
)

a
t−1

o
t−1
p(a
t
, o
t
|z
1:t−1
, a
t−1
, o
t−1
)p(a
t−1
, o
t−1
|z
1:t−1
), (4.84)
94
4.3 Approximate inference based on Rao-Blackwellized particle filtering
where the posterior pdf at the previous time step p(a
t−1
, o
t−1
|z
1:t−1
) has been already
approximated by a sum of weighed particles
p(a
t−1
, o
t−1
|z
1:t−1
) =
N
sam

i=1
w
[i]
t−1
δ
_
a
t−1
−a
[i]
t−1
, o
t−1
−o
[i]
t−1
_
. (4.85)
Using this approximation, the proposal distribution can be recursively expressed as
q(a
t
, o
t
|z
1:t
) = p(z
t
|z
1:t−1
, a
t
, o
t
)
N
sam

i=1
w
[i]
t−1
p(a
t
, o
t
|z
1:t−1
, a
[i]
t−1
, o
[i]
t−1

· δ
_
a
t−1
−a
[i]
t−1
, o
t−1
−o
[i]
t−1
_
, (4.86)
where all the probability terms have been already presented in previous sections. The
substitution of these terms by their expressions leads to
q(a
t
, o
t
|z
1:t
) =
_
x
t
p(z
t
|a
t
, x
t
)p(x
t
|z
1:t−1
, o
t
)dx
t
· p(a
t

·
N
sam

i=1
w
[i]
t−1
_
x
t−1
p(o
t
|x
t−1
)p(x
t−1
|z
1:t−1
, a
[i]
t−1
, o
[i]
t−1
)dx
t−1
·
· δ
_
a
t−1
−a
[i]
t−1
, o
t−1
−o
[i]
t−1
_
. (4.87)
The developed method for drawing samples from q(a
t
, o
t
|z
1:t
) is based on the fact that
the object occlusion samples o
[i]
t
can be drawn independently of the data association
samples a
[i]
t
. This dramatically reduces the complexity of the sampling task. According
to this fact, the proposal distribution can be factorized as
q(a
t
, o
t
|z
1:t
) = q
a
t−1
,o
t−1
· q
o
t
· q
a
t
(4.88)
where
q
a
t−1
,o
t−1
=
N
sam

i=1
w
[i]
t−1
δ
_
a
t−1
−a
[i]
t−1
, o
t−1
−o
[i]
t−1
_
, (4.89)
q
o
t
=
_
x
t−1
p(o
t
|x
t−1
)p(x
t−1
|z
1:t−1
, a
[i]
t−1
, o
[i]
t−1
)dx
t−1
, (4.90)
95
4. BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS
and
q
a
t
= p(a
t
)
_
x
t
p(z
t
|a
t
, x
t
)p(x
t
|z
1:t−1
, o
t
)dx
t
. (4.91)
The process to obtain a new sample {a
[i]
t
, o
[i]
t
} starts by drawing a sample {a
[i]
t−1
, o
[i]
t−1
}
from q
a
t−1
,o
t−1
, which is straightforward since it is a discrete probability. Then, con-
ditioned on this sample, an object occlusion sample o
[i]
t
is drawn from q
o
t
. For this
purpose, a tractable solution for the involved integral has to be found. Firstly, the
probability terms involved in the integral are substituted by their expressions, given in
previous sections,
q
o
t
= p(o
t,S
vis)β(p
b
)
|S
occ
oth
|
·
·
_
x
t−1
N(G

x
t−1
; G
′′
x
t−1
, Σ

occ
)N
_
x
t−1
; µ
x
t−1
, Σ
x
t−1
_
dx
t−1
. (4.92)
Since the integral has not analytical solution, an approximate one is computed by
means of Monte Carlo simulation. This technique approximates the mean of the func-
tion N(G

x
t−1
; G
′′
x
t−1
, Σ

occ
) by a sum of samples drawn from N
_
x
t−1
; µ
x
t−1
, Σ
x
t−1
_
.
In this case, considering that the sampling function is Gaussian, the integral can be
satisfactorily approximated by only one sample, the mean of the Gaussian function,
giving as a result
_
x
t−1
N(G

x
t−1
; G
′′
x
t−1
, Σ

occ
)N
_
x
t−1
; µ
x
t−1
, Σ
x
t−1
_
dx
t−1

≈ N(G

µ
x
t−1
; G
′′
µ
x
t−1
, Σ

occ
). (4.93)
Using this solution for the integral, the occlusion sampling function can be expressed
as
q
o
t
≈ p(o
t,S
vis)β(p
b
)
|S
occ
oth
|
N(G

µ
x
t−1
; G
′′
µ
x
t−1
, Σ

occ
). (4.94)
In order to efficiently perform the sampling, the previous expression is factorized as
(see Eq. 4.37)
96
4.3 Approximate inference based on Rao-Blackwellized particle filtering
q
o
t
=

i
x
∈S
vis
p(o
t,i
x
= 0|o
t,1
, . . . , o
t,i
x
−1

·

i
x
∈S
occ
2,3
N(Gµ
x
t−1,i

x
; Gµ
x
t−1,i
x
, Σ
occ

·

i
x
∈S
occ
oth
N(Gµ
x
t−1,i

x
; Gµ
x
t−1,i
x
, Σ
occ
) · β(p
b
). (4.95)
Thus, the sampling task can be easily performed using the technique of ancestral
sampling (91), which sequentially samples the components of the object occlusion
{o
t,i
x
|i
x
= 1, . . . , N
obj
}. The first component is drawn from the following set of discrete
probabilities (see Eqs. 4.32 and 4.34)
p(o
t,1
= i
x
) =
_
β(p
b
)N(Gµ
x
t−1,i
x
; Gµ
x
t−1,1
, Σ
occ
) if i
x
= 1,
0 if i
x
= 1,
(4.96)
p(o
t,1
= 0) = d
vis
. (4.97)
And the rest of components are drawn from the following set of discrete probabilities
(see Eqs. 4.35 and 4.36)
p(o
t,i
x
= i

x
|o
t,1
, . . . , o
t,i
x
−1
) =
=
_
¸
¸
_
¸
¸
_
0 if i

x
= i
x
∨ cond
1
,
N(Gµ
x
t−1,i

x
; Gµ
x
t−1,i
x
, Σ
occ
) if cond
2
∨ cond
3
,
N(Gµ
x
t−1,i

x
; Gµ
x
t−1,i
x
, Σ
occ
)β(p
b
) rest of cases,
(4.98)
p(o
t,i
x
= 0|o
t,1
, . . . , o
t,i
x
−1
) =
_
1 if ∃i

x
< i
x
|o
t,i

x
= i
x
,
d
vis
rest of cases.
(4.99)
Lastly, conditioned on {a
[i]
t−1
, o
[i]
t−1
} and o
[i]
t
, a data association sample a
[i]
t
is drawn
from q
a
t
. The mathematical expression of q
a
t
is obtained substituting the integral in
Eq 4.91 by its solution (see Eqs. 4.56 and 4.65)
q
a
t
= p(a
t
)p(z
t,S
clu)
_
(2π)
n
det(Σ
x
t
) · N(H
′′
z
t
; H
′′′
µ
x
t
, Σ
lh

)N
_
µ
x
t
; ˆ µ
x
t
,
ˆ
Σ
x
t
_
. (4.100)
97
4. BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS
In order to efficiently draw a data association sample, q
a
t
is factorized as
q
a
t
=
N
ms

i
a
=1
p(a
t,i
a
|a
t,1
, . . . , a
t,i
a
−1
)

i
a
∈S
clu
p(z
t,i
a
|a
t,i
a
= 0)·
·

i
a
∈S
obj
_
(2π)
n

det(Σ
x
t,i
x
) · N(Hz
t,i
a
; H

µ
x
t,i
x
, Σ
lh
)N
_
µ
x
t,i
x
; ˆ µ
x
t,i
x
,
ˆ
Σ
x
t,i
x
_
, (4.101)
where n

is the size of the dimension of one component of the object state, and the
rest of parameters have been already introduced in previous sections. A sample can be
now easily obtained by means of the ancestral sampling, which sequentially draws the
sample components {a
t,i
a
|i
a
= 1, . . . , N
ms
}. The first component is sampled using the
following set of discrete probabilities
p(a
t,1
= i
x
) = p
obj
δ(Jz
t,1
−J

x
t,i
x
)N(Hz
t,1
; H

x
t,i
x
, Σ
lh
), (4.102)
p(a
t,1
= 0) = p
clu
d
clu
, (4.103)
and the rest of sample components are drawn from the following set of discrete proba-
bilities
p(a
t,i
a
= i
x
|a
t,1
, . . . , a
t,i
a
−1
) =
=
_
p
obj
δ(Jz
t,i
a
−J

x
t,i
x
)N(Hz
t,i
a
; H

x
t,i
x
, Σ
lh
) if cond
1
∧ cond
2
,
0 rest of cases,
(4.104)
p(a
t,i
a
= 0|a
t,1
, . . . , a
t,i
a
−1
) =
_
p
clu
d
clu
if cond
2
,
d
clu
rest of cases,
(4.105)
where all of the parameters and conditions have been already introduced in previous
sections.
Once all the samples {a
[i]
t
, o
[i]
t
|i = 1, . . . , N
sam
} have been drawn by the previous
method, the corresponding unnormalized weights are computed as
98
4.4 Object detections
w
∗[i]
t
=
p(a
[i]
t
, o
[i]
t
|z
1:t
)
q(a
[i]
t
, o
[i]
t
|z
1:t
)
= p(z
t
|z
1:t−1
) =
=
N
sam

i=1
p(z
t
|z
1:t−1
, a
[i]
t
, o
[i]
t
)p(a
[i]
t
, o
[i]
t
|z
1:t−1
). (4.106)
Notice that all the weights have the same value since it does not depend on the spe-
cific drawn sample {a
[i]
t
, o
[i]
t
}. This fact allows to simplifying the computation of the
normalized weights as
w
[i]
t
=
1
N
sam
. (4.107)
Notice that an explicit resampling stage is not necessary since it has been implicitly
performed by the drawing of a sample {a
[i]
t−1
, o
[i]
t−1
} from the distribution q
a
t−1
,o
t−1
.
4.4 Object detections
Object detections z
t
are obtained by a set of detectors, each one focused on detecting
a specific object category. The detectors describe each object category by means of
a color representation based on a RGB histogram that captures its appearance. The
RGB histogram is mathematically expressed as
h
c
(i) = n
i
; i = 1, . . . , N
b
, (4.108)
where n
i
is the number of pixels whose values belongs to that of the i
th
histogram bin,
and N
b
is the total number of bins. The RGB histogram is computed using a set of
representative image patches of the specific object category. These patches represent
the training data for the learning of the parameters of the object model, i.e. the bins of
the color histogram. Fig. 4.8 shows visually the RGB color histograms of two different
object categories related to a football video sequence. Fig. 4.8(a) contains an image
of a football match with two teams that are the two object categories. Fig. 4.8(b)
and (c) show the color histograms of the red dressed team, and the black and white
dressed team, respectively. The color histogram is represented in a three-dimensional
vector space, where each axis is a RGB color channel. The bins are represented by
spheres where the center indicates the RGB coordinates, and the radius is a magnitude
proportional to the bin count.
99
4. BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS
(a)Image of a football match.
(b)Color histogram of the red dressed team.
(c)Color histogram of the black and white dressed team.
Figure 4.8: Example of color histograms of two object categories.
100
4.4 Object detections
The search of object candidates in each time step, i.e. the detection process, is
performed by computing the similarity between the histograms of the object categories
and the histograms of the image regions of a frame. The similarity measure is computed
through the Bhattacharyya coefficient
ρ
B
(h
c
, h
r
) =
N
b

i=1
_
n
h
c
i
· n
h
r
i
, (4.109)
where h
c
is the learned color histogram of one of the object categories, and h
r
is the color
histogram of the candidate image region. The range of values of the Bhattacharyya co-
efficient is [0, 1], where the values closer to 1 indicate that a candidate image regions are
more similar to one object category. Fig. 4.9 shows two different examples of similarity
maps for two object categories that corresponds to the football teams of Fig. 4.8(a).
The X-Y coordinates of each similarity map coincide with the X-Y coordinates of the
image frame, while the Z coordinates are the values of the Bhattacharyya coefficient.
Note that the Bhattacharyya coefficient relative to specific coordinates (x
1
, y
1
) of the
similarity map has been computed using a rectangular image region that is centered
in such as (x
1
, y
1
) coordinates, and whose size is the same as the rectangles used to
compute the color histograms of the different object categories. Fig. 4.9(a) shows the
similarity map related to the football team dressed in red, and Fig. 4.9(b) shows the one
dressed in white and black. The original image has been projected over the similarity
map as a visual aid to easily correlate the peaks of the similarity map with the image
regions. Notice that the main peaks are correctly located over the football players.
There are also image regions with a high similarity value due to similar objects in the
background, such as the white lines of the pitch that are quite similar to the white
t-shirts of one of the football teams.
The detections are finally obtained by selecting the image regions with an accept-
able similarity value. The selection is accomplished by firstly performing a non-local
maxima suppression in order to obtain the peaks of the similarity map, and lastly
discarding the peaks whose value is below a threshold th
sim
. Thus, the selected peaks
correspond to the image regions that are likely to belong to one of the object categories.
Fig. 4.10(a) shows the resulting detections for the red dressed team over the similarity
map marked by circles, and Fig. 4.10(b) the detections over the original image marked
101
4. BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS
(a)Similarity map relative to the team dressed in red.
(b)Similarity map relative to the team dressed in black and white.
Figure 4.9: Similarity maps of the color histograms of object categories and candidate
image regions.
102
4.5 Results
by crosses. Notice that there is a missing detection as a result of a severe partial occlu-
sion. Similarly, Fig. 4.11 shows the detections for the black and white dressed team. In
this case, there are several false detections due to the background clutter (white lines
of the pitch) and other similar moving objects (the referees). All the aforementioned
problems in the detection acquisition are typical of a visual object detector, and they
must be appropriately handled by the tracking algorithm. In this sense, the developed
multi-object tracking system explicitly addresses these problems to correctly compute
the object trajectories.
As a result, a set of detections is obtained z
t
= {z
t,i
z
|i
z
= 1, . . . , N
ms
}, where
each detection z
t,i
z
= {r
z
t,i
z
, w
z
t,i
z
, id
z
t,i
z
} contains respectively the 2D coordinates of the
detection over the image plane, the Bhattacharyya coefficient that is used as a quality
value, and an identifier that associates each detection with the object detector that
generated it.
4.5 Results
The performance of the developed Bayesian tracking method for multiple interacting
objects has been evaluated using a public database called ‘VS-PETS 2003’, which
was officially presented for the IEEE International Workshop on Visual Surveillance
and Performance Evaluation of Tracking and Surveillance. This dataset consists of
several video sequences about a football match where football players move around the
pitch. The dataset is very appropriate to validate the developed multi-object tracking
algorithm because of the challenging interactions between the football players, which
go beyond simple object crosses.
The goal is to compute the object trajectories of the football players of both teams:
one dressed in red and the other in black and white. There are two different object
detectors, described in Sec. 4.4, each one focused on detecting players of one team. One
of the main characteristics of the detectors with respect to the tracking task is that
they generate detections of the entire object. This fact discards the possibility of split
measurements/detections typical from part-based object detectors and moving object
detectors based on background subtraction techniques (77). Nonetheless, it is feasible
more than one detection per object, since different poses of a player can lead to several
detections. Another reason for multiple detections per object arises when the object
103
4. BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS
(a)Detections over the similarity map of Fig. 4.9(a).
(b)Detections over the original image presented in Fig. 4.8(a).
Figure 4.10: Computed detections from the red dressed team.
104
4.5 Results
(a)Detections over the similarity map of Fig. 4.9(b).
Several
measurements
per object
Clutter
(b)Detections over the original image presented in Fig. 4.8(a).
Figure 4.11: Computed detections from the black and white dressed team.
105
4. BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS
size is not properly estimated. Other fundamental characteristic of the detector is that
the probability of detecting an existing object is less than one, which means that one
object can not be detected as a result of abnormal variations in its appearance, and
partial or total occlusions. The last relevant characteristic of the detector is that the
probability of false alarm is higher than zero, i.e. there can be false detections not
related to the objects of interest. False alarms arise as a consequence of similar objects
in the background, called clutter. There can be several types of clutter: random clutter
that has not a temporal coherence, permanent static clutter, such as the pitch lines that
can be easily confused with the black and white dressed team, and permanent moving
clutter, such as referees, coaches, and spectators. The used detectors are not very
complex regarding the current state of the art, but they are well-known low complex
color detectors that can be applied to a huge range of video applications. This decision
has allowed to focus in the tracking task, and at the same time making the tracking
application more practical since it has to deal with the typical problems of a standard
detector. Nonetheless, the developed tracking strategy is versatile enough to be used
with more complex object detectors (while they fulfill the aforementioned detector
characteristics) that would expectedly increase the tracking performance.
As a result, the tracking has to face different kinds of drawbacks such as multiple
detections per object, missing detections, and false detections, and at the same time
manage the sequence of correct detections to generate the trajectories of every player.
In this context, the ‘VS-PETS 2003’ is a challenging database because of the great
number of players interactions that take place. The interactions can be classified into:
• Overtaking action: one or more players overtake others, and their trajectories
share the same direction. The overtaking action can be so slow in some occasions
that it could also be considered as two or more players going in parallel.
• Simple cross: two or more players cross each other keeping their trajectories. Note
that, unlike the overtaking action, the players have different trajectory directions.
• Complex cross: two or more players cross each other changing their trajectories
during the cross. The cross can even last a relatively long period of time due to
the typical interactions that take place between football players.
106
4.5 Results
In all the previous situations, the involved players can be partially or totally occluded.
Specially challenging are the interactions that involve a complex cross due to the vari-
ation in the object trajectories. In this situation, the existing multi-object tracking
approaches usually fail in computing the trajectories since they rely on that the objects
keep their trajectories during the occlusion. The only way that they have to adapt to
the variations in the object trajectories is by means of the available detections. How-
ever, some of the objects involved in a cross have not associated any detection due
to the implicit occlusion. A similar problem can arise in overtaking actions when the
object occlusions last too long. In this cases, there are a significant period of time in
which some of the objects have not associated any detection, and the increasing uncer-
tainty in the prediction of the object positions can be excessively high to recover the
correct trajectories after the occlusion. All of these problems are explicitly addressed
in the developed tracking algorithm, which models the behavior of interacting objects.
In the following two sections, the results about the performance of the developed
tracking algorithm are presented. The first section graphically shows the performance
of the tracking system in different kinds of interactions. And the second section presents
quantitative results about the global performance of the tracking algorithm over the
‘VS-PETS 2003’ dataset, offering also a comparison with another similar object tracking
algorithm, which does not explicitly model the interacting behavior between objects.
4.5.1 Qualitative results
In this section, qualitative tracking results are presented for different types of object
interactions. Fig. 4.12 illustrates a situation involving a simple cross between two rival
players, who keep their trajectories before and after the cross. The first row shows
the original frames along with a blue square that encloses the image region where the
simple cross takes place. The second row shows in detail the object interaction inside
the previous squared image regions and the computed object detections, marked with
blue crosses. Note that some detections are missing due to the occlusions. The last
row shows the tracking results in which the tracked objects have been enclosed with a
blue rectangle and labeled by an identifier. Notice that the labels have been correctly
assigned to each player during the interaction, allowing to satisfactorily recover the ob-
ject trajectories. A simple cross involving different types of objects, in this case players
of rival teams, is the least complex interaction for two reasons. The first one is because
107
4. BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS
the trajectories do not change, and thus the object positions can be easily recover after
the occlusion. The other reason is that the objects involved in the interaction are of
different types, which simplifies the data association stage since the detections can be
only associated to objects of the same type. A consequence of this is that the marginal
posterior pdf over the involved objects is unimodal rather than multimodal, as it can
be observed in Fig. 4.13 and 4.14. These figures represent the marginal posterior pdf
over the objects of Fig. 4.12. Note that, instead of representing the marginalized distri-
bution by the corresponding mixture of Gaussians, only the mean of each Gaussian has
been depicted by a weighted sample. The samples are overlapped causing the feeling
that there is only one sample, although not all of the samples have the same weight
since it depends on the global data association that involves all the objects.
An example of complex cross is illustrated in Fig. 4.15 that implicates three players,
two of them are of the same team, and the other one from the rival team. The level
of interaction in a complex cross is higher than that of a simple cross since the object
trajectories change along the interaction, while the objects are occluded. In addition,
the duration of the interaction is usually longer than that of the simple cross, which in
turn makes more complex the tracking. This kind of situation can not be satisfactorily
managed with the typical assumption of independent object motion, posed by the ma-
jority of multi-object tracking approaches. However, the developed tracking algorithm
explicitly models this kind of interactions to satisfactorily perform the tracking, as it
can be observed in the previous figure. An additional source of difficulty is that there
are objects of the same type (players of the same team) involved in the interaction. In
this case, the data association stage is more complex since there are several feasible
detection associations, making the corresponding posterior pdf multimodal. This fact
can be observed in Fig. 4.17 and Fig. 4.18, in which the object state samples of the
players belonging to the same team are grouped in different modes. On the contrary,
the marginal posterior pdf of the rival player is unimodal, as it can be observed in
Fig. 4.16.
Lastly, Fig. 4.19 illustrates an overtaking action involving three players, two of
them belonging to the same team, and another to the rival one. This situation is
easier to handle than that of the complex cross since the object trajectories do not
change, but it is more complex than that of the simple cross since the duration of the
occlusion is usually much longer, implying more missing detections. Similarly to the
108
4.5 Results
example of the complex cross, the presence of players of the same team poses a greater
difficulty, which causes that their marginal posterior pdf be multimodal, as it is showed
in Fig. 4.20 and 4.21. On the other hand, the marginal posterior pdf of the rival player
is unimodal, as it is depicted in Fig. 4.22.
4.5.2 Quantitative results
The performance of the developed tracking algorithm has been tested using the ‘VS-
PETS 2003’ dataset described in Sec. 4.5. It has been also compared with a similar
algorithm that lacks of a explicit model for interacting objects, which theoretically
cannot handle situations in which the objects change their trajectories while they are
occluded, as it happens for complex crosses. This algorithm is described in (73), and the
main differences with respect to the proposed one are the lack of a model for managing
occlusions and the lack of a dynamic model for objects with non constant velocity. The
tracking results of both algorithms are showed in Tabs. 4.1 and 4.2, which indicates the
number of tracking errors in a set of interacting situations extracted for the ‘VS-PETS
2003’ dataset. It is considered that a tracking error occurs when the bounding boxes
of the ground truth and the one computed from the estimated object state do not
overlap. This situation is caused by poorly object state estimations or by erroneous
detection associations that lead to the exchange of object identities. In addition, the
table contains a description of the main characteristics of the tested interactions, such
as the interaction type, the number of players who are involved in the interaction,
the total number of players who are being tracked, and the duration in frames of the
interaction. The tracking results are obtained in conditions of continued working, i.e.
the tracking algorithm is started at the beginning of a sequence, and it is stopped at
the end. There is not reinitialization of the algorithm in case of tracking failure. These
procedure is more realistic for testing tracking systems, and in addition it allows to test
the failure recovery capability of the particle filtering technique. The table excludes
the tracking results for situations not involving object interactions because there is
almost no tracking errors. This allows to properly measure the real performance of the
tracking for interacting situations, performance that would remain otherwise obscure
by the high tracking efficiency in non-interacting situations.
Analyzing the results, it can be observed that both algorithms satisfactorily track
objects that undergo simple crosses, whereas for complex crosses the developed track-
109
4. BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS
O
r
i
g
i
n
a
l
f
r
a
m
e
s
s
h
o
w
i
n
g
a
l
l
t
h
e
p
l
a
y
e
r
s
.
T
h
e
b
l
u
e
s
q
u
a
r
e
m
a
r
k
s
t
h
e
s
e
l
e
c
t
e
d
o
b
j
e
c
t
i
n
t
e
r
a
c
t
i
o
n
.
C
r
o
p
p
e
d
f
r
a
m
e
s
c
o
n
t
a
i
n
i
n
g
t
h
e
s
e
l
e
c
t
e
d
o
b
j
e
c
t
i
n
t
e
r
a
c
t
i
o
n
a
l
o
n
g
w
i
t
h
t
h
e
d
e
t
e
c
t
i
o
n
s
.
1
5
9
9
1
5
9 1
5
1
5
9
1
5
9
T
r
a
c
k
i
n
g
r
e
s
u
l
t
s
.
F
i
g
u
r
e
4
.
1
2
:
T
r
a
c
k
i
n
g
r
e
s
u
l
t
s
f
o
r
a
s
i
t
u
a
t
i
o
n
i
n
v
o
l
v
i
n
g
a
s
i
m
p
l
e
c
r
o
s
s
w
i
t
h
o
b
j
e
c
t
s
o
f
d
i

e
r
e
n
t
c
a
t
e
g
o
r
i
e
s
.
110
4.5 Results
F
i
g
u
r
e
4
.
1
3
:
M
a
r
g
i
n
a
l
i
z
a
t
i
o
n
o
f
t
h
e
p
o
s
t
e
r
i
o
r
p
d
f
r
e
l
a
t
e
d
t
o
o
n
e
o
f
t
h
e
p
l
a
y
e
r
s
o
f
F
i
g
.
4
.
1
2
t
a
g
g
e
d
w
i
t
h
t
h
e
i
d
9
.
T
h
e
s
a
m
p
l
e
s
r
e
p
r
e
s
e
n
t
t
h
e
m
e
a
n
s
o
f
t
h
e
r
e
s
u
l
t
i
n
g
m
i
x
t
u
r
e
o
f
G
a
u
s
s
i
a
n
s
.
111
4. BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS
F
i
g
u
r
e
4
.
1
4
:
M
a
r
g
i
n
a
l
i
z
a
t
i
o
n
o
f
t
h
e
p
o
s
t
e
r
i
o
r
p
d
f
r
e
l
a
t
e
d
t
o
o
n
e
o
f
t
h
e
p
l
a
y
e
r
s
o
f
F
i
g
.
4
.
1
2
t
a
g
g
e
d
w
i
t
h
t
h
e
i
d
1
5
.
T
h
e
s
a
m
p
l
e
s
r
e
p
r
e
s
e
n
t
t
h
e
m
e
a
n
s
o
f
t
h
e
r
e
s
u
l
t
i
n
g
m
i
x
t
u
r
e
o
f
G
a
u
s
s
i
a
n
s
.
112
4.5 Results
O
r
i
g
i
n
a
l
f
r
a
m
e
s
s
h
o
w
i
n
g
a
l
l
t
h
e
p
l
a
y
e
r
s
.
T
h
e
b
l
u
e
s
q
u
a
r
e
m
a
r
k
s
t
h
e
s
e
l
e
c
t
e
d
o
b
j
e
c
t
i
n
t
e
r
a
c
t
i
o
n
.
C
r
o
p
p
e
d
f
r
a
m
e
s
c
o
n
t
a
i
n
i
n
g
t
h
e
s
e
l
e
c
t
e
d
o
b
j
e
c
t
i
n
t
e
r
a
c
t
i
o
n
a
l
o
n
g
w
i
t
h
t
h
e
d
e
t
e
c
t
i
o
n
s
.
1
1
5
1
3
1
1
1
3
5
T
r
a
c
k
i
n
g
r
e
s
u
l
t
s
.
F
i
g
u
r
e
4
.
1
5
:
T
r
a
c
k
i
n
g
r
e
s
u
l
t
s
f
o
r
a
s
i
t
u
a
t
i
o
n
i
n
v
o
l
v
i
n
g
a
c
o
m
p
l
e
x
c
r
o
s
s
w
i
t
h
o
b
j
e
c
t
s
o
f
b
o
t
h
c
a
t
e
g
o
r
i
e
s
.
113
4. BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS
F
i
g
u
r
e
4
.
1
6
:
M
a
r
g
i
n
a
l
i
z
a
t
i
o
n
o
f
t
h
e
p
o
s
t
e
r
i
o
r
p
d
f
r
e
l
a
t
e
d
t
o
o
n
e
o
f
t
h
e
p
l
a
y
e
r
s
o
f
F
i
g
.
4
.
1
5
t
a
g
g
e
d
w
i
t
h
t
h
e
i
d
5
.
T
h
e
s
a
m
p
l
e
s
r
e
p
r
e
s
e
n
t
t
h
e
m
e
a
n
s
o
f
t
h
e
r
e
s
u
l
t
i
n
g
m
i
x
t
u
r
e
o
f
G
a
u
s
s
i
a
n
s
.
114
4.5 Results
F
i
g
u
r
e
4
.
1
7
:
M
a
r
g
i
n
a
l
i
z
a
t
i
o
n
o
f
t
h
e
p
o
s
t
e
r
i
o
r
p
d
f
r
e
l
a
t
e
d
t
o
o
n
e
o
f
t
h
e
p
l
a
y
e
r
s
o
f
F
i
g
.
4
.
1
5
t
a
g
g
e
d
w
i
t
h
t
h
e
i
d
1
1
.
T
h
e
s
a
m
p
l
e
s
r
e
p
r
e
s
e
n
t
t
h
e
m
e
a
n
s
o
f
t
h
e
r
e
s
u
l
t
i
n
g
m
i
x
t
u
r
e
o
f
G
a
u
s
s
i
a
n
s
.
115
4. BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS
F
i
g
u
r
e
4
.
1
8
:
M
a
r
g
i
n
a
l
i
z
a
t
i
o
n
o
f
t
h
e
p
o
s
t
e
r
i
o
r
p
d
f
r
e
l
a
t
e
d
t
o
o
n
e
o
f
t
h
e
p
l
a
y
e
r
s
o
f
F
i
g
.
4
.
1
5
t
a
g
g
e
d
w
i
t
h
t
h
e
i
d
1
3
.
T
h
e
s
a
m
p
l
e
s
r
e
p
r
e
s
e
n
t
t
h
e
m
e
a
n
s
o
f
t
h
e
r
e
s
u
l
t
i
n
g
m
i
x
t
u
r
e
o
f
G
a
u
s
s
i
a
n
s
.
116
4.5 Results
O
r
i
g
i
n
a
l
f
r
a
m
e
s
s
h
o
w
i
n
g
a
l
l
t
h
e
p
l
a
y
e
r
s
.
T
h
e
b
l
u
e
s
q
u
a
r
e
m
a
r
k
s
t
h
e
s
e
l
e
c
t
e
d
o
b
j
e
c
t
i
n
t
e
r
a
c
t
i
o
n
.
C
r
o
p
p
e
d
f
r
a
m
e
s
c
o
n
t
a
i
n
i
n
g
t
h
e
s
e
l
e
c
t
e
d
o
b
j
e
c
t
i
n
t
e
r
a
c
t
i
o
n
a
l
o
n
g
w
i
t
h
t
h
e
d
e
t
e
c
t
i
o
n
s
.
8
9
1
6
8
9 1
6
8
9
1
6
8
9
1
6
8
9
1
6
T
r
a
c
k
i
n
g
r
e
s
u
l
t
s
.
F
i
g
u
r
e
4
.
1
9
:
T
r
a
c
k
i
n
g
r
e
s
u
l
t
s
f
o
r
a
s
i
t
u
a
t
i
o
n
i
n
v
o
l
v
i
n
g
a
n
o
v
e
r
t
a
k
i
n
g
a
c
t
i
o
n
w
i
t
h
o
b
j
e
c
t
s
o
f
b
o
t
h
c
a
t
e
g
o
r
i
e
s
.
117
4. BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS
F
i
g
u
r
e
4
.
2
0
:
M
a
r
g
i
n
a
l
i
z
a
t
i
o
n
o
f
t
h
e
p
o
s
t
e
r
i
o
r
p
d
f
r
e
l
a
t
e
d
t
o
o
n
e
o
f
t
h
e
p
l
a
y
e
r
s
o
f
F
i
g
.
4
.
1
9
t
a
g
g
e
d
w
i
t
h
t
h
e
i
d
8
.
T
h
e
s
a
m
p
l
e
s
r
e
p
r
e
s
e
n
t
t
h
e
m
e
a
n
s
o
f
t
h
e
r
e
s
u
l
t
i
n
g
m
i
x
t
u
r
e
o
f
G
a
u
s
s
i
a
n
s
.
118
4.5 Results
F
i
g
u
r
e
4
.
2
1
:
M
a
r
g
i
n
a
l
i
z
a
t
i
o
n
o
f
t
h
e
p
o
s
t
e
r
i
o
r
p
d
f
r
e
l
a
t
e
d
t
o
o
n
e
o
f
t
h
e
p
l
a
y
e
r
s
o
f
F
i
g
.
4
.
1
9
t
a
g
g
e
d
w
i
t
h
t
h
e
i
d
9
.
T
h
e
s
a
m
p
l
e
s
r
e
p
r
e
s
e
n
t
t
h
e
m
e
a
n
s
o
f
t
h
e
r
e
s
u
l
t
i
n
g
m
i
x
t
u
r
e
o
f
G
a
u
s
s
i
a
n
s
.
119
4. BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS
F
i
g
u
r
e
4
.
2
2
:
M
a
r
g
i
n
a
l
i
z
a
t
i
o
n
o
f
t
h
e
p
o
s
t
e
r
i
o
r
p
d
f
r
e
l
a
t
e
d
t
o
o
n
e
o
f
t
h
e
p
l
a
y
e
r
s
o
f
F
i
g
.
4
.
1
9
t
a
g
g
e
d
w
i
t
h
t
h
e
i
d
1
6
.
T
h
e
s
a
m
p
l
e
s
r
e
p
r
e
s
e
n
t
t
h
e
m
e
a
n
s
o
f
t
h
e
r
e
s
u
l
t
i
n
g
m
i
x
t
u
r
e
o
f
G
a
u
s
s
i
a
n
s
.
120
4.6 Conclusions
ing algorithm is much more efficient than the one without an explicit model for object
interactions. Another observation is that there are more errors in interactions involving
players of the same team, especially if it is a long interaction. In these situations, there
is not information enough to reliably perform the data association. A more sophis-
ticated object detector would be needed that provide more information to correctly
estimate the detection association. The tracking results for overtaking actions are very
similar for both algorithms since the interacting model does not offer a great advantage
in these situations. This fact is due to the objects maintain their trajectories during the
interaction. There is only a slight improvement when the duration of the interaction
is too long. As in complex crosses, the tracking errors are because of long interac-
tions of players of the same team that make strongly multimodal the data association
distribution.
To sum up, the developed tracking algorithm has a great performance in complex
object interactions, improving other approaches that lack of an explicit model for occlu-
sions, and/or a dynamic model for non independent objects. This fact is more evident
in situations in which the objects change their trajectories while they are occluded.
4.6 Conclusions
In this chapter a novel recursive Bayesian tracking model for multiple interacting ob-
jects has been presented, in which object dynamics and object trajectories are mutually
dependent. This problem has been successfully tackled by means of the explicit man-
agement of the object interactions. The key part has been the design of an advanced
object dynamic model that is sensitive to the object interactions, such as long-term
occlusions of two or more objects. For this purpose, the proposed Bayesian tracking
model uses a random variable to predict the occlusion events, which in turn triggers
different choices of object motion. The tracking algorithm is also able to handle false
and missing detections through a probabilistic data association stage, which efficiently
computes the correspondence between the unlabeled detections and the tracked objects.
It is worthy to note that the tracking framework is independent of the specific object
detector used. The single restriction is that the object detections have to include the
positional information about the whole object, which is common to a lot of existing
object detectors. The tracking algorithm can also use quality information about the
121
4. BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS
O
b
j
e
c
t
i
n
t
e
r
a
c
t
i
o
n
d
e
s
c
r
i
p
t
i
o
n
T
r
a
c
k
i
n
g
r
e
s
u
l
t
s
I
n
t
e
r
a
c
t
i
o
n
I
n
t
e
r
a
c
t
i
o
n
N
u
m
b
e
r
o
f
T
o
t
a
l
n
u
m
b
e
r
D
u
r
a
t
i
o
n
o
f
N
u
m
b
e
r
o
f
e
r
r
o
r
s
N
u
m
b
e
r
o
f
e
r
r
o
r
s
n
a
m
e
t
y
p
e
p
l
a
y
e
r
s
i
n
o
f
p
l
a
y
e
r
s
i
n
t
e
r
a
c
t
i
o
n
(
t
r
a
c
k
i
n
g
a
l
g
o
r
i
t
h
m
f
o
r
(
t
r
a
c
k
i
n
g
a
l
g
o
r
i
t
h
m
f
o
r
i
n
t
e
r
a
c
t
i
o
n
i
n
f
r
a
m
e
s
i
n
t
e
r
a
c
t
i
n
g
o
b
j
e
c
t
s
)
N
O
N
-
i
n
t
e
r
a
c
t
i
n
g
o
b
j
e
c
t
s
)
i
n
t
e
r
a
c
t
-
1
S
i
m
p
l
e
c
r
o
s
s
2
1
7
4
6
0
0
i
n
t
e
r
a
c
t
-
2
S
i
m
p
l
e
c
r
o
s
s
3
1
7
7
2
0
0
i
n
t
e
r
a
c
t
-
3
S
i
m
p
l
e
c
r
o
s
s
2
1
8
4
8
0
0
i
n
t
e
r
a
c
t
-
4
S
i
m
p
l
e
c
r
o
s
s
2
1
8
5
0
0
0
i
n
t
e
r
a
c
t
-
5
S
i
m
p
l
e
c
r
o
s
s
2
1
7
1
2
3
0
0
i
n
t
e
r
a
c
t
-
6
S
i
m
p
l
e
c
r
o
s
s
2
1
6
9
9
0
0
i
n
t
e
r
a
c
t
-
7
S
i
m
p
l
e
c
r
o
s
s
2
5
3
7
0
0
i
n
t
e
r
a
c
t
-
8
S
i
m
p
l
e
c
r
o
s
s
2
5
5
6
0
0
i
n
t
e
r
a
c
t
-
9
S
i
m
p
l
e
c
r
o
s
s
2
1
8
7
3
0
0
i
n
t
e
r
a
c
t
-
1
0
C
o
m
p
l
e
x
c
r
o
s
s
2
1
4
3
6
0
0
i
n
t
e
r
a
c
t
-
1
1
C
o
m
p
l
e
x
c
r
o
s
s
3
1
4
5
6
0
0
i
n
t
e
r
a
c
t
-
1
2
C
o
m
p
l
e
x
c
r
o
s
s
2
1
3
5
5
3
3
6
i
n
t
e
r
a
c
t
-
1
3
C
o
m
p
l
e
x
c
r
o
s
s
3
1
7
7
8
0
4
5
i
n
t
e
r
a
c
t
-
1
4
C
o
m
p
l
e
x
c
r
o
s
s
2
1
5
6
9
0
0
i
n
t
e
r
a
c
t
-
1
5
C
o
m
p
l
e
x
c
r
o
s
s
2
1
8
6
1
0
0
i
n
t
e
r
a
c
t
-
1
6
C
o
m
p
l
e
x
c
r
o
s
s
2
1
7
1
1
3
0
8
7
i
n
t
e
r
a
c
t
-
1
7
C
o
m
p
l
e
x
c
r
o
s
s
2
1
6
1
0
9
0
7
4
T
a
b
l
e
4
.
1
:
Q
u
a
n
t
i
t
a
t
i
v
e
r
e
s
u
l
t
s
f
o
r
b
o
t
h
t
h
e
d
e
v
e
l
o
p
e
d
t
r
a
c
k
i
n
g
a
l
g
o
r
i
t
h
m
f
o
r
i
n
t
e
r
a
c
t
i
n
g
o
b
j
e
c
t
s
,
a
n
d
t
h
e
o
n
e
f
o
r
n
o
n
-
i
n
t
e
r
a
c
t
i
n
g
o
b
j
e
c
t
s
.
122
4.6 Conclusions
O
b
j
e
c
t
i
n
t
e
r
a
c
t
i
o
n
d
e
s
c
r
i
p
t
i
o
n
T
r
a
c
k
i
n
g
r
e
s
u
l
t
s
I
n
t
e
r
a
c
t
i
o
n
I
n
t
e
r
a
c
t
i
o
n
N
u
m
b
e
r
o
f
T
o
t
a
l
n
u
m
b
e
r
D
u
r
a
t
i
o
n
o
f
N
u
m
b
e
r
o
f
e
r
r
o
r
s
N
u
m
b
e
r
o
f
e
r
r
o
r
s
n
a
m
e
t
y
p
e
p
l
a
y
e
r
s
i
n
o
f
p
l
a
y
e
r
s
i
n
t
e
r
a
c
t
i
o
n
(
t
r
a
c
k
i
n
g
a
l
g
o
r
i
t
h
m
f
o
r
(
t
r
a
c
k
i
n
g
a
l
g
o
r
i
t
h
m
f
o
r
i
n
t
e
r
a
c
t
i
o
n
i
n
f
r
a
m
e
s
i
n
t
e
r
a
c
t
i
n
g
o
b
j
e
c
t
s
)
n
o
n
-
i
n
t
e
r
a
c
t
i
n
g
o
b
j
e
c
t
s
)
i
n
t
e
r
a
c
t
-
1
8
C
o
m
p
l
e
x
c
r
o
s
s
2
1
7
5
0
0
0
i
n
t
e
r
a
c
t
-
1
9
C
o
m
p
l
e
x
c
r
o
s
s
2
8
9
2
0
4
7
i
n
t
e
r
a
c
t
-
2
0
C
o
m
p
l
e
x
c
r
o
s
s
2
1
0
1
2
6
0
8
4
i
n
t
e
r
a
c
t
-
2
1
C
o
m
p
l
e
x
c
r
o
s
s
3
1
6
4
5
6
3
2
i
n
t
e
r
a
c
t
-
2
2
C
o
m
p
l
e
x
c
r
o
s
s
2
1
8
3
8
0
0
i
n
t
e
r
a
c
t
-
2
3
O
v
e
r
t
a
k
i
n
g
2
1
7
9
5
0
0
i
n
t
e
r
a
c
t
-
2
4
O
v
e
r
t
a
k
i
n
g
2
1
7
6
0
0
0
i
n
t
e
r
a
c
t
-
2
5
O
v
e
r
t
a
k
i
n
g
3
1
4
9
4
0
0
i
n
t
e
r
a
c
t
-
2
6
O
v
e
r
t
a
k
i
n
g
2
1
4
3
5
1
3
1
4
i
n
t
e
r
a
c
t
-
2
7
O
v
e
r
t
a
k
i
n
g
3
1
9
8
9
0
0
i
n
t
e
r
a
c
t
-
2
8
O
v
e
r
t
a
k
i
n
g
2
1
9
2
9
1
2
1
5
i
n
t
e
r
a
c
t
-
2
9
O
v
e
r
t
a
k
i
n
g
2
1
7
1
0
8
0
0
i
n
t
e
r
a
c
t
-
3
0
O
v
e
r
t
a
k
i
n
g
2
1
5
9
0
0
0
i
n
t
e
r
a
c
t
-
3
1
O
v
e
r
t
a
k
i
n
g
2
1
5
8
9
0
0
i
n
t
e
r
a
c
t
-
3
2
O
v
e
r
t
a
k
i
n
g
2
1
0
2
7
1
2
i
n
t
e
r
a
c
t
-
3
3
O
v
e
r
t
a
k
i
n
g
2
8
6
3
0
0
i
n
t
e
r
a
c
t
-
3
5
O
v
e
r
t
a
k
i
n
g
2
1
4
1
0
0
0
0
i
n
t
e
r
a
c
t
-
3
6
O
v
e
r
t
a
k
i
n
g
2
1
6
4
5
1
4
1
6
T
a
b
l
e
4
.
2
:
Q
u
a
n
t
i
t
a
t
i
v
e
t
r
a
c
k
i
n
g
r
e
s
u
l
t
s
:
c
o
n
t
i
n
u
a
t
i
o
n
o
f
T
a
b
.
4
.
1
.
123
4. BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS
object detections if this is provided. Also important is that the Bayesian tracking model
is thought to work with realistic object detectors that can produce several false alarms
and even lose detections (missing detections).
Regarding the inference of the tracking information in the proposed Bayesian model
for interacting objects, two major issues have been carefully addressed. The first one
is the mathematical derivation of the posterior distribution of the object positions,
which has been a great challenge due to the high complexity of the tracking model. An
intensive use of the concepts of probabilistic conditional independence and d-separation
has been necessary to simplify the expression of the posterior distribution. The second
issue arises from the fact that the obtained expression for the posterior distribution has
not an analytical form due to the non-linearities and non-Gaussianities of the dynamic,
observation and occlusion processes involved in the Bayesian tracking model. As a
consequence, the utilization of approximate inference methods is necessary to obtain
a suboptimal solution. Nonetheless, the accuracy of the approximate solution can be
very poor because of the high dimensionality of the tracking problem, proportional to
the number of tracked objects and object detections. To deal with this drawback, a
novel suboptimal inference method has been developed which uses a variance reduction
technique, called Rao-Blackwellization, to improve the accuracy of the approximation
obtained by particle filtering. As a result, an accurate approximation of the object
trajectories in high dimensional state spaces, involving multiples tracked objects, is
achieved.
Lastly, the presented Bayesian tracking model for interacting targets has been tested
with a well-known and public video database, which has allowed to prove the efficiency
and reliability of the tracking in situations involving several objects that interact each
other.
124
Chapter 5
Conclusions and future work
This dissertation closes with the conclusions about the developed tracking algorithms
for challenging situations and some suggestions for future lines of research.
5.1 Conclusions
The visual object tracking for unconstrained scenarios is a challenging task due to
interactions between multiple objects, heavily-cluttered scenarios, moving cameras, ob-
jects with varying appearance, and complex object dynamics. In this dissertation, the
research has been focused on developing efficient and reliable algorithms that are able
to handle complex tracking situation affected by the aforementioned problems. In par-
ticular, two different challenging situations have been addressed: tracking of a single
object in heavily-cluttered scenarios with a moving camera and tracking of multiple
interacting objects with a static camera.
In the first situation, the main drawbacks are the heavily-cluttered scenarios and
the moving camera. At the beginning of the research, deterministic and statistical
methods (32; 33; 34) were applied to compensate the camera motion, and thus to sat-
isfactorily perform the tracking. However, a reliable estimation of the camera motion
is not possible in some situations because of the aperture problem, the presence of
independent moving objects, and changes in the scene. To overcome these problems,
more emphasis was placed in the camera motion estimation (85) to be more robust
to the previous sources of disturbances. But, it was not until the development of the
first Bayesian models (92) that a significant improvement in performance was achieved.
125
5. CONCLUSIONS AND FUTURE WORK
Fostered by the success of the Bayesian modeling of the camera motion, new Bayesian
strategies were proposed to handle both the camera motion and the tracking, some of
them adapted for infrared imagery (93; 94), others for visible imagery (95). These algo-
rithms allow to manage several hypotheses of camera and object motion to accurately
approximate the posterior distribution of the object state, which contains the desired
tracking information. The proposed Bayesian model for the camera motion is able to
handle situations with strong ego-motion due to large camera displacements, situations
with a high uncertainty in the camera motion because of low textured video sequences,
and situations with multiple independent moving objects. On the contrary, approaches
based on a deterministic estimation of the camera ego-motion are prone to fail, since
the camera motion can not be reliably computed in these situations. An important
issue related to the proposed Bayesian model is the inference strategy used to obtain
the tracking information. The inference cannot be performed analytically because the
dynamic and observation processes are non-linear and non-Gaussian. This fact gives
rise to complex integrals in the expression of the posterior distribution over the object
position that cannot be analytically solved. In order to deal with this problem, a sub-
optimal inference method has been derived, based on the particle filtering technique,
which is able to compute an accurate approximation of the object trajectory.
The second challenging tracking situation has focused on the interactions of mul-
tiple objects with a static camera. The initial works (96; 97) proposed particle filter
based approaches to efficiently handle the curse of dimensionality that arises in the
multi-object tracking. More sophisticated Bayesian models (78; 98) were developed to
manage the events of missing detections and false detections, which are inherent to
real tracking applications. These detection problems stem from variations in the object
appearance, changes in illumination, partial and total occlusions, and background clut-
ter. However, none of these models explicitly models the object interactions, which are
prone to tracking errors. Especially conflictive are the long-term interactions involving
at the same time occlusions and changes in object trajectories. To successfully tackle
this problem, a novel Bayesian algorithm was developed to explicitly manage complex
object interactions. This algorithm proposes an advanced dynamic model that is able
to simulate the object interactions involving long-term occlusions of two or more ob-
jects. A fundamental part of the model is the use of a random variable to infer the
potential occlusion events, which are indicators of the occurrence of object interactions.
126
5.2 Future work
As a result, the developed tracking algorithm for interacting objects achieves a high
performance in complex object interactions, outperforming other approaches that lack
of an explicit model for occlusions, and/or a dynamic model for non independent ob-
jects. This fact is more evident in situations in which the objects vary their trajectories
while they undergo an occlusion. A great effort has been performed to derive the math-
ematical expression of the posterior distribution over the object tracking information,
considering the high complexity of the model. An additional issue has been that the re-
sulting expression of the posterior distribution lacks a closed form due to the presence of
non-solvable integrals. These integrals arise from the fact that the stochastic processes
involved in the Bayesian tracking model are non-linear and non-Gaussian. Therefore,
the use of approximate inference methods is necessary. However, the high dimension-
ality of the state space associated to the tracking problem, which is proportional to
the number of tracked objects and object detections, causes problems in the accuracy
of the approximation of the posterior distribution. To overcome these problems, a
novel and efficient suboptimal inference strategy has been developed to accurately ap-
proximate the expression of the posterior distribution. The inference strategy uses the
particle filtering technique in combination with a variance reduction technique, called
Rao-Blackwellization, to obtain an accurate approximation of the object trajectories in
high dimensional state spaces, involving multiples tracked objects.
5.2 Future work
Several lines of future work have been identified as a result of the analysis of the
tracking results, and the overall performance of the algorithms. Regarding the approach
for tracking a single object in heavily-cluttered scenarios with a moving camera, the
following lines of research has been considered:
• Automatic selection of the estimation method for the camera motion. Two differ-
ent tracking scenarios have been considered: the first one for infrared imagery and
the second one for visible imagery. Since each kind of imagery has its own char-
acteristics, different camera motion techniques have been proposed to compute
the samples that approximate the pdf of the camera motion. A more practical
approach would be to include a stage that analyze the acquired images to select
the most appropriate motion estimation technique.
127
5. CONCLUSIONS AND FUTURE WORK
• Scale adaptation. Both the detection and tracking stages consider that the tracked
object has a fixed size. However, the object size can vary significantly due to
the camera and object motion, affecting negatively the tracking performance.
Therefore, in order to improve the tracking performance, a detector that supply
object scale measurements should be used. Also, the tracking Bayesian model
should be adapted to handle this new source of information.
• Real time operation. In order to use the tracking algorithm in applications with
real-time operation restrictions, it is necessary a detail study about the parame-
ters of the algorithm, and their impact on the efficiency and computational cost.
With respect to the approach for tracking multiple interacting objects with a static
camera, the future research would be focused on:
• Improvement of the interaction model. The main tracking errors arise when
there are two or more objects of the same kind interacting each other. This is
because of the limited amount of information that the detector supplies about the
object. A more sophisticated object detector would be needed that provide more
information such as the pose or shape of the object. This information would be
integrated in the Bayesian tracking model to make more distinctive the tracked
objects, even if they are of the same type. As a result, the data association stage
would be much more efficient, and consequently the final object tracking would
be better.
• Improvement of the occlusion model. The object occlusions are modeled as tem-
poral independent processes, however, there is a significant correlation between
object occlusions in consecutive time steps. Therefore, the tracking model could
be improved taking into account this information, which would avoid situations
in which the occluded and occluding objects are interchanged.
• Real time operation. Similarly to the previous tracking situation, an analysis of
the tracking performance in terms of efficiency and computational cost would be
necessary to adapt the tracking algorithm to operate under real time restrictions.
Finally, another line of future work would be the development of a tracking al-
gorithm that integrate the both previous situations: tracking of multiple interacting
128
5.2 Future work
objects with a moving camera. This would allow to extend the applicability of the
tracking algorithm to a wider range of visual based systems.
129
5. CONCLUSIONS AND FUTURE WORK
130
Chapter 6
Appendix
6.1 Conditional independence and d-separation
The conditional independence is an important concept for probabilistic models involving
probability distributions over multiple variables, since it can simplify both the structure
of a model and the computational cost needed to infer the required information from
the model. Consider three random variables a, b, and c, then it is said that a is
conditionally independent of b given c if
p(a|b, c) = p(a|c). (6.1)
The conditional independence statement holds for sets of non-intersecting variables.
Given a probabilistic model, the conditional independence properties can be directly
inferred by inspection of its graphical model, process that is known as d-separation (79).
Consider a directed graphical model in which A, B and C are three arbitrary non-
intersecting sets of variables. In order to verify the statement ‘A is conditionally inde-
pendent of B given C’, consider all the possibles paths between any node in A and any
node in B. Any such path is said to be blocked if it includes a variable such that
• its relative arrows on the path are either incoming-outgoing (see Fig. 6.1(a)) or
outgoing-outgoing (see Fig. 6.1(b)), and the variable is in the set C, or
• its relative arrows on the path are incoming-incoming (see Fig. 6.1(c)), and neither
the variable nor any of its descendants is in the set C. A variable x is considered
131
6. APPENDIX
a descendant of y if there is a path from y to x that follows the direction of the
arrows (see Fig. 6.1(d)).
If all paths between A and B are blocked, then A is conditionally independent of B
given C, and it is said that A is d-separated from B by C.
A B
C
A B
C
(a)Incoming-outgoing configuration (b)Outgoing-outgoing configuration
A B
C
u v
y
z
x
(c)Incoming-incoming configuration (d)z and x are descendants of y
Figure 6.1: Definitions about incoming and outgoing arrows, and descendants relative to
a node/variable.
In the context of object tracking, the object state x
t
can be in fact a set of ran-
dom variables that not only store information about a single object but also informa-
tion about the camera motion or about several tracked objects. In these cases, the
probabilistic relationships between the variables can lead to extremely complex joint
probability distributions, which can be efficiently simplified by applying the underlying
conditional independence properties of the model.
132
References
[1] B. D. O. Anderson and J. B. Moore, Optimal filtering. Prentice-Hall, 1979. 6
[2] S. Arulampalam, S. Maskell, and N. Gordon, “A tutorial on particle filters for on-
line nonlinear/non-gaussian bayesian tracking,” IEEE Trans. on Signal Processing,
vol. 50, pp. 174–188, 2002. 6, 7, 20, 21, 75
[3] S. Julier and J. Uhlmann, “A new extension of the kalman filter to nonlinear
systems,” in SPIE Proc. Signal Processing, Sensor Fusion, and Target Recognition
VI, vol. 3068, 1997, pp. 182–193. 6
[4] ——, “Unscented filtering and nonlinear estimation,” IEEE Proc., vol. 92, no. 3,
pp. 401–422, 2004. 6
[5] K. Ito and K. Xiong, “Gaussian filters for nonlinear filtering problems,” IEEE
Trans. on Automatic Control, vol. 45, no. 5, pp. 910–927, 2000. 7
[6] G. Kitagawa, “Non-gaussian state space modeling of time series,” in IEEE Proc.
Conference on Decision and Control, vol. 26, 1987, pp. 1700–1705. 7
[7] A. Doucet and A. M. Johansen, “A tutorial on particle filtering and smoothing:
fifteen years later,” in In Handbook of Nonlinear Filtering. University Press, 2009.
7
[8] P. Djuric, J. Kotecha, J. Zhang, Y. Huang, T. Ghirmai, M. Bugallo, and J. Miguez,
“Particle filtering,” IEEE Signal Processing Magazine, vol. 20, no. 5, pp. 19–38,
2003. 7
[9] A. Doucet, S. Godsill, and C. Andrieu, “On sequential monte carlo sampling meth-
ods for bayesian filtering,” Journal Statistics and Computing, vol. 10, no. 3, pp.
197–208, 2000. 7, 20, 27, 51
133
REFERENCES
[10] A. Doucet, N. De Freitas, and N. Gordon, Sequential Monte Carlo methods in
practice. Springer-Verlag, 2001. 7
[11] L. J.S. and R. Chen, “Sequential monte carlo methods for dynamic systems,”
Journal of the American Statistical Association, vol. 93, pp. 1032–1044, 1998. 7
[12] M. Isard and A. Blake, “Condensation - conditional density propagation for visual
tracking,” Journal of Computer Vision, vol. 29, pp. 5–28, 1998. 7
[13] J. MacCormick and A. Blake, “A probabilistic exclusion principle for tracking
multiple objects,” in IEEE Proc. International Conference on Computer Vision,
vol. 1, 1999, p. 572. 7
[14] N. Gordon, D. Salmond, and A. Smith, “Novel approach to nonlinear/non-gaussian
bayesian state estimation,” in IEE Proc. Radar and Signal Processing, vol. 140,
no. 2, 1993, pp. 107–113. 7
[15] G. Pulford, “Relative performance of single scan algorithms for passive sonar track-
ing,” in Proc. Innovation in Acoustics and Vibration, vol. 2, 2002, pp. 212–221.
8
[16] Y. Bar-Shalom, “Tracking in a cluttered environment with probabilistic data as-
sociation,” Journal of Automatica, vol. 11, no. 5, pp. 451–460, 1975. 9
[17] Y. Bar-Shalom, F. Daum, and J. Huang, “The probabilistic data association filter,”
IEEE Journal of Control Systems Magazine, vol. 29, no. 6, pp. 82–100, 2009. 9
[18] U. B. Neto, M. Choudhary, and J. Goutsias, “Automatic target detection and
tracking in forward-looking infrared image sequences using morphological con-
nected operators,” SPIE Journal of Electronic Imaging, vol. 13, no. 4, pp. 802–813,
2004. 9, 23
[19] H. Xin and T. Shuo, “Target detection and tracking in forward-looking infrared
image sequences using multiscale morphological filters,” in IEEE Proc. Interna-
tional Symposium on Image and Signal Processing and Analysis, 2007, pp. 25–28.
9, 23
134
REFERENCES
[20] C. Wei and S. Jiang, “Automatic target detection and tracking in flir image se-
quences using morphological connected operator,” in IEEE Proc. International
Conference on Intelligent Information Hiding and Multimedia Signal Processing,
2008, pp. 414–417. 9
[21] A. Bal and M. S. Alam, “Automatic target tracking in flir image sequences,” in
SPIE Proc. Automatic Target Recognition XIV, vol. 5426, no. 1, 2004, pp. 30–36.
10
[22] ——, “Automatic target tracking in flir image sequences using intensity variation
function and template modeling,” IEEE Trans. on Instrumentation and Measure-
ment, vol. 54, no. 5, pp. 1846–1852, 2005. 10
[23] W. Yang, J. Li, D. Shi, and S. Hu, “Mean shift based target tracking in flir imagery
via adaptive prediction of initial searching points,” in IEEE Proc. Second Interna-
tional Symposium on Intelligent Information Technology Application, vol. 1, 2008,
pp. 852–855. 10
[24] D. Comaniciu, V. Ramesh, and P. Meer, “Kernel-based object tracking,” IEEE
Trans. on Pattern Analysis and Machine Intelligence, vol. 25, no. 5, pp. 564–577,
2003. 10
[25] N. A. Mould, C. T. Nguyen, and J. P. Havlicek, “Infrared target tracking with am-
fm consistency checks,” in IEEE Proc. Southwest Symposium on Image Analysis
and Interpretation, 2008, pp. 5–8. 10, 35
[26] V. Venkataraman, G. Fan, and X. Fan, “Target tracking with online feature se-
lection in flir imagery,” in IEEE Proc. Computer Vision and Pattern Recognition,
2007, pp. 1–8. 10, 35, 60
[27] B. Zitova and J. Flusser, “Image registration methods: a survey,” Journal of Image
and Vision Computing, vol. 21, no. 11, pp. 977–1000, 2003. 10, 22
[28] R. Kumar, H. Sawhney, S. Samarasekera, S. Hsu, H. Tao, Y. Guo, K. Hanna,
A. Pope, R. Wildes, D. Hirvonen, M. Hansen, and P. Burt, “Aerial video surveil-
lance and exploitation,” in IEEE Proc., vol. 89, no. 10, 2001, pp. 1518–1539. 10
135
REFERENCES
[29] I. Cohen and G. Medioni, “Detecting and tracking moving objects in video from
an airborne observer,” in Workshop in Image Understanding, 1998, pp. 217–222.
10
[30] B. Jung and G. Sukhatme, “Detecting moving objects using a single camera on
a mobile robot in an outdoor environment,” in Proc. International Conference on
Intelligent Autonomous Systems, vol. 6566, 2004, pp. 980–987. 10
[31] B. D. Lucas and T. Kanade, “An iterative image registration technique with an
application to stereo vision,” in Proc. International Joint Vonference on Artificial
Intelligence, 1981, pp. 674–679. 10
[32] C. R. del Blanco, F. Jaureguizar, L. Salgado, and N. Garc´ıa, “Aerial moving target
detection based on motion vector field analysis,” in Proc. Advanced Concepts for
Intelligent Vision systems, vol. 4678, 2007, pp. 990–1001. 10, 23, 125
[33] ——, “Target detection through robust motion segmentation and tracking restric-
tions in aerial flir images,” in IEEE Proc. International Conference on Image
Processing, vol. 5, 2007, pp. 445–448. 10, 23, 125
[34] ——, “Automatic aerial target detection and tracking system in airborne flir im-
ages based on efficient target trajectory filtering,” in SPIE Proc. Automatic Target
Recognition XVII, vol. 6566, 2007, pp. 656 604(1–12). 10, 23, 125
[35] R. Pless, T. Brodsky, and Y. Aloimonos, “Detecting independent motion: the
statistics of temporal continuity,” IEEE Trans. on Pattern Analysis and Machine
Intelligence, vol. 22, no. 8, pp. 768–773, 2000. 11
[36] M. Irani and P. Anandan, “Video indexing based on mosaic representations,” IEEE
Proc., vol. 86, no. 5, pp. 905–921, 1998. 11
[37] Y. Alper, K. Shafique, N. Lobo, X. Li, T. Olson, and M. A. Shah, “Target-tracking
in flir imagery using mean-shift and global motion compensation,” in IEEE Proc.
Workshop on Computer Vision Beyond the Visible Spectrum, 2001, pp. 54–58. 11
[38] A. Yilmaz, K. Shafique, and M. A. Shah, “Target tracking in airborne forward
looking infrared imagery,” Journal of Image and Vision Computing, vol. 21, no. 7,
pp. 623–635, 2003. 11
136
REFERENCES
[39] A. Strehl and J. Aggarwal, “Detecting moving objects in airborne forward looking
infra-red sequences,” in IEEE Proc. Workshop on Computer Vision Beyond the
Visible Spectrum, 1999, pp. 3–12. 11, 23
[40] A. Strehl and J. K. Aggarwal, “Modeep: a motion-based object detection and
pose estimation method for airborne flir sequences,” Journal of Machine Vision
and Applications, vol. 11, no. 6, pp. 267–276, 2000. 11, 23
[41] S. Lankton and A. Tannenbaum, “Improved tracking by decoupling camera and
target motion,” in SPIE Proc. Real-Time Image Processing, vol. 6811, 2008, pp.
681 112(1–8). 11
[42] H. Zhang and F. Yuan, “Vehicle tracking based on image alignment in aerial
videos,” in Proc. Advanced Concepts for Intelligent Vision systems, vol. 4679,
2007, pp. 295–302. 11
[43] J. Domke and Y. Aloimonos, “A probabilistic framework for correspondence and
egomotion,” Journal Dynamical Vision, pp. 232–242, 2007. 11, 23
[44] G. Pulford and B. La Scala, “Multihypothesis viterbi data association: Algorithm
development and assessment,” IEEE Trans. on Aerospace and Electronic Systems,
vol. 46, no. 2, pp. 583–609, 2010. 13
[45] T. Quach and M. Farooq, “Maximum likelihood track formation with the viterbi
algorithm,” in IEEE Proc. Conference on Decision and Control, vol. 1, 1994, pp.
271–276. 13
[46] J. Wolf, A. Viterbi, and G. Dixon, “Finding the best set of k paths through a
trellis with application to multitarget tracking,” IEEE Trans. on Aerospace and
Electronic Systems, vol. 25, no. 2, pp. 287–296, 1989. 13
[47] T. Kirubarajan, Y. Bar-Shalom, and K. Pattipati, “Multiassignment for tracking
a large number of overlapping objects,” IEEE Trans. on Aerospace and Electronic
Systems, vol. 37, no. 1, pp. 2–21, 2001. 13
[48] S. Deb, M. Yeddanapudi, K. Pattipati, and Y. Bar-Shalom, “A generalized s-d
assignment algorithm for multisensor-multitarget state estimation,” IEEE Trans.
on Aerospace and Electronic Systems, vol. 33, no. 2, pp. 523–538, 1997. 13
137
REFERENCES
[49] D. Castanon, “Efficient algorithms for finding the k best paths through a trellis,”
IEEE Trans. on Aerospace and Electronic Systems, vol. 26, no. 2, pp. 405–410,
1990. 13
[50] G. Pulford and A. Logothetis, “An expectation-maximisation tracker for multiple
observations of a single target in clutter,” in IEEE Proc. Conference on Decision
and Control, vol. 5, 1997, pp. 4997–5003. 13
[51] W. Brogan, “Algorithm for ranked assignments with applications to multiobject
tracking,” Journal of Guidance, vol. 12, pp. 357–364, 1989. 13
[52] M. Miller, H. Stone, and I. Cox, “Optimizing murty’s ranked assignment method,”
IEEE Trans. on Aerospace and Electronic Systems, vol. 33, no. 3, pp. 851–862,
1997. 13
[53] K. Pattipati, S. Deb, Y. Bar-Shalom, and J. Washburn, R.B., “A new relaxation
algorithm and passive sensor data association,” IEEE Trans. on Automatic Con-
trol, vol. 37, no. 2, pp. 198–213, 1992. 13
[54] A. Poore and N. Rijavec, “A lagrangian relaxation algorithm for multidimensional
assignment problems arising from multitarget tracking,” SIAM Journal on Opti-
mization, vol. 3, no. 3, pp. 544–563, 1993. 13
[55] X. Li, L. Z.Q., K. Wong, and E. Bosse, “An interior point linear programming
approach to two-scan data association,” IEEE Trans. on Aerospace and Electronic
Systems, vol. 35, no. 2, pp. 474–490, 1999. 13
[56] N. Karmarkar, “A new polynomial-time algorithm for linear programming,” Com-
binatorica, vol. 4, pp. 373–395, 1984. 13
[57] S. Blackman, Multiple-target tracking with radar applications. Artech House,
Dedham, MA,, 1986. 14
[58] D. Reid, “An algorithm for tracking multiple targets,” IEEE Trans. on Automatic
Control, vol. 24, no. 6, pp. 843–854, 1979. 14
[59] S. Blackman, “Multiple hypothesis tracking for multiple target tracking,” IEEE
Trans. on Aerospace and Electronic Systems Magazine, vol. 19, no. 1, pp. 5–18,
2004. 14
138
REFERENCES
[60] I. J. Cox, “A review of statistical data association for motion correspondence,”
International Journal of Computer Vision, vol. 10, no. 1, pp. 53–66, 1993. 14
[61] T. Fortmann, Y. Bar-Shalom, and M. Scheffe, “Sonar tracking of multiple targets
using joint probabilistic data association,” IEEE Journal of Oceanic Engineering,
vol. 8, no. 3, pp. 173–184, 1983. 14
[62] L. Y. Pao, “Multisensor multitarget mixture reduction algorithms for tracking,”
Journal of Guidance, Control and Dynamics, vol. 17, pp. 1205–1211, 1994. 14
[63] D. Salmond, “Mixture reduction algorithms for target tracking in clutter,” in SPIE
Signal and Data Processing of Small Targets 1990, vol. 1305, no. 1, 1990, pp. 434–
445. 14
[64] H. Gauvrit and J. Le Cadre, “A formulation of multitarget tracking as an in-
complete data problem,” IEEE Trans. on Aerospace and Electronic Systems, pp.
1242–1257, 1997. 14
[65] R. Streit and T. Luginbuhl, “Maximum likelihood method for probabilistic multi-
hypothesis tracking,” in SPIE Proc. Signal and Data Processing of Small Targets,
vol. 2235, 1994, pp. 394–405. 14
[66] C. Hue, J. Le Cadre, and P. Perez, “Tracking multiple objects with particle fil-
tering,” IEEE Trans. on Aerospace and Electronic Systems, vol. 38, no. 3, pp.
791–812, 2002. 14
[67] ——, “A particle filter to track multiple objects,” in IEEE Proc. Workshop on
Multi-Object Tracking, 2001, pp. 61–68. 14
[68] Z. Khan, T. Balch, and F. Dellaert, “Mcmc-based particle filtering for tracking a
variable number of interacting targets,” IEEE Trans. Pattern Analysis and Ma-
chine Intelligence, vol. 27, pp. 1805–1918, 2005. 15
[69] F. Dellaert, S. M. Seitz, C. E. Thorpe, and S. Thrun, “Em, mcmc, and chain flip-
ping for structure from motion with unknown correspondence,” Machine Learning,
vol. 50, pp. 45–71, 2003. 15
139
REFERENCES
[70] W. Gilks and D. Spiegelhalter, Markov Chain Monte Carlo in Practice: Interdis-
ciplinary Statistics, 1st ed. Chapman and Hall/CRC, 1995. 15
[71] N. Gordon and A. Doucet, “Sequential monte carlo for maneuvering target tracking
in clutter,” in SPIE Proc. Signal and Data Processing of Small Targets, vol. 3809,
1999, pp. 493–500. 15
[72] A. Doucet, B. Vo, C. Andrieu, and M. Davy, “Particle filtering for multi-target
tracking and sensor management,” in Proc. International Conference on Informa-
tion Fusion, vol. 1, 2002, pp. 474–481. 15
[73] S. S¨arkk¨ a, A. Vehtari, and J. Lampinen, “Rao-blackwellized particle filter for mul-
tiple target tracking,” Journal of Information Fusion, vol. 8, no. 1, pp. 2–15, 2007.
15, 109
[74] E. Maggio, M. Taj, and A. Cavallaro, “Efficient multitarget visual tracking using
random finite sets,” IEEE Trans. on Circuits and Systems for Video Technology,
vol. 18, no. 8, pp. 1016–1027, 2008. 15
[75] Y. Ma, Q. Yu, and I. Cohen, “Target tracking with incomplete detection,” Com-
puter Vision and Image Understanding, vol. 113, no. 4, pp. 580–587, 2009. 16
[76] Z. Khan, T. Balch, and F. Dellaert, “Multitarget tracking with split and merged
measurements,” in IEEE Proc. Conference on Computer Vision and Pattern
Recognition, vol. 1, 2005, pp. 605–610. 16
[77] M. Piccardi, “Background subtraction techniques: a review,” in IEEE Proc. Inter-
national Conference on Systems, Man and Cybernetics, vol. 4, 2004, pp. 3099–3104.
16, 103
[78] C. R. del Blanco, F. Jaureguizar, and N. Garcia, “Visual tracking of multiple
interacting objects through rao-blackwellized data association particle filtering,”
in IEEE Proc. International Conference on Image Processing, 2010, pp. 821–824.
16, 126
[79] C. M. Bishop, Pattern Recognition and Machine Learning (Information Science
and Statistics). Springer, 2006. 20, 31, 34, 131
140
REFERENCES
[80] R. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision, 2nd ed.
Cambridge University Press, 2004. 22
[81] J. Domke and Y. Aloimonos, “A probabilistic notion of correspondence and the
epipolar constraint,” in IEEE Proc. International Symposium on 3D Data Pro-
cessing Visualization and Transmission, 2006, pp. 41–48. 23
[82] S. Periaswamy and H. Farid, “Medical image registration with partial data,” Jour-
nal Medical image analysis, vol. 10, no. 3, pp. 452–464, 2006. 27, 29, 36
[83] W. Hastings, “Monte carlo sampling methods using markov chains and their ap-
plications,” Journl of Biometrika, vol. 57, no. 1, pp. 97–109, 1970. 31
[84] S. Birchfield and S. Rangarajan, “Spatial histograms for region-based tracking,”
Journal of Electronics and Telecommunications Research Institute, vol. 29, no. 5,
pp. 697–699, 2007. 48
[85] C. R. del Blanco, F. Jaureguizar, L. Salgado, and N. Garc´ıa, “Motion estimation
through efficient matching of a reduced number of reliable singular points,” in
SPIE Proc. Real-Time Image Processing, vol. 6811, 2008, pp. 68 110N(1–12). 51,
125
[86] C. V. Stewart, “Robust parameter estimation in computer vision,” Reviews Society
for Industrial and Applied Mathematics, vol. 41, no. 3, pp. 513–537, 1999. 54
[87] P. Meer, C. V. Stewart, and D. E. Tyler, “Robust computer vision: an inter-
disciplinary challenge,” Jopurnal of Computer Vision and Image Understanding,
vol. 78, no. 1, pp. 1–7, 2000. 54
[88] K. Nummiaro, E. Koller-Meierb, and L. Van Gool, “An adaptive color-based par-
ticle filter,” Journal of Image and Vision Computing, vol. 21, no. 1, pp. 99–110,
2003. 60
[89] A. Doucet, N. d. Freitas, K. P. Murphy, and S. J. Russell, “Rao-blackwellised par-
ticle filtering for dynamic bayesian networks,” in Proc. Conference on Uncertainty
in Artificial Intelligence, 2000, pp. 176–183. 85, 86
[90] R. E. Bellman, Dynamic Programming. Courier Dover Publications, 2003. 85
141
REFERENCES
[91] D. MacKay, Information Theory, Inference, and Learning Algorithms. Cambridge
University Press, 2003. 97
[92] C. R. del Blanco, F. Jaureguizar, L. Salgado, and N. Garc´ıa, “Automatic feature-
based stabilization of video with intentional motion through a particle filter,” in
Advanced Concepts for Intelligent Vision Systems, ser. Proc. Advanced Concepts
for Intelligent Vision systems, vol. 5259, 2008, pp. 356–367. 125
[93] C. R. del Blanco, F. Jaureguizar, N. Garc´ıa, and L. Salgado, “Robust automatic
target tracking based on a bayesian ego-motion compensation framework for air-
borne flir imagery,” in Automatic Target Recognition XIX, vol. 7335, no. 1. SPIE,
2009, p. 733514. 126
[94] C. R. del Blanco, F. Jaureguizar, and N. Garcia, “Robust tracking in aerial imagery
based on an ego-motion bayesian model,” EURASIP Journal on Advances in Signal
Processing, vol. 2010, pp. 1–18, 2010. 126
[95] C. R. del Blanco, N. Garc´ıa, L. Salgado, and F. Jaureguizar, “Object tracking from
unstabilized platforms by particle filtering with embedded camera ego motion,” in
IEEE Proc. Advanced Video and Signal Based Surveillance, vol. 0, 2009, pp. 400–
405. 126
[96] C. R. del Blanco, R. Mohedano, N. Garc´ıa, L. Salgado, and F. Jaureguizar, “Color-
based 3d particle filtering for robust tracking in heterogeneous environments,” in
ACM/IEEE International Conference on Distributed Smart Cameras, 2008, pp.
1–10. 126
[97] R. Mohedano, C. R. del Blanco, F. Jaureguizar, L. Salgado, and N. Garcia, “Ro-
bust 3d people tracking and positioning system in a semi-overlapped multi-camera
environment,” in IEEE Proc. International Conference on Image Processing, 2008,
pp. 2656–2659. 126
[98] C. Cuevas, C. R. del Blanco, N. Garcia, and F. Jaureguizar, “Segmentation-
tracking feedback approach for high-performance video surveillance applications,”
in IEEE Proc. Southwest Symposium on Image Analysis Interpretation, 2010, pp.
41–44. 126
142

ii

˜ Departamento de Senales, Sistemas y Radiocomunicaciones
´ Escuela Tecnica Superior ´ de Ingenieros de Telecomunicacion

Visual Object Tracking in Challenging Situations using a Bayesian Perspective Seguimiento visual de objetos en situaciones complejas mediante un enfoque bayesiano

Ph.D. Thesis Tesis Doctoral
Autor:

Carlos Roberto del Blanco Ad´n a
Ingeniero de Telecomunicaci´n o Universidad Polit´cnica de Madrid e

Director:

Fernando Jaureguizar N´ nez u˜
Doctor Ingeniero de Telecomunicaci´n o Profesor titular del Dpto. de Se˜ales, Sistemas y n Radiocomunicaciones Universidad Polit´cnica de Madrid e
2010

iii

.

. . . . . . . . . . . . . . . . . . . . . Vocal D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .TESIS DOCTORAL Visual Object Tracking in Challenging Situations using a Bayesian Perspective Seguimiento visual de objetos en situaciones complejas mediante un enfoque bayesiano Autor: Carlos Roberto del Blanco Ad´n a Director: Fernando Jaureguizar N´nez u˜ Tribunal nombrado por el Mgfco. . . . . . . . . . . . . . . . . . . de . . . . . . . . . . . . de 2010 en . . . . . . Rector de la Universidad Polit´cnica e de Madrid. . . . . . . . . . . . . . . . . . . . . . Vocal D. . . ıa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . de 2010. . . . . . . . . . el d´ . . . . . . . . . . . . . . . . . . Realizado el acto de defensa y lectura de la Tesis el d´ . . . . . . . . . . . . . . . . . . . . . . . o EL PRESIDENTE LOS VOCALES EL SECRETARIO v . . . . . . . Secretario D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . y Excmo. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Calificaci´n: . . . Vocal D. . . . . . . . . . . . ıa Presidente D. . . . . . . . . . . . . . Sr. . . . . . . . . . . . . . de . . . . . .

.

a mis padres.A Vanessa. . a mis hermanos.

.

Sin duda me gustar´ n ıa escribir unas l´ ıneas de todos mis compa˜eros del GTI con los que tanta n cafe´ he compartido y que han hecho tan agradable mi vida laboral. eso supondr´ escribir. e haciendo una menci´n especial a Fernando. Seguir´ con mis padres y o a e hermanos que tantas veces me han preguntado si me quedaba mucho para terminar la tesis. m´s que un tomo de tesis. Narciso y Luis. Por ello me veo obligado a hacer algo m´s alternativo a la a par que dudosamente util: una tabla con las edades del GTI. en la que se ´ puede ver a todos mis compa˜eros (o al menos esa ha sido mi intenci´n) a n o lo largo del viaje temporal de mi tesis. Sin e n olvidar el esfuerzo y migra˜as que a Fernando. Continuar´ con todos los miembros y visitantes del GTI. gracias a la cual debe odiar al se˜or Bayes. los cuales han o sufrido los estragos de mis art´ ıculos escritos en ese ingl´s tan espa˜ol. . mi tutor. le ha costado leer n esta tesis. This work has been partially supported by the Ministerio de Ciencia e Innovaci´n of the Spanish government by means of a “Formaci´n del Pero o sonal Investigador” fellowship and the projects TIN2004-07860 (Medusa) and TEC2007-67764 (SmartVision). Sin ına embargo. toda la encicloıa a pedia Salvat. Empezar´ por mi mujer Vanessa e que tanto apoyo me ha dado y que en m´s de una ocasi´n ha tenido que a o aguantar la versi´n m´s irascible de mi mismo.Acknowledgements Me gustar´ agradecer a un gran n´mero de personas el que hayan comparıa u tido mi andadura por el camino de la tesis.

Espa˜a n Espa˜a. Julian. Xiang.. Macedonia. Cuba. Sergio Su. Victor. Iviza Samira. Jes´s u Rafa. Daniel A. Brasil Italia. Shagniq Carlos C. India n Paleozoico Mesozoico Cenozoico . Xioadan. Francisco. Espa˜a n Macedonia. Massimo. Yu. Espa˜a. B´lgica n e Espa˜a. Gogo. Ravi. n Manuel... Nacho Marcos N. Maykel. Ivana To˜i. Gian Luca Hui. Srimanta Procedencia Espa˜a n Espa˜a. Italia n Espa˜a n China. Macedonia n Espa˜a n Macedonia. Carlos R. Binu Nerea.. Claire Lihui. Macedonia Iran. Kristina. Yi. India Espa˜a. Jon.. Richard. Esther Sasho. C´sar e Daniel B. Peru n Espa˜a. Yang Shankar. Marcos A. Angel u Irena. Sharko ´ Ra´l. Wenjia. Carlos G. Juan Carlos.Era Proterozoico Arquezoico Periodo C´mbrico a Ordov´ ıcico Sil´rico u Dev´nico o Carbon´ ıfero P´rmico e Tri´sico a Jur´sico a Cret´zico a Paleoceno Eoceno Oligoceno Mioceno Plioceno Pleistoceno Holoceno Espec´ ımenes Narciso. Antonio Filippo. Macedonia Espa˜a. Italia n China India. Francia n China. Luis. Fernando. EEUU n Espa˜a. Usoa. Pieter Pablo. Abel. Pratik.

and complex object dynamics. and the camera ego-motion. This is achieved by inferring the occlusion events. In addition. the object tracking is still a challenge in uncontrolled situations involving multiple interacting objects. which in turn trigger different choices of object motion. avoiding tracking failures due to the presence of similar objects. which manages complex object interactions by means of an advanced object dynamic model that is sensitive to object interactions. which perform tasks such as vehicle navigation. proving the efficiency of the developed Bayesian tracking models. The first one consists in tracking a single object in heavily-cluttered scenarios with a moving camera. two different kinds of information are used: detections obtained by the analysis of video streams and prior knowledge about the object dynamics. . For this purpose.Abstract The increasing availability of powerful computers and high quality video cameras has allowed the proliferation of video based systems. illumination changes. To tackle this problem. surveillance. However. As a result. A fundamental component in these systems is the visual tracking of objects of interest. etc. a novel Bayesian model has been developed. it can predict satisfactorily the evolution of a tracked object in situations with high uncertainty about the object location. The other tracking situation focuses on the interactions of multiple objects with a static camera. The tracking algorithm can also handle false and missing detections through a probabilistic data association stage. moving cameras. whose main goal is to estimate the object trajectories in a video sequence. Excellent results have been obtained using publicly available databases. the varying object appearance. this information is usually corrupted by the sensor noise. the aim has been to develop efficient tracking solutions for two complex tracking situations. traffic monitoring. heavily-cluttered scenarios. object interactions. To address this situation. an advanced Bayesian framework has been designed that jointly models the object and camera dynamics. cluttered backgrounds. the algorithm is robust to the background clutter. While there exist reliable algorithms for tracking a single object in constrained scenarios. In this dissertation.

.

Para tal fin. se ha desarrollado un novedoso moda a elo bayesiano que gestiona las interacciones mediante un avanzado modelo ´ din´mico. La otra situaci´n considerada se ha centrado en las interacciones de objetos o con una c´mara est´tica. el algoritmo es robusto a a fondos estructurados. Para a tratar esta situaci´n. El algoritmo es tambi´n capaz de gestionar detecciones p´rdidas y falsas detecciones a e e trav´s de una etapa de asociaci´n de datos probabil´ e o ıstica. Este se basa en la inferencia de oclusiones entre objetos. Para tal fin. el seguimiento es todav´ un reto en situaciones ıa no restringidas caracterizadas por m´ltiples objetos interactivos. esta informaci´n suele estar distorsionada por el ruido del seno sor. escenarios u muy estructurados y c´maras en movimiento. el objetivo ha a sido el desarrollo de algoritmos de seguimientos eficientes para dos situaciones especialmente complicadas. En esta tesis. se ha dise˜ado un sofisticado marco bayesiano que o n modela conjuntamente la din´mica de la c´mara y el objeto. a Sin embargo. se usan dos tipos de informaci´n: las detecciones obtenidas del o an´lisis del v´ a ıdeo y el conocimiento a priori de la din´mica de los objetos. la variaci´n en la apariencia de los objetos. la v´ o a ıdeo-vigilancia. a Mientras existen algoritmos fiables para el seguimiento de un unico objeto ´ en escenarios controlados. Esto pera a mite predecir satisfactoriamente la evoluci´n de la posici´n de los objetos o o en situaciones de gran incertidumbre. etc. Se han obtenido excelentes resultados en diversas bases de datos. evitando errores por la presencia de objetos similares. lo que prueba la eficiencia de los modelos bayesianos de seguimiento desarrollados. siendo su principal objetivo la estimaci´n de las trayectorias en secuencias de v´ o ıdeo. Adem´s. la monitorizaci´n del tr´fico. los cambios de iluminaci´n.Resumen La creciente disponibilidad de potentes ordenadores y c´maras de alta cala idad ha permitido la proliferaci´n de sistemas basados en v´ o ıdeo para la navegaci´n de veh´ o ıculos. La primera consiste en seguir un unico ´ objeto en escenas muy estructuradas con una c´mara en movimiento. o o escenas muy estructuradas y el movimiento de la c´mara. Una parte esencial en estos sistemas es seguimiento de objetos. . las cuales a a su vez dan lugar a diferentes tipos de movimiento de objeto.

xiv .

. . . . .3. . . . Tracking of multiple interacting objects . . . . .1 3. . . . . . . . . . . 3. .1 3. . High uncertainty ego-motion situation . . . .5 Conclusions . . . . . . . . . . . . . . . . .4. . . . . . . . . . . . . xv . . . Results . . . . . . . .1. . . . . . . . . .2 Particle Filter approximation . . . . . . .3. .1 Optimal Bayesian estimation for object tracking . .1 3. . . . . Results .4 Object tracking in aerial and terrestrial visible imagery 3. . . . Global tracking results . . . .3. . . . . . . . . . . . . . . . . . . 3. . . . . . . . . . .1 2. . . . . . . . . . .2 Tracking with moving cameras . . . . . . . . . 3. . . . . . . . . . . . . . . . . . . . . . Object tracking in aerial infrared imagery . . . . .2. .1 3. . . .Contents List of Figures List of Tables 1 Introduction 2 Bayesian models for object tracking 2. . . . . . . . . . . . . . . . xvii xix 1 5 9 12 17 17 20 21 24 27 34 36 39 43 46 51 57 60 3 Bayesian Tracking with Moving Cameras 3. . . . . . . . . .2 3.2. . . . .3 Strong ego-motion situation . . . . . . . . . . . . . . .3. Bayesian tracking framework for moving cameras . . . . . . . . . . . . . . . . . . . . . . . .2 3. . . . . . . . . . . .3. . . .3 Particle filter approximation . . . . . . 3. . . . . . . . . . . . . . . 3. . . . . . . . . . .4. . . . . . . .2. . .2 Particle filter approximation .

. . . . . 121 125 5 Conclusions and future work 5. .5. . . . . . . .2 Description of the multiple object tracking problem . . . . . . . . . . . . . . . . Likelihood . . . . . . . . . . .2. . .1 4. . . . . . . . . . . . . . . . . . . . .3. . . . . . . . . . . . 109 4. . . . . . . . . . . . . . . . .2 4. . . . . . . . . . Bayesian tracking model for multiple interacting objects . . . . . . . .CONTENTS 4 Bayesian tracking of multiple interacting objects 4. . . . .1 4. . . . . . . . . 127 131 6 Appendix 6. . . . . . . . . . . . . . . 103 Qualitative results . . . . . . .3 4.3. . . . . . . . . . . . . . Particle filtering of the data association and object occlusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2. . . . . . . . 125 Future work . . . . . . .1 5. . . . . . . . . . . . . . . .4 4. . . . .6 Conclusions . . . . . . . . . . . . . . . . . . . . . . .5 Transition pdfs . . . . . . . . . . . . . . . . .1 Conditional independence and d-separation . . . .5. . . . . . . . . . .2 4. . . . Kalman filtering of the object state .2 Results . . 4. . . . . . . . . . . . . . . . . . . . . . . . 131 133 References xvi . . . . . . . . . . 65 65 69 75 82 85 91 94 99 Approximate inference based on Rao-Blackwellized particle filtering Object detections . . . . 4. . . . . .1 4. . . .1 4. 107 Quantitative results . . . . . . . .2 Conclusions . .

. 3. . . . 3. . . Multimodal LoG filter response . . . 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9 Graphical model for the Bayesian object tracking. . . SIR resampling of the posterior pdf . 18 25 26 26 30 30 32 33 33 35 35 37 38 39 40 3.5 3. . . . . . 3. .21 Example of feature correspondence . . . . . . . . . . . . 3.8 3. . . . .13 Tracking results for the BEH algorithm under strong ego-motion . 3. . . . . . . . . . . . . . . . . . . . . . . Probability values for the ego-motion hypothesis . . . . . . . . .16 Intermediate results for a situation greatly affected by the aperture problem 41 42 44 45 50 53 xvii . . . . . . . . . . . . . . . . . . . 3. . . . . . . . . .15 Tracking results for the NEH algorithm under strong ego-motion . . . . . . . . . . . . . . . . . . . . . . . . . . . .12 Intermediate results for a situation of strong ego-motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Particle approximation of the posterior pdf .7 3. . . . . . . . . . . . . . . . . . . . . .11 Object tracking result . . . . .3 3. . . . . . . . . . . . .20 Example of the similarity measurement between image regions . . . . . . 3. . . .2 3. . . . . . . . . . . . . . . .1 3. .List of Figures 3. . .18 Tracking results for the DEH algorithm in a situation greatly affected the aperture problem . Metropolis-Hastings sampling of the likelihood distribution . . . . . .6 3. . . Likelihood distribution . . . . . . . 3.17 Tracking results for the BEH algorithm in a situation greatly affected the aperture problem . Consecutive frames of an aerial infrared sequence . . . . . . . . .14 Tracking results for the DEH algorithm under strong ego-motion . . . . . . .4 3. . . . . . . . . . . . . . .10 Kernel density estimation and state estimation . Initial translational transformations . . . . . . . . . . . . . .19 Tracking results for the NEH algorithm in a situation greatly affected the aperture problem . 3. 3. . . .

. .8 4. . 114 4. 104 4.13 Marginalization of the posterior pdf over one specific object . . . . . . . . .7 4. . . . . . . 55 56 56 57 59 61 67 68 70 71 74 79 80 Color histograms of two object categories . . . 117 4. . . . . . . . . . . . . . . . . . . . . . . . . Restrictions imposed to the associations between detections and objects. . .27 Tracking results with a camera mounted on a helicopter . . . . . . . . . 3. . . . . . . . . . . . . Object dynamic model .18 Marginalization of the posterior pdf over one specific object . . . . . . . . . . . . . . . .14 Marginalization of the posterior pdf over one specific object . . 113 4. . . 3. . . . 119 4. .10 Computed detections from the red dressed team. . . . . . . . . . . . . Graphical model for the initial time step . . . . . 4.5 4.6 4. 100 Similarity maps of the color histograms . . . . . . . . . . . . .23 Samples of the object position . 132 xviii . . . . 112 4. . . . . . . .20 Marginalization of the posterior pdf over one specific object . .15 Tracking results for a complex object cross . . . .3 4. . . . 120 6. . . . . Restrictions imposed to the occlusions among objects. . . . . . 3. . . . . . 111 4. . . . . . . .11 Computed detections from the black and white dressed team. .24 Samples of ellipsis enclosing the object . . . . . .19 Tracking results for an overtaking action .25 Weighted sampled representation of the posterior pdf . 105 4. . . . 102 4. . . . . . . . . . . . . . .9 Set of detections yielded by multiple detectors . . . .4 4. . . . . 3. . . . . . . . . . . . . . . . . . . . . .16 Marginalization of the posterior pdf over one specific object . . .12 Tracking results for a simple object cross . . . . . . . . . . . . . . . . . . . . . . 110 4. . . . . . . . . . . . . . . . . . . .17 Marginalization of the posterior pdf over one specific object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 4. . . . .LIST OF FIGURES 3. 3.1 4. . .2 4. . . . . . . . . . Graphical model for multiple object tracking . . . . . 116 4. . .26 Tracking results with a camera mounted on a car . . . .22 Marginalization of the posterior pdf over one specific object . . . 115 4. . .21 Marginalization of the posterior pdf over one specific object . . . . . . . . . . . . . . . . . . . . . . . .1 Concepts of d-separation and descendants . . . . . . Data association between detections and objects . . . . . . . .22 Representation of the affine transformation hypothesis . .

. . . . . . . . . . . . . . . . . . . . . . . 4.1 3. . . . 123 xix . . 122 Quantitative results for interacting objects 2/2 . . . . . . . . . . . . . . . . . . . . .List of Tables 2. . Quantitative results for object tracking with a moving camera in visible imagery . . . . . . . . . . . . . . . . . . . Quantitative results for object tracking with a moving camera in infrared imagery . . .2 62 47 13 Quantitative results for interacting objects 1/2 . . . . . . .1 3. . . . . . . . . . . . . . . . . . . . . . . . . . . .1 4. . . . . . . .2 Tracking problems related to the data association .

LIST OF TABLES xx .

tracked objects that have not been detected. texture. and also in the field of computer vision in general. On the other hand. security and surveillance. such as partial and total occlusions.e. i. scene structures similar to the objects of interest can cause false detections.Chapter 1 Introduction The evolution and spreading of the technology nowadays have allowed the proliferation of video based systems. Visual object tracking is a fundamental part in all of the previous tasks. motion-based recognition. This fact has motivated a great deal of interest in object tracking algorithms. The detection process uses the most distinctive appearance features. The appearance variations can be produced by articulated or deformable objects. to minimize the probability of false detections and at the same time to maximize the detection probability. the object appearance can undergo significant variations that cause noisy detections and even missing detections. and thus obfuscating the tracking process. However. Object interactions. illumination changes due to weather conditions (typical in outdoor applications). two different kinds of information are used: the video streams acquired by the camera sensor and the prior knowledge about the tracked objects and the environment. etc. such as color. 1 . are another source of noisy and missing detections. and shape. which make use of powerful computers and high quality video cameras to automatically perform increasingly demanding tasks such as vehicle navigation. To alleviate these detection shortcomings. For this purpose. human-computer interaction. also known as observations or measurements. The video-stream based information is used to compute object detections in each frame. and variations in the camera point of view. gradient. traffic monitoring. The ultimate goal of tracking algorithms is to estimate the object trajectories in a video sequence.

which is used to predict the evolution of the object trajectories. which makes more complex the tracking and increases the uncertainty in the trajectory estimation. a novel recursive Bayesian model has been developed to explicitly manage complex object interactions. This allows to predict satisfactorily the evolution of the tracked object in situations of high uncertainty. the camera dynamics must be also modeled. While there exist reliable algorithms for the tracking of a single object in constrained scenarios. the object tracking is still a challenge in uncontrolled situations involving multiple interacting objects.e. moving cameras.1. For this 2 . an advanced Bayesian framework has been designed that jointly models the object and camera dynamics. there is not a closed form expression to directly compute the required tracking information. avoiding tracking failures due to the presence of similar objects to the tracked one in the background. a suboptimal inference method has been derived that makes use of the particle filtering technique to compute an accurate approximation of the object trajectory. the main aim has been the development of efficient tracking solutions for two of these complex tracking situations. As a result. INTRODUCTION the tracking also relies on the available prior information in order to constrain the trajectory estimation problem. a global motion is induced in the image. the algorithm is robust to the background clutter. called ego-motion. The other unrestricted tracking situation focuses on the interactions of multiple objects with a static camera. In addition. and complex object dynamics. object dynamic information is only meaningful for static or quasi-static cameras. On the other hand. in which several object locations are possible because of the combined dynamics of the object and the camera. In this dissertation. This is accomplished by an advanced object dynamic model that is sensitive to the object interactions involving long-term occlusions of two or more objects. In order to deal with this problem. that corrupts the trajectory predictions. To successfully tackle this problem. The first one consists in tracking a single object in heavily-cluttered scenarios with a moving camera. since. The inference of the tracking information in the proposed Bayesian model cannot be performed analytically. This situation arises from the fact that the dynamic and observation processes involved in the Bayesian tracking framework are non-linear and non-Gaussian. especially in situations in which objects undergo complex interactions. heavily-cluttered scenarios. in the case of moving cameras. For this purpose. This kind of information is mainly the object dynamics. The modeling of the object dynamics can be a very difficult task. objects with varying appearance. i.

The developed recursive Bayesian model for tracking a single object in heavily-cluttered scenarios with a moving camera is described in Chap.e. the developed Bayesian tracking solution for multiple interacting objects is presented. 2. 5. Regarding the inference of the tracking information in the proposed Bayesian model for interacting objects. along with a test bench to evaluate the efficiency of the tracking in object interactions. conclusions and future lines of research are set out in Chap. The first one is the mathematical derivation of the posterior distribution of the object tracking information. and another dealing with both terrestrial and aerial visible imagery. which in turn triggers different choices of object motion. a state of the art of the most remarkable object tracking techniques for multiple objects is presented. 4. i. observation and occlusion processes.purpose. called Rao-Blackwellization. 3. the inference has to be accomplished by means of suboptimal methods. such as the particle filtering. 3 . In Chap. This situation is caused by the non-linear and non-Gaussian character of the stochastic processes involved in the Bayesian tracking model. The organization of the dissertation is as follows. This allows to obtain an accurate approximation of the object trajectories in high dimensional state spaces. which efficiently computes the correspondence between the unlabeled detections and the tracked objects. strategies for handling moving cameras. a novel suboptimal inference method has been developed which combines the particle filtering technique with a variance reduction technique. the proposed Bayesian tracking model uses a random variable to predict the occlusion events. two major issues have been carefully addressed. one involving aerial infrared imagery. proportional to the number of tracked objects and object detections. To overcome this drawback. arises from the fact that the derived mathematical expression for the posterior distribution has not an analytical form due to the involved complex integrals. In Chap. The second issue. At the end of the chapter. the high dimensionality of the tracking problem. the dynamic. closely related to the first one. tracking results of the proposed Bayesian framework is presented for two kinds of applications. and the management of multiple objects. causes that the accuracy of the approximate posterior distribution be very poor. Lastly. However. placing special emphasis in Bayesian models. Subsequently. which has been a challenging task due to the complexity of the tracking model. The tracking algorithm is also able to handle false and missing detections through a probabilistic data association stage. involving multiples tracked objects.

1. INTRODUCTION 4 .

which indeed can be very complex. The uncertainty of the detection process arises from the noise of camera sensor. a dynamic model can be used to predict the motion of the objects. robotics. restricting in this way the spatio-temporal evolution of the trajectories. the tracking algorithm have to manage different sources of information (detections and object dynamics). The tracking algorithm must be able to handle the uncertainty in the detection process to assign consistent labels to the tracked objects in each frame of a video sequence. This process can be simplified by imposing certain constrains on the motion of the objects. traffic monitoring and management. video retrieval. The visual tracking can be defined as the problem of estimating the trajectories of a set of objects of interest in a video sequence as they move around the scene. autonomous vehicle navigation. the dynamic model is only an approximation of the underlying object dynamics. human-computer interface. taking into account their own uncertainties. In a typical tracking application there are one or more object detectors that generate a set of noisy measurements or detections in discrete time instants.Chapter 2 Bayesian models for object tracking Visual object tracking is a fundamental task in a wide range of military and civilian applications. 5 . variations in the appearance of the objects. to efficiently estimate the object trajectories. For this purpose. behavior analysis. Nonetheless. non-rigid and/or articulated objects. As a result. and loss of information caused by the projection of the 3D world onto the 2D image plane. changes in the scene illumination. such as surveillance. and many more. security and defense.

4) has proved to be more efficient in models that are moderately non-linear. the performance of the extended Kalman filter rapidly decreases as the non-linearities becomes more severe. the main difficulty arises from the fact that realistic models for the object dynamics and detection processes are often nonlinear and non-Gaussian. This framework models in a probabilistic way the tracking problem. The update stage makes use of the available detections at the current time step to correct the predicted posterior distribution by means of the likelihood model of the object detector. The prediction stage evolves the posterior distribution at the previous time step according to the object dynamics. The extended Kalman filter (1) linearizes models with weak non-linearities using the first term in a Taylor expansion. so that the Kalman filter expression can be still applied. obtaining as a result the predicted posterior distribution at the current time step. Both approximate 6 . In fact. the aim is to compute the posterior distribution over the object state. environmental clutter. inaccurate dynamic models. Grid based approaches (2) overcome the limitations imposed on Kalman filter by restricting the state space to be discrete and finite. which is a vector containing all the desired tracking information such as. velocity.2. BAYESIAN MODELS FOR OBJECT TRACKING Bayesian estimation is the most commonly used framework in visual tracking and also in other contexts such as radar and sonar. The unscented Kalman filter (3. etc. which is obtained when both the dynamic and likelihood models are linear and Gaussian. and the set of detections at the current time step. In single object tracking with static cameras. Thus. only in a limited number of cases there exist close-form expressions. the exact computation of posterior distribution is not possible. Nonetheless. and it becomes necessary to resort to approximate inference methods that computes an approximation of the posterior distribution. which leads to a posterior distribution without a closed-form analytic expression. position. This posterior distribution encodes all the necessary information to efficiently compute an estimation of the object state. The most well-known closed-form expression is the Kalman filter (1). the computation is efficiently performed since it is only required the previous estimation of the posterior distribution. and all its sources of uncertainty: sensor noise. If any of the previous assumptions does not hold. From a Bayesian perspective. It recursively propagates a set of selected sigma points to maintain the second order statistics of the posterior distribution. etc. The computation of the posterior distribution is usually performed in a recursive way via two cyclic stages: prediction and update.

and 7 . condensation algorithm (12. the accuracy of the estimation can be randomly poor. the computational cost increases dramatically with the dimensionality of the state space and becomes impractical for dimensions larger than four. But if this assumption does not hold (e. Another limitation is the combinatorial growth of the number of Gaussian components in the mixture over time. An additional disadvantage of grid-based methods is that the state space cannot be partitioned unevenly in order to improve the resolution in regions of high density probability. The main limitation of the Gaussian sum filter is that linear approximations are required. These methods approximate the continuous state space by a finite and fixed grid. that are propagating recursively along the time. 6). The particle filter has become very successful in a wide range of tracking applications due to its efficiency. • the cameras are static. the previous tracking approaches have proved to be efficient and reliable solutions for single object tracking provided that: • they fulfill the assumptions of linearity/non-linearity Gaussianity/non-Gaussinity for which they were conceived. The grid must be sufficiently dense to compute an accurate approximation of the posterior distribution. its computational cost is theoretically independent of the dimension of the state space. and then they apply numerical integration for computing the posterior distribution. The Gaussian sum filter (5) was one of the first attempts to deal with non-Gaussian models. It is a numerical integration technique that simulates the posterior distribution by a set of weighted samples. assume that the underlying posterior distribution is Gaussian. as in the extended Kalman filter. 7. However. 8). An alternative solution for non-linear non-Gaussian models that does not need linearization is obtained by approximate-Grid based methods (2. extended and unscented Kalman filters. To sum up. approximating the posterior distribution by a mixture of Gaussians. and easy of implementation. Moreover.e. known as particles. All these shortcomings are overcome by the particle filtering technique (2.solutions. 13). 11). or bootstrap filtering (14). flexibility. the distribution is heavily skewed or multimodal). i. 10. that is the key component of the algorithm. also known as Sequential Monte Carlo method (9.g. with no motion. and evaluated by means of the dynamic and likelihood models. The samples are drawn from a proposal distribution.

that corrupts the spatio-tempotal continuity of the video sequence. since potentially whatever detection can be associated with an object. which is receiving a great deal of research attention because of the wide range of potential applications that can be developed. object occlusions and strong variations in the object appearance can cause that one or more of the objects are not detected. a thorough review of the main techniques that address the ego-motion problem for single object tracking with moving cameras is presented. the object dynamic information is not useful anymore. the research has been focused on two situations: the single object tracking with moving cameras.2. In Sec. and the multiple interacting object tracking with static cameras. the moving camera induces a global motion in the scene. since the camera motion is not considered. The difficulty arises from the fact that in each time step there is a set of unlabeled detections generated from the detectors. Specifically. As a consequence. called ego-motion. and the tracking performance is seriously reduced. In fact. there can be false detections and missing detections that increase even more the complexity of the data association. BAYESIAN MODELS FOR OBJECT TRACKING • there is always a unique detection for the tracked object. In the other considered situation. one or multiple detections. which only occurs in constrained scenarios where there is total control about the number and types of objects that compose the scene. the so-called missing detections. the tracking algorithm has to manage several interacting objects in an environment with static cameras. but there can be none. the nearest neighbor Kalman filter (15) handles the data association problem by selecting 8 . The main contribution of this dissertation is the development of efficient and reliable algorithms for the tracking of objects in challenging situations. the data association can be very complex since the number of possible associations is combinatorial with the number of objects and detections. 2. On the other hand. The false detections arise from the noise of the camera sensor and the scene clutter (similar structures to the tracked object in the background). Furthermore. This means that the correspondence between objects and detections is not known. This fact violates the assumption that there is always a unique detection per object. These phenomena can also occur for single object tracking in unconstrained scenarios. and therefore a data association stage is required. For example. the tracking task is still a challenge. In the rest of situations. In the first situation. in which there is a unique object.1.

a mobile robot. and recently there has been a revival of interest due to the recent developments in particle filtering and recursive Bayesian models in general. called ego-motion. or an Unmanned Aerial Vehicle). according to its dynamics. The ego-motion problem has been addressed in different manners in the scientific literature. On the other hand. This approach maintains the Gaussian character of the posterior distribution. the probabilistic data association filter (16. Unlike the previous method that only uses a unique detection. In cases where the previous assumption does not hold. This is accomplished by averaging the innovation terms of the Kalman filter resulting from the set of detections. the main multi-object tracking techniques are presented. a helicopter. etc. 9 .2. In this context. the acquired video sequences undergo a random global motion. Then. There exist a lot of scientific literature in the field of multiple object tracking.1 Tracking with moving cameras the closest detection to the predicted object trajectory. 17) updates the posterior distribution utilizing all the detections that are close to the predicted object trajectory. i. 2. As a consequence. that prevents the use of the object dynamic information to restrict the object position in the scene. the data association in single object tracking can be considered as a specific case of the data association of multiple object tracking. the image regions associated with the tracked object are spatially overlapped in consecutive frames. focusing on the problem of the data association. 19. In Sec. They can be split into two categories: approaches based on the assumption of low ego-motion.1 Tracking with moving cameras In video based applications in which the video acquisition system is mounted on a moving aerial platform (such as a plane. 20). the tracking is performed using morphological connected operators. 2. and those based on the ego-motion estimation. where the number of tracked objects in the scene is just one. which is used to update the posterior distribution of the object state. some works assume that the spatio-temporal connectivity of the object is preserved along the sequence (18. the most common approach is to search for the object in a bounded area centered in the location where it is expected to find the object. the tracking performance can be dramatically reduced. a vehicle.. Approaches assuming low ego-motion consider that the motion component due to the camera is not very significant in comparison with the object motion.e.2.

provided that there is a minimum number 10 . the works (32. the KLT method (31) is used to infer a bilinear camera model in an application that detects moving objects from a mobile robot. The existing works differ in the specific image registration technique used to compute the parameters of the geometric transformation. while the second one is focused on aerial imagery. Extensive reviews of image registration techniques can be found in (27. in which the camera motion is at least as significant as the object motion.2.e. In (30). where the first one tackles all kind of vision based applications. Other authors (25. and even more. They aim to compute the camera egomotion between consecutive frames in order to compensate it. and thus recovering the spatio-temporal correlation of the video sequence. and then the search is performed deterministically using the Mean Shift algorithm (24). The other category of approaches based on the ego-motion estimation are able to deal with strong ego-motion situations. which uses a feature based approach to estimate an affine camera model. However. The reason is the size of the search area must be enlarged to accommodate the expected camera ego-motion. 22). which produces that the probability that the tracking can be distracted by false candidates increases dramatically. which is able to manage multiple initial locations for the search. an exhaustive search is performed in a fixed-size image region centered in the previous object location. whose parameters are estimated by means of an image registration technique. a possible classification of the image registration techniques is: those based on features and those based on area (i. In (23). the initial search location is estimated using a Kalman filter. all these methods lose effectiveness as the displacement induced by the ego-motion increases. This system is able to successfully handle situations in which the camera motion estimation is disturbed by the presence of independent moving objects. image regions). In (29). an object detection and tracking system with a moving airborne platform is described. Feature based image registration techniques detect and match distinctive image features between consecutive frames to estimate a geometric transformation. 33. 26) propose a stochastic search based on particle filtering. which represents the camera ego-motion model. 28). BAYESIAN MODELS FOR OBJECT TRACKING In (21. In the field of FLIR (Forward Looking InfraRed) imagery. typically an affine or projective one. 34) describe a detection and tracking system of aerial targets mounted on an airborne platform that uses a robust statistic framework to match edge features in order to estimate an affine camera model. According to them. The camera ego-motion is modeled by a geometric transformation.

which can drift the ego-motion estimation. 11 . In (39. the area based techniques are not robust to the presence of independent moving objects. 38) for a tracking application of terrestrial targets in airborne FLIR imagery. they require that the involved images are closely aligned to achieve satisfactory results. Unlike the feature based image registration techniques. The same approach is followed in (37. the presence of independent moving objects. In (35). An optical flow algorithm is also used in (36) to estimate the parameters of a pseudo perspective camera model. In (42). i. but utilizing a different minimization algorithm. in real applications.2. the Inverse Compositional Algorithm is used to obtain the parameters of an affine camera model for a tracking application of vehicles in aerial imagery.1 Tracking with moving cameras of detected features belonging to the background. because the acquired images are low textured and structured. A similar framework of camera motion compensation is used in (41) for tracking vehicles in aerial infrared imagery. and limitations of the own camera ego-motion technique. This situation arises as a consequence of several phenomena. 40). an area-based image registration technique is used to estimate the parameters of the camera model. several camera geometric transformations. and not necessarily the solution with less error is the correct one. In addition. However. All the previous approaches. a perspective camera model is computed by means of an optical flow algorithm to detect moving objects in an application of aerial visual surveillance. In situations in which the detection of distinctive features is particularly complicated. independently of the used camera ego-motion compensation technique. an efficient and reliable Bayesian framework is proposed to deal with the uncertainty in the estimation of the camera ego-motion for tracking applications. which is utilized to create panoramic image mosaics. because there can be several feasible solutions.e. have in common that they compute at most one parametric model to represent the ego-motion between consecutive frames. 3. such us the aperture problem (43) (related to low structured or textured scenes). the ego-motion computation can be quite challenging. changes in the scene. In Chap. a target detection framework is presented for FLIR imagery that minimizes a SSD (Sum of Squares Differences) based error function to estimate an affine camera model.

in which the computational cost inevitably grows exponentially with the number of objects. called data association. in spite of the fact that the goal of the detector is both to minimize the probability of false alarms and to maximize the detection probability. A great deal of strategies have been proposed in the scientific literature to solve the data association problem. Single-scan approaches perform the data association considering only the set of available detections in a specific time step.2. BAYESIAN MODELS FOR OBJECT TRACKING 2. The estimation of the true correspondence. Another source of missing detections are the partial and total occlusions involved in the object interactions. while the multiple-scan approaches make use of the detections acquired in a temporal interval. On the other hand. As a result. and variations in the camera point of view. the true correspondence between objects and detections is unknown.2 Tracking of multiple interacting objects Multiple object tracking can be sought as the generalization of single object tracking. the object detections are unlabeled and unordered. and the derived data association problems. illumination changes due to weather conditions (typical in outdoor applications). there can be false detections and missing detections. However. i. which in turn are caused by articulated or deformable objects. rather than only one trajectory from an unique object. 2. techniques of multiple object tracking are fundamentally different from those of single object tracking. Furthermore. This fact increases the complexity of the data association problem. The missing detections can be originated from changes in the object appearance. one.1 summarizes the mentioned sources of disturbances along with their effects. in the sense that the main goal is to recover the trajectories of multiple objects from a video sequence.e. due to the particular problems that arise in the presence of two or more objects. data association is a stochastic process in which the estimation of the true detection association can be extremely difficult due to the involved uncertainty. suffers from the combinatorial explosion of the possible associations. Multiple-scan approaches consider that tracks are basically a sequence of noisy 12 . Tab. These can be divided into single-scan and multiple-scan approaches. comprising several time steps. All of these phenomena are also responsible of the noisy character of the detection process. The false detections arise from scene structures similar to the objects of interest. In multiple object tracking. which can obfuscate the tracking process. in real situations there can be none. or several detections per object.

and the expectation-maximization algorithm (EM) (50). such as the interior point method (56).1: Disturbances in the detection process. Inside the group of single-scan approaches. the multiple object tracking consists in seeking the optimal paths in a trellis formed by the temporal sequence of detections. In this way. The most popular solution to tackle this problem is the Lagrangian relaxation (53. detections. noisy detections Missing detections. network theoretic algorithms (49). This allows to efficiently solve the problem in polynomial time through well-known algorithms. noisy detections Missing detections. noisy detections False detections Table 2. An additional problem is the computational cost. 46). discarding a lot of feasible hypotheses that could be the true solution. their complexity is exponential with the number of objects and detections. their effects. noisy detections Missing detections. and the resulting problems in data association.e. 48). 54). 52) compute the best N solutions in order to minimize the risk of an incorrect trajectory estimation. some approaches (51. the data association problem is cast to one of association of sequence of detections.2. multiple scan assignment (47. Thus. To alleviate this situation. i. The precedent approaches compute a single solution that is considered the best one. wherein the N dimensional assignment problem is divided into a set of assignment problems of lower dimensionality. into a linear programming problem by relaxing the constraints for an integer solution.2 Tracking of multiple interacting objects Disturbance Changes in the camera point of view Articulated or deformable objects Illumination changes Object interactions Scene structures similar to the objects of interest Effect Variations in the object appearance Variations in the object appearance Variations in the object appearance Partial or total occlusions Presence of clutter Data association Problem Missing detections. Another approach (55) transforms the integer programming problem. Techniques that accomplish this task are the Viterbi algorithm (44. the simplest one is the global nearest 13 . It is known that the multiple-scan approaches are NP-hard problems in combinatorial optimization. 45. posed by the multiplescan assignment.

The computed association hypotheses constitute an approximation of the true data association distribution. in such a way that the contribution of each detection to each object depends on the statistical distance between them. the performance is similar to that of the JPDAF. In this respect. In a similar way. additional methods are required to establish a trade-off between the computational complexity and the handling of multiple association hypotheses. 65) is another alternative to estimate the best data associations hypotheses at a moderate computational cost. one of the most popular methods is the joint probabilistic data association filter (JPDAF) (60.2. a 14 . but also restricts the data association distribution to be Gaussian. BAYESIAN MODELS FOR OBJECT TRACKING neighbor algorithm (57). The probabilistic multiple hypotheses tracker (PMHT) (64. although the computational cost is higher. 59) attempts to keep track of all the possible associations along the time. 63) try to overcome this limitation by modeling the data association distribution by a mixture of Gaussians. and the approximation is more accurate as the number of hypotheses increases. In (66. On the other hand. Nevertheless. the performance of the particle filtering techniques depends on the ability to correctly sample association hypotheses from a proposal distribution called importance density. 61). However. which limits the applicability of the technique. and also with the number of objects and detections. Therefore. It assumes that the data association is an independent process to work around the problems with pruning. the multiple hypotheses tracker (MHT) (58. Subsequent works (62. the complexity of the problem is NP-hard because the number of association grows exponentially over time. 67). the algorithms based in particle filtering have the ability to manage the best data association hypotheses with a computational cost independently of the number of objects and detections. also known as the 2D assignment algorithm. heuristics techniques are necessary to reduce the number of components to make the algorithm computationally manageable. In practice. This method prunes away many unfeasible hypotheses. which allows to deal with arbitrary data association distributions in a natural way. which performs a soft association between detections and objects. a Gibbs sampler is used to sample the data association hypotheses. The main problem of this approach is that many feasible associations are discarded. which computes a single association between detections and objects by minimizing a distance based cost function. As it occurs with the multiple-scan approaches. Theoretically. The data association problem has been also addressed with particle filtering. This is carried out by combining all the detections with all the objects.

72) overcome this limitation by means of the design of an efficient and non-iterative proposal distribution that depends on the specific characteristic of the underlying dynamic and likelihood processes of the tracking system. which improves the accuracy of the estimated object trajectories for a given number of samples or hypotheses. In this case. The previous works have been designed to track multiple objects with restricted kinds of interactions among them. This limitation arises from the fact the main tracking techniques for multiple objects have been developed for radar and sonar applications. Some works (71. the accuracy can be quite low. 70) scheme has been used for drawing samples that simulate the underlying data association distribution. In this case the object detections are used to efficiently correct the object trajectories. 69. For high dimensional spaces. However. such as a situation with two people who cross each other maintaining their paths. An alternative to the particle filtering is the probability hypothesis density (PHD) filter that can also address missing and false detections like the particle filtering. in which the dynamics of the tracked objects have physical restrictions that make 15 . This fact makes them inappropriate for online applications. which can be an excessive limitation for some tracking applications. In order to deal with this drawback. Another kind of interaction that is successfully addressed involves object occlusions but without trajectory changes. relying on their trajectories are unchanged in order to predict their tracks.2 Tracking of multiple interacting objects Markov Chain Monte Carlo (MCMC) (68. this approach is only satisfactory for multivariate distributions that can be reasonable approximated by its first moment.2. called Rao-Blackwellization. the data association stage can manage the missing detections during the occlusion. these works are able to handle object interactions involving trajectory changes but without occlusions. the computational cost is exponential with the number of objects. in complex object interactions involving trajectory changes and occlusions. the previous approaches are prone to fail because the occluded objects have not available detections to correct their trajectories. The main problem with these samplers is that they are iterative methods that need an unknown number of iterations to converge. such as a situation with two people who stop one in front the other. The accuracy of the estimation achieved by techniques based on particle filtering depends on the size of the dimension of the state space. the full posterior distribution is simplified by its first-order moment in (74). Nonetheless. However. In order to reduce the complexity from exponential to linear. has been used in (73). For instance. a technique of variance reduction.

In (78). which are used to detect moving objects in video sequences. Some works have proposed strategies to deal with the specific problems that arise in the field of visual tracking. the data association hypotheses are drawn using a sampling technique that is able to handle split object detections. i. The split detections are typical from background subtraction techniques (77). a novel Bayesian approach that explicitly models the occlusion phenomenon has been developed. 76). since the occlusion events are not explicitly modeled. the objects are handled as point targets that cannot be occluded. Moreover. in the field of radar and sonar. Chap. However. 16 . In (75.e. tracking errors can appear when a virtual detection is associated to an object that is actually not occluded. a specific approach for handling object interactions that involve occlusions and changes in trajectories is presented. In order to improve the performance of the tracking of multiple objects in the field of computer vision. This approach is able to track complex interacting objects whose trajectories change during the occlusions. It creates virtual detections of possible occluded objects to cope with the changes in trajectories during the occlusions.2. group of detections that have been generated from the same object. 4 describes in detail the proposed visual tracking for multiple interacting models. BAYESIAN MODELS FOR OBJECT TRACKING impossible the complex interactions that arise in visual tracking.

edges. some degree of belief in the state xt at time t is 17 .3 and. etc.. the developed Bayesian tracking framework for moving cameras is presented in Sec. color.. appearance. From a Bayesian perspective. etc.1 Optimal Bayesian estimation for object tracking The Bayesian approach for object tracking aims to estimate a state vector xt that evolves over time using a sequence of noisy observations z1:t = {zi |i = 1. an approximate inference technique. corners. 3. The state vector contains all the relevant information for the tracking at time step k. which analyze the video sequence information acquired by the camera to either directly compute the object position. or indirectly obtain relevant features that can related to the object position.. The noisy observations z1:t (also called measurements or detections) are obtained by one or more detectors. explaining also the basics of the particle filtering. .Chapter 3 Bayesian Tracking with Moving Cameras This chapter starts with a brief overview of the optimal Bayesian framework for general object tracking (Sec. such as the object position. Next. size. such as motion. Secs.4 show respectively how to apply the proposed Bayesian model to two visual tracking applications for moving cameras: the first one focused on aerial infrared imagery. 3. 3.2. t} up to time t. velocity. and the second one for aerial and terrestrial visible imagery. 3. 3.1). texture. Lastly. which models the camera motion in a probabilistic way.

The state transition probability is defined by a possibly non-linear function of the state xt−1 . 3. The prediction stage involves computing the prior pdf of the state. conditioned to the set of observations. and the set of observations z1:t .1: Graphical model for the Bayesian object tracking. and p(xt |xt−1 ) is the state transition probability. For efficiency purposes. using the available prior information (about the object.1) where p(xt−1 |z1:t−1 ) is the posterior pdf at the previous time step.1). that encodes the prior information. vt−1 ). Therefore. for example the object dynamics along with its uncertainty.3. p(xt |z1:t ). the camera and the scene). and the update (or correction) of the prediction based on the observations. at time t via the Chapman-Kolmogorov equation p(xt |z1:t−1 ) = p(xt . where the initial pdf p(x0 |z0 ) ≡ p(x0 ) is assumed to be known. BAYESIAN TRACKING WITH MOVING CAMERAS calculated. and the probabilistic relationships among the variables by arrows.2) 18 . xt−1 |z1:t−1 )dxt−1 = p(xt |xt−1 )p(xt−1 |z1:t−1 )dxt−1 . This probabilistic model for the object tracking can be graphically represented by a graph (see Fig. p(xt |z1:t−1 ). called graphical model. and an independent identically distributed noise process vt−1 xt = ft (xt−1 . (3. (3. Figure 3. the tracking problem can be formulated as the estimation of the posterior probability density function (pdf) of the state of the object. the estimation of the posterior pdf p(xt |z1:t ) is recursively performed through two stages: the prediction of the most probable state vectors using the prior information. in which the random variables are represented by nodes.

(3. given respectively by M AP : xt = arg max p(xt |z1:t ) xt (3.3) is simply a normalization constant given by (3. a powerful and popular suboptimal method. The likelihood is given by a possibly nonlinear function of the state xt . using the new available observation zt (observations are available at discrete times) through the Bayes’ rule p(zt |xt )p(xt |z1:t−1 ) . nt ). xt |xt )dxt = p(zt |xt )p(xt |z1:t−1 )dxt . it assesses the degree of support of the observation zt by the prediction xt . i. 3. (3.3.7) Nevertheless. that contains the desired tracking information. will be described. it is necessary the use of suboptimal methods to obtain an approximate solution. 3. due to the nonlinearities and nonGaussianities of the prior information and observation models.4) p(zt |z1:t−1 ) = p(zt . The denominator of Eq.3. In the Sec.1 Optimal Bayesian estimation for object tracking The update stage aims to reduce the uncertainty of the prediction. p(xt |z1:t−1 ). given by Eq.6) M M SE : xt = E(p(xt |z1:t )) (3. allowing the computation of an optimal estimate of the state vector xt . call Particle Filtering.1. and an independent identically distributed noise process nt zt = ht (xt . Commonly used estimators are the Maximum A Posteriori (MAP) and the Minimum Mean Square Error (MMSE). p(zt |z1:t−1 ) p(xt |z1:t ) = (3. Therefore.5) The posterior p(xt |z1:t ) embodies all the available statistical information.e. can not be determined analytically in practice. 19 . the optimal solution of the posterior probability.1.3) where p(zt |xt ) is the likelihood distribution that models the observation process.

since it depends on the specific characteristics of the tracking application. which aims to reduce the variance of the approximation given by Eq.. Particle Filtering is able to deal with continuous state spaces and nonlinear/nonGaussian processes (9).8) through Monte Carlo simulation. q(xi |xi . As the number of samples becomes very large. The optimal q(xt |xt−1 . NS } (2) t 1 c NS i i wt δ(xt − xt ). and c = posterior pdf. The Particle Filtering technique approximates the posterior probability p(xt |z1:t ) by a set of NS weighted random samples (or particles) {xi . called the importance density. since it would imply that p(xt |z1:t ) is known. .. In practice. i=1 p(xt |z1:t ) ≈ (3. NS i i=1 wt is a normalization factor.. zt ) t t−1 (3. . . i The weights wt related to each sample xi are recursively computed by (2) t i i wt = wt−1 p(zt |xi )p(xi |xi ) t t t−1 . zt ) should be proportional to p(xt |z1:t ). {wt .3. In contrast to other approximate inference methods.9) 20 . (3. in whose case the variance would be zero.. But this is only a theoretical solution.. i = 0. it is chosen a proposal distribution as similar as possible to the posterior pdf. zt ). this approximation becomes equivalent to the true i Samples xi and weights wt are obtained using the concept of importance samt pling (2.. 79).1 Particle filter approximation The Particle Filter is an approximate inference method based on Monte Carlo simulation for solving Bayesian filters. The choice of the proposal distribution is a key component in the design of Particle Filters.8) i where the function δ(x) is the Kronecker’s delta. BAYESIAN TRACKING WITH MOVING CAMERAS 3. which arise in a natural way in real tracking situations.. Unscented Kalman Filters and Hidden Markov Models. i = 0.. since the quality of the estimation of the posterior pdf depends on the ability to find an appropriate proposal distribution. NS } is the set of weights related to the samples.1. The set of samples {xi .. and should have the same support (the support of a function is the set of points where the function is not zero). i = 0. such as Extended Kalman Filters. NS } is drawn from t a proposal distribution function q(xt |xt−1 . but there is not a standard solution.

In order to overcome this problem. the samples with higher weights are selected more times. it is necessary to estimate the camera motion in order to obtain the object position. A popular resampling strategy is the Sampling Importance Resampling (SIR) algorithm.e. After SIR resampling. gt−1 ) encodes the information about the object and camera dynamics. consisting in all the weights except one have an insignificant value after a few iterations. the perceived motion of the objects is composed by the own object motion and the camera motion. i. According to this.3. which introduce an additional sampling step that consists in populating more times those samples that are more probable. but also the camera dynamics. Consequently. while the ones with an insignificant weight are discarded. called the degeneracy problem (2). gt . the state vector xt = {dt . the object dynamics can be modeled by the linear function dt = M · dt−1 . which makes a random selection of the samples at each time step according to their weights. (position and velocity over the image plane). dt . If the camera motion is not considered. gt |dt−1 . gt } must contain not only the object dynamics.12) where M is a matrix that represents a first order linear system of constant velocity. The camera 21 . Thus. This object dynamic model is a reasonable approximation for a wide range of object tracking applications. the camera ego-motion. provided that the camera frame rate is enough high. The posterior pdf of the state vector is recursively expressed by the equations p(xt |z1:t ) = p(zt |xt )p(xt |z1:t−1 ) p(zt |z1:t−1 ) (3.2 Bayesian tracking framework for moving cameras The importance sampling principle has a serious drawback. 3.2 Bayesian tracking framework for moving cameras In video sequences acquired by a moving camera. (3. ′ ′ (3. along with their uncertainty.11) The transition probability p(xt |xt−1 ) = p(dt . several resampling techniques have been proposed in the scientific literature. all the samples have the same weight.10) p(xt |z1:t−1 ) = p(xt |xt−1 )p(xt−1 |z1:t−1 )dxt−1 .

(3. gt ) models the uncertainty of the proposed joint dynamic model as 2 p(dt |dt−1 . The other probability term in Eq. not following any specific pattern. depending on the camera and scene disposition. The joint dynamic model for the camera and the object is expressed as a composition of both individual models dt = gt · M · dt−1 . the current camera motion is conditionally independent of both the camera motion and the object position in previous time steps. (3. gt−1:t )p(gt |dt−1 . it can be simplify to an affine or Euclidean transformation. in aerial tracking systems.3. (3. expresses the probability that one specific geometric transformation represents the true camera motion between consecutive time steps. since the depth relief of the objects in the scene is small enough compared to the average depth. gt |dt−1 . the current object position is conditionally independent of the camera motion in the previous time step (as the proposed joint dynamic model states). For example. and.15) where N (x. σtr . and the field of view is also small (80). The probability term p(dt |dt−1 . σ 2 ) is a Gaussian or Normal distribution of mean µ and variance σ 2 . which amounts to express p(gt ) as j p(gt ) = δ(gt − gt ).14) where it has been assumed that. gt )p(gt ).13) Based on this joint dynamic model. gt · M · dt−1 . gt ) = N dt . gt−1 ) = p(dt |dt−1 . gt−1 ) = p(dt |dt−1 . p(gt ).14. on the one hand. µ.16) 22 . (3. 2 Thus. the transition probability p(xt |xt−1 ) can be expressed as p(xt |xt−1 ) = p(dt . an affine geometric transformation is a satisfactory approximation of the projective camera model. This is typically computed by a deterministic approach using an image registration algorithm (27). BAYESIAN TRACKING WITH MOVING CAMERAS dynamics is modeled by a geometric transformation gt that ideally is a projective camera model. although. 3. the term σtr represents the unknown disturbances of the joint dynamic model. This last assumption results from the fact that the camera egomotion is completely random. on the other hand.

.10 is dependent on the kind of imagery.17) j where Ng is the number of geometric transformations used to represent p(gt ). In order to satisfactorily deal with this situation.3 and 3. the best geometric transformation according to some error or cost function can not necessarily be the actual camera ego-motion.3 and 3. Two different methods are proposed in Secs. 33.3. rather than a parameter computed in a deterministic way. gt is addressed as a random variable.2 Bayesian tracking framework for moving cameras j where gt is the geometric transformation obtained by the image registration technique. The initial pdf p(x0 |z0 ) ≡ p(x0 ). 81) is quite significant and/or the assumption of only one global motion does not hold. one based on the detection of blob regions for infrared imagery. {gt |j = 1. this approximation can fail in situations in which the aperture problem (43. 34. can be initialized as a Gaussian distribution using the information given by an object detector algorithm. . that evaluates how well the transformation represents the camera ego-motion. 23 . 32. However. 19. 40). for instance. Another alternative is to use the ground truth information (if it is available) to initialize a Kronecker’s delta function δ(x0 ). 3. due to the presence of clutter and objects similar to the tracked object. and j j wt is the weight of gt . due to the noise and non-linearities involved in the estimation process. the resulting likelihood will be non-Gaussian.4. Under these circumstances there are several putative geometric transformations that can explain the camera egomotion.. in the presence of independent moving objects. called the prior. 3. as in (18. 3..4 for infrared and visible imagery. The specific computation of p(gt ) depends of the tracking application and type of imagery. Ng } are the best transformation candidates to model the camera ego-motion. respectively. they have in common that they compute an approximation of p(gt ) as Ng p(gt ) ≈ j=1 j j wt δ(gt − gt ). 39. (3. Two different models have been developed. In any case. and the object type that is being tracked. Moreover. nonlinear and multi-modal. The likelihood function p(zt |xt ) in Eq. In general terms. which are described respectively in Secs. and another based on color histograms for visible video sequences.

10 and 3. BAYESIAN TRACKING WITH MOVING CAMERAS 3. distorts the spatio-temporal correlation of the video sequence. µ. In contrast to visual-range images.11. regions that are brighter than their surrounding neighborhood. These drawbacks. 3. the posterior pdf of the state vector p(xt |z1:t−1 ) is recursively computed by Eqs.3 Object tracking in aerial infrared imagery This section presents the developed object tracking approach for aerial infrared imagery. The quality of the image alignment (or the ego-motion compensation) is computed by means of the Mean Square Error function. where the prior probability p(gt ) of the geometric transformation was dependent on the specific type of imagery. is quite challenging due to the aforementioned characteristics of the infrared imagery. and the previous frame It−1 warped by the transformation gt . σg (3. along with the competing background clutter. y).2. For the ongoing tracking j application dealing with infrared imagery. and σg is the expected variance of the image alignment process. infrared images have low signal-to-noise ratios. the unpredictable camera ego-motion. The most robust and reliable object property is the presence of bright regions. make the tracking task extremely difficult. objects low contrasted with the background. 3. which typically 24 . mse(x. The transition probability p(xt |xt−1 ).3. and non-repeatable object signatures. and illumination changes due to weather conditions. gt · It−1 . between the curj rent frame It . resulting from the fact that the camera is on board of an aerial platform. 0. the p(gt ) of a specific geometric transforj mation gt is based on the quality of the image alignment between consecutive frames j achieved by gt . Finding an observation model for the likelihood p(zt |xt ) in airborne infrared imagery that appropriately describes the object appearance and its variations along the time. All the aforementioned problems are addressed by a tracking strategy based on the Bayesian tracking framework for moving cameras proposed in Sec. is given by Eq. Thus.18) 2 where N (x. Notice that It is an infrared intensity image. 3. that encodes the joint camera and object dynamic model. On the other hand. or at least. negatively affecting the tracking performance.14. the j probability p(gt ) is mathematically expressed as j j 2 p(gt ) = N mse It . σ 2 ) is a Gaussian distribution of mean µ and variance σ 2 . According to this.

(a) and (b).2. the resulting LoG filter response is strongly multi-modal. The multi-modality feature is clearly observed. and in theory any of the modes could be the right object position.3 Object tracking in aerial infrared imagery (a) (b) Figure 3. in which the target object has been enclosed by a rectangle. The main handicap of the observation model is its lack of distinctiveness. coupled with the camera ego-motion. if only the object dynamics is considered. Fig.2: Two consecutive frames of an infrared sequence acquired by an airborne camera. correspond to the engine and exhaust pipe area of the object. the closest mode to the predicted object location (marked by a vertical black line) is not the true object location.2 and 3. This fact.3 shows the LoG filter response related to Fig. 3. This is accomplished by a rotationally symmetric Laplacian of Gaussian (LoG) filter. dramatically complicate a reliable estimation of the state vector. of an infrared sequence acquired by an airborne camera. the likelihood function uses an observation model that aims to detect the main bright regions of the target. Based on this fact.3. since whatever bright region with an adequate size can be the target object. Fig. so that the filter response be maximum in the bright regions with a size similar to the tracked object.2(b). 3. where the own image has been projected over the filter response for a better interpretation. 3. shows two consecutive frames. characterized by a sigma parameter that is tuned to the lowest dimension of the object size. because of the effects of the camera ego-motion. The first one. Moreover. 3. 25 . This situation is illustrated in Figs. As a consequence.3.

(3. The likelihood probability can be simplified by p(zt |xt ) = p(zt |dt .19) assuming that zt is conditionally independent of gt given dt .3. 3. Hdt . gt ) = p(zt |dt ). where only the most significant modes of Fig.3: Multimodal LoG filter response related to Fig. Figure 3.2(b).3. This is illustrated in Fig. 3.4: Likelihood distribution related to Fig. σL ).20) where zt is the LoG filter response of the frame It . 3. and the variance σL is set to highlight the main modes of zt . 3.3 are highlighted.4. BAYESIAN TRACKING WITH MOVING CAMERAS Figure 3. H is a matrix that selects the object positional information. while discarding the less significant ones. Then. 26 . (3. p(zt |dt ) is expressed by the Gaussian distribution 2 p(zt |dt ) = N (zt .

This method assumes an initial geometric transformation ti . warped by the initial transformation.3. importance density function (9) q(xt |xt−1 . and then.21) where the samples xi are drawn from a proposal distribution based on the likelihood t and the prior probability of the camera motion q(xt |xt−1 . uses the whole image intensity informat i tion to compute a global affine transformation gt .3 Object tracking in aerial infrared imagery As both dynamic and observation models. draws samples gt from p(gt ). t i=1 p(xt |z1:t ) ≈ (3. but not tractable. the computed candidate gt only will be a reasonable approximation of the camera motion if the initial geometric transformation ti is close to the geometric transformation that represents the actual t camera motion. zt ).23) i The samples xi = {di . the posterior pdf can not be analytically determined. 27 . zt ) = p(xt |xt−1 . This method explicitly accounts for global variations in image i intensities to be robust to illumination changes. This. (3. This limitation derives from the optimization strategy used in the image registration algorithm.3. and therefore.1 Particle filter approximation The posterior pdf p(xt |z1:t ) is approximated by means of a Particle Filter as 1 c NS i wt δ(xt − xi ). This means that the image in the previous time step. a Particle Filtering strategy is presented to obtain an approximate solution of the posterior pdf.22) which is an efficient simplification of the optimal. draws i samples di from p(zt |dt ). and then. that tends to the closest mode given an initial transformation. firstly. However. the use of approximate inference methods is necessary. (3. 3. gt } are drawn from the proposal distribution by a hierart t i chical sampling strategy. In the next section. are nonlinear and non-Gaussian. which is a candidate for representing the true camera motion. The sampling procedure for obtaining samples gt from p(gt ) t is based on an image registration algorithm presented in (82). zt ) = p(zt |dt )p(gt ). must be closely aligned to the current image to achieve a satisfactory result.

if the two images are not closely aligned. since the rectification capability depends on the scene structure. on the other hand. An image is warped by different affine transformations of increasing magnitude until the image registration algorithm is not able to rectify the induced motion. In addition. This approach is inefficient in airborne visual tracking. . which described the typical camera movements.3. it can be concluded that the image registration algorithm can deal with the scale. NS }. ti is a 3 × 3 identity matrix that represents the previous image without t warping. To overcome this problem. a set of affine transformations are obtained. this process has been performed with a set of different images belonging to several sequences of the AMCOM dataset. For example. BAYESIAN TRACKING WITH MOVING CAMERAS As a consequence. the ideal situation would be that the magnitude of the camera motion were lower than the maximum displacement that the image registration algorithm is able to rectify. but also evaluates if the reached result is enough accurate to be considered the real camera motion. so that the image registration algorithm can effectively compute the correct geometric transformation... 3. The computation process of the camera motion has been supervised by a user. the following benchmark has been performed to measure its capability to align images. . two sets of affine transformations are obtained. These sequences have been acquired by different infrared cameras on board a plane.. the previous image registration technique has been improved using several initial geometric transformations {ti |i = 1. NS }. As a result. Regarding the image registration algorithm. one composed by the camera movements that is able to rectify. Analyzing the affine parameters of the previous sets of transformations. on the one hand. and. rotation. the concept of closeness between geometric transformations depends. on the magnitude of the camera motion. By default. the computed solution will probably correspond to a local mode.3. The set of initial transformations are com- puted with the purpose that at least one of them is relatively close to the actual camera motion. and another one containing the camera movements that is not able to. since the camera can undergo strong displacements that can not be satisfactorily compensated. For the purpose of measuring the magnitude of the camera motion. on the own capability of the image registration algorithm to rectify misaligned images. a subset of video sequences belonging to the AMCOM dataset (see Sec. which not only guides the image alignment.. As a result.. and shear transformations that 28 . obtaining in turn a set of camera t i ego-motion candidates {gt |i = 1.. that does not represent the true camera motion.2) has been used as training set to compute the actual camera motion. In this context.

. Taking into account the previous conclusions. (tx )i and (ty )i . The maximum t i sampling step should be less than 7-8 pixels.5 shows a set of translational transformations. the sampling step has been fixed to 5 pixels. and the computational cost related to the estimation of {gt |i = 1. However. the computed camera motion can be quite inaccurate. NS . This affine transformation does not represent a good approximation for the 29 ... Note that the rest of affine parameters of the affine transformation are equivalent to the identity matrix...3 Object tracking in aerial infrared imagery take place in the AMCOM video sequences. arranged in a 21×21 grid. Specifically. so that at least one gt represents accu- rately the camera motion. Thus. For the ongoing tracking application. the corresponding gt obtained by the image registration algo- rithm using ti is an accurate representation of the camera ego-motion. 3. 3.. which is a good trade off between the achieved camera motion i accuracy. in such a way that at least one ti will be close to the actual t i camera motion. Fig. i = 1. and are encoded with a color scale. that are used as initial transformations for the image registration technique.6 illustrate an example of the sampling procedure from q(gt ) = p(gt ). the image registration algorithm can deal with the magnitude of these kind of transformations induced by the camera motion. since. just located in the middle of the grid. i Fig. The sampling step for the translational components. obtained by uniformly sampling t t the x-y coordinate space.3. Notice that the default solution obtained by the original image registration algorithm (82) (that uses as initial 221 transformation the identity matrix) corresponds to gt . .. the set of initial transformations ti t are computed as   1 0 (tx )i  t ti =  0 1 (ty )i  .. when the translation components of the affine transformation is higher than 7-8 pixels. t   0 0 1    (3.5 and 3.24) where (tx )i and (ty )i are the translation components. i = 1. rotation and shear warping respect to the previous image frame. NS }. meaning that there is no scale. . 3. i The probability values p(gt ) are arranged in the same way that the previous grid of initial transformations. t t as was stated before. it can not satisfactorily rectify all the translation transformations. . 441}. as was stated before. is set to 5 pixels..6 shows the probability value of each affine transformation i gt given by the image registration algorithm using the previous initial transformations. {tt . Figs.

that is the global maximum. i = 1..6: Probability values of {gt . representing the camera ego-motion candidates.3. . i Figure 3.5: Set of translational transformations used as initialization transformations in the image registration method. and in this case is the best representation of the camera motion. 30 . since the related probability value is much lower than the corresponding 74 to gt . camera motion... BAYESIAN TRACKING WITH MOVING CAMERAS tt tt tt tt tt tt tt tt tt Figure 3. 441}.

gt }. NS } will have enough precision to actually represent the camera motion. the actual camera motion could be represented by one i gt of lower probability.22). a probabilistic representation of the camera motion.. this statement does not hold. The sampling task from the likelihood function is not a trivial task..e. For the case of the likelihood p(zt |dt ) sampling.4). This is a convenient decision since the main modes of the posterior dist t tribution also appear in the likelihood function. the samples of the object dynamics di are drawn from the likelihood p(zt |dt ) (Eq. In this case. To deal with this issue.. where only the transformations with enough accuracy have a significant value. is more efficient than a deterministic approach based on the estimation of the best transformation that represents the camera motion. a Markov Chain Monte Carlo (MCMC) sampling method is proposed. based on discrete samples. to finally obtain t i xi = {di . Since the likelihood function concentrates almost all the probability in a few sparse regions of the state space (i.3 Object tracking in aerial infrared imagery i In a typical situation. 3. Another fundamental issue is the initialization of the Markov Chain. since it is a bivariate function composed by narrow modes (see Fig. . However. since it should represent the most accurate camera motion. The appropriate selection of the proposal distribution is the key for the efficient sampling of the target distribution.3. due to the particularities of the proposed sampling procedure. As a consequence.. This can be checked by evaluating i the probability value of each affine transformation p(gt ). has proven to be efficient. The Metropolis-Hasting (79. since the moving objects bias the camera ego-motion estimation. An alternative approach could be to replicate the sample with the highest probability value. i = 1. with mean zero and a variance proportional to the lowest size dimension of the tracked object... For this reason. 83) algorithm is a method of this type that uses a proposal distribution for simulating such a chain. in such a way that the stationary t distribution is exactly the target distribution. The MCMC approach generates a sequence of samples {di . a Gaussian function. To alleviate this situation. a resampling i stage is performed using the SIR algorithm to place more samples gt over the regions of the affine space that concentrate the majority of the probability. i Once the samples of the camera dynamics gt are computed. NS } by means of a Markov Chain. 3. in its sparse 31 . only a few affine transformations of the whole set {gt |i = 1. since only the best sample is used. in situations with independent moving objects. the sampling procedure would be equivalent to use an optimization approach based on a stochastic search. which is able to efficiently represent the likelihood function by a reduced number of samples. .

The samples have been marked with circles.4. 3.7 shows the result of applying the proposed Metropolis-Hasting sampling algorithm to simulate the likelihood distribution depicted in Fig. gt } are computed using t t the Eq. After computing the samples and weights to approximate the posterior distribution.7: Fig. with different initialization states given by the main local maxima of the likelihood distribution. gt ). by selecting more times the samples with 32 . a resampling step is applied to reduce the degeneracy problem. A more efficient approach is to use a set of Markov Chains. in spite of the relatively low number of used samples. Notice that the samples are on the main modes of the likelihood distribution. the samples that best fitted with the joint camera and object dynamic model will have more relevance than the rest. This is accomplished by means of the SIR algorithm (Sec. 3. BAYESIAN TRACKING WITH MOVING CAMERAS Figure 3. Metropolis-Hastings sampling of the likelihood distribution depicted in narrow modes). In this way.25) Thus. 3. the Markov Chain needs a large amount of samples to correctly simulate it. 3.3. z ) q(xt t−1 t i i p(zt |di )p(di |di .1).9.1. which can be simplified using the expression of the proposed proposal distribution as i i wt = wt−1 i = wt−1 p(zt |xi )p(xi |xi ) t t t−1 i |xi . t t−1 (3.4. gt )p(gt ) t t t−1 i p(zt |di )p(gt ) t i i = wt−1 p(di |di . Fig. the likelihood is efficiently simulated by a reduced number of samples located on the main modes. i i The weights wt related to the drawn samples xi = {di . 3.

and the modes relative to the background in the likelihood function.9 show the estimated posterior probability. while the ones with an insignificant weight are discarded. due to the coherence with the expected camera and object dynamics.3. higher weights. for the case of a multimodal posterior probability.8: Particle Filtering based approximation of the posterior probability p(xt |z1:t ). or multimodal. since the background modes bias the result.3 Object tracking in aerial infrared imagery Figure 3. the MMSE estimator does not produce a satisfactory estimation. all the samples have the same weight. the resulting posterior probability can be quasi unimodal (if there is only one significant mode). Notice that the samples corresponding with modes related to background structures have a lower weight than the ones related to the tracked object. the state estimation can be efficiently performed by means of the MMSE estimator. After SIR resampling. p(xt |z1:t ). Figs. This fact depends on the distance between the mode corresponding to the tracked object.9: SIR resampling of p(xt |z1:t ). and the result of applying the SIR resampling.8 and 3. In general terms. 3. While for the case of a quasi unimodal posterior probability. Figure 3. To avoid such 33 . respectively.

In general. that has been marked by a black circle. 3. 3.10 shows the result of applying the Gaussian kernel over p(xt |z1:t ).26) where the covariance matrix Σe determines the bandwidth of the Gaussian kernel. the magnification and pose variations of the target signatures. In this way.27) where sx and sy are the width and height of the object.5µm) and long-wave (8µm . (3. µe . which are the same parameters as the ones used in the LoG based object detector. which must be coherent with the size of the tracked object mode. 3. Σe ) of mean µe and covariance matrix Σe (79). This is achieved by means a bivariate Gaussian kernel N (x. the tracking task is quite challenging in this dataset due to the strong camera egomotion. 34 . Σe )p(xi |z1:t ) . and the bandwidth of the LoG filter used in the object detection (Sec.3. along with the maximum corresponding to the final estimation xt . which gives more relevance to the samples located close to the Gaussian mean.3. that in turn it was set according to the object size. discarding the rest.3. and the own characteristics of the FLIR imagery described in Sec. Fig.12µm). an efficient covariance matrix can be estimated as Σe = sx /2 0 0 sy /2 .2 Results The proposed object tracking algorithm has been tested using the AMCOM dataset.3).11 shows the tracked object accurately enclosed by a white rectangle corresponding to the estimated xt . only its samples will have significant value. 3. A variety of moving and stationary terrestrial targets can be found in two different wavelengths: mid-wave (3µm . xl . This consists of 40 infrared sequences acquired from cameras mounted on an airborne platform. t t t l=1 i=1 (3. 3. when the Gaussian mean is centered over the tracked object mode. Fig. the MMSE estimator should only use the samples relative to the tracked object mode. The proposed Gaussian-MMSE estimator is mathematically expressed as 1 NS NS NS xt = max N (xi . Taking into account the relationship between the size of the tracked object mode. BAYESIAN TRACKING WITH MOVING CAMERAS a bias in the estimation. respectively.

and it is referred to as tracking with Bayesian ego-motion handling (BEH). Figure 3. 26). The second algorithm is based on a deterministic modeling.3. which is equivalent to express p(At ) by a d Kronecker’s delta centered in gt . It models the ego-motion by only one affine transformation. p(gt ) was approximated by a set of discrete samples (Sec. These both algorithms are inspired on the existing works (25.8). The proposed object tracking algorithm has been compared with other two tracking algorithms to prove its superior performance. 3.11: Tracked object accurately enclosed by a white rectangle. deterministic modeling.10: Result of applying the Gaussian kernel over p(xt |z1:t ) (depicted in Fig. which also use a Particle Filter framework for the tracking. The three algorithms differ in the way they tackle the ego-motion: Bayesian modeling. In this model. 3.1). and is referred to as tracking with deterministic ego-motion handling (DEH). making easier and fairer to compare the performance of all the three algorithms.3 Object tracking in aerial infrared imagery Figure 3. and not explicit modeling.3. along with the final state estimation xt marked by a black circle. an affine transformation deterministically computed 35 . The developed algorithm uses a Bayesian model for the ego-motion.

the same number of samples have been used for the three algorithms: NS = 300. 3. and lastly. Fig. two different tracking situations are evaluated to demonstrate the higher performance of the BEH algorithm in complex ego-motion situations. With the purpose of making a fair comparison. 3. other values with a variation less than the 15 percent have also offered similar results. the same value σtr = 2 has been 2 chosen for the BEH and DEH algorithms. The last subsection presents the overall tracking results for each of three algorithms using the aforementioned AMCOM dataset. This number is enough to ensure a satisfactory approximation of the state posterior probability given the specific charac2 teristics of the AMCOM dataset. BAYESIAN TRACKING WITH MOVING CAMERAS through the image registration algorithm described in (82). and make it comparable with the other algorithms. has not an explicit model for the camera ego-motion.28) 2 where the value of the parameter σtr should be larger than for BEH and DEH algorithms to try to alleviate the ego-motion effect. The BEH algorithm needs an ex2 tra parameter which has been heuristically set to σg = 0. in which the target object has been enclosed by a black rectangle as visual aid. offering good results for the given AMCOM dataset.12(a) and (b) show two consecutive frames that have undergone a large displacement. which leads to a simplified expression of the state transition probability 2 p(xt |xt−1 ) = N dt .2. σtr . In the same way. in order to alleviate its lack of an explicit ego-motion model. 3. Fig. 3.12(d) shows the resulting Metropolis-Hasting based sampling. In the two following subsections.1 Strong ego-motion situation The BEH algorithm has been compared with the DEH and NEH ones for a situation of strong ego-motion. Although. referred to as tracking with no ego-motion handling (NEH).3. 3. 36 . where Fig.12(c) shows the multimodal likelihood function.03. (3. Fig. The last algorithm.3. M · dt−1 . where each sample has been marked by a black circle.12 shows the common intermediate results for all the three algorithms. while a value of σtr = 4 has been chosen for the NEH algorithm.

37 .3 Object tracking in aerial infrared imagery (a) Frame 15 (b) Frame 16 (c) Likelihood p(zt |xt ) (d) Sampling Figure 3.12: Common intermediate results for all the three tracking algorithms in a situation of strong ego-motion.3.

where the samples di with higher t weights are correctly located over the target object. Fig. Notice that there is a peak in the middle left side. Finally. Fig.2 0.13(d) shows the target object satisfactorily enclosed by white rectangle. in a similar way to that of Fig. respectively.55 2 0. The probability values i of gt (the estimated affine transformations) are shown in Fig. indicating that the camera has undergone a strong right translation motion. which is used by the Gaussian-MMSE estimator to compute the final state estimation (marked as a black circle).3 0. The probability values are displayed using a color scale.1 0. 3. 0. Figs. and consequently the tracking 38 . 3. Observe that the infrared image is projected over the X-Y plane of each probability distribution as visual aid.15 show the tracking results for the DEH and NEH algorithms. 3.13: Tracking results for the BEH algorithm in a situation of strong ego-motion.14 and 3.13(a). thanks to the Bayesian treatment of the camera ego-motion.13 shows the tracking results for the BEH algorithm. 3. 3.45 0.4 0.5 4 6 8 10 12 14 16 18 20 5 10 15 20 0.35 0. Fig.13(c) shows the result of applying the Gaussian kernel over the sampled posterior probability. whose coordinates are determined by the state estimation. 3. since the dynamic model does not correctly represent the camera and object dynamics. 3. which have been arranged in a rectangular grid.25 0.3. Notice that the tracking fails in both cases.13(b) shows the sampled posterior probability. BAYESIAN TRACKING WITH MOVING CAMERAS Fig.6.05 i (a) Probability values of gt (b) Sampled posterior probability (c) Gaussian-MMSE based estimation (d) Object tracking result Figure 3.15 0.

respectively. 3. which has a probability value much lower than the one related to the true camera motion. 3. 3. In the case of DEH algorithm. where the target object has been enclosed by a black rectangle as visual aid.6. 3. The probability values of i gt (the estimated affine transformations) are shown in the first column. in which the first column shows five consecutive frames. 3. DEH. and NEH) is presented for a situation where the ego-motion estimation is especially challenging due to the aperture problem (the frames are very low-textured).3.2.3. which have been arranged in a rectangular grid in a similar way to that of Fig. The following two columns show the resulting multimodal likelihood function and the Metropolis-Hasting based sampling for each frame. Fig. this fact can be checked by observing that the estimated affine transformation corresponds to the one located in the coordinates (11.3 Object tracking in aerial infrared imagery drifts to another mode of the likelihood function. 11) of Fig.13(a). (a) Sampled posterior probability (b) Gaussian-MMSE based estimation (c) Object tracking result Figure 3. Fig.14: Tracking results for the DEH algorithm in a situation of strong ego-motion.16 shows common intermediate results.2 High uncertainty ego-motion situation A comparison about the tracking performance of all the three algorithms (BEH. The probability values 39 .17 shows the tracking results for the BEH algorithm.

BAYESIAN TRACKING WITH MOVING CAMERAS (a) Sampled posterior probability (b) Gaussian-MMSE based estimation (c) Object tracking result Figure 3. 3. as a consequence of the high uncertainty in the camera ego-motion estimation. The second column of Fig. The affine transformations with higher probability value are located in the horizontal direction. but there is a set of affine transformation candidates with similar probability values.18 and 3. allowing to track it satisfactorily. 3. Figs.3. are displayed using a color scale. Notice that there are several samples with high weights that are not located over the target object. 3. However. which respectively show the Gaussian-MMSE estimation. meaning that whatever of them could be the true camera motion. the horizontal translation of the camera motion can not be reliably computed between consecutive frames.17 shows the sampled posterior probability related to each frame. 40 . and the tracking result.19 show the tracking results for the DEH and NEH algorithms. This fact can be verified by observing the last two columns. the majority of samples that have a high weight are located over the target object. where the target object has been satisfactorily enclosed by a rectangle (whose coordinates are determined by the state estimation). Notice that there is not a well defined peak. indicating that the aperture problem is especially significant in that direction. unlike the strong ego-motion situation (Fig.13(a)). In other words.15: Tracking results for the NEH algorithm in a situation of strong ego-motion.

3. 169 Fr. 41 . 172 Frames 168-172 Likelihood p(zt |xt ) Sampling Figure 3.3 Object tracking in aerial infrared imagery Fr. 171 Fr. 170 Fr.16: Common intermediate results for all the three tracking algorithms in a situation where the ego-motion compensation is especially challenging due to the aperture problem. 168 Fr.

8 0.7 0.6 8 0.4 14 0.6 0.3 2 4 6 0.2 Probability values i of gt Sampled posterior probability Gaussian-MMSE estimation Object tracking result Figure 3.5 10 12 14 16 18 0. BAYESIAN TRACKING WITH MOVING CAMERAS 2 4 6 8 10 12 0.1 20 5 10 15 20 0.1 20 5 10 15 20 0.9 2 4 6 8 10 12 14 0. 42 .8 0.2 18 0.4 2 0.6 0.1 20 5 10 15 20 0.4 0.7 0.1 20 5 10 15 20 0.4 0.5 10 12 14 16 18 0.7 6 8 10 12 14 16 0.4 0.3 16 0.3 0.2 2 4 6 8 0.5 0.1 20 5 10 15 20 0.9 0.5 0.7 0.7 0.8 0.2 18 0.6 0.3 0.8 0.8 4 0.6 0.2 18 0.3.5 0.17: Tracking results for the BEH algorithm in a situation where the ego-motion compensation is especially challenging due to the aperture problem.3 16 0.

From the analysis of the number of tracking failures. measured by means of the number of tracking failures and the tracking accuracy. The first two columns show the sequence name and the target name. since one of the more appealing advantages of the Particle Filter framework is its capability of recovering from tracking failures thanks to the handling of multiple hypotheses (or samples). On the other hand. 3. the BEH algorithm is the best of all. Observe that the tracking fails in the frame 172 for the DEH algorithm. are caused by the poor characterization of the camera ego-motion. An object is considered to be lost when the rectangle that encloses the object does not overlap the one corresponding with the ground truth. the sequences with a lot of tracking failures will have a much worse tracking accuracy. Therefore. DEH. fourth and fifth columns show the first frame. and the number of consecutive frames in which the target appears. and the DEH algorithm is better than NEH one. respectively. have been taken into account in its estimation. To sum up. The number of failures indicates the number of times that the target object has been lost. The errors obtained 43 . derived from tracking failures. there are 4 situations in which the NEH algorithm outperforms the DEH one. and 16 situations in which the BEH algorithm outperforms the NEH one. Therefore. the tracking accuracy has been defined as the average Euclidean distance between the estimated object locations and the locations given by the ground truth. as was expected. there are 11 situations in which it outperforms the NEH one. and 3 situations in which it outperforms BEH algorithm. This fact affects to the tracking accuracy measurement. Regarding the DEH algorithm. In the rest of situations.3.3 Global tracking results Global performance results of the BEH. The third. the last frame. Lastly. since all the erroneous object locations.3 Object tracking in aerial infrared imagery arranged in the same way that Fig 3.1.3. It is important to note that the tracking algorithms are not reinitialized in case of tracking failure. and also in the frame 171 for NEH algorithm. and NEH algorithms using the AMCOM dataset are shown in Tab. These failures arise from the accumulation of slight errors in the estimation of the object location. it can be summarized that there are 11 situations (sequences) in which the BEH algorithm outperforms the DEH one.17. the accuracy is better when its value is lower. and none situation in which it outperforms the BEH algorithm. in turn. which.2. 3. the performance is similar for all the three algorithms. The remaining columns show the performance of all the three algorithms.

BAYESIAN TRACKING WITH MOVING CAMERAS Sampled posterior probability Gaussian-MMSE estimation Object tracking result Figure 3.18: Tracking results for the DEH algorithm in a situation where the ego-motion compensation is especially challenging due to the aperture problem.3. 44 .

3.3 Object tracking in aerial infrared imagery

Sampled posterior probability

Gaussian-MMSE estimation

Object tracking result

Figure 3.19: Tracking results for the NEH algorithm in a situation where the ego-motion compensation is especially challenging due to the aperture problem.

45

3. BAYESIAN TRACKING WITH MOVING CAMERAS

by the DEH and NEH algorithms arise from the poor characterization of the camera ego-motion, which is satisfactorily solved by the BEH algorithm. The results about the tracking accuracy follow the same trend. An interesting fact happens when the ego-motion is quite low: the tracking accuracy of the DEH algorithm is slightly better than BEH one. The reason is that a deterministic approach introduces less uncertainty than a Bayesian approach for situations in which is possible to reliably compute the camera motion. Nonetheless, the improvement is insignificant, and in addition, the BEH algorithm is able to cope with a wider range of situations than the rest. There is one situation in which none of the three algorithms can ensure a correct tracking. This situation arises when the likelihood distribution has false modes very close to the true one (corresponding to the tracked object), and the apparent motion of the tracked object is very low. Under these circumstances, the tracker can be locked on a false mode.

3.4

Object tracking in aerial and terrestrial visible imagery

This section presents an object tracking approach for visual-range imagery. The theoretical basis is almost the same as for the object tracking for infrared imagery (Sec. 3.3). The main differences are given by the specific characteristics of the color (visual range) imagery, which are exploited to built a robust appearance model of the tracked object. Also, the estimation of the camera motion is handled in a specific way, since, in contrast to infrared imagery, it is possible to use image registration techniques based on features. These techniques have several advantages in comparison with the region/area based ones, such as a less computational cost, a higher robustness to moving independent objects, and a greater tolerance to large camera displacements. The object tracking is based on the same Bayesian framework as the one described in Sec. 3.3 for airborne infrared imagery. But, the prior probability p(gt ) of the camera motion, the observation model, and the likelihood p(zt |xt ) are different, specifically designed for the color imagery. The prior probability p(gt ) is based on the quality of the image alignment achieved by gt , the affine transformation that models the camera motion. This affine trans-

46

3.4 Object tracking in aerial and terrestrial visible imagery

Sequence name L19NSS L19NSS L19NSS L1415S L1607S L1608S L1608S L1608S L1618S L1618S L1701S L1701S L1702S L1702S L1720S L1720S L1805S L1805S L1805S L1812S L1813S L1817S-1 L1817S-2 L1818S L1818S L1818S L1906S L1910S L1911S L1913S L1913S L1918S L2018S L2104S L2104S L2208S L2312S M1406S M1407S M1410S M1413S M1415S M1415S

Target M60 mantruck mantruck Mantruck mantrk apc1 M60 mantrk apc1 M60 Bradley pickup(trk) Mantruck Mantruck target M60 apc1 M60 tank1 M60 apc1 M60 M60 apc1 M60 tank1 Mantruck apc1 apc1 apc1 M60 tank1 tank1 bradley tank1 apc1 apc1 Bradley Bradley tank1 Mantruck Mantruck Mantruck

First frame 1 86 211 1 225 1 1 193 1 1 1 1 113 631 1 43 1 615 614 72 1 1 1 21 81 151 1 56 1 1 182 26 1 1 69 1 1 1 1 1 1 1 15

Last frame 57 100 274 280 409 79 289 289 290 100 370 30 179 697 34 777 86 641 734 157 167 193 189 112 202 364 203 129 164 264 264 259 447 320 759 379 367 379 399 497 379 10 527

No. of frames 57 15 64 280 185 79 289 97 290 100 370 30 67 67 34 735 86 27 125 86 167 193 189 92 122 214 203 74 164 264 61 234 447 320 691 379 367 379 399 497 379 10 513

BEH No. of Tracking failures accuracy 15 0 0 8 0 0 0 0 0 0 0 0 0 0 0 15 0 0 0 0 0 0 0 32 60 119 0 0 0 25 0 15 0 51 235 0 0 0 0 0 2 0 7 6.19 7.33 4.90 14.14 3.49 2.11 2.95 3.93 13.73 3.51 6.54 1.99 3.27 2.57 2.91 6.59 1.41 4.88 2.14 2.53 2.68 4.46 4.93 6.01 10.55 35.89 8.47 10.70 8.15 6.79 10.42 7.03 3.77 5.60 16.88 5.88 2.11 1.84 4.83 7.45 7.44 3.14 5.42

DEH No. of Tracking failures accuracy 0 2 0 10 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0 72 119 213 0 0 0 262 0 20 0 319 629 0 0 0 0 0 1 2 512 4.10 12.40 6.69 14.84 3.56 2.19 2.74 3.38 18.90 1.87 10.89 1.99 6.18 2.43 2.60 5.80 1.52 4.80 2.17 2.52 2.64 3.41 4.43 10.22 17.79 54.29 5.58 11.47 9.19 29.93 9.19 8.98 4.52 34.39 56.99 5.40 1.70 1.92 5.77 5.25 7.73 6.68 62.30

NEH No. of Tracking failures accuracy 14 2 0 25 0 0 1 48 0 0 0 0 0 0 4 0 0 19 163 0 0 0 0 72 118 213 0 0 0 262 98 144 4 319 673 0 0 0 0 0 0 0 512 6.78 12.30 4.98 16.32 4.78 1.99 3.73 9.09 16.57 2.35 8.21 1.70 4.56 2.59 4.99 4.39 1.48 13.15 65.73 2.71 2.84 4.18 4.75 26.05 85.25 55.47 8.21 11.15 8.75 27.69 15.70 9.46 4.32 43.82 48.33 5.53 2.10 2.72 4.37 3.91 6.67 3.09 65.16

Table 3.1: Comparison of the tracking performance of the BEH, DEH and NEH algorithms using the AMCOM dataset. See the text for a detailed description of the results.

47

The spatiogram of a region in a HSV color image is defined as a vector h = [h1 . can be randomly large. hNB ].31) 2 where σg quantifies the uncertainty in the estimation of the feature correspondence. . σg (3. This allows to capture a richer description of the object. The residual distance ′ is only meaningful for correct correspondences. . Due to this fact. The goodness of fit of gt with one correspondence is determined by the residual distance j dj = ftj − (gt · ft−1 ) ′ (3. which. The first component nb is the 48 .1 for more details). According to this. . the quality of the ego-motion compensation) is based on the goodness of fit of gt with the set of matched features. since the residual distance of erroneous correspondences.e. ftj is the j th feature detected in the color image It .3. which is given by a Normal distribution of zero mean and variance σg 2 p(gt ) = N qg . to robustly estimate the quality of the image alignment. which is similar to a color histogram. The set of inliers is also estimated by the image registration algorithm. BAYESIAN TRACKING WITH MOVING CAMERAS formation is obtained by means of an image registration algorithm based on features (see Sec. The quality of the image alignment (i. computes the correspondence of features between consecutive images. firstly. µb . Σb ]. the outliers can drift the quality of the image alignment related to gt . 3. where the bth component is given by hb = [nb . only the inliers should be taken into account. called inliers.3. but augmented with the spatial means and covariances of the histogram bins. and j ft−1 is the corresponding feature in the previous color image It−1 . then. used to compute the prior probability 2 p(gt ). and Nc is the number of inliers. and then estimates the affine transformation that best fits with the set of matched features. 0. Therefore.30) where din is the set of the residual distances related to the inliers. since a basic spatial distribution of the object color is considered. .29) where x is the Euclidean norm. (3. The object model describes the appearance of the object through a color spatiogram (84). called outliers. The quality of the image alignment qg is. the quality is computed by qg = dj ∈d in dj /Nc . The object model uses the color as the main feature to characterize the tracked object.

σ 2 ) is a Gaussian distribution of mean µ and variance σ 2 . Fig.e. µr2 . i. the spatiogram of the tracked object. hr2 ) = 1 − ρw (hr1 . The function N (x. In order to compute the similarity between two image regions. 3. The variance σd models the expected variation of the Bhattacharyya distance due to changes in the object appearance.20(b) shows the another image of the same scene where the target object has a different spatial position. 3. a variation of the Bhattacharyya coefficient is calculated B ρw (h . b b (3. µr1 . h ) = b=1 r1 where ρ(nb .20 contains three images that show a graphical example of the similarity measurement between the region that encloses the tracked object and all the region candidates in an image. hr2 ) is the Bhattacharyya distance between hr1 . nr2 ). Therefore.20(c) shows the result of computing the Bhattacharyya coefficient between the reference region of Fig.3. according to their HSV triplet value. gt ) = p(zt |dt ) = N (dw (hr1 . 3.4 Object tracking in aerial and terrestrial visible imagery number of pixels belonging to the bth bin. a Particle Filtering 49 .20(b). the camera motion is not involved in the likelihood computation. are respectively the mean vector and the covariance matrix of the spatial coordinates of the pixels contributing to the bth bin. hr2 ). 0. In the next section. located in the coordinates given by dt . Σ) is a multivariate Gaussian function evaluated b b b at x. characterized by their spatiograms hr1 and hr2 . The function dw (hr1 . Once defined the observation model. 3.20(a) and all the possible regions of Fig. and Fig. The resulting posterior pdf can not be analytically determined because both dynamic and observation models are nonlinear and non-Gaussian. and hr2 . µ. Σr1 ). the use of approximate inference methods is necessary. This is similar to a standard HSV color histogram. nr2 ) = b r1 r2 wb ρ(nr1 . and the weights are defined by wb = N (µb . Fig. the spatiogram of a candidate image region in the current 2 image It . Fig. Σr2 ) · b b b b ·N (µr2 . The rest of components.32) r1 nr1 nr2 . 3. σd ).20(a) shows the reference image with the object enclosed by a bounding box. Note that there is an highlighting mode corresponding with the target object. the likelihood can be expressed as 2 p(zt |xt ) = p(zt |dt . µb and Σb . (3. The function N (x. µ.33) where it has been assumed that zt is conditionally independent of gt given dt . 3.

20: Graphical example of the similarity measurement between image regions based on the Bhattacharyya coefficient. BAYESIAN TRACKING WITH MOVING CAMERAS (a) Image containing the target object (b) Image containing the same object in a different spatial position (c) Computation of the Bhattacharyya coefficient Figure 3.3. 50 .

e. The main steps of the FFME algorithm are described in the following lines. zt ). The first one selects the image regions that have a significant gradient magnitude. illumination changes. This algorithm can be split into two stages: the computation of correspondences of image features. 3. which is robust to the noise.36) i The samples xi = {di . The first stage. the aperture problem. gt } are drawn from the proposal distribution using a hiert t i archical sampling strategy. draws i samples di from p(dt |dt−1 . the computation of features correspondences. and then. conditioned to gt . t i=1 (3. The 51 . also called singular points. draws samples gt from p(gt ). is accomplished by a fast feature-based motion estimation technique (FFME) (85). is accomplished by means of three steps. The second step filters out the image regions with low cornerness.4) q(xt |xt−1 .1. This. 3. they are still affected by the aperture problem). (3. As a result. t i The computation of samples gt from p(gt ) is based on an image registration algo- rithm based on features. The selection of image features. but not tractable. 3. gt ). zt ) = p(xt |xt−1 .34) i where the samples xt are drawn from a proposal distribution based on the transition probability (Eq.4. importance density function (9) q(xt |xt−1 .35) which is an efficient simplification of the optimal. firstly.3.3. and the estimation of affine transformations that fit the features correspondences. which are typically located in the straight edges (i. which is different to the Particle Filtering approach presented in Sec. gt )p(gt ). zt ) = p(xt |xt−1 ) = p(dt |dt−1 . (3. the output of the second step are the image regions that stand out by their gradient magnitude and cornerness. and small variations of 3D viewpoint.1 Particle Filter approximation The posterior pdf p(xt |z1:t ) is approximated by means of a Particle Filter as 1 p(xt |z1:t ) ≈ c NS i wt δ(xt − xi ).4 Object tracking in aerial and terrestrial visible imagery strategy is presented to obtain an approximate solution of the posterior pdf.

b. d. However. All these 52 . The second stage of the image registration algorithm uses the computed feature correspondences to estimate a set of affine transformations. and therefore they do no fit with the camera motion. then. Singular points of consecutive images are matched by computing the minimum Euclidean distance between descriptors. normalized to make it invariant to brightness and contrast changes.3. xj . Fig. t−1 j j xt . since the image points that were specially sensitive to the noise and to the aperture problem have been discarded. All the histograms are concatenated in a vector to form the descriptor. BAYESIAN TRACKING WITH MOVING CAMERAS last step performs a non-maximal suppression in the cornerness space to discard the image points that are not very significant according to their neighborhood. The correspondences are represented by arrows that link a feature in the previous image (top image) with a feature in the current image (bottom image).37) j where a. 3.21 shows the feature correspondence between two consecutive images that are stacked in a column.21). Potentially erroneous correspondences are discarded if the Euclidean distance of the best matching of a feature is too close to the second best matching. which are candidates for representing the camera motion. and Call is the set of all the correspondences. c. To compute the descriptor. and also all of them are related to static background regions. the presence of wrong correspondences is still possible in real images. in spite the carefully selection of features. yt are the coordinates of the feature related to the j th correspondence. and the posterior robust correspondence. Each singular point is characterized by a sophisticated descriptor robust to illumination changes and small variations in the 3D viewpoint. tx and ty are the parameters of the affine transformation. the affine transformation representing the camera motion can be obtained by solving the following equation system through the Least Mean Squares (LMS) technique   j    xj xt a b tx k−1  yj  =  c d ty  ·  y j  t k−1 0 0 1 1 1  j = Call (3. 3. which is. Note that. and correct correspondences that are related to independent moving objects (see Fig. The result is a set of singular points that are the most reliable ones to compute the feature correspondence. in a real situation there could be wrong correspondences. an array of gradient phase histograms in the neighborhood of each singular point is calculated. yt−1 . For a hypothetical ideal situation in which all the correspondences are correct.

4 Object tracking in aerial and terrestrial visible imagery Figure 3.21: Example of feature correspondence between consecutive images using the FFME algorithm.3. 53 .

The combination of these camera motion hypotheses and the object motion hypotheses produces a robust dynamic model of the scene. Finally. the estimation of the camera motion can be so uncertain that can be not possible to deterministacally estimate the correct camera motion. the set of affine transformation i samples {gt |i = 1 . the amount of outliers can be greater than the 50% in real scenarios. and has a side lenght equal to 5. Lastly. 0). represented in red colour. The main sources of the outliers are the presence of several independent moving objects.22 shows an image that represents a set of affine transformations obtained by the proposed method. and then it estimates through Least Mean Squares (LMS) the affine transformation that best fits with the selected correspondences. scaling. Nevertheless. which computes a set of weighted affine transformations to approximate the underlying probability of the camera motion p(gt ). This randomly selects Nc correspondences from the whole set Call . a resampling stage is performed using i the SIR algorithm to place more samples gt over the regions of the affine space that concentrate the majority of the probability. a probabilistic approach is proposed. It is assumed that at least one of the affine transformation candidates is a satisfactory approximation of the actual camera motion. . low textured image regions. rotation. BAYESIAN TRACKING WITH MOVING CAMERAS correspondences are outliers that bias the estimation of the affine transformation.3. NS } is obtained. In order to improve the accuracy of the estimated camera motion. the set of inliers with regard to the computed affine tranformation is estimated by means of the Median Absolute Deviation algorithm (87). is centered in the coordinates (0. and shearing) can be perceived. To overcome this problem. Observing the resulting warped squares with regard to the canonical square. the different kinds of deformations produced by an affine tranformation (translation. The affine transformation samples i gt are obtained using the principle of RANdom SAmple Consensus (RANSAC). To address this situation. which is used by the proposed Bayesian filter to reliably estimate both the object tracking information and the camera motion. As a result of this RASNSAC based drawing procedure. Each affine transforamtion is represented by a square. Fig. 3. there exist robust statistical techniques that can tolerate up to a 50% of outliers. Under these circumstances. The canonical square. such as RANSAC (86) and Least Median Squares (87). . 54 . the affine transi formation sample gt is obtained using the LMS algorithm with the whole set of inliers. which has been obtained by warping a canonical square according to the parameters of the affine transformation. and repetive background structures.

based on the color appearance of the object. 3.9. t which can be simplified using the expression of the proposed proposal distribution as p(zt |xi )p(xi |xi ) t t t−1 q(xi |xi . and the samples of the ellipse that encloses the object according to xi .15. gt ). the object dynamic samples di are drawn from the transition probability of the object dynamics t i p(dt |dt−1 . See the text for more details.24 respectively show the samples of the object position (marked by green crosses) given by di . Figs. The state sami ples are then obtained concatenating both kinds of samples xi = {di . will 55 . t t Notice that the samples are correctly placed around the most probable image locations. Therefore. zt ) t t−1 p(zt |di )p(xi |xi ) t t−1 t i |xi ) p(xt t−1 (3.4 Object tracking in aerial and terrestrial visible imagery Figure 3. Each affine transformation is represented by a warped square. gt }. the samples that t best fit with the observation model. t where the object likelihood p(zt |di ) is given by Eq. i The weights wt related to the drawn samples xi are computed using the Eq. Once that the affine transformation samples have been obtained. 3.23 t t and 3.20.3. 3.22: Representation of a set of affine transformations obtained by the RANSAC based drawing procedure. 3. containing the target object. which is a Gaussian distribution given by Eq.38) i i wt = wt−1 i = wt−1 i = wt−1 p(zt |di ).

24: Samples of the ellipsis that enclose the object determined by xi . BAYESIAN TRACKING WITH MOVING CAMERAS Figure 3.3. t Figure 3. t 56 .23: Samples of the object position determined by di .

3. This kind of sequences are 57 . 3.1. it is showed the graphical results for two sequences. Figure 3.2 Results The proposed visual tracking strategy for moving cameras in visible imagery is tested in several challenging situations. by selecting more times the samples with higher weights. such as the chasing of an object at high speed.25 shows the discrete representation of the posterior probability of the object position given by the samples xi and their weights t i wt . This is accomplished by means of the SIR algorithm (Sec. in which the camera has been mounted in two different moving platforms: a car and a helicopter. is finally obtained using the MAP estimator 3.4. while the ones with an insignificant weight are discarded.3.1). containing the desired tracking information. Notice that the position samples that have higher weight are those whose associated ellipse encloses better the target object. After the SIR resampling. Next. all the samples have the same weight. 3.25: Weighted samples of the object position representing the posterior p(xt |z1:t ). The state estimation xt . After computing the samples and weights to approximate the posterior distribution.4 Object tracking in aerial and terrestrial visible imagery have more relevance than the rest. Fig. a resampling step is applied to reduce the degeneracy problem.

where the white ellipse satisfactorily encloses the tracked object.3. 3. Using the object dynamics to rectify the object locations. and the target object is a red car in a highway.26. as shown in the third row. where the tracked object. 3. The image is mainly composed by low textured regions in which the quality of correspondences is very poor due to the aperture problem. that takes into account multiple hypotheses. the tracking is successfully accomplished as observed in the first row. In both situations the posterior pdf over the state of the object has been approximated by 50 samples. The second row shows a set of ellipses that represent the state samples but only considering the camera dynamics. This ellipse is defined by the parameters of the state vector. Note that the ellipses are not exactly located on the target object. BAYESIAN TRACKING WITH MOVING CAMERAS suitable for testing the proposed tracking algorithm because they involve strong ego motions. The proposed model for the camera dynamics. 3.27 shows three rows of frames with the same disposition and meaning as that of Fig. The first row of frames corresponds to two non consecutive time steps of the sequence. The first sequence has been acquired by a camera mounted in a police car that is chasing a motorbike. the majority of the ellipses are correctly located over the target object. and multiple moving independent objects. performing successfully the tracking of the target object. Notice how this strategy allows to discard other candidates such as the other red car (second column of frames) marked with a white X.26. Note that ellipses of the second and third rows are distributed along the direction of the highway because the correspondences related to the rest of cars induce several false candidates of camera motion in such a direction. can manage this situation. The combination of the object and camera dynamics allows to satisfactorily propagate the posterior pdf of the object state. This is demonstrated in Fig. since only the camera dynamics has been taken into account. In spite of this drawback. since the correspondences related to the moving objects can induce false camera motions. The main challenges are the strong ego motion and the reduced set of appropriate regions to compute reliable correspondences. a motorbike. which can bias the camera ego motion estimation. In addition to the strong ego-motion. since there is no camera nor 58 . The second sequence has been acquired from a camera mounted on a helicopter. which has been computed using the MMSE estimator. is enclosed by a white ellipse. there are several independent moving objects (other cars) that makes more challenging the camera motion estimation. Fig.

3.4 Object tracking in aerial and terrestrial visible imagery

Object tracking represented by a white ellipse, which is determined by the estimated object state.

Ellipse samples of the posterior pdf taking only into account the camera dynamics.

Ellipse samples of the posterior pdf considering both the object and the camera dynamics.
Figure 3.26: Results using the proposed Bayesian framework for moving cameras and visual imagery. The tracked object is a motorbike, and the camera is mounted on a car.

59

3. BAYESIAN TRACKING WITH MOVING CAMERAS

object motion that supports that region. This avoids that the tracking process can be distracted by similar objects. The performance of the proposed tracking algorithm has been also compared with two tracking techniques described in (26; 88), respectively. The main differences with the presented approach are that the camera ego-motion is not compensated in (88), while in (26) it is compensated, but using only one affine transformation instead of several ones. The three strategies use a Particle Filter framework to perform the tracking. In order to appropriately compare the three tracking algorithms, the same set of parameters has been used to tune the Particle Filters. The dataset used to perform the comparison is composed by 11 videos with challenging situations: strong ego-motion, changes in illumination, variations of the object appearance and size, and occlusions. The comparison is based on the following criteria: number of times that the tracked object has been lost. Each time that the tracked object is lost, the corresponding tracking algorithm is initiated with a correct object detection. Tab. 3.2 shows the comparison of the three tracking algorithms, where it can be appreciated that the proposed algorithm is quite superior, because the other approaches can not satisfactorily handle the egomotion. On the other hand, it can be observed that, independently of the algorithm, the results obtained by the videos “motorbike1-6.avi” are worse than the rest. The reason is that the size of the tracked object is very small, less than one hundred pixels, and therefore the object can not be robustly represented by a spatiogram.

3.5

Conclusions

In this chapter the problem of tracking a single object in heavily-cluttered scenarios with a moving camera has been addressed. Firstly, a common Bayesian framework has been presented to jointly model the object and camera dynamics. This framework allows to satisfactorily predict the evolution of the tracked object in situations where several putative candidates for the object location are possible due to combined dynamics of the object and the camera. This fact is especially problematic in situations with strong egomotion, high uncertainty in the camera motion because of the aperture problem, and the presence of multiple objects with independent motion. In addition, the Bayesian model is robust to the background clutter, which avoids tracking failures due to the presence of similar objects to the tracked one in the background.

60

3.5 Conclusions

X
Object tracking represented by a white ellipse, which is determined by the estimated object state.

X
Ellipse samples of the posterior pdf taking only into account the camera dynamics.

X
Ellipse samples of the posterior pdf considering both the object and the camera dynamics.
Figure 3.27: Results using the proposed Bayesian framework for moving cameras and visual imagery. The tracked object is the red car, and the camera is mounted on a helicopter. The white X in the second column indicates a similar object (other red car) that could be easily mistaken for the target object.

61

avi bluecar1. which lead to use different techniques to characterize the tracked objects and compute their detections. the inference of the tracking information in the proposed Bayesian models have been carried out by means of approximate suboptimal methods. and alg.avi motorbike5. 2 0 4 3 13 3 8 16 23 1 2 3 #L alg. the Bayesian framework has been adapted to work in different kinds of imagery. specifically infrared and visible-range imagery. 2. depending on the type of imagery.avi #L alg. 62 . 1. 1 0 1 0 4 1 2 5 8 0 0 1 #L alg.avi motorbike2. while a feature-based image registration method has been adopted for the visible-range imagery.avi bluecar3. since there is not a closed form expression to directly compute the required tracking information. Alg.avi bluecar2.avi redcar2. 3 0 9 7 24 6 19 26 37 3 5 6 Table 3.avi person1. and the Particle Filter approach without compensation. Also. In infrared sequences.2: Comparison between the three tracking algorithms using the criteria of number of times that the tracked object has been lost (#L). BAYESIAN TRACKING WITH MOVING CAMERAS Video redcar1. two different suboptimal inference methods have been derived.avi motorbike3. Every type of imagery has its own characteristics. Subsequently. the Particle Filter approach with compensation.avi motorbike1. On the other hand. This situation arises from the fact that the dynamic and observation processes are non-linear and non-Gaussian. 3 refer the proposed method.3. which make use of the particle filtering technique to compute an accurate approximation of the object trajectory. alg. Secondly. the strategies to infer the multiple hypothesis about the camera motion are different.avi motorbike4. respectively. an area-based image registration technique has been used due to the absence of distinctive image features. taking into account the specific characteristics of the observation and the camera dynamic models of the infrared and visible-range imagery.

3.5 Conclusions Lastly. the developed Bayesian tracking models have been tested in different video datasets. proving its efficiency and reliability in the task of tracking objects with moving cameras in heavily-cluttered scenarios. 63 .

BAYESIAN TRACKING WITH MOVING CAMERAS 64 .3.

Chapter 4

Bayesian tracking of multiple interacting objects
This chapter presents a tracking approach for multiple interacting objects with a static camera. Firstly, a detailed description of the tracking problem is presented in Sec. 4.1. Next, the developed Bayesian model for multiple interacting objects is described in Sec. 4.2. The approximate inference method to obtain a tractable solution from the optimal Bayesian model is presented in Sec. 4.3. The computation of the object detections carried out by the set of object detectors is described in Sec. 4.4. Lastly, the tracking results using a public dataset containing football sequences are shown in Sec. 4.5.

4.1

Description of the multiple object tracking problem

The aim is to track several interacting objects in a video sequence acquired by a static camera along the time. The object tracking information at the time step t, referred also to as the object state, is represented by the vector

xt = {xt,ix |ix = 1, . . . , Nobj }, where each component contains the tracking information of a specific object

(4.1)

xt,ix = {rt,ix , vt,ix , idt,ix },

(4.2)

65

4. BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS

where rt,ix contains the object position over the image plane, the variable vt,ix stores the object velocity, and idt,ix is a scalar value that uniquely identifies an object throughout
th the video sequence. It is assumed that the ix element between consecutive time steps

corresponds to the same tracked object. The number of tracked objects Nobj is variable, but it is assumed that the entrances and exits of the objects in the scene are known. This allows to focus on the modeling of the object interactions. The tracking information contained by the object state xt is estimated along the time using a set of noisy detections (also called observations or measurements), which are obtained by means of the analysis of the video stream acquired by the camera. The set of detections at time step t zt = {zt,iz |iz = 1, . . . , Nms } (4.3)

contains potential detections of the tracked objects extracted from the frame It . There can be several detections per object, although only one or none comes from the actual object. The rest of detections are false alarms, known as clutter, that arise as a result of similar objects in the scene, or uncertainties related to the detection process. Moreover, there can be objects without any detection as a consequence of occlusions, or severe changes in the visual appearance of the tracked objects. The number of detections Nms can vary at each time step. Detections are generated by a set of detectors, each one specialized in one specific type or category of object. Fig. 4.1 illustrates the situation of a set of detections yielded by several object detectors. Each detection zt,iz = {rt,iz , wt,iz , idt,iz } (4.4)

contains the coordinates over the image plane of the detected object rt,iz , a weight value related to the quality of the detection wt,iz , and an identifier that associates the detection with the object detector that generated it, idt,iz . The association between detections and objects is unknown, since the set of computed detections at each time step are unordered. In addition, some of the detections are false alarms, which should be discarded or associated to a generic entity that symbolizes the clutter to prevent the drifting in the estimation of the object state vector. Consequently, the tracking algorithm has to include a data association stage to correctly compute the associations between the true detections and the objects, while the

66

4.1 Description of the multiple object tracking problem

Detector 1 Several measurements or detections per object Detector 2 False alarm or clutter

Figure 4.1: Illustration depicting a set of detections yielded by multiple detectors.

false detections are associated to the clutter entity. The data association stage has also to handle the situation in which some objects have not associated any detections, either because all the detections are false, or because their respective detectors have not produced any detection. The data association is represented by the random variable at = {at,ia |ia = 1, . . . , Nms }, (4.5)

where the component at,ia contains the association of the ith detection zt,ia . The a possible associations are with one of the objects or with the clutter. This is expressed as 0 ix if ith detection is assigned to clutter, a th detection is assigned to ith object, if ia x

at,ia =

(4.6)

where ix ∈ (1, . . . , Nobj ). Fig. 4.2 illustrates the data association process between the sets of detections and objects. To improve the estimation of the object state, and at the same time to reduce the ambiguity in the estimation of the data association, some prior knowledge about the

67

4. BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS

Measurements
1

Objects 1 2

2

3 4

3

}

Objects without measurements

Clutter 4 0

Figure 4.2: Illustration depicting the data association between detections and objects.

object dynamics is used. Typically, it is assumed that the objects move independently according to a constant velocity model. However, in real situations, the objects interact each other, changing their trajectories as they approach each other. For example, in the context of people tracking, one person can cross with another one, stop in front of him, follow him, and even perform more complicated interactions. In these cases, the assumption of constant velocity does not hold, causing errors in the computed object trajectories. This situation especially happens when an object changes its trajectory while is occluded, since there are not available detections (due to the occlusion) that can be used to rectify its trajectory. Consequently, a more complex dynamic model is required that takes into account the interaction between objects. In this line, an advanced dynamic model for interacting objects has been developed that depends on the relative distance between objects. An object that is far away from the others is considered to move independently with a piecewise constant velocity model. But, an object that is close to or occluded by another one can either undergo an independent motion according to a constant velocity model, or an interacting motion, that depends on the position of the interacting objects. The interacting motion is modeled in such a way that it can either follow the trajectory of an occluding object, which means that the tracked object remains occluded behind others objects, or it can follow an independent trajectory, meaning that the object interaction has finished, and the occluded object becomes visible around the previously occluding objects. Fig. 4.3 illustrates the

68

. not occluded. rather than a constant velocity model. . representing the basic elements of information in the tracking problem. The object occlusions are modeled by an auxiliary random variable ot = {ot. The component ot.1 69 . Therefore. . since the interactions take place when two or more objects are really close involving some kind of occlusion. Nms }.7) that stores the occlusion events of each tracked object. The developed Bayesian model for multiple object tracking can be represented by the graphical model of Fig. o if ith object is occluded by the ith object.4. the inference of occlusions between objects allows to predict when the objects are interacting each other. 4. .2 Bayesian tracking model for multiple interacting objects Visual tracking of multiple interacting objects is performed using a Bayesian approach. (4. . . Nobj ). 4.2 Bayesian tracking model for multiple interacting objects different kinds of situations that the proposed dynamic model can handle. Nobj } along the time. which is described in the next section. .ix |ix = 1. . which recursively computes the probability of the object state xt = {xt. which were already introduced in Sec. . Nobj }. . . are probabilistically related through a Bayesian model. . As previously stated. given a sequence of noisy detections zt = {zt. which encodes the probabilistic dependencies among the different random variables involved in the tracking task. .io = where ix ∈ (1.iz |iz = 1. .8) All the previous random variables. 4. and thus to estimate their expected position using an interacting dynamic model. the dynamic model depends on the distances between the existing objects in the scene. This probability contains all the required information to compute an optimum estimate of the object trajectories at each time step. (4. . x o ot. .io encodes the occlusion information of the ith object as o 0 ix if ith object is visible.4.io |io = 1.

3: Illustration depicting the object dynamic model.4. BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS Distant objects: constant velocity model Cross of objects without interaction: constant velocity model Cross of objects with interaction: interacting model End of an interaction (objects move away): constant velocity model Figure 4. 70 .

ot−1 |z1:t−1 )dxt−1 . . and p(zt |z1:t−1 ) is just a normalization constant. ot−1 )· (4. ot |z1:t−1 . zt }. ot ) is the likelihood. This task can be efficiently performed by expressing the posterior pdf in a recursive way using the Bayes’ theorem p(xt . ot−1 |z1:t−1 ) and the prior knowledge about the dynamics of {xt . the data association. ot . at ot }. at . ot |z1:t−1 ) = . 71 . ot |z1:t−1 ) = xt−1 at−1 ot−1 p(xt . ot |z1:t−1 ) is the prior pdf.. p(zt |z1:t−1 . at . at . ot |z1:t ) = p(xt . ot |z1:t ). zt |z1:t−1 ) = p(zt |z1:t−1 ) p(zt |z1:t−1 . at . at . at . ot . xt−1 . This is mathematically expressed by p(xt .4. xt−1 . at−1 . at . using the prior information about the object dynamics and the sequence of available detections until the current time step z1:t = {z1 .10) · p(xt−1 . the aim is to compute the posterior probability density function (pdf) over the object state.9) where p(xt . The prior pdf predicts the evolution of {xt . at−1 . at ot } between consecutive time steps using the posterior pdf at the previous time step p(xt−1 .. p(zt |z1:t−1 ) (4. xt . at .. at . xt . at .4: Graphical model of the developed Bayesian algorithm for multiple object tracking.2 Bayesian tracking model for multiple interacting objects Figure 4. and the object occlusion p(xt . ot )p(xt . at−1 . ot−1 |z1:t−1 )dxt−1 = = xt−1 at−1 ot−1 p(xt . From a Bayesian perspective. at−1 .

the transition pdf over the data association can be simplified as p(at |z1:t−1 . On the other hand.1 for the definition of conditional independence and d-separation): • z1:t−2 and ot−1 are d-separated from xt by xt−1 since xt−1 is a given variable with incoming-outgoing arrows on the path.13) where it has been taken into account that z1:t−1 . at−1:t .11) where p(xt |z1:t−1 . xt−1 . at−1 . (4. (4. 6. since zt is a not given variable with incoming-incoming arrows on the path. The transition pdf over the object state can be simplified as p(xt |z1:t−1 . xt−1 . and p(ot |z1:t−1 . at−1 . at−1 . xt−1 . xt−1 . p(at |z1:t−1 . xt−1 . ot |z1:t−1 . xt−1 . at . ot−1:t )p(ot |z1:t−1 . BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS where p(xt . xt−1 .4. 72 . at−1:t . xt−1 . ot−1:t ) = p(at ). at−1 . ot−1 ) = p(xt |z1:t−1 . ot−1 ) is the joint transition pdf that determines the temporal evolution of {xt . ot−1:t ) = p(xt |xt−1 . at−1 . Therefore the object trajectories depend only on the previous object positions and possible occlusions among them. (4. • zt−1 and at−1 are d-separated from xt by xt−1 since xt−1 is a given variable with outgoing-outgoing arrows on the path. at−1 . ot−1 ). making useless the previous data associations or object trajectories. at−1 . xt−1 . • at is d-separated from xt by zt since zt is a not given variable with incomingincoming arrows on the path. ot−1 ) is the transition pdf over the object occlusion. ot−1:t ) is the transition pdf over the object state. at . xt−1 . These conditionally independent properties are derived from the fact that the detections are unordered. at−1:t . ot |z1:t−1 . ot−1:t ) is the transition pdf over the data association. xt−1 . ot ).12) where the following conditionally independent statements have been taken into account (see Sec. at−1 . ot−1:t )· · p(at |z1:t−1 . and ot−1:t are d-separated from at by zt . at ot }. It can be factorized as p(xt .

at ).15 in Eq. 4. at .11. xt . at−1 . Substituting Eqs.13. since xt is a given variable with incoming-outgoing arrows on the path.14 in Eq.10. • zt−1 and at−1 are d-separated from ot by xt−1 since xt−1 is a given variable with outgoing-outgoing arrows on the path.11. The prediction on {xt . 4. xt−1 . ot |z1:t−1 } performed by the prior pdf p(xt . at . which uses the new set of available detections at the current time. (4. ot |z1:t−1 .14) where the following conditionally independent statements have been taken into account: • z1:t−2 and ot−1 are d-separated from ot by xt−1 since xt−1 is a given variable with incoming-outgoing arrows on the path. at . at .17) taking into account that z1:t−1 and ot are d-separated from zt by xt .1. ot−1 ) = p(xt |xt−1 . ot−1 |z1:t−1 )dxt−1 . ot |z1:t−1 ) = p(at ) xt−1 p(xt |xt−1 .15) where the mathematical expressions for each probability term are derived in Sec. xt . at . The likelihood can be simplified as p(zt |z1:t−1 . ot } can be then expressed as p(xt . ot } can be then expressed as p(xt . the transition pdf over the object occlusion can be simplified as p(ot |z1:t−1 . (4. at−1 . 4. 4. and 4. 4. ot )p(at )p(ot |xt−1 ). ot ). xt−1 .4.2. This result arises from the fact 73 . the joint transition pdf on {xt . ot |z1:t−1 ) is corrected by means of the likelihood p(zt |z1:t−1 . at .12. ot ) = p(zt |xt . (4.2 Bayesian tracking model for multiple interacting objects The last term in Eq 4. at−1 . ot−1 ) = p(ot |xt−1 ). 4. This reflects the fact that object occlusions only depend on the previous object positions. at . Substituting Eq.16) · at−1 ot−1 p(xt−1 . the prior pdf on {xt . at . ot )p(ot |xt−1 )· (4.

Fig. µ0 . since there is not still any available detection p(x0 .5 depicts the graphical model corresponding to the initial time step. Σ0 ). All the random variables are necessary to model the multi-tracking problem. at . 4.e. a0 . their trajectories.19) where N(x0 . the posterior pdf for the initial time step is computed in a different way.9 is just a normalization constant given by p(zt |z1:t−1 ) = x t at ot p(zt |z1:t−1 . in order to obtain the object 74 . Whereas the posterior pdf for a generic time step t is given by the Eq. o0 |z0 ) = p(x0 ) = N(x0 . The last probability term in Eq. 4. at . the expected distortion in the location of the detected objects. (4. BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS that the data association information is necessary to evaluate the coherence between the predicted object positions and the detections.9. 4. i. Σ0 ) is a multivariate Gaussian distribution with mean µ0 and covariance matrix Σ0 . (4. but the goal is to obtain the tracking information of the objects.5: Graphical model of the developed Bayesian framework for the initial time step. ot |z1:t−1 )dxt . The Gaussian distribution models the uncertainty of the object detection process. Figure 4. µ0 . i.e.4.18) where all the involved terms have been already introduced. Therefore. in which only the object state xt is involved. ot )p(xt . xt .

described in Sec. and therefore neither can the marginal posterior probability p(xt |z1:t ) be solved. ot. The interaction information is inferred from the potential occlusions among objects. which in turn is represented by the random variable ot . However. 4. the posterior pdf p(xt . at .ix |xt−1 . The estimated xk contains the object positions and all the desired tracking information at the current time step. at . Therefore the transition pdf over the object state can be factorized as a product of independent probability terms conditioned to ot is known Nobj p(xt |xt−1 .1 Transition pdfs As stated in Sec. which is able to obtain an accurate approximate solution to both p(xt . if the object interactions are known. at . it is used to obtain an accurate estimation of the state vector xk by means of the Maximum a Posteriori (MAP) estimator. especially in partial or total occlusions. which makes intractable the integrals involved in the posterior pdf (2). Therefore. ot |z1:t ) can not be analytically solved for the ongoing tracking application.21) where the factor p(xt. x The object motion depends on whether the own object is occluded by another one or not.ix = ix and ot. it is necessary to resort to approximate inference techniques to obtain a suboptimal solution. ot |z1:t ) and p(xt |z1:t ).ix |xt−1 . an efficient framework specifically developed for the current tracking application is presented. Once p(xt |z1:t ) has been approximated.ix ). ot ) = ix =1 p(xt.4.ix ) represents the temporal evolution of the ith object.3. the marginal posterior pdf over the object state is computed by integrating out the data association and object occlusion p(xt |z1:t ) = at ot p(xt .2. the object dynamics is not an independent process due to the potential interactions among close objects. In Sec.1. In the case of occlusion.ix = 0 respectively. ot. which is indicated by ot. 4. (4.1. 4. The reason is that some of the stochastic processes involved in the multiple object tracking model are non-linear or/and non-Gaussian. (4. it is possible to predict the motion of each object independently.20) However. 3. ot |z1:t ).2 Bayesian tracking model for multiple interacting objects trajectories. ′ 75 .

at.ia |at. ′ if ot.ix .ix . if there is no occlusion the object moves independently according to its own dynamics. . the parameter pb should be set according to the expected number of object interactions. Notice that.23) variance matrix that models the uncertainty of the dynamics of not occluded objects.ix .ix = 0. Σvis is a diagonal cox x   A=     .e.ix = ix .4. the data association is not a conditionally independent process since the association of one detection with one object depends on the previous estimated associations. µ. an occluded object changes its trajectory to follow the occluding one in an interaction situation. As happened with the object dynamics. visible objects. ′ if ot. and β(pb ) is a Bernoulli distribution with parameter pb that models the probability that one occluded object has an interacting behavior.i′ = Axt−1.ix |xt−1 .1 . BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS the occluded object can move independently according to its own dynamics indicating that no interaction exists.ix . This fact leads to express the transition pdf over the data association as Nms p(at ) = ia =1 p(at. Σocc )(1 − β(pb ))   N(xt.i′ is predicted position of the occluding object.i′ . At is a matrix simulating a constant velocity model  1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 1 0 1 0 0 0 0 0 1  xt. i. according to the above dynamic model. On the other hand. . Σocc )β(pb ) x if ot. On the other hand.ix = ix . All of this can be mathematically expressed as  N(xt. (4. Σ) is a multivariate Gaussian function evaluated at x with mean µ and covariance matrix Σ.   (4. xt. or the occluded object can follow the occluding object as a result of the interaction between them.22) where N(x. .ix . Σocc is a diagonal covariance matrix that models the uncertainty of the dynamics of occluded objects.24) 76 . (4. ot.ix ) = N(xt.ia −1 ). . The elements of the diagonal covariance matrix Σocc are set to have a greater value than the ones of Σvis because the uncertainty related to the trajectory of an occluded object is usually higher. Axt−1. Axt−1. Σvis )  p(xt.

The second restriction enforces that only one detection can be associated with the object.1 . The first restriction consists in one detection can be associated at most to one object or to the clutter. the detections that share the same image regions with it can not be associated to any object. Fig.ia |at.6 depicts the three restrictions. And the probability that the first detection is associated to the clutter is p(at. Therefore. Nobj }.26) The other probability terms p(at. . and thus they have to be associated to the clutter. which can be only part of one object due to the occlusion phenomenon. The second restriction imposes that one object can be associated at most to one detection.1 = 0) = pclu = 1 − pobj .ia −1 ) have a different expression that is common to all of them. This restriction is given by the own characteristics of the object detector.6(a) illustrates the first restriction where there are two objects partially occluded and only one detection since one of the objects is too occluded to yield a detection. Fig. (4. although the clutter can be associated to several detections. 4. The conditional dependency with the previous estimated associations is based on three different restrictions that limit the possible associations between detections and objects. if one detection is already associated to one object. This situation can arise.1 ) is the only one that is not restricted by other associations. Two of 77 .6(c) illustrates the third restriction where there are two objects partially occluded and three detections. 4.4. The probability that the first detection is associated to an object is p(at. . This restriction arises from the fact that each detection is generated from the analysis of a specific image region. whereas the others are associated with the clutter. Fig. The restriction avoids that the detection can be associated to both objects. Fig. The last restriction is also related to the occlusion problem.6(b) shows the second restriction where there are only one object and two detections. . . when the image region containing the object fits with different possible poses of the object. 4. avoiding that different detections can share the same image regions. .2 Bayesian tracking model for multiple interacting objects where the first element p(at. 4. at. (4. for example.25) where ix ∈ {1. which does not allow split detections.1 = ix ) = pobj . and therefore its mathematical expression is different. . . .

whereas the rest of detections that have common image regions with it are associated to the clutter.28) and cond2 is another logical condition that encodes the third restriction as cond2 = ¬∃ia < ia |at. p(at. .ia = ix |at. rest of cases.27) where ix ∈ {1.ia −1 ) = (4. On the other hand. Nobj }. cond1 is a logical condition that encodes the second restriction as cond1 = ¬∃ia < ia |at. at. . this last one can not be occluded by any other object. the probability that the ith detection is associated to the clutter a is pclu 1 if cond2 . The object occlusion is not an independent process either since only one object can be in the foreground in an occlusion. zt. The mathematical formalization of the previous restrictions is as follows. The third restriction ensures that only one detection be associated to the visible object. . On the other ′ 78 .29) where Ov(zt.ia −1 ) = pobj 0 if cond1 ∧ cond2 . BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS these detections arise from the combination of the image regions of both objects. and thov is a threshold value. a ′ (4. This means that an occluding object is to be considered to be always in the foreground. . . if it is estimated that the ith object is occluded by the x ixth object.i′ .ia . The probth ability that the ia detection is associated to an object is expressed as p(at.30) Notice that the condition cond1 is not imposed in this case since several detections can be associated to the clutter.ia .1 . at. . .4. whereas the rest are total or partially behind it in the background. .i′ = ix .i′ = ix ∧ Ov(zt. Notice that the first restriction is implicitly a fulfilled since each probability evaluates the association event of one detection with a single object. . zt. rest of cases. .i′ ) ≥ thov .ia = 0|at. . Therefore. but only one detection is actually correct since one of the objects is too occluded to generate a detection.1 . .i′ ) is the overlapping of the image regions related to the detections zt. a a a ′ ′ (4. (4.ia and zt.

2 Bayesian tracking model for multiple interacting objects Measurements Objects Partially occluded Clutter (a) First restriction Measurements Objects Clutter (b) Second restriction Measurements Objects Partially occluded Clutter (c) Third restriction Figure 4.4.6: Restrictions imposed to the associations between detections and objects. 79 .

4. and thus the only important thing is that all of them are occluded by the same object.7: Restrictions imposed to the occlusions among objects. the relative depth position in a set of occluded objects is considered irrelevant. 4.ix .7 illustrates the aforementioned occlusion restrictions among objects. Σocc ) · β(pb ) 0 if ix = 1 if ix = 1.1 = ix |xt−1 ) = N(Gxt−1. (4. BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS hand. .1 . .ix −1 ). . Gxt−1.ix |xt−1 . The probability that the first object is occluded by another is computed as p(ot. ot. From a mathematical perspective. Figure 4.32) 80 .31) where only the first element p(ot. Correct description of object occlusions Incorrect description of object occlusions (a) Two objects: one occluding and another occluded.1 |xt−1 ) does not depend on the other occlusion events. the expression of the transition pdf on ot subject to the previous restrictions is Nobj p(ot |xt−1 ) = ix =1 p(ot. Correct description of object occlusions Incorrect description of object occlusions (b) Three objects: one occluding and the rest occluded. (4. Fig. ot. .1 .

ix . . and β(pb ) is a Bernoulli distribution with parameter pb . and if the first object is in the background occluded by another object (modeled by the Bernoulli distribution).33) (4.36) 81 .1 .ix = 0|ot. ot.e. ot. the probability that the first object is not occluded is expressed as p(ot. .ix = 0|xt−1 . ot. the probability that the ith object is not occluded is given by x p(ot. and cond2 = ix < ix ∧ ot. Σocc )β(pb ) x ′ if ix = ix ∨ cond1 .i′′ = ix are logical conditions that reflect that one x ′ ′′ ′ occluding object is always in the foreground. . On the other hand. where dvis is the probability density that the object is visible. 4. .35) that one occluded object can not occlude another one. Gxt−1. The rest of probability terms in Eq. Σocc )   N(Gxt−1. The probability that the ith object is occluded by another is x ′   G=     .1 .2 Bayesian tracking model for multiple interacting objects where G is a matrix used to select the object position information given by  1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0  Σocc is a covariance matrix whose elements are set according to the size and pose of the objects involved in the potential occlusion. .ix .ix −1 ) =  0  = N(Gxt−1.1 = 0) = dvis . if cond2 ∨ cond3 .i′ = 0 is a logical condition that expresses the restriction x p(ot. not occluded. .1 . . (4.ix = ix |xt−1 . .i′ = ix .i′x . rest of cases.1 = 0|xt−1 ) = p(ot. Therefore. . . i.i′ .31 depend on the previous estimated occlusion events. ot. .   (4.ix −1 ) = 1 dvis if ∃ix < ix |ot. the probability that the first object is occluded by another object depends on both the distance between them (modeled by the Gaussian function).i′ = 0 x ′ and cond3 = ix > ix ∧ ∃ix < ix |ot.34) where cond1 = ix < ix ∧ ot.1 |xt−1 ).ix −1 ) = = p(ot.4. ot. x ′ rest of cases. On the other hand. ′ (4. . Gxt−1. according to the definition of p(ot.

. . G and G are matrices that oth oth associate the position information between the occluded and occluding objects.38) 4. ot. ot.ix = 0|ot. .1 . i. For the sake of a clearer notation. occ ′ ′′ ′ (4. . ot. Σ occ ) occ ′ ′′ ′ (4. .ix −1 )· · β(pb )|Soth | · N(G xt−1 . Σ occ ).e.ix −1 )· N(Gxt−1. G xt−1 .2 Likelihood The likelihood on {xt . Gxt−1.ix = 0|ot.Svis ). . Socc is the set of indexes related to occluded objects that does not fulfill any oth logical conditions.ix . at ) = ia =1 p(zt.ix .39) 82 . ot. G xt−1 .2. and they are formed by the proper combination of the matrix G. . at.ia |xt . Σocc )· ix ∈Socc 2. |Socc | is the cardinal of the set Socc . Socc is the 2. at } can be expressed as product of independent factors as Nms p(zt |xt . ot.ix = 0|ot. . Σocc ) · β(pb ) = x = ix ∈Svis p(ot. .ix −1 ) = ix ∈Socc ′ · = ix ∈Svis p(ot.ix = ix |xt−1 . .4. BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS The transition pdf on ot can be reformulated in a more convenient and compact way using the previous equations as p(ot |xt−1 ) = ix ∈Svis p(ot. and Σ occ is a covariance matrix derived from the matrix Σocc . . .ix −1 )· p(ot. Gxt−1.3 x · · ix ∈Socc oth N(Gxt−1.3 set of indexes related to occluded objects that fulfills the logical conditions cond2 and cond3 . . .1 .37) where Svis is the set of indexes related to visible objects. .ia ) (4. the productory of probability terms related to visible objects is represented by p(ot.1 . not occluded. and thus the expression of the transition pdf on ot becomes ′ ′ ′′ p(ot |xt−1 ) = p(ot.i′ .i′ .Svis )β(pb )|Soth | · N(G xt−1 .1 . .

.ix )N(Hzt. The matrices J and J select the identifiers of the detection and the tracked object from zt. . the detection and the associated object must be of the same type. (4. Nobj }. .   (4.ia .ix ) is a Dirac delta function that imposes that a detection can be only associated with an object that shares the same identifier.ia |xt . respectively.4.ia ) = δ(Jzt.ia = 0.42) and   H=    1 0 0 0 0 (4. at.40) where ix ∈ {1.ix . i.ix . if at.   (4.43)   H =   ′ 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0   .ia − J xt. The term δ(Jzt. Σlh ) dclu ′ ′ ′ if at. H xt. .ia − J xt.41) The matrices H and H relate the position and weight information of the detection to those one of the object state and are given by  1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0       0 0 0 0 0  ′   ′ J =   0 0 0 0 0 0 0 0 0 0 0 0 0 0 0   .   0 0 0 0 1  (4. Each factor computes the likelihood of a specific detection as p(zt.ia and xt.2 Bayesian tracking model for multiple interacting objects since the association between detections and objects is already given.44) 83 . and are defined as  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1  ′ and   J=    0 0 0 0 0   .ia = ix .e.

the weight information of the detection is evaluated without an explicit reference related to the tracked object. and thus the likelihood on {xt . ′′ ′′′ ′ (4. Finally. a weight value of 1 means that the appearance of the image region under consideration is exactly the same as that of the object type.ix .4. ′′′ ′ (4. at.Sclu )N(H zt . at } can be reformulated in a more convenient and compact way using the previous equations as p(zt |xt .4).ia |xt . Σlh is a covariance matrix that models the uncertainty of the detections. Lastly. while a value of 0 means that they are completely different. in the context of the current visual tracking application.ia |at. In order to maintain a clear notation. at } is expresed as ′ ′ ′′ ′′′ p(zt |xt . the weight information measures the similarity between an image region and the theoretical appearance of a type of object. at ) = ia ∈Sclu p(zt. Σlh is a covariance matrix that models the uncertainty of the detection process. Σlh ).ia = ix ) = p(zt. Sobj is the set of indexes that associate a detection with an object.ia |xt . the productory of terms involved with clutter associations is renamed as p(zt. The likelihood on {xt . at. Σlh ).ia |xt. This is because.ia = ix ) = = ia ∈Sclu p(zt.ia = 0) ia ∈Sobj p(zt.ia |at. which is assumed to be constant along the time (more details are given in Sec. 4. BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS Notice that. These matrices are formed by the proper combination of the matrices H and H . at. H xt . while the position information of the detection is compared with the predicted object position.ia = 0)N(H zt .ia = 0) ia ∈Sobj ′′ = ia ∈Sclu p(zt. and H and H are matrices that respectively select the position information of all the detections and objects. In this way. at ) = p(zt.45) where Sclu is the set of indexes that associate a detection with the clutter.Sclu ). H xt .46) 84 . and it is formed by means of the combination of Σlh .

to deal with the curse of dimensionality (90). which becomes more and more difficult as the dimension of the random variables increases. y.4. ot } is very high: 7 × Nobj . z) is a multi-dimensional Dirac delta function. a t − a t . also called particles. . Nsam } are the weights of the samples. {at |i = 1.1) that is theoretically able to obtain an approximation of p(xt . called Rao-Blackwellization (89). which is only dependent on the number of samples. The Rao-Blackwellization technique achieves a more accurate estimation than an approximation technique based solely on particle filtering. This method uses a variance reduction technique. Nsam } are the samples of the object occlusion. ot − ot [i] . unless a prohibitively high number of samples is used. This inevitably leads to approximations of low quality. . . Both samples and weights are computed using a Monte Carlo strategy (see Sec. .47) where δ(x. The rest of variables still have to be [i] [i] [i] 85 . with a high variance. {ot |i = 1. at . .3 Approximate inference based on Rao-Blackwellized particle filtering An efficient inference method based on particle filtering has been developed to accurately approximate the posterior pdf p(xt . i. . {xt |i = 1. . ot |z1:t ) = i=1 w t δ x t − x t . . (4. which in turn causes approximations with higher variance. This problem refers to the exponential increase in volume in a mathematical space related to the addition of extra dimensions.e. at . The Rao-Blackwellization technique overcomes this problem by assuming that the random variables have a special structure that allows to analytically marginalize out some of the variables conditioned to the rest ones.1. Since the marginalized variables are computed exactly. Consider a particle filter framework which simulates the posterior pdf by a set of Nsam weighted samples. . For the ongoing multi-tracking application. However. . . Nsam } are the samples of the data association. ot |z1:t ). in practice. rather than the dimensionality of the involved random variables. as Nsam [i] [i] [i] [i] p(xt . and {wt |i = 1. . 3. at . the posterior pdf will be more accurate. ot |z1:t ). the Monte Carlo based methods depend on the ability of sampling from multivariate distributions.3 Approximate inference based on Rao-Blackwellized particle filtering 4. . . at . the dimension of {xt . . . This can be informally accounted for as follows. Nsam } are the samples of the object state. This means that the accuracy of the sampling diminishes with the increasing number of dimensions.

3. ot |z1:t ) = p(zt |z1:t−1 . at . BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS approximated by particle filtering. ot ) = N (xt . if the data association is provided such as in p(zt |xt . at . the likelihood p(zt |xt ) is linear non-Gaussian because of the different possible combination between detections and objects. at . ot ) must be conditionally linear Gaussian processes. ot ) is the likelihood. t t (4. the RaoBlackwellization technique can be used to express the posterior pdf as p(xt . A formal demonstration about the variance reduction achieved by the Rao-Blackwellization technique can be found in (89).48. ot |z1:t ) = p(xt |z1:t . at .3. the likelihood becomes linear Gaussian. is a linear non-Gaussian process due to the occlusions. ot ). Thus. at . p(zt |z1:t−1 ) p(zt |z1:t−1 ) (4. which defines the object dynamics.1). The another probability term in Eq. along with the parameters µx and Σx . (4. at . ot |z1:t ). ot |z1:t−1 ) can be recursively expressed as 86 . In order for the linear Gaussian assumption to come true. The prior pdf p(at . at . at . ot }. ot ) is assumed to be conditionally linear Gaussian. But. But. ot )p(at . The prior pdf p(xt |z1:t−1 ). but the approximation will be more accurate since their dimension is less. zt |z1:t−1 ) = . ot |z1:t−1 ) p(at . given by p(xt |z1:t . is det t scribed in Sec. at . the linear Gaussian condition is fulfilled.49) The derivation of the previous equation.48) where p(xt |z1:t . and therefore with an analytical expression known as the Kalman filter (see Sec. Σx ) . the posterior probability on {at . ot }.50) where p(at . 4.4. 4. ot |z1:t−1 ) is the prior pdf on {at . and p(zt |z1:t−1 ) is just a normalization constant. The designed Bayesian model for multi-tracking has a special structure that allows to marginalize out the object state xt conditioned to {at . Similarly. if the occlusion information is given such as in p(xt |z1:t−1 .1. ot ). ot . can be expressed using the Bayes’ theorem as p(at . 4. at . µx . both the prior pdf p(xt |z1:t−1 . ot )p(at . ot ) and the likelihood p(zt |xt . ot }. p(zt |z1:t−1 .

4.4. ot−1 )dxt−1 . at−1 . xt−1 |z1:t−1 . at−1 . ot−1 ) = p(at ). These conditionally independent properties are derived from the fact that the detections are unordered. ot .52) where p(at |z1:t−1 . and p(ot |z1:t−1 . at−1 . ot−1 )dxt−1 = p(ot |z1:t−1 . ot } can be factorized as p(at . (4. ot |z1:t−1 . at−1 . at−1 .2. at−1 . ot−1 ) is the transition pdf on {at . ot |z1:t−1 . and ot−1 are d-separated from at by zt (Sec. at−1 ot−1 = (4. ot−1 ) is the transition pdf over the data association.54) where it has been taken into account the following conditional independence properties: 87 . ot−1 ) = p(at |z1:t−1 . the transition pdf over the object occlusion can be simplified as p(ot |z1:t−1 . and p(at−1 . at−1:t . ot−1 ) = p(ot |z1:t−1 . at−1 . (4.51) where p(at . ot−1 )p(at−1 . at−1:t . (4. ot−1 )dxt−1 = xt−1 = = xt−1 p(ot |xt−1 )p(xt−1 |z1:t−1 . 6. On the other hand. ot−1 |z1:t−1 ) = p(at . xt−1 )p(xt−1 |z1:t−1 .3 Approximate inference based on Rao-Blackwellized particle filtering p(at . The transition pdf over the data association can be simplified as p(at |z1:t−1 .53) where it has been taken into account that z1:t−1 . ot }. ot−1 |z1:t−1 ) is the posterior pdf at the previous time step.1 for the definition of conditional independence and d-separation) since zt is a not given variable with incoming-incoming arrows on the path. at−1 . ot−1 ) = = xt−1 p(ot . ot−1 |z1:t−1 ). ot−1 ) is the transition pdf over the object occlusion. The transition pdf on {at . at−1 . ot−1 . at−1 .1. at−1 . The mathematical expression of p(at ) has been already introduced in Sec. ot−1 ). ot |z1:t−1 ) = at−1 ot−1 p(at . at−1 . ot−1 )p(ot |z1:t−1 . at−1:t . ot |z1:t−1 . which makes useless the previous data associations. at−1 .

ot )dxt = xt = = xt p(zt |at . at . ot−1 ) = p(ot.55) The another term in Eq. xt )p(xt |z1:t−1 . at−1 .Svis )β(pb )|Soth | · · xt−1 occ N(G xt−1 . 88 . xt |z1:t−1 . • zt−1 and at−1 are d-separated from ot by xt−1 since xt−1 is a given variable with outgoing-outgoing arrows on the path. ot )dxt = p(zt |z1:t−1 . The first property states the fact that only the data association information is necessary to evaluate the coherence between the predicted object positions and the detections. 4. at . (4. µx . 4. ot−1 ) appears in Sec. These properties manifest the fact that the object occlusions only depend on the previous object positions. BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS • at is d-separated from ot by zt since zt is a not given variable with incomingincoming arrows on the path. ot .56) where the following conditional independent properties have been taken into account: • z1:t−1 and ot are d-separated from zt by xt since xt is a given variable with incoming-outgoing arrows on the path.54 gives p(ot |z1:t−1 . the likelihood. • at is d-separated from xt by zt since zt is a not given variable with incomingincoming arrows on the path.3. t−1 t−1 ′ ′′ ′ (4. at .1.1. 4. ot ) = xt p(zt . Σx dxt−1 . G xt−1 .2. Σ occ )N xt−1 . can be expressed as p(zt |z1:t−1 .50. • z1:t−2 and ot−1 are d-separated from ot by xt−1 since xt−1 is a given variable with incoming-outgoing arrows on the path. at−1:t . 4. at . whereas the one for the conditional posterior pdf over the object state p(xt−1 |z1:t−1 . xt )p(xt |z1:t−1 . ot )dxt .4. Substituting these expressions in Eq. The mathematical expression of p(ot |xt−1 ) has been already introduced in Sec.

Σlh )N xt . ot ). 4. µx . H xt . Σlh )N xt . rearranging some terms is obtained ′′ ′′′ ′ N(H zt . H xt .Sclu )N(H zt .1.3. Σx ) t t = ˆt ˆ t p(zt.1. Σlh )N xt . xt )p(xt |z1:t−1 . µx . xt ) is given in Sec. ˆt ˆ t (4.Sclu ) . substituting the expressions of the previous probability terms in Eq. Σt dxt . ot ) = p(zt |z1:t−1 . at . Then.2. µt . H xt . Σx ) t t xt ′′ ′′′ ′ ˆt ˆ t p(zt. Σx dxt . xt ) has been already introduced in Sec. at .61) 89 . and the prior pdf p(xt |z1:t−1 . the following expression for the likelihood is obtained p(zt |z1:t−1 . Σx dxt p(zt.Sclu ) xt ′′ ′′′ ′ N(H zt .2. xt )p(xt |z1:t−1 . ot ) = p(zt. H xt . µx .3. 4. at . at . Substituting their expressions in the previous equation is obtained N (xt . H xt . Σx ′′ ′′′ ′ ′′ ′′′ lh′ ˆx ˆ x xt p(zt. at . ot ) .3 Approximate inference based on Rao-Blackwellized particle filtering And. representing the conditional prior pdf over the object state. at . Σlh )N xt . 4. µx .2. is derived in Sec.56. ot .57) This expression can be further simplified by deriving a solution for the involved integral. µx . µx . ot . the data association is useless for predicting the object positions. Σx ˆt ˆ t ′′ ′′′ ′ (4. ot . ot ).59) where the likelihood p(zt |z1:t−1 . if the detections at the current time are not available. xt p(zt |z1:t−1 .60) Then. the conditional posterior pdf over the object state can be factorized as p(xt |z1:t .Sclu )N(H zt . µx .4. (4. Σ )N xt .Sclu )N(H zt . Σlh )N xt . The probability term p(zt |at . 4. (4. as demonstrated as follows. Using the Bayes’ theorem. at . The solution is based on the fact that N(H zt . the second one sets out that.58) is proportional to p(xt |z1:t . Σx = ˆt ˆ t = N (xt . at . and the expression for p(xt |z1:t−1 . H xt . ot )dxt (4. the conditional posterior pdf over the object state. at . ot ) in Sec. 4.2.

noticing that p(zt. at . since it does not depend on xt . Σx t t ˆt t kp = = N (µx . 4. Σx ) dxt = k p . Σlh )N µx . µt . Σx . t (4. µx . the integral in Eq. Σx . H µx . µt . Σlh )N µx .Sclu ) ′′ ′′′ ′ ˆ (2π)n det(Σx ) · N(H zt . ot |z1:t−1 ). µx . the probability term p(zt |z1:t−1 ) in Eq.66) 90 . t t ˆt t t (4.4.50.Sclu ) acts like a constant. µx . ot ) = p(zt. BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS where the integral is just a normalization constant. µx . substituting the integral in Eq. Σx ) t t t = (4. Finally. Σlh )N xt .65) Lastly. H µx . H xt .57 can be solved as ′′ ′′′ ′ N(H zt . Σx = k p N (xt . the following expression for the likelihood pdf on {at . Σlh )N µx . µx . the proportionality statement is demonstrated ′′ ′′′ ′ N(H zt . ot } is achieved p(zt |z1:t−1 . 4. the computation is simplified by t t t ′′ ′′′ ′ ˆ N(H zt .57 by its solution. Using the previous results. (4.62) The value of k p can be obtained by evaluating the previous expression on a specific object state x∗ . ˆt ˆ t t t (4. Using the mean x∗ = µx . at . which corresponds to a normalization constant. Σx ) is equal to one.Sclu )k p = = p(zt. Σx dxt = ˆt ˆ t xt = kp xt x N (xt . 4. is given by p(zt |z1:t−1 ) = at ot p(zt |z1:t−1 .64) where it has been taken into account that the marginalization of the posterior pdf N (xt . H µx .63) ′′ ′′′ ′ x ˆ (2π)n det(Σx ) · N(H zt . t t ˆ t t where n is the dimension of xt and det() is the determinant operator. t t Finally. µx . ot )p(at . Σx ) . H xt . Σlh )N xt . µx .

ot−1 |z1:t−1 ) at the time step t − 1 is a multivariate Gaussian distribution given by p(xt−1 |z1:t−1 . at . the use of approximate inference is still necessary. (4. ot }. in which the following approximation based on a weighted sum of samples is derived Nsam p(at . ot |z1:t ) can be expressed as p(xt . ot }. at . at . where the parameters of each Gaussian are determined by the estimated object occlusions and detection associations.2. Using the above approximation of p(at . Therefore. at−1 . a new particle filtering framework has been developed to approximate the posterior pdf on {at . ot } had not an analytical form. one with analytical expression and another without it. at−1 . at . making in this way the inference computationally tractable. ot |z1:t ) = p(xt |z1:t . Σ t i=1 x[i] x[i] [i] [i] [i] [i] [i] [i] [i] = = wt δ at − at .4. ot − o t [i] [i] [i] . The computational complexity is linear with the number of detections and objects.3. The particle filtering framework is described in detail in Sec. (4. neither has the posterior pdf on {at . 4. ot − ot N x t . For this purpose. µt . The Rao-Blackwellization technique just decomposes a joint probability into two parts.2.3 Approximate inference based on Rao-Blackwellized particle filtering Since the posterior pdf on {xt . ot |z1:t ). ot )wt δ at − at .67) where the expression for the samples and weights can be also found in Sec. ot−1 ) = N xt−1 . rather than exponential.3. t−1 t−1 (4. 4. ot |z1:t ) = Nsam = i=1 Nsam p(xt |z1:t . ot − ot [i] . Σx .1 Kalman filtering of the object state Assume that the posterior pdf p(xt−1 .68) This expression represents a mixture of multivariate Gaussians. ot |z1:t ) = i=1 w t δ a t − a t . the mathematical expression of p(xt .69) 91 . µx . 4.3. ot )p(at . at .

and Σx is the covariance matrix. . ot ) = p(xt |z1:t−1 . Nobj } are matrices that handle each component t−1.74) 92 .i (4. ˆt ˆ t (4.72) where it has been taken into account that at is d-separated from xt by zt (see Sec.73) where each component involves a different tracked object. The first step predicts the evolution of xt−1 at the current time step by means of the following expression. . . The posterior pdf p(xt |z1:t . Nobj }.1 0 0 . . Nobj }.i x µx x = ˆt. the covariance matrix can be expressed as  Σx t−1. This covariance arrangement shows the fact that object trajectories are independent once object occlusions and the detection associations are known. t−1 t−1.i Aµx ′ t−1. This property reflects the fact that if the detections at the current time step are not available. BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS where µx is the mean.i of the object state independently.  x Σt−1 =   .ix = 0. . at .ix = ix . In the same way. . Σx t−1. Σx . . and it is mathematically given by Aµx x t−1.Nobj  where the elements {Σx x |ix = 1. . (4.1 for the definition of conditional independence and d-separation) since zt is a not given variable with incoming-incoming arrows on the path. data association is useless for predicting object positions. ′ if ot. 6.70) in such a way that each component contains information relative to one object.4.71) p(xt |z1:t−1 . ot ) = N xt . µx . called prior pdf. at . .i (4.. . The mean can be divided t−1 t−1 into as many components as number of tracked objects µx = {µx x |ix = 1. . ot ) for the current time step can be recursively computed in two steps: prediction and update. . (4. .i if ot. The mean parameter can be decomposed into µx = {ˆx x |ix = 1. ˆt µt.

On the other hand. (4.ia − H µt.i ˆ H′ Σx x H′ ⊤ + Σlh t.3 Approximate inference based on Rao-Blackwellized particle filtering where the matrix A has been already introduced in Sec.1. 4. .76) The matrices Σvis and Σocc have been already introduced in Sec..2. 4. ′ if ot. Nobj } are given by t. .N   Σx =  t  . Lastly.2.79) where the matrices H and H have been already defined in Sec.ix = 0. at .81) 93 . (4. (4.4. .i ′ ′ if at. t t The mean is expressed by x µx = {µt. .78) and its components by µx x = t. . 4.i .i x  .1 ˆx =  Σt  0 0 .i Σvis + AΣx x A⊤ t−1. 4.1. . Nobj }. t (4. if at.75) if ot. Σx ) . ot ) = N (xt . and the variable Kt is the Kalman gain given by Kt = ′ ˆ Σx x H ⊤ t.2. .i ˆ Σx x = t. (4. µx .77) (4. ..2. the covariance matrix is defined as  ˆx Σt.i Σocc + AΣx ′ A⊤ t−1. . the covariance matrix is given by  Σx t. The second step uses the set of available detections at the current time step to update the previous prediction as p(xt |zt . Σx obj t. (4.ix |ix = 1.ia = ix .ix = ix .1 0 0 .i ˆx µx x + Kt (Hzt.2. .2.N  ˆ where the matrix elements {Σx x |ix = 1.ia = 0.i µx x ˆt. ˆ Σx obj t.80) where the covariance matrix Σlh related to the detection model has been already introduced in Sec.ix ) ˆt.

. ot−1 |z1:t−1 ). . . since it would imply that the analytical expression of p(at . ot } is simulated by a set of Nsam weighted samples.2 Particle filtering of the data association and object occlusion The posterior pdf on {at . at .83) where δ(x. . 3. The choice of the proposal distribution has been based on the criteria of diminishing the degeneracy problem.1.i x ˆ Σ t. . Nobj } are given by t. Consider the optimal proposal distribution [i] [i] [i] [i] q(at . as Nsam p(at .ia = ix . {at . (4. ot |z1:t−1 . ot |i = 1. which is achieved by means of the minimization of the variance of the weights wt . at−1 . called also importance density.i x Σt. ot − o t [i] [i] [i] [i] . . (4. The proposed solution has consisted in using the approximation of the posterior pdf at the previous time step p(at−1 . Samples and weights are computed using the principle of importance sampling (see Sec. ot |z1:t−1 ) = = p(zt |z1:t−1 . is used to draw the samples {at . ot )p(at . ot |z1:t ) = i=1 w t δ a t − a t . ot |z1:t ) is known. Theoretically.i t. . [i] (4. 4. if at.ix = ′ ˆ ˆ Σx x − Kt H Σx x t.84) 94 . Nsam } are sample weights.82) Notice that the posterior pdf over the objects that have not been associated to a detection is determined by the prediction. this is an unfeasible solution. .ix if at. . .ia = 0. the proposal distribution that attains this goal should be proportional to p(at .4. ot−1 )p(at−1 . . ot |z1:t ). in which a proposal distribution q(at . ot }. ot |z1:t ) ∝ p(at .1). Nsam } are the samples of data association and object occlusion. ot−1 |z1:t−1 ) to approximate in turn the posterior pdf at the current time step. also called particles. ot ) at−1 ot−1 p(at . ot |z1:t ) since in this case the variance would be zero. at .3. But. and {wt |i = 1. ot |z1:t ) ∝ p(zt |z1:t−1 . . This process is explained as follows. BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS where the matrix elements {Σx x |ix = 1. y) is a multi-dimensional Dirac delta function.

3 Approximate inference based on Rao-Blackwellized particle filtering where the posterior pdf at the previous time step p(at−1 . ot |z1:t ) is based on the fact that the object occlusion samples ot can be drawn independently of the data association samples at .88) q at−1 . ot−1 )dxt−1 · [i] [i] [i] · δ at−1 − at−1 . the proposal distribution can be recursively expressed as Nsam q(at . This dramatically reduces the complexity of the sampling task.87) The developed method for drawing samples from q(at . ot )dxt · p(at )· wt−1 i=1 [i] · xt−1 [i] p(ot |xt−1 )p(xt−1 |z1:t−1 .89) q ot = xt−1 p(ot |xt−1 )p(xt−1 |z1:t−1 .86) where all the probability terms have been already presented in previous sections. The substitution of these terms by their expressions leads to q(at . ot |z1:t ) = p(zt |z1:t−1 . the proposal distribution can be factorized as q(at . ot−1 − ot−1 . at . ot ) i=1 wt−1 p(at . ot |z1:t ) = xt Nsam p(zt |at . xt )p(xt |z1:t−1 .90) 95 .4.ot−1 · q ot · q at where Nsam [i] [i] [i] [i] [i] (4. at−1 . ot−1 |z1:t−1 ) has been already approximated by a sum of weighed particles Nsam [i] [i] [i] p(at−1 . ot |z1:t−1 . at−1 . ot |z1:t ) = q at−1 . [i] [i] (4. ot−1 )dxt−1 . ot−1 )· − ot−1 . ot−1 − ot−1 . at−1 .ot−1 = i=1 wt−1 δ at−1 − at−1 . ot−1 |z1:t−1 ) = i=1 wt−1 δ at−1 − at−1 . According to this fact. ot−1 (4. [i] [i] [i] [i] · δ at−1 − [i] at−1 .85) Using this approximation. (4. ot−1 − ot−1 . (4. (4.

Σx dxt−1 .Svis )β(pb )|Soth | N(G µx .ot−1 . Σ occ )N xt−1 . a tractable solution for the involved integral has to be found.93) Using this solution for the integral. t−1 t−1 occ ′ ′′ ′ (4. an object occlusion sample ot is drawn from q ot . µx . G xt−1 . the integral can be satisfactorily approximated by only one sample. Σ occ )N xt−1 . the mean of the Gaussian function. t−1 t−1 ′ ′′ ′ (4. Firstly. Then. [i] q ot = p(ot. Σ occ ). ot } starts by drawing a sample {at−1 . the probability terms involved in the integral are substituted by their expressions. giving as a result ′ ′′ ′ N(G xt−1 . t−1 t−1 In this case. [i] [i] [i] (4. considering that the sampling function is Gaussian. given in previous sections. an approximate one is computed by means of Monte Carlo simulation. G xt−1 . the occlusion sampling function can be expressed as q ot ≈ p(ot. xt )p(xt |z1:t−1 . This technique approximates the mean of the function N(G xt−1 . G µx . For this purpose. µx .37) 96 .4. Σx dxt−1 ≈ t−1 t−1 xt−1 x ≈ N(G µt−1 .Svis )β(pb )|Soth | · · xt−1 occ N(G xt−1 .91) [i] The process to obtain a new sample {at . 4. G µx .94) In order to efficiently perform the sampling. µx . ot )dxt . which is straightforward since it is a discrete probability. Σ occ ). t−1 ′ ′′ ′ ′ ′′ ′ (4. the previous expression is factorized as (see Eq. conditioned on this sample. Σ occ ) by a sum of samples drawn from N xt−1 . G xt−1 . Σx . ot−1 } from q at−1 .92) Since the integral has not analytical solution. BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS and q at = p(at ) xt p(zt |at .

. .ix −1 ) = [i] [i] 1 dvis [i] if ∃ix < ix |ot. if ix = 1. . the sampling task can be easily performed using the technique of ancestral sampling (91). (4. Gµx x .96) (4. .4. .56 and 4. 4.34) p(ot. Nobj }. ot.1 t−1. .i x (4. a data association sample at is drawn from q at . Σocc ) t−1.99) Lastly. Σocc )· t−1. Σ )   occ N(Gµx ′ . ot. Gµt−1. ot−1 } and ot . The mathematical expression of q at is obtained substituting the integral in Eq 4. .97) And the rest of components are drawn from the following set of discrete probabilities (see Eqs.i′x .ix = 0|ot.ix . Σx . if ix = 1.3 Approximate inference based on Rao-Blackwellized particle filtering q ot = ix ∈Svis p(ot. Σ )β(pb ) t−1.ix .i ix ∈Socc 2. 4.1 . if cond2 ∨ cond3 .91 by its solution (see Eqs. Σocc ) · β(pb ).i t−1. .35 and 4. x ′ rest of cases. .1 . µx . t ˆt t t (4.i t−1.65) q at = p(at )p(zt.ix |ix = 1. 4.Sclu ) ′′ ′′′ x ′ ˆ (2π)n det(Σx ) · N(H zt . The first component is drawn from the following set of discrete probabilities (see Eqs.32 and 4. Gµx t−1. conditioned on {at−1 .1 . t−1.36) p(ot.1 = ix ) = β(pb )N(Gµx x .ix = 0|ot. .98) p(ot. which sequentially samples the components of the object occlusion {ot.i′ = ix . Σlh )N µx .95) Thus. H µt . . ot. Gµx . rest of cases.ix = ix |ot. Gµx x .i x ′ if ix = ix ∨ cond1 .ix −1 )· N(Gµx ′ . ′ (4.i 0 p(ot. .3 x · · ix ∈Socc oth N(Gµx ′ . [i] (4.1 = 0) = dvis .ix −1 ) =  0   occ x x = N(Gµt−1. . . . .100) 97 .

. . at.ia = ix |at. . ot |i = 1. which sequentially draws the sample components {at. .ia −1 ) ia ′ p(zt.ia |ia = 1. . Nsam } have been drawn by the previous method. .1 = ix ) = pobj δ(Jzt. Nms }.ia |at.ia −1 ) = = pobj δ(Jzt.105) where all of the parameters and conditions have been already introduced in previous sections.ia . . and the rest of parameters have been already introduced in previous sections.102) p(at. (4.ix . Once all the samples {at . rest of cases. at.i ˆ x t. H xt.ia = 0)· ∈Sclu · ia ∈Sobj ′ x (2π)n det(Σx x ) · N(Hzt. (4. . . (4. .1 = 0) = pclu dclu .ix . H xt. .1 .ia = 0|at. H µx x .1 .i (4.1 − J xt. . .4. . the corresponding unnormalized weights are computed as [i] [i] 98 .1 . µx x .ix )N(Hzt.1 .103) and the rest of sample components are drawn from the following set of discrete probabilities p(at. .ix . .ix . The first component is sampled using the following set of discrete probabilities ′ ′ ′ p(at. q at is factorized as Nms q at = ia =1 p(at.104) p(at. rest of cases. Σt. .ix )N(Hzt. Σlh ) 0 pclu dclu dclu ′ ′ if cond1 ∧ cond2 . . BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS In order to efficiently draw a data association sample. . A sample can be now easily obtained by means of the ancestral sampling.ia |at. (4.ia .ia −1 ) = if cond2 .ia − J xt.101) where n is the size of the dimension of one component of the object state. Σlh ). . ˆt. at. Σlh )N µt.i t.

at .e. and the black and white dressed team. ot )p(at . These patches represent the training data for the learning of the parameters of the object model.8(b) and (c) show the color histograms of the red dressed team. The detectors describe each object category by means of a color representation based on a RGB histogram that captures its appearance. ot |z1:t ) q(at .106) Notice that all the weights have the same value since it does not depend on the specific drawn sample {at . Fig. . Fig. Fig. respectively.4 Object detections Object detections zt are obtained by a set of detectors. i. The RGB histogram is computed using a set of representative image patches of the specific object category. (4. and Nb is the total number of bins. The RGB histogram is mathematically expressed as hc (i) = ni . 4. The bins are represented by spheres where the center indicates the RGB coordinates. ot |z1:t ) Nsam i=1 [i] [i] [i] [i] = p(zt |z1:t−1 ) = [i] [i] [i] [i] p(zt |z1:t−1 . each one focused on detecting a specific object category. ot }. (4.108) where ni is the number of pixels whose values belongs to that of the ith histogram bin.8 shows visually the RGB color histograms of two different object categories related to a football video sequence. ot |z1:t−1 ). 4. 99 .4.8(a) contains an image of a football match with two teams that are the two object categories. . the bins of the color histogram. The color histogram is represented in a three-dimensional vector space.ot−1 .107) Notice that an explicit resampling stage is not necessary since it has been implicitly performed by the drawing of a sample {at−1 . i = 1. . ot−1 } from the distribution q at−1 . (4. and the radius is a magnitude proportional to the bin count.4 Object detections ∗[i] wt = = p(at . . Nb . 4. This fact allows to simplifying the computation of the normalized weights as wt = [i] [i] [i] [i] 1 Nsam [i] . where each axis is a RGB color channel. 4.

Figure 4. (b)Color histogram of the red dressed team.8: Example of color histograms of two object categories.4. (c)Color histogram of the black and white dressed team. 100 . BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS (a)Image of a football match.

4.10(b) the detections over the original image marked 101 . and Fig. The detections are finally obtained by selecting the image regions with an acceptable similarity value. There are also image regions with a high similarity value due to similar objects in the background.e. is performed by computing the similarity between the histograms of the object categories and the histograms of the image regions of a frame. and Fig. The selection is accomplished by firstly performing a non-local maxima suppression in order to obtain the peaks of the similarity map. Fig. and hr is the color histogram of the candidate image region. such as the white lines of the pitch that are quite similar to the white t-shirts of one of the football teams. Thus.8(a). the detection process. 1]. hr ) = i=1 n hc · n h r . while the Z coordinates are the values of the Bhattacharyya coefficient. Notice that the main peaks are correctly located over the football players. y1 ) of the similarity map has been computed using a rectangular image region that is centered in such as (x1 . 4. Fig. The X-Y coordinates of each similarity map coincide with the X-Y coordinates of the image frame. The similarity measure is computed through the Bhattacharyya coefficient Nb ρB (hc .9(b) shows the one dressed in white and black.4. i i (4. 4. 4. where the values closer to 1 indicate that a candidate image regions are more similar to one object category. i. 4. 4.9(a) shows the similarity map related to the football team dressed in red. y1 ) coordinates. the selected peaks correspond to the image regions that are likely to belong to one of the object categories.9 shows two different examples of similarity maps for two object categories that corresponds to the football teams of Fig. and lastly discarding the peaks whose value is below a threshold thsim . The range of values of the Bhattacharyya coefficient is [0.109) where hc is the learned color histogram of one of the object categories. Fig. Note that the Bhattacharyya coefficient relative to specific coordinates (x1 .4 Object detections The search of object candidates in each time step.10(a) shows the resulting detections for the red dressed team over the similarity map marked by circles. The original image has been projected over the similarity map as a visual aid to easily correlate the peaks of the similarity map with the image regions. and whose size is the same as the rectangles used to compute the color histograms of the different object categories.

(b)Similarity map relative to the team dressed in black and white. Figure 4. BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS (a)Similarity map relative to the team dressed in red.4. 102 .9: Similarity maps of the color histograms of object categories and candidate image regions.

All the aforementioned problems in the detection acquisition are typical of a visual object detector. As a result. Nonetheless. . which go beyond simple object crosses.i t. which was officially presented for the IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance. 4. The goal is to compute the object trajectories of the football players of both teams: one dressed in red and the other in black and white.iz = {rz z . a set of detections is obtained zt = {zt. wt. there are several false detections due to the background clutter (white lines of the pitch) and other similar moving objects (the referees). idz z } contains respectively the 2D coordinates of the t. 4. . In this case. the developed multi-object tracking system explicitly addresses these problems to correctly compute the object trajectories. 4. the Bhattacharyya coefficient that is used as a quality value.4. described in Sec. This fact discards the possibility of split measurements/detections typical from part-based object detectors and moving object detectors based on background subtraction techniques (77). and an identifier that associates each detection with the object detector that generated it. This dataset consists of several video sequences about a football match where football players move around the pitch.5 Results by crosses.iz |iz = 1. . it is feasible more than one detection per object. Similarly. Notice that there is a missing detection as a result of a severe partial occlusion.5 Results The performance of the developed Bayesian tracking method for multiple interacting objects has been evaluated using a public database called ‘VS-PETS 2003’. One of the main characteristics of the detectors with respect to the tracking task is that they generate detections of the entire object. each one focused on detecting players of one team. Another reason for multiple detections per object arises when the object 103 . Fig.4. and they must be appropriately handled by the tracking algorithm. where z each detection zt. since different poses of a player can lead to several detections.11 shows the detections for the black and white dressed team. The dataset is very appropriate to validate the developed multi-object tracking algorithm because of the challenging interactions between the football players. Nms }. In this sense. .iz . There are two different object detectors.i detection over the image plane.

(b)Detections over the original image presented in Fig.9(a).8(a).4. Figure 4.10: Computed detections from the red dressed team. 4. BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS (a)Detections over the similarity map of Fig. 4. 104 .

4. Figure 4.4. Clutter Several measurements per object (b)Detections over the original image presented in Fig.5 Results (a)Detections over the similarity map of Fig. 4.11: Computed detections from the black and white dressed team.8(a).9(b). 105 .

missing detections. The last relevant characteristic of the detector is that the probability of false alarm is higher than zero. such as the pitch lines that can be easily confused with the black and white dressed team. Nonetheless. the ‘VS-PETS 2003’ is a challenging database because of the great number of players interactions that take place. This decision has allowed to focus in the tracking task. called clutter. there can be false detections not related to the objects of interest. The interactions can be classified into: • Overtaking action: one or more players overtake others. The overtaking action can be so slow in some occasions that it could also be considered as two or more players going in parallel. but they are well-known low complex color detectors that can be applied to a huge range of video applications. the tracking has to face different kinds of drawbacks such as multiple detections per object. and permanent moving clutter.4. which means that one object can not be detected as a result of abnormal variations in its appearance. permanent static clutter. i. and spectators. BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS size is not properly estimated. and at the same time making the tracking application more practical since it has to deal with the typical problems of a standard detector. Other fundamental characteristic of the detector is that the probability of detecting an existing object is less than one. the players have different trajectory directions. • Complex cross: two or more players cross each other changing their trajectories during the cross. and partial or total occlusions. In this context. the developed tracking strategy is versatile enough to be used with more complex object detectors (while they fulfill the aforementioned detector characteristics) that would expectedly increase the tracking performance. and their trajectories share the same direction. False alarms arise as a consequence of similar objects in the background. • Simple cross: two or more players cross each other keeping their trajectories. and at the same time manage the sequence of correct detections to generate the trajectories of every player. Note that. coaches. 106 . The used detectors are not very complex regarding the current state of the art.e. There can be several types of clutter: random clutter that has not a temporal coherence. unlike the overtaking action. such as referees. and false detections. As a result. The cross can even last a relatively long period of time due to the typical interactions that take place between football players.

offering also a comparison with another similar object tracking algorithm. A simple cross involving different types of objects. In this situation. marked with blue crosses. 4. The second row shows in detail the object interaction inside the previous squared image regions and the computed object detections. Notice that the labels have been correctly assigned to each player during the interaction. in this case players of rival teams. the existing multi-object tracking approaches usually fail in computing the trajectories since they rely on that the objects keep their trajectories during the occlusion.1 Qualitative results In this section. However. The only way that they have to adapt to the variations in the object trajectories is by means of the available detections. qualitative tracking results are presented for different types of object interactions. some of the objects involved in a cross have not associated any detection due to the implicit occlusion. And the second section presents quantitative results about the global performance of the tracking algorithm over the ‘VS-PETS 2003’ dataset. All of these problems are explicitly addressed in the developed tracking algorithm. The first one is because 107 . The last row shows the tracking results in which the tracked objects have been enclosed with a blue rectangle and labeled by an identifier. which models the behavior of interacting objects. The first row shows the original frames along with a blue square that encloses the image region where the simple cross takes place.5. 4. who keep their trajectories before and after the cross.12 illustrates a situation involving a simple cross between two rival players. there are a significant period of time in which some of the objects have not associated any detection. Fig. In this cases. the results about the performance of the developed tracking algorithm are presented. A similar problem can arise in overtaking actions when the object occlusions last too long. is the least complex interaction for two reasons. Specially challenging are the interactions that involve a complex cross due to the variation in the object trajectories. and the increasing uncertainty in the prediction of the object positions can be excessively high to recover the correct trajectories after the occlusion. which does not explicitly model the interacting behavior between objects. allowing to satisfactorily recover the object trajectories. The first section graphically shows the performance of the tracking system in different kinds of interactions.5 Results In all the previous situations. Note that some detections are missing due to the occlusions. the involved players can be partially or totally occluded. In the following two sections.4.

posed by the majority of multi-object tracking approaches.14. instead of representing the marginalized distribution by the corresponding mixture of Gaussians. the duration of the interaction is usually longer than that of the simple cross. two of them are of the same team. the data association stage is more complex since there are several feasible detection associations. as it can be observed in Fig. However. BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS the trajectories do not change. A consequence of this is that the marginal posterior pdf over the involved objects is unimodal rather than multimodal. In addition. In this case. Similarly to the 108 . making the corresponding posterior pdf multimodal. 4. This fact can be observed in Fig. only the mean of each Gaussian has been depicted by a weighted sample. On the contrary. 4. which in turn makes more complex the tracking.13 and 4. two of them belonging to the same team. This kind of situation can not be satisfactorily managed with the typical assumption of independent object motion.18. Lastly. as it can be observed in the previous figure.19 illustrates an overtaking action involving three players.16. 4. implying more missing detections.12. in which the object state samples of the players belonging to the same team are grouped in different modes. The samples are overlapped causing the feeling that there is only one sample. 4. and the other one from the rival team. An additional source of difficulty is that there are objects of the same type (players of the same team) involved in the interaction. The other reason is that the objects involved in the interaction are of different types.4. 4. This situation is easier to handle than that of the complex cross since the object trajectories do not change. 4. although not all of the samples have the same weight since it depends on the global data association that involves all the objects. An example of complex cross is illustrated in Fig.17 and Fig. Note that. and thus the object positions can be easily recover after the occlusion. Fig. and another to the rival one. which simplifies the data association stage since the detections can be only associated to objects of the same type. 4. while the objects are occluded. These figures represent the marginal posterior pdf over the objects of Fig. as it can be observed in Fig. the developed tracking algorithm explicitly models this kind of interactions to satisfactorily perform the tracking. but it is more complex than that of the simple cross since the duration of the occlusion is usually much longer.15 that implicates three players. the marginal posterior pdf of the rival player is unimodal. The level of interaction in a complex cross is higher than that of a simple cross since the object trajectories change along the interaction.

the marginal posterior pdf of the rival player is unimodal. The table excludes the tracking results for situations not involving object interactions because there is almost no tracking errors. which indicates the number of tracking errors in a set of interacting situations extracted for the ‘VS-PETS 2003’ dataset.20 and 4. and the duration in frames of the interaction.5. as it happens for complex crosses. and it is stopped at the end. On the other hand. the table contains a description of the main characteristics of the tested interactions. whereas for complex crosses the developed track- 109 .22. In addition. the number of players who are involved in the interaction.e.2. which causes that their marginal posterior pdf be multimodal. 4. and in addition it allows to test the failure recovery capability of the particle filtering technique. It has been also compared with a similar algorithm that lacks of a explicit model for interacting objects. This situation is caused by poorly object state estimations or by erroneous detection associations that lead to the exchange of object identities. it can be observed that both algorithms satisfactorily track objects that undergo simple crosses. The tracking results of both algorithms are showed in Tabs. and the main differences with respect to the proposed one are the lack of a model for managing occlusions and the lack of a dynamic model for objects with non constant velocity.21. as it is showed in Fig.2 Quantitative results The performance of the developed tracking algorithm has been tested using the ‘VSPETS 2003’ dataset described in Sec. which theoretically cannot handle situations in which the objects change their trajectories while they are occluded. as it is depicted in Fig. This algorithm is described in (73). the tracking algorithm is started at the beginning of a sequence.5 Results example of the complex cross. 4.1 and 4. i. These procedure is more realistic for testing tracking systems.5. 4. 4. This allows to properly measure the real performance of the tracking for interacting situations. There is not reinitialization of the algorithm in case of tracking failure. the presence of players of the same team poses a greater difficulty. 4. The tracking results are obtained in conditions of continued working. such as the interaction type.4. the total number of players who are being tracked. Analyzing the results. It is considered that a tracking error occurs when the bounding boxes of the ground truth and the one computed from the estimated object state do not overlap. performance that would remain otherwise obscure by the high tracking efficiency in non-interacting situations.

The blue square marks the selected object interaction.12: Tracking results for a situation involving a simple cross with objects of different categories. Figure 4.4. 110 . BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS Original frames showing all the players. 9 9 15 15 9 15 9 15 Tracking results. 15 9 Cropped frames containing the selected object interaction along with the detections.

13: Marginalization of the posterior pdf related to one of the players of Fig.111 4. The samples represent the means of the resulting mixture of Gaussians. 4.12 tagged with the id 9. .5 Results Figure 4.

4.14: Marginalization of the posterior pdf related to one of the players of Fig.4. The samples represent the means of the resulting mixture of Gaussians.12 tagged with the id 15. BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS Figure 4. 112 .

4. .15: Tracking results for a situation involving a complex cross with objects of both categories. The blue square marks the selected object interaction.Original frames showing all the players.5 Results Figure 4. 113 11 5 13 Cropped frames containing the selected object interaction along with the detections. 5 11 13 Tracking results.

BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS Figure 4. 114 .4.16: Marginalization of the posterior pdf related to one of the players of Fig.15 tagged with the id 5. The samples represent the means of the resulting mixture of Gaussians. 4.

115 4.5 Results Figure 4.15 tagged with the id 11. The samples represent the means of the resulting mixture of Gaussians. .17: Marginalization of the posterior pdf related to one of the players of Fig. 4.

116 .18: Marginalization of the posterior pdf related to one of the players of Fig. The samples represent the means of the resulting mixture of Gaussians. 4. BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS Figure 4.4.15 tagged with the id 13.

117 Cropped frames containing the selected object interaction along with the detections.5 Results Figure 4. 9 9 16 8 16 9 8 16 9 16 8 16 8 9 8 Tracking results.19: Tracking results for a situation involving an overtaking action with objects of both categories. The blue square marks the selected object interaction. 4. .Original frames showing all the players.

The samples represent the means of the resulting mixture of Gaussians. BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS Figure 4.20: Marginalization of the posterior pdf related to one of the players of Fig.4.19 tagged with the id 8. 4. 118 .

. The samples represent the means of the resulting mixture of Gaussians.5 Results Figure 4.119 4.21: Marginalization of the posterior pdf related to one of the players of Fig.19 tagged with the id 9. 4.

22: Marginalization of the posterior pdf related to one of the players of Fig.19 tagged with the id 16. The samples represent the means of the resulting mixture of Gaussians. BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS Figure 4. 4. 120 .4.

such as long-term occlusions of two or more objects. This fact is more evident in situations in which the objects change their trajectories while they are occluded. The key part has been the design of an advanced object dynamic model that is sensitive to the object interactions. which in turn triggers different choices of object motion. in which object dynamics and object trajectories are mutually dependent. As in complex crosses. which is common to a lot of existing object detectors.4. which efficiently computes the correspondence between the unlabeled detections and the tracked objects. A more sophisticated object detector would be needed that provide more information to correctly estimate the detection association. the tracking errors are because of long interactions of players of the same team that make strongly multimodal the data association distribution. the proposed Bayesian tracking model uses a random variable to predict the occlusion events. This fact is due to the objects maintain their trajectories during the interaction. especially if it is a long interaction. The single restriction is that the object detections have to include the positional information about the whole object. The tracking algorithm can also use quality information about the 121 . There is only a slight improvement when the duration of the interaction is too long. This problem has been successfully tackled by means of the explicit management of the object interactions. For this purpose. It is worthy to note that the tracking framework is independent of the specific object detector used. The tracking results for overtaking actions are very similar for both algorithms since the interacting model does not offer a great advantage in these situations. there is not information enough to reliably perform the data association. Another observation is that there are more errors in interactions involving players of the same team. the developed tracking algorithm has a great performance in complex object interactions. In these situations. The tracking algorithm is also able to handle false and missing detections through a probabilistic data association stage. improving other approaches that lack of an explicit model for occlusions. and/or a dynamic model for non independent objects.6 Conclusions In this chapter a novel recursive Bayesian tracking model for multiple interacting objects has been presented.6 Conclusions ing algorithm is much more efficient than the one without an explicit model for object interactions. 4. To sum up.

4. and the one for non-interacting objects. 122 0 0 .1: Quantitative results for both the developed tracking algorithm for interacting objects. BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS Object interaction description Interaction name interact-1 interact-2 interact-3 interact-4 interact-5 interact-6 interact-7 interact-8 interact-9 interact-10 interact-11 interact-12 interact-13 interact-14 interact-15 interact-16 interact-17 Simple cross Complex cross Complex cross Complex cross Complex cross Complex cross Complex cross Complex cross Complex cross Simple cross 2 2 2 3 2 3 2 2 2 2 Simple cross 2 Simple cross 2 Simple cross 2 Simple cross 2 18 17 16 5 5 18 14 14 13 17 15 18 17 16 Simple cross 2 18 Simple cross 3 17 Simple cross 2 17 46 72 48 50 123 99 37 56 73 36 56 55 78 69 61 113 109 Interaction type Number of players in interaction Total number of players Duration of interaction in frames Tracking results Number of errors (tracking algorithm for interacting objects) Number of errors (tracking algorithm for NON-interacting objects) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 36 0 45 0 0 0 0 0 87 0 74 Table 4.

. 4.6 Conclusions Table 4.1.Object interaction description Number of players in interaction 2 2 2 3 2 2 2 3 2 3 2 2 2 2 2 2 2 2 16 14 8 63 100 45 10 27 15 89 15 90 17 108 19 29 19 89 14 35 14 94 17 60 17 95 0 0 0 13 0 12 0 0 0 1 0 0 14 18 38 0 16 45 6 10 126 0 8 92 0 17 50 0 Total number of players Duration of interaction in frames Number of errors (tracking algorithm for interacting objects) Number of errors (tracking algorithm for non-interacting objects) 0 47 84 32 0 0 0 0 14 0 15 0 0 0 2 0 0 16 Tracking results Interaction name Interaction type interact-18 Complex cross interact-19 Complex cross interact-20 Complex cross interact-21 Complex cross interact-22 Complex cross interact-23 Overtaking interact-24 Overtaking interact-25 Overtaking 123 interact-26 Overtaking interact-27 Overtaking interact-28 Overtaking interact-29 Overtaking interact-30 Overtaking interact-31 Overtaking interact-32 Overtaking interact-33 Overtaking interact-35 Overtaking interact-36 Overtaking 4.2: Quantitative tracking results: continuation of Tab.

an accurate approximation of the object trajectories in high dimensional state spaces. Lastly. which has been a great challenge due to the high complexity of the tracking model. called Rao-Blackwellization. An intensive use of the concepts of probabilistic conditional independence and d-separation has been necessary to simplify the expression of the posterior distribution. BAYESIAN TRACKING OF MULTIPLE INTERACTING OBJECTS object detections if this is provided. to improve the accuracy of the approximation obtained by particle filtering. the presented Bayesian tracking model for interacting targets has been tested with a well-known and public video database. To deal with this drawback. which has allowed to prove the efficiency and reliability of the tracking in situations involving several objects that interact each other. The second issue arises from the fact that the obtained expression for the posterior distribution has not an analytical form due to the non-linearities and non-Gaussianities of the dynamic.4. observation and occlusion processes involved in the Bayesian tracking model. the utilization of approximate inference methods is necessary to obtain a suboptimal solution. Regarding the inference of the tracking information in the proposed Bayesian model for interacting objects. the accuracy of the approximate solution can be very poor because of the high dimensionality of the tracking problem. 124 . As a result. Nonetheless. a novel suboptimal inference method has been developed which uses a variance reduction technique. The first one is the mathematical derivation of the posterior distribution of the object positions. two major issues have been carefully addressed. As a consequence. Also important is that the Bayesian tracking model is thought to work with realistic object detectors that can produce several false alarms and even lose detections (missing detections). involving multiples tracked objects. proportional to the number of tracked objects and object detections. is achieved.

At the beginning of the research. To overcome these problems. the main drawbacks are the heavily-cluttered scenarios and the moving camera. more emphasis was placed in the camera motion estimation (85) to be more robust to the previous sources of disturbances.1 Conclusions The visual object tracking for unconstrained scenarios is a challenging task due to interactions between multiple objects. heavily-cluttered scenarios.Chapter 5 Conclusions and future work This dissertation closes with the conclusions about the developed tracking algorithms for challenging situations and some suggestions for future lines of research. and changes in the scene. 34) were applied to compensate the camera motion. deterministic and statistical methods (32. In the first situation. However. the research has been focused on developing efficient and reliable algorithms that are able to handle complex tracking situation affected by the aforementioned problems. moving cameras. But. 5. and thus to satisfactorily perform the tracking. and complex object dynamics. it was not until the development of the first Bayesian models (92) that a significant improvement in performance was achieved. In this dissertation. objects with varying appearance. the presence of independent moving objects. In particular. two different challenging situations have been addressed: tracking of a single object in heavily-cluttered scenarios with a moving camera and tracking of multiple interacting objects with a static camera. 125 . a reliable estimation of the camera motion is not possible in some situations because of the aperture problem. 33.

which is able to compute an accurate approximation of the object trajectory. situations with a high uncertainty in the camera motion because of low textured video sequences. which are indicators of the occurrence of object interactions. This algorithm proposes an advanced dynamic model that is able to simulate the object interactions involving long-term occlusions of two or more objects. These algorithms allow to manage several hypotheses of camera and object motion to accurately approximate the posterior distribution of the object state. 97) proposed particle filter based approaches to efficiently handle the curse of dimensionality that arises in the multi-object tracking. changes in illumination. 94). The initial works (96. a novel Bayesian algorithm was developed to explicitly manage complex object interactions. The inference cannot be performed analytically because the dynamic and observation processes are non-linear and non-Gaussian. More sophisticated Bayesian models (78. approaches based on a deterministic estimation of the camera ego-motion are prone to fail. a suboptimal inference method has been derived. A fundamental part of the model is the use of a random variable to infer the potential occlusion events. which are prone to tracking errors. some of them adapted for infrared imagery (93. based on the particle filtering technique. However. others for visible imagery (95). 98) were developed to manage the events of missing detections and false detections. CONCLUSIONS AND FUTURE WORK Fostered by the success of the Bayesian modeling of the camera motion. new Bayesian strategies were proposed to handle both the camera motion and the tracking. and situations with multiple independent moving objects. To successfully tackle this problem. These detection problems stem from variations in the object appearance. which contains the desired tracking information. and background clutter. 126 . The second challenging tracking situation has focused on the interactions of multiple objects with a static camera. since the camera motion can not be reliably computed in these situations. none of these models explicitly models the object interactions.5. This fact gives rise to complex integrals in the expression of the posterior distribution over the object position that cannot be analytically solved. On the contrary. In order to deal with this problem. The proposed Bayesian model for the camera motion is able to handle situations with strong ego-motion due to large camera displacements. partial and total occlusions. Especially conflictive are the long-term interactions involving at the same time occlusions and changes in object trajectories. which are inherent to real tracking applications. An important issue related to the proposed Bayesian model is the inference strategy used to obtain the tracking information.

However.5. outperforming other approaches that lack of an explicit model for occlusions. to obtain an accurate approximation of the object trajectories in high dimensional state spaces. called Rao-Blackwellization. A more practical approach would be to include a stage that analyze the acquired images to select the most appropriate motion estimation technique. This fact is more evident in situations in which the objects vary their trajectories while they undergo an occlusion. An additional issue has been that the resulting expression of the posterior distribution lacks a closed form due to the presence of non-solvable integrals. and/or a dynamic model for non independent objects. which is proportional to the number of tracked objects and object detections. These integrals arise from the fact that the stochastic processes involved in the Bayesian tracking model are non-linear and non-Gaussian. considering the high complexity of the model. Two different tracking scenarios have been considered: the first one for infrared imagery and the second one for visible imagery. Since each kind of imagery has its own characteristics. 127 . the use of approximate inference methods is necessary. Regarding the approach for tracking a single object in heavily-cluttered scenarios with a moving camera. causes problems in the accuracy of the approximation of the posterior distribution.2 Future work As a result. a novel and efficient suboptimal inference strategy has been developed to accurately approximate the expression of the posterior distribution. Therefore. To overcome these problems.2 Future work Several lines of future work have been identified as a result of the analysis of the tracking results. involving multiples tracked objects. the following lines of research has been considered: • Automatic selection of the estimation method for the camera motion. The inference strategy uses the particle filtering technique in combination with a variance reduction technique. 5. A great effort has been performed to derive the mathematical expression of the posterior distribution over the object tracking information. the high dimensionality of the state space associated to the tracking problem. different camera motion techniques have been proposed to compute the samples that approximate the pdf of the camera motion. and the overall performance of the algorithms. the developed tracking algorithm for interacting objects achieves a high performance in complex object interactions.

Both the detection and tracking stages consider that the tracked object has a fixed size. This is because of the limited amount of information that the detector supplies about the object. however. • Real time operation. even if they are of the same type. the tracking model could be improved taking into account this information. which would avoid situations in which the occluded and occluding objects are interchanged. another line of future work would be the development of a tracking algorithm that integrate the both previous situations: tracking of multiple interacting 128 . • Real time operation. the data association stage would be much more efficient. Similarly to the previous tracking situation. As a result. and their impact on the efficiency and computational cost. there is a significant correlation between object occlusions in consecutive time steps. the future research would be focused on: • Improvement of the interaction model. With respect to the approach for tracking multiple interacting objects with a static camera. an analysis of the tracking performance in terms of efficiency and computational cost would be necessary to adapt the tracking algorithm to operate under real time restrictions. CONCLUSIONS AND FUTURE WORK • Scale adaptation. the tracking Bayesian model should be adapted to handle this new source of information. the object size can vary significantly due to the camera and object motion. In order to use the tracking algorithm in applications with real-time operation restrictions. Therefore. However. • Improvement of the occlusion model. in order to improve the tracking performance. Also. Finally. it is necessary a detail study about the parameters of the algorithm. The main tracking errors arise when there are two or more objects of the same kind interacting each other. This information would be integrated in the Bayesian tracking model to make more distinctive the tracked objects. affecting negatively the tracking performance. and consequently the final object tracking would be better.5. Therefore. a detector that supply object scale measurements should be used. A more sophisticated object detector would be needed that provide more information such as the pose or shape of the object. The object occlusions are modeled as temporal independent processes.

5. 129 . This would allow to extend the applicability of the tracking algorithm to a wider range of visual based systems.2 Future work objects with a moving camera.

5. CONCLUSIONS AND FUTURE WORK 130 .

A variable x is considered 131 . and c. Given a probabilistic model. and the variable is in the set C.1) The conditional independence statement holds for sets of non-intersecting variables. and neither the variable nor any of its descendants is in the set C. B and C are three arbitrary nonintersecting sets of variables. Consider a directed graphical model in which A. In order to verify the statement ‘A is conditionally independent of B given C’. or • its relative arrows on the path are incoming-incoming (see Fig. 6. b. then it is said that a is conditionally independent of b given c if p(a|b. since it can simplify both the structure of a model and the computational cost needed to infer the required information from the model. 6.1(c)). c) = p(a|c). Consider three random variables a. the conditional independence properties can be directly inferred by inspection of its graphical model. consider all the possibles paths between any node in A and any node in B.1 Conditional independence and d-separation The conditional independence is an important concept for probabilistic models involving probability distributions over multiple variables.1(b)). (6. Any such path is said to be blocked if it includes a variable such that • its relative arrows on the path are either incoming-outgoing (see Fig. 6.1(a)) or outgoing-outgoing (see Fig. process that is known as d-separation (79).Chapter 6 Appendix 6.

A B A B C C (a)Incoming-outgoing configuration (b)Outgoing-outgoing configuration u v y A B z C x (c)Incoming-incoming configuration (d)z and x are descendants of y Figure 6. If all paths between A and B are blocked. In these cases. the object state xt can be in fact a set of random variables that not only store information about a single object but also information about the camera motion or about several tracked objects. then A is conditionally independent of B given C.1(d)).1: Definitions about incoming and outgoing arrows. 6. and it is said that A is d-separated from B by C. 132 . and descendants relative to a node/variable. which can be efficiently simplified by applying the underlying conditional independence properties of the model. In the context of object tracking.6. the probabilistic relationships between the variables can lead to extremely complex joint probability distributions. APPENDIX a descendant of y if there is a path from y to x that follows the direction of the arrows (see Fig.

D. 7. “A tutorial on particle filtering and smoothing: fifteen years later. Doucet and A. Xiong. 51 133 . no. Huang. on Signal Processing. M. vol. pp. 6 [2] S.” IEEE Proc. 6 [5] K. 3068.. “Non-gaussian state space modeling of time series. Julier and J. O. 27. pp. vol. Conference on Decision and Control. 21. “A tutorial on particle filters for online nonlinear/non-gaussian bayesian tracking. J. “On sequential monte carlo sampling methods for bayesian filtering. Optimal filtering. pp. 1987. 20. 10. University Press. pp.References [1] B. 20. 7 [6] G. and C. J.” IEEE Trans. Kitagawa. Andrieu. pp. 182–193. Uhlmann. vol.” Journal Statistics and Computing.” in In Handbook of Nonlinear Filtering. 910–927. 2000.” IEEE Trans. Signal Processing. 2000. on Automatic Control. 2002. no. 92. Ito and K. 19–38. “A new extension of the kalman filter to nonlinear systems. Arulampalam. and J. 197–208. Anderson and J. 6. T. 7 [7] A. 5. and N. no. Zhang. 3.” IEEE Signal Processing Magazine. 3. “Particle filtering. 1997. Johansen. Doucet. Djuric. pp. 45. M. vol. Prentice-Hall. S. 2003.” in SPIE Proc. no. Ghirmai. 401–422. 75 [3] S. S. 6 [4] ——. 174–188. 50. Moore. 5. vol. 2009. Godsill. Kotecha. 2004. 1979. Y. Bugallo. B. 1700–1705. Gordon. pp. vol. Maskell. vol. “Unscented filtering and nonlinear estimation. 20. Miguez. 26. 7 [8] P. “Gaussian filters for nonlinear filtering problems. and Target Recognition VI. Sensor Fusion.” in IEEE Proc. 7 [9] A. 7.

2004. pp. MacCormick and A. p. Daum. D. Goutsias. Bar-Shalom. 140. vol. 2001. 7 [14] N. 11. 13. and R. Pulford. 107–113. Springer-Verlag. 9. Innovation in Acoustics and Vibration.” in IEEE Proc. 82–100. J. “Novel approach to nonlinear/non-gaussian bayesian state estimation. “Target detection and tracking in forward-looking infrared image sequences using multiscale morphological filters. pp. 212–221. 25–28. 93. 7 [15] G. 1975. and A. De Freitas.” Journal of the American Statistical Association. 572. no. Gordon. vol.” Journal of Computer Vision. Xin and T. 7 [11] L. 1. no. 9 [17] Y. 6. 1999. 1993.S. Huang. 802–813. Blake. vol. 29.conditional density propagation for visual tracking. International Symposium on Image and Signal Processing and Analysis. “Sequential monte carlo methods for dynamic systems. “Condensation . 1998. Smith. Neto. 7 [13] J. no. “Automatic target detection and tracking in forward-looking infrared image sequences using morphological connected operators. Sequential Monte Carlo methods in practice. “A probabilistic exclusion principle for tracking multiple objects. pp. 5–28. 8 [16] Y.” Journal of Automatica. 451–460. and N.” in Proc. vol. Choudhary.REFERENCES [10] A. Radar and Signal Processing. 2. B. Isard and A. 2. pp.” in IEEE Proc. Bar-Shalom. 1032–1044. N. 7 [12] M. Doucet. “Relative performance of single scan algorithms for passive sonar tracking. 2002. pp. “Tracking in a cluttered environment with probabilistic data association. Shuo. 5.” SPIE Journal of Electronic Imaging. M. Salmond.” in IEE Proc. 9. “The probabilistic data association filter. pp. vol. Chen. 2007. Blake. International Conference on Computer Vision. Gordon. 23 134 . pp. pp. 4. 1998.” IEEE Journal of Control Systems Magazine. 23 [19] H. 29. vol. 2009. no. 9 [18] U. and J. and J. F. vol. vol.

10 [23] W. 564–577. T. H. no. Comaniciu. 10. Havlicek. A. pp. V. 852–855. 5. Venkataraman.” in SPIE Proc. Bal and M. Guo. Jiang.” in IEEE Proc. Hsu. M. vol. “Aerial video surveillance and exploitation. 977–1000. Li. K. vol. Fan. pp. and S. Second International Symposium on Intelligent Information Technology Application. Burt. Zitova and J. 54. Nguyen. S. 21. 2008. International Conference on Intelligent Information Hiding and Multimedia Signal Processing. D. 1518–1539. 2007. 10. no. Mould.” in IEEE Proc. Hansen. Flusser. Computer Vision and Pattern Recognition. “Mean shift based target tracking in flir imagery via adaptive prediction of initial searching points. pp. on Instrumentation and Measurement. 414–417. 2008. Pope. 10 [25] N. Tao.” in IEEE Proc. 2003. “Automatic target tracking in flir image sequences using intensity variation function and template modeling. Hirvonen. Y. 60 [27] B. Automatic Target Recognition XIV. Kumar. “Automatic target detection and tracking in flir image sequences using morphological connected operator. P. vol. 9 [21] A. 1. pp. “Kernel-based object tracking. 35 [26] V. Hanna. pp. 2008. S. 2005. 10. 5. 5426. no. 11. Hu. 89. “Automatic target tracking in flir image sequences. A. 35. Ramesh. Shi. and X. 30–36. “Infrared target tracking with amfm consistency checks. Fan. Wei and S. vol. Alam.” in IEEE Proc. 2003. and P.” IEEE Trans. D. vol.” Journal of Image and Vision Computing. 10 135 . H. 5–8. C. 2001. pp. Wildes. Sawhney. “Target tracking with online feature selection in flir imagery. and J. 2004. pp.REFERENCES [20] C. 1. 10.” IEEE Trans. pp.. 10 [24] D. no. Samarasekera. 10 [22] ——. no. 1–8. pp. R. 1846–1852. vol. S. Meer. and P. “Image registration methods: a survey. Yang.” in IEEE Proc. J. 25. 22 [28] R. on Pattern Analysis and Machine Intelligence. G. Southwest Symposium on Image Analysis and Interpretation.

Cohen and G. pp. 217–222. 54–58. vol. 23. vol. Anandan. 5. “Target detection through robust motion segmentation and tracking restrictions in aerial flir images. 11 [36] M. R. 11 [38] A. pp. 10. International Joint Vonference on Artificial Intelligence. pp. “Target-tracking in flir imagery using mean-shift and global motion compensation.” in SPIE Proc. Jaureguizar. vol. pp. 86. 2007. “Target tracking in airborne forward looking infrared imagery.” in Proc. Yilmaz.” IEEE Proc. X. Jung and G. “Detecting independent motion: the statistics of temporal continuity. Li. Shah.” in IEEE Proc. 125 [34] ——. 2007. A. Lobo. 4678. 10 [30] B. 125 [35] R. and N. 7. 2007. 2000. pp. 656 604(1–12). 10. 905–921. International Conference on Image Processing. pp. Alper. 10 [31] B.” in Proc.” Journal of Image and Vision Computing. 23. pp. vol. N. detection based on motion vector field analysis. no. “Detecting and tracking moving objects in video from an airborne observer. 22. A. 980–987. D. 6566. no. and M. 1998.” IEEE Trans. 11 [37] Y. “Detecting moving objects using a single camera on a mobile robot in an outdoor environment. 2004. Shafique. T. pp. K. L. 21.” in Workshop in Image Understanding. 6566. and M. Shafique. “Automatic aerial target detection and tracking system in airborne flir images based on efficient target trajectory filtering. Automatic Target Recognition XVII. vol. 768–773. 125 [33] ——. on Pattern Analysis and Machine Intelligence. Sukhatme. Irani and P.” in IEEE Proc. 2001. 10 [32] C. 5. 2003. Advanced Concepts for Intelligent Vision systems. 1998. 23. del Blanco. International Conference on Intelligent Autonomous Systems. F. 674–679. Brodsky.” in Proc. 623–635. K. vol. “An iterative image registration technique with an application to stereo vision. 1981. Pless.. vol. Workshop on Computer Vision Beyond the Visible Spectrum. Aloimonos. 10. 8. Salgado. pp.REFERENCES [29] I. Kanade. T. 990–1001. pp. no. Shah. “Video indexing based on mosaic representations. Garc´ “Aerial moving target ıa. Medioni. 445–448. Lucas and T. 11 136 . Olson. and Y.

pp. 1. pp. 2000. 271–276. 2007. 2. 11. 583–609. Lankton and A.” IEEE Trans. 4679. no. 2. Pattipati. 1997. Tannenbaum. no. vol. 13 137 . 13 [45] T. 11 [43] J. vol. vol. 2001. 11. vol.” IEEE Trans. Strehl and J. Pulford and B. on Aerospace and Electronic Systems. A. 6811. “A probabilistic framework for correspondence and egomotion. Real-Time Image Processing. 2010. on Aerospace and Electronic Systems. 2–21.” Journal Dynamical Vision. Conference on Decision and Control.REFERENCES [39] A.” Journal of Machine Vision and Applications. 11 [42] H. Aggarwal. pp. and K. pp. Bar-Shalom. 11. Deb. 1999. vol. Quach and M. M. 6. 681 112(1–8). K. 13 [47] T. on Aerospace and Electronic Systems. 2. “Vehicle tracking based on image alignment in aerial videos. 25. K. Domke and Y. vol. and Y. no. 23 [41] S. and G. pp. Aggarwal. 232–242. Strehl and J. “Detecting moving objects in airborne forward looking infra-red sequences.” IEEE Trans. 11. “Multiassignment for tracking a large number of overlapping objects. 295–302. 23 [44] G. Bar-Shalom. 523–538. 267–276. Zhang and F. La Scala. Workshop on Computer Vision Beyond the Visible Spectrum. 13 [46] J. 1989. Yeddanapudi. 33. no. 1994. 37. Dixon. “Multihypothesis viterbi data association: Algorithm development and assessment.” in Proc. pp. Kirubarajan. pp. 13 [48] S.” IEEE Trans. no. 3–12. Aloimonos. “Finding the best set of k paths through a trellis with application to multitarget tracking. 2007. Yuan. Y. pp. Farooq. Wolf. Pattipati. 23 [40] A. “A generalized s-d assignment algorithm for multisensor-multitarget state estimation. pp. Viterbi.” in IEEE Proc. vol. “Modeep: a motion-based object detection and pose estimation method for airborne flir sequences. 2008. 46.” in IEEE Proc. 1. Advanced Concepts for Intelligent Vision systems. “Maximum likelihood track formation with the viterbi algorithm. on Aerospace and Electronic Systems. 287–296. “Improved tracking by decoupling camera and target motion.” in SPIE Proc. vol. pp.

vol.” Combinatorica. 1984. “Optimizing murty’s ranked assignment method. pp. 35. 1. pp.. 14 [59] S. S. 1993. Pattipati. “A new relaxation algorithm and passive sensor data association.REFERENCES [49] D. 26. 24. 357–364. Li. pp. Dedham. 2004. on Aerospace and Electronic Systems.” IEEE Trans. “An algorithm for tracking multiple targets.Q. pp. no. on Aerospace and Electronic Systems.” IEEE Trans. vol. Brogan. no. pp. on Aerospace and Electronic Systems. MA. 1997. 13 [54] A. Deb. Blackman. 5–18. vol. Wong. Castanon. H. 13 [52] M. pp.” in IEEE Proc. 4997–5003. no. Pulford and A. 13 [51] W. 13 [57] S. Rijavec. 1986. Z. 33. vol. 1979. 373–395. 198–213. 13 [50] G. 1990. 12. 474–490. 19. Cox. R. Multiple-target tracking with radar applications. 5. Washburn. Y. 843–854. 3. 3. no. 3. Blackman. 14 [58] D.B. on Aerospace and Electronic Systems Magazine. “A new polynomial-time algorithm for linear programming. Bosse. 1997. 1989. pp. Conference on Decision and Control. pp. no. “Multiple hypothesis tracking for multiple target tracking.” IEEE Trans. 4. on Automatic Control. Stone. Reid. 37. no. “Algorithm for ranked assignments with applications to multiobject tracking. vol. 14 Artech House. 2. pp. 1992. 405–410. vol..” IEEE Trans. 2. K. Poore and N. no. vol. and E. 138 . 1999. vol. on Automatic Control. L. 2. 544–563. vol. 6.” IEEE Trans. Logothetis.” Journal of Guidance. “An interior point linear programming approach to two-scan data association. pp. “A lagrangian relaxation algorithm for multidimensional assignment problems arising from multitarget tracking. and I. 851–862.. and J. vol. Miller. Bar-Shalom. Karmarkar. 13 [53] K. “An expectation-maximisation tracker for multiple observations of a single target in clutter. 13 [55] X. 13 [56] N.” IEEE Trans.” SIAM Journal on Optimization. “Efficient algorithms for finding the k best paths through a trellis.

on Aerospace and Electronic Systems. Signal and Data Processing of Small Targets. and chain flipping for structure from motion with unknown correspondence. 1242–1257. 61–68. 38. 1993. S. 14 [61] T. “Tracking multiple objects with particle filtering. vol.” International Journal of Computer Vision.” Journal of Guidance. 14 [65] R. vol. 27. 2002. Balch. 8. Dellaert. “Maximum likelihood method for probabilistic multihypothesis tracking.” IEEE Journal of Oceanic Engineering. vol. pp. 394–405. 45–71. 173–184. and F. Cox. 14 [68] Z. 2005. 1994. Salmond. 17. J. vol. 14 [66] C. vol. and S. 2003. pp.” IEEE Trans. 1. vol. “Mixture reduction algorithms for target tracking in clutter. and P. Le Cadre. pp. Thrun. 10. 1805–1918. pp. 2001. mcmc. Perez. pp. 14 [62] L. “Em. pp. no. 3. Seitz. vol. Control and Dynamics. no. pp. Luginbuhl. on Aerospace and Electronic Systems. 14 [67] ——. 434– 445. “Sonar tracking of multiple targets using joint probabilistic data association. vol. Le Cadre. Y. Dellaert.” IEEE Trans. 1990. 1. 2235. E. “A review of statistical data association for motion correspondence. Hue. Khan. Workshop on Multi-Object Tracking. “A formulation of multitarget tracking as an incomplete data problem. Gauvrit and J. Bar-Shalom. “Multisensor multitarget mixture reduction algorithms for tracking. 791–812. Pao. pp.” in IEEE Proc. T. 3.” in SPIE Signal and Data Processing of Small Targets 1990. 50. J. 14 [63] D. 14 [64] H. “Mcmc-based particle filtering for tracking a variable number of interacting targets. Pattern Analysis and Machine Intelligence. no. 15 [69] F. 1305. pp. 53–66. 1994. “A particle filter to track multiple objects.REFERENCES [60] I. no.” IEEE Trans. and M.” Machine Learning. 1205–1211.” in SPIE Proc. C. 15 139 . Thorpe. Scheffe. Y. M. 1983. 1997. Streit and T. pp. Fortmann.

16. Lampinen. pp. no. Chapman and Hall/CRC. Pattern Recognition and Machine Learning (Information Science and Statistics). vol. Vo. 8. pp. Doucet. pp. 15 [72] A. 2002. 821–824. 2010. Conference on Computer Vision and Pattern Recognition. Bishop.” in IEEE Proc. pp. 31. 1995. 15 [71] N. “Background subtraction techniques: a review.REFERENCES [70] W. 103 [78] C. R. and J. 2005. Q.” IEEE Trans. Khan. pp. 4. 580–587. 2006. 1999.” in IEEE Proc.” in SPIE Proc. 1. M. 474–481. 1. International Conference on Information Fusion. and N. Taj. “Multitarget tracking with split and merged measurements. Man and Cybernetics. Yu. vol. Ma. International Conference on Systems. 15 [75] Y. and F. vol. F. Balch. 2004. and M. B. M. 16 [77] M. vol. and A. no. Cohen. 20. 605–610. Gordon and A. Davy. vol. Spiegelhalter. 493–500. 16 [76] Z. Gilks and D. 3809. 131 140 .” in IEEE Proc. 2–15. vol. Jaureguizar. “Particle filtering for multi-target tracking and sensor management. 109 [74] E. 2008. Markov Chain Monte Carlo in Practice: Interdisciplinary Statistics. 1016–1027. “Sequential monte carlo for maneuvering target tracking in clutter. Signal and Data Processing of Small Targets. 18.” Journal of Information Fusion. “Rao-blackwellized particle filter for mula a tiple target tracking. pp. S¨rkk¨. 15 [73] S. Dellaert. and I. 2009. Maggio. “Target tracking with incomplete detection. International Conference on Image Processing. 1st ed. 2007. Vehtari. “Visual tracking of multiple interacting objects through rao-blackwellized data association particle filtering. 3099–3104. 8. pp. 4. Cavallaro. 34. Piccardi. A. vol. pp. 126 [79] C. Garcia.” in Proc. on Circuits and Systems for Video Technology. Springer.” Computer Vision and Image Understanding. “Efficient multitarget visual tracking using random finite sets. Andrieu. 1. C. del Blanco. 15. Doucet. 113. T. 16. no.

and D. no. 21. vol.” Reviews Society for Industrial and Applied Mathematics. Meer. pp. 85 141 . 1. F. Koller-Meierb. 22 [81] J. 176–183. Aloimonos. Domke and Y. 2000. vol. no. 1970. 57. pp. “Rao-blackwellised particle filtering for dynamic bayesian networks. 2003. del Blanco. 27. 2004. no. K. 68 110N(1–12). Tyler. 29. pp. pp. 10. “Medical image registration with partial data. 31 [84] S. 29. Hartley and A. Hastings. no. 85. Periaswamy and H. R. Freitas. Bellman. “Monte carlo sampling methods using markov chains and their applications. pp. vol. 1–7. E. 54 [88] K. 48 [85] C. and S. J. 2007. Nummiaro. Cambridge University Press.” in SPIE Proc.” in IEEE Proc.” Jopurnal of Computer Vision and Image Understanding. 41–48. “A probabilistic notion of correspondence and the epipolar constraint. 5. Conference on Uncertainty in Artificial Intelligence. pp. V.” Journal of Image and Vision Computing.REFERENCES [80] R. Multiple View Geometry in Computer Vision. 2006. 2008.” Journal Medical image analysis. pp. Courier Dover Publications. 697–699. pp. 1999. no. 99–110. 54 [87] P. Zisserman. V. Farid. 3. 60 [89] A. Salgado. 51. “Robust parameter estimation in computer vision. L. and N. 2003. vol. International Symposium on 3D Data Processing Visualization and Transmission. 41. 3. Garc´ “Motion estimation ıa. Real-Time Image Processing.” Journl of Biometrika. “Robust computer vision: an interdisciplinary challenge.” in Proc. 6811. N. Dynamic Programming. 513–537. 86 [90] R. pp. Rangarajan. 36 [83] W. Murphy. 97–109. C. 2000. no. 78. vol. Stewart. “An adaptive color-based particle filter. 2nd ed. Jaureguizar. Birchfield and S. Stewart. “Spatial histograms for region-based tracking. Russell. d. P. Van Gool. through efficient matching of a reduced number of reliable singular points. 125 [86] C. E.” Journal of Electronics and Telecommunications Research Institute. 23 [82] S. E. 1. vol. vol. 2006. 1. and L. 452–464. Doucet.

pp. Salgado. pp. pp. no. Garc´ L. pp. R. 1–10. “Robust automatic ıa.” in Automatic Target Recognition XIX. Information Theory. 2656–2659. “Object tracking from ıa. 2003. based 3d particle filtering for robust tracking in heterogeneous environments. R. Salgado. 733514. vol. del Blanco. 126 [97] R. 0. N. Southwest Symposium on Image Analysis Interpretation. N. del Blanco.” in IEEE Proc. vol. Garc´ “Automatic featureıa. Salgado. Jaureguizar. “Robust 3d people tracking and positioning system in a semi-overlapped multi-camera environment. and F. del Blanco. Mohedano. vol. 2009. Proc. 2010. F. F. pp. Jaureguizar. 2009. 126 [94] C. Garcia. unstabilized platforms by particle filtering with embedded camera ego motion. and N.” EURASIP Journal on Advances in Signal Processing. 1. F. R. del Blanco. “Colorıa. “Robust tracking in aerial imagery based on an ego-motion bayesian model. based stabilization of video with intentional motion through a particle filter. 2008. and F. R. Cuevas. Jaureguizar. SPIE. Salgado. 7335. Jaureguizar. “Segmentationtracking feedback approach for high-performance video surveillance applications. 2008. 126 [98] C. del Blanco. 1–18. and N.” in Advanced Concepts for Intelligent Vision Systems. Jaureguizar. N.REFERENCES [91] D. Jaureguizar. 97 [92] C. 400– 405. Garc´ and L.” in IEEE Proc. Mohedano. 5259. p. Advanced Video and Signal Based Surveillance. 125 [93] C. Advanced Concepts for Intelligent Vision systems. del Blanco. and Learning Algorithms. R. Inference. ser. R. 126 [95] C.” in IEEE Proc. vol. Garc´ L. Jaureguizar. R. MacKay. and N. R. 41–44. Garcia. 2010. Cambridge University Press. 2008. 126 142 . N. C. del Blanco. Garcia. pp. L. L. 126 [96] C. C. Salgado.” in ACM/IEEE International Conference on Distributed Smart Cameras. and F. 2010. International Conference on Image Processing. target tracking based on a bayesian ego-motion compensation framework for airborne flir imagery. F. 356–367.

Sign up to vote on this title
UsefulNot useful