Active Vision

:
Directing Visual Attention

R.S. Pieters
DCT 2008.145

Master’s thesis

Coach(es):

Prof. dr. Ir. P.P. Jonker

Supervisor:

Prof. dr. H. Nijmeijer

Technische Universiteit Eindhoven
Department Mechanical Engineering
Dynamics and Control Group
Eindhoven, November, 2008

2

Abstract
Active vision is the research topic where the vision process is dynamic; i.e. movement of a camera
is controlled purposively. The image processing is applied to a sequence of images where a
mechanism decides which feature or object should be followed. Attentive vision is an element
of active vision where the outcome of an attention selection mechanism differentiates between
locations to be attended.
When combined, a vision system can direct attention in two ways; top-down and bottom-up
tracking. Top-down tracking consists in estimating the position and characteristics of an object of
interest, given its position and characteristics in the previous frame. Therefore, particle filters and
Kalman filters represent this top-down technique for tracking. The bottom-up approach consists
simply in segmenting the object of interest in each frame or Region of Interest and match them
over time.
This top-down/bottom-up method can also be identified in detecting visual attention. Regarding the bottom-up approach, no prior knowledge for salience is used to direct attention selection.
However, top-down mechanisms could be employed to direct attention to certain objects which
are of more interest and should be found first. The top-down approach is more complex (recursive) and has a neural background, as can be identified in the human visual system.
In this thesis a visual attention model (the well-known ’Saliency Map’) is implemented on a
human-like eye-head setup. The model detects attention from three feature extraction methods
(intensity, color and orientation) which contribute to a saliency map which depicts the salience at
every location in the field of view in a scalar quantity. A particle filter (observer) and Kalman filter
(smoother) are used for visual tracking control on this focus of attention. Human eye control
methods are then used for actuation. Position control mimics smooth pursuit (normal tracking),
velocity control mimics a saccade (jump in eye velocity), when smooth pursuit can’t keep up.
These control methods actuate the neck of the setup; the system then more resembles the visual
system of a cat or owl; they cannot make saccades with their eyes and do it with their head.
The result is a ’Search and Track’ algorithm which detects and directs visual attention in a
human-like manner.

3

4

Summary
In order to understand active, attentive vision, a literature report was constructed which is dealt up
in two parts. The first part explains the complete mechanisms of the human visual system, which
include eye movements, control and attention. The second part of the literature report describes
machine vision, with an extra emphasis on the current developed models of attention. A followup on this is understanding the differences and similarities between human and robot vision.
From this it became clear that the bottom-up/top-down method in both tracking and attention is
the connection between these research topics. For tracking, the bottom-up method is simply a
point-to-point following, the top-down method is identified as a recursive estimator (particle filter,
Kalman filter). For attention selection the bottom-up method uses no prior knowledge (i.e. not
recursive), as where the top-down method employs mechanisms to direct the output.
A human-like eye-head setup is developed which can move in two dimensions (pan and tilt)
and has two movable eyes. A camera is used for video input. A simple ball detection method
(color extraction with circular Hough) with an observer (particle filter) and smoother (Kalman
filter) provide the input for visual tracking control.
As main subject a visual attention model is implemented on the setup to direct attention
in a human-like manner. The model uses three feature detection tracks (intensity, color and
orientation) to compute the most salient location in the field of view. Human eye control methods
are then used to actuate the setup. Smooth pursuit (normal tracking) is simulated with position
control, a saccade (jump in eye velocity) is simulated with velocity control.
The end result is a ’search and track’ algorithm to direct human-like visual attention in real-time.

5

6

Samenvatting
Om active, attentive vision (het, door middel van attentie, sturen van een camera setup) te begrijpen, is een literatuurstudie uitgevoerd die is opgedeeld in twee stukken. Het eerste deel beschrijft
het menselijke visuele systeem waaronder de bewegingen en de regeling van het oog en de mechanismen die attentie mogelijk maken. Het tweede deel van de literatuurstudie omschrijft machine vision (de toepassingen van camera- en beeldbewerkings-technologie in de industrie), met
de nadruk op de huidige modellen van visuele attentie. Een vervolg hierop is het begrijpen van de
verschillen en overeenkomsten tussen het menselijke visuele systeem en de ontwikkelde robot
variant. Hieruit is naar voren gekomen dat de bottom-up/top-down methode (onderop/bovenaf)
de connectie is tussen deze twee onderzoeksthema’s. Voor tracking (het volgen van een object) is
de bottom-up methode simpelweg het volgen van setpoints. De top-down methode maakt gebruik
van voorkennis en kan daardoor geïdentificeerd worden als een recursieve schatter (zoals particle
filters en Kalman filters). Voor attentie selectie gebruikt de bottom-up methode geen voorkennis
(niet recursief), waar als de top-down methode mechanismen gebruikt om de uitkomst van deze
selectie naar bepaalde voorkeur te richten.
Een setup, gelijkend op een menselijk hoofd met twee beweegbare ogen, is ontwikkeld waarbij beide (hoofd én ogen) in twee dimensies kunnen bewegen (hoofd: pan en tilt; i.e. ja-nee en
knikken). Een industriële camera wordt gebruikt voor video input. Een eenvoudige methode voor
balherkenning (kleur herkenning met cirkel detectie (circular Hough)) en een observer (particle
filter; een zwerm van ’particles’ representeert een mogelijke huidige positie) in combinatie met
een smoother (Kalman filter), verzorgen de input voor visual tracking control (visueel volgen en
regelen).
Een visueel attentie model is vervolgens geïmplementeerd in de setup om attentie op een
menselijke manier te sturen. Het model gebruikt drie kenmerken (intensiteit, kleur en oriëntatie) om de meest ’interessante’ locatie in het totale beeld te berekenen. Regelmethodes die het
menselijke oog aansturen worden vervolgens gebruikt om de setup aan te sturen. Smooth pursuit (normaal, vloeiend volgen) wordt gesimuleerd met positie control, een saccade (sprong in
oog snelheid) wordt gesimuleerd met een snelheidsregeling.
Het resultaat is een ’search and track’ (zoek en volg) algoritme om menselijke visuele attentie te
richten in real-time.

7

8

Contents
1

2

Introduction

13

1.1 Problem statement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

1.2 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

1.3 Thesis outline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

Human visual system

15

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

2.2 Eye anatomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Eye movements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

2.4 Stereo vision. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5 Color vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6 Perceiving motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

21

2.7 Focus of attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

21

2.8 Human eye control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.9 Visual pathway . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.10Vestibular system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

2.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3

Machine vision

29

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Visual front-end/Scale-space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Image processing for recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9

31

CONTENTS

3.4 Color space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

32

3.5 Motion vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

32

3.6 Present and previous work in visual attention . . . . . . . . . . . . . . . . . . . . . . . . . . .

33

3.6.1 Psychophysical models of attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

33

3.6.2 Machine vision models of attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

35

3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4

Human versus Robot vision and control

39

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Differences and similarities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3 Top-down/Bottom-up tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5

Visual tracking model

43

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.2 Eye-head system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.3 Visual attention algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.4 Tracking algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.4.1 Main vision loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.4.2 Ball detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.4.3 Lucas-Kanade Feature Tracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.4.4 Bayesian filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.4.5 Kalman filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.4.6 Particle filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

52

5.4.7 Other tracking algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

55

5.5 Eye-head control. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

57

5.5.1 Position control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

57

5.5.2 Velocity control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
10

CONTENTS

5.5.3 Combined control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6

Results

63

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.2 Visual attention model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.2.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.2.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.3 Visual tracking model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.3.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.3.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.3.3 Performance and delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.4 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.4.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.4.2 experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.4.3 Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.5 Comparison towards human visual system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

81

6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

7

Conclusion

83

7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.1.1 Visual attention model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.1.2 Visual tracking model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.1.3 Search and Track . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.2 Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

A The Discrete Time Kalman Filter

87

B

89

One particle filter step

Bibliography

95

11

CONTENTS

12

Chapter 1
Introduction
1.1 Problem statement
Active vision is the research topic where the vision process is dynamic; i.e. movement of a camera
is controlled purposively to maintain a distinctive view of the world around. Attention selection
on the other hand, selects only one object in the field of view to be the most salient. In order to
be aware of objects that might be more interesting, human beings direct their gaze to one object
while constantly updating the entire field of view ([15], [24]). When this technique is applied to
an artificial system, a certain amount of complexity or awareness becomes present. Therefore, active, attentive vision can be of high relevance for various (non)-commercial applications. Subjects
as detection (surveillance cameras, automation) and identification (commercial video cameras,
warehousing), but also an understanding of the human visual system can benefit from this field
of research. An interesting application that covers most of these subjects can be found in humanoid robotics1 . For humanoid robot soccer2 in particular, a (stereo) camera is not only used
to detect and identify objects (ball, goal, field lines, other robots), but also for strategic decisions
and even self-localization.
Numerous models of visual attention (both psychophysical and machine vision models) are developed and to be found in literature ([59], [37]). These models cover the neural background of
human visual attention [37] and extend to specific applications in industrial environments [44].
With respect to control, many methods for (visual) tracking control have been employed during
the years this topic has been covered. This extends from simulating human-like eye control [52]
to estimation and filtering in robotics ([17], [14]).
A combination of these two however, is a fairly new research track where the limits of modeling
(social and/or emotional factors in attention tracking) are yet to be understood [24].
Having this in mind, the main problem can be formulated as: "What are the necessary means
to perform robust tracking control for human like visual attention".
1
2

www.dutchrobotics.net
robocup.org

13

introduction

1.2 Research Objectives
Regarding the research question mentioned above, the main approach is as follows:
Each research objective is covered separately. These objectives include a review of the mechanisms of the human visual system, which include movements, control and attention. Also, a
short survey on machine vision, with an extra emphasis on models of visual attention should
be presented. This includes both a human approach (psychophysical) and a machine vision approach of the existing models of attention in literature. The connection between human and
machine vision is a follow up on this, where the link between a top-down and bottom-up approach in both tracking and attention is clarified. As main objective, a visual attention model
has to be implemented on a human-like eye-head setup to end up with an active, visual attention
system. In order to maintain fast and robust tracking control, tracking filters (Kalman- and particle filters) and control methods are employed. With the coordinates obtained from the visual
tracking model, two control methods are of interest; Position control to imitate smooth pursuit
and velocity control to imitate a saccade.

1.3 Thesis outline
The thesis is build up as follows; chapter 2 gives an introduction of the human eye system and
the mechanisms (i.e. movements, control, attention) which are involved with this.
Chapter 3 gives a short overview of machine vision, where its usage towards recognition and
detection is clarified. Chapter 3 also presents an overview of the existing psychophysical and
machine vision models of attention, with an extra emphasis on the ’saliency map’ model by [35].
The connection between human vision and robot vision is given in chapter 4. The complete
visual tracking model, including the tracking and control algorithms and the eye-head setup, is
described in chapter 5.
Chapter 6 presents the results of the tests and experiments which where performed on both the
algorithms and the setup. Conclusions of these experiments are given in chapter 7, together with
recommendations for future research.

14

Chapter 2
Human visual system
2.1 Introduction
Human beings are a visually minded species. Approximately 25% of the human brain cells, which
are 1010 in total, are involved in the visual processes [27]. Therefore, the human visual system is
one of the most important fields of research. Consequently, the topics for the studies on human
vision are very broad and can be set out in a few different tracks [27]:
• Eye Physics: Study on the physical limitations due to the optics or the retinal structure of
the eye.
• Psychophysics: Study on the performance of the visual system and the limits of perception.
• Neurophysiology, -anatomy: Here the system’s wetware organization and electrical activity
are regarded, including mapping of the neural pathways and connections.
• Functional/optical imaging: The topic here is measuring the functional activity of arrays or
large clusters of cells.
• Computational models: Here computer vision is mimicking the neural substrate to understand and predict its behavior. Vision algorithms (artificial) for computer vision tasks are
determined, which also includes visual attention models.
Division of the visual system
The human visual system can be divided in three different layers and can be regarded as levels of
visual processing [27]. It is set up as follows:
• Front-end level: This part takes care of the processing of shapes, motion, disparity (asymmetry) and color analysis and is done in parallel. It measures the geometrical structures by
multi-scale partial derivatives in space and time. No memory is involved in this stage and
there are no cognitive processes present.
15

human visual system

• Intermediate level: The subsequent layers perform an analysis of contextual structure by
perceptual grouping and hierarchical topological analysis. Complex shapes, depth and motion, are processed here. Also the first association with stored information is made here.
• High level: In the high levels of the visual system, cognitive and highly associative tasks are
processed. Recognition and conscious perception is performed here.
The human visual system performs measurements; for every perceivable aspect of the stimulus it has a dedicated set of detectors. These receptive fields span the full measurement range of
the various parameters. These parameters are location, orientation, order of spatial or temporal
differentiation of the stimulus, velocity in every direction, disparity and many more [16, 33].
As mentioned, there is no sampling by the retina of individual rods and cones (photoreceptor
cells), but well structured assemblies of rods and cones, the so called receptive fields. The human
eye has about 150 million receptors and 1 million optic nerve fibers. A typical receptive field
has therefore about 150 receptors. The receptors in the retina are designed to extract multi-scale
information by applying sampling apertures (openings) simultaneously at a wide range of sizes.
This measurement of a whole stack of images is called a scale-space [27]. This scale-space theory
will be treated in more detail in the computer vision section as it turns out to be the basis of
human and machine vision.

2.2 Eye anatomy
Only a few basic topics of the eye anatomy are treated here as these are of the most interest and
give the most important relation between human and computer vision.
The Retina
The retina is the thin layer of neural cells that lines the back of the eyeball and is part of the
central nervous system. It contains rods and cones that respond to light and can be compared to
the photoreceptive sensors used in cameras.
Rods are highly sensitive photoreceptors that contain the visual pigment rhodopsin and are sensitive to blue-green light (500 nm and up). It is used for scotopic vision, which is vision in dim
light, black and white, mediate low resolution and at night. Therefore, in dim light we see no
color. Cones respond to bright light, mediate high-resolution vision and color. This is also called
photopic vision. Nocturnal animals therefore have a high density of rods and diurnal animals,
have mainly cones. Rods are much slower (about 1/10 sec. slower) to respond to light stimulation than cones. Therefore sporting events become progressively more difficult as daylight fails.
When light falls on a receptor it sends a proportional response synaptically to bipolar cells (a type
of neuron) which in turn signals the retinal ganglion cells (also a type of neuron). The receptors
are also cross-linked to amacrine (interneuron) and horizontal cells (also a type of neuron), which
modify the synaptic signal before it is sent to the ganglion cell.
Due to the number of receptors and optic nerve fibers (axons) and their division (150 : 1 respectively), a large amount of pre-processing is performed within the retina (local contrast adaption,
contour extraction, moving object detection) ([33], [16] and [5]).

16

human visual system

(a) Cross section human eye

(b) Human retina

Figure 2.1: Cross section human eye and retina. Figure (a) shows the cross section of the
human eye. Figure (b) shows a cross section of the human retina, with the rods and cones. Images
taken from [16] (a) and [65] (b).

The fovea
The fovea is responsible for sharp central vision. The foveal pit, which is 1mm in diameter, has
a high concentration of cone photoreceptors. The center of the fovea is the foveola (diameter
0.2mm) and has only cones. Compared to the rest of the retina, the cones in the foveal pit are
smaller in diameter and therefore more densely packed. Despite occupying only 0.01% (2 ◦ ) of
the visual field, 10% of the axons in the optic nerve are devoted to the fovea and takes up to 50% of
the visual cortex in the brain. As said human beings only see sharp at the central two degrees of
the visual field, which is once or twice the width of a thumbnail at arm’s length. The information
capacity is estimated 500.000 bits/sec (±61 KB/sec) without color and 600.000 bits/sec with color
[33].

2.3 Eye movements
Eye movements can be classified in two different ways; first in a way which depicts which eye is
used and second in a way which describes the different movements an eye, or pair of eyes, can
make:
Shared movements:
• Ductions: This classifies all eye movements involving only one eye.
• Versions: Movements include both eyes and where both move in the same direction.
• Vergences: Both eyes are used but each eye moves in opposite direction.
17

human visual system

The second classification gives clarity on the different movements eyes can make [59, 47, 22, 40,
60] and consists of three main categories.
Actual eye movements:
1. Fixational eye movements
These are small involuntary eye movements during visual fixation.
Micro saccades: which is imperceptibly jumping of an eye when looking at a spot. This is
essential for seeing and vital for the maintenance of perception over time.
Ocular microtremor: Constant, physiological, high frequency (peak 80 Hz), low amplitude
(150 − 2500nm) eye tremor which is due to the constant activity of brainstem oculomotor
units. It is contentious whether ocular microtremor assists vision.
2. Gaze stabilizing mechanisms
Vestibulo-ocular reflex: Movement that stabilizes images on the retina during head movement. It places the image in the center of the visual field, with a velocity up to 300 ◦ /sec
and an overall latency in the order of 15 − 20ms.
Nystagmus is an involuntary eye movement that can be part of the vestibulo-ocular reflex.
This is alternating smooth pursuit in one direction and a saccadic movement in the other
direction [63].
Optokinetic reflex: This is a combination of a saccade and a smooth pursuit; this is best explained by an example; when looking out of the window of a moving train, the eye focuses
on a ’moving’ tree (smooth pursuit). When the tree is out of the field of vision, eyes move
back to the point where it first saw the tree (saccade).
3. Gaze shifting mechanisms
Saccadic movements; Saccades are quick, simultaneous movements of both eyes, which
occur very frequent and take only about 150 − 200ms to plan and execute. Velocities are
typically in the range of 200 − 400 ◦ /sec, with a duration of 50 − 100ms. The acceleration
can go up to 40.000 ◦ /sec2 and the frequency of issued saccades is up to 5 Hz. The latency for the saccade to occur is about 200ms where about 50ms originates from the retina,
and another 30ms is due to latencies in the motor system, leaving 120ms for computations [52, 54, 61].
Pursuit movements; smooth pursuit is following a moving object in order to maintain it stably in the fovea. This requires the brain to process incoming visual information and supply
feedback. The pursuit reflex, which is non-linear and dependent on target predictability, is
therefore a closed loop and has a latency of about 130 ms. The latency for the peripheral
system is known to be about 80ms, leaving only 50ms for the computation during smooth
pursuit. When tracking, the eyes can pursuit an object at about 0 − 100 ◦ /sec. Saccadic
jerks can be made to keep up [52, 54, 61].
Vergence movements; when looking at an object, the eyes must rotate around a vertical axis
so that the projection of the image is in the center of the retina of both eyes. Object far
away; eyes diverge. Object close by; eyes converge. Vergence movements are slow, rarely
exceeding 10 ◦ /sec.
Furthermore, there are other movements an eye makes, such as Rapid Eye Movements (REM)
when sleeping. Another one is accommodation; to see clearly, the lens will be pulled flatter or
18

human visual system

allowed to regain its thicker form.
Another important issue is saccadic masking; the mind selectively blocks visual processing during eye movements in such a way that neither the motion of the eye, nor the gap in visual perception is noticeable to the viewer. Blurred retinal images are not of much use, and the eye has a
mechanism that ’cuts off’ the processing of retinal images when it becomes blurred. Therefore,
humans actually become effectively blind during a saccade [73].

tilt
y

x

roll

pan
z

Figure 2.2: Different movements of the human eye; up-down is also known as ’tilt’, right-left
is also known as ’pan’. Image adapted from [50].

2.4 Stereo vision
In humans the difference between two eyes’ images, due to horizontal separation, is usually
referred to as binocular or retinal disparity. The brain interprets this as depth (first discovered by
Charles Wheatstone in 1838 [58]).
Stereopsis, which is the ability to distinguish the relative distance between objects, appears to be
processed in the visual cortex in binocular cells which have receptive fields in different horizontal
positions in the two eyes. A cell is only active when a preferred stimulus is in the correct position
in the left and right eye, thus making a disparity detector. Therefore, when a person stares at an
object, the eyes converge, placing the image in the center of the retina. In general a human being
can perceive depth up to 10 or 15 meters [8].
In human vision there are both monocular and binocular cues for perceiving depth. Therefore,
depth doesn’t necessarily have to be perceived with both eyes.

monocular cues
• Relative size: Based on our experiences and familiarities with similar objects, size
gives a clue for determining relative distance (for instance; a car driving away).
• Interposition:
objects.

Overlapping of objects tells us its position in depth relative to other

• Linear perspective: Convergence of parallel lines with increasing distance (for instance;
railroad tracks).
19

human visual system

• Aerial perspective: The relative color of objects gives clues to its distance; further away
appears bluer.
• Light and shade: As we assume light to come from above, shade is a good indicator
for depth and shape.
• Monocular movement parallax: When moving your head, objects at different distances
move at a different relative velocity.

binocular cues
Since stereo vision and thus depth is created with both eyes, binocular cues are obvious
[58].
• Retinal disparity: depth perception by the human brain
• Convergence: when focusing on an object, the extra ocular muscles stretch, giving a
cue for depth vision.

Eye rotations with a stationary head give no information about depth and 3-D structure. This
is because the viewpoint does not change significantly. The image movements are therefore
almost pure translations without distortion or change of structure, i.e. without motion parallax.
Another cue for depth is the movement of a human itself. If the head moves, objects closer by
appear to move more than objects further away. This can be perceived monocular and binocular.

2.5 Color vision
The retina in the human eye contains three different types of color receptor cells, also known as
cones. These cones are sensitive to different wavelengths and can be divided in three classes:
• S-cones: short-wavelength sensitive (’blue’ cones, peak: 440 nm)
• M-cones: middle-wavelength sensitive (’green’ cones, peak: 545 nm)
• L-cones: long-wavelength sensitive (’red’ cones, peak: 565 nm)
Light can be reduced to three color components by the eye. For each location in the visual
field, the cones give three signals based on the extent to which each is stimulated. The other type
of light-sensitive cell in the eye, the rod, has a different response curve. In normal situations,
when light is bright enough, rods play virtually no role in vision at all. On the other hand, in dim
light, and having only the signal form the rods, a monochromatic response results. Furthermore,
the spectral sensitivities overlap, which can be seen in figure 2.3

20

human visual system

Figure 2.3: Retinal response of human rods and cones [73]. The dotted line depicts the
response of the rods. S, M and L depicts the short, middle and long wavelength sensitivity of the
cones.

2.6 Perceiving motion
Velocity corresponds to the orientation in space-time, and can be found from the ratio of temporal
and spatial intensity gradients and derivatives. In general, the visibility of a grating, depends on
both spatial and temporal frequency. The temporal frequency describes the rate of luminance
fluctuation at a given point as the grating moves past it, and the temporal waveform can be seen
by examining a vertical cross-section of the space-time image [9].
Human spatial acuity (which is the acuteness of spatial vision or perception) has the highest
visible spatial frequency between 40 and 50 cycles per degree. The temporal acuity, which can
be seen as the fastest visible flicker rate, has the highest temporal frequency between 40 and 50
Hz under normal viewing conditions. Furthermore, the human visual sensitivity is greatest at
roughly 5 cycles/degree and 5 Hz [9].
The motion aftereffect is a visual illusion which is perceived after watching a moving stimulus,
for approximately one minute, and then looking at a stationary stimulus. The stationary stimulus
then appears to move slightly for about 15 seconds in the opposite direction of the physically
moving stimulus. This motion aftereffect is believed to be the result of motion adaptation and
has a neural background [9].

2.7 Focus of attention
Attention is the cognitive process of selectively concentrating on one aspect of the environment
while ignoring other things. In human perception overt and covert attention shifts are used for
vision. Overt attention is the act of directing the sense organs towards a stimulus, thus moving
the eye. Covert attention, which is 4 times as fast as overt attention, is the act of, unawaringly,
focusing on one of several sensory stimuli and is thought to be a neural process that enhances
the signal from a particular part of the sensory panorama. With covert attention, the eye does not
move, with overt attention it does [73, 36].
21

human visual system

Attention shifts our gaze according to two kinds of processing: task dependent processing (topdown) and primitive selective attention processing (bottom-up). In the last case, saliency is determined unconsciously by the visual system. It is obtained from basic information from the input
image (intensity, orientation and color) and driven by the attributes of stimuli in a scene. This
saliency based attention method is thought to be biologically plausible and therefore applied in
this research and described in chapter 3.6.2 and 5.3.
The current view is that visual covert attention is a mechanism for quickly scanning the field of
view for interesting locations. This shift in covert attention is linked to the eye movement circuitry that sets up a saccade to that location.
In human visual behavior, fixation points closer to the center of the view field are more likely to
be chosen and upward movements are preferred to downward movements. Also, in terms of 2-D
image characteristics, points close to corners and symmetry lines are more likely to be chosen.
For peripheral stimuli, which are stimuli at the edges of the visual field, temporal cues lead to
movement of the eye and head.

2.8 Human eye control
The motor system of an eye consists of muscles which individually contain very thin muscle
fibers and simple spiral sensors. The muscle fibers show varying electrophysical characteristics
and are categorized into twitch (contract) and tonic (stretch) fibers. The spiral sensors, known as
spindles are richly supplied in the extra-ocular muscles and provide feedback on the degree of
muscle contraction. The muscles are attached to the outside of the eyeball and control the rotation position of the eyeball within the eye socket. The six muscles function in pairs (antagonist
vs. agonist).
The muscles can be seen in Figure 2.4. The medial rectus rotates the eye in the horizontal plane
nasally and the lateral rectus rotates the eye in the horizontal plane temporally. The inferior rectus
primarily depresses the eye vertically and the superior rectus primarily elevates the eye vertically.
The inferior and superior rectus muscles also produce oblique rotation and torsion. For a given
muscle, the terms primary and secondary action are used to classify these components of movement, depending on their relative magnitude. The inferior oblique muscle produces extorsion
and the superior oblique muscle produces intorsion. Their tonus, which is the continuous and
passive partial contraction of the muscles, compensates for the oblique movement and torsion of
the inferior and superior rectus muscles. It has been shown that even simple eye movements,
such as saccades in the horizontal plane, rely on a controlled interaction of all six muscles.
During pursuit movements the eye tends to rotate at the speed of the target, reaching velocity saturation at about 100 ◦ /sec. Above this saturation-velocity tracking becomes increasingly difficult
and inaccurate and leads to oscillations and saccades to keep up [60].
The human vision system uses Extra retinal Eye Position Signals (EEPS) to control the eyes.
EEPS is a sluggish, temporally blurred (low pass filtered) version of a real saccadic displacement
[9]. There are two theories behind the source of EEPS. The first is considered as the outflow
theory and can be seen as open-loop control; Motor commands sent to the eye muscles are used
in interpreting image movement. This can be tested by pressing ones eyeball; vision gets blurred
and the two images of the field of view overlap. The second is considered as the inflow theory
and can be seen as closed-loop control; signals from the eye muscles are also taken into account
when movement in the retinal image is interpreted [9].
22

human visual system

The ability to maintain a fairly even eye positioning capability, requires a control system that is
capable of monitoring and rapidly adjusting to changes of the extra-oculomotor system [10, 60].

Figure 2.4: Human eye muscles. Image taken from [21].

To control the pan and tilt angles of a human eye, two descriptive control methods can be
distinguished. These are velocity control for smooth pursuit and a position control for saccadic
movements. Pursuit movement control aims at adjusting the velocity of pan and tilt axes so as
to minimize the retinal velocity of the fixated feature [53]. Robinson [52] [53], and [54] also states
that physiological evidence indicates that saccades are controlled with a sampled data system,
while pursuit motions are continuously controlled (or at least a sampled data system with a much
higher sampling rate than for saccades). As mentioned before, the latency for a saccade has been
determined to be about 200 milliseconds. This latency is the difference in time between a change
in retinal position of a feature and the moment a motor command is send to the eye muscles.
During this time period, the oculomotor system is insensitive to further changes of the retinal
position of the feature; human beings become effectively blind during a saccade. If the feature
would move during a saccade, this would result in a position error [17].
Robinson [53] describes a model with separate subsystems for smooth pursuit and saccadic
movements. These subsystems are depicted in figure 2.5 and 2.6. Time delays (in [ms]) are
denoted in the square boxes. Transfer functions are denoted in the boxes for the filters. The
target position is denoted by T and the eye position by E.

Figure 2.5: Robinson’s model for the human oculomotor control system; Saccades.
Model taken from [17].

23

human visual system

Figure 2.6: Robinson’s model for the human oculomotor control system; Smooth pursuit. Model taken from [17].

Robinson’s model possesses two interesting features; first the fact that the saccadic system has
a sampled data nature. The desired retinal position, ED , is sampled (with a pulse sampler), and
held by a first order hold (an integrator). The output of this sample/hold is then used as a setpoint
to the plant (in this case the local motor controller). Between sampled pulses no new desired
position is computed, even though the feature may be moving. Secondly, a positive feedback
loop is present which, for the smooth pursuit system, is necessary to prevent oscillations due
to delays in the negative feedback loop. This negative feedback, which is provided by the vision
system, is used to reduce the retinal velocity error. The positive feedback consists of a delayed
’efference copy’ of the current eye velocity, which is added to the computed retinal velocity error.
The delay is such that the efference copy is the one measured at the same time as when the visual
observation is made. The effect of the positive feedback is to essentially eliminate the negative
visual feedback. The saccadic system is modeled in the same way, except that position control
is being done instead of velocity control. However, in the saccadic system, the internal positive
feedback is not really necessary to ensure stability, since this is gained through the use of the
sample/hold. Nonetheless, the available evidence indicates that the human saccadic system does
use internal positive feedback to compensate for delays. A saccade trigger signal (that opens up
the sample/hold) is generated by the feature detection system when the retinal position error is
greater than a threshold value [17].

2.9 Visual pathway
Large regions of the brain are devoted to visual analysis, interpretation and understanding. The
central mechanisms that control eye movement are complex and incompletely understood, but
some of the brain regions involved in these tasks have been identified as the cerebellum, the inferior
olive, the parieto-occipital junction, the vestibular system, the superior colliculi, the lateral geniculate
nuclei, the motor nuclei of the extra-ocular muscles and the reticular formation (Figure 2.7).
There is no consensus on what information and control is processed in each of these regions, but
it is assumed that cortical regions of the brain are engaged in eye movement. The frontal cortex is
apparently related to saccadic movements. Saccades are believed to be triggered in several regions
of the brain, such as the visual, parietal and prefrontal cortex as well as the superior colliculus,
the cerebellum and the reticular formation. Smooth pursuit movements are more related to the
occipital cortex. The main center for smooth pursuit tracking resides within the parieto-occipital
junction and originates either from the optokinetic system or from the stabilization system [38].
When regarding image processing within the brain, the ventral visual pathway (see figure 2.7 top;
24

human visual system

red arrows) is involved in the recognition of an object or a scene. Early stages of visual processing
(in areas V1 to V4) analyze an image’s contours, colors and textures. Intermediate stages (the
lateral occipital area and the ventral occipito-temporal cortex, or VOT) integrate local information
to detect surfaces, objects, faces and places. Within the VOT, a region within the collateral sulcus (CoS) responds strongly to images of places, such as buildings and houses. Later stages of
recognition, in areas such as the parahippocampal cortex and rhinal cortex, are activated when
the brain interprets the stimulus in the context of stored memories [8, 73, 60].

Figure 2.7: Map of the human brain. Image taken from [6].

2.10 Vestibular system
The human vestibular system is devoted to detect the position and motion of the head in space.
In order to determine the absolute movement of a body in a three dimensional space, reliable
information is required about motion in each of the 6 degrees of freedom permitted in three dimensional space, i.e. 3 translational and 3 rotational movements. The human vestibular system
25

human visual system

responds to movements of the head relative to space and gravity, using inertial-sensing receptors
which are activated by forces arising from the acceleration of mass in accordance with Newton’s
Laws. The reason why human beings have two vestibular systems is that the signals sent from the
vestibular systems to the brain are constantly compared with each other. From this comparison
the brain interprets and senses movement. This results in a very sensitive system. The downside,
however, is that when the signals on one side fail, the brain interprets this as a movement and a
feeling of giddiness occurs.
The two vestibular labyrinths are mirror-symmetric structures within the inner ears. Each vestibular labyrinth comprises five receptor organs that, complemented by those of the counter-lateral
ear, can measure linear acceleration along any axis and angular velocity along any axis. Linear
accelerations, including those produced by gravity and those resulting from body motions, are
detected by the utricle and saccule (the otholic organs), while angular velocities caused by rotation
of the head or the body are measured by the semicircular canals (see figure 2.8).

Figure 2.8: The human vestibular system. Image taken from [8].

The utricle and saccule consist of an ovoid (egg-shaped) sac of membranous labyrinth housing
a roughly elliptical patch called the macula. The hair cells contained in each organ are matched
with the macula. When the head undergoes linear acceleration the membranous labyrinth moves
along as well because it is fixed to the skull, while the macula is free to shift within the receptor
organ. This motion in turn deflects the hair bundles of the hair cells exciting an electrical response. In each utricle, since the macula is approximately horizontal when the head is in its
normal position, these organs are particularly sensitive to accelerations lying in the horizontal
plane. The afferent nerve fibers from each utricle therefore provide a rich and redundant representation of the horizontal plane. Because the utricles are bilateral, the brain receives additional
information form the counter-lateral labyrinth. In saccules, the operation of the pair resembles
that of the utricles. Since the maculas within the saccules are oriented vertically in quasi-sagittal
planes, these two sensory organs are especially sensitive to vertical accelerations, of which gravity
is the most important.
Like the otolithic organs, the semicircular canals detect accelerations by means of their internal contents (endolymph fluid). The fluid cannot freely move around the whole of a semicircular
canal because they are interrupted by a gelatinous diaphragm, the cupula. Around most of its
perimeter the cupula is penetrated by hair cells. When the head begins to rotate, the fluid inside
26

human visual system

the canals presses against one surface of the cupula exciting an electrical response of the hair
cells. In each labyrinth the three canals are almost precisely perpendicular to one another, so that
the canals represent accelerations about three orthogonal axes. The vestibular labyrinths on the
two sides of the head are symmetrically arranged with respect to one another. The two horizontal
canals for example lie in a common plane and hence function together.
Vestibulo-Ocular Reflex
The vestibular nerve transmits information about head accelerations and velocities to the vestibular nuclei in the medulla in the brain, which then distributes it to higher-level neural centers.
This central network of vestibular connections is responsible for various sensory-motor reflexes,
including the Vestibulo-Ocular Reflex (VOR, see 2.9). The VOR is an important mechanism by
which unblurred vision is made possible during head movements that are generated during common body motions, such as walking and running. When the head moves, the eyes are kept still
by the vestibulo-ocular reflexes of the eye muscles. The vestibular apparatus signals how and how
fast the head is moving, and the oculomotor system uses this information to stabilize the eyes to
keep images steady on the retina.
Each of the three semicircular canals is matched with one of the three muscle pairs in the eye
and the VOR works for all three pairs of muscles. Three different VOR’s arise from the main
components of the vestibular system:
• The rotational vestibulo-ocular reflex compensates for head rotation and receives its input
predominately from the semicircular canals.
• The translational vestibulo-ocular reflex compensates for linear head movement.
• The ocular counter-rolling response compensates for head tilt in vertical direction.

Figure 2.9: Vestibulo-ocular reflex. Image taken from [73].

Section 2.10 is partly taken from: [45]
27

human visual system

2.11 Summary
This chapter described the human visual system; the anatomy, movements and control of the eye,
and the connection towards visual attention and the vestibular system.
The retina is the thin layer of neural cells that lines the back of the eyeball. It contains rods and
cones that respond to light and can be compared to the photoreceptive sensors used in cameras.
The fovea is used for sharp central vision and is densely packed with only cones.
Two of the most important human eye movements are smooth pursuit and saccades. Smooth
pursuit is used for normal tracking, saccades for jumps in eye velocity when smooth pursuit can’t
keep up or when attention is suddenly drawn to a different location.
Visual attention can be classified in overt and covert attention; overt attention directs the eye,
covert does not. Covert attention is thought to be a neural process and is 4 times faster than overt
attention. Attention processing can be classified in bottom-up and top-down processing. Bottomup is a process where saliency is determined unconsciously by the visual system. The top-down
method is task-dependent where pre-defined information is used to direct attention.

28

Chapter 3
Machine vision
3.1 Introduction
Machine, or computer vision can be divided in a few different subjects. Besides the obvious
two, computer and machine vision, image processing is another important topic. The focus
of computer vision is now-a-days mainly on 3-D. Machine vision is the complete integration of
sensor, control and image processing in one vision system. Image processing is the analysis and
processing of images for the sake of measurement or control.
Machine vision can be used for several applications; a few are:
• Recognition: prespecified or learned objects (or object classes) can be recognized.
• Identification: an individual instance of an object can be recognized (fingerprint, face, etc.)
• Detection: data is checked for a specific condition (medical, etc.).

3.2 Visual front-end/Scale-space
When Gaussian derivative operators or invariants are used in the way as basic feature detectors
at multiple scales, the first stages of visual processing is referred to as the visual front-end.
This can be applied for a range of image processing techniques:
• Feature detection/classification
• Image segmentation/matching
• Motion estimation
• Computation of shape cues
• Object recognition
29

machine vision

An example is the detection of corners, ridges or valleys; these can be expressed as local minima, maxima, zero-crossings of multi-scale differential invariants defined from Gaussian derivatives.
An important issue is scale selection; the size of real-world objects is, obviously, not known to
a vision system. The distance between an object and the camera can vary and also be unknown.
Therefore, a useful property called scale-invariance is used; it performs automatic local scale
selection based on local maxima or minima over scales of normalized derivatives. Scale adaptive
and scale invariant feature detectors can be expressed for tasks as blob, corner, ridge and edge
detection, which, consequently, can be used for determining regions of interest [27].
The pyramid representation is a predecessor to scale space, and is constructed by simultaneously smoothing and sub-sampling a given signal. In this way, computationally highly efficient
algorithms can be obtained. However, it is algorithmically harder to relate structures at different
scales, due to the discrete nature of scale levels. In scale space, the existence of a continuous scale
parameter makes it easier to express deep structure [73].
Scale space is the formal theory for handling image structures at different scales, by representing an image as a one-parameter family of smoothed images parameterized by the size of the
smoothing (low pass filter) kernel.
For a given image (2D, f (x, y)), its linear (Gaussian) scale-space representation is a family of
derived signals L(x, y; σ), defined by the convolution of f (x, y; σ) with the Gaussian kernel:

G(x, y; σ) =

1 − x2 +y2
e 2σ
2πσ 2

(3.1)

such that: L(x, y; t) = G(x, y; σ) ⊗ f (x, y)
⊗ depicts the convolution which represents a mathematical operation (blending) between two
functions resulting in a third, modified version of the originals [73].

Figure 3.1: Gaussian kernel; 1-D (left) and 2-D (right). Images adapted from [27].

The convolution is performed on the variables ’x’ and ’y’. The scale parameter ′ σ ′ indicates
which scale-space level is defined; it is the variance of the Gaussian filter. The scale space level
with parameter ′ σ ′ equal to zero is the image itself.
30

machine vision

The scale-space framework can be seen as a theoretically well-founded paradigm for early
vision. It shows a high degree of similarity with receptive field profiles recorded from the mammalian retina and the first stages in the visual cortex. It is therefore thought that human beings
look at a scale-space; retinal images are fed through Mexican hat filters (LOG: Laplacian of Gaussian filters) and these are sent to the brain.
Basically, the idea behind scale space is that important features of the original image should
survive the change of scale; therefore, all of the scale space of the image should be used to detect
these features [27].

3.3 Image processing for recognition
Methods for recognizing objects or object classes:

Feature based method: Feature detection is a low-level image processing operation and focuses
on detecting types like edges, corners, blobs and ridges. The desired property for a feature
detector is repeatability; whether or not the same feature will be detected in two or more
different images of the same scene. This method is based on the correspondence of feature
points between two images. This reduces the full range pixels correspondence estimations
to a sparse set of pixels correspondence estimation [56].
Pixel based method: This method estimates all pixels within a specified area in two images.
The result is a dense depth map. The disadvantage is that it is time consuming (only one
pixel at a time) and difficult, especially when occlusions (when corresponding points are
missing in one of the two images) are present or if an image contains uniform regions or
pixel values. It is especially essential for 3-D reconstruction applications [25].
Saliency based method: From low-level visual feature extraction, three conspicuity maps (color,
intensity and orientation) are combined, which provide the input for a single saliency map.
The task of the saliency map is to compute a scalar quantity representing the salience at
every location in the visual field [35].
Segmentation: Segmentation is partitioning a digital image into multiple regions. The goal is to
simplify or change the representation into more meaningful, and easier to analyze, images.
Segmentation is typically used to locate objects and boundaries in images. The result is a
set of regions that collectively cover the entire image or a set of contours extracted from the
image. Each of the pixels in a region is similar with respect to some characteristics, such
as color, intensity, texture, etc [25].

Each detection method has its advantages and disadvantages and depending on the application a choice should be made which method should be implemented.
31

machine vision

3.4 Color space
Color space is a system for describing color numerically. The most widely used color spaces
are RGB (channels for red, green and blue) for scanners and displays, CMYK (short for cyan,
magenta, yellow, and key (black)) for color printing and YUV (Y for luminance (brightness), U
and V for chrominance (color)) for video and TV. Before electronic displays where used, color
spaces were developed that were closer to the way people perceive color. For example, the HSB
model uses hue, saturation and brightness. For machine vision, color spaces can be used to
recognize objects or differentiate between important or less important features [73].

Figure 3.2: horseshoe shape of human visible colorspace and triangle shape RGB display space (E). Image taken from [73]

3.5 Motion vision
Optical flow is a concept which approximates the motion of objects within a visual representation.
It is represented as vectors originating or terminating at pixels and is closely related to motion
estimation and motion compensation.
Different methods for determining optical flow are present [73]:
Differential based: which is based on partial, first and second order, time derivatives of the image signal and/or flow field (called differential since they are based on local Taylor series
approximations of the image signal). A few methods can be distinguished:
Lucas kanade method: this regards image patches and an affine method for the flow field.
Horn Schunck method: optimizing a functional based on residuals from the brightness constancy constraint, and a particular regularization term expressing the expected smoothness
of the flow field.
General variational methods: these are modifications or extensions of the Horn and Schunk
method, using different data and smoothness terms.
32

machine vision

Region based: based on correlation between images, where only integer pixel motions are taken
into account.
Frequency/Phase based: from the spatial and temporal Fourier Transform, the magnitude and
phase are examined in order to determine sub pixel motions and flow field.

3.6 Present and previous work in visual attention
In literature of visual attention, many models, both for psychophysical and machine vision, have
been developed. The review, which is given for both, is surely not complete but the main approaches and ideas are covered.

3.6.1 Psychophysical models of attention
Feature Integrated Theory: This ’FIT’ model by Treisman [66, 67] consists of a master map
which codes locations of feature discontinuities in luminance, color, depth or motion and a
separate set of feature maps for processing information about the current spatial layout of
the features. An attention window moves within a location map which selects the features
attended to and temporarily excludes others from the feature maps.
Guided Search: The Guided Search model proposed by Wolfe [75, 76] uses the well-known
"Saliency map" to account for visual search and focuses on simulating visual behavior data.
GS posits that primitive visual features are detected across the retina in parallel along dimensions such as color and orientation, yielding a set of feature activity maps. These are
passed through a differencing mechanism that enhances local contrast and texture discontinuities, yielding a bottom-up activation, which, when combined from all feature maps,
form a saliency map.
Dynamic Routing Circuits: The Dynamic Routing Circuits model was first proposed by Anderson and Von Essen [2] and then further developed by Olshausen [43]. The key objective is
to make use of visual attention to route retinal information for translation-invariant pattern
recognition. In the model, spatial attention is taken as a "window" shifting to different
spatial locations in the retina.
SEarch via Recursive Rejection: This model; SEarch via Recursive Rejection (SERR), proposed
by Humphreys and Müller [29] is built upon a hierarchical organization of networks and
uses a Boltzmann-like machine activation function to simulate grouping effects on visual
search with letter-like stimuli. It tends to repeatedly reject strongly represented distracters
until the target is detected.
Selective Attention for Identification: Heinke and Humphreys [28] presented the Selective Attention for Identification Model (SAIM). It integrates "dynamic routing circuits" to achieve
bottom-up invariant translation and a top-down knowledge network for object recognition
via visual attention. It consist of three parallel processing networks: a contents network to
map the contents of the visual field into the focus of attention; a selection network to determine the location of the mapped elements in the visual field; and a knowledge network to
33

machine vision

store object template which directly influence selection. This model can perform locationbased selection and object-based recognition using a competitive approach to attention.
Adaptive Resonance Theory: The Adaptive Resonance Theory (ART), proposed by Grossberg
and Carpenter [26], is a theory of neural network representations. It suggests that both the
bottom-up and top-down pathways contain adaptive weights, or long term memory traces,
which may be modified by experience. The learned top-down expectations focus attention
upon information that matches them. Then they select, synchronize, and amplify the activities of cells within the attentional focus, while suppressing the activities of irrelevant
cells which could otherwise be incorporated into previously learned memories and thereby
destabilize them. This all together is called the ART Matching Rule.
VIsual Search ITeratively: This VISIT model proposed by Ahmed [1] is another connectionist
model of covert visual attention. The model consist of a gating network for suppressing all
activity except for a given region, a priority network responsible for sequencing and for mediating the information flow between the gating and priority networks, and a working memory for the temporary storage of relevant information. The gating network corresponds to
the pulvinar and the output, the gated feature maps, corresponds to the areas V4, IT and
MT of the optic nerve.
MORSEL: The model Multiple Object Recognition and attentional SELection (MORSEL) [41]
describes a neurally inspired computational model of two-dimensional object recognition
and spatial attention that can explain many characteristics of human visual perception. It
uses one-step attentional selection but no saliency maps. MORSEL is a connectionist model
of spatial attention and object recognition. It essentially contains two modules- one for
object recognition and one for visual attention. The object recognition system operates in a
hierarchical manner, pooling visual information across increasingly large receptive fields.
The spatial attention network gates activation entering into the object recognition system,
which is then biased in favor of attended objects.

Furthermore, a few models which should be mentioned:
Sandon’s model [55] is a hierarchical multi-scale connectionist network that uses feature arrays
with strong lateral inhibitory connections. It represents the first real implementation of the Koch
and Ullman [32] model. Moreover, it represents the first real implementation of any attention
model. SLAM, by Phaf [48], in contrast, is based on visual word recognition. Three levels of
processing were distinguished: a mapping level, an attribute level and a response level. It included a motor program, colour, form and position modules, together with combined modules
(e.g. colour-position, form-colour and form-position). Inhibitory competition within and between
modules results in attended items. Finally, The Feature Gate Model by Cave [13] uses a hierarchy
of spatial maps encoding features. Inhibition is applied at several levels of the hierarchy to inhibit
distracter locations and the selection is based on local differences in a bottom-up WTA (Winner
Take All) mechanism, with top-down biases. Even though some of the main ideas of the model
have been used in the implementation an attention network for a humanoid robot, the Feature
Gate Model has not been fully implemented and tested.
34

machine vision

3.6.2 Machine vision models of attention
These computational models of attention are greatly inspired by the psychophysical principles
mentioned earlier.
Saliency map: The "saliency map" model, which is based on, and originally proposed by Koch
and Ullman [32] and later implemented by Itti [37, 35, 36]. The model is inspired by an architecture which has a biologically plausible background. Visual input is first decomposed into
a set of three parallel tracks which differentiate between feature extraction; color, intensity
and orientation. These maps are pyramidically scaled down with a Gaussian, which progressively low-pass filters and sub-samples the image. From these maps center-surround
and across-scale operations compute the features which locally stand out from their surround. All feature maps contribute to the master ’saliency map’, which depicts the salience
at every location of the input image. This is used to direct selective attention, where a
winner-take-all network selects the most conspicuous location in the saliency map. An
inhibition of return ensures that no location is attended twice in a row.

Figure 3.3: general model of the Saliency-based visual attention model [35].

An intensity image I is obtained as I = (r+g+b)/3 used to create a Gaussian pyramid I(σ)
which normalizes the ’rgb’ channels to decouple hue from intensity. Four color channels
are then created and the Gaussian pyramid I(σ) is applied to these; R = r−(g+b)/2 for red,
G = g − (r + b)/2 for green, B = b − (r + g)/2 for blue and Y = (r + g)/2 − |r − g|/2 − b
for yellow to create R(σ), G(σ), B(σ) and Y (σ) respectively. In visual receptive fields,
typical visual neurons are most sensitive in a small region of the visual space (the center),
while stimuli presented in a regions concentric with the center (the surround) inhibit the
neural response. Therefore, each feature is computed by the center-surround differences
35

machine vision

between a "center" fine scale c and a "surround" coarse scale s. The center is a pixel at scale
c ∈ {2, 3, 4} and the surround is the corresponding pixel at scale s = c + δ, with δ ∈ {3, 4}.
It basically means that locations that stand out from their surround in any way, should
always be detected (multi-scale feature extraction). For the intensity map this becomes:
(3.2)

I(c, s) = |I(c) ⊖ I(s)|

where ⊖ is defined as the across-scale difference between a ’center’ fine scale (c) and a
’surround’ coarser scale (s). This difference between these two maps is obtained by interpolation to the finer scale and point-by-point subtraction.
The feature maps for color are computed with a so-called "color double-opponent" system, which represent the excitation and inhibition of neurons in the receptive field of human beings [23]. Accordingly, maps are created to simultaneously account for red/green
and green/red double opponency (RG(c,s)) and for blue/yellow and yellow/blue opponency
(BY(c,s)):
RG(c,s) = |(R(c) − G(c)) ⊖ (G(s) − R(s))|

(3.3)

BY(c,s) = |(B(c) − Y (c)) ⊖ (Y (s) − B(s))|

(3.4)

with ⊖, c and s as defined previously.

Local orientation information is obtained from I using oriented Gabor pyramids O(σ, θ),
with scale σ and orientation θ. From [35]:
"Gabor filters are the product of a cosine grating and a 2D Gaussian envelope and approximate the receptive field sensitivity profile (impulse response) of orientation-selective
neurons in primary visual cortex [39]."
A Gabor filter is basically a band-pass filter with tunable center frequency, orientation and
bandwidth. The magnitude of the output of the filter provides information about the location of the textures. It should be large when the texture exhibits the frequency and orientation characteristics to which the filter is tuned.
The complex Gabor representation:  

2 

1 x
¯
y¯2
1
G(x, y; σ) =
exp −
+ 2
exp(2πjW x¯)
2πσx σy
2 σx2
σy

(3.5)

x
¯ = x cos θ + y sin θ

(3.6)

y¯ = −x sin θ + y sin θ

(3.7)

with

where σx and σy are the scaling parameters of the filter and determine the effective size of
the neighborhood of a pixel. θ is the preferred orientation, W is the radial frequency of the
sinusoid [70].
36

machine vision

With the preferred orientation (θ = 0 ◦ , 45 ◦ , 90 ◦ , 135 ◦ ), feature maps are computed:
(3.8)

O(c, s, θ) = |O(c, θ) ⊖ O(s, θ)|

The model uses bottom-up, low-level feature extraction methods which result in a taskindependent focus of attention. Top-down guidance to shift attention is not required, but
can be implemented by means of importance weights for more important features. This
would result in a feedback where only features with high weights would survive.
The saliency map theory is partly taken from [35].
Selective Tuning: Tsotsos [68] presented a selective tuning model for visual attention that uses
a visual pyramid, a winner-take-all algorithm and an inhibition of irrelevant connections.
First the WTA algorithm generates a global winner across the visual field at the top layer
of the pyramid. From there on the process proceeds to lower levels, ignoring irrelevant
connections. The end result is that from a globally strongest response, the cause is detected
at the earliest levels.
Incremental Focus of Attention: Toyama [64] describes a Incremental Focus of Attention mechanism architecture for robust, adaptive, real-time motion tracking. The system combines
several visual search and vision-based algorithms in a layered hierarchy. When conditions
are good, tracking is accurate and precise (corresponding to a higher level tracking in the
layered pyramid). As conditions deteriorate, more robust, yet less accurate algorithms take
over (corresponding to lower level tracking in the layered pyramid).
Burt: Burt [11] acknowledges that systems for real time computer vision are confronted with
an enormous amount of visual information, yet must respond rapidly to critical events. He
identifies the need for attention mechanisms, to support efficient, responsive analysis by
focusing the system’ sensing and computing resources on selected areas of a scene. The
system must locate and analyze the information essential to the task at hand, while ignoring
the vast flow of irrelevant detail. Three elements of attention are described: foveation, to
examine selected regions of the visual world at high resolution; tracking, to stabilize the
images of moving objects within the eye; and high level interpretation, to anticipate where
salient information will occur in a scene.
Conception: Conception et al. [18] describe an approach in target recognition using active and
selective perception and perceptual learning in the analysis of time-varying imagery. Attention mechanisms are implemented through the use of three linked functional pyramids
corresponding to image representation, memory, and attention. The novel approach is
multi-resolution search, supported by saccade and zoom-lens control, allowing the focus of
attention to roam inside an image pyramid under the guidance of an attention pyramid and
analyze a field-of-view at any available resolution.
Sun: Sun [59] recently proposed a model of object-based visual attention where mechanisms,
that direct visual attention in the system, are object and feature-driven. Competition to
gain visual attention occurs not only within an object (among the constituting features),
but also between objects. Two new mechanisms in this proposed model are described and
analyzed: one that computes the visual salience of objects and groupings; the second one
that implements the hierarchical selectivity of attentional shifts.
37

machine vision

Finally, much research has been done using neural network approaches to modeling selective
attention. A few approaches are presented (Adaptive Resonance Theory [26], MORSEL [41]) and
more models are the SCAN model by Postma [49] and Baluja [4].
A problem that has only been considered briefly in the previous work is the application of
attention strategies to dynamic environments. Some researchers have considered motion as a
cue for the attention mechanism (as in human vision), but basically none of the research projects
reported here have the ability to perform reasonably in dynamic scenes. The problem here is that
there is a considerable difference between detecting motion and incorporating the phenomenon
in the internal representation. One of the missing parts is that most attention strategies rely
on a memory of prior fixation point locations. However, in a dynamic scene, one might well
have to deal with the same location several times. Hence some kind of forgetting needs to be
incorporated in the systems, along with continuous updating of the current information, and
fusing of information from different viewpoints or directions. This makes a strong contrast to
today’s systems which typically makes snapshots of the environment and fuse data into a larger,
static frame or reference.
A topic that has been left totally untouched so far is the problem of whether attention is to be
2D or 3D. All researchers have considered attention in 2D, sometimes for directing two sensors.
In active vision one could easily think of advantages when doing 3D attention. This would for
example allow the left and right camera to fixate a point in 3D space. As it is now, one camera
needs to fixate first, and the other then has to be guided by a leading eye approach. However,
for depth and orientation in space, as we intent to use the vision system, binocular vision is a
necessary means.

3.7 Summary
This section summarizes the discussed topics in this chapter; machine vision, and in particular visual attention models. The scale-space framework can be seen as a theoretically wellfounded paradigm for early vision. It shows a high degree of similarity with receptive field profiles
recorded from the mammalian retina and the first stages in the visual cortex. The idea behind
scale space is that important features of the original image should survive the change of scale;
therefore, all of the scale space of the image should be used to detect these features.
A literature study on visual attention shows that many models, both for psychophysical and machine vision, have been developed. One model is chosen to be explained in more detail due to
its biologically plausible nature. This visual attention model (the Saliency map) detects attention
from three feature extraction methods (intensity, color and orientation) which contribute to a
saliency map which depicts the salience at every location in the field of view in a scalar quantity.
These feature extraction methods compute in total 7 maps. 1 for intensity, 2 for color (color opponency for blue-yellow and red-green) and 4 for orientation (for 0 ◦ , 45 ◦ , 90 ◦ , 135 ◦ in orientation).
Each map is down-scaled pyramidically 5 times and contribute to the master ’saliency map’. An
inhibition of return ensures that no location is attended twice in a row.

38

Chapter 4
Human versus Robot vision and control
4.1 Introduction

The key to understanding the differences and similarities of human and machine vision, is the
fact that most processes applied in machine vision are adapted from the human visual system.
Since the human visual system becomes more and more understood, more information becomes
available for machine vision. Although the fundamental phenomena of human vision have been
identified by neurologists ([78] and [15]), an application towards machine vision has not yet been
made. These phenomena are an extreme sensitivity to color and continual visual motion with
hyper-acutance for static objects. Our vision system’s ability to maintain color constancy despite ambient light shifts is not yet understood. Basically, human beings don’t see a radical shift
in color even though physical measurements would detect a strong offset. The ability termed
’hyper-acutance’ (acutance being perceived ’sharpness’), is a well recognized visual phenomena
by ophthalmologists and enables the human visual system to be extremely sensitive to spatial
and spectral differences. The actions to accomplish this sharpness is called micro saccades. It
has been well known that we can see detail in the distance about 50 times greater than what optical theory predicts for our combined lens and retinal structure [78].
These fundamental phenomena work in conjunction, and in ways that are not accounted for in
conventional television and camera systems. The human brain is a real neural network, and it
doesn’t follow anything we yet know about computation [57].
The method of tracking in human vision and machine vision, has a similar approach. The recursive estimation process (top-down) can be identified in a number of existing algorithms. As
noted, a short outline of differences and similarities is presented, with an extra emphasis on
top-down/bottom-up tracking [19].
39

human versus robot vision and control

4.2 Differences and similarities
Human and robot vision
Taking a field of view of 120 ◦ (the human eye really has a larger field of view, close to 180 ◦ ),
the human eye has an equivalent resolution of about 576 megapixels [16]. This megapixel
equivalent refers to the spatial detail in an image that would be required when a human
eye views a scene. This means, with micro saccades, moving the eye around a scene and
continually updating the image in the brain.
The camera which will be used in the robotic eye unit has a resolution which is significantly
lower (0.3 megapixels @ 39 ◦ × 29 ◦ ; which is hardly comparable).
The detectable refresh rate (temporal frequency) of the human eye is said to be between
40Hz and 80Hz (depending on the lighting ratio of source to ambient light). The eye can
detect flickering (temporal separation) of equal intensity light sources at intervals as low as
15-20 ms. For robot vision, the camera and processing is the limiting factor. In our case,
and highly dependent on the size of the input image, the image processing has a minimum
computation of 3 to 4 ms, which enables it to detect changes at a frequency of 200 Hz [9].
The color spectrum of the human eye exceeds any artificial colorspace, as can be seen in
figure 3.2. The camera used for the visual tracking model has a number of outputs (i.e.
RGB, MONO, YUV, etc.) from which YUV is used.
Although it might be faster, robot vision doesn’t come close to human eye specifications.
Image processing
The organization of the human visual system highlights two important aspects; First, the
processing of the visual signals start immediately in the visual pathway, right after signal
transduction. Secondly, in a highly parallel subsystem, the pathway towards ’higher level’
processing areas (lateral geniculate body and the cortex) is itself highly parallel. Therefore,
the overall architecture of the visual system does not exhibit the tremendous mismatch between sensing and processing that is typical of artificial systems [9, 15].
As is explained in chapter 2.7, visual attention in human vision is highly complex and has a
neural background. Since the purpose of the vision system in the humanoid robot (detecting, recognizing, tracking simple objects) isn’t as highly complex, a simpler mechanism
for detection, recognition and tracking is sufficient. The camera gives a YUV signal, which
is ideal for recognizing an orange ball and distinguishing the different, important, objects
(orange ball, blue and red goal, black robots) in the FOV (field of view).
Eye-head control
Depending on the movements an eye and a head is making, different specifications regarding position, velocity, acceleration and latency can be addressed. Since the list of human
eye and head movements is very extensive, only the basics are presented in table 4.1.
As is stated in chapter 2.3, there a numerous eye movements identified. The main control
movements an eye makes are saccadic movements and smooth pursuit. Smooth pursuit
does not have to be smooth and can even include saccades if retinal velocity is large and
smooth pursuit can’t keep up. These movements can be translated to visual tracking control as smooth pursuit is in fact a velocity control method and a saccade is in fact position
control. A small overview is given in table 4.2.

40

human versus robot vision and control

Parameter
eye pan angle
eye tilt angle
head pan angle
head tilt angle

Value
total rotation
total rotation
total rotation
total rotation

max.
max.
max.
max.

600 [deg/s] or 10, 5 [rad/s] (pan and tilt)
40.000 [deg/s2 ] or 700 [rad/s2 ] (pan and tilt)
about 200 [deg/s]
about 200 [deg/s]

eye angular velocity
eye angular acceleration
head angular velocity tilt
head angular velocity pan

eye latency (overall)
eye latency (computations)
eye latency (Vestibulo-ocular reflex)
head latency (overall)

angle
angle
angle
angle

≈ 70 [deg]
≈ 70-90 [deg]
≈ 180 [deg]
≈ 140 [deg]

130-200 [ms]
50-120 [ms]
15-20 [ms]
250 [ms]

Table 4.1: Overview of specifications of the human eye and head. Keep in mind
that this can differ per person significantly [62], [52], [54] and [61].

Saccade
eye angular velocity
eye latency
frequency
Smooth pursuit
eye angular velocity
eye latency

200 ∼ 400 [deg/s] (pan and tilt)
150-200 [ms]
∼ 5 [Hz]
0 ∼ 100 [deg/s] (pan and tilt)
∼ 130 [ms]

Table 4.2: Overview of specifications of the saccadic and smooth pursuit movements. These two control methods a human eye makes are translated as velocity (smooth
pursuit) and position (saccade) control [52, 53, 40].

4.3 Top-down/Bottom-up tracking
As mentioned in chapter 2.7 the human visual system shifts our gaze according to two kinds
of processing: task dependent (top-down) and primitive selective attention processing (bottomup). This way of attention selection can also be translated to machine vision and its application
towards tracking an object. The bottom-up approach consists in segmenting in each frame the
(known) moving object and trying to match it over time. The object is predefined in advance and
image processing is applied as such.
Top-down tracking consists of estimating the position and characteristics of the object of interest
in the current frame, given its position and characteristics in the previous frame [19]. It is thus,
by definition, a recursive method and can be modeled as a Markov process. In this top-down
approach, hypotheses of object position are generated and verified using data from the current
image. Some methods which represent this approach for tracking are the Lucas-Kanade feature tracking algorithm [7], the Mean-Shift algorithm [44] and Kalman filters and particle filters
[34, 3, 19]. The difference in these algorithms lie in the way of extracting features in the FOV as
well as the tracking method (deterministic or stochastic). In all cases a model of the object to be
41

human versus robot vision and control

tracked is assumed to be available. This can be provided manually (’click and track’) or by means
of detecting in a bottom-up manner (image processing) [19].
How human beings direct attention is still not fully understood. The "saliency map" model
treated earlier (chapter 3.6.2), is inspired by the behavior and the neuronal architecture of the
early primate visual system and is therefore a good candidate to be compared with a human focus
of attention system. However, if this can be matched is not the main subject. An attention model
is mainly implemented to give the eye-head system an attentive behavior and to gain knowledge in
the human visual attention process. The saliency map in human vision is a bottom-up approach
and can be altered with top-down hints for a more task-specific tracking or attention selection.
The visual tracking model can thus be explained by a Hidden Markov model [51] and tracking
is applied by the algorithms mentioned before. In chapter 5 the complete setup is explained and
the used tracking algorithm is described and tested.

Image

New features
detection/
initialization

Search of new
position
and update of
feature

Estimation of
Feature
position

Tracking

Figure 4.1: Architecture of top-down approach. The figure depicts how the top-down approach is build up. The two lower squares represent the recursive method which can be modeled as
a Markov process. Figure taken from [19].

4.4 Summary
This chapter summarizes the most important differences and similarities of human and machine vision for this research. Human vision is highly complex, can count on a massive amount
of computation power and is highly parallel. The human eye has proporties (low friction, fast
acceleration, highly efficient) that cannot be matched by commercially available motors. Robot
vision can computationally be faster, when knowing what to detect and track, but doesn’t come
close to human eye specifications.
The most important similarity is the notion of top-down/bottom-up processing and tracking.
In attention selection this can be identified as task dependent processing (top-down, with prior
knowledge) and primitive selective attention processing (bottom-up, no prior knowledge). For
tracking, the bottom-up approach is simply segmenting the object in each frame and match it
over time. The top-down method consists of estimation and is therefore a recursive method
(such as particle- and Kalman filters).
42

Chapter 5
Visual tracking model
5.1 Introduction
The main topic of this research is to detect and direct attention. For this, a setup and different
algorithms are developed to detect, direct and control the attention models. The setup used for
this is a human-like eye-head system with 2 degrees of freedom for the neck and two for both
eyes. A single camera, thus monocular vision, is used for image acquisition. The setup and the
algorithms are all explained in the following chapters.

5.2 Eye-head system
The camera used for the visual tracking model is a Prosilica GC640 color camera with 1/2 inch
CMOS progressive scan sensor. Maximum input image size is 659 × 493 pixels (0.3 megapixel)
and can reach 195 frames per second. The chosen Fujinon lens (HF9HA-1B) has a focal length
of 9 mm and an aperture ratio of 1 : 1, 4 − 16. The angle of view is 39 ◦ 09′ × 29 ◦ 52′ (H × V ).
The head set-up is a pan-tilt unit, consisting of two dynamixels1 , where both eyes will move at the
same time, just as a human head. Besides that, the eyes itself can also rotate around two axes.
The camera and an inertial sensor (XSens2 ) are also mounted on the pan-tilt unit. Figure 5.1
shows the complete set-up. This eye setup was partly developed by Philips Applied Technologies
and was previously used for research in stabilizing eyes with only an inertial sensor [74].

1
2

www.robotis.com
www.xsens.com

43

visual tracking model

Artificial eye

Camera

Xsens

Dynamixel

Figure 5.1: Human-like eye-head setup. The head (eyes, camera and XSens) can move in two
dimensions; left-right (also known as ’pan’, actuated by the lower Dynamixel) and up-down (also
known as ’tilt’, actuated by the upper Dynamixel).

An Intel Pentium pc with Dual Core processor running Linux Ubuntu and Xenomai (real
time linux extension) was used to connect the hardware (camera, XSens, Dynamixels and eyes)
via USB in real-time with the developed and available C/C++ algorithms and drivers. ’Real-time’
can be defined as:
"A real-time system is one in which the correctness of the computations not only depends
upon the logical correctness of the computation but also upon the time at which the result is
produced. If the timing constraints of the system are not met, system failure is said to have
occurred3 ."
To use image processing and for developing the visual tracking model in a C/C++ environment, OpenCV4 is used. OpenCV (Open Source Computer Vision) is a library of programming
functions mainly aimed at real time computer vision.

3
4

www.faqs.org/faqs/realtime-computing/faq
opencvlibrary.sourceforge.net

44

visual tracking model

tilt
y

x

roll

pan
z

(a) head

(b) eye

Figure 5.2: The coordinate systems of the head setup (a) and an eye (b) respectively.
Tilt is the up-down movement and pan is the left-right movement of the eye and the head. Head
setup figure taken from [74].

5.3 Visual attention algorithm
The used visual attention model is a modified version of the ’Saliency map’ model and is explained
in detail in chapter 6.2.
The 7 feature maps are computed from a single input image consisting of three color channels
(RGB). As explained in chapter 3.6.2, the intensity image is obtained by computing the centersurround differences in the intensity image. The color feature maps are created by computing the
color double-opponency from the red/green, green/red and blue/yellow, yellow/blue channels.
The orientation feature map is created from 4 oriented ’Gabor’ pyramids (θ = 0 ◦ , 45 ◦ , 90 ◦ , 135 ◦ ).
Each feature map (or a combination of maps) can be assigned a weight which depicts the importance of that particular feature. From the three feature ’tracks’, three, pyramidically scaled down,
conspicuity maps are combined representing the saliency of each feature (color, intensity and orientation). These are again combined to form the saliency map, which represents the most salient
location in a scalar quantity. The maximum value is then chosen to be the attended location. An
inhibition of return mask is computed by filling a certain area in a matrix, with the same size as
the saliency map, which minimizes the attended location and enables other locations to be salient
as well. The subsequent iteration combines this matrix with the conspicuity maps.
As an extra feature, a top-down approach of attention is implemented. An extra feature map for
detecting the orange ball is added; a ’YUV’ image (’YUV’ colorspace; Y stands for luminance,
U and V for chrominance)is converted from the RGB color space, from which the ’U’ and ’V’
(chrominance, or color) channels are used. However, even without the extra feature map, a fast
detecting algorithm from this complete model is not possible. The algorithm can run at 1 [Hz]
maximum, mainly due to the ’Gabor’ filter, which uses most computation time and the fact that
a lot of relatively large matrix additions and multiplications have to be computed. Basically this is
a simplified ’Saliency Map’ model [35] to find and react to salient features in the FOV.
45

visual tracking model

Figure 5.3: Implemented Saliency Map model. With only one image size, 7 feature maps
are generated and combined to form the saliency map. The highest scalar quantity then represent
the next attended location and an inhibition of return ensures that no location is visited twice in
a row.

5.4 Tracking algorithm
The tracking model consists of several parts and can be subdivided in an initialization stage, a
detection stage and a filtering stage. The initialization stage sets up the camera and its different
functions as well as setting up other main tasks and functions. The detection stage and filtering
stage takes care of detecting and tracking the ball in a robust way.

5.4.1 Main vision loop
The main loop of the visual tracking model takes care of handling the main camera features and
calling the main functions:
• Camera (un)-setup
Grabbing and opening the camera as well as clearing the frame queue, deleting the allocated buffers and closing the camera. This (un)-initializes the camera and ensures that the
camera can be unplugged safely.
46

visual tracking model

• Camera start/stop
Setting the pixel format, sensor width and height, allocating buffers of each frame and
setting acquisition mode. This is done once and can be changed when different settings
are applied in the algorithm. The pixel format can be set to RGB, YUV, MONO and Bayer
and their different formats. Image acquisition can also be set to different modes. For
example; ’Freerun’ means that a continuous trigger is set, and, according to settings, the
maximum possible frame rate is used. Other modes can be an external- or software-trigger
or a fixed frame rate. Here also the sensor’s exposure time and gain can be altered. The
exposure value should not be lower than 5 ms and preferably no gain should be used, as
this also increases the noise level.
• Input image pre-processing
For tracking the orange ball, input image pre-processing is set up as follows:
The YUV input image is downsized pyramidically from 640 × 480 (VGA) to 320 × 240
(QVGA). The original image is convolved with a 5 × 5 Gaussian, then down-sized by rejecting even rows and columns to get the reduced output image. This image is then divided
into the 3 different parts of the image; Y, U and V, which then can be used separately. A different approach retrieves a smaller image from the camera and down-sizing is not applied.
In this case the frame rate can be much higher (up to 80 fps with down-sizing against
300 fps without). Together with image processing, communication and control it has to be
decided in which way frames are retrieved from the camera.
The complete, developed software hierarchy is depicted in figure 5.4.

Figure 5.4: Software hierarchy of the human-like eye-head system. The colored boxes
contain the software that was developed for the visual tracking model; object detection (orange
ball, attention), tracker and smoother. The ’dynamixel’ files control the ’neck’, the communication
through USB with the dynamixel is done by the ’dxlSerialLinux’ file. Extra is shown the files to
control and use the Xsens measurement unit.

47

visual tracking model

5.4.2 Ball detection
Before a salient object can be tracked (attentive vision), a head start is made by tracking a simple
orange ball. From the YUV-image stream the V image (chrominance) is used to extract the bright
color of the ball (orange). Subsequently the image is binarized by means of a threshold, dilated
(3 × 3 rectangular structuring element) and subtracted from each other. This leaves a binary
outline of the ball which is fed to a two-stage Circular Hough Transform. In the first stage a
transform is accumulated to find the center coordinates of the ball. This is done by taking the
gradient of the image and thus ending up with peaks in the transformed image which represent
the center of the ball. The second stage constructs a radius histogram for each candidate center
derived in the fist stage [77]. This computation returns the coordinates and the size of the found
ball. For efficiency purposes at this point a region of interest (ROI) can be used and is set up as
follows: a rectangle is drawn around the found ball. This rectangle is used as new ROI for the
next frame and image processing will only take place there. If the ball is lost, i.e. not found in the
current ROI, the ROI is reset and the whole FOV (320 × 240 pixels) is used again as image input.
The used ROI is only dependent on the position of the ball in the previous frame. It has to be
mentioned that still every iteration a large (640 × 480 pixels) image is streamed from the camera.
It is possible to stream smaller sized images from the camera (real ROI-ing), as mentioned in
previous subsection.

5.4.3 Lucas-Kanade Feature Tracker
As a first option as possible tracking algorithm, the LKF tracker is applied and tested. In computer vision, the Lucas-Kanade method is a two-frame differential method for computing optical
flow. Optical flow estimates the motion between two image frames by assuming that the pixel
intensity between those two image frames does not change significantly. A pyramidal representation of the image frames is used to efficiently compute the optical flow and therefore making the
algorithm faster. The overall pyramidal tracking algorithm proceeds as follows: first the optical
flow is computed at the deepest pyramidal level. Then, the result of that computation is propagated to the next level in a form of an initial guess for the pixel displacement. Given that initial
guess, the refined optical flow is computed at the same level and the result is propagated to a
next level up until level 0 (the original image). A more detailed (mathematical) description of the
algorithm can be found in [7].
An integrated approach towards feature detection is presented together with the LK feature tracker.
This algorithm ’Good Features to Track’ [56] was proposed by Shi and Tomassi and is based on a
model of affine image changes. The algorithm finds corners with big eigenvalues in the image.
It first calculates the minimal eigenvalue for every pixel in the image and then performs nonmaxima suppression. The next step is rejecting the corners with the minimal eigenvalue less
than a given boundary. Finally, it ensures that all the corners found are distanced enough from
one another by getting two strongest features and checking that the distance between the points
is satisfactory. If not, the point is rejected [56]. When applied, the algorithm selects a given number of features (100) and propagates it to the LK feature tracker. However, this algorithm does
not work properly for our application since it detects and uses only corners for tracking. Since
the ball is round and has a uniform color the positions of the tracked features move away from
the center of the ball and jump excessively. Even with a different detecting method and applying
48

visual tracking model

a Kalman filter the Lucas-Kanade Feature tracker did not give good results.

5.4.4 Bayesian filtering
The Bayesian theory [14] is built on mathematical probability theory that models the uncertainty
of a system and the outcome of interest by incorporating prior knowledge and observational
evidence. Recursive Bayesian estimation is a general probabilistic approach for estimating an
unknown probability density function (p) recursively over time using measurements (zk ) and a
predefined mathematical process model [14]. For this, two assumptions are made:
(i) The system can be modeled by a first order Markov process:
(5.1)

p(zk |xk , xk−1 , ..., x0 ) = p(zk |xk )

The term p(zk |xk ) (the perceptual model) describes the likelihood of making observation zk given
the state xk . It means that measurements (zk ) only depend on the state (xk ) [51] (assumption (ii)).

(5.2)

p(xk |xk−1 , xk−2 , ..., x0 ) = p(xk |xk−1 )

The term p(xk |xk−1 ) describes the system dynamics, i.e. how the state of the system changes over
time. This means that the future behavior of a process, given its path, only depends on its present
state. This is depicted in figure 5.5 by a Hidden Markov model (HMM). For notation simplicity, zk
is denoted as a set of measurements z0:k {z0 , ..., zk } [14]. With these assumptions the probability
distribution over all states of the HMM can then be written as:
p(xk |zk ) =

p(zk |xk )p(xk |zk−1 )
p(zk |zk−1 )

(5.3)

Figure 5.5: Architecture of a Hidden Markov Model

Sequential Bayesian filtering is the extension of the Bayesian estimation for the case when
the observed values change in time. The method is named filtering when information is extracted
49

visual tracking model

at time t by using data measured up to and including t, smoothing when estimating past values
given present and past measurements (up to time t), and prediction when estimating a probable
future value by using data measured up to and including time t [14].
In this research Bayesian filtering finds its application as a Kalman filter, which is a recursive
Bayesian filter for multivariate normal distributions, and as a particle filter, which is a sequential
Monte Carlo (SMC) based technique, which models the probability density function (PDF) using
a set of discrete points. The particle filter is used for prediction; the Kalman filter for smoothing.

5.4.5 Kalman filter
Due to its simplicity, optimality, tractability and robustness, the Kalman filter is one of the most
widely used methods of Bayesian filtering. It is essentially a set of mathematical equations that
implements a predictor-corrector type estimator that is optimal in the sense that it minimizes
the estimated error covariance - when some presumed conditions are met [72, 31]. The beliefs
are approximated by unimodal Gaussian distributions, represented by their mean and variance.
The mean represents the expected location of the tracked object and the variance represents the
uncertainty in the estimate. The Kalman filter can essentially be divided into three main steps.
First, initialization constructs a model transition matrix A, the process noise covariance matrix
Q, the measurement covariance matrix R and the measurement transition matrix H. In our
case, measurement positions x and y do not depend on each other, meaning the H matrix can
be initialized as an 2 × 2 identity matrix. The Q and R matrices are both initialized as a onevalued diagonal with the value specified by experimenting. The A matrix relates the evolution
(or transformation) of the state from the previous time step k − 1 to current time step k. If is
presumed that only the location of the tracked object changes and velocity and acceleration is not
taken into account, the A matrix is initialized as a 2 × 2 identity matrix. When taking also velocity
and acceleration into account the A matrix (in algebraic form and for notation simplicity ignoring
k and k − 1) looks like:
xx = 1 ∗ xx + 0 ∗ xy + 1 ∗ vx + 0 ∗ vy + 0 ∗ ax + 0 ∗ ay
xy = 0 ∗ xx + 1 ∗ xy + 0 ∗ vx + 1 ∗ vy + 0 ∗ ax + 0 ∗ ay
vx = 0 ∗ xx + 0 ∗ xy + 1 ∗ vx + 0 ∗ vy + 0 ∗ ax + 0 ∗ ay
vy = 0 ∗ xx + 0 ∗ xy + 0 ∗ vx + 1 ∗ vy + 0 ∗ ax + 0 ∗ ay

ax = 0 ∗ xx + 0 ∗ xy + 0 ∗ vx + 0 ∗ vy + 1 ∗ ax + 0 ∗ ay
ay = 0 ∗ xx + 0 ∗ xy + 0 ∗ vx + 0 ∗ vy + 0 ∗ ax + 1 ∗ ay

meaning position (x) depends on velocity (v) and velocity depends on acceleration (a). In our
case, for smoothing, the A matrix is initialized as 6 × 6 identity matrix; velocity and acceleration
are taken into account but do not change [46].

50

visual tracking model

1. Prediction step
This ’time update’ step projects the state (5.4) and the error covariance (5.5) ahead in time.
x
ˆ−
xk−1 + Buk−1 + wk−1
k = Aˆ

(5.4)

Pk− = APk−1 AT + Q

(5.5)

The n × l matrix B relates the optional control input u ∈ ℜl (which is not present) to the
state x. wk−1 is the process noise. Q is the process noise covariance matrix, which can be
used to ’inject’ uncertainty into the process to gain acceptable results.
2. Measurement step
First the Kalman gain is computed;
Kk = Pk− H T (HPk− H T + R)−1

(5.6)

where R is the measurement error covariance and Pk− is the a priori estimate error covariance.
Then a measurement is performed and propagated to the filter.
(5.7)

zk = Hxk + vk

The m × n matrix H relates the state xk to the measurement zk . vk is the measurement
noise.
3. Correction step
Together with the Kalman gain, Kk (5.6), and the measurement, zk (5.7), an a posteriori
state update estimation is made;
x
ˆk = x
ˆ−
ˆ−
k + Kk (zk − H x
k)

(5.8)

The final step is estimating (updating) the a posteriori error covariance;
Pk = (I − Kk H)Pk−

(5.9)

Figure 5.6: Diagram of Kalman filter. The ’predict’ step is actually a ’time update’ step.
The ’measurement’ and ’correct’ step can actually be seen as one ’measurement update’ step; the
measurement is used in the ’correct’ step.
51

visual tracking model

The Kalman Gain (5.6) is chosen in such a way, that it minimizes the a posteriori error covariance (5.9). Another way of looking at the weighting by Kk is that as the measurement error
covariance R approaches zero, the actual measurement zk is assumed more and more correct,
while the predicted measurement H x
ˆ−
k is assumed more and more false. On the other hand, as
the a priori estimate error covariance Pk− approaches zero, the actual measurement zk is assumed
more and more false, while the predicted measurement H x
ˆ−
k is assumed more and more correct.
Thus, the filter uses noisy measurements as a form of feedback control to estimate the process
state. After each complete cycle, the process is repeated with the previous a posteriori estimates
used to predict the new a priori estimates. Because of this recursive nature, and the fact that
conditions can be far from perfect the filter is very appealing as a ’help’ in tracking applications.
Despite the fact that the filter can be implemented even if conditions are not ideal (for instance; model or measurement transition matrix not correct, lots of noise), it has some disadvantages towards certain processes. The Kalman filter as a recursive linear estimator is a special case,
applying only to Gaussian densities, of a more general probability density propagation process.
When the process or observation model are nonlinear or when noise is not Gaussian, the Kalman
filter does not suffice anymore. Therefore the Kalman filter is not used as an estimator, but as a
smoothing function for the output of the particle filter [72].
A more mathematical approach towards the Kalman filter can be found in Appendix A.

5.4.6 Particle filter
Particle filtering is a technique for estimating the position of a system in its state space. The
process works by sampling the potential state space with a number of "particles", each of which
represents a hypothesis regarding the current position. Given a previous position distribution, a
current sensor reading, and a likelihood distribution of the ball’s position on the sensor reading,
the particle filter calculates a new position distribution [14, 19, 69].
Sequential Importance Sampling (SIS) is the basic framework for most particle filter algorithms [14] and looks as follows:
1. Initialization
First create a set (i = 1, ..., Np ) of initial samples xik ∼ p(xik |xik−1 , zk ) and define the system
equation
(5.10)

xk = Axk−1 + wk−1

The model transition matrix Aparticle takes care of velocity and acceleration estimation (see
chapter 5.4.5) and looks like:


2
1 0 dt 0 dt2
0
2 

 0 1 0 dt 0 dt2 



0 
Aparticle =  0 0 1 0 dt
(5.11)

 0 0 0 1 0
dt 


 0 0 0 0 1
0 
0 0 0 0 0
1
52

visual tracking model

With dt as the time difference (in [sec]) between k − 1 and k. The process noise wk−1 can
be used to obtain a better particle distribution in the state space; it determines the rate at
which the filter expands. This means that in the state transition matrix position depends
on velocity and acceleration and velocity depends on acceleration. An initial position can
be set, but is left ([0,0]) so the filter starts in the center of the screen.
2. Prediction
With the particles from the previous frame xk−1 and (5.10), predict the next positions of the
particles.
3. Update weights
Compute the importance weight of each particle with the probability density function
(x−µ)2
1
πki ∼ √ e− 2σ2
σ 2π

(5.12)

N
X

(5.13)

with
πn = 1

n=1

It is defined by two parameters, location and scale: the mean ("average", µ) and variance
(standard deviation squared, σ 2 ), respectively. In our case, sigma (σ) is the expected distance of the ball position between iterations (in pixels) and (x − µ) is the measured position
minus the sampled position of each particle. This gives a Gaussian distribution of the distance between the particles and the measured position. The discrete sampling probability
π is then normalized to 1.
4. State estimation
The state estimation of an object is computed each time step:
E[xk ] =

X

π i xik

(5.14)

i

Up to this point, the particle filter is known as the SIS approach. Problems associated with
this are that after a few iterations, most particles have negligible weight; most weight is
concentrated on a few particles only. Possible solutions for this are taking many particles
or a resampling scheme that deals with this degeneracy problem.
When using the resampling approach, the particle filter is known as the Sampling Importance Resampling (SIR) particle filter. The basic idea is that whenever degeneracy rises
above a certain threshold, particles with low weight are killed and particles with high weight
generate more particles. However, a new problem arises when the newly generated particles
are selected more and more resulting in mostly identical particles and a very poor particle
distribution in the state space. To avoid this, the SIR approach resamples new particles the
following way:
53

visual tracking model

5. Resampling
A so-called effective sample size (Nef f ) is introduced, from which the estimate is calculated:
ˆef f = P 1
N
Np
i 2
i=1 (πk )

(5.15)

ˆef f is below a certain threshold the resampling procedure is performed:
When N
Generate Np new particles xik from the set x
˜ik according to the importance weight πki .
In a 2D tracking approach the threshold for generating new particles is depicted by the distance of the particles towards the actual measurement. The new particles are generated using a
Gaussian function to ensure an even distribution in the entire state space, again limited by the
distance threshold. The weight assigned to each new particle is proportional to the distance towards the measurement; further away means less weight. The size of the ’swarm’ can be used as
a safety; a sudden higher deviation in a measurement can still be ’captured’ by particles further
away.
One of the strengths of this technique is the ability to track a multi-modal distribution, where
there may be multiple symmetric hypotheses. The particle filter can maintain these distinct
populations until some non-symmetrical information is able to differentiate the true position.
However, keep in mind this is a statistical technique. Its accuracy is limited by the number of
particles being used, and is dependent on correctly estimating the noise of the models [3, 69].
In this thesis the CONDENSATION algorithm (Conditional Density Propagation) was used
[34]. It is based on factored sampling (probability distribution of possible interpretations is represented by a randomly generated set) but extended to apply iteratively to successive images in a
sequence, and also known as the Sampling Importance Resampling (SIR) particle filter algorithm.

Figure 5.7: Factored sampling (CONDENSATION. A set of points si , the centers of the
blobs in the figure, is sampled randomly from a prior density p(x). Each sample is assigned a
weight πi (depicted by blob area) in proportion to the value of the observation density p(z|x = si ).
The weighted point-set then serves as a representation of the posterior density p(x|z), suitable for
sampling [34].

Compared to the Kalman filter, the CONDENSATION algorithm is simpler, despite its generality. Largely this is due to the absence of the Ricatti equation which appears in the kalman
54

visual tracking model

filter for the propagation of covariance and is relatively complex computationally. CONDENSATION deals with variability by sampling, involving the repeated computation of a relatively simple
propagation formula [34].

5.4.7 Other tracking algorithms
• Extended Kalman filter
As mentioned before, the application of the Kalman filter to nonlinear systems can be
difficult or even impossible. The most common approach is then to use the Extended
Kalman Filter (EKF), which is simply an ad hoc state estimator that only approximates the
optimality of Bayes’ rule by linearization.
The non-linear stochastic difference equation xk = f(xk−1 , uk−1 , wk−1 ) and measurement
zk = h(xk , vk ) are now used to write the new governing equations [72]:
˜k + A(xk−1 − x
ˆk−1 ) + W wk−1
xk ≈ x

(5.16)

zk ≈ z˜k + H(xk − x
˜k ) + V vk

(5.17)

where
– xk and zk are the actual state and measurement vectors,
– x
˜k and z˜k are the approximate state and measurement vectors,
– x
ˆk is an a posteriori estimate of the state at step k,
– A is the Jacobian matrix of partial derivatives of f with respect to x, that is
A[i,j] =

δf[i]

xk−1 , uk−1 , 0),
δx[j]

– W is the Jacobian matrix of partial derivatives of f with respect to w, that is
W[i,j] =

δf[i]

xk−1 , uk−1 , 0),
δw[j]

– H is the Jacobian matrix of partial derivatives of h with respect to x, that is
H[i,j] =

δh[i]

xk , 0),
δx[j]

– V is the Jacobian matrix of partial derivatives of h with respect to v, that is
V[i,j] =

δh[i]

xk , 0),
δv[j]

For notation simplicity subscripts k with the Jacobians A, W , H, and V are ignored. Important to note is that these Jacobians, when evaluated at step k, explicitly depend on k.
55

visual tracking model

With a new definition for the prediction error and the measurement residual respectively:
e˜xk
e˜zk

≡ xk − x
˜k
≡ zk − z˜k

one can proceed as for the traditional linear Kalman filter [72, 31]. Although that it is a
widely used filter strategy and the fact that it has been around for over thirty years, the
general consensus is that it is difficult to implement, difficult to tune and only reliable for
systems which are almost linear on the time scale of the update intervals. Since our sample
rate does not exceed 100 [Hz], sudden movements in tracking do not ensure linearity on
this time scale.
• Unscented Kalman filter
The Unscented Transformation (UT) is a method for calculating the statistics of a random
variable which undergoes a nonlinear transformation and builds on the principle that it is
easier to approximate a probability distribution than an arbitrary nonlinear function. This
deterministic sampling technique is then used by the unscented Kalman filter to pick up a
minimal set of sample points (called sigma points) around the mean. These sigma points
are then propagated through the non-linear functions and the covariance of the estimate
is then recovered. The result is a filter which more accurately captures the true mean and
covariance [73].
• Mean Shift algorithm
The Mean Shift algorithm iterates to find the object center given its 2D color probability
distribution image. The iterations are made until the search window center moves by less
than the given value and/or until the function has done a maximum number of iterations.
Unfortunately, the Mean Shift algorithm is designed for static distributions. A different
algorithm, which deals with dynamically changing distributions and the Mean Shift algorithm is CamShift (Continuously Adaptive Mean-SHIFT. This algorithm adjusts the search
window size in the course of its operation [44].
• Wiener filter
The Wiener filter separates signals based on their frequency spectra. Some frequencies
contain mostly signal, while at other frequencies mostly noise is present. It is obvious that
the ’noise’ frequencies should be blocked by the filter and the ’signal’ should be passed
through. The Wiener filter takes this a step further; the gain of the filter at each frequency
is determined by the relative amount of signal and noise in that frequency. In other words,
from an observed signal, it provides the best restored signal with respect to the squared
error averaged over the original signal and the noise among linear operators. The Wiener
Filter is a noise filter based on Fourier iteration. Its main advantage is the short computational time it takes to find a solution [73].

56

visual tracking model

5.5 Eye-head control
The artificial eyes are equipped with a predefined PID-control action (by Philips rt_motion control
board), and can follow a predefined trajectory or setpoints in real-time. The head can then react
to the movements of the eyes or on the setpoints itself. Since control based on information of
the eyes gives extra latency and for simplicity reasons, it was chosen to also control the head
with the same setpoints. The Dynamixels are controlled with a local PD-control action which is
incorporated in the casing; only a position and velocity has to be send to the motor. Two humanlike eye control methods are presented and applied to the Dynamixels (neck); position and velocity
control.

5.5.1 Position control
As known from chapter 2.3 and 4, velocity control is used for smooth pursuit. However, since a
reliable velocity measurement of the tracked object is not possible (due to noisy measurements
and latency), smooth pursuit is in this research controlled with position setpoints. The motor
is controlled with a local position and velocity control action to give it a smooth behavior. When
applied, the dynamixels act as a joint. If there is no ball in the FOV, a ’search’ motion is initiated, consisting of an ∞-like trajectory. When a ball is found, those coordinates are send to the
dynamixels.

Algorithm 5.5.1: T RACK _P OSITION (x, y)
if (i == 0 || N o_Ball )

Search_M otion();
then if (Ball_F ound)

then break;

GetState(xk );
else F indBall(ek );

SetP os(xk+1 );

comment: Pseudo code for controlling head movement with position control.

The new dynamixel coordinates are computed with the following difference equation:
xk+1 = xk − ek

(5.18)

with

(5.19)

ek = x_ballk − x0
57

visual tracking model

This should be read as: new position = (current position − (difference in screen from ball position to centerline)). Figure 5.8 illustrates this. If the ball is lost for 100 iterations, the algorithm
goes back to its ’search’ motion.
y

[-160,160]

x

21

[0,0]

y_ball

x_ball
17

Figure 5.8: Image coordinate setup. Image capture is set up such that the center of the screen,
[x0 , y0 ], has coordinate [0, 0]. The difference in x and y can now be subtracted from the current
position to control the dynamixels and position the ball in the center of the screen. Image adapted
from techunited (www.techunited.nl)

The dynamixel has the ability to set characteristics for the output torque (figure 5.9). When
used correctly, results can be shock absorption and smooth motion. Compliance parameters and
punch can be set for both clockwise and counter clockwise motion.
The compliance margin (B and C) are the areas where the output torque is set to zero. The
compliance slope (A and D), which can be defined in seven slope ’levels’, reduces the output
torque while getting closer to the goal position. The wider these areas the smoother the motion.
The punch value (E) can be specified as an ’end torque’ value; the torque is reduced (by specifying
the compliance slope) until the punch value is reached.
goal position
CCW
CCW

E

CW
8

E

9

CW

X axis: Position

Y axis: Output Torque

13

11

A

B

C

13

11

D

Figure 5.9: Position - Torque curve. Compliance margin (B and C) gives an output torque of
0. Compliance slope (A and D) reduces the torque until the Punch (E) value.

58

visual tracking model

5.5.2 Velocity control
In human vision, position control is used for saccades. In this research saccades are controlled
open loop with a variable velocity to an extreme outer position. The difference in position is
smaller when the ball is closer towards [0,0]. This means that the velocity which is sent out is
also lower, which gives a smooth stop (and can be compared with the compliance parameters
for position control). This open loop method doesn’t need measurements (which cause delay,
see chapter 6.4.3), meaning the frequency of the loop is no longer dependent on the delay of
the dynamixels. The rate of retrieving coordinates from image processing is now the maximum
possible loop frequency (80 [Hz]).

Algorithm 5.5.2: T RACK _V ELOCITY (x,
˙ y)
˙
if (i == 0 || N o_Ball )

Search_M otion();
then if (Ball_F ound)

then break;
else 

F indBall(ek );
SetV el(x˙ k+1 );

comment: Pseudo code for controlling head movement with Velocity control.

The new dynamixel velocities are computed with the following difference equation:
x˙ k+1 = edif /∆t
edif

(5.20)

with

(5.21)

= ek − ek−1

This should be read as: new velocity = (current difference in screen from ball position to centerline − previous difference in screen from ball position to centerline) divided by the time per
iteration in [ms]. If the ball is lost for 100 iterations, the algorithm goes back to its ’search’ motion.

59

visual tracking model

5.5.3 Combined control
Since the human eye also combines both methods, this can also be applied to the setup. First, it
has to be identified when to initiate a saccade or smooth pursuit. For that, it has to be differentiated what identifies both. As stated in chapter 4, smooth pursuit has velocities up to 50 [deg/s]
while saccades can go faster (200 ∼ 400 [deg/s]). Before a control action is applied, first the velocity of the ball should be determined. This is done with equation 5.20 and 5.21 for both ’x’ and
’y’ direction. The condition ’GetVel < 50’ then differentiates between position or velocity control.
This results in the combined control method as:

Algorithm 5.5.3: T RACK _C OMBINED (x, y, x,
˙ y)
˙
if (i == 0 || N o_Ball )

Search_M otion();
then if (Ball_F ound)

then break;
else if (GetV el < 50)
then Saccade;

else if (GetV el >= 50)
then Smooth_P ursuit;
comment: Pseudo code for combined head movement control.

Here ’Smooth_Pursuit’ stands for the position control algorithm and ’Saccade’ for the velocity
control algorithm as mentioned earlier. Position control is the default method for controlling the
movement of the setup. Whenever a difference (bigger then 50 pixels) in last and present position
is detected, a velocity control action is used (instead of position control). This velocity control is
open loop and therefore does not stabilize the ball in the center of the FOV. dynamixel is actuated
with a higher velocity to an (extreme) end position to end up with a jump in velocity (a saccade).
Position control then stabilizes the ball in the center of the FOV (corrective saccade).

60

visual tracking model

no ball

sine

ball detection

search

image

ball

switch

ek

edif >= 50

Velocity

+
-

Δt

[pos,vel]

saccade

e dif < 50

switch

ek -

ek-1

motor

edif

measurement

smooth pursuit

+
xk

Figure 5.10: Combined control scheme. An image is first fed through a ball detection stage.
If there is no ball present for 100 iterations, a ’search’ mode is initiated (sine-motion). If there is
a ball present, the difference in last and present position of the ball (edif ) is determined. Is this
larger than 50, a saccade is initiated that uses this difference (edif ) and a scaling factor (∆t) to
direct the dynamixel. If the velocity is smaller than 50, smooth pursuit is used, which simply uses
the present position (xk ; measurement) and the ball position difference in the FOV (ek ) to send
motor commands.

5.6 Summary
This section sums up the developed and tested algorithms and the setup on which these algorithms were tested on. To test the developed algorithms and to perform experiments, a humanlike eye-head setup is built. It consist of two servo motors (Dynamixels) to actuate the neck (pan
and tilt) and two small controllable eyes. A camera is positioned beneath these eyes and is used
for image input. An inertial sensor is positioned underneath the camera. The setup is connected
via USB to a pc running real-time linux (xenomai) and controlled using C/C++ and OpenCV (image processing library).
A simplified version of the Saliency map is developed and tuned to simulate its more complex
original, where images are used to compare the two similar algorithms. Following, the simplified
visual attention model is implemented on the setup, where the camera provides images to the
algorithm to direct visual attention.
A simple ball detection algorithm (color extraction with circular Hough) is used as input for
various tracking filters. The Lucas-Kanade Feature tracker (motion estimation with optical flow
between two images) gave insufficient results due to the uniformity and shape of the ball. Furthermore, a linear Kalman filter and a particle filter (SIR) are also a possible solution as tracking
filter. Not tested where the Extended and Unscented Kalman filter and the Wiener filter.
From chapter 2 it became clear that the human eye is controlled by two main control methods;
smooth pursuit and saccades. These two methods are also applied onto the setup. Smooth pursuit is in fact a velocity control action; when tracking an object the eye adjusts its velocity to keep
the object in the center of the field of view. Whenever smooth pursuit can’t keep up, a saccade
is initiated, which consists of making a jump from one spot to another (position control). In the
setup these control methods are actually executed vice versa; position control is used for smooth
pursuit and velocity control is used for saccades. This is due to the fact that the object velocity
can’t be measured very accurately (due to noisy measurements and delay).

61

visual tracking model

62

Chapter 6
Results
6.1 Introduction
This chapter describes the results of the developed algorithms and the behavior of the entire
human-like eye-head setup (See figure 5.1.). The first and second part consist of a visual tracking
part which describes the results of the visual attention model and the visual tracking model. The
third and fourth part describe the results of the tracking model and the attention model on the
setup. The final section is devoted to give the differences and similarities between a human
eye-head system and our artificial setup.

6.2 Visual attention model
6.2.1 Implementation
Our attention model is compared with an existing toolbox available for Matlab; ’SaliencyToolbox’
(developed by [71]). The toolbox model is altered a bit to use similar parameters or eliminate
those that weren’t used in our developed model. For simplicity, normalization is not applied
for both models. The number of pyramid levels used in the toolbox is 5, which also explains
the coarse scale in the shown images. Our model uses 4 pyramid levels, but higher scales can
be plotted to show the different maps. Also, the toolbox has the ability to set a highest and
lowest surround level and a center-surround delta (resolution of center-surround differences).
For simplicity and performance reasons this is not applied in our model (See chapter 5.3 and
chapter 6.2). A comparison between the human visual attention system and our model is not
made; this would be far too extensive and the theory our model is based on has already been
proven to be biologically plausible [35]. Since the output of the developed model can be directed
by changing the weights of the different feature maps, a few examples are given to show that our
model can give the same salient output as the ’Saliency Map’ model.
63

results

6.2.2 Experiments
A first comparison between the algorithms is made with an image containing clear objects (black
bottle cap, pencil, coin, white ball and a lighter. As to be seen in figures 6.1 and 6.2, the conspicuity maps are similar (higher saliency areas on the same locations), although they are shown on
different scales. The objects are found in the same order. First the black bottle cap, then the pen
and last the lighter. With our attention model, the same salient objects are found.

50
100
150
200
250
300
350
400
450
100

(a) conspicuity and saliency maps

200

300

400

500

600

(b) attended locations

Figure 6.1: Output of ’SaliencyToolbox’ for image with simple objects. Image (a) shows
the computed conspicuity and saliency maps. Top left to right: color map and intensity map.
Bottom left to right: orientation map and saliency map. Image (b) shows the attended locations.
The first attended location is the black bottle cap. (Image taken from OpenCV sample library)

(a) color map

(b) intensity map

(c) orientation map

Figure 6.2: Conspicuity maps of developed attention model of ’simple objects’ image.
(a) shows the color map, (b) the intensity map and (c) the orientation map. The range in the
images is [0-255] meaning ’0’ black and ’255’ white. A higher value corresponds to more saliency.
The higher resolution images are only used for results and not to compute the saliency map.

As mentioned before, the output the ’saliency map’ algorithm can be directed for a specific
application. Knowing that, it was also possible to find the coin and white ball, as a most salient
object, quite easily. When comparing the attended locations, our model finds some objects twice.
This is caused by a smaller inhibition of return mask (see also section 5.3), which is added to
the conspicuity maps after the first iteration. The input parameters, such as the weight for the
conspicuity maps, are for both algorithms different. This is due to the toolbox algorithm being
64

results

more complex by having more pyramid levels and a variable center-surround resolution and level.
For simplicity and performance reasons our model does not have these options.

(a) saliency map

(b) attended locations

Figure 6.3: Output of the developed attention model (’simple object’ image). Image
(a) shows the saliency map, image (b) shows the attended locations. Also here; the first attended
location is the black bottle cap.

Secondly, an image is used which shows a soccer field with two humanoid robots and an
orange ball. With some tuning, both models give the orange ball as the most salient object in the
image. Following, the two robots are found, however, in a slightly different order.

50
100
150
200
250
300
350
400
450
100

(a) conspicuity and saliency maps

200

300

400

500

600

(b) attended locations

Figure 6.4: Output of ’SaliencyToolbox’for image with robots and ball. Image (a) shows
the computed conspicuity and saliency maps; top left to right: color and intensity map. Bottom
left to right: orientation and saliency map. Image (b) shows the attended locations. The first
attended location is the orange ball. (Image taken from www.nimbro.de)

Also here, the weights of the conpsicuity maps are slightly different, for the same reasons.
A solution to get identical results is to add a top-down importance mechanism (or extra maps)
which decide what should be even more salient or the first attended location. A suggestion could
be to convert the ’RGB’ color maps to a ’YUV’ color space, which deals with chrominance and
luminance and is therefore especially convenient for detecting the orange ball.

65

results

(a) color map

(b) intensity map

(c) orientation map

Figure 6.5: Conspicuity maps of developed attention model of ’robots and ball’ image.
(a) shows the color map, (b) the intensity map and (c) the orientation map. The range in the
images is [0-255] meaning ’0’ black and ’255’ white. A higher value corresponds to more saliency.
The higher resolution images are only used for results and not to compute the saliency map.

(a) saliency map

(b) attended locations

Figure 6.6: Output of developed attention model (’robots and ball’ image). Image (a)
shows the saliency map, image (b) shows attended locations of the developed attention model.
When compared with the toolbox results; the same locations are attended and the first, most salient
object found for both models is the orange ball. However, the order of the found locations is not
identical.

More images (soccer field with humanoids, scenery with people, etc.) were used to test the
differences between the algorithms. The results where similar as the aforementioned; the output
can be directed in such a way that both models give similar results. However, the algorithm is
not appropriate for robot soccer; it is dependent on too many variables and/or too task-specific,
to track a salient object. A top-down mechanism however, could distinguish between importance
for robot soccer. The computation time of both models is in the order of 100-200 [ms] per salient
object.
The algorithm was also applied to the eye-head setup; with a camera image, the algorithm
computes the most salient location every second (1 [Hz]). The algorithm can be altered such that
it uses the same input image multiple times, to find more salient objects in a single FOV. Best
results however, are obtained when only the most salient feature is used from a single image.
With a slow or little changing environment, the eye-head setup then ’searches’ its surroundings
in a human-like way as if interested in finding salient objects. For a faster attentional tracking
mechanism, the algorithm has to be altered. This is treated in chapter 6.4.2.

66

results

6.3 Visual tracking model
6.3.1 Implementation
The implementation of the tracking model consists of an image acquisition and processing part
and a combination of a particle filter and a Kalman filter. For image acquisition, a ’YUV’ input
image is scaled down pyramidically to ’QVGA’ size (320 × 240 pixels). The number of particles,
the number of update steps in one Condensation iteration and the expected position error between iterations (in pixels) can be changed to improve the performance of the algorithm while
running. The Kalman filter is only used to smooth out the chattering behavior of the particle
filter. Furthermore, the ROI function can be turned on and off at any time. This uses a smaller
input image for processing in order to detect a ball and is therefore computationally faster. However, it increases the chance of missing a fast moving ball in the FOV, since less information (a
smaller image) is available to detect it. If a ball being tracked, disappears, the particle filter drifts
around the last coordinates and keeps the output and the particles approximately at that position.
In other words, the expected distance error is used as a range for the particles to where the ball
can reappear.

6.3.2 Experiments
The algorithms and set-up are tested in a stepwise manner; first only the algorithms itself and
from there on adding the different components (camera, dynamixel, eyes) into the loop. A bouncing ball, and simple step- and sine-functions are used to test algorithms and both the eyes and
the head. Figure 6.7 shows a noisy measurement from the camera of a bouncing ball. This to get
an idea of the performance of retrieving coordinates from a camera without a tracking filter.
Measurement of bouncing ball
0
bouncing ball
floor

y−position [pixel]

50

100

150

200

250

0

20

40

60

80

100
iteration [−]

120

140

160

180

200

Figure 6.7: Bouncing ball measurement. The noisy measurement is clearly seen between 0
and 70 iterations. The horizontal line represents the floor. Due to the frame rate, the camera
can’t capture the position very accurately, so coordinates are missed (to be seen as sometimes the
floor is not reached by the trajectory).

67

results

• Ball detection
Ball detection is relatively simple. A Circular Hough Transform (CHT) is performed on a (1
channel deep) binarized color image of the ball (’v’ channel of ’YUV’). Figure 6.11 (a) shows
the outline of a ball, which is fed to the CHT.
• Kalman filter
Figure 6.8 shows the response of tracking a ball with a Kalman filter which is initialized
with only position in model transition matrix ’A’. Clearly seen is no overshoot and some
delay, while the trajectory is quite smooth. Different values for the Q (process noise covariance ) and R (measurement covariance matrix) are tested. These gave acceptable results;
Q, which represents the process noise covariance is an identity matrix with value 0.01 on
its diagonal. R, which represents the measurement covariance matrix is an identity matrix
with value 0.1 on its diagonal.
Ball tracking with only position in ’A’ matrix (state transition per iteration)
300
Measurement
Kalman filter
y−position [pixel]

250

200

150

100
0

20

40

60

80
100
iteration [−]

120

140

160

180

Figure 6.8: Tracking with a Kalman filter with only position transition in ’A’ matrix.
Q (process noise covariance ) = 0.01; R (measurement error covariance matrix) = 0.1.

Figure 6.9 shows the tracking response of a Kalman filter where the model transition matrix ’A’, is initialized with position, velocity and acceleration. Three trajectories are used, a
step function, a sine function and the bouncing ball. In the beginning a clear overshoot
is visible, which also occurs at the sine function. The bouncing ball trajectory shows quite
some delay. Overall, the result is not acceptable; overshoot could result in excessive movements which could lead to losing the ball in the FOV.

68

results

Ball tracking with position, velocity and acceleration in ’A’ matrix.
Measurment
Kalman (PVA)
floor

0

y−position [pixel]

50

100

150

200

0

50

100

150
iteration [−]

200

250

300

Figure 6.9: Tracking with a Kalman filter with pos., vel. and acc. in ’A’ matrix. Q
(process noise covariance ) = 0.01; R (measurement error covariance) = 1.

Effect of measurement error covariance and process noise covariance.
As explained in chapter 5.4.5, the measurement error covariance matrix ’R’ gives a certain
weight on the reliability of the measurement. When low (approaching 0), the reliability is
also low, which means the actual measurement is less trusted and the measurement prediction has a greater effect. When high (approaching 1), the actual measurement is dominant.
This is to be seen in figure 6.10 and also by comparing figure 6.8 and figure 6.9.
The same holds for the process noise covariance ’Q’; the lower it gets (approaching 0), the
more effect the prediction has, and vice versa.

100

y−position [pixel]

position [pixel]

250
150

200

250
Measurement
Kalman filter
300

0

10

20
30
iteration [−]

200

150
Measurement
Kalman filter

100
40

50

0

(a) lower R (0.01)

10

20

30
40
iteration [−]

50

60

(b) higher R (0.1)

Figure 6.10: Comparison of different values of measurement error covariance ’R’.
When comparing, figure (a) shows a lower value for R, which means more trust in the actual
measurement and thus less smooth tracking since the actual measurement is chattering.

69

results

• Particle filter
Figure 6.11 (b) shows the tracking window where the settings for the particle filter can be
altered in real-time. It can be used to tune the particle filter for a better performance. When
the window is switched off, the algorithm is a bit faster (up to 20%).

(a) Outline ball

(b) Tracking window

Figure 6.11: Tracking windows. Figure (a) shows a one-pixel wide edge-map of the ball. The
slider adjust the Threshold value used to binarize the color image. Figure (b) shows the tracking
window with the adjustable sliders. The small and big circle represent the measured position of the
ball. The swarm of crosses (variable in size; smaller means less weight) represent the particles.
The big cross is the averaged output of the particle filter. The square represents the ROI.

Tracking with a particle filter with position, velocity and acceleration in ’A’ matrix.
0
Measurement
Particle filter
floor
y−position [pixel]

50

100

150

200

0

50

100

150

200
iteration [−]

250

300

350

400

Figure 6.12: Tracking with a particle filter with position, velocity and acceleration in
’A’ matrix. Q (process noise covariance ) = 0.01; R (measurement covariance matrix) = 0.1.

As to be seen in figure 6.12, the particle filter has a fairly good response; no overshoot and
hardly any delay. The only disadvantage is the chattering behavior it causes at low velocities.

70

results

• Combined algorithm
The chattering behavior is smoothed out by a Kalman filter. The particle filter is compared
with a particle filter and Kalman filter in figure 6.13. A comparison of the combined tracking filter (particle filter plus Kalman filter) and the measurement is to be seen in figure
6.14. The Kalman filter has the disadvantage that it brings an extra delay into the response.
However, compared to the particle filter alone and with experimenting on the setup, it gave
no drawbacks. This is mainly due to the robustness of the particle filter, which makes it
almost impossible to lose in a ball in the FOV.
Tracking with particle and particle + Kalman filter.
0
Particle + Kalman
Particle filter
floor
y−position [pixel]

50

100

150

200

0

50

100

150

200
iteration [−]

250

300

350

400

Figure 6.13: Comparison between particle plus Kalman filter and particle filter. The
chattering behavior is clearly reduced. For convenience the measurement is left out.

Tracking with a particle filter with position, velocity and acceleration in ’A’ matrix
0
Measurement
Particle + Kalman
floor
y−position [pixel]

50

100

150

200

0

50

100

150

200
iteration [−]

250

300

350

400

Figure 6.14: Comparison between particle plus Kalman filter and measurement. The
output is a smooth trajectory, not perfectly following the measurement, but acceptable to track a
ball in real-time.

71

results

6.3.3 Performance and delay
Image processing, including ball detection, particle filter and Kalman filter, takes about 8 [ms] to
compute. Having this in mind, theoretically, the image processing loop cannot be higher than
125 [Hz]. However, ball detection works optimal when the exposure time of the camera is set not
lower than 7500 [µs], resulting in a frame rate of about 80 [f ps]. Since the algorithm runs in real
time, this gives enough margin in computation time.
Since the camera can have a higher frame rate, but would be useless if the ball detecting algorithm
can’t use this, the image processing is the limiting factor.

6.4 Setup
Two servo motors (Dynamixels) are used to actuate the neck and head. The camera is mounted
in the middle of a board, on which also both eyes are fixed (see figure 5.1). The developed visual
tracking model is added to the existing software (eyes, dynamixels, communication) to end up
with a complete setup which can be actuated to mimic a human eye-head system.

6.4.1 Implementation
As mentioned before, two methods for dynamixel control are implemented. For position control the dynamixel control loop runs at approximately 30 Hz, which is due to the delay in the
communication between USB and the serial port. Velocity control alone is faster (80 Hz) since
no measurement is needed. Controlled by coordinates from image processing the maximum
frequency could theoretically be the same as the image processing loop (80 Hz). To achieve a
smooth human-like movement, for position control, the dynamixels are tuned by compliance parameters. Velocity control takes care of this in its dynamixel control method (see chapter 5.5).
The main task of the dynamixels can be seen as a ’follower’ of the eyes, since these can move and
react much faster. The eyes are controlled with the same setpoints as for the head.

6.4.2 experiments
Position control
The frequency of the position control loop is limited to 30 [Hz]. This is due to the communication between Dynamixel and USB. The USB2Dynamixel device (USB to serial port) uses an
FTDITM chip which causes latency when information is asked from the Dynamixel. With this
latency however, a pretty decent control loop is possible. Whenever the ball is in sight, tracking
control directs the Dynamixels towards it. When the ball is out of sight, the last coordinate (if
out-of-center) directs the Dynamixel in the last moved direction. Figure 6.15 shows the response
of the setup (dynamixels only) on a step-like movement of the ball in the FOV. The Dynamixel
has a smooth motion and the ball is not lost in the FOV since the position of the ball stays within
the [-120,120] range, which is the lower and upper pixel limit of the input image. The coordinates
of the Dynamixel are fixed, while the filter has a relative coordinate output.
72

results

y position
dashed: [pixel] solid: [encodercount]

Response of particle filter and dynamixel position on step function
200
dynamixel
particle + Kalman

150
100
50
0
−50
−100

0

10

20

30

40

50
60
iteration [−]

70

80

90

100

Figure 6.15: Response of particle filter and dynamixel position on step function. The
chattering behavior, in the particle + Kalman response, gives better results for detecting and
maintaining the ball in the FOV, than a smooth output. A ’proof ’ for tracking can be seen as the
fact that the position of the ball stays within the [-120,120] range, which is the y-pixel range in
the FOV and the fact that it stabilizes roughly around zero. The Dynamixel has a smooth output
and no overshoot.

y position
dashed: [pixel] solid: [encodercount]

Response of particle filter and dynamixel position on bouncing ball
200
dynamixel
particle + Kalman

150
100
50
0
−50
−100
−150

0

20

40

60
iteration [−]

80

100

120

Figure 6.16: Response of particle filter and dynamixel position on bouncing ball. Again,
the output of the tracking filter is not smooth while the ball is not lost (within [-120,120] range).
The dynamixel does not follow the bounces up and down, however, a smooth trajectory tracks the
ball to its final position.

The response of position control of the setup on a bouncing ball is to be seen in figure 6.16.
From, a certain height in the FOV, a ball is dropped, with the floor on a detectable height. The
ball remains in the FOV, since again the [-120,120] range is not exceeded.
Figure 6.17 shows the response after a ’lost’ ball. The ball is moved manually, upwards out of
the field of view. The Dynamixel moves to its end point while the tracking filter can drift past its
’limit’. Past the FOV, in this case the upper limit of 120 [pixels], image processing no longer contributes to the tracking filter. The prediction and estimation process of the particle filter now only
propagates information for ball (and thus dynamixel) position. When the ball is not in the FOV
for 30 iterations, a trigger is set and a search motion or a fixed standby position can be initiated.
These experiments are carried out with a ball relatively close to the camera, which leads to rela73

results

tively large motions from the Dynamixels. When the ball would be further away, smaller motions
would suffice in keeping the ball in the FOV.

y position
dashed: [pixel] solid: [encodercount]

Response of particle filter and dynamixel position after losing a ball
300
dynamixel
particle + Kalman

250
200
150
limit
100
50
0
−50

0

20

40

60

80
100
iteration [−]

120

140

160

180

Figure 6.17: Response of particle filter and dynamixel position on lost ball. The ’limit’
line depicts the end point to where a ball can be found. The Dynamixel is at its end point (270
encoder counts), but the filter can drift further away. After that, no ball is recognized and after 30
iterations a search motion is initiated. The ’wave’ only depicts the data send to the dynamixel,
so the sudden jump does not physically occur.

Velocity control
When having only the velocity control loop, the loop frequency can be equal to the image processing loop, which is about 80 [Hz]. The disadvantage of velocity control is that the position of
the ball is not taken into account. It is inevitable that the ball eventually drifts out of the FOV. A
second disadvantage is that this control is open loop; no feedback is used in control and no information is asked from the dynamixels. The two reasons for this are that it is very difficult to get
an accurate velocity measurement; due to a relatively low sampling rate (which is equal to image
processing, so 80 [Hz]) and the fact that measurements are very noisy. Secondly, as mentioned
before, reading data from the Dynamixels gives latency. Therefore, error plots are not available to
compare or evaluate the control loop. Instead, the response of the algorithm is used to show the
behavior of velocity control.

74

y position / velocity
dashed: [pixel] solid: [encodercounts/sec]

results

Response of tracking filter and Dynamixel velocity on ball motion.
300
particle + Kalman
velocity control signal

200
100
0
−100
−200
0

20

40

60
80
iteration [−]

100

120

Figure 6.18: Response of tracking filter and dynamixel velocity on ball sine motion.
The difference between the present and last position of the ball in the FOV is used to compute a
velocity for the Dynamixel. When this difference is within certain boundaries, no control is send,
so the position of the tracker will inevitably drift out of the FOV.

y position
dashed: [pixel] solid: [encodercounts/sec]

Figure 6.19 shows that having only velocity to control the position of the ball, stable tracking
is not possible. Eventually the ball drifts outside the FOV.
Response of tracking filter and control velocity on bouncing ball.
100
50
0
−50
−100
−150
−200

tracking filter
velocity control output
0

50

100

150

200

250

iteration [−]

Figure 6.19: Response of tracking filter and velocity control output on bouncing ball.
When the ball is moving slowly, a boundary condition prohibits a velocity to be send out. Due
to this, the slow motion can cause the ball to drift out of center and eventually out of the FOV
since there is no position control to correct this. This is to be seen past iteration 220, when the
velocities give no effect any more and the ball passes the image limits [-120].

Combined control
The combined control scheme aims at simulating human eye control. Position control is the default control method applied, when the velocity to track the ball gets too high, a jump in velocity
is used to keep up. Directly after that, a corrective saccade (position control) positions the ball in
the center of the FOV. This method is also present in human vision. A problem in this scheme
is that both control methods are called from one schedule. Since both methods have different
execution time this can lead to delay in tracking control. This problem is solved by making both
control schemes equal in computation time; the velocity control loop is called 2 or 3 times (or
once and some delay) to end up at approximately the same time.
75

results

As can be seen in figures 6.20 and 6.21, there are two ways in which velocity control responds
into the control scheme. One in which it executes when the position control can’t keep up and
one in which the tracker has a sudden stop in position (or change of direction). In the latter case
an opposite velocity is send to rapidly slow down or stop the Dynamixel (or send it in the opposite
direction). Compared to position control only, the combined control method has more sudden
movements, position control is more smooth.
y position
dashed: [pixel] solid: [encodercount]
dashed−solid: [encodercount]/sec]

Response of Dynamixel with combined control on step functions.
200
Dynamixel
tracking filter
vel. control signal

150
100
50
0
−50
−100

0

20

40

60
iteration [−]

80

100

Figure 6.20: Response of Dynamixel with combined control on step functions. The
Dynamixel position stabilizes after each manual step down. Velocity control is both used for
slowing down/stopping the Dynamixel (peak up) and catching up the tracking filter (peak down).
Velocity control has a slightly longer execution time to let the velocity signal take effect.

y position
dashed: [pixel] solid: [encodercount]
dashed−solid: [encodercount]/sec]

Response of Dynamixel with combined control on bouncing ball.
200
Dynamixel
tracking filter
vel. control signal

150
100
50
0
−50
−100
−150

0

10

20

30

40
iteration [−]

50

60

70

80

Figure 6.21: Response of Dynamixel with combined control on bouncing ball. From an
initial position [150], the Dynamixel position stabilizes on the ball. From there the ball is dropped
in the FOV. Here also, velocity control is both used for slowing down/stopping the Dynamixel (peak
up) and catching up the tracking filter (peak down). The Dynamixel position shows no overshoot
and a controlled movement tracks the ball down to the floor.

It must be clear that the methods applied to control the head are actually control methods
which are present in the human visual system. This is mainly done to put the emphasis of this
thesis on eye control methods but also due to the available hardware. For both cases the setup
76

results

(two Dynamixels and camera) can be seen as an eye itself.
Eye control loop
The small movable eyes are controlled with a PD control action which is embedded in the motion
control boards. Coordinates from the tracking filter are directly sent to the eye controller. The
main idea was to have the eyes first fixate on the ball and have the head act as a follower. Since
the eyes react much faster than the Dynamixels and the fact that the eye control loop runs twice
as fast as the position control loop, this is what actually happens. However, since these eyes are a
first prototype, some difficulties could not be solved. A fairly amount of chattering due to friction
and the movements of the head itself still remain in the setup. These effects can be overcome
(or already have been) with some design alterations in the eye hardware (sapphire bearings ).
Unfortunately, direct measurements are therefore not of much use. A measure of performance
can be found in [74].

77

results

Search and Track
The next step is to implement the visual attention model so that it would track the most salient
object in the FOV. For this, a few alterations had to be made to make the attention algorithm fast
enough to track in ’real time’. To compute the orientation feature map, the Gabor filter is replaced
by a Sobel operator, which is significantly faster. The Sobel operator calculates the gradient of the
image intensity, which emphasizes regions in high spatial frequency that correspond to edges.
The output is an edge map, combined with Gaussian smoothing, which gives the closest resemblance to a Gabor filter. The weights of the three feature ’tracks’ can be altered while running, to
direct attention to different features in ’real time’. The loop frequency of the algorithm is again
limited by communication delay and the many matrix calculations to 20 [Hz], which however,
is sufficient enough to track a salient object in the FOV. By changing the weights of the feature
tracks, different objects become more salient and are chosen to be the tracked object. Figure 6.22,
6.23 and 6.24 show that by changing the feature weights, different objects become most salient.
Controlling the motion of the eye-head setup is only done with position control (smooth pursuit).
Since different objects in the FOV can be most salient in no predefined order, the expression to
differentiate between a saccade and smooth pursuit can result in unpredictable and false behavior.
It must be noted that these results are heavily dependent on a number of circumstances (lighting,
scenery of the FOV). Stable visual attention tracking was mainly possible with an extra top-down
mechanism to detect the orange ball. Other objects where more salient in certain conditions but
further investigation is necessary to get reliable results.
From the initial idea to develop a human-like eye-head ’Search and Track’ method, a rough
version is eventually the end result. The visual attention model directs its attention towards locations in the FOV according to the weights in the saliency map. A particle filter and smoothing
mechanism (Kalman filter) then ensure robust tracking control with the position control method
discussed earlier. When looking at the small eyes, the amount of chattering or vibration is amplified or damped by the movement of the head itself. Still, the eyes are faster than the neck so the
movements could be recognized as an initial tracker (eyes) and follower (head).

(a) Saliency map

(b) Attended location

Figure 6.22: Output ’Search and Track’ algorithm. Figure (a) shows the saliency map, figure
(b) the attended location. Most salient object in the FOV is the orange ball, which is tracked at
20 [Hz]. The sliders alter the weight of the concerned feature track.

78

results

(a) Saliency map

(b) Attended location

Figure 6.23: Output ’Search and Track’ algorithm. Figure (a) shows the saliency map, figure
(b) the attended location. Most salient object in this case is the green pen. Orientation is in this
case not of interest.

(a) Saliency map

(b) Attended location

Figure 6.24: Output ’Search and Track’ algorithm. Figure (a) shows the saliency map, figure
(b) the attended location. Most salient object in this case is a can. It is detected by giving high
weight to the red color map. Therefore, when having other red objects in the FOV, stable tracking
of the can only can not be guaranteed.

6.4.3 Scheduling
Visual tracking
The visual tracking model runs at a frequency of about 80 [Hz], which means that the main
algorithm extracts every 15 [ms] a set of coordinates from the visual tracking algorithm. This is
then send to the eyes and the dynamixels. Since the control frequency of the dynamixels and
the eyes are both bigger and smaller, the main loop of the complete system is set up in a way
that different tasks can run at different frequencies. Two tasks (one for eye control and one for
Dynamixel control) are periodically called. For this reason scheduling becomes a very important
issue. The main schedule is set up such that the eye control loop is called twice as often as the
79

results

Dynamixel control loop. The latter has some extra unpredictable behavior to it (velocity control),
so this schedule is explained in more detail.
Looking at combined Dynamixel control very roughly, the difference is that for position control only, the control of the dynamixels can be a bit faster but is executed later, while with the
combined control scheme the movement can be faster and the execution can be earlier. In figure
6.25 this scheduling is visualized. The computation time of the position control task is about 30
[ms]. The combined control method can take up to 50 [ms] to execute. This is to let the velocity
signal take effect and let it catch up the bigger position difference. The figure depicts only the
order of the tasks, so the size of the control tasks can be a bit misleading. Also, since the acquisition of images and image processing can be computed separately (or parallel) from the control
computation time, the size is also not correct.

Figure 6.25: Task scheduling. Both control methods show the order of the scheduling task. The
position control shows aan extra line in the middle; this depicts the start point of the control of
the first Dynamixel. The start point of the second Dynamixel is the end of the control task. The
scheme of combined control is a possible schedule. Worst case scenario would be only velocity
control and a lost of the ball due to not being able to fixate the ball in the center of the FOV.

Since with the combined control scheme the schedule is no longer predictable, only best and
worst case scenarios can be given. A best case (or normal) scenario could be a scheme as shown
in figure 6.25. Worst case scenario would be only velocity control and eventually a lost of ball
situation.
Visual attention
The simplified model of the saliency map has a computation time of approximately 10 [ms] for the
image processing part alone. This is mainly due to the amount of multiplications and additions
of the numerous matrices. Together with position control, the total loop time would be 40 [ms] if
both tasks where called successively. Eventually, a loop frequency of 20 [Hz] was feasible.
80

results

6.5 Comparison towards human visual system
It might be clear that the remarkable mechanisms and control methods of the human eye are
far from understood and can hardly be matched. The human eye has low inertia and relies on
6 muscles for its actuation. These muscles have low friction, fast acceleration, high power/size
ratio, are highly efficient and work in pairs (agonist vs. antagonist). These properties can not be
matched by commercially available motors. On the control side, since the physiological behavior
is not full understood, a good comparison is difficult to make or simulate. For image processing,
or human vision in general, the eye can rely on a processor of phenomenal power, adapted by
human’s needs and activities through evolution [42].
From [35] it’s clear that human beings direct their attention depending on three parallel tracks
for selecting a next attended location. These tracks differentiate between feature extraction with
respect to color, intensity and orientation. This method is applied as a simplified model on the
eye-head setup, which results in a visual attention tracker which scans a room in a human-like
manner. When regarding a specific task, human beings can direct their attention to a certain object. The neurological background on this is not clear. A possible, logical answer to this can be the
weights which are assigned to the different feature extraction methods, since human vision can
rely on massively parallel computing power. For the eye-head setup a simple feature extraction
method gives similar results, but can be faster and doesn’t have to deal with the more complex
extraction methods the human vision is capable of.
The latency of the peripheral system can be compared with the latency of the setup. For
human beings this latency is the time between an observation made in the FOV and the execution
of muscle movement, which is typically between 150 and 200 [ms]. The delay in the eye-head
setup (where the two Dynamixels act as one eye) consists of the computation time of image
processing and a communication delay in the Dynamixels and is typically between 45 and 60 [ms]
(10 [ms] for image processing, 35−50 [ms] for communication, depending on the applied control
method). It has to be noted however, that, when the Dynamixels can be connected directly to a
motherboard, the communication delay is narrowed to 4 [ms].
Human eye control uses saccades for position control and smooth pursuit for velocity control.
In our setup, this is actually vice versa; smooth pursuit is mimicked with position control, while a
jump in velocity is used to mimic a saccade. The reason for this is that its very difficult to obtain a
reliable velocity measurement (low sampling rate, noisy measurements) and the fact that there is
a relatively large amount of delay present in the system. As described in [24], the eye movement is
always accompanied by processes that can initiate a corrective saccade, to adaptively maintain the
saccadic system in calibration with the visual environment. This process operates unconsciously,
but is vital in maintaining useful, active vision. Comparing this with our setup; after a saccade
is initiated, a position control action (in fact; smooth pursuit) ensures the attended location to be
centered in the FOV.
From this it can be concluded that human vision is quite slow but far more complex, resulting
in a vision system which can hardly be matched. However, a drawback in the human vision
system is that during a saccade, humans become effectively blind. Since these saccades can
take up to 200 ms, its quite assumable that this can cause difficulties when tracking fast moving
objects.
81

results

6.6 Summary
A short summary of the obtained results is given:
The simplified visual attention model was verified by comparing its image output to the image
output of the original Saliency map model. Analysis showed that the same objects are found and,
with some tuning, the most salient object is found first. However, next most salient objects can
be in a slightly different order.
For visual tracking control, the Lucas-Kanade Feature tracker was already ruled out and also a
linear Kalman filter turned out to be insufficient as observer (too much overshoot and/or delay).
A particle filter (Sampling Importance Resampling) gave good tracking results, however, an extra
mechanism was needed to smooth out the chattering output. This was covered by a linear Kalman
filter.
The two eye control methods where both applied separately and as a combined control action. The
result was that position control can be applied separately, velocity control cannot. Velocity control
is open loop and when the measured velocity is below a certain threshold, no velocity action is
send out. This means that inevitably the ball will drift out of the field of view. A combination of
both methods gave good results; normal tracking is covered by position control (smooth pursuit)
and when this can’t keep up, a velocity control action (saccade) is executed. After this saccade,
position control then ensures that the tracked object is again positioned in the center of the field
of view (a so-called corrective saccade).
A ’Search and Track’ algorithm combines the visual attention model with the control methods.
The simplified attention model is adjusted to make execution in real-time possible. The Gabor
filter (which responds highly to a given orientation and turned out to be computationally too
heavy) is replaced by a Sobel operator (gradient of image intensity in combination with Gaussian
smoothing). The end result is an attention tracker which can detect and direct attention at 20
[Hz]. Feature weights can be altered and extra feature maps can be added to direct attention in
’real time’.

82

Chapter 7
Conclusion
7.1 Conclusions
7.1.1 Visual attention model
• A visual attention model, developed in C++ and based on the ’saliency map’ (three feature tracks (intensity, color and orientation) contribute to a saliency map which depicts the
salience at every location in the field of view in a scalar quantity [35]), has been proven to
give similar results as its more complex original. Images (random objects, soccer field) fed
through both algorithms gave identical output; the highest salient locations where equal
and chosen in a roughly identical order. Applied to an eye-head setup to direct visual attention in ’real time’ gives unsatisfactory results. This already simplified model is still
computationally too heavy to run at an appropriate frame rate. This is mainly due to the
’Gabor’ filter, which is used to detect high salient orientation in an image.

7.1.2 Visual tracking model
• A particle filter is chosen as observer for visual tracking. This gave the best results, though
not all possible methods (Extended, unscented Kalman filter) are tested. A disadvantage of
the particle filter however, is that the chattering output has to be smoothed in order to be
useful. This smoothing was covered by a Kalman filter.
• For a higher frame rate, pyramidically scaling down the images was chosen over ROI-ing
(selection of a region of interest). Although ROI-ing turned out to be faster, having a bigger
field of view as input is preferred more than a loss of information.
• A simple ball detecting method (pixel based color extraction with circular Hough) with a
particle filter (observer) and Kalman filter (smoother) are the necessary means to perform
robust and reliable visual tracking control. From a ’YUV’ color input image stream (Y for
luminance, U and V for chrominance at 80 [Hz]), coordinates of the ball are withdrawn
83

conclusion

and converted into a coordinate range for the human-like eye-head setup (eyes and neck).
Before a control signal is send to the motors, first a rough estimate of the velocity of the ball
is computed which differentiates between smooth pursuit and a saccade. Position control
is used for smooth pursuit (normal tracking) while jumps in velocity mimic a saccade, to
keep up when the ball moves to fast for smooth pursuit. These two control methods are the
basis of human eye control and are applied to the neck of the eye-head setup.

7.1.3 Search and Track
• The main difference between human and robot vision and control is that robot vision and
control can be much faster while the complexity of the human vision system cannot be
matched. A solution for a search and track algorithm could be, the simplified saliency map
model, with the ’Gabor’ filter replaced by a ’Sobel’ operator (gradient of image intensity),
and a simple tracker (particle filter as observer and kalman filter for smoothing) as shown
in this thesis. The saliency map also incorporates other objects in the FOV, but can keep
its focus on the main target. Extra maps or feature map weights differentiate between the
attended locations. A model is created (with an extra ’YUV’ color map added to the feature
maps) which detects an orange ball as most salient and tracks this at 20 [Hz].

7.2 Recommendations
Since the mechanism of human vision are now just begun to be understood, more and more
aspects become important as they also connect to active human vision. For future research and
the application towards humanoid robot soccer, some recommendations can be made and future
topics are addressed.
• The setup and algorithms were only tested on a fixed eye-head platform. To take into
account the behavior of the setup and algorithms on disturbance motions (for instance,
caused by a walking motion of a humanoid robot), experiments should be carried out on a
movable frame.
• A subsequent version of the eye setup, with miniature cameras in the eyes (thus having
stereo vision), should be used as a next active vision setup to incorporate tracking and the
visual attention model. This could give rise to a more complex image processing architecture.
• The present attention model tracks the most salient object in the FOV. An extend to this
could be a mechanism which identifies found objects and classifies them in a salient manner (for robot soccer).
• The visual attention model is a bottom-up method for directing the focus of attention.
Top-down importance mechanisms should be added to direct the attention in a more taskspecific way. This can be adding extra feature maps (for instance, a ’YUV’ map, to detect a
bright orange ball) or giving more weight to certain feature maps.
84

conclusion

• Target selection precedes a saccadic movement but its occurrence leads to enhanced processing at and around the selected location. This phenomenon is referred to as peripheral
preview or hyper-acutance. A similar mechanism could be implemented for a more efficient algorithm.
From these conclusions and recommendations it might be clear that an imitation of a human
attention selection and tracking model is fairly easy to implement. Although the very basics of
the human visual system (detection and tracking) can be simulated, the fundamental phenomena of human vision are not yet understood. When these processes become more and more clear,
the boundaries of active visual behavior will also shift. Since human psychology involves interaction with a physical and social world, a rich model of vision will also need to include social and
emotional factors in ways that are only just starting to be considered and understood. Therefore,
the importance of emotional and social factors in affecting active visual behavior should not be
missed and are clearly a topic for future studies.

85

conclusion

86

Appendix A
The Discrete Time Kalman Filter
A discrete time system with process noise w and measurement noise v is defined by:
xk = Axk−1 + Buk + wk

(A.1)

zk = Hxk + vk

(A.2)

The predictor equation is given by
x
ˆ−
xk−1 + Buk
k = Aˆ

(A.3)

The corrector equation is given by
ˆ−
ˆ−
x
ˆk = x
k + Kk (zk − H x
k)

(A.4)

The a priori and a posteriori covariances are given by
−T
T
Pk− = E{e−
ˆ−
ˆ−
k ek } = E{(xk − x
k )(xk − x
k) }

(A.5)

−T
Pk = E{e−
ˆk )((xk − x
ˆk )T }
k ek } = E{(xk − x

(A.6)

The Kalman filter gain is given by

Kk =

Pk H T
HPk H T + R

(A.7)
87

the discrete time kalman filter

The recursive form of the a priori covariance is given by
Pk− = APk−1 AT + Q

(A.8)

The recursive calculation of the a posteriori covariance is given by
Pk = (I − Kk H)Pk−

(A.9)

From [72].

88

Appendix B
One particle filter step

Figure B.1: One step in the CONDENSATION algorithm (without resampling): Select:
n
Randomly select N particles from snk−1 based on weights πk−1
; sample particle may be picked
multiple times (factored sampling). Predict: Move particles according to deterministic dynamics
of motion model (drift), then perturb individually (diffuse). Measure: Get a likelihood for each
new sample by comparing it with an observation, i.e. based on p(zk |xk ). Then update weight
accordingly to obtain (snk , πkn ) [34].

89

one particle filter step

90

Bibliography
[1] S. Ahmed. VISIT: A neural model of covert attention. Advances in Neural Information
Processing Systems, vol. 4, pp. 420-427, 1991. 34
[2] C. H. Andersen and D. C. Van Essen. Shifter circuits: a computational strategy for dynamic
aspects of visual processing. Proc. Natl. Acad. Sci., USA, 84, pp. 6297-6301, 1987. 33
[3] M. S. Arulampalam, S. Maskell, N. Gordon and T. Clapp. A Tutorial on Particle Filters for
Online Nonlinear/Non-Gaussian Bayesian Tracking. IEEE Trans. on Signal Processing, Vol.
50, No. 2, 2002. 41, 54
[4] S. Baluja and D. Pomerleau. Dynamic relevance: vision-based focus of attention using
artificial neural networks. Artificial Intelligence, 97:381-395, 1997. 38
[5] G. Bedfer and J-F. Vibert. Image preprocessing in simulated biological retina. Proc. Ann.
Conf. IEEE, Eng. in Medicine and Biology Society, vlo. 14, pp. 1570-1571, 1992. 16
[6] I. Biederman and E. A. Vessel A novel theory explains why the brain craves information
and seeks it through the senses. American Scientist Magazine, Perceptual pleasure and the
brain, May-June, 2006. 25
[7] J. Y. Bouguet. Pyramidal Implementation of the Lucas Kanade Feature Tracker: Description
of the algorithm. Intel Corporation, Microprocessor Research Labs, 2002. 41, 48
[8] Encyclopaedia Britannica - the online Encyclopedia. http://www.britannica.com/ 19, 25,
26
[9] V. Bruce, P. R. Green and M. A. Georgeson. Visual Perception: Physiology, Psychology, and
Ecology. Psychology Press, fourth Edition, 2003. 21, 22, 40
[10] M. Buchberger. Biomechanical Modelling of the Human Eye. Phd Thesis, Johannes Kepler
Universität Linz, 2004. 23
[11] P. J. Burt. Attention mechanisms for vision in a dynamic world. Proceedings Ninth International Conference on Pattern Recognition, Beijing, China, pp. 977-987, 1988. 37
[12] V. Cantoni. Human and Machine Vision. Analogies and divergencies. Proceedings of the
Third International Workshop on Perception, Pavia, Italy, 1993.
[13] K. Cave. The feature gate model of visual selection. Psychological Research, 62:182-194,
1999. 34
91

bibliography

[14] Z. Chen. Bayesian Filtering: From Kalman Filters to Particle Filters, and Beyond.
Manuscript, McMaster University, Canada, 2003. 13, 49, 50, 52
[15] P. Churchland and T. J. Sejnowski. The Computational Brain. Cambridge, MA: MIT Press,
1992. 13, 39, 40
[16] R. N. Clark. http://www.clarkvision.com/imagedetail/eye-resolution.html 16, 17, 40
[17] J. J. Clark and N. J. Ferrier. Attentive Visual Servoing. Active Vision, Blake and Yuille, (MIT
Press), pp. 137-154, 1992. 13, 23, 24
[18] V. Conception and H. Wechsler. Detection and localization of objects in time-varying
imagery using attention, representation and memory pyramids. Pattern Recognition,
29(9):1543-1557, 1996. 37
[19] X. Desurmont, C. Machy, C. Mancas-Thillou, D. Severin and J.-F. Delaigle. Effects of Parameters Variations in Particle Filter Tracking. IEEE Int. Conf. on Image Processing, pp.
2789-2792, Mons, 2006. 39, 41, 42, 52
[20] A. Doucet, s. Godsill and C. Andrieu On Sequential Monte Carlo sampling methods for
Bayesian filtering. statistics and Computing, 10, pp. 197-208, 2000.
[21] E-sunbear, human vision resource website. http://www.e-sunbear.com 23
[22] J. D. Enderle. The Fast Eye Movement Control System. The Biomedical Engineering Handbook: Second Edition, Boca Raton: CRC Press LLC, 2000 18
[23] S. Engel, X. Zhang and B. Wandell. Colour Tuning in Human Visual Cortex Measured with
Functional Magnetic Resonance Imaging. Nature, vol. 388, no. 6,637, pp. 68-71, 1997. 36
[24] J. M. Findlay and I. D. Gilchrist. The psychology of looking and seeing. Active Vision,
Oxford University Press, 2003. 13, 81
[25] R. C. Gonzalez and R. E. Woods. Digital Image Processing. 3rd edition, Pearsson/Prentice
Hall, 2008. 31
[26] S. Grossberg. A neural theory of attentive visual search: interactions of boundary, surface,
spatial and object representations. Psychological Review, 101, pp. 470-489, 1994. 34, 38
[27] B. ter Haar Romeny. Front-End Vision and Multi-Scale Image Analysis: Multi-Scale Computer Vision Theory and Applications, written in Mathematica. Computational Imaging
and Vision, Vol. 27, ISBN: 978-1-4020-1503-8, 2003. 15, 16, 30, 31
[28] D. Heinke and G. W. Humphreys. Attention, spatial representation and visual neglect:
Simulating emergent attention and spatial memory in the Selective Attention for Identification Model (SAIM). Psychological Review, Behavioural and Brain Science Centre, School
of Psychology, University of Birmingham. 33
[29] G. W. Humpreys. SEarch via Recursive Rejection (SERR): A connectionist model of visual
search. Cognitive Psychology, 25, pp. 43-110, 1993. 33
92

bibliography

[30] L. Itti and C. Koch. A saliency-based search mechanism for overt and covert shifts of visual
attention. Elsevier, Vision Research, Comp. and Neural Systems Program, 40 (2000), pp.
1489-1506, 1999. 21, 35
[31] S. J. Julier and J. K. Uhlmann. A New Extension of the Kalman Filter to Nonlinear Systems.
Technical Report, The Robotics Research Group, Dep. of Engineering Science, University of
Oxford. 50, 56
[32] C. Koch and S. L. Ulman. Shifts in the selective visual attention: Towards the underlying
neural circuitry. Human Neurobiology, 5:219-227, 1985. 34, 35
[33] H. Kolb, E. Fernandez, R. Nelson. Webvision; The Organization of the Retina and Visual
System. http://webvision.med.utah.edu/index.html, John Moran Eye Center, University of
Utah, 2005. 16, 17
[34] M. A. Isard. Visual Motion Analysis by Probabilistic Propagation of Conditional Density.
Phd thesis, Robotics research Group, Dep. of Engineering Science, University of Oxford,
1998. 41, 54, 55, 89
[35] L. Itti, C. Koch and E. Niebur. A model of saliency-based visual attention for rapid scene
analysis. IEEE transactions on pattern analysis and machine intelligence, vol. 20, no. 11,
November 1998. 14, 31, 35, 36, 37, 45, 63, 81, 83
[36] L. Itti and C. Koch. A saliency-based search mechanism for overt and covert shifts of visual
attention. Elsevier, Vision Research, vol. 40, pp. 1489-1506, 2000. 21, 35
[37] L. Itti and C. Koch. Computational modeling of visual attention. Neuroscience, Nature
reviews, volume 2, March 2001. 13, 35
[38] E. R. Kandel and J. H. Schwartz. Principles of Neural Science. 1981. Edward Amold, London.
24
[39] A. G. Leventhal. The Neural Basis of Visual Functions: Vision and Visual Dysfunction. vol. 4.
Boca Raton, Fla.: CRC Press, 1991. 36
[40] E. J. A. Manders. Design of a Human-like Robotic Eye Unit, to be used for Active, Stereo
Vision Master’s thesis, confidential report, rep. no. DCT 2005.127, Eindhoven University of
Technology, 2005. 18, 41
[41] M. C. Mozer. The Perception of Multiple Objects. MIT Press, Cambridge, MA, 1991. 34, 38
[42] D. W. Murray, F. Du, P. F. McLauchlan et al. Design of stereo heads. Active vision, 0-26202351-2, MIT Press, pp. 155-172, Cambridge, 1993. 81
[43] B. A. Olshausen, C. H. Andersen and D. C. Van Essen. A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information.
Journal of Neuroscience, 13(11), pp. 4700-4719, 1993. 33
[44] Intel Corporation
Open Source Computer Vision Library.
http://opencvlibrary.sourceforge.net, 2000. 13, 41, 56
93

Reference Manual,

bibliography

[45] F. Patane, P. Dario, H. Miwa. Design and development of a biologically-inspired artificial
vestibular system for robot heads. Proc.of Int. Con. on Intelligent Robots and Systems, Japan,
pp. 1317-1322, 2004. 27
[46] T. Petrie.
Tracking using kalman Filters and Condensation.
Webreport;
http://www.marcad.com/cs584/Tracking.html, University of Colorado, 2008. 50
[47] A. Petrignani. Real-time eye tracking using a smart vision sensor. Master’s Thesis, Delft
University of Technology, 2000. 18
[48] R. Phaf, A. Van der Heijden and P. T. W. Hudson. SLAM: A connectionist model for
attention in visual selection tasks. Cognitive Psychology, 22:273-341, 1990. 34
[49] E. O. Postma, H. J. van den Herik and P. T. W. Hudson. SCAN: A scalable model of
attention selection. Neural Networks, 10(6):993-1015, 1997. 38
[50] www.pycomall.com/images/P/01-12.jpg 19
[51] L. R. Rabiner. A tutorial on Hidden Markov models and selected applications in speech
recognition. Proceedings of the IEEE, 77 (2), pp. 257-286, 1989. 42, 49
[52] D. A. Robinson. The mechanics of human smooth pursuit eye movements. Journal of
Physiology, 180:569-591, 1965 13, 18, 23, 41
[53] D. A. Robinson. The Oculomotor Control System: A Review. Proc. of the IEEE, 56(6): pp.
1032-1049, 1968. 23, 41
[54] D. A. Robinson. Why visuomotor systems don’t like negative feedback and how they avoid
it. Vision, Brain and Cooperative Computation, chapter 1, pp. 89-107, 1988. 18, 23, 41
[55] P. A. Sandon. Simulating visual attention. Journal of Cognitive Neuroscience, 2(3):213-231,
1990. 34
[56] J. Shi and C. Tomasi. Good Features to Track. IEEE Conf. on Comp. Vision and Pattern
Recognition, (CVPR94) Seattle, June 1994. 31, 48
[57] R. J. Solomon. As if you where there; Matching Machine Vision to Human Vision. The
Hybrid Vigor Journal, The Hybrid Vigor Institute, Human Perception, V1.3:04, 2002. 39
[58] Stereoscopy.com, The world of 3D-Imaging. http://www.stereoscopy.com/library/wheatstonepaper1838.html 19, 20
[59] Y. Sun. Hierarchical object-based visual attention for machine vision. Phd thesis, Institute
of Perception, Action and Behavior, School of Informatics, University of Edinburgh. 13, 18,
37
[60] H. Schmidt-Cornelius. Reverse engineering an active eye. Technical report, University of
Sussex, 2002 18, 22, 23, 25
[61] M. S. Sugathadasa, W. P. Dayawansa and C. F. Martin. Control of Pursuit Eye Movement.
Proc. of 39th. Conf. on Decision and Control, pp. 1793-1798, Sidney, Australia, 2000. 18,
41
94

bibliography

[62] T. Suzuki and N. Hirai. Reaction times of head movements occurring in association with
express saccades during human gaze shifts. Neuroscience Letters, Dep. of Physiology, Kyorin
University School of Medicine, pp. 61-64, Japan, 1998. 41
[63] A. Takanashi, S. Ishimoto and T. Matsuno. Development of an Anthropomorphic Head-Eye
System for Robot and Human Communication. IEEE INt. Workshop on Robot and Human
Communication, pp. 77-82, Tokyo, Japan, 1995. 18
[64] K. Toyama. Incremental focus of attention for robust vision-based tracking. International
Journal of Computer Vision, 35(1):45-63, 1999. 37
[65] L. T. Thompson. Sensory Systems II. Aging and Memory Research Center, University of
Texas at Dallas, 2007. 17
[66] A. Treisman and G. Gelade. A feature integration theory of attention. Cognition Psychology,
12, pp. 97-136, 1980. 33
[67] A. Treisman. Perceptual grouping and attention in visual search for features and for objects.
Journal of Exp. Psychol: Hum. Percept. Perf.,8, pp. 194-214, 1982. 33
[68] J. K. Tsotsos. The selective tuning model for visual attention. Technical Report, Department
of Computer Science, and Centre for Vision Research, York University, Canada, 1995. 37
[69] R. velmurugan. Implementation strategies for particle filter based target tracking. Phd
Thesis, Georgia Institute of Technology, 2007. 52, 54
[70] V. S. Vyas and P. Rege. Automated Texture Analysis with Gabor filter. Journal on Graphics,
Vision and Image Processing, Vol. 6-1, 2006. 36
[71] D. Walther and C. Koch. Modeling attention to salient proto-objects. Neural Networks, 19,
1395-1407, 2006. 63
[72] G. Welch and G. Bishop. An Introduction to the Kalman Filter. technical Report, TR 95-041,
Department of Computer Science, University of North Carolina, Chapel Hill, 2006. 50, 52,
55, 56, 88
[73] Wikipedia, The Free Encyclopedia. http://wikipedia.org 19, 21, 25, 27, 30, 32, 56
[74] F. P. Wilbers. Human-like stabilisation of a robot eye. MSc Thesis, Delft Biorobotics Laboratory, 3mE, TU Delft, 2008. 43, 45, 77
[75] J. M. Wolfe. Guided Search 2.0: A revised model of visual search. Psychonomic Bulletin and
Review, pp. 202-238, 1994. 33
[76] J. M. Wolfe. Guided Search 4.0: A guided search model that does not require memory for
rejected distractors. Journal of Vision, 1(3):349, 2001. 33
[77] H. K. Yuen, J. Princen, J. Illingworth and J. Kittler. Comparative study of Hough Transform
methods for circle finding. Image and Vision Computing, vol. 8, no. 1, pp. 71-77, 1990. 48
[78] S. Zeki. A Vision of the Brain. Oxford; Boston: Blackwell Scientific Publications, 1993. 39

95