P. 1
Mail

Mail

|Views: 239|Likes:
Published by Senapathi Teja

More info:

Published by: Senapathi Teja on Dec 15, 2011
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

10/17/2014

pdf

text

original

Sections

  • 2 Subject : object recognition and localization on NAO
  • 2.1 Mission and goals
  • 2.2 Recognition
  • 2.3 Localization
  • 2.4 NAO control
  • 3 Implementation and testing
  • 3.1 Languages, softwares and libraries
  • 3.2 Tools and methods
  • 3.3 Steps of the project
  • 4 Conclusion
  • 5 References
  • 6 Appendices
  • 6.1 SIFT: a method of feature extraction
  • 6.2 SURF : another method adapted for real time applications
  • 6.3 Mono-calibration
  • 6.4 Stereo-calibration

Msc Dissertation

NAO project :
object recognition and localization
Author : Sylvain Cherrier
Msc : Robotics and Automation
Period : 04/04/2011 → 30/09/2011
Sylvain Cherrier – Msc Robotics and Automation
Acknowledgement
In order to do my project, I used a lot of helpful documents which simplified me the work. I could
mention a website from Utkarsh Sinha (SIFT, mono-calibration and other stuff), the publication of
David G. Lowe « Distinctive Image Features from Scale-Invariant Keypoints » (SIFT and feature
matching) or the dissertation of a previous MSc student, Suresh Kumar Pupala named « 3D-
Perception from Binocular Vision » (stereovision and other stuff).
After, I would like to thank Dr. Bassem Alachkar to have worked with me on the recognition. He
guided me to several feature extraction methods (SIFT and SURF) and presented me several
publications which allowed me to understand the feature matching.
Then, I would like to thank my project supervisor Dr. Samia Nefti-Meziani, who allowed me to use
the laboratory and in particular the NAO robot and the stereovision system.
Finally, I would like to thank all researchers and studients from the laboratory for their sympathy
and their help.
NAO project – page 2/107
Sylvain Cherrier – Msc Robotics and Automation
Table of contents
1 Introduction......................................................................................................................................3
2 Subject : object recognition and localization on NAO robot............................................................5
2.1 Mission and goals......................................................................................................................5
2.2 Recognition...............................................................................................................................7
2.3 Localization.............................................................................................................................22
2.4 NAO control............................................................................................................................42
3 Implementation and testing...........................................................................................................58
3.1 Languages, softwares and libraries.........................................................................................58
3.2 Tools and methods..................................................................................................................59
3.3 Steps of the project.................................................................................................................64
4 Conclusion.......................................................................................................................................67
5 References......................................................................................................................................68
6 Appendices.....................................................................................................................................69
6.1 SIFT: a method of feature extraction......................................................................................69
6.2 SURF : another method adapted for real time applications...................................................84
6.3 Mono-calibration....................................................................................................................92
6.4 Stereo-calibration.................................................................................................................100
NAO project – page 3/107
Sylvain Cherrier – Msc Robotics and Automation
1 Introduction
Video cameras are interesting devices to work with because they can transmit a lot of data to
work with. For human beings or animals, image data inform them about the objects within the
environment they meet : a goal for the future is to understand how we can recognize an object.
Firstly, we learn about it and after we are able to recognize it within a different environment.
The strength of the brain is to be able to recognize an object among different classes of objects :
we call this ability clustering. But before all, the brain is able to extract interesting features from
an object that make it belonging to a class.
Thus, the difficulty to implement a recognition ability on a robot has two levels :
– what features extract from an object ?
– How to identify a specific object within a lot of several others ?
We can say we have firstly to have well chosen features before to be able to identify an object, so
the first question has to be replied before the second one.
Besides recognition, we need often the location and orientation of the object in order to make the
robot interact with it (control). Thus, the vision system must also enable object localization as it
allows to increase the definition of the object seen.
In this report, firstly, I will present the university and where I worked before to describe what were
the context and goals of my project.
After, I will focus on the three subjects I had to study : object recognition, object localization and
NAO control before to present how I implemented and managed my project.
NAO project – page 4/107
Sylvain Cherrier – Msc Robotics and Automation
2 Subject : object recognition and localization on NAO
robot
2.1 Mission and goals
2.1.1. Context
In real life, the environment can change easily and our brain can easily adapt itself to match object
whatever the conditions.
In Robotics, in order to tackle recognition problems, we think often about a specific context ; I can
mention three examples :
➢ For industrial robots, we can have to make robots recognize circular objects on a moving
belt : the features to extract is well defined and we will implement an effective program to
match only circular ones. In this first example, the program has been done to work if the
background is the moving belt one. Thus, a first problem is the background : in real
applications, it can vary a lot !
➢ When we want improve a CCTV camera to recognize vehicles, we can use the thickness to
their edges to distinguish them from people walking in the street. After, we can also use
the car blobs to match a specific size or the optical flow to match a specific speed. The
problem there can be at the occlusion level : indeed, the car blob could be not valid if a
tree hide the vehicle and the would be not identified.
➢ Using a HSV spacecolor, we can recognize monocoloured objects within a constant
illumination. Even if the use of blobs could improve the detection, a big problem is
illumination variation.
Thus, from these examples we can underline three types of variations :
➢ background
➢ occlusion
➢ illumination
But there are a lot of others such as all the transformation in space :
➢ 3D translation (on image : translation (X, Y) and scale (1/Z))
➢ 3D rotation
NAO project – page 5/107
Sylvain Cherrier – Msc Robotics and Automation
2.1.2. State of the art
A lot of people has worked to find invariant features to recognize an object whatever the
conditions.
An insteresting way of study is to work on invariant keypoints :
➢ The Scale-Invariant Feature Transform (SIFT) was published in 1999 by David G. Lowe,
researcher from the University of British Columbia. This algorithm allow to detect local
keypoints within an image, these features are invariant in scale, rotation, illumination.
Using keypoints instead of keyregions enable to deal with occlusion.
➢ The Speeded Up Robust Feature (SURF) was presented in 2006 by Herbert Bay et al.,
researcher from the ETH Zurich. This algorith is partly inspired from SIFT but it's several
times faster than SIFT and claimed by its authors to be more robust against different image
transformations.
2.1.3. My mission
My mission was to recognize an object in difficult conditions (occlusion, background, different
rotations, scales and positions). Moreover, the goal was also to apply it for a robotic purpose : the
control of the robot need to be linked to the recognition. Thus, I had to study the localization
between recognition and control in order to orientate to robotic applications.
In the following, I will speak more about these methods and how to use them to match an object.
After, I will focus on the localization as it allows to have a complete description of the object in the
space : it enables more interaction between the robot and the object.
As I worked on the NAO robot, I will describe the hardware and software of it and the ways to
program it.
Then, in a final part, I will link all the previous parts (recognition, localization and robot control) in
order to describe my implementation and testing. I will also speak about the way I managed my
project.
NAO project – page 6/107
Sylvain Cherrier – Msc Robotics and Automation
2.2 Recognition
2.2.1. Introduction
In pattern recognition, there are two main steps before to be able to have a judgement on an
object:
– Firstly, we have to analyze the image, the features (which can be for example keypoints or
keyregions) susceptible to be interesting to work with. In the following, I will call this step:
feature extraction.
– Secondly, we have to compare these features with them stocked in the database. I will call
it : feature matching.
Dealing with the database, there are two ways to create it:
– either it's created manualy by the user and stays fixed (the user learns to the robot)
– or it's updated automatically by the robot (the robot learns by itself)
2.2.2. SIFT and SURF : two methods of invariant feature extraction
SIFT and SURF are two invariant-feature extraction methods which are divided in two main steps :
– keypoint extraction : keypoints are found for specific locations on the image and an
orientation is given to them. They are found depending on their stability with scale and
position : the invariance with scale is done at this step.
– Descriptor generation : a descriptor is generated for each keypoint in order to identify
them accurately. In order to be flexible to transformations, it has to be invariant : the
invariances with rotation, brightness and contrast are done at this step.
During the keypoint extraction, each method have to build a scale space in order to estimate the
Laplacian of Gaussian. The Laplacian allows to detect pixels with rapid intensity changes but it's
necessary to apply on a Gaussian in order to have the invariance in scale. Moreover, the Gaussian
makes the Laplacian less noisy. From the Laplacian of Gaussian applied to the image, the keypoints
are found at the extremums of the function as their intensity changes very quickly around their
location. Then, the second step is to calculate their orientation using gradients from their
neighborhood.
NAO project – page 7/107
Sylvain Cherrier – Msc Robotics and Automation
Concerning the descriptor generation, we will need firtly to rotate the axis around each keypoint
by their orientation in order to have invariance witk rotation. After, as for the calculation of the
keypoint orientation, it's necessary to calculate the gradients around a specific neighborhood (or
window). This window will be then divided in several subwindow in order to build a vector
containing several gradient parameters. Finally, a threshold is applied to this vector and it's
normalized at unit length in order to make it respectively invariant with brightness and contrast.
This vector have a specific number of values depending on the methods : with SIFT, it has 128
values and with SURF, 64.
The fact that SURF works on integral images and uses simplified masks reduces the computation
and makes the algorithm quicker to execute. For these reasons, SURF is better for real time
applications than SIFT even if SIFT can be very robust.
For more details about the feature extraction methods SIFT and SURF, you can refer to the
appendices where I describe more each one of these methods.
The diagram below sums up the main common steps used by SIFT and SURF.
NAO project – page 8/107
Build a scale space
(invariance on scale)
Estimate the Laplacian of Gaussian
Calculate the gradians around the keypoints
within a window W1
Estimate the orientations
Calculate the gradians around keypoints
within a window W2
Divide the window W2
in several subwindows
Rotate the keypoint axis
by the orientation of the keypoints
(invariance on rotation)
Calculate gradian parameters
for each subwindow
Threshold and normalize the vector
(invariance on brightness and contrast)
Keypoint
extraction
Descriptor
generation
Keypoint
location
Keypoint
orientation
Descriptor
Sylvain Cherrier – Msc Robotics and Automation
2.2.3. Feature matching
Thus, now we have seen two methods to extract keypoints which locate and orientate keypoints
and assign them an invariant descriptor. The feature matching consists in recognize a model within
an image acquired.
The goal of the matching is to compare two lists of keypoint-descriptor in order to :
– identify the presence of an object
– know the 2D position, 2D size, 2D orientation and scale of the object on the image
A big problem of recognition is to generalize it for 3D objects : a 3D object is complicated to be
read by a camera but also the 3D position, 3D size and 3D orientation are difficult to estimate with
only one camera.
Thus, we will see how a matching can be done using only one model we want recognize within the
image acquired. It can be divided in three main steps :
– compare the descriptor vectors (descriptor matching)
– compare the feature of the keypoints matched (keypoint matching)
– calculate the 2D features of the model within the image acquired (calculation of 2D model
features)
NAO project – page 9/107
kpList
acq
Descriptor matching
Keypoint matching
Calculation of 2D model features
Feature extraction (SIFT or SURF) Feature extraction (SIFT or SURF)
Image acquired Image model
descList
acq
descList
model
kpList
model
Feature matching
Sylvain Cherrier – Msc Robotics and Automation
The diagram above shows the principle of the feature matching : features are extracted from the
image acquired and from the image model in order to match their descriptors and keypoints. The
goal is to obtain the 2D features of the model within the acquisition at the end of the matching.
2.2.3.1 descriptor matching
2.2.3.1.1 Compare distances
The descriptors we have previously calculated using SIFT or SURF allow us to have data invariant in
scale, rotation, brightness and contrast.
Firstly, in order to compare descriptors, we can focus on their distance or on their angle. If desc
acq
(desc
acq
(1) desc
acq
(2) … desc
acq
(size
desc
)) is a descriptor from the image acquired and desc
model
(desc
model
(1) desc
model
(2) … desc
model
(size
desc
)) is one from the model :
d
euclidian
=
.
(desc
acq
(1)−desc
model
(1))
2
+...+(desc
acq
(size
desc
)−desc
model
( size
desc
))
2
0=arccos( desc
acq
(1) desc
model
(1)+...+desc
acq
( size
desc
) desc
model
( size
desc
))
Thus, there are two basic solutions to compare descriptors : using euclidian distance or angle θ (in
the following I qualify the angle θ as distance also). There are others types of distances I don't
mention here in order to make it simple.
Thus, we could calculate the distances between one descriptor from the image acquired and each
one from the image model : the smallest distance could be compared to a specific threshold.
But how to find the threshold in this case ? Indeed, each descriptor could have its own valid
threshold. Thus, a threshold has to be relative to a distance.
David G. Lowe had the idea to compare the smallest distance with the next smallest distance and
deduced that rejecting matching with a distance ratio (ratio
distance
= distance
smallest
/distance
next-smallest
)
more that 0.8, we eliminate 90 % of the false matches with just 5 % of the correct ones.
The figure below represents the Probability Density Functions (PDF) functions of the ratio of
distance (closest/next closest). As you can see, removing matches from 0.8 enables to keep the
most of correct matches and to reject the most of false matches. We can also keep matches under
0.6 in order to focus on only correct matches (but we lose data).
NAO project – page 10/107
Sylvain Cherrier – Msc Robotics and Automation
This method allows to eliminate quickly matches from the background of the acquisition which are
completely incompatibles with the model. At the end of the procedure, we have a number of
interested keypoints from the image acquired that are supposed to match with the model but we
need a method more precised than distances.
2.2.3.1.2 k-d tree
The goal now is to match the nearest neighboors more accurately. To do that, we can use the k-d
tree method.
Before all, k-d tree uses two lists of vectors of at least k dimensions (as it will focus only on the k
dimensions specified) :
– the reference (N vectors) which aludes to the vectors we use to build the tree
– the model (M vectors) which aludes to the vectors we use to find a nearest neighboor
Thus, when the k-d tree has finished its querying for each vector from the model within the tree, it
has for each one of them a nearest reference vector : we have M matches at the end.
In order to avoid several model vectors to match for the same reference vector, we need to have :
N¯M
NAO project – page 11/107
Sylvain Cherrier – Msc Robotics and Automation
Then, depending on the previous operation (compare distances), we will need to assign correctly
the model and acquisition descriptor lists depending on the number of their descriptors to match :
– if there are more descriptors in the acquisition than in the model, the acquisition will be
the reference and the model the reference of the k-d tree.
– If there are more descriptors in the model than in the acquisition, it will be the opposite.
In our case, if we want accuracy, the dimension of the tree has to be the dimension of the
corresponding descriptors (128 for SIFT and 64 for SURF). The drawback is that more we have
dimensions, more there is computation and slower the process is.
I don't explain more details about k-d tree as I used it as a tool.
2.2.3.2 Keypoint matching
After knowing the nearest neighboors, the descriptors are not useful any more.
Now, we have to check if the matches are coherent at the level of the geometry of the keypoints.
Thus, now we will use the data relative to the keypoint (2D position, scale, magnitude and
orientation) in order to check if there are again some false matches.
To do this, we can use a Generalized Hough Transforms (GHT) but there are two types :
– GHT invariant with 2D translation (classical GHT)
– GHT invariant with 2D translation, scale and rotation (invariant GHT)
NAO project – page 12/107
Sylvain Cherrier – Msc Robotics and Automation
The classical GHT uses only the 2D translation of the keypoints as reference. It has the advantage
to give already an idea of the scale and orientation of the model within the image acquired but it
requires more computation as it's necassary to go through several possible scales and 2D
rotations. Moreover, it's necessary to know a specific range of scale and 2D rotation to work with
and at what sampling.
The invariant GHT uses both the 2D translation and orientation of the keypoints as references. It
has the main advantage to be invariant with all the 2D transformations of the object but it doesn't
inform about the scale and 2D rotation and as it uses orientations, it's sensitive to their accuracy.
In the following, we will both study the classical Generalized Hough Transform and the invariant
one.
2.2.3.2.1 Classical GHT
The Generalized Hough Transform leans mainly on keypoint coordinates defined relative to a
reference point. Indeed, using a reference point for all the keypoints, it enables to keep a data
invariant in 2D translation.
As the goal of the GHT is here to check whether the keypoints are correct, it has as input the
keypoints matched between the image acquired and the model.
➢ Build the Rtable of the model
Before all, the model need to have a corresponding Rtable which describes for each of its keypoint
their displacement γ from the reference point :
y(1, 0)=
(
x
model
−x
r
y
model
−y
r
)
Where x
model
and y
model
are the coordinates of the keypoint from the model and x
r
and y
r
are those
from the reference point. The two parameters of the displacement γ alude respectively to the
scale (s = 1) and rotation (θ = 0 rad) as the model is considered as base.
The reference point can be simply calculated as being the mean of position among all the
keypoints :
x
r
=

x
model
nbKp
model
y
r
=

y
model
nbKp
model
NAO project – page 13/107
Sylvain Cherrier – Msc Robotics and Automation
Where nbKp
model
refers to the number of keypoints within the model.
Thus, the Rtable for each model is structured as below :
T=
(
x
model
(1)−x
r
y
model
(1)−y
r
x
model
(2)−x
r
y
model
(2)−y
r
... ...
x
model
(nbKp
model
)−x
r
y
model
( nbKp
model
)−y
r
)
➢ Check the features of the acquisition
The second step is to estimate the position of the reference point within the acquired image and
check the keypoints which have a good result.
As we have already a match between the keypoints from the acquisition and those from the
model, we can directly use the Rtable in order to make each keypoint vote for a specific position of
the reference point.
Indeed, if we consider a transformation invariant both in scale and rotation, we have :
x
r
=x
acq
−Rtable (numKp
model
,1) y
r
=y
acq
−Rtable( numKp
model
, 2)
Where x
acq
and y
acq
alude to the position of the keypoint from the acquisition and numKp
model
refers
to the corresponding keypoint within model matched by acquisition.
With scale s and rotation θ, we have to modify the value of the displacement given by the Rtable :
y( s , 0)=s
(
cos(0) −sin (0)
sin (0) cos(0)
)
y(1,0)

Thus, we have our new reference point for the acquisition :
x
r
=x
acq
−s (cos(0) Rtable (numKp
model
,1)−sin(0) Rtable( numKp
model
, 2))
y
r
=y
acq
−s (sin (0) Rtable (numKp
model
, 1)+cos(0) Rtable (numKp
model
, 2))
NAO project – page 14/107
Sylvain Cherrier – Msc Robotics and Automation
Thus, at the end of this step, we have an accumulator of votes of the size the image acquired.
In the image below, you can see an example of accumulator : the maximum show the best
probability of position of the reference point within the image acquired.
A problem is to know at what scale and rotation we have the best results : personnaly, I focused
on the maximum of votes to choose but there can be several for different scales and rotation. An
idea would be to detect where the votes are more grouped.
Adding the scale and rotation, another problem is that we add more calculation and the difficulty
here is to know what range use and how many iterations (more we increase the number of
iterations, more it will be accurate but more it will spend time).
2.2.3.2.2 Invariant GHT
Contrary to the previous method, we will not use the displacement γ this time.
We will use a relation between the speed of the keypoint φ' and its acceleration φ''
➢ Speed φ' and acceleration φ''
It's easier to define the speed and accleration focusing on a template and not on keypoints.
The studied template has a reference point as origin (x
0
, y
0
).
The speed (at a specific angle θ of the template) is defined as being tangent of its border. The
acceleration is defined as going through the border and the reference point (x
0
, y
0
).
NAO project – page 15/107
Sylvain Cherrier – Msc Robotics and Automation
We can prove that between the angle of speed φ'
a
and the angle of acceleration φ''
a
, there is a
value k invariant with translation, rotation and scale :
k=µ
a
' −µ
a
' '
In our case, we will use the orientation of the keypoint as the angle of speed φ'
a
and concerning
the acceleration φ'', it's defined by :
µ' ' =
y−y
0
x−x
0
and
µ
a
' ' =arctan(µ' ' )
➢ Build the Rtable of the model
As previously, we will build a Rtable for the model but this time we will record the value of k and
not the displacement. Thus we will use both the position and orientation of each keypoint to build
the Rtable.
NAO project – page 16/107
Sylvain Cherrier – Msc Robotics and Automation
➢ Check the features of the acquisition
Using the position of the keypoints within the acquisition and their corresponding matches, we
will have an associated value of k.
After, knowing the orientation of the keypoints (φ'
a
), we will have the angle of acceleration φ''
a
.
Contrary to the previous method, each keypoint will not vote for a precised position but for a line.
2.2.3.2.3 Conclusion
Thus, at the end of the both methods, we have to look for the maximum of votes in the
accumulator or where the votes are the most grouped.
I chose to focus on the maximum as it's more simple.
In order to be more tolerant, I added a radius of tolerance around the maximum. The false
matches, outside the tolerance zone will be rejected : it allows to be coherent at the geometric
level.
The GHT allows to reduce errors in the calculation of 2D model features we will see in the
following.
2.2.3.3 calculation of 2D model features
➢ Least Square Solution
In order to calcultate the 2D features, we will have to use a Least Squares Solution.
Indeed, a point of the model (p
model
) is expressed in the image acquired (p
model/acq
) using a specific
translation t, a scale s (scalar) and a rotation matrix R :
p
model / acq
=s R p
model
+t
where R=
(
cos(0) −sin(0)
sin(0) cos(0)
)
and t =
(
tx
ty
)
NAO project – page 17/107
Sylvain Cherrier – Msc Robotics and Automation
θ is the angle of rotation between the model and the model within the image acquired. We
consider that the model has a scale s of 1 and an angle θ of 0 radians.
Thus, we will apply the Least Squares in order to find two matrices :
M=s R=
(
m
11
m
12
m
21
m
22
)
and
t =
(
t
x
t
y
)
The idea of the Least Squares is to arrange the equation (4.3-1) in order to group the unknowns
m
11
, m
12
, m
21
, m
22
, t
1
and t
2
within a vector :
p
model / acq
=
(
m
11
m
12
m
21
m
22
)
p
model
+
(
t
x
t
y
)
with
p
model / acq
=
(
x
model/ acq
y
model /acq
)
and
p
model
=
(
x
model
y
model
)
Thus, we have also :
A mt =
(
x
model / acq
y
model/ acq
)
with
A=
(
x
model
y
model
0 0 1 0
0 0 x
model
y
model
0 1
)
and mt =( m
11
m
12
m
21
m
22
t
x
t
y
)
T
Thus the vector mt can be determined using the inverse of the matrix A, if it's not square, we will
have to calculate a pseudo inverse :
mt =(( A
T
A)
−1
A
T
)
(
x
model / acq
y
model /acq
)
As we have six unknowns, in order to solve this problem, we require at least three positions p
model
with three corresponding positions within the image acquired p
model/acq
. But, in order to increase
the accuracy of the Least Squares Solution, it's better to have more positions : all the matches can
be considered (of course, the previous steps such as the descriptor matching and keypoint one
have to be correct).
NAO project – page 18/107
Sylvain Cherrier – Msc Robotics and Automation
Thus, we have :
A mt =v
model / acq
with A=
(
x
model
(1) y
model
(1) 0 0 1 0
0 0 x
model
(1) y
model
(1) 0 1
... ... ... ... ... ...
x
model
(nb
matches
) y
model
(nb
matches
) 0 0 1 0
0 0 x
model
( nb
matches
) y
model
(nb
matches
) 0 1
)
and
v
model/ acq
=
(
x
model / acq
(1)
y
model /acq
(1)
...
...
x
model /acq
(nb
matches
)
y
model / acq
( nb
matches
)
)
nb
matches
is the number of matches detected.
Thus, as previously, the solution is as the following :
mt =(( A
T
A)
−1
A
T
)v
model/ acq
We have now the matrix M and the translation t but we need to have the scale s and the angle θ
aluding to the transformation from the model to the acquisition.
To calculate them, we can simply lean on the proprieties of the rotation matrix R :
RR
T
=I
and R=
(
cos(0) −sin (0)
sin(0) cos(0)
)
NAO project – page 19/107
Sylvain Cherrier – Msc Robotics and Automation
We conclude :
MM
T
=s
2
I
and s=. MM
T
(1,1)=.MM
T
( 2, 2)=
.
MM
T
(1, 1)+MM
T
(2, 2)
2
We deduce the rotation matrix R and the angle θ :
R=
M
s
and 0=arctan(
R( 2,1)
R(1, 1)
)
Then, at the end of this step, we have the translation t, the size s and the angle θ refering to the
transformation of the model within the acquisition.
In the diagram below, the red points refer to examples of matches between acquisition and
model : it's necessary to have at least three couples of positions (acquisition-model) in order to
calculate the features. You can see some measures useful to estimate : C
model
and C
model/acq
alude
respectively to the center of rotation within the model and within the acquisition. the Least
Squares Solution gives us only an estimation of the origin of the model within the image acquired
O
model/acq
but the center of rotation is easily calculated using the size of the model (sW
0
, sH
0
) which
refer to the model image from where the keypoints were extracted.
NAO project – page 20/107
O
model/acq
θ
t
x
t
y
O
acq
W
0
H
0
sH
0
sW
0
Acquisition Model
O
model
C
model
C
model/acq
Sylvain Cherrier – Msc Robotics and Automation
➢ Reprojection error
In order to have an idea of the quality of the Least Square Solution, it's necessary to calculate the
reprojection error.
The reprojection error is simply obtained by calculating the new position vector of the model
within the acquisition v'
model/acq
using the position vector of the model v
model
and the matrix M and
translation t previously calculated.
Thus, we have an error (the reprojection error error
repro
) defined by the distance between the
position vector calculated v'
model/acq
and the real one v
model/acq
:
d
error
=∣∣v'
model / acq
−v
model /acq
∣∣ and error
repro
=
d
error
∣∣v
model / acq
∣∣
NAO project – page 21/107
Sylvain Cherrier – Msc Robotics and Automation
2.3 Localization
2.3.1. Introduction
There are a lot ways to localize objects :
– ultrasonic sensors
– infrared sensors
– cameras
– …
But, localization with cameras allows to have a lot of data and working with pixels is more
accurate. However it need more processing time and computing.
In the context of the project, localization allows to improve the interaction between the robot and
the object detected.
We will see in the following how we can localize objects using one and several cameras; we'll see
also the drawbacks and advantages of these two methods.
But before all we need to focus on the parameters of a video camera which are important to
define.
2.3.2. Parameters of a video camera
A video camera can be defined by several parameters:
– extrinsic
– intrinsic
The extrinsic parameters focus on the geographical location of the camera whereas the intrinsic
parameters deals with the internal features of the camera. In the following, we will describe each
of these parameters.
2.3.2.1 extrinsic
The extrinsic parameters of a camera alude to the position and orientation of the camera in
relation to the focused object. Thus, the position of the camera p
cam
defined as below:
NAO project – page 22/107
Sylvain Cherrier – Msc Robotics and Automation
p
cam
=R
cam
p
world
+T
cam
=〈 R
cam
∣T
cam
〉 p
world
where:
– p
world
aludes to the position of the world origin
– R
cam
and T
cam
refer respectively to the rotation and translation of the camera relative to the
reference point
R
cam
=
(
r
11
r
12
r
13
r
21
r
22
r
23
r
31
r
32
r
33
)
T
cam
=
(
t
x
t
y
t
z
)
In our case, it's interesting to know the object position p
obj/world
in relation to the camera frame
p
obj/cam
. In order to build this relation, we consider the R
cam
and T
cam
are referenced to the object
and not to the world as before. As we want to express T
cam
in the object frame, we have a
translation of -T
cam
before the rotation:
p
obj/ cam
=R
cam
( p
obj/ world
−T
cam
)=R
cam
〈 I∣−T
cam
〉 p
obj /world
2.3.2.2 intrisic
The intrinsic parameters can be divided in two categories:
– the projection parameters allowing to build the projection relation between 2D points seen
NAO project – page 23/107
X
obj/world
Y
obj/world
Z
obj/world
O
obj/world
O
obj/cam
Z
obj/cam
X
obj/cam
Y
obj/cam
(R
cam
, T
cam
)
Sylvain Cherrier – Msc Robotics and Automation
and real 3D points
– the distortions defined in relation to an ideal pinhole camera
2.3.2.2.1 projection parameters
The projection parameters allow to fill the projection matrix linking 3D points from the world with
2D points (or pixels) seen with the camera.
This matrix allows to apply the relation followed by ideal pinhole cameras. Actually, a pinhole
camera is a simple camera without a lens and with a single small aperture; we can assume a video
camera with a lens has a similar behaviour that the pinhole camera because the rays go throught
the optical center of the lens.
This relation is determined easily using geometry using a front projection pinhole camera :
x'
img
x
obj/cam
=
y '
img
y
obj/ cam
=
f
cam
z
obj/cam
We deduce:
x'
img
=f
cam
x
obj/ cam
z
obj /cam
y '
img
=f
cam
y
obj /cam
z
obj /cam
These relations allow to build the projection matrix linking the 2D points on the image p'
img
and the
3D object points expressed relative to the video camera frame p
obj/cam
:
p'
img
=s P
cam
p
obj/ cam
NAO project – page 24/107
O
cam
Optical axis
Image plane
O
img
O
obj f
cam
z
obj/cam
y
z
x
(x
obj/cam
, y
obj/cam
)
(x'
img
, y'
img
)
Sylvain Cherrier – Msc Robotics and Automation
where P
cam
=
(
f
cam
0 0
0 f
cam
0
0 0 1
)
and
s=
1
z
obj/ cam
s is a scale factor depending on the distance between the object and the camera: higher it is,
bigger the object will appear on the image.
The position on the image p
img
is expressed in distance unit, it's necessary now to convert it in pixel
unit (p
img
) and apply an offset:
p
img
=P
img
p'
img
where P
img
=
(
N
x
N
x
cot(0) c
x
0
N
y
sin(0)
c
y
0 0 1
)
N
x
and N
y
alude respectively to the number of pixels for one unit of distance horizontally and
vertically.
c
x
and c
y
alude to the coordinates of the image center (in pixels); as the z axis of the camera frame
is located on the optical axis, it's necessary to offset p
img
of the half of image width on x and of the
half of image height on y (of c
x
and c
y
respectively).
Θ refers to the angle between the x and y axes, it can be responsible for a screw distortion on the
image. The axis are generally orthogonal, this angle is so very closed to 90°.
Finally, we can deduce the final projection matrix P
cam-px
allowing to have directly the position on
the image in pixels p
img
:
p
img
=s P
img
P
cam
p
obj/ cam
=s P
cam−px
p
obj/ cam
where P
cam−px
=
(
N
x
f
cam
N
x
f
cam
cot (0) c
x
0
N
y
f
cam
sin(0)
c
y
0 0 1
=
f
x
f
x
cot (0) c
x
0
f
y
sin(0)
c
y
0 0 1
)
We have f
x
= N
x
f
cam
and f
y
= N
y
f
cam
; generally, the term sin(θ) = 1 as θ ≈ 90° but the term cot(θ) is
often kept as tan(θ)→∞. Thus, a skew coefficent is defined as α
c
= cot(θ); the projection matrix
can be approximated:
NAO project – page 25/107
Sylvain Cherrier – Msc Robotics and Automation
P
cam−px

(
f
x
f
x
o
c
c
x
0 f
y
c
y
0 0 1
)
where α
c
≈ 0
2.3.2.2.2 distortions
There are two differrent types of distortion:
– the radial distortions dr
– the tangential distortions on x (dt
x
) and on y (dt
y
)
The radial distortions come from the convexity of the lens; more we go far from the center of the
lens (center of the image), more we have distortions. You can see below the effect of radial
distortion.
The tangential distortions come from the lack of parallelism between the lens and the image
plane.
Thus, we can express the distorted and normalized position (p
d
n
) relative to the normalized object
position s p
obj/cam
(x
n
, y
n
):
p
n
d
=dr s p
obj/ cam
+
(
dt
x
dt
y
1
)
=M
d
s p
obj/ cam
NAO project – page 26/107
Sylvain Cherrier – Msc Robotics and Automation
where M
d
=
(
dr 0 dt
x
0 dr dt
y
0 0 1
)
and
s=
1
z
obj/ cam
You can notice we have to normalize the object position before to apply distortions but it must be
done before the camera projection we saw in the previous part. Indeed, focals, center of the
image or skew coefficient will be applied on the distorted and normalized position p
d
n
instead of
directly the object position p
obj/cam
; it will be not necessary to normalize again by 1/s during the
projection.
dr, dt
x
and dt
y
are expressed functions of the normalized object position on z axis (s p
obj/cam
(x
n
, y
n
)):
x
n
=
x
obj/ cam
z
obj/ cam
y
n
=
y
obj / cam
z
obj/ cam
r=
.
x
n
2
+y
n
2
Thus, the distortion coefficients are calculated as shown below:
dr=1+dr
1
r
2
+dr
2
r
4
+dr
3
r
6
+...+dr
n
r
2n
dt
x
=2dt
1
x
n
y
n
+dt
2
(r
2
+2x
n
2
) dt
y
=dt
1
(r
2
+2 y
n
2
)+2dt
2
x
n
y
n
2.3.2.3 Conclusion
Using the equations from above, we can express a real 2D pixel point on the image p
img
from the
camera functions of a 3D object point within the world p
obj/world
:
p
img
=s P
cam−px
M
d
R
cam
〈 I∣−T
cam
〉 p
obj /world
p
img
=s
(
f
x
f
x
o
c
c
x
0 f
y
c
y
0 0 1
) (
dr 0 dt
x
0 dr dt
y
0 0 1
)
(
r
11
r
12
r
13
t
x
r
21
r
22
r
23
t
y
r
31
r
32
r
33
t
z
)
p
obj/world
=M p
obj/world
M will be called the intrinsic matrix in the following.
NAO project – page 27/107
Sylvain Cherrier – Msc Robotics and Automation
2.3.3. Localization with one camera
With one camera, it's a simple way to localize objects in the space even if it's necessary to know at
least one dimension in distance unit.
It consists in three steps :
– calibrate the camera (mono-calibration)
– undistord the image
– find patterns
– calculate the position of the patterns
As the diagram above shows, the calibration of the camera is only needed at the beginning of the
process.
In the following, we will describe each one of these steps.
2.3.3.1 Calibrate the camera (mono-calibration)
We have before all to calibrate the camera in order to know its intrisic parameters. We can notice
that calibration can also inform us about external parameters but as they are always variable, they
are less interesting.
NAO project – page 28/107
Calibrate the camera
Undistord the image
Find patterns
Calculate the position of the patterns
Real time
application
Sylvain Cherrier – Msc Robotics and Automation
➢ What are the different methods of calibration ?
We can notice that there are several ways to calibrate a camera:
– photogrammetric calibration
– self-calibration
Photogrammetric calibration need to know the exact geometry of the object in 3D space. Thus,
this method isn't flexible using an unknown environment and moreover, it's difficult to implement
because of the need of precised 3D objects.
Self-calibration doesn't need any measurements and produces more accurate results. It's this
method chosen to calibrate using a chessboard which has simple reptitive patterns.
➢ Chessboard corner detection
A simple way to calibrate is using a chessboard, this allows to know accurately the position of the
corners on the image.
In the appendices, you could find more details about this type of calibration, I describe the Tsai
method which is a simple way to proceed.
Thus, the calibration informs us about the intrinsics parameters of the camera :
– the focales on x (f
x
) and on y (f
y
)
– the position of the center (c
x
, c
y
)
– the skew coefficient (α
c
)
– the radial distortions (dr) and tangential ones (dt
x
and dt
y
)
Calibration is only needed at the beginning of a process, the intrinsic parameters could be save in
an XML file.
2.3.3.2 Undistord the image
Undistord the image is the first step of the real time application which will be executed at a
specific speed.
Distortions can be responsible to a bad arrangement of pixels within the image, which will have a
big impact on the accuracy of the localization.
Thus, distortions have to be compensated by a undistortion process that will remap the image.
NAO project – page 29/107
Sylvain Cherrier – Msc Robotics and Automation
To do so, we use the distortion matrix M
d
and the projection matrix P
cam-px
in order to calculate the
ideal pixel from the distorted one.
If we consider p
i
img
as the ideal pixel and p
d
img
as the distorted one, considering what we said
previously, we have:
p
img
i
=s P
cam−px
p
obj/cam and
p
img
d
=s P
cam−px
M
d
p
obj/ cam
The idea is to calculate the distorted pixel from the ideal one:
p
img
d
=s P
cam−px
M
d
1
s
P
cam−px
−1
p
img
i
=P
cam−px
M
d
P
cam−px
−1
p
img
i
We can't calculated directly the ideal one because the distortion matrix M
d
is calculated thanks to
the ideal pixels. Because of this problem, we will need to use image mapping:
x
img
d
=map
x
( x
img
i
, y
img
i
)
and
y
img
d
=map
y
( x
img
i
, y
img
i
)
Thus, in order to find the ideal image, we remap:
( x
img
i
, y
img
i
)=(map
x
( x
img
i
, y
img
i
) , map
y
(x
img
i
, y
img
i
))
2.3.3.3 find patterns
Then, we have to find patterns we want to focus on.
We can either :
– find keypoints (Harris, Shi-Thomasi, SUSAN, SIFT, SURF, ...)
– or find keyregions (RANSAC, Hough, …)
SIFT and SURF allow to add some robust pattern recognition invariant in scale, rotation, brightness
and contrast even if they require more computation than other corner detection methods.
I don't give more details about the corner detection methods but they are divide in two types :
– Harris, Shi-Thomasi : an Harris matrix is calculated using the local gradients and depending
on the eigenvalues, it's possible to extract specific corners.
– SUSAN (Smallest Univalue Segment Assimilating Nucleus): it works with a SUSAN mask and
generate a USAN (Univalue Segment Assimilating Nucleus) region area depending on the
pixel intensities. It can be improved using shaped rings counting the number of intensity
changes.
As my project is oriented for pattern recognition, I used SIFT and SURF to extract keypoints.
NAO project – page 30/107
Sylvain Cherrier – Msc Robotics and Automation
2.3.3.4 calculate the position of the patterns
Knowing the projection matrix P
cam-px
given by the calibration, it's possible to calculate 3D
coordinates (x
obj/cam
, y
obj/cam
, z
obj/cam
) from pixels (x
img
, y
img
) :
(
x
img
y
img
1
)
=s P
cam−px
(
x
obj/ cam
y
obj /cam
z
obj/ cam
)

1
z
obj/ cam
(
f
x
x
obj/cam
+c
x
z
obj/ cam
f
y
y
obj /cam
+c
y
z
obj/ cam
z
obj /cam
)
if
o
c
≈0
In the following, we consider α
c
= 0.
➢ Knowing the distance from the object (zobj/cam), we can calculate directly the position of
the object (x
obj/cam
, y
obj/cam
) :
x
obj/ cam
=
x
img
−c
x
f
x
z
obj/ cam
and y
obj/ cam
=
y
img
−c
y
f
y
z
obj /cam
➢ Knowing one dimension of the object (d
obj
), we can calculate the distance from the object
(z
obj/cam
) :
z
obj/ cam
=
f
0
d
img
−c
0
d
obj
Where f
0
and c
0
are the mean values of f
x
, f
y
and c
x
, c
y
. The dimension d
img
refers to the the
dimension d
obj/cam
seen on the image.
Thus, we see there are two main problems of the localization with one camera :
– either it's needed to know the distance from the object z
obj/cam
– or it's needed to know at least one dimension of the object (d
obj/cam
) and know how to
extract it from the image (d
img
).
In order to make more autonomous the localization, we need more than one camera as we will
see in the next part using the stereovision.
NAO project – page 31/107
Sylvain Cherrier – Msc Robotics and Automation
2.3.4. Localization with two cameras : stereovision
Now we will focus on stereovision using two cameras (one on the left and another on the right) : it
enables autonomous localization even if it's more complicated to implement.
It requires the same main steps that using one camera but adds a bit more :
– mono and stereo-calibrate the cameras
– rectify, reproject and undistord the image
– find patterns in the left camera
– find the correspondences in the right camera
– calculate the position of the patterns
As previously, the calibration is isolated from all the real time application.
In the following, I give more details about each one of these steps.
NAO project – page 32/107
mono and stereo-calibrate the cameras
rectify, reproject and undistord the image
Find patterns in the left camera
Find the correspondences in the right image
Calculate the position of the patterns
Real time
application
Sylvain Cherrier – Msc Robotics and Automation
2.3.4.1 mono and stereo-calibrate the cameras
This operation consists, as its name means, to mono-calibrate each one of the cameras (left and
right) and stereo-calibrate the system made up of the two cameras.
A simple way to do is to firstly mono-calibrate each one of the cameras in order to have their
respective intrinsic and extrinsic parameters. After, the stereo-calibration will use their extrinsic
parameters in order to know their relationship.
This relationship will be defined by :
– the relative position between the left camera and the right one P
l→r
(rotation R
l→r
and
translation T
l→r
)
– the essential and fundamental matrices E and F
In the diagram above, you can see P
l
and P
r
as the position of the left and right cameras realtive to
the object P. T
l→r
and R
l→r
alude to the relative translation and rotation from the left camera to the
right one.
The essential matrix E links the position of the left camera P
l
and the right one P
r
(it considers only
extrinsic parameters) :
P
r
T
E P
l
=0
The fundamental matrix F links the image projection of the left camera p
l
and p
r
(it considers both
extrinsic and intrinsic parameters) :
NAO project – page 33/107
P
O
l
O
r
P
l
P
r
T
l→r
R
l→r
Sylvain Cherrier – Msc Robotics and Automation
p
r
T
F p
l
=0
Where
p
l
=M
l
P
l
and
p
r
=M
r
P
r
M
l
and M
r
alude to the intrinsic matrices of the left and right cameras.
The matrix F enables us to have a clear relation between one pixel from the left camera and others
from the right one. Indeed, one pixel on the left image corresponds to a line in the right one
(called right epipolar line l
r
), the same with one pixel on the right image (left epipolar line l
l
).
In the diagram above, you can see a pixel p
l
from the left image img
l
with its corresponding right
epiline l
r
from the right image l
r
.
O
l
and O
r
alude to the optical center of the left and right camera respectively.
Thus, thanks to the fundamental matrix F, we can calculate the epilines :
– if p
l
is a pixel from the left camera, the correponding right epiline l
r
will be Fp
l
.
– if p
r
is a pixel from the right camera, the correponding left epiline l
l
will be F
T
p
r
.
NAO project – page 34/107
O
l O
r
p
l
Right epiline l
r
Left camera Right camera
img
l
img
r
Sylvain Cherrier – Msc Robotics and Automation
Moreover, using the relative position P
l→r
, we are able to calculate a rectification and reprojection
matrices for each one of the cameras.
➢ But why rectify and reproject ?
As we're working with a stereovision of two cameras, the goal is to have a simple relationship
between them in order to :
– simplify the correspondence step (on the right image)
– but before all, calculate easily the disparity between the two images (step of position
calculation)
Indeed, a translation only on x will enable us to predict more easily the position of the features
within the right camera : it avoids us to have to use the fundamental matrix F and the epilines to
make the correspondence.
The fact that the cameras have just just a shift on x between them enables the epilines to be at the
same y that the pixel from the other image as the diagram below shows.
➢ The rectification will compensate the relative position between the cameras in order to
have only a shift on x between them. A rectification matrix, which is a rotation one, will be
applied to each one of the cameras (R
l
rect
and R
r
rect
).
➢ The reprojection will unify the projection matrices of the two cameras (P
l
cam-px
and P
r
cam-px
) in
order to build a global one P
global
cam-px
. Indeed, the goal here is to find the global camera
equivalent to the stereovision system. A reprojection matrix will be applied also at each
one of the cameras (P
l
repro
and P
r
repro
).
NAO project – page 35/107
O
l
p
l
Left camera
Right camera
img
l
Right epiline l
r
img
r
O
r
Sylvain Cherrier – Msc Robotics and Automation
In the appendices, there are more explanation about the epipolar geometry used in stereovision
and about the calculation of the rectification and reprojection matrices.
2.3.4.2 rectify, reproject and undistord the image
This step is similar to the undistording step seen for the mono-camera case : the difference here is
that we apply also a rectification and reprojection matrices to each camera.
our ideal pixel will be now:
p
img
i
=s
global
P
repro
R
rect
p
obj /cam
Thus, we can express the distorted (and unrectified) pixel p
d
img
functions of p
i
img
:
p
img
d
=s P
cam−px
M
d
1
s
global
R
rect
T
P
repro
−1
p
img
i
Then, similarly to undistortion, we will use image mapping.
2.3.4.3 find patterns in the left camera
Exactly similar to the case of one camera but now we will extract patterns from the left camera.
2.3.4.4 find the correspondences in the right camera
The correspondences on the right image can be calculated using :
– a matching function on epipolar lines
– an optical flow (Lucas-Kanade for example)
NAO project – page 36/107
Sylvain Cherrier – Msc Robotics and Automation
➢ Matching function on epipolar lines
The first method allows to use the interest of the rectification which enables to work on an
horizontal line (constant y).
Using pixels from the patterns found in the left image, it goes through horizontal lines in the right
image for a same y : as we work on the right image, the correspondence points are expected to be
on the left side of the position x of the pattern. A matching operation using a small window called
Sum of Absolute Difference (SAD) is calculated at each possible pixels of the right image (this
operation is based before on texture).
Thus, the maximum of the matching function give the best correspondence.
The diagram below explains the principle.
➢ Optical flow
Optical flow or optic flow is the pattern of apparent motion of objects, surfaces and edges in a
visual scene caused by the relative motion between an observer and the scene.
Thus, you can extract an optical flow using :
– several images from one camera at different instants
– several images from different cameras with different viewpoint
In stereovision, we are in the second case.
NAO project – page 37/107
Sylvain Cherrier – Msc Robotics and Automation
The Lucas-Kanade method is an accurate method to find the velocity in a sequence of images but
it leans on several assumptions :
– brightness constancy : a pixel from an object detected has a constant intensity.
– temporal persistence : the image motion is small.
– spatial coherence : neighboring points in the world have same motion, surface and
neighboring position on the image.
Using Taylor series, we can express the intensity of the next image I(x+δx, y+δy, t+δt) functions of
the previous one I(x, y, t) :
I ( x+6 x , y+6 y ,t +6t )≈I ( x , y ,t )+
∂I
∂ x
6 x+
∂I
∂ y
6 y+
∂ I
∂t
6t
Using the assumption of brightness constancy and the relation from above a an equality, we have :
I ( x+6 x , y+6 y , t +6t )=I ( x , y ,t ) and
∂I
∂x
V
x
+
∂ I
∂ y
V
y
+
∂ I
∂t
=0
Where V
x
=
∂ x
∂t
and V
y
=
∂ y
∂t
As we have two unknowns for the components of the velocity (V
x
and V
y
), the equation can't be
solved directly. We need to use a region made up of several pixels (at least two pixels) where we
will consider that we have a constant vector of speed (spatial coherence assumption).
As Lucas-Kanade method focus on local features (small window), we see why it's important to
have small movements (temporal persistence assumption).
In the following, I will consider :
I
x
=
∂ I
∂ x
, I
y
=
∂ I
∂ y
and I
t
=
∂ I
∂t
NAO project – page 38/107
Sylvain Cherrier – Msc Robotics and Automation
Then, using the previous formula applied to small regions, we have for a region of n pixels :
(
I
x1
I
y1
I
x2
I
y2
... ...
I
xn
I
yn
)
(
V
x
V
y
)
=
(
−I
t1
−I
t2
...
−I
tn
)
In order to calculate the velocity, we need to calculate a Least Squares Solution :
(
V
x
V
y
)
=(( A
T
A)
−1
A
T
)
(
−I
t1
−I
t2
...
−I
tn
)
where
A=
(
I
x1
I
y1
I
x2
I
y2
... ...
I
xn
I
yn
)
The Least Squares Solutions works well for features as corners but less for flat zones.
The Lucas-Kanade method can be improved to consider also big movements using the scale
dimension. Indeed, in this case, we can calculate the velocity (V
x
, V
y
) for a large scale to refine it
several times before to work on the raw image (pyramid Lucas-Kanade). The diagram below shows
the principle of this method.
Thus, using the velocity, we can calculate the displacement vector d and so the correspondence
pixels within the right image.
NAO project – page 39/107
Sylvain Cherrier – Msc Robotics and Automation
2.3.4.5 calculate the position of the patterns
This step is the core of the use of stereovision of localization.
The diagram below illustrates the stereovision system when the two images have been
undistorted, rectified and reprojected.
We can find a geometrical relation between the cameras in order to have the distance from the
object Z :
T−( x
l
−x
r
)
Z−f
=
T
Z
thus
Z=
f T
x
l
−x
r
The disparity between the two cameras is defined as :
d=x
l
−x
r
Thus, similarly that in the mono-camera case, knowing Z, we can calculate the complete position
of the object relative to the camera (X, Y, Z) :
X=
x
l
−c
x
l
f
Z
and
Y=
y
l
−c
y
l
f
Z
NAO project – page 40/107
Sylvain Cherrier – Msc Robotics and Automation
➢ disparity
The disparity d is an interesting parameter to measure as it's inverse proportionnal of the distance
of the object Z :
– for an object far from the camera, it will be near zero : the localization become less
accurate.
– For an object closed to the camera, it will tend to infinite : the localization become very
accurate.
Thus, we will have to find a maximal distance from where the object is detected with enough
accuracy.
NAO project – page 41/107
Sylvain Cherrier – Msc Robotics and Automation
2.4 NAO control
2.4.1. Introduction
The NAO is an autonomous humanoid robot built by the French company Aldebaran-Robotics. This
robot is good for education purposes as it's done to be easy and intuitive to program.
It integrates several sensors and actuators which allow students to work on locomotion, grasping,
audio and video signal treatment, voice recognition and more.
A humanoid robot such as NAO can make more funny the courses motivating the students to work
on practical projects : it allows to test very quickly some complicated algorithms on real time and
using a real environment.
2.4.2. Hardware
2.4.2.1 General
Firstly, it's necessary to know that there are several models of NAO which enable to do different
kinds of applications :
– NAO
T2
(humanoid torso) : sensing, thinking
This model allows to focus on signal treatment and artificial intelligence.
– NAO
T14
(humanoid torso) : sensing, thinking, interacting, controlling, grasping
This model adds more practical aspect of the previous one with the capacity to work on
control.
– NAO
H21
(humanoid) : sensing, thinking, interacting, controlling, traveling
This first model of humanoid allows to make NAO move within the environment : a lot of
applications are possible with for example mapping.
– NAO
H25
(humanoid) : sensing, thinking, interacting, controlling, travelling, being
autonomous, grasping
This last model is the most evolved : it adds the ability to grasp objects to the humanoid (as
the torso NAO
T14
does also).
NAO project – page 42/107
Sylvain Cherrier – Msc Robotics and Automation
Notice : Moreover, besides these different models of NAO, there are some modules which can be
added to NAO such as the laser head. The use of a laser improves the navigation of the robot in
complex indoor environments.
Besides the model, the NAO robot refers to a specific version (V3+, V3.2, V3.3, …).
The robots available in the laboratory are NAO
H25
version V3.2 : this hardware is a good base for a
lot of applications.
2.4.2.2 Features
The NAO robot (NAO
H25
) has the following features :
– height of 0.57 meters and weight of 4.5 kg
– sensors : 2 CMOS digital cameras, 4 microphones, 2 ultrasonic emitters and receivers
(sonars), 2 infrared emitters and receivers, 1 inertial board (2 gyrometers 1 axis and 1
accelerometer 3 axis), 9 tactile sensors (buttons), 8 pressure sensors (FSR), 36 encoders
(Hall effect sensors)
NAO project – page 43/107
NAO
T2
NAO
T14
NAO
H21
NAO
H25
Sylvain Cherrier – Msc Robotics and Automation
– actuators : brush DC motors
– others outputs : 1 voice synthetizer, LED lights, 2 high quality speakers
– Connectivity : internet (Ethernet and WIFI) and infrared (remote control, other NAO, ...)
– a CPU (x86 AMD GEODE 500 MHz CPU + 256 MB SDRAM / 2 GB Flash Memory) located in
the head with a Linux kernel (32bit x86 ELF) and Aldebran Software Development Kit (SDK)
NAOqi
– a second CPU located in the torso
– a 55 Watt-Hours Lithium ion battery (nearly 1.5 hours of autonomy)
The NAO is well equipped in order to be able to be very flexible on its applications.
We can notice that the motors are controlled by a dsPIC microcontroller and through their
encoders (Hall effect sensors).
The robot has 25 Degree Of Freedom (DOF) :
– head : 2 DOF
– arm : 4 DOF in each arm
– pelvis : 1 DOF
– leg : 5 DOFin each leg
– hand : 2 DOF in each hand
NAO project – page 44/107
Sylvain Cherrier – Msc Robotics and Automation
For my project, I focused on the head at the beginning in order to work on object tracking.
After, I planned to work on object grasping : this task requires more calculations to synchronise
correctly the actuators.
Depending on the time, I had to leave some control as my main topic was on image processing for
object recognition and localization.
In the next part, I will give more details about the two cameras of the NAO robot as I focused on
these sensors.
2.4.2.3 Video cameras
NAO has two identical video cameras which are located on its face. They provide a resolution of
640x480 and they can run at 30 frames per second.
Moreover, we have these features for the two cameras :
– camera output : YUV422
– Field Of View (FOV) : 58 degrees (diagonal)
– focus range : 30 cm → infinity
– focus type : fixed focus
NAO project – page 45/107
Sylvain Cherrier – Msc Robotics and Automation
The upside camera is oriented straight through the head of NAO : it enables the robot to see in
front of him.
The downside camera, down shiffted, is oriented to see the ground : it enables NAO to see what
he meets on the ground closed to him.
We can notice that these cameras can't be used for stereovision as there is no everlapping
between them.
2.4.3. Softwares
There are several softwares which enable to program NAO very quickly and easily. They allow also
to interact easily with the robot.
2.4.3.1 Choregraphe
Choregraphe is a cross-platform application that allows to edit NAO's movement and behavior. A
behavior is a group of elementary actions linked together following an event-based and a time-
based approach.
NAO project – page 46/107
Sylvain Cherrier – Msc Robotics and Automation
It consists on a simple GUI (Graphical User Interface) based on NAOqi SDK, which is the framework
used to control NAO : we will focus on it later.
➢ Interactions
Choregraphe allows to see a NAO robot on the screen which imitate the actions of the real NAO in
real time. It enables the user to use the simulated NAO to save more time during the tests. The
software enables to un-enslave the motors which is impose no stiffness for each motor. This
aspect allows the user to focus on the simulated NAO if he had to test the code more quickly and
safely.
Moreover, the software enables to control each joints in real time from the GUI. The interface
displays the new joint values in real time even if the user moves the robot manually. Thanks to
these features, the user can easily test the range of each joint for example.
NAO project – page 47/107
Sylvain Cherrier – Msc Robotics and Automation
➢ Boxes
In Choregraphe, each movement and detection (action) can be represented as a box; these boxes
are arranged in order to create an application. As mentionned previously, behaviors which are
groups of actions can be also represented as boxes but with several boxes inside.
In order to connect the boxes, they have input(s) and output(s) which allow generally to start and
finish the action. For specific boxes, such as the sensors'ones or computation's ones, there can be
several outputs (the value of the detected signals, calculated variables, ...) and several inputs
(variables, conditions, …).
NAO project – page 48/107
Sylvain Cherrier – Msc Robotics and Automation
➢ Timeline
An interesting feature of Choregraphe is the possibility to create an action or a sequence of
behaviours using the timeline. Contrary to the system of boxes seen previously, the timeline
allows to have a view on the time which is the real conductor of the application.
The screenshot above show the timeline (marked as 1), the behavior layers (marked as 2) and the
position on the frame (marked as 3). Thus, it's easier to manage several behaviors in series or in
parallel.
Moreover, the timeline enables to create an action from several positions of the joints of the
robot. Indeed, it's only necessary to select a specific frame from the timeline, to put the NAO in a
specific position and save the joints. The software will automatically calculate the required speed
of the actuators in order to follow the constraints of the frames. It's also possible to edit manually
the laws of speed for each motor even if it's necessary to be careful !
In the screenshot above, you can see the keyframes (squares) and several functions controlling the
speed of the motors.
Edit a movement using keyframes allows for example to implement easily a dance for NAO.
Choregraphe offers others possibilities such as a video monitor panel enabling to :
NAO project – page 49/107
Sylvain Cherrier – Msc Robotics and Automation
– display the images from NAO cameras
– recognize an object using as model the selection of a pattern on the image
2.4.3.2 Telepathe
Telepathe is an application that allows to have feedback from the robot, as well as send basic
orders. As Choregraphe, it's a cross-platform GUI but which is customizable (the user can load
plugins in different widgets). These widgets or windows can be connected to different NAO robots.
Actually, Telepathe is a more sophiscated software than Choregraphe for the configuration of
devices and visualization of data. It has directly a link with the memory of NAO which allows to
follow in real time a lot of variables.
➢ Memory viewer
The module ALMemory of NAO (which is part of NAOqi) allows to write to and read from the
memory of the robot. Telepathe uses this module to display variables with tables and graphs. The
user can monitor some variable accurately during testing.
NAO project – page 50/107
Sylvain Cherrier – Msc Robotics and Automation
➢ Video configuration
Moreover, Telepathe enables to configure deeply the video camera. Thus, it make easier the
testing of the applications of computer vision.
2.4.4. NAOqi SDK
The Software Development Kit (SDK) NAOqi is robotic framework built especially for NAO. There
are a lot of others for use in a large range of robots : Orocos, YARP (Yet Another Robotic Platform),
Urbi and other. They can be specialized in motion, in real-time, IA oriented and they can be
sometimes open-source in order to enable all people to work on the source code.
NAOqi allows to improve the management of the events and the synchronization of the tasks as a
real-time kernel can do. It works with a system of modules and around a shared memory
(ALMemory module).
NAO project – page 51/107
Sylvain Cherrier – Msc Robotics and Automation
2.4.4.1 Languages
The framework uses mainly three languages :
– Python : as a simple interpeter is needed to generate the machine code, the language
enables to control quickly NAO and manage parallelism more easily.
– C++ : the language needs a compilator to execute the program; it's used to program new
modules as they work well with oriented object languages and as they are expected to stay
fixed.
– URBI (UObjects and urbiScript) : URBI is a robotic framework as NAOqi which can start
from NAOqi (it's a NAOqi module called ALUrbi). It enables to create UObjects which refer
to modules and use urbiScript to manage events and parallelism.
URBI is an interesting framework as it's compatible with a lot of robots (Aibo, Mindstorms NXT,
NAO, Spykee, …) but in my internship, I focused on NAOqi which has the same principle. Thus, in
the following, I will focus on the Python and C++ languages which use the core of NAOqi.
NAOqi is very flexible on its languages (cross-language) as C++ modules can be called from a
Python code (as the opposite) : this simplifies the programming and the exchange of programs
between people.
Moreover, NAOqi is embedded with several libraries such as Qt or OpenCV which enable
respectively to create GUIs and manipulate images.
2.4.4.2 Brokers and modules
NAOqi works using brokers and modules :
– a broker is an executable and a server that listen remote commands on IP and port.
– a module is both a class and a library which deals with a specific device or action of NAO.
Modules can be called from a broker using proxies : a proxy enables to use methods from a local
or remote module. Indeed, a broker can be executed on NAO but also on others NAO and
computers : it enables to make only one program working on several devices.
An interesting propriety of the module is that they can't call directly other modules : only proxies
enable them to use methods from different modules as they refer to the IP and port of a specific
broker.
NAO project – page 52/107
Sylvain Cherrier – Msc Robotics and Automation
In the diagram above, you can see the architecture adopted by both NAO and its softwares; purple
rectangles alude to brokers and green circles to modules. The broker mainBroker is the broker of
the NAO robot which is executed directly when the robot is switched on and has an IP address
(Ethernet or WIFI).
This architecture is a safe one as the robot and each one of the softwares work with different
brokers : if there is a crash on Choregraphe or on Telepathe, the robot will be less expected to fall
as the brokers will not communicate during 100 ms.
By using NAOqi as developper, we can choose to use the mainBroker (unsafe but fast) or create
our own remote broker. Personnaly, as I focused firstly on the actuators from the head for the
tracking, I prefered use the mainBroker as it was more intuitive and fast.
As the diagram above shows, there are a lot of modules available for NAO (mainBroker) but we
can add more using the module generation from C++ classes.
The modules refer often to devices available on the robot :
– sensors (ALSensors) : tactile sensors
– motion (ALMotion) : actuators
– leds (ALLeds) : led outputs
– videoInput (ALVideoDevice) : video cameras
– audioIn (ALSoundDetection) : microphones
– AudioPlayer (ALAudioPlayer) : speakers
– …
NAO project – page 53/107
Sylvain Cherrier – Msc Robotics and Automation
But, there are others which are built for a specific purpose :
– interpret Python code : pythonbridge (ALPythonBridge)
– run computationnal processes : faceDetection (ALVisionRecognition), landmark
(ALLandMarkDetection), …
– communicate with the lower level software controlling electrical devices : DCM (DCM)
– manage a shared memory between modules : memory (ALMemory)
– …
Concerning Telepathe, we meet a memory and a camera modules as expected.
In order to enable a module to make its methods available for the brokers, they need to be bound
to the API (Application Programming Interface) module. This module enables to call the methods
both locally and remotely.
As we said previously, python code allows to control quickly NAO because NAOqi SDK provides a
lot of python functions to call Proxies from one or several brokers and use these proxies to
execute methods.
NAOqi provide the same kind of methods for C++ but more oriented for the generation of
modules.
2.4.4.3 Generation of modules in C++
Before all, the module can be generated using a Python code called module_generator.py. It
allows to generate automatically several C++ templates :
– projectNamemain.cpp : it allows to create the module for a specific platform (NAO or PC).
– moduleName.h and moduleName.cpp : they are the files which will be customized by the
user.
projectName and moduleName will be remplaced by the desired names.
Thus, as NAOqi provides module libraries and other libraries (OpenCV for example) for C++, the
user can develop easily with the framework. When the module is ready, it's necessary to compile
it.
In order to build modules, NAOqi uses CMake which enables to compile sources using a
configuration file (generally with the name CMakeLists) and others options such as the location of
the sources or of the binaries. The Python script module_generator.py creates also a CmakeLists
file which facilitate the compiling process. CMake can be executed with a command as it's the case
on Linux or it can open a GUI as on Windows.
NAO project – page 54/107
Sylvain Cherrier – Msc Robotics and Automation
In order to compile modules, we use cross-compiling so as to make the code run in a specific
platform (PC or robot). Indeed, the PC has not the same modules as the robot as it has not the
same devices. Cross-compiling is used using a cmake file : toolchain-pc.cmake to compile a module
for the PC and toolchain-geode.cmake for NAO.
On Linux, the CMake command has to mention another option in order to cross-compile :
cmake -DCMAKE_TOOLCHAIN_FILE=path/toolchain_file.cmake ..
Cross-compilation from Windows for NAO is impossible as NAO has a Linux OS : however, Cygwin
on Windows can be an alternative.
After execution, CMake provide :
– a MakeFile configuration file on Linux : it descibes the compilator to use (g++ in our case)
and a lot of other options and it has to be executed by the make command.
– A Visual Studio Solution file (.sln) on Windows : it can be open on Visual Studio and it's
necessary to build it.
At the end of the process, both Operating Systems create a library file (.so on Linux and .dll on
Windows). The problem of Windows in this case is that it can't provide a library for NAO as the
robot uses .so libraries (Linux). The module library can be moved after in the lib folder of NAO or
the PC and the configuration file autoload.ini has to be modified in order to load this module.
2.4.4.4 The vision module (ALVideoDevice)
As I focused on the vision, I present in this part the management of the NAO cameras through the
vision module called ALVideoDevice.
The vision on NAO (ALVideoDevice module) is controlled by three modules :
– the driver V4L2
It allows to acquire frames from a device : it can be a real camera, a simulated camera, a file.
It used a circular buffer which contains a specific number of image buffers (for NAO, there are four
images). This images will be updated at a specific frame rate, resolution and format depending on
the features of the device.
NAO project – page 55/107
Sylvain Cherrier – Msc Robotics and Automation
– the Video Input Module (VIM)
This module allows to manage the video device and the driver V4L2 : it opens and close the video
device using the I2C bus, it allows to start it in streaming mode (circular buffer) and stop it. It will
configure the driver for a specific device (real camera, simulated camera, file). Moreover, it can
extract an image from the circular buffer (unmap buffer) at a specific frame rate and convert the
image at a specific resolution and format. We can notice that several VIM can be created at the
same time for different or same devices, they are identified by a name.
– the Generic Video Module (GVM)
This module is the main interface between the video device and the user : it allows the user to
choose a device, a frame rate, resolution and format; the GVM will configure the VIM in order to
do this. Moreover, this module enables the user to control the video device in local or remote
mode. The difference will be at the level of the data received : in local mode, the image data (from
the VIM) can be accesed using a pointer whereas in remote mode, the image data has to be
duplicate from the image of the VIM and it will be an ALImage. The GVM has also methods to
recuperate the raw images from the VIM (without any conversions) : this way to do allow to
improve the frame rate even if it requires a restricted number of VIM depending on the number of
buffers enabled by the driver.
The diagram below show the three modules with their own thread (driver, VIM and GVM). The
GVM can access the image by two accesses :
– access 1 to use conversion : by pointer if local and by ALImage variable if remoteross-
compilation from Windows for NAO is impossible as NAO has a Linux OS : however, Cygwin
on Windows can be an alternative.
– access 2 to use without conversion (raw image) : by pointer if local and by ALImage
variable if remote
NAO project – page 56/107
Sylvain Cherrier – Msc Robotics and Automation
In my case, I used the GVM on remote and local mode with conversion. I used the front camera of
NAO in RGB and BGR modes : as the camera is initally in YUV422, a conversion is needed.
NAO project – page 57/107
GVM thread
Camera
Driver thread
VIM thread
GVM thread
GVM thread
Unmap
buffer
Format
conversion
Access 1
Access 2
Sylvain Cherrier – Msc Robotics and Automation
3 Implementation and testing
In this part, I will focus on what hardware and software I used to accomplish my mission : object
recognition and localization on a NAO robot for control purpose.
3.1 Languages, softwares and libraries
I worked using two ways :
– simulating and testing on Matlab
Matlab is a very useful tool to test algorithms as it's very easy to manipulate data and
display it on figures. Moreover, a lot of toolboxes are available on the web : it enables the
users to use quickly programs as tools.
Moreover, Matlab is cross-platform : it can work an Windows, Linux or Mac. It's cross-
language as the mex command allows to convert C, C++, Frotran code to matlab code and
the opposite.
It offers a lot of toolboxes for several fields shuch as aerospace, image processing,
advanced control, … Matlab offers also functions to read data from serial port (serial) or
from other hardwares such as video cameras which allow to manipulate a large range of
devices. All these aspects make Matlab very flexible for any applications needing
computation.
– real implementation in C++/Python with OpenCV for NAO
NAO can be programmed using both C++ and Python as we saw in the NAO control part. C+
+ allows to build modules for NAO whereas Python which is directly interpreted allows to
do quickly applications using modules.
NAO project – page 58/107
Matlab
Sylvain Cherrier – Msc Robotics and Automation
In my case, as the vision was the core of my project, I chose to build a vision module in
order to have the main features of the object recognized and to enable the user to choose
what object to recognize among other parameters.
I used Python for control as I didn't focus on control during my project. If there is a
problem from the control part, it's very easy to modify the Python code as it's not
necessary to complile it. I could use Matlab to simulate the control as i did for the vision
part but I didn't have enough time.
I used the Open source Computer Vision (OpenCV) library in order to implement more
easily the vision part.
3.2 Tools and methods
3.2.1. Recognition
For recognition, I used two ways to test on Matlab :
– Firstly, I used an image (studied image) and a transfromed image of it (cropped, rotated,
scaled) as model image.
This way to do allows to test easily the number of keypoints detected, the number of
matches depending on the rotation, scale and size of the cropped image. Using this
method, we can test both the feature extraction (SIFT/SURF) and the feature matching.
I focused before all on the invariance in rotation for the test : I used several rotated images
and calculated the number of keypoints and matches with the studied image (with SIFT).
You can see below the keypoints detected for several rotations and the number of
keypoints functions of the rotation of the model image (green points : studied image, red
points : model image, blue points : matches).
NAO project – page 59/107
OpenCV Python
Sylvain Cherrier – Msc Robotics and Automation
The number of matches is reduced with the rotation of the image : the SIFT was not
efficient and I had to improve it.
After, I focused on the calculation of 2D model features within the studied image : the
model image was cropped, rotated and scaled depending on the inputs of the user. The
matching could estimate the values and in order to test it, we could calculate the
translation, rotation and scale errors.
NAO project – page 60/107
Sylvain Cherrier – Msc Robotics and Automation
– Then, I used images acquired from the video camera of my computer and model images
of objects.
This way to do allows to test the algorithm for real 3D objects from the world. It enables to
see the influence of rotation of the object in 3D which limits the recognition. To make it
simple, I focused firstly on plane objects (books for example) as only one model image is
required. After, for specific objects such as robots, several model images are needed. The
2D features of the model within the acquisition are always interesting to calculate.
NAO project – page 61/107
Sylvain Cherrier – Msc Robotics and Automation
3.2.2. Localization
For localization, I didn't spend the time to test on Matlab and I focused directly on the
implementation on C++ using the Open source Compter Vision (OpenCV). I used a stereovision
system available in the laboratory (a previous MSc studient, Suresh Kumar Pupala, had designed it
for his dissertation). I wanted to learn more about mono-calibration, stereo-calibration and stereo
correspondence as OpenCV functions hide the processing. Thus, I used also Matlab to test
programs for calibration and stereo correspondence. I thought that understanding the concepts
could enable me to be more comfortable with camera devices. The photos below shows the
stereovision system and NAO wearing them : the system can be connected to a PC (USB ports); in
order to use them with NAO, it's necessary to start NAOqi both on NAO and the PC.
You can see below the screenshots showing the stereo-calibration and the stereovision using C+
+/OpenCV.
NAO project – page 62/107
Sylvain Cherrier – Msc Robotics and Automation
The blue points alude to the corners found in the left image (left camera), the green points to the
corresponding points in the right image using stereocorrespondence (right camera) and then the
red points to the 3D points deduced in the left image.
3.2.3. NAO control
My final goal was to use a vision module (enabling object recognition and localization) combined
with a program in Python for NAO control. I wanted firstly to do a basic control for NAO (as
tracking the object by its head, moving to the object, saying what object he is watching) and after,
if it's possible, more complicated control (grasping an object for example). In order to practice on
NAO, I did a small vision module to track a red ball (HSV color space) and I made NAO focus on the
barycenter of the object rotating its head (pitch and yaw).
NAO project – page 63/107
Sylvain Cherrier – Msc Robotics and Automation
In the screenshot above, you can see the original image on the left and the HSV mask on the right :
tracking using HSV colorspace is not perfect as it depends a lot on the brightness but it was a
simple project to practice on NAO.
3.3 Steps of the project
My project was divided in several steps :
– As we said before, the recognition is divided in two main step : feature extraction and
feature matching. I worked firstly on feature extraction with SIFT. I implemented SIFT on
Matlab using a C++ code written by an Indian student in Computer Science, Utkarsh Sinha.
He used as me the publication of David G. Lowe (« Distinctive Image Features from Scale-
Invariant Keypoints »). I focused on a clear program instead of a fast one because I wanted
to divide it in several distinct functions : by doing this, the program becomes more flexible.
– After, I studied the feature matching and I focused on the descriptor matching and the
calculation of the 2D model features (Least Squares Solution).
– Then, I looked into the localization by learning the parameters of a video camera, the
calibration and the stereovision. In order to implement it, I used a stereovision system
adaptable for NAO's head which was available in the laboratory. I used a lot of references
such as a MSc dissertation from a previous student in Robotics and Automation (« 3D-
Perception from Binocular Vision » - Suresh Kumar Pupala) or the publication from David G.
Lowe. In order to implement it in C++, I used the Open source Compter Vision (OpenCV)
library. . I did also some experiments on matlab in order to understand more the
calibration and the stereo correspondence. Before to link the localization with the
NAO project – page 64/107
Sylvain Cherrier – Msc Robotics and Automation
recognition, I focused only on corners, this allows to study the evolution of disparity and
the values of 3D positions.
– After, I learned about the control of NAO and I tested a small module of recognition in C++
(ballTracking) combined with a control part in Python. The goal of the small project was to
deal with the use of NAO camera or another camera (laptop camera), in remote or local
mode and to build a simple control of the head of NAO. The project consisted to track a red
ball on the image from the camera (HSV color space), to calculate the barycenter, and to
control the head of NAO in order to make it focus its camera on the object.
– Later, I studied another invariant feature extraction method (SURF). Contrary to SIFT, I
didn't implement it on Matlab, I only downloaded the OpenSURF library which can be
found for either matlab code or C++. As this was quicker than SIFT, I used it to study the
feature matching.
– Finally, I learned about the Generalized Hough Transform which enables to do keypoint
matching focusing on the position and the orientation of the keypoints. I studied firstly the
algorithms outside the matching and after I integrated them to it; I used only Matlab to
test them.
– My work is not finished as I didn't implement the feature matching in C++. The goal is then
to replace the corner detection of the localization by SURF feature extraction and add the
feature matching before to do the stereo correspondence on the right image : the
recognition will be done on the left image only. Thus, combining recognition and
localization, we will have keypoints localized within the space, it enables to estimate the
average position of the object and maybe more (3D orientation, 3D size, …).
NAO project – page 65/107
Sylvain Cherrier – Msc Robotics and Automation
NAO project – page 66/107
Feature matching
- descriptor matching
- 2D features calculation
NAO control
- ball tracking project
Localization
- parameters of a video camera
- calibration
- stereovision
SIFT
- SIFT on Matlab
- some testing
SURF
- use of OpenSURF library
- some testing
Feature matching
- keypoint matching
Future work
- feature matching in C++
- SURF and feature matching
combined with stereovision
in a module for NAO
Sylvain Cherrier – Msc Robotics and Automation
4 Conclusion
The project was around three subjects :
– object recognition
– object localization
– NAO control
I spoke firstly about these subjects before to present what I did and what I have to do for the next
days. I used different methods to deal with each one of these subjects : I prefered to simulate
more on matlab the recognition whereas I chose to practice quickly the localization and the
control of NAO (using C++, OpenCV and Python). However, dealing with the localization, I used
also Matlab in order to understand more what was inside the calibration and stereo-calibration.
I can conclude that this project allowed me to learn more about the use of video camera for
robotic purpose. Indeed, recognition is useful at the beginning to track the desired object and
localization enables the robot to interact with it using control : the behavior of the robot will
depend on these three parts (recognition, localization and control).
Concerning my project, as I said when I described its steps, it's not finished : it need some
implementations in C++ and some testing. I hope it will be. Contrary to SIFT, SURF can be open
source with OpenSURF for example : thus, the recognition I presented is not only academic and
could be used by companies.
NAO project – page 67/107
Sylvain Cherrier – Msc Robotics and Automation
5 References
– « 3D-Perception from Binocular Vision » (Suresh Kumar Pupala)
– Website of Utkarsh Sinha http://www.aishack.in/
– « A simple camera calibration method based on sub-pixel corner extraction of the
chessboard image » (Yang Xingfang, Huang Yumei and Gao Feng)
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5658280&tag=1
– University of Nevada, Reno
http://www.cse.unr.edu/~bebis/CS791E/Notes/EpipolarGeonetry.pdf
– « A simple rectification method of stereo image pairs » (Huihuang Su, Bingwei He)
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5678343
– OpenCV tutorials (Noah Kuntz) http://www.pages.drexel.edu/~nk752/tutorials.html
– Learning OpenCV (Gary Bradski and Adrian Kaehler)
– OpenCV 2.1 C Reference http://opencv.willowgarage.com/documentation/c/index.html
– Structure from Stereo Vision using Optical Flow (Brendon Kelly)
http://www.cosc.canterbury.ac.nz/research/reports/HonsReps/2006/hons_0608.pdf
– « Distinctive Image Features from Scale-Invariant Keypoints » (David G. Lowe)
http://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf
– Feature Extraction and Image Processing (Mark S. Nixon and Alberto S. Aguado)
– Speeded-Up Robust Features (SURF) (Herbert Bay, Andreas Ess, Tinne Tuytelaars and Luc
Van Gool)
ftp://ftp.vision.ee.ethz.ch/publications/articles/eth_biwi_00517.pdf
– Aldebaran official website (Aldebaran) http://www.aldebaran-robotics.com/
– Aldebaran Robotics (Aldebaran)
http://users.aldebaran-robotics.com/docs/site_en/index_doc.html
– URBI (Gostai)
http://www.gostai.com/downloads/urbi-sdk/2.x/doc/urbi-sdk.htmldir/index.html#urbi-
platforms.html
– NAO tutorials (Robotics Group of the University of Leόn)
http://robotica.unileon.es/mediawiki/index.php/Nao_tutorial_1:_First_steps
NAO project – page 68/107
Sylvain Cherrier – Msc Robotics and Automation
6 Appendices
6.1 SIFT: a method of feature extraction
6.1.1. Definition and goal
Find the keypoints invariant with scale, rotation and illumination of the object.
In pattern recognition, SIFT has two advantages:
– it can identify complex objects in a scene using specific keypoints
– it can identify a same object at several positions and rotations of the camera
However, it has a main disavantage, the processing time. SURF (Speeded Up Robust Feature) leans
on the same principle that SIFT but it's quicker but in order to be invariant in rotation, it has to use
FAST keypoint detection.
6.1.2. Description
SIFT process can be divided in two main steps we will study in detail:
– find the location and orientation of the keypoints (keypoint extraction)
– generate for each one a specific descriptor (descriptor generation)
Indeed, we have to know before all the location of the keypoints on the image but after, we have
to build an ID very descriminating, that's why we need one descriptor for each one of them. We
can have a lot of keypoints so we have to be able to be very selective in our choice.
Dealing with SIFT, one descriptor is a vector of 128 values.
6.1.3. Keypoint extraction
This first process is divided in two main steps:
– find the location of keypoints
– find their orientation
These two steps have to be done very carefully because they have a big impact on descriptor
generation we will see later.
NAO project – page 69/107
Sylvain Cherrier – Msc Robotics and Automation
Firslty, in order to find their location, we will have to work on the laplacian of gaussian.
6.1.3.1 Keypoint location
6.1.3.1.1 The laplacian of gaussian (LoG) operator
The laplacian operator can be explained thanks to the following formula :
L( x , y)=A I ( x, y)=∇
2
I ( x , y)=

2
I ( x , y)
∂x
2
+

2
I ( x, y)
∂ y
2
The laplacian allows to detect pixels with rapid intensity change on : they match with extremums
of the laplacian.
However, as it's a second derivative measurement, it's very sensitive to noise so before to apply
the laplacian, it's used to blur the image in order to delete the high frequency noise on the image.
In order to do this, we use a gaussian operator which blurs the image in 2D according to the
parameter u :
G( x , y , u)=
1
2¬u
2
e

x
2
+y
2
2u
2
We call the resulting operator (gaussian followed by laplacian) the laplacian of gaussian (LoG).
We can estimate the laplacian of gaussian :
AG( x , y , u)=

2
G( x , y , u)
∂x
2
+

2
G( x, y ,u)
∂ y
2
AG( x , y , u)=
x
2
+y
2
−2u
2
2¬u
6
e

x
2
+y
2
2u
2
Actually, in the SIFT method, we don't calculate the laplacian of gaussian but we focus on the
difference of gaussian (DoG) which has more advantages we will see in the next part.
NAO project – page 70/107
Sylvain Cherrier – Msc Robotics and Automation
6.1.3.1.2 The operator used in SIFT : the difference of gaussian (DoG)
Using the heat diffusion equation, we have an estimation of the laplacian of gaussian:
∂G( x , y, u)
∂u
=u∇
2
G( x , y ,u)
Where
∂G( x , y, u)
∂u
can be calculated thanks to a difference of gaussians:
∂G( x , y, u)
∂u

G( x, y , k u)−G( x, y , u)
k u−u
From equations (1) and (2), we deduce:
G( x , y , k u)−G( x , y, u)≈(k−1)u
2

2
G( x , y ,u)
Using the difference of gaussian instead of the laplacian, we have three main advantages:
– it reduces the processing time because laplacian requires two derivatives
– it's less sensitive to noise than laplacian
– it's already scale invariant (normalization by u
2
shown in equation (3)). We can notice
the factor k-1 is constant in the SIFT method so it doesn't have an impact on extrema
location.
6.1.3.1.3 The resulting scale space
In order to generate these differences of gaussian, we have to build a scale space. Instead of blu
ring several times the raw image, we resize it several times in order to reduce the calculation time.
Indeed, more we blur, more we lose high frequency details, more we can approximate the image
by reducing its size.
Below, you can see the structure of the scale space for gaussians.
Several constants have to be specified :
– the number of octaves (octavesNb) specifying the number of scales to work with
– the number of intervals (intervalsNb) specifying the number of local extremas to generate
for each octave
Of course, more there are octaves and intervals, more keypoints are found but more the
processing time is important.
NAO project – page 71/107
(1)
(2)
(3)
Sylvain Cherrier – Msc Robotics and Automation
In order to generate intervalsNb local extremas images, we have to generate intervalsNb+2
Differences of Gaussians (DoG) and intervalsNb+3 Gaussians.
Indeed, two Gaussians are substracted to generate one DoG and three DoG are needed to find
local extremas.
NAO project – page 72/107
2
-0
size
2
0
2
0/intervalsNb
σ
2
-0
size
2
0
2
(intervalsNb+2)/intervalsNb
σ
2
-(octaveNb-1)
size
2
(octaveNb-1)
2
(intervalsNb+2)/intervalsNb
σ
2
-(octaveNb-1)
size
2
(octaveNb-1)
2
0/intervalsNb
σ
-
-
-
-
MIN
MAX
MIN
MAX
Gaussians
Differences of
Gaussians
Local
extremas
intervalsNb
images
intervalsNb+2
images
intervalsNb+3
images
Sylvain Cherrier – Msc Robotics and Automation
Comment : prior smoothing
In order to increase the number of stable keypoints, a prior smoothing to the raw image can be
done. David G. Lowe Found experimentally that a gaussian of σ = 1.6 could increase the number of
stable keypoints by a factor of 4. In order to keep highest frequency, the size of the raw image is
doubled. Thus, two gaussians are applied :
– one before doubling the size of the image for anti-aliasing (σ = 0.5)
– one after for pre-bluring (σ = 1.0)
6.1.3.1.4 Keep stable keypoints
Local extremas can be numerous for each scale, so we have to be selective in our choice.
There are several ways we can be more selective :
– keep keypoints whose DoG has enough contrast
– keep keypoints whose DoG is located on a corner
In order to find whether a keypoint is located on a corner, we use the Hessian matrix :
H( x , y)=
(

2
I ( x, y)
∂ x
2

2
I ( x , y)
∂ x∂ y

2
I ( x, y)
∂ y ∂x

2
I ( x , y)
∂ y
2
)
=
(
dxx dxy
dxy dyy
)
Using the classical differentiate equation, we can estimate the different derivatives of the Hessian
matrix :
f ' ( x)=lim
h-0
f ( x+h)−f (x)
h
=lim
h-0
f ( x+h)−f ( x−h)
2h
Using (8) with h=0.5, we deduce :
dx
h=0.5
( x , y)=I ( x+0.5, y)−I ( x−0.5, y)
So,
dxx≈dx
h=0.5
( x+0.5, y)−dx
h=0.5
( x−0.5, y)=I ( x+1, y)+I ( x−1, y)−2 I ( x, y)
We have similarly :
dyy≈I ( x, y+1)+I ( x , y−1)−2 I ( x , y)
NAO project – page 73/107
Sylvain Cherrier – Msc Robotics and Automation
And then, in order to calculate dxy, we can use h=1 :
dx
h=1
( x , y)=
I (x+1, y)−I ( x−1, y)
2
and dxy≈
dx
h=1
( x, y+1)−dx
h=1
( x , y−1)
2
So, dxy≈
I ( x+1, y+1)+I ( x−1, y−1)−I ( x+1, y−1)−I ( x−1, y+1)
4
After, we can express the trace and dterminant of the Hessian matrix thanks to its eigenvalues α
and β :
tr ( H)=dxx+dyy=o+µ and det ( H)=dxx dyy−2dxy=oµ
Thus, we can estimate a curvature value C supposing α = r β where r >= 1 :
C=
tr (H)
2
det ( H)
=
(o+µ)
2

=
(r+1)
2
r
=f (r)
The function f has a minimal value when r = 1, i.e when the two eigenvalues are closed. More they
are closed, more the corresponding keypoint is a corner and more they are different, more it's
similar to an edge. Thus, we can be less or more selective depending on the value of r we choose
to threshold C :
Cf (r
threshold
)
David G. Lowe used r = 10 in its experiments.
Moreover, in the case of flat zones, one or the two eigenvalues are near zero so these zones can
be identified by det(H) ≈ 0.
Then, if det(H) < 0, it means the eigenvalues have an opposite sign ; in this case also, the point is
discarded as not being an extremum.
To conclude, after this step, we know the position of the keypoints on their specific octave and the
corresponding scale where they were found. The scale is calulated depending on the octave and
interval numbers.
NAO project – page 74/107
Sylvain Cherrier – Msc Robotics and Automation
6.1.3.2 Keypoint orientation
After to know the location of the keypoints (position on the octave and scale), we have to find
their orientation.
Finding the keypoint orientation can be explained using three steps :
– gradient parameters (magnitude and orientation) are calculated around all the pixels of
gaussians
– Feature windows around each keypoint are generated (one feature window for magnitude
and another for orientation)
– Orientation of each keypoint (center of feature windows) are estimated doing an average
of the orientations within their corresponding feature window
6.1.3.2.1 Calculation of gradient parameters within gaussians
As mentionned presiously, the gradient parameters alude to the magnitude and orientation.
gradients allow to classify contrast changes on the image :
– the magnitude shows how big changes are
– the orientation aludes to the direction of the main change
Thus, a gradients can be represented as a vector whose position, magnitude and orientation are
specified.
Comment : gradients on the image are oriented from black to white.
NAO project – page 75/107
x
y
x
θ
M
Sylvain Cherrier – Msc Robotics and Automation
In order to calculate gradient parameter, we hyave only to calculate the first derivative for each
pixel on the image. Using equation (8) showed before, we deduce with h = 1 :
dx=
∂ I ( x , y)
∂ x

I ( x+1, y)−I ( x−1, y)
2
and dy=
∂ I ( x , y)
∂ y

I ( x, y+1)−I ( x , y−1)
2
Knowing dx and dy, we deduce the magnitude (M) and orientation (θ) :
M=.dx
2
+dy
2
and 0=arctan(
dy
dx
)
6.1.3.2.2 Generation of keypoint feature windows
Now, we will speak about the way we will create the feature window.
Two parameters have to be studied :
– the gaussian to apply to feature windows
– the size of the feature windows
➔ Why apply a gaussian ?
It allows to have a best average of the orientation of the keypoint. Indeed, the keypoint (located at
the center of the feature window) has to have the best weight whereas the pixels located near the
limits of the feature window has to have the smaller weight. Then, instead of having the
magnitude of orientation, we have a weight W :
weight =G∗M
With G being the gaussian mask and M the magnitudes of the entire image.
NAO project – page 76/107
Sylvain Cherrier – Msc Robotics and Automation
➔ How to choose the σ of the gaussian and the size of the feature window ?
The σ to choose will depend on the scale we are. Indeed, more we will have an image blured (i.e
higher scale or higher octave and interval numbers), less we will have details on the image, more
we will need pixels to estimate the average of orientation.
David G. Lowe uses :
u
kpFeatureWindow
=1.5u
Where σ is the absolute sigma of the corresponding image studied (depending on the scale).
Dealing with the size of the featue window, it's chosen depending on
u
kpFeatureWindow
. Indeed, the
gaussian can be considered as negligible from a specific position on the feature window. Thus, the
size can be approximated :
size
kpFeatureWindow
=3u
kpFeatureWindow
6.1.3.2.3 Averaging orientations within each feature window
From the feature windows we have generated, we can now estimage an average of the
orientation of each keypoint.
This is done thanks to an histogram of 36 bins.
Each bin deals with a range of 10 degrees for orientations so that the entire histogram deals with
orientations from 0 degree to 360 degrees.
If an orientation matches a specific bin, its weight fills the corresponding bin.
NAO project – page 77/107
Sylvain Cherrier – Msc Robotics and Automation
The maximum of the histogram will refer to the average of the keypoint orientation.
Comment : Accurac y
As we work with an histogram, we must be accurate and we can't have an error of 10 degrees
because it's the length of a bin.
Thus, in order to approximate the real maximum, we use the left and right neighbooring bins of
the matched one in order to calculate the parabole which goes through these three points. Thanks
to the parabolic equation, it's easy after to calculate the maximum.
The accuracy can be improved also by making more accurate the bin filling. Indeed, as we have a
real orientation at the beginning, it's always located between two bins. Thus, we have to compare
this value with the middle of each neighbooring bins and spread the weight accordingly.
NAO project – page 78/107
Weight
Orientation (°)
0 11 21 351
n n+1 n-1
m
n
m
n+1
m
n-1
bin(θ)
Sylvain Cherrier – Msc Robotics and Automation
As we can see with the diagram above, the weight we have to add to bins n-1 and bins n can be
calculated easily :
weight
n
=weight
n
+

bin(0)−m
n−1
m
n
−m
n−1

W(0)=weight
n
+∣bin(0)−m
n−1
∣W(0)
weight
n−1
=weight
n−1
+∣bin(0)−m
n
∣W(0)
6.1.4. Descriptor generation
This second process is divided in several steps:
– calculate the interpolated gradient parameters of each gaussian (interpolated magnitudes
and orientations)
– generate the descriptor feature windows
– generate the descriptor of each keypoint
6.1.4.1 Calculation of interpolated gradient parameters within gaussians
➔ Why calculate the interpolated gradient parameters ?
Contrary to the previous case with keypoints, in order to generate the descriptors, we will
consider the keypoint as a subpoint. The pixel located at its location doesn't interest us but its
neighbours yes.
NAO project – page 79/107
Kp
Neig
Neig
Neig
Neig
Sylvain Cherrier – Msc Robotics and Automation
Thus, we have to work on the subpixels of the image instead of working on the real pixels.
There are several ways to do so, we could either :
– firstly generate each interpolated gaussians to calculate the gradient parameters after
– or directly estimate the interpolated gradient parameters from the gaussians
We chose the second method.
The calculation of interpolated gradients is similar to the calculation used for the non interpolated
ones (keypoints). The x gradient dx and y one dy are determined using the same formulas except
that now we will have to use averages :
I
inter
( x+1, y)=I ( x+0.5, y)≈
I ( x , y)+I ( x+1, y)
2
I
inter
( x−1, y)=I ( x−1.5, y)≈
I ( x−2, y)+I ( x−1, y)
2
I
inter
( x , y+1)=I ( x, y+0.5)≈
I ( x , y)+I ( x , y+1)
2
I
inter
( x , y−1)=I ( x, y−1.5)≈
I ( x , y−2)+I ( x , y−1)
2
Thus, we deduce dx and dy for the interpolated case :
dx
inter
≈I
inter
( x+1, y)−I
inter
( x−1, y)≈
I ( x+1, y)+I (x , y)−I (x−1, y)−I ( x−2, y)
2
dy
inter
≈I
inter
( x, y+1)−I
inter
( x, y−1)≈
I ( x, y+1)+I ( x , y)−I ( x , y−1)−I ( x , y−2)
2
The calculation of the magnitude and orientation is exactly the same that previously with the
keypoints.
6.1.4.2 Generation of descriptor feature windows
As we said before, we will interest ourself in the subpoints around the keypoint. In order to have a
symmetric, the size of the feature window should be pair.
David G. Lowe used :
size
descFeatureWindow
=16
NAO project – page 80/107
Sylvain Cherrier – Msc Robotics and Automation
This value is interested for the generation of the subwindows we will see after. Indeed, this
number is easily divisible by 4.
As for the keypoint case, we need a gaussian in order to calculate the weight within the descriptor
feature window. This time, the size of the window we choose can help us to find a good gaussian.
David G. Lowe used :
u
descFeatureWindow
=
size
descFeatureWindow
2
6.1.4.3 Generation of the descriptor components for each keypoint
As we have generated the feature windows for each of our keypoint, we can now interest
ourselves on the most important part : the descriptor generation.
This process can be divided in three steps :
– Before all, we have to rotate the feature window by the orientation θ of the corresponding
keypoint (invariance in rotation).
– after, we divide our feature window in 16 subwindows (same size).
– For each subwindow, we fill a 8 bins histogram using the relative orientation and the
magnitude of the gradients (interpolated).
– We thresold (invariance in brightness) and normalize (invariance in contrast) the vector
(given by the histograms).
NAO project – page 81/107
Kp
Sylvain Cherrier – Msc Robotics and Automation
We will focus on the filling of the 8 bins histograms.
Before all, we have to notice that, in order to have the invariance with rotation, we have to fill the
histograms using the interpolated orientations substracted by the keypoint orientation :
0
diff
=0
inter
−0
Except this fact, the filling of histograms use the interpolated weight as for the keypoint case ;
however, we could also substract the intetrpolated weight with the keypoint weight :
weight
diff
=weight
inter
−M
We can be more accurate spreading the weight between two adjacent bins as for the keypoint
case.
NAO project – page 82/107
128 (= 16x8) descriptor components
Fill 1 histogram 8 bins for each subwindow
Split into 16 subwindows
Kp
threshold and normalize the vector
Kp
Rotate the feature window by the
orientation θ of the keypoint
x
θ
M
Sylvain Cherrier – Msc Robotics and Automation
6.1.5. Conclusion and drawbacks
To conclude, SIFT is divided in two steps :
– keypoint extraction : the location is found using a scale space of images and calculating the
difference of Gaussian. The orientation is found using a feature window of the gradients
around the keypoint : a histogram of 36 bins allows to have the value.
– descriptor generation : descriptors are generated using a feature window of the
interpolated gradients around the keypoint. This feature window is rotated by the
orientation of the keypoint, divided in 16 subwindows : for each subwindow, a histogram
of 8 bins is generated using the relative orienation to generate 8 values. Then, the final
vector (128 values) is thresholded and normalized.
SIFT has a main drawback which is its speed : it's very slow mainly because it needs to blur the
image several times.
Besides this problem, keypoints extracted are not invariant with all 3D rotation : indeed, it's before
all invariant with rotations within the plan of the image and less with others.
In order to improve the speed of SIFT for its use in real time applications, Herbert Bay et al.
Presented the Speeded Up Robust Feature in 2006.
Contrary to SIFT, SURF works on the integral of the image, on a scale space of filters and on Haar
wavelets for orientation assignement and descriptor generation, those enable to be faster.
In the following, we will see more in details how SURF works.
NAO project – page 83/107
Sylvain Cherrier – Msc Robotics and Automation
6.2 SURF : another method adapted for real time
applications
6.2.1. keypoint extraction
As for SIFT, keypoint extraction consists in two steps :
– find the location of the keypoints
– find their main orientation
6.2.1.1 Keypoint location
6.2.1.1.1 LoG approximation : Fast-Hessian
In order to calculate the Laplacian of Gaussian (LoG) within the image, SURF uses a Fast-Hessian
calculation.
Fast-Hessian focus on the Hessian matrix generalised to Gaussian operators :
H ( x ,u)=
(
L
xx
( x , u) L
xy
( x , u)
L
xy
( x , u) L
yy
( x , u)
)
Where L
xx
is the convolution of the Gaussian second order derivative

2
∂x
2
with the image I in
point x, and similarly for L
xy
and L
yy
.
In order to extract invariant keypoints in scale, we study the determinant of the Hessian matrix :
det ( H)=L
xx
L
yy
−L
xy
2
As the determinant is the product of eigen values, we can conclude that :
➢ if it's negative, the eigen values have opposite sign and the keypoint doesn't correspond to
a local extrema.
➢ If it's positive or negative, both eigen values have the same sign and the keypoint is a local
minimum or maximum.
NAO project – page 84/107
Sylvain Cherrier – Msc Robotics and Automation
In order to calculate the values of the Hessian matrices (known as Laplacian of Gaussian), SURF
doesn't use Difference of Gaussian as SIFT but a mask as approximation of the Gaussian second
derivative.
In the figure below, you can see :
➢ On the first row, the real Gaussian second order derivative (from the left to the right : L
xx
,
L
yy
and L
xy
).
➢ On the second row, the simplified Gaussian second order derivative used in SURF (from the
left to the right : D
xx
, D
yy
and D
xy
).
The simplified mask allows to improve the speed keeping enough accuracy. Using this mask, the
determinant has to be ajusted :
det ( H
approx
)=D
xx
D
yy
−(0.9 D
xy
)
2
where
H
approx
( x , u)=
(
D
xx
( x ,u) D
xy
( x ,u)
D
xy
( x ,u) D
yy
( x , u)
)
6.2.1.1.2 Work on an integral image
SURF uses the simplified Gaussian second order derivative D and applies it not directly to the
image but to the integral image, which enable to reduce the number of calculations.
Indeed, working with an integral image allows to make calculation indepedent to the area studied.
The integral image I
Σ
is calculated by the sum of the values between the point and the origin :
NAO project – page 85/107
Sylvain Cherrier – Msc Robotics and Automation
I
2
( x , y)=

i=0
i<x

j=0
j <y
I (i , j )
Working with an itegral image, we need only four calculation to calculate the whatever area within
the original image :
2=A+D−(C+B)
6.2.1.1.3 Scale space
As for SIFT, SURF has to use a scale space in order to apply its Fast-Hessian operation.
As we saw previously, SIFT blur the image within a same octave and reduce the size of the image
to go to the next octave : this method is time consuming !
SURF doesn't modify the image but the mask it uses (the simplified Gaussian second order
derivative D). Indeed, as it works on an integral image, the size of the mask is independant of the
number of calculation as it requires always four calculation for an area.
NAO project – page 86/107
Sylvain Cherrier – Msc Robotics and Automation
Thus, in SURF, in order to build a scale space within several octaves, it's necessary to increase the
size of the mask D : each octave aludes to a doubling of scale.
The minimum size of D is determined using a real gaussian of σ = 1.2. The smaller D is represented
in the figure above. For example, D
xx
has its three lobes of width three pixels and of height five
pixels.
In order to calculate the approximate scale σ
approx
, we can use a proportionnal relation between
the real operator L and the aproximated one D :
u
approx
=size
mask
u
mask −base
size
mask −base
=size
mask
1.2
9
In order to increase the scale of D, the minimum step is to add six pixels on the longer length (as
it's necessary to have a pixel at the center and three lobes of identical size) and four on the smaller
length (to keep the structure).
6.2.1.1.4 Thresolding and non maximal suppression
Similarly to SIFT, after applying the Fast-Hessian with a specific mask D, we will reject pixels with
not enough intensity.
Then, it's necessary to compare the pixel with its neighbours (eight in the same scale and nine in
both the previous and the next scales).
NAO project – page 87/107
Sylvain Cherrier – Msc Robotics and Automation
6.2.1.2 Keypoint orientation : Haar wavelets
6.2.1.2.1 Haar wavelets
Haar wavelets are simple filters which can be used to find gradients in the x and y directions.
In the diagram below, you can see from the left to the right : a Haar wavelet filter for x-direction
gradient and another one for y-direction gradient.
The white side has a weight of 1 whereas the black one has a weight of -1. Working with integral
images is interesting because it needs only 6 operations to convolve an Haar wavelet with it.
NAO project – page 88/107
Sylvain Cherrier – Msc Robotics and Automation
6.2.1.2.2 Calculation of the orientation
To determine the orientation, Haar wavelet responses of size 4σ are calculated for a set of pixels
within a radius of 6σ of the detected point (σ refers to the scale of the corresponding keypoint).
The sets of pixels are defined by a step of σ.
As for SIFT, there is a weight by a Gaussian centered at the keypoint, it is chosen to have
σ
Gaussian
=2.5 σ.
In order to know the main orientation, the x and y-reponses of the Haar wavelets are combined to
focus on the vector-response. The dominant orientation is selected by rotating a circle segment
covering an angle of π/3 around the origin. At each angle of the circle, the vector-reponses are
summed and form a new vector : at the end of the rotating, the longest vector will be the main
orientation of the keypoint.
6.2.2. Descriptor generation : Haar wavelets
As it uses to calculate the main orientation of the keypoints, SURF will use again Haar wavelets to
generate the descriptor.
The descriptor is calculated within a 20σ window wich will be rotated of the main orientation in
order to make it invariant to rotation. The descriptor window is divided in 4 by 4 subregions where
Haar wavelets responses of size 2σ are calculated. As for main orientation calculation, the sets of
pixels used for calculation are defined by a step of σ.
For each subregions, a vector of four values is estimated :
V
subregion
=(∑
dx

dy

∣dx∣

∣dy∣
)
NAO project – page 89/107
Sylvain Cherrier – Msc Robotics and Automation
Where dx and dy are respectively the x-response and y-response of each sets of pixels within the
subwindow.
Thus, we conclude that SURF has a descriptor of 4 x 4 x 4 = 64 values which is less than SIFT.
As for SIFT, in order to make the descriptor invariant with brightness and contrast, it's necessary to
threshold it and normalize it to unit length.
NAO project – page 90/107
64 (= 16x4) descriptor components
Split into 16 subwindows
Kp
Threshold and normalize the vector
Kp
Rotate the feature window by the
orientation θ of the keypoint
x
θ
M
Calculate the 4 values vector for each subwindow
Sylvain Cherrier – Msc Robotics and Automation
6.2.3. Conclusion
To conclude, as SIFT, SURF is divided in two steps :
– keypoint extraction : the location is found using a scale space of specific masks and by
calculating the determinant of the Hessian matrix at each point. The orientation is
generated using Haar wavelets for the calculation of the gradients within feature windows
around the keypoints : the value is calculated by estmating average gradient vectors.
– descriptor generation : descriptors are generated using a feature window of gradients
(given also by Haar wavelets). This feature window is rotated by the orientation of the
keypoint, divided in 16 subwindows : 4 values are calculated to build the vector. The final
vector (64 values) is then thresholded and normalized.
The fact that SURF works on integral images and uses simplified masks reduces the computation
and makes the algorithm quicker to execute.
NAO project – page 91/107
Sylvain Cherrier – Msc Robotics and Automation
6.3 Mono-calibration
➢ Why mono-calibrate ?
The goal of the mono-calibration is to know the extrinsic and intrinsic parameters of the camera:
– knowing the distortion coefficents, it will be possible to undistord the image
– knowing the focale, we will be able to know dimensions in units of distance within the
image
– knowing the extrinsic parameters, it will be possible to rectify images from a stereovision
system (stereo-calibration)
➢ What are the different methods of calibration ?
We can notice that there are several ways to calibrate a camera:
– photogrammetric calibration
– self-calibration
Photogrammetric calibration need to know the exact geometry of the object in 3D space. Thus,
this method isn't flexible using an unknow environment and moreover, it's difficult to implement
because of the need of precised 3D objects.
Self-calibration doesn't need any measurements and produces more accurate results. It's this
method chosen to calibrate using a chessboard which has simple reptitive patterns.
➢ Chessboard corner detection
A simple way to calibrate is using a chessboard, this allow to know accurately the position of the
corners on the image.
There are two main steps in this process:
– the generation of the 3D corner points and of the 2D corner points
– the two steps calibration
Dealing with the two steps calibration, I chose to use the Tsai method which is a simple way to find
the extrinsic and intrinsic parameters.
NAO project – page 92/107
Sylvain Cherrier – Msc Robotics and Automation
6.3.1. Generation of the 3D corner points and of the 2D corner points
Firstly, it's necessary to generate the 3D corner points and the 2D corner points within the image
in order to use them for the two step calibration.
6.3.1.1 3D corner points
The idea is to choose as reference the chessboard. Using this way, it's easy to generate the 3D
corner points functions of the world frame.
As you can see on the diagram above, the chessboard is supposed fixed and the camera is
supposed to be moving whereas in reality, the camera is fixed and the user moves the chessboard.
Thus, it will be needed to generate 3D points only one time depending on the dimensions of the
rectangles on the chessbard (dx, dy). We will see later that only the relation between dx and dy
(rectangle ratio) will interest us: if we work on squares, the ratio will be 1 and the calibration will
be definitely a self one.
6.3.1.2 2D corner points
Dealing with the 2D corner points on the image, we can detect them using two ways:
– manual selection
– automatic detection
NAO project – page 93/107
X
w
Y
w
Z
w
O
w
dx
O
cam1
Z
cam1
X
cam1
Y
cam1
O
cam2
Z
cam2
X
cam2
Y
cam2
dy
Sylvain Cherrier – Msc Robotics and Automation
Using the manual selection, the user have to select each one of the corners on the image. This
method can be very laborious if the chessboard is big or if the user want register a lot of positions
of the chessboard. Moreover, this method is not accurate and need often a checking by the
software. The software can for example use the first points given by the user in order to generate
the others using pattern recurrence, the user could check also to see whether it's valid.
Using the automatic detection, a corner detection technics will be used. I will not go into details in
this report but an interesting method is to use SUSAN operator combined with a ring shaped
operator.
6.3.2. Two steps calibration
Using Tsai method, we will use two least squares solutions in order to know projection matrix,
extrinsic parameters and distortions.
We will come from the equation presented in the last part (describing the parameters of the
camera); this realtion linked a 2D pixel on the image with a 3D point (in distance unit) within the
world:
p
img
=s
(
f
x
f
x
o
c
c
x
0 f
y
c
y
0 0 1
) (
dr 0 dt
x
0 dr dt
y
0 0 1
)
(
r
11
r
12
r
13
t
x
r
21
r
22
r
23
t
y
r
31
r
32
r
33
t
z
)
p
obj/world
=M p
obj/world
We will have to separate in two problems:
– find the projection and extrinsic parameters
– find the distortions
NAO project – page 94/107
Y
cam1
X
cam1
X
cam2
Y
cam2
Sylvain Cherrier – Msc Robotics and Automation
6.3.2.1 First least-square solution (projection matrix and extrinsic
parameters)
Firstly, we negligate the distortions in order to make the equation easier:
p
img
=s
(
f
x
f
x
o
c
c
x
0 f
y
c
y
0 0 1
)
(
r
11
r
12
r
13
t
x
r
21
r
22
r
23
t
y
r
31
r
32
r
33
t
z
)
p
obj/ world
p
img
=s M p
obj/ world
Where
p
img
=
(
x
img
y
img
1
)
,
M=
(
m
11
m
12
m
13
m
14
m
21
m
22
m
23
m
24
m
31
m
32
m
33
m
34
)
and
p
obj/ world
=
(
X
obj/world
Y
obj/ world
Z
obj/world
1
)
The scale s can be easily deduced from M:
s=
1
m
31
X
obj/ world
+m
32
Y
obj /world
+m
33
Z
obj /world
+m
34
Thus, the equation (1) can be also expressed as:
(m
31
X
obj /world
+m
32
Y
obj/ world
+m
33
Z
obj/ world
+m
34
)
(
x
img
y
img
1
)
=
(
m
11
m
12
m
13
m
14
m
21
m
22
m
23
m
24
m
31
m
32
m
33
m
34
)
(
X
obj/ world
Y
obj/ world
Z
obj/ world
1
)
m
11
X
obj/ world
+m
12
Y
obj / world
+m
13
Z
obj / world
+m
14
−m
31
X
obj / world
x
img
−m
32
Y
obj / world
x
img
−m
33
Z
obj / world
x
img
=m
34
x
img
m
21
X
obj/ world
+m
22
Y
obj / world
+m
23
Z
obj / world
+m
24
−m
31
X
obj / world
y
img
−m
32
Y
obj / world
y
img
−m
33
Z
obj / world
y
img
=m
34
y
img
NAO project – page 95/107
Sylvain Cherrier – Msc Robotics and Automation
We can express the two relations from above using a matrix A and a vector m:
A m=
(
m
34
x
img
m
34
y
img
)
=p
img
0
where:
A=
(
X
obj/ world
Y
obj / world
Z
obj / world
1 0 0 0 0 −X
obj / world
x
img
−Y
obj / world
x
img
−Z
obj / world
x
img
0 0 0 0 X
obj /world
Y
obj /world
Z
obj / world
1 −X
obj /world
y
img
−Y
obj /world
y
img
−Z
obj / world
y
img
)
m=( m
11
m
12
m
13
m
14
m
21
m
22
m
23
m
24
m
31
m
32
m
33
)
T
Thus, we can calculate m using the least squares method (pseudo inverse calculation):
m=(( A
T
A)
−1
A
T
) p
img
0
At the beginning, we have to suppose m
34
=1 but we will calculate its real value later. When we will
modify m
34
, it will be necessary to multiply all the others coefficients as they are directly
proportional.
Thanks to this first solution, we can have an idea of the coefficients of M but in order to be able to
calculate a correct least squares solutions, we have to know at least 6 corner features
(corresponding 2D and 3D corner points); more corner features are recorded, better the
approximation of M is.
In a global wiewpoint, if we consider K as the number of chessboard views to treat and N the
number of corner point (couple of 2D and 3D corner points), we have:
– 2NK constraints (2 because of 2D coordinates)
– 6K unknown extrinsic parameters (3 for rotation and 3 for translation)
– 5 unknown projection parameters (f
x
, f
y
, α
c
, c
x
and c
y
)
Thus, in order to find a solution, we have to have:
2NK¯6K+5
So, if we choose to work on only 1 chessboard view, we need at least 6 corner points as we
showed before. Whereas using 2 chessboard views, we can use only 5 corners.
However, in reality, 1 chessboard plane offers only 4 corner points worth of information whatever
the number of corner points chosen; it's why we have to work at least on 3 chessboard views.
In practice, because of noise and numerical stability, we have to take more points and more
chessboard views in order to have more accuracy: 10 views of a 7-by-8 or larger chessboard is a
good choice.
NAO project – page 96/107
Sylvain Cherrier – Msc Robotics and Automation
The next step is to split the matrix M into two matrices: projection matrix and extrinsic matrix.
A simple way is to split the matrix M into M
r
and M
t
in order to use the proprieties of the
orthogonal matrices:
M=〈 M
r
∣M
t
〉=〈 P
cam−px
R
cam
∣−P
cam−px
R
cam
T
cam

As R
cam
is a rotation matrix, R
cam
R
cam
T
=I
Thus, we deduce:
M
r
M
r
T
=P
cam−px
R
cam
( P
cam−px
R
cam
)
T
=P
cam−px
R
cam
R
cam
T
P
cam−px
T
=P
cam−px
P
cam−px
T
By calculating the coefficients of P
cam-px
P
T
cam-px
, it's possible to estimate the projection parameters:
P
cam−px
P
cam−px
T
=
(
(1+o
c
2
)f
x
2
+c
x
2
o
c
f
x
f
y
+c
x
c
y
c
x
o
c
f
x
f
y
+c
x
c
y
f
y
2
+c
y
2
c
y
c
x
c
y
1
)
We can notice a condition which allows us to find the real value of m
34
:
P
cam
px
P
cam
px
T
(3,3)=M
r
M
r
T
(3,3)=1 ⇔ m
34
=
.
1
∣M
r
M
r
T
(3, 3)∣
Thus, we have to normalize the matrix M by mutliplying it by the coefficient m
34
.
After the normalization, we can solve the system of equations:
c
x
=M
r
M
r
T
(1,3) c
y
=M
r
M
r
T
(2,3)
f
y
=
.
M
r
M
r
T
(2, 2)−c
y
2
f
x
=
.
M
r
M
r
T
(1, 1)−c
x
2
−k
2
and
o
c
=
k
f
x
with k=
M
r
M
r
T
(1,2)−c
x
c
y
f
y
When the projection matrix is known, it's easy to deduce the external matrix (rotation and
translation matrices of the camera):
R
cam
=P
cam−px
−1
M
r
and T
cam
=−R
cam
T
P
cam−px
−1
M
t
NAO project – page 97/107
Sylvain Cherrier – Msc Robotics and Automation
6.3.2.2 Second least square solution (distortions)
Thanks to the matrix M previously calculated, we can estimated the ideal 2D corner points p
i
img
(x
i
img,
y
i
img
) using the equations (2) and (3):
x
img
i
=
m
11
X
obj/world
+m
12
Y
obj/ world
+m
13
Z
obj/ world
+m
14
m
31
X
obj/ world
+m
32
Y
obj/ world
+m
33
Z
obj/ world
x
img
+m
34
y
img
i
=
m
21
X
obj/ world
+m
22
Y
obj /world
+m
23
Z
obj /world
+m
24
m
31
X
obj /world
+m
32
Y
obj/ world
+m
33
Z
obj/ world
x
img
+m
34
Here, we have to consider the normalized points before the projection p
img0
and p
i
img0
:
p
img0
=
P
cam−px
−1
m
34
p
img
and
p
img0
i
=
P
cam−px
−1
m
34
p
img
i
Then, using the deformation equation:
p
img0
=dr p
img0
i
+
(
dt
x
dt
y
1
)
⇔A p
img0
=( p
img0
−p
img0
i
)=(dr
1
r
2
+dr
2
r
4
+dr
3
r
6
+...+dr
n
r
2n
) p
img0
i
+
(
dt
x
dt
y
1
)
with r=
.
x
img0
i 2
+y
img0
i 2
and dt
x
=2dt
1
x
img0
i
y
img0
i
+dt
2
(r
2
+2x
img0
i 2
) dt
y
=dt
1
(r
2
+2 y
img0
i 2
)+2dt
2
x
img0
i
y
img0
i
In order to calculate k
c
= (dr
1
, dr
2
, …, dr
n
, dt
1
, dt
2
)
T
, we have to express the equation of above using
a matrix A
d
, the principle is the same of previously with A:
A p
img0
=A
d
k
c
where A
d
=
(
r
2
x
img0
i
r
4
x
img0
i
... r
2n
x
img0
i
2 x
img0
i
y
img0
i
r
2
+2x
img0
i 2
r
2
y
img0
i
r
4
y
img0
i
... r
2n
y
img0
i
r
2
+2 y
img0
i 2
2x
img0
i
y
img0
i
)
NAO project – page 98/107
Sylvain Cherrier – Msc Robotics and Automation
As before, the least-square solution a this equation is as follow:
k
c
=(( A
d
T
A
d
)
−1
A
d
T
)A p
img0
The minimal number of points required depends on the maximal degree we want for the radial distortions.
Stopping until the sixth degree offers enough precision (three radial coefficients); so, in this case, we need
to work at least on three 2D corner points. As before, more we have points, better is the least-squares
solution.
6.3.3. A main use of mono-calibration: undistortion
As we said previously, a main use of mono-calibration is to undistord the images. To do so, we use the the
distortion matrix M
d
and the projection matrix P
cam-px
in order to calculate the ideal pixel from the distorted
one.
If we consider p
i
img
as the ideal pixel and p
d
img
as the distorted one, we have:
p
img
i
=s P
cam−px
p
obj/cam and
p
img
d
=s P
cam−px
M
d
p
obj / cam
The idea is to calculate the distorted pixel from the ideal one:
p
img
d
=s P
cam−px
M
d
1
s
P
cam−px
−1
p
img
i
=P
cam−px
M
d
P
cam−px
−1
p
img
i
We can't calculated directly the ideal one because the distortion matrix M
d
is calculated thanks to the ideal
pixels. Because of this problem, we will need to use image mapping:
x
img
d
=map
x
( x
img
i
, y
img
i
)
and
y
img
d
=map
y
( x
img
i
, y
img
i
)
Thus, in order to find the ideal image, we remap:
( x
img
i
, y
img
i
)=(map
x
( x
img
i
, y
img
i
) , map
y
(x
img
i
, y
img
i
))
NAO project – page 99/107
Sylvain Cherrier – Msc Robotics and Automation
6.4 Stereo-calibration
Stereo-calibration deals with the calibration of a system made up of several cameras. Thus, it focus
on the location of each one from each others in order to make relations between them.
In the following, because it's easier and because the tool was available in the laboratory, we will
focus only on a stereovision system made up of two cameras (one on the left and the other on the
right).
6.4.1. Epipolar geometry
Epipolar geometry is the geometry used in stereovision. We have firstly to focus on that in order
to understand better the stereovision system.
On the diagram below, the left camera parameters have the indication l and the right one the
indication r.
P is an object point visible for the two cameras.
O
i
is the center, P
i
the position relative to the object P, π
i
the image plane and p
i
the image
projection of each camera (i=l, r).
e
l
and e
r
are respectively the epipoles of the left and right cameras. Their position is defined by the
intersection between the line O
l
O
r
and each one of the image planes π
i
and π
r
. If the cameras are
stationary, They are located always at the same place within each image.
NAO project – page 100/107
Sylvain Cherrier – Msc Robotics and Automation
The epipolar lines l
l
and l
r
are defined respectively by the segment e
l
p
l
and e
r
p
r
.
Using this geometry, we can define two interesting matrices which allows to link the parameters
of the two cameras:
– the essential matrix E links the camera positions P
l
and P
r
– the fundamental matrix F links the image projections p
l
and p
r
6.4.1.1 The essential matrix E
Actually, the essential matrix E describes the extrensic relations between the left and right
cameras (due to the location of each of them).
Indeed, each one of the cameras are defined by their position from the object as we saw during
the mono-calibration:
P
l
=R
l
〈 I∣−T
l
〉 P
and
P
r
=R
r
〈 I∣−T
r
〉 P
We deduce a relative relation from the left camera to the right one P
l→r
:
P
l -r
=P
r
P
l
−1
or
P
l -r
=R
l -r
〈 I∣−T
l -r
〉 P
with R
l -r
=R
r
R
l
T
and
T
l-r
=T
l
−T
r
Using a coplanarity relation, we can calculate the essential matrix E. The three vectors PO
l
,PO
r
and O
l
O
r
shown on the diagram below have to be coplanar, we deduce these relations:
PO
r
⋅(O
l
O
r
∧PO
l
)=0 ⇔ P
r
T
(T
l -r
∧P
l
)=0
NAO project – page 101/107
P
O
l
O
r
P
l
P
r
T
l→r
R
l→r
Sylvain Cherrier – Msc Robotics and Automation
Using the diagram above, we can see in terms of vector, we have:
P
r
=P
l
−T
l-r
Moreover, we can express P
r
functions of P
l→r
using the definitions of P
l
and P
r
:
P
r
=P
l -r
P
l
=R
l -r
〈 I∣−T
l -r
〉 P
l
=R
l -r
( P
l
−T
l -r
)
Thanks to the two equations mentionned above, we can simplify the coplanarity relation:
P
r
T
(T
l -r
∧P
l
)=0 ⇔ ( P
l
−T
l -r
)
T
(T
l-r
∧P
l
)=0 ⇔ ( R
l-r
T
P
r
)
T
(T
l -r
∧P
l
)=0
Finally, we can define the essential matrix E:
( R
l -r
T
P
r
)
T
(T
l-r
∧P
l
)=P
r
T
R
l -r
(T
l -r
∧P
l
)=P
r
T
R
l -r
S P
l
=0 and
E=R
l -r
S
Where S deals with the vector product between T
l→r
and P
l
:
S=
(
0 −t
z
t
y
t
z
0 −t
x
−t
y
t
x
0
)
with rank(S) = 2
Thus, the positions of the left and right cameras are linked as follows:
P
r
T
E P
l
=0
6.4.1.2 The fundamental matrix F
The fundamental matrix F will consider the intrinsic parameters of each camera in order to link
their pixels.
Each camera has an intrinsic matrix (M
l
and M
r
) linking their image projection and their position
relative to the object:
p
l
=M
l
P
l
and
p
r
=M
r
P
r
As we saw with the mono-calibration, M
l
and M
r
alude to the projection and distortion matrices of
each camera.
Thus, the previous coplanarity relation can be written as below:
P
r
T
E P
l
=0 ⇔ ( M
r
−1
p
r
)
T
E( M
l
−1
p
l
)=p
r
T
( M
r
−1
)
T
E M
l
−1
p
l
F is defined as: F=( M
r
−1
)
T
E M
l
−1
=( M
r
−1
)
T
R
l -r
S M
l
−1
NAO project – page 102/107
Sylvain Cherrier – Msc Robotics and Automation
Knowing the fundamental matrix, it's possible to know:
– the epipolar line of one camera corresponding to one point in the other
– the epipoles of the two cameras (e
l
and e
r
)
Only one relation is needed to know all these parameters:
p
r
T
F p
l
=p
l
T
F
T
p
r
=0
Moreover, as we work on a projective representation of lines, we have if l is a line and p a point:
p ∈ l ⇒ p⋅l=0
Thus, we conclude from these two relations:
– if p
l
is a pixel from the left camera, the correponding right epiline l
r
will be Fp
l
.
– if p
r
is a pixel from the right camera, the correponding left epiline l
l
will be F
t
p
r
.
Dealing with the epipoles, they are defined as points lying on all epilines; so, whe have:
e
l
T
l
l
=0 and e
r
T
l
r
=0
We conclude:
e
l
T
l
l
=0 ⇔ l
l
T
e
l
=0 ⇔ p
r
T
Fe
l
=0 ⇒ Fe
l
=0
e
r
T
l
r
=0 ⇔ l
r
T
e
r
=0 ⇔ p
l
T
F
T
e
r
=0 ⇒ F
T
e
r
=0
6.4.1.3 Roles of the stereo-calibration
We deduce from the previous calculations that the main roles of the stereo-calibration to
calculate:
– the relative position between the two cameras P
l→r
(R
l→r
and T
l→r
)
– the essential and fundamental matrices E and F
NAO project – page 103/107
Sylvain Cherrier – Msc Robotics and Automation
The stereo-calibration can do also integrate mono-calibration of each cameras but that's a
complicated solutions. Thus, a easy way to proceed is to mono-calibrate each one of the camera
before to stereo-calibrate the stereovision system made up of them.
6.4.2. Calculation of the rectification and reprojection matrices
6.4.2.1 Rectification matrices
Thanks to the calibration, we know some key links between the cameras.
Now, we want to make simple the stereovision system in order to match easily the same object
within the two cameras. Working in the epipolar plane (plane defined by e
l
, e
r
and P) make easier
this matching: this mean align the cameras, their relative position relative to the object P
l
and P
r
have to be colinear. Thus, we need a rectification matrix for each camera R
l
rect
and R
r
rect
in order to
make a correcting rotation.
In order to know the rectification matrices, it's necessary to know the orientation of the frame
defined by O
l
O
r
= T
l→r
as it's this vector which define the orientation of the epipolar plane. A simple
way to know that is by calculating the unit vector e
1
, e
2
and e
3
defining this frame:
e
2
is chosen as the vector product between e
1
and the z axis:
e
2
=
−e
1
∧( 0 0 1)
T
∣e
1
∧( 0 0 1)
T

=
1
.
t
x
2
+t
y
2
(
−t
y
t
x
0
)
e
3
is directly deduced from e
1
and e
2
:
e
3
=
e
1
∧e
2
∣e
1
∧e
2

=
1
.
t
x
2
t
z
2
+t
y
2
t
z
2
+(t
x
2
+t
y
2
)
2
(
−t
x
t
z
−t
y
t
z
t
x
2
+t
y
2
)
From this three unit vectors, we deduce the rectification matrix for the left camera R
l
camRect
:
R
camRect
l
=( e
1
e
2
e
3
)
As the right camera has a rotation relative of the left one defined by R
l→r
, we have R
r
camRect
:
R
camRect
r
=R
l -r
R
camRect
l
NAO project – page 104/107
Sylvain Cherrier – Msc Robotics and Automation
As you can see, that are the rectification matrices for the cameras and not for the images; indeed,
in order to simulate a real rotation of the camera in one direction, we need to rotate the image in
the opposite direction. Thus, we conclude the rectification matrices for the images:
R
rect
l
=(R
camRect
l
)
T
=
(
e
1
T
e
2
T
e
3
T )
and R
rect
r
=(R
camRect
l
)
T
R
l -r
T
=R
rect
l
R
l-r
T
Thus, we can calcute the rectified position of each camera P
l
rect
and P
r
rect
:
P
rect
l
=R
rect
l
P
l
and P
rect
r
=R
rect
r
P
r
6.4.2.2 Reprojection matrices
After the rectified position calculated, we can calculate the new projection or reprojection matrix.
The goal is to consider the system of multiple cameras as one global camera with only one general
projection matrix which could represent the whole system. Thus, we will need a reprojection
matrix for each camera P
l
repro
and P
r
repro
in order to reproject the pixels from each camera to the
global camera.
The projection matrix of the global camera can be expressed as this:
P
cam−px
global
=
(
f
x
global
f
x
global
o
c
global
c
x
global
0 f
y
global
c
y
global
0 0 1
)
where
f
x
global
=
f
x
l
+f
x
r
2
,
f
y
global
=
f
y
l
+f
y
r
2
,
o
c
global
=
o
c
l
+o
c
r
2
– if this is a horizontal stereosystem, c
y
global
=
c
y
l
+c
y
r
2
and c
x
global
is not common for the two
cameras ( c
x
r
=c
x
l
+dx where dx is the disparity through the x axis).
NAO project – page 105/107
Sylvain Cherrier – Msc Robotics and Automation
– if this is a vertical stereosystem, c
x
global
=
c
x
l
+c
x
r
2
and c
y
global
is not common for the two
cameras ( c
y
r
=c
y
l
+dy where dy is the disparity through the y axis).
In order to simplify the model and to use the global projection matrix P
global
cam-px
, we can consider
the disparity as 0. thus, we can consider we have also a mean between c
l
and c
r
.
Geometrically, focusing on a horizontal stereovision system, the left camera and the right one are
separated by Δx = f
x
global
t/z
global
(where t is the norm of T
l→r
); thus, if we consider the left camera as
being the reference frame, pixel coordinates from the right ones have to be offseted by Δx.
Finally, in the case of a horizontal stereovision system, we deduce the reprojection matrices P
l
repro
and P
r
repro
(if the left camera is considered as reference frame):
P
repro
l
=
(
f
x
global
f
x
global
o
c
global
c
x
global
0
0 f
y
global
c
y
global
0
0 0 1 0
)
P
repro
r
=
(
f
x
global
f
x
global
o
c
global
c
x
global
f
x
global
t
x
0 f
y
global
c
y
global
0
0 0 1 0
)
Then, we can express the stereo-rectified pixel of each camera p
l
rect
and p
r
rect
functions of the
positions of the object relative to the camera P
l
and P
r
:
p
rect
l
=P
repro
l
P
rect
l
=P
repro
l
R
rect
l
P
l
and p
rect
r
=P
repro
r
P
rect
r
=P
repro
r
R
rect
r
P
r
NAO project – page 106/107
∆x = f
x
global
t/z
global
P
O
l
O
r
P
l
rect
P
r
rect
z
global
Sylvain Cherrier – Msc Robotics and Automation
6.4.3. Rectification and reprojection of an image
In order to rectifate and reproject an image from a camera within a stereovision system, we will
use the same method as for undistortion adding the rectification and reprojection matrices.
Indeed, our ideal pixel will be now:
p
img
i
=s
global
P
repro
R
rect
p
obj /cam
Thus, we can express the distorted (and unrectified) pixel p
d
img
functions of p
i
img
:
p
img
d
=s P
cam−px
M
d
1
s
global
R
rect
T
P
repro
−1
p
img
i
Then, similarly to undistortion, we will use image mapping.
NAO project – page 107/107

Sylvain Cherrier – Msc Robotics and Automation

Acknowledgement

In order to do my project, I used a lot of helpful documents which simplified me the work. I could mention a website from Utkarsh Sinha (SIFT, mono-calibration and other stuff), the publication of David G. Lowe « Distinctive Image Features from Scale-Invariant Keypoints » (SIFT and feature matching) or the dissertation of a previous MSc student, Suresh Kumar Pupala named « 3DPerception from Binocular Vision » (stereovision and other stuff). After, I would like to thank Dr. Bassem Alachkar to have worked with me on the recognition. He guided me to several feature extraction methods (SIFT and SURF) and presented me several publications which allowed me to understand the feature matching. Then, I would like to thank my project supervisor Dr. Samia Nefti-Meziani, who allowed me to use the laboratory and in particular the NAO robot and the stereovision system. Finally, I would like to thank all researchers and studients from the laboratory for their sympathy and their help.

NAO project – page 2/107

Sylvain Cherrier – Msc Robotics and Automation

Table of contents

1 Introduction......................................................................................................................................3 2 Subject : object recognition and localization on NAO robot............................................................5 2.1 Mission and goals......................................................................................................................5 2.2 Recognition...............................................................................................................................7 2.3 Localization.............................................................................................................................22 2.4 NAO control............................................................................................................................42 3 Implementation and testing...........................................................................................................58 3.1 Languages, softwares and libraries.........................................................................................58 3.2 Tools and methods..................................................................................................................59 3.3 Steps of the project.................................................................................................................64 4 Conclusion.......................................................................................................................................67 5 References......................................................................................................................................68 6 Appendices.....................................................................................................................................69 6.1 SIFT: a method of feature extraction......................................................................................69 6.2 SURF : another method adapted for real time applications...................................................84 6.3 Mono-calibration....................................................................................................................92 6.4 Stereo-calibration.................................................................................................................100

NAO project – page 3/107

Thus. Thus. The strength of the brain is to be able to recognize an object among different classes of objects : we call this ability clustering. object localization and NAO control before to present how I implemented and managed my project. I will focus on the three subjects I had to study : object recognition.Sylvain Cherrier – Msc Robotics and Automation 1 Introduction Video cameras are interesting devices to work with because they can transmit a lot of data to work with. But before all. Besides recognition. we learn about it and after we are able to recognize it within a different environment. the difficulty to implement a recognition ability on a robot has two levels : – – what features extract from an object ? How to identify a specific object within a lot of several others ? We can say we have firstly to have well chosen features before to be able to identify an object. For human beings or animals. so the first question has to be replied before the second one. image data inform them about the objects within the environment they meet : a goal for the future is to understand how we can recognize an object. the brain is able to extract interesting features from an object that make it belonging to a class. Firstly. firstly. In this report. we need often the location and orientation of the object in order to make the robot interact with it (control). I will present the university and where I worked before to describe what were the context and goals of my project. After. the vision system must also enable object localization as it allows to increase the definition of the object seen. NAO project – page 4/107 .

from these examples we can underline three types of variations : ➢ background ➢ occlusion ➢ illumination But there are a lot of others such as all the transformation in space : ➢ 3D translation (on image : translation (X. After. the car blob could be not valid if a tree hide the vehicle and the would be not identified. I can mention three examples : ➢ For industrial robots. we can recognize monocoloured objects within a constant illumination. the environment can change easily and our brain can easily adapt itself to match object whatever the conditions.1. Mission and goals Context In real life. we can have to make robots recognize circular objects on a moving belt : the features to extract is well defined and we will implement an effective program to match only circular ones.1 2. we think often about a specific context . a big problem is illumination variation. the program has been done to work if the background is the moving belt one. In Robotics. Even if the use of blobs could improve the detection. The problem there can be at the occlusion level : indeed. In this first example. in order to tackle recognition problems. it can vary a lot ! ➢ When we want improve a CCTV camera to recognize vehicles. ➢ Using a HSV spacecolor. a first problem is the background : in real applications. we can use the thickness to their edges to distinguish them from people walking in the street.Sylvain Cherrier – Msc Robotics and Automation 2 Subject : object recognition and localization on NAO robot 2. we can also use the car blobs to match a specific size or the optical flow to match a specific speed. Thus. Thus.1. Y) and scale (1/Z)) ➢ 3D rotation NAO project – page 5/107 .

I will speak more about these methods and how to use them to match an object. I will link all the previous parts (recognition. State of the art A lot of people has worked to find invariant features to recognize an object whatever the conditions. An insteresting way of study is to work on invariant keypoints : ➢ The Scale-Invariant Feature Transform (SIFT) was published in 1999 by David G. Lowe. In the following. Moreover. background. I will also speak about the way I managed my project. 2.1. in a final part. Thus. I will describe the hardware and software of it and the ways to program it. My mission My mission was to recognize an object in difficult conditions (occlusion. Then. rotation.2.Sylvain Cherrier – Msc Robotics and Automation 2. ➢ The Speeded Up Robust Feature (SURF) was presented in 2006 by Herbert Bay et al.1. This algorithm allow to detect local keypoints within an image. different rotations. scales and positions).3. I will focus on the localization as it allows to have a complete description of the object in the space : it enables more interaction between the robot and the object. Using keypoints instead of keyregions enable to deal with occlusion. these features are invariant in scale. illumination. As I worked on the NAO robot. After. This algorith is partly inspired from SIFT but it's several times faster than SIFT and claimed by its authors to be more robust against different image transformations. the goal was also to apply it for a robotic purpose : the control of the robot need to be linked to the recognition.. I had to study the localization between recognition and control in order to orientate to robotic applications. researcher from the University of British Columbia. researcher from the ETH Zurich. NAO project – page 6/107 . localization and robot control) in order to describe my implementation and testing.

I will call this step: feature extraction. there are two ways to create it: – either it's created manualy by the user and stays fixed (the user learns to the robot) – or it's updated automatically by the robot (the robot learns by itself) 2.2. it has to be invariant : the invariances with rotation. – Secondly. Recognition Introduction In pattern recognition. the keypoints are found at the extremums of the function as their intensity changes very quickly around their location. the Gaussian makes the Laplacian less noisy. From the Laplacian of Gaussian applied to the image. there are two main steps before to be able to have a judgement on an object: – Firstly. I will call it : feature matching. – During the keypoint extraction.2 2. Descriptor generation : a descriptor is generated for each keypoint in order to identify them accurately. NAO project – page 7/107 . Then. Moreover. In the following. we have to compare these features with them stocked in the database. we have to analyze the image.1.2. In order to be flexible to transformations. the features (which can be for example keypoints or keyregions) susceptible to be interesting to work with. each method have to build a scale space in order to estimate the Laplacian of Gaussian.2. brightness and contrast are done at this step. the second step is to calculate their orientation using gradients from their neighborhood. Dealing with the database. SIFT and SURF : two methods of invariant feature extraction SIFT and SURF are two invariant-feature extraction methods which are divided in two main steps : – keypoint extraction : keypoints are found for specific locations on the image and an orientation is given to them.Sylvain Cherrier – Msc Robotics and Automation 2. The Laplacian allows to detect pixels with rapid intensity changes but it's necessary to apply on a Gaussian in order to have the invariance in scale. They are found depending on their stability with scale and position : the invariance with scale is done at this step.

This window will be then divided in several subwindow in order to build a vector containing several gradient parameters. we will need firtly to rotate the axis around each keypoint by their orientation in order to have invariance witk rotation. The diagram below sums up the main common steps used by SIFT and SURF. SURF is better for real time applications than SIFT even if SIFT can be very robust.Sylvain Cherrier – Msc Robotics and Automation Concerning the descriptor generation. a threshold is applied to this vector and it's normalized at unit length in order to make it respectively invariant with brightness and contrast. For these reasons. it's necessary to calculate the gradients around a specific neighborhood (or window). The fact that SURF works on integral images and uses simplified masks reduces the computation and makes the algorithm quicker to execute. This vector have a specific number of values depending on the methods : with SIFT. you can refer to the appendices where I describe more each one of these methods. Finally. For more details about the feature extraction methods SIFT and SURF. After. Build a scale space (invariance on scale) Estimate the Laplacian of Gaussian Calculate the gradians around the keypoints within a window W1 Estimate the orientations Rotate the keypoint axis by the orientation of the keypoints (invariance on rotation) Calculate the gradians around keypoints within a window W2 Divide the window W2 in several subwindows Calculate gradian parameters for each subwindow Threshold and normalize the vector (invariance on brightness and contrast) Keypoint location Keypoint extraction Keypoint orientation Descriptor generation Descriptor NAO project – page 8/107 . it has 128 values and with SURF. as for the calculation of the keypoint orientation. 64.

2D size. now we have seen two methods to extract keypoints which locate and orientate keypoints and assign them an invariant descriptor.Sylvain Cherrier – Msc Robotics and Automation 2.3. we will see how a matching can be done using only one model we want recognize within the image acquired. Feature matching Thus. 2D orientation and scale of the object on the image A big problem of recognition is to generalize it for 3D objects : a 3D object is complicated to be read by a camera but also the 3D position. The feature matching consists in recognize a model within an image acquired. Thus. It can be divided in three main steps : – – – compare the descriptor vectors (descriptor matching) compare the feature of the keypoints matched (keypoint matching) calculate the 2D features of the model within the image acquired (calculation of 2D model features) Image acquired Feature extraction (SIFT or SURF) descListacq Descriptor matching kpListacq Keypoint matching Calculation of 2D model features kpListmodel Image model Feature extraction (SIFT or SURF) descListmodel Feature matching NAO project – page 9/107 . 3D size and 3D orientation are difficult to estimate with only one camera.2. The goal of the matching is to compare two lists of keypoint-descriptor in order to : – – identify the presence of an object know the 2D position.

3. brightness and contrast. The goal is to obtain the 2D features of the model within the acquisition at the end of the matching.Sylvain Cherrier – Msc Robotics and Automation The diagram above shows the principle of the feature matching : features are extracted from the image acquired and from the image model in order to match their descriptors and keypoints.. removing matches from 0.desc acq size desc −desc model  size desc 2 =arccos desc acq 1 desc model 1. David G.1.. Lowe had the idea to compare the smallest distance with the next smallest distance and deduced that rejecting matching with a distance ratio (ratiodistance = distancesmallest/distancenext-smallest) more that 0. If desc acq (descacq(1) descacq(2) … descacq(sizedesc)) is a descriptor from the image acquired and descmodel (descmodel(1) descmodel(2) … descmodel(sizedesc)) is one from the model : d euclidian= desc acq 1−desc model 12 . rotation.desc acq  size desc  desc model  size desc  Thus. Thus. in order to compare descriptors.. The figure below represents the Probability Density Functions (PDF) functions of the ratio of distance (closest/next closest). 2.3.8. There are others types of distances I don't mention here in order to make it simple. Firstly. we can focus on their distance or on their angle.2. each descriptor could have its own valid threshold. a threshold has to be relative to a distance. As you can see. we could calculate the distances between one descriptor from the image acquired and each one from the image model : the smallest distance could be compared to a specific threshold. we eliminate 90 % of the false matches with just 5 % of the correct ones.1 descriptor matching Compare distances The descriptors we have previously calculated using SIFT or SURF allow us to have data invariant in scale. there are two basic solutions to compare descriptors : using euclidian distance or angle θ (in the following I qualify the angle θ as distance also).2. Thus. But how to find the threshold in this case ? Indeed.6 in order to focus on only correct matches (but we lose data).8 enables to keep the most of correct matches and to reject the most of false matches.1 2. We can also keep matches under 0.. NAO project – page 10/107 .

In order to avoid several model vectors to match for the same reference vector. we can use the k-d tree method.2. 2. when the k-d tree has finished its querying for each vector from the model within the tree. it has for each one of them a nearest reference vector : we have M matches at the end.1. At the end of the procedure.2 k-d tree The goal now is to match the nearest neighboors more accurately. Before all.3. To do that. k-d tree uses two lists of vectors of at least k dimensions (as it will focus only on the k dimensions specified) : – – the reference (N vectors) which aludes to the vectors we use to build the tree the model (M vectors) which aludes to the vectors we use to find a nearest neighboor Thus. we have a number of interested keypoints from the image acquired that are supposed to match with the model but we need a method more precised than distances. we need to have : NM NAO project – page 11/107 .Sylvain Cherrier – Msc Robotics and Automation This method allows to eliminate quickly matches from the background of the acquisition which are completely incompatibles with the model.

2. scale and rotation (invariant GHT) NAO project – page 12/107 . Thus. The drawback is that more we have dimensions.2. it will be the opposite.3. I don't explain more details about k-d tree as I used it as a tool. the dimension of the tree has to be the dimension of the corresponding descriptors (128 for SIFT and 64 for SURF).Sylvain Cherrier – Msc Robotics and Automation Then. we can use a Generalized Hough Transforms (GHT) but there are two types : – – GHT invariant with 2D translation (classical GHT) GHT invariant with 2D translation. the acquisition will be the reference and the model the reference of the k-d tree. depending on the previous operation (compare distances). more there is computation and slower the process is. we will need to assign correctly the model and acquisition descriptor lists depending on the number of their descriptors to match : – – if there are more descriptors in the acquisition than in the model. if we want accuracy. magnitude and orientation) in order to check if there are again some false matches.2 Keypoint matching After knowing the nearest neighboors. now we will use the data relative to the keypoint (2D position. If there are more descriptors in the model than in the acquisition. scale. the descriptors are not useful any more. we have to check if the matches are coherent at the level of the geometry of the keypoints. Now. In our case. To do this.

It has the main advantage to be invariant with all the 2D transformations of the object but it doesn't inform about the scale and 2D rotation and as it uses orientations. The reference point can be simply calculated as being the mean of position among all the keypoints : xr= ∑ x model nbKp model yr= ∑ y model nbKpmodel NAO project – page 13/107 . the model need to have a corresponding Rtable which describes for each of its keypoint their displacement γ from the reference point :  1.1 Classical GHT The Generalized Hough Transform leans mainly on keypoint coordinates defined relative to a reference point. it enables to keep a data invariant in 2D translation. it has as input the keypoints matched between the image acquired and the model. Moreover. The invariant GHT uses both the 2D translation and orientation of the keypoints as references. Indeed.Sylvain Cherrier – Msc Robotics and Automation The classical GHT uses only the 2D translation of the keypoints as reference. As the goal of the GHT is here to check whether the keypoints are correct. it's sensitive to their accuracy.2. In the following. ➢ Build the Rtable of the model Before all.3. using a reference point for all the keypoints. The two parameters of the displacement γ alude respectively to the scale (s = 1) and rotation (θ = 0 rad) as the model is considered as base. 0=  x model − x r y model − y r  Where xmodel and ymodel are the coordinates of the keypoint from the model and xr and yr are those from the reference point. It has the advantage to give already an idea of the scale and orientation of the model within the image acquired but it requires more computation as it's necassary to go through several possible scales and 2D rotations.2. it's necessary to know a specific range of scale and 2D rotation to work with and at what sampling. we will both study the classical Generalized Hough Transform and the invariant one. 2.

we have : x r =x acq −Rtable numKp model . With scale s and rotation θ. Thus. the Rtable for each model is structured as below : x model 1−x r x model 2−x r T= ..Sylvain Cherrier – Msc Robotics and Automation Where nbKpmodel refers to the number of keypoints within the model. we have our new reference point for the acquisition : x r =x acq −s cos  Rtable numKp model . if we consider a transformation invariant both in scale and rotation. As we have already a match between the keypoints from the acquisition and those from the model. Indeed.0 NAO project – page 14/107 .1 y r = y acq− Rtable numKpmodel . we can directly use the Rtable in order to make each keypoint vote for a specific position of the reference point. 2 Where xacq and yacq alude to the position of the keypoint from the acquisition and numKp model refers to the corresponding keypoint within model matched by acquisition..1−sin  Rtable numKp model ... 1cos  Rtable numKp model . y model  nbKp model − y r  ➢ Check the features of the acquisition The second step is to estimate the position of the reference point within the acquired image and check the keypoints which have a good result. 2 y r = y acq−s sin  Rtable numKp model . x model nbKp model −x r  y model 1− y r y model 2− y r . we have to modify the value of the displacement given by the Rtable :   s .  =s cos   −sin  sin  cos  Thus. 2    1.

I focused on the maximum of votes to choose but there can be several for different scales and rotation. 2. we have an accumulator of votes of the size the image acquired. y0). more it will be accurate but more it will spend time). NAO project – page 15/107 . at the end of this step. An idea would be to detect where the votes are more grouped. The acceleration is defined as going through the border and the reference point (x 0. We will use a relation between the speed of the keypoint φ' and its acceleration φ'' ➢ Speed φ' and acceleration φ'' It's easier to define the speed and accleration focusing on a template and not on keypoints. In the image below. we will not use the displacement γ this time. The studied template has a reference point as origin (x0.Sylvain Cherrier – Msc Robotics and Automation Thus.2 Invariant GHT Contrary to the previous method. A problem is to know at what scale and rotation we have the best results : personnaly. Adding the scale and rotation. The speed (at a specific angle θ of the template) is defined as being tangent of its border.2. you can see an example of accumulator : the maximum show the best probability of position of the reference point within the image acquired. another problem is that we add more calculation and the difficulty here is to know what range use and how many iterations (more we increase the number of iterations. y0).2.3.

we will use the orientation of the keypoint as the angle of speed φ'a and concerning the acceleration φ''.Sylvain Cherrier – Msc Robotics and Automation We can prove that between the angle of speed φ'a and the angle of acceleration φ''a. Thus we will use both the position and orientation of each keypoint to build the Rtable. rotation and scale : k = a '− a ' ' In our case. NAO project – page 16/107 . we will build a Rtable for the model but this time we will record the value of k and not the displacement. there is a value k invariant with translation. it's defined by : y− y 0 x−x 0 ' ' = and  a ' '=arctan ' '  ➢ Build the Rtable of the model As previously.

3 Conclusion Thus.Sylvain Cherrier – Msc Robotics and Automation ➢ Check the features of the acquisition Using the position of the keypoints within the acquisition and their corresponding matches. The GHT allows to reduce errors in the calculation of 2D model features we will see in the following.3. Contrary to the previous method. a point of the model (pmodel) is expressed in the image acquired (pmodel/acq) using a specific translation t. we will have the angle of acceleration φ''a. 2. each keypoint will not vote for a precised position but for a line. I chose to focus on the maximum as it's more simple.3. we have to look for the maximum of votes in the accumulator or where the votes are the most grouped.3 calculation of 2D model features ➢ Least Square Solution In order to calcultate the 2D features.2. 2. a scale s (scalar) and a rotation matrix R : p model / acq=s R pmodel t R= cos  −sin  sin  cos   where   and t= tx ty  NAO project – page 17/107 . After. Indeed.2. knowing the orientation of the keypoints (φ'a).2. outside the tolerance zone will be rejected : it allows to be coherent at the geometric level. The false matches. we will have an associated value of k. we will have to use a Least Squares Solution. at the end of the both methods. I added a radius of tolerance around the maximum. In order to be more tolerant.

it's better to have more positions : all the matches can be considered (of course. Thus. m22. we have also : A mt =  x model / acq y model/ acq  T with A=  x model 0 y model 0 0 x model 0 y model 1 0 0 1  and mt = m11 m12 m21 m22 t x t y  Thus the vector mt can be determined using the inverse of the matrix A. m21. NAO project – page 18/107 . in order to increase the accuracy of the Least Squares Solution.3-1) in order to group the unknowns m11. if it's not square. We consider that the model has a scale s of 1 and an angle θ of 0 radians.Sylvain Cherrier – Msc Robotics and Automation θ is the angle of rotation between the model and the model within the image acquired. we will have to calculate a pseudo inverse : mt = AT A−1 AT   x model / acq y model /acq  As we have six unknowns. we will apply the Least Squares in order to find two matrices : M =s R=  m11 m12 m21 m22  and t= t x ty  The idea of the Least Squares is to arrange the equation (4. in order to solve this problem. we require at least three positions p model with three corresponding positions within the image acquired pmodel/acq. But. m12. t1 and t2 within a vector : p model / acq=  m 11 m12 t p model  x m 21 m 22 ty   with p model / acq=  x model/ acq y model /acq  and p model =   x model y model Thus. the previous steps such as the descriptor matching and keypoint one have to be correct).

we have : A mt =v model / acq with x model 1 0 A= . as previously... To calculate them.... v model/ acq= . .. x model /acq nb matches  y model / acq  nbmatches    −1 T nbmatches is the number of matches detected....... Thus. 0 1 0 y model nb matches  0 1 0  x model / acq 1 y model /acq 1 and . we can simply lean on the proprieties of the rotation matrix R : R= cos  −sin  sin  cos   T RRT =I and   NAO project – page 19/107 .. the solution is as the following : mt = A A A v model/ acq We have now the matrix M and the translation t but we need to have the scale s and the angle θ aluding to the transformation from the model to the acquisition.. x model nb matches  0  y model 1 0 0 x model 1 .. y model nb matches  0 0 x model  nbmatches  1 0 y model 1 0 1 .Sylvain Cherrier – Msc Robotics and Automation Thus.. . .

Sylvain Cherrier – Msc Robotics and Automation We conclude : MM T =s 2 I and s= MM T 1.1= MM T  2. 2=  MM T 1. 1MM T 2. at the end of this step. sH0) which refer to the model image from where the keypoints were extracted. Oacq ty Omodel/acq sH0 Cmodel/acq θ sW0 W0 Omodel Cmodel H0 tx Acquisition NAO project – page 20/107 Model . we have the translation t. the size s and the angle θ refering to the transformation of the model within the acquisition. In the diagram below. the Least Squares Solution gives us only an estimation of the origin of the model within the image acquired Omodel/acq but the center of rotation is easily calculated using the size of the model (sW 0. You can see some measures useful to estimate : Cmodel and Cmodel/acq alude respectively to the center of rotation within the model and within the acquisition. 1 R= and =arctan  Then. the red points refer to examples of matches between acquisition and model : it's necessary to have at least three couples of positions (acquisition-model) in order to calculate the features. 2 2 We deduce the rotation matrix R and the angle θ : M s R 2.1  R1.

Sylvain Cherrier – Msc Robotics and Automation ➢ Reprojection error In order to have an idea of the quality of the Least Square Solution. The reprojection error is simply obtained by calculating the new position vector of the model within the acquisition v'model/acq using the position vector of the model vmodel and the matrix M and translation t previously calculated. we have an error (the reprojection error errorrepro) defined by the distance between the position vector calculated v'model/acq and the real one vmodel/acq : d error d error =∣ v ' model / acq−v model /acq∣ and error repro = ∣ ∣ ∣ v model / acq∣ ∣ ∣ NAO project – page 21/107 . Thus. it's necessary to calculate the reprojection error.

1. Parameters of a video camera A video camera can be defined by several parameters: – extrinsic – intrinsic The extrinsic parameters focus on the geographical location of the camera whereas the intrinsic parameters deals with the internal features of the camera. localization with cameras allows to have a lot of data and working with pixels is more accurate. Localization Introduction There are a lot ways to localize objects : – ultrasonic sensors – infrared sensors – cameras – … But. localization allows to improve the interaction between the robot and the object detected. we'll see also the drawbacks and advantages of these two methods. In the following. We will see in the following how we can localize objects using one and several cameras. Thus.3 2.3. 2. However it need more processing time and computing. 2. In the context of the project.3.Sylvain Cherrier – Msc Robotics and Automation 2. we will describe each of these parameters.2.3.2. But before all we need to focus on the parameters of a video camera which are important to define. the position of the camera pcam defined as below: NAO project – page 22/107 .1 extrinsic The extrinsic parameters of a camera alude to the position and orientation of the camera in relation to the focused object.

As we want to express Tcam in the object frame.Sylvain Cherrier – Msc Robotics and Automation pcam=R cam p world T cam=〈 Rcam∣T cam 〉 p world where: – pworld aludes to the position of the world origin – Rcam and Tcam refer respectively to the rotation and translation of the camera relative to the reference point r 11 r 12 r 13 Rcam = r 21 r 22 r 23 r 31 r 32 r 33   tx T cam= t y tz  In our case. In order to build this relation. Tcam) Xobj/cam Oobj/cam Zobj/world Oobj/world Xobj/world Yobj/world Zobj/cam Yobj/cam 2.2. we have a translation of -Tcam before the rotation: pobj/ cam=R cam  pobj/ world −T cam =Rcam 〈 I ∣−T cam 〉 pobj /world (Rcam. it's interesting to know the object position pobj/world in relation to the camera frame pobj/cam.2 intrisic The intrinsic parameters can be divided in two categories: – the projection parameters allowing to build the projection relation between 2D points seen NAO project – page 23/107 . we consider the Rcam and Tcam are referenced to the object and not to the world as before.3.

1 projection parameters The projection parameters allow to fill the projection matrix linking 3D points from the world with 2D points (or pixels) seen with the camera. a pinhole camera is a simple camera without a lens and with a single small aperture.Sylvain Cherrier – Msc Robotics and Automation and real 3D points – the distortions defined in relation to an ideal pinhole camera 2. This matrix allows to apply the relation followed by ideal pinhole cameras. y'img) z y Optical axis Oobj Oimg zobj/cam fcam Ocam x ' img y ' img f = = cam x obj/cam y obj/ cam z obj/cam We deduce: x ' img =f cam x obj/ cam z obj /cam y ' img =f cam y obj /cam z obj /cam These relations allow to build the projection matrix linking the 2D points on the image p' img and the 3D object points expressed relative to the video camera frame pobj/cam: p ' img =s Pcam p obj/ cam NAO project – page 24/107 .2.3. Actually.2. yobj/cam) Image plane (x'img. we can assume a video camera with a lens has a similar behaviour that the pinhole camera because the rays go throught the optical center of the lens. This relation is determined easily using geometry using a front projection pinhole camera : x (xobj/cam.

generally. the term sin(θ) = 1 as θ ≈ 90° but the term cot(θ) is often kept as tan(θ)→∞. Θ refers to the angle between the x and y axes. a skew coefficent is defined as αc = cot(θ). The position on the image pimg is expressed in distance unit. as the z axis of the camera frame is located on the optical axis. it's necessary to offset pimg of the half of image width on x and of the half of image height on y (of cx and cy respectively). it's necessary now to convert it in pixel unit (pimg) and apply an offset: pimg =Pimg p ' img N x N x cot c x Ny where Pimg = 0 cy sin  0 0 1 Nx and Ny alude respectively to the number of pixels for one unit of distance horizontally and vertically. it can be responsible for a screw distortion on the image. The axis are generally orthogonal. we can deduce the final projection matrix Pcam-px allowing to have directly the position on the image in pixels pimg: pimg =s Pimg Pcam pobj/ cam=s Pcam−px p obj/ cam   N x f cam N x f cam cot  c x f x f x cot  c x N y f cam fy where Pcam− px = 0 c y= 0 cy sin sin 0 0 1 0 0 1   We have fx = Nx fcam and fy = Ny fcam. cx and cy alude to the coordinates of the image center (in pixels). this angle is so very closed to 90°. Finally. the projection matrix can be approximated: NAO project – page 25/107 . Thus.Sylvain Cherrier – Msc Robotics and Automation f cam 0 0 Pcam= 0 f cam 0 where 0 0 1   and s= 1 z obj/ cam s is a scale factor depending on the distance between the object and the camera: higher it is. bigger the object will appear on the image.

You can see below the effect of radial distortion.2. we can express the distorted and normalized position (pdn) relative to the normalized object position s pobj/cam (xn. yn): dt x p =dr s pobj/ cam dt y =M d s pobj/ cam 1 d n  NAO project – page 26/107 . The tangential distortions come from the lack of parallelism between the lens and the image plane.2. Thus.3.2 distortions There are two differrent types of distortion: – the radial distortions dr – the tangential distortions on x (dtx) and on y (dty) The radial distortions come from the convexity of the lens.Sylvain Cherrier – Msc Robotics and Automation f x f x c c x Pcam− px ≈ 0 fy cy 0 0 1   where αc ≈ 0 2. more we have distortions. more we go far from the center of the lens (center of the image).

dtx and dty are expressed functions of the normalized object position on z axis (s pobj/cam (xn. dr. we can express a real 2D pixel point on the image pimg from the camera functions of a 3D object point within the world pobj/world: pimg =s Pcam− px M d Rcam 〈 I ∣−T cam 〉 pobj /world f x f x c c x pimg =s 0 f y cy 0 0 1   dr 0 dt x r 11 r 12 r 13 t x 0 dr dt y r 21 r 22 r 23 t y p obj/world =M p obj/world 0 0 1 r 31 r 32 r 33 t z   M will be called the intrinsic matrix in the following.3 Conclusion Using the equations from above. the distortion coefficients are calculated as shown below: dr =1dr 1 r dr 2 r dr 3 r .2. it will be not necessary to normalize again by 1/s during the projection. center of the image or skew coefficient will be applied on the distorted and normalized position p dn instead of directly the object position pobj/cam. NAO project – page 27/107 .3.Sylvain Cherrier – Msc Robotics and Automation dr 0 dt x where M d = 0 dr dt y 0 0 1   and s= 1 z obj/ cam You can notice we have to normalize the object position before to apply distortions but it must be done before the camera projection we saw in the previous part. Indeed. focals..dr n r dt x =2 dt 1 x n y ndt 2 r 2 x n  2 2 2 4 6 2n dt y =dt 1 r 22 y 2 2 dt 2 x n y n n 2.. yn)): x n= x obj/ cam z obj/ cam y n= y obj/ cam z obj/ cam r=  x 2  y 2 n n Thus.

they are less interesting.3. it's a simple way to localize objects in the space even if it's necessary to know at least one dimension in distance unit. we will describe each one of these steps.3. 2.3.Sylvain Cherrier – Msc Robotics and Automation 2. We can notice that calibration can also inform us about external parameters but as they are always variable. the calibration of the camera is only needed at the beginning of the process. NAO project – page 28/107 . In the following. It consists in three steps : – calibrate the camera (mono-calibration) – undistord the image – find patterns – calculate the position of the patterns Calibrate the camera Undistord the image Real time application Find patterns Calculate the position of the patterns As the diagram above shows.3.1 Calibrate the camera (mono-calibration) We have before all to calibrate the camera in order to know its intrisic parameters. Localization with one camera With one camera.

which will have a big impact on the accuracy of the localization.3. It's this method chosen to calibrate using a chessboard which has simple reptitive patterns. Self-calibration doesn't need any measurements and produces more accurate results. ➢ Chessboard corner detection A simple way to calibrate is using a chessboard.2 Undistord the image Undistord the image is the first step of the real time application which will be executed at a specific speed. you could find more details about this type of calibration.3. NAO project – page 29/107 . Thus. it's difficult to implement because of the need of precised 3D objects. this method isn't flexible using an unknown environment and moreover. the calibration informs us about the intrinsics parameters of the camera : – – – – the focales on x (fx) and on y (fy) the position of the center (cx. cy) the skew coefficient (αc) the radial distortions (dr) and tangential ones (dtx and dty) Calibration is only needed at the beginning of a process. In the appendices. distortions have to be compensated by a undistortion process that will remap the image. Thus. 2. Distortions can be responsible to a bad arrangement of pixels within the image. this allows to know accurately the position of the corners on the image. I describe the Tsai method which is a simple way to proceed. the intrinsic parameters could be save in an XML file.Sylvain Cherrier – Msc Robotics and Automation ➢ What are the different methods of calibration ? We can notice that there are several ways to calibrate a camera: – – photogrammetric calibration self-calibration Photogrammetric calibration need to know the exact geometry of the object in 3D space. Thus.

we will need to use image mapping: x d =mapx  x iimg .3. in order to find the ideal image. brightness and contrast even if they require more computation than other corner detection methods. SIFT.3 find patterns Then.. I used SIFT and SURF to extract keypoints. As my project is oriented for pattern recognition.3. Shi-Thomasi : an Harris matrix is calculated using the local gradients and depending on the eigenvalues. we remap:  ximg . SURF. NAO project – page 30/107 . …) SIFT and SURF allow to add some robust pattern recognition invariant in scale. we use the distortion matrix Md and the projection matrix Pcam-px in order to calculate the ideal pixel from the distorted one.) or find keyregions (RANSAC. Because of this problem. we have: pimg =s Pcam− px p obj/cam and i pimg =s Pcam− px M d p obj/ cam d The idea is to calculate the distorted pixel from the ideal one: pd =s Pcam− px M d img 1 −1 P pi =Pcam −px M d P−1 px p iimg cam− s cam− px img We can't calculated directly the ideal one because the distortion matrix M d is calculated thanks to the ideal pixels. . We can either : – – find keypoints (Harris. y img =mapx  x img . y img  . Shi-Thomasi. If we consider piimg as the ideal pixel and pdimg as the distorted one.. considering what we said previously. it's possible to extract specific corners. y iimg  img 2. y img  i i i i i i y d =map y  x iimg . SUSAN. rotation. y iimg  and img Thus. SUSAN (Smallest Univalue Segment Assimilating Nucleus): it works with a SUSAN mask and generate a USAN (Univalue Segment Assimilating Nucleus) region area depending on the pixel intensities. It can be improved using shaped rings counting the number of intensity changes. map y x img .Sylvain Cherrier – Msc Robotics and Automation To do so. I don't give more details about the corner detection methods but they are divide in two types : – – Harris. Hough. we have to find patterns we want to focus on.

we consider αc = 0. In order to make more autonomous the localization. it's possible to calculate 3D coordinates (xobj/cam.Sylvain Cherrier – Msc Robotics and Automation 2. we can calculate the distance from the object (zobj/cam) : z obj/ cam= f0 d d img −c 0 obj Where f0 and c0 are the mean values of fx.4 calculate the position of the patterns Knowing the projection matrix Pcam-px given by the calibration. we need more than one camera as we will see in the next part using the stereovision. fy and cx. we can calculate directly the position of the object (xobj/cam. we see there are two main problems of the localization with one camera : – either it's needed to know the distance from the object z obj/cam – or it's needed to know at least one dimension of the object (d obj/cam) and know how to extract it from the image (dimg). Thus. The dimension dimg refers to the the dimension dobj/cam seen on the image. cy. yobj/cam. ➢ Knowing the distance from the object (zobj/cam). NAO project – page 31/107 . zobj/cam) from pixels (ximg.3.3. yobj/cam) : x obj/ cam= x img−c x z obj/ cam and fx y obj/ cam= y img −c y z obj /cam fy ➢ Knowing one dimension of the object (dobj). yimg) :  x img x obj/ cam f x x obj/cam c x z obj/ cam 1 y img =s Pcam −px y obj /cam ≈ f y c z z obj/ cam y obj /cam y obj/ cam 1 z obj/ cam z obj /cam     if c ≈0 In the following.

3. I give more details about each one of these steps. reproject and undistord the image Find patterns in the left camera Real time application Find the correspondences in the right image Calculate the position of the patterns As previously. reproject and undistord the image – find patterns in the left camera – find the correspondences in the right camera – calculate the position of the patterns mono and stereo-calibrate the cameras rectify. NAO project – page 32/107 . In the following. the calibration is isolated from all the real time application. Localization with two cameras : stereovision Now we will focus on stereovision using two cameras (one on the left and another on the right) : it enables autonomous localization even if it's more complicated to implement. It requires the same main steps that using one camera but adds a bit more : – mono and stereo-calibrate the cameras – rectify.4.Sylvain Cherrier – Msc Robotics and Automation 2.

you can see Pl and Pr as the position of the left and right cameras realtive to the object P.3. Tl→r and Rl→r alude to the relative translation and rotation from the left camera to the right one. The essential matrix E links the position of the left camera P l and the right one Pr (it considers only extrinsic parameters) : PT E Pl=0 r The fundamental matrix F links the image projection of the left camera p l and pr (it considers both extrinsic and intrinsic parameters) : NAO project – page 33/107 . to mono-calibrate each one of the cameras (left and right) and stereo-calibrate the system made up of the two cameras. A simple way to do is to firstly mono-calibrate each one of the cameras in order to have their respective intrinsic and extrinsic parameters.Sylvain Cherrier – Msc Robotics and Automation 2. the stereo-calibration will use their extrinsic parameters in order to know their relationship. After. as its name means.1 mono and stereo-calibrate the cameras This operation consists.4. This relationship will be defined by : – – the relative position between the left camera and the right one Pl→r (rotation Rl→r and translation Tl→r) the essential and fundamental matrices E and F P Pl Pr Ol Or Tl→r Rl→r In the diagram above.

if pr is a pixel from the right camera. Ol and Or alude to the optical center of the left and right camera respectively. thanks to the fundamental matrix F. you can see a pixel pl from the left image imgl with its corresponding right epiline lr from the right image lr. Indeed.Sylvain Cherrier – Msc Robotics and Automation pr F pl=0 Where pl=M l Pl and pr =M r P r T Ml and Mr alude to the intrinsic matrices of the left and right cameras. Thus. the correponding left epiline ll will be FTpr. NAO project – page 34/107 . the same with one pixel on the right image (left epipolar line ll). imgl imgr pl Right epiline lr Ol Left camera Right camera Or In the diagram above. one pixel on the left image corresponds to a line in the right one (called right epipolar line lr). we can calculate the epilines : – – if pl is a pixel from the left camera. the correponding right epiline lr will be Fpl. The matrix F enables us to have a clear relation between one pixel from the left camera and others from the right one.

will be applied to each one of the cameras (Rlrect and Rrrect). A rectification matrix. Indeed. A reprojection matrix will be applied also at each one of the cameras (Plrepro and Prrepro).Sylvain Cherrier – Msc Robotics and Automation Moreover. NAO project – page 35/107 . the goal here is to find the global camera equivalent to the stereovision system. ➢ The reprojection will unify the projection matrices of the two cameras (P lcam-px and Prcam-px) in order to build a global one Pglobalcam-px. which is a rotation one. the goal is to have a simple relationship between them in order to : – – simplify the correspondence step (on the right image) but before all. The fact that the cameras have just just a shift on x between them enables the epilines to be at the same y that the pixel from the other image as the diagram below shows. we are able to calculate a rectification and reprojection matrices for each one of the cameras. imgl imgr pl Right epiline lr Ol Left camera Or Right camera ➢ The rectification will compensate the relative position between the cameras in order to have only a shift on x between them. ➢ But why rectify and reproject ? As we're working with a stereovision of two cameras. a translation only on x will enable us to predict more easily the position of the features within the right camera : it avoids us to have to use the fundamental matrix F and the epilines to make the correspondence. calculate easily the disparity between the two images (step of position calculation) Indeed. using the relative position Pl→r.

we can express the distorted (and unrectified) pixel pdimg functions of piimg: pd =s Pcam− px M d img 1 RT P−1 pi s global rect repro img Then. our ideal pixel will be now: pimg =s global Prepro R rect pobj /cam i Thus. we will use image mapping.Sylvain Cherrier – Msc Robotics and Automation In the appendices.3. 2.3.3.4. reproject and undistord the image This step is similar to the undistording step seen for the mono-camera case : the difference here is that we apply also a rectification and reprojection matrices to each camera.2 rectify. similarly to undistortion. 2. there are more explanation about the epipolar geometry used in stereovision and about the calculation of the rectification and reprojection matrices.4. 2.3 find patterns in the left camera Exactly similar to the case of one camera but now we will extract patterns from the left camera.4 find the correspondences in the right camera The correspondences on the right image can be calculated using : – – a matching function on epipolar lines an optical flow (Lucas-Kanade for example) NAO project – page 36/107 .4.

Sylvain Cherrier – Msc Robotics and Automation ➢ Matching function on epipolar lines The first method allows to use the interest of the rectification which enables to work on an horizontal line (constant y). Thus. the maximum of the matching function give the best correspondence. Using pixels from the patterns found in the left image. we are in the second case. A matching operation using a small window called Sum of Absolute Difference (SAD) is calculated at each possible pixels of the right image (this operation is based before on texture). you can extract an optical flow using : – – several images from one camera at different instants several images from different cameras with different viewpoint In stereovision. The diagram below explains the principle. surfaces and edges in a visual scene caused by the relative motion between an observer and the scene. NAO project – page 37/107 . ➢ Optical flow Optical flow or optic flow is the pattern of apparent motion of objects. Thus. it goes through horizontal lines in the right image for a same y : as we work on the right image. the correspondence points are expected to be on the left side of the position x of the pattern.

Sylvain Cherrier – Msc Robotics and Automation The Lucas-Kanade method is an accurate method to find the velocity in a sequence of images but it leans on several assumptions : – – – brightness constancy : a pixel from an object detected has a constant intensity. spatial coherence : neighboring points in the world have same motion. y. surface and neighboring position on the image.t  Using the assumption of brightness constancy and the relation from above a an equality.t t ≈I  x . In the following. y . t+δt) functions of the previous one I(x. the equation can't be solved directly. I will consider : I x= ∂I . y  y . We need to use a region made up of several pixels (at least two pixels) where we will consider that we have a constant vector of speed (spatial coherence assumption). ∂x I y= ∂I ∂y and I t= ∂I ∂t NAO project – page 38/107 . t) : ∂I ∂I ∂I  x  y t ∂x ∂y ∂t I  x x . we see why it's important to have small movements (temporal persistence assumption). t t=I  x . y+δy. Using Taylor series. y . temporal persistence : the image motion is small. y  y . we have : ∂I ∂I ∂I V x V y  =0 ∂x ∂y ∂t ∂y ∂t I  x x . we can express the intensity of the next image I(x+δx.t  and Where V x = ∂x ∂t and V y = As we have two unknowns for the components of the velocity (Vx and Vy). As Lucas-Kanade method focus on local features (small window).

Vy) for a large scale to refine it several times before to work on the raw image (pyramid Lucas-Kanade).Sylvain Cherrier – Msc Robotics and Automation Then.. I yn −I tn In order to calculate the velocity.... I yn The Least Squares Solutions works well for features as corners but less for flat zones. V y .. The Lucas-Kanade method can be improved to consider also big movements using the scale dimension. we need to calculate a Least Squares Solution :   −I t1 Vx = AT A−1 AT  −I t2 Vy .. I xn   I y1 I y2 . we can calculate the velocity (V x. NAO project – page 39/107 .. we have for a region of n pixels :      I x1 I x2 . using the previous formula applied to small regions. we can calculate the displacement vector d and so the correspondence pixels within the right image. Indeed... The diagram below shows the principle of this method. in this case. −I tn  where I x1 A= I x2 .. using the velocity.. Thus.. I xn I y1 −I t1 I y2 V x = −I t2 .

Sylvain Cherrier – Msc Robotics and Automation 2. The diagram below illustrates the stereovision system when the two images have been undistorted. Z) : x l−clx y l−cly and Y = X= Z Z f f NAO project – page 40/107 . rectified and reprojected.3. we can calculate the complete position of the object relative to the camera (X. similarly that in the mono-camera case.4. We can find a geometrical relation between the cameras in order to have the distance from the object Z : T − x l−x r  T = Z−f Z thus Z = fT r x −x l The disparity between the two cameras is defined as : d=x −x l r Thus. Y. knowing Z.5 calculate the position of the patterns This step is the core of the use of stereovision of localization.

Thus. it will be near zero : the localization become less accurate. it will tend to infinite : the localization become very accurate. we will have to find a maximal distance from where the object is detected with enough accuracy. NAO project – page 41/107 .Sylvain Cherrier – Msc Robotics and Automation ➢ disparity The disparity d is an interesting parameter to measure as it's inverse proportionnal of the distance of the object Z : – for an object far from the camera. – For an object closed to the camera.

4. interacting.1.Sylvain Cherrier – Msc Robotics and Automation 2. travelling.2. thinking. controlling. This robot is good for education purposes as it's done to be easy and intuitive to program.4.4 2. – NAOH25 (humanoid) : sensing. 2. interacting. A humanoid robot such as NAO can make more funny the courses motivating the students to work on practical projects : it allows to test very quickly some complicated algorithms on real time and using a real environment. being autonomous. NAO control Introduction The NAO is an autonomous humanoid robot built by the French company Aldebaran-Robotics.1 Hardware General Firstly.2. 2. It integrates several sensors and actuators which allow students to work on locomotion. audio and video signal treatment. – NAOH21 (humanoid) : sensing. it's necessary to know that there are several models of NAO which enable to do different kinds of applications : – NAOT2 (humanoid torso) : sensing. controlling. thinking. controlling. grasping This last model is the most evolved : it adds the ability to grasp objects to the humanoid (as the torso NAOT14 does also). thinking. interacting. grasping This model adds more practical aspect of the previous one with the capacity to work on control. traveling This first model of humanoid allows to make NAO move within the environment : a lot of applications are possible with for example mapping. – NAOT14 (humanoid torso) : sensing. NAO project – page 42/107 . thinking This model allows to focus on signal treatment and artificial intelligence. voice recognition and more.4. grasping.

9 tactile sensors (buttons). the NAO robot refers to a specific version (V3+. 2 infrared emitters and receivers.2. 4 microphones. The use of a laser improves the navigation of the robot in complex indoor environments.2.4. 1 inertial board (2 gyrometers 1 axis and 1 accelerometer 3 axis).Sylvain Cherrier – Msc Robotics and Automation NAOT2 NAOT14 NAOH21 NAOH25 Notice : Moreover. besides these different models of NAO.57 meters and weight of 4. 8 pressure sensors (FSR).3. V3. …). Besides the model.5 kg sensors : 2 CMOS digital cameras. there are some modules which can be added to NAO such as the laser head. The robots available in the laboratory are NAOH25 version V3.2 Features The NAO robot (NAOH25) has the following features : – – height of 0. 36 encoders (Hall effect sensors) NAO project – page 43/107 .2 : this hardware is a good base for a lot of applications. 2 ultrasonic emitters and receivers (sonars). 2. V3.

2 high quality speakers Connectivity : internet (Ethernet and WIFI) and infrared (remote control. The robot has 25 Degree Of Freedom (DOF) : – head : 2 DOF – arm : 4 DOF in each arm – pelvis : 1 DOF – leg : 5 DOFin each leg – hand : 2 DOF in each hand NAO project – page 44/107 .) a CPU (x86 AMD GEODE 500 MHz CPU + 256 MB SDRAM / 2 GB Flash Memory) located in the head with a Linux kernel (32bit x86 ELF) and Aldebran Software Development Kit (SDK) NAOqi a second CPU located in the torso a 55 Watt-Hours Lithium ion battery (nearly 1.5 hours of autonomy) – – The NAO is well equipped in order to be able to be very flexible on its applications... . We can notice that the motors are controlled by a dsPIC microcontroller and through their encoders (Hall effect sensors).Sylvain Cherrier – Msc Robotics and Automation – – – – actuators : brush DC motors others outputs : 1 voice synthetizer. other NAO. LED lights.

I focused on the head at the beginning in order to work on object tracking. 2. I had to leave some control as my main topic was on image processing for object recognition and localization. After. we have these features for the two cameras : – camera output : YUV422 – Field Of View (FOV) : 58 degrees (diagonal) – focus range : 30 cm → infinity – focus type : fixed focus NAO project – page 45/107 . I planned to work on object grasping : this task requires more calculations to synchronise correctly the actuators. Moreover. Depending on the time.3 Video cameras NAO has two identical video cameras which are located on its face. They provide a resolution of 640x480 and they can run at 30 frames per second.Sylvain Cherrier – Msc Robotics and Automation For my project. In the next part.4. I will give more details about the two cameras of the NAO robot as I focused on these sensors.2.

We can notice that these cameras can't be used for stereovision as there is no everlapping between them. Softwares There are several softwares which enable to program NAO very quickly and easily. A behavior is a group of elementary actions linked together following an event-based and a timebased approach.4. The downside camera.3.4. 2. They allow also to interact easily with the robot.Sylvain Cherrier – Msc Robotics and Automation The upside camera is oriented straight through the head of NAO : it enables the robot to see in front of him.3. is oriented to see the ground : it enables NAO to see what he meets on the ground closed to him.1 Choregraphe Choregraphe is a cross-platform application that allows to edit NAO's movement and behavior. NAO project – page 46/107 . 2. down shiffted.

This aspect allows the user to focus on the simulated NAO if he had to test the code more quickly and safely. Thanks to these features. ➢ Interactions Choregraphe allows to see a NAO robot on the screen which imitate the actions of the real NAO in real time. which is the framework used to control NAO : we will focus on it later. Moreover. The software enables to un-enslave the motors which is impose no stiffness for each motor. The interface displays the new joint values in real time even if the user moves the robot manually.Sylvain Cherrier – Msc Robotics and Automation It consists on a simple GUI (Graphical User Interface) based on NAOqi SDK. the user can easily test the range of each joint for example. It enables the user to use the simulated NAO to save more time during the tests. NAO project – page 47/107 . the software enables to control each joints in real time from the GUI.

. behaviors which are groups of actions can be also represented as boxes but with several boxes inside. conditions.. these boxes are arranged in order to create an application. calculated variables. they have input(s) and output(s) which allow generally to start and finish the action. In order to connect the boxes. . NAO project – page 48/107 . each movement and detection (action) can be represented as a box. such as the sensors'ones or computation's ones.Sylvain Cherrier – Msc Robotics and Automation ➢ Boxes In Choregraphe.) and several inputs (variables. there can be several outputs (the value of the detected signals. For specific boxes. …). As mentionned previously.

Moreover. Choregraphe offers others possibilities such as a video monitor panel enabling to : NAO project – page 49/107 . the timeline enables to create an action from several positions of the joints of the robot. you can see the keyframes (squares) and several functions controlling the speed of the motors. It's also possible to edit manually the laws of speed for each motor even if it's necessary to be careful ! In the screenshot above. the behavior layers (marked as 2) and the position on the frame (marked as 3). Thus. to put the NAO in a specific position and save the joints. it's only necessary to select a specific frame from the timeline. Contrary to the system of boxes seen previously. it's easier to manage several behaviors in series or in parallel. Edit a movement using keyframes allows for example to implement easily a dance for NAO. The screenshot above show the timeline (marked as 1).Sylvain Cherrier – Msc Robotics and Automation ➢ Timeline An interesting feature of Choregraphe is the possibility to create an action or a sequence of behaviours using the timeline. the timeline allows to have a view on the time which is the real conductor of the application. Indeed. The software will automatically calculate the required speed of the actuators in order to follow the constraints of the frames.

As Choregraphe.4.2 Telepathe Telepathe is an application that allows to have feedback from the robot. These widgets or windows can be connected to different NAO robots. NAO project – page 50/107 . Actually. Telepathe is a more sophiscated software than Choregraphe for the configuration of devices and visualization of data. as well as send basic orders. The user can monitor some variable accurately during testing. it's a cross-platform GUI but which is customizable (the user can load plugins in different widgets).3. ➢ Memory viewer The module ALMemory of NAO (which is part of NAOqi) allows to write to and read from the memory of the robot.Sylvain Cherrier – Msc Robotics and Automation – – display the images from NAO cameras recognize an object using as model the selection of a pattern on the image 2. Telepathe uses this module to display variables with tables and graphs. It has directly a link with the memory of NAO which allows to follow in real time a lot of variables.

in real-time. NAOqi allows to improve the management of the events and the synchronization of the tasks as a real-time kernel can do.Sylvain Cherrier – Msc Robotics and Automation ➢ Video configuration Moreover. YARP (Yet Another Robotic Platform). NAO project – page 51/107 . IA oriented and they can be sometimes open-source in order to enable all people to work on the source code.4. There are a lot of others for use in a large range of robots : Orocos. Thus. Urbi and other. It works with a system of modules and around a shared memory (ALMemory module).4. 2. NAOqi SDK The Software Development Kit (SDK) NAOqi is robotic framework built especially for NAO. They can be specialized in motion. Telepathe enables to configure deeply the video camera. it make easier the testing of the applications of computer vision.

Sylvain Cherrier – Msc Robotics and Automation

2.4.4.1

Languages

The framework uses mainly three languages : – Python : as a simple interpeter is needed to generate the machine code, the language enables to control quickly NAO and manage parallelism more easily. – C++ : the language needs a compilator to execute the program; it's used to program new modules as they work well with oriented object languages and as they are expected to stay fixed. – URBI (UObjects and urbiScript) : URBI is a robotic framework as NAOqi which can start from NAOqi (it's a NAOqi module called ALUrbi). It enables to create UObjects which refer to modules and use urbiScript to manage events and parallelism. URBI is an interesting framework as it's compatible with a lot of robots (Aibo, Mindstorms NXT, NAO, Spykee, …) but in my internship, I focused on NAOqi which has the same principle. Thus, in the following, I will focus on the Python and C++ languages which use the core of NAOqi. NAOqi is very flexible on its languages (cross-language) as C++ modules can be called from a Python code (as the opposite) : this simplifies the programming and the exchange of programs between people. Moreover, NAOqi is embedded with several libraries such as Qt or OpenCV which enable respectively to create GUIs and manipulate images.

2.4.4.2

Brokers and modules

NAOqi works using brokers and modules : – a broker is an executable and a server that listen remote commands on IP and port. – a module is both a class and a library which deals with a specific device or action of NAO. Modules can be called from a broker using proxies : a proxy enables to use methods from a local or remote module. Indeed, a broker can be executed on NAO but also on others NAO and computers : it enables to make only one program working on several devices. An interesting propriety of the module is that they can't call directly other modules : only proxies enable them to use methods from different modules as they refer to the IP and port of a specific broker.

NAO project – page 52/107

Sylvain Cherrier – Msc Robotics and Automation

In the diagram above, you can see the architecture adopted by both NAO and its softwares; purple rectangles alude to brokers and green circles to modules. The broker mainBroker is the broker of the NAO robot which is executed directly when the robot is switched on and has an IP address (Ethernet or WIFI). This architecture is a safe one as the robot and each one of the softwares work with different brokers : if there is a crash on Choregraphe or on Telepathe, the robot will be less expected to fall as the brokers will not communicate during 100 ms. By using NAOqi as developper, we can choose to use the mainBroker (unsafe but fast) or create our own remote broker. Personnaly, as I focused firstly on the actuators from the head for the tracking, I prefered use the mainBroker as it was more intuitive and fast. As the diagram above shows, there are a lot of modules available for NAO (mainBroker) but we can add more using the module generation from C++ classes. The modules refer often to devices available on the robot : – sensors (ALSensors) : tactile sensors – motion (ALMotion) : actuators – leds (ALLeds) : led outputs – videoInput (ALVideoDevice) : video cameras – audioIn (ALSoundDetection) : microphones – AudioPlayer (ALAudioPlayer) : speakers – … NAO project – page 53/107

Sylvain Cherrier – Msc Robotics and Automation

But, there are others which are built for a specific purpose : – interpret Python code : pythonbridge (ALPythonBridge) – run computationnal processes : faceDetection (ALVisionRecognition), landmark (ALLandMarkDetection), … – communicate with the lower level software controlling electrical devices : DCM (DCM) – manage a shared memory between modules : memory (ALMemory) – … Concerning Telepathe, we meet a memory and a camera modules as expected. In order to enable a module to make its methods available for the brokers, they need to be bound to the API (Application Programming Interface) module. This module enables to call the methods both locally and remotely. As we said previously, python code allows to control quickly NAO because NAOqi SDK provides a lot of python functions to call Proxies from one or several brokers and use these proxies to execute methods. NAOqi provide the same kind of methods for C++ but more oriented for the generation of modules.

2.4.4.3

Generation of modules in C++

Before all, the module can be generated using a Python code called module_generator.py. It allows to generate automatically several C++ templates : – projectNamemain.cpp : it allows to create the module for a specific platform (NAO or PC). – moduleName.h and moduleName.cpp : they are the files which will be customized by the user. projectName and moduleName will be remplaced by the desired names. Thus, as NAOqi provides module libraries and other libraries (OpenCV for example) for C++, the user can develop easily with the framework. When the module is ready, it's necessary to compile it. In order to build modules, NAOqi uses CMake which enables to compile sources using a configuration file (generally with the name CMakeLists) and others options such as the location of the sources or of the binaries. The Python script module_generator.py creates also a CmakeLists file which facilitate the compiling process. CMake can be executed with a command as it's the case on Linux or it can open a GUI as on Windows.

NAO project – page 54/107

2. both Operating Systems create a library file (. Indeed. This images will be updated at a specific frame rate. there are four images).4. a simulated camera.cmake .4 The vision module (ALVideoDevice) As I focused on the vision. the PC has not the same modules as the robot as it has not the same devices.4. NAO project – page 55/107 .ini has to be modified in order to load this module. The module library can be moved after in the lib folder of NAO or the PC and the configuration file autoload. At the end of the process. I present in this part the management of the NAO cameras through the vision module called ALVideoDevice. CMake provide : – a MakeFile configuration file on Linux : it descibes the compilator to use (g++ in our case) and a lot of other options and it has to be executed by the make command.Sylvain Cherrier – Msc Robotics and Automation In order to compile modules.so libraries (Linux).. Cross-compilation from Windows for NAO is impossible as NAO has a Linux OS : however. – A Visual Studio Solution file (. we use cross-compiling so as to make the code run in a specific platform (PC or robot).sln) on Windows : it can be open on Visual Studio and it's necessary to build it. a file.cmake for NAO. Cygwin on Windows can be an alternative. On Linux. After execution. The problem of Windows in this case is that it can't provide a library for NAO as the robot uses .so on Linux and . The vision on NAO (ALVideoDevice module) is controlled by three modules : – the driver V4L2 It allows to acquire frames from a device : it can be a real camera. It used a circular buffer which contains a specific number of image buffers (for NAO.cmake to compile a module for the PC and toolchain-geode.dll on Windows). resolution and format depending on the features of the device. the CMake command has to mention another option in order to cross-compile : cmake -DCMAKE_TOOLCHAIN_FILE=path/toolchain_file. Cross-compiling is used using a cmake file : toolchain-pc.

it allows to start it in streaming mode (circular buffer) and stop it. The GVM can access the image by two accesses : – access 1 to use conversion : by pointer if local and by ALImage variable if remoterosscompilation from Windows for NAO is impossible as NAO has a Linux OS : however. the image data has to be duplicate from the image of the VIM and it will be an ALImage. The difference will be at the level of the data received : in local mode. the image data (from the VIM) can be accesed using a pointer whereas in remote mode. they are identified by a name. – access 2 to use without conversion (raw image) : by pointer if local and by ALImage variable if remote NAO project – page 56/107 . Cygwin on Windows can be an alternative. a frame rate. simulated camera. it can extract an image from the circular buffer (unmap buffer) at a specific frame rate and convert the image at a specific resolution and format. file). We can notice that several VIM can be created at the same time for different or same devices.Sylvain Cherrier – Msc Robotics and Automation – the Video Input Module (VIM) This module allows to manage the video device and the driver V4L2 : it opens and close the video device using the I2C bus. The GVM has also methods to recuperate the raw images from the VIM (without any conversions) : this way to do allow to improve the frame rate even if it requires a restricted number of VIM depending on the number of buffers enabled by the driver. the GVM will configure the VIM in order to do this. Moreover. The diagram below show the three modules with their own thread (driver. Moreover. this module enables the user to control the video device in local or remote mode. resolution and format. VIM and GVM). It will configure the driver for a specific device (real camera. – the Generic Video Module (GVM) This module is the main interface between the video device and the user : it allows the user to choose a device.

I used the front camera of NAO in RGB and BGR modes : as the camera is initally in YUV422. NAO project – page 57/107 . I used the GVM on remote and local mode with conversion.Sylvain Cherrier – Msc Robotics and Automation Driver thread VIM thread GVM thread Format conversion Access 1 Access 2 Camera Unmap buffer GVM thread GVM thread In my case. a conversion is needed.

It offers a lot of toolboxes for several fields shuch as aerospace. I will focus on what hardware and software I used to accomplish my mission : object recognition and localization on a NAO robot for control purpose. Moreover. All these aspects make Matlab very flexible for any applications needing computation. softwares and libraries I worked using two ways : – simulating and testing on Matlab Matlab is a very useful tool to test algorithms as it's very easy to manipulate data and display it on figures.1 Languages. advanced control. NAO project – page 58/107 . image processing. C++. Matlab is cross-platform : it can work an Windows. It's crosslanguage as the mex command allows to convert C. a lot of toolboxes are available on the web : it enables the users to use quickly programs as tools. Linux or Mac. C+ + allows to build modules for NAO whereas Python which is directly interpreted allows to do quickly applications using modules. Matlab – real implementation in C++/Python with OpenCV for NAO NAO can be programmed using both C++ and Python as we saw in the NAO control part. Moreover. … Matlab offers also functions to read data from serial port (serial) or from other hardwares such as video cameras which allow to manipulate a large range of devices. Frotran code to matlab code and the opposite.Sylvain Cherrier – Msc Robotics and Automation 3 Implementation and testing In this part. 3.

I focused before all on the invariance in rotation for the test : I used several rotated images and calculated the number of keypoints and matches with the studied image (with SIFT). we can test both the feature extraction (SIFT/SURF) and the feature matching. This way to do allows to test easily the number of keypoints detected. You can see below the keypoints detected for several rotations and the number of keypoints functions of the rotation of the model image (green points : studied image.2 3. I chose to build a vision module in order to have the main features of the object recognized and to enable the user to choose what object to recognize among other parameters. rotated. I used Python for control as I didn't focus on control during my project. If there is a problem from the control part. I used an image (studied image) and a transfromed image of it (cropped.2. scaled) as model image. the number of matches depending on the rotation. red points : model image. Python OpenCV 3. I used the Open source Computer Vision (OpenCV) library in order to implement more easily the vision part. Using this method. Tools and methods Recognition For recognition. as the vision was the core of my project. I used two ways to test on Matlab : – Firstly.1. NAO project – page 59/107 . scale and size of the cropped image.Sylvain Cherrier – Msc Robotics and Automation In my case. I could use Matlab to simulate the control as i did for the vision part but I didn't have enough time. it's very easy to modify the Python code as it's not necessary to complile it. blue points : matches).

I focused on the calculation of 2D model features within the studied image : the model image was cropped. After. we could calculate the translation. The matching could estimate the values and in order to test it. rotation and scale errors. NAO project – page 60/107 .Sylvain Cherrier – Msc Robotics and Automation The number of matches is reduced with the rotation of the image : the SIFT was not efficient and I had to improve it. rotated and scaled depending on the inputs of the user.

I focused firstly on plane objects (books for example) as only one model image is required. This way to do allows to test the algorithm for real 3D objects from the world. NAO project – page 61/107 . It enables to see the influence of rotation of the object in 3D which limits the recognition. To make it simple.Sylvain Cherrier – Msc Robotics and Automation – Then. The 2D features of the model within the acquisition are always interesting to calculate. for specific objects such as robots. several model images are needed. After. I used images acquired from the video camera of my computer and model images of objects.

NAO project – page 62/107 . Thus. I thought that understanding the concepts could enable me to be more comfortable with camera devices. I wanted to learn more about mono-calibration.2.Sylvain Cherrier – Msc Robotics and Automation 3. it's necessary to start NAOqi both on NAO and the PC. I used also Matlab to test programs for calibration and stereo correspondence. I used a stereovision system available in the laboratory (a previous MSc studient. had designed it for his dissertation). in order to use them with NAO. Suresh Kumar Pupala. You can see below the screenshots showing the stereo-calibration and the stereovision using C+ +/OpenCV.2. The photos below shows the stereovision system and NAO wearing them : the system can be connected to a PC (USB ports). Localization For localization. I didn't spend the time to test on Matlab and I focused directly on the implementation on C++ using the Open source Compter Vision (OpenCV). stereo-calibration and stereo correspondence as OpenCV functions hide the processing.

In order to practice on NAO. I did a small vision module to track a red ball (HSV color space) and I made NAO focus on the barycenter of the object rotating its head (pitch and yaw). saying what object he is watching) and after. if it's possible. the green points to the corresponding points in the right image using stereocorrespondence (right camera) and then the red points to the 3D points deduced in the left image. NAO project – page 63/107 .Sylvain Cherrier – Msc Robotics and Automation The blue points alude to the corners found in the left image (left camera). 3.3. moving to the object.2. I wanted firstly to do a basic control for NAO (as tracking the object by its head. NAO control My final goal was to use a vision module (enabling object recognition and localization) combined with a program in Python for NAO control. more complicated control (grasping an object for example).

the program becomes more flexible. I implemented SIFT on Matlab using a C++ code written by an Indian student in Computer Science. you can see the original image on the left and the HSV mask on the right : tracking using HSV colorspace is not perfect as it depends a lot on the brightness but it was a simple project to practice on NAO. I focused on a clear program instead of a fast one because I wanted to divide it in several distinct functions : by doing this. Then. .3 Steps of the project My project was divided in several steps : – As we said before. the recognition is divided in two main step : feature extraction and feature matching. Utkarsh Sinha. 3. Lowe. Lowe (« Distinctive Image Features from ScaleInvariant Keypoints »). In order to implement it. I looked into the localization by learning the parameters of a video camera. I studied the feature matching and I focused on the descriptor matching and the calculation of the 2D model features (Least Squares Solution). I did also some experiments on matlab in order to understand more the calibration and the stereo correspondence. I used a lot of references such as a MSc dissertation from a previous student in Robotics and Automation (« 3DPerception from Binocular Vision » . After. In order to implement it in C++.Suresh Kumar Pupala) or the publication from David G. I used the Open source Compter Vision (OpenCV) library. He used as me the publication of David G.Sylvain Cherrier – Msc Robotics and Automation In the screenshot above. I used a stereovision system adaptable for NAO's head which was available in the laboratory. I worked firstly on feature extraction with SIFT. Before to link the localization with the NAO project – page 64/107 – – . the calibration and the stereovision.

this allows to study the evolution of disparity and the values of 3D positions. I learned about the Generalized Hough Transform which enables to do keypoint matching focusing on the position and the orientation of the keypoints. I used only Matlab to test them. we will have keypoints localized within the space. combining recognition and localization. The goal is then to replace the corner detection of the localization by SURF feature extraction and add the feature matching before to do the stereo correspondence on the right image : the recognition will be done on the left image only. to calculate the barycenter. it enables to estimate the average position of the object and maybe more (3D orientation. I studied firstly the algorithms outside the matching and after I integrated them to it. As this was quicker than SIFT. 3D size. I focused only on corners. I only downloaded the OpenSURF library which can be found for either matlab code or C++. Later. I used it to study the feature matching. Contrary to SIFT. Finally. – – – NAO project – page 65/107 . …). in remote or local mode and to build a simple control of the head of NAO. – After. I didn't implement it on Matlab.Sylvain Cherrier – Msc Robotics and Automation recognition. The project consisted to track a red ball on the image from the camera (HSV color space). and to control the head of NAO in order to make it focus its camera on the object. I learned about the control of NAO and I tested a small module of recognition in C++ (ballTracking) combined with a control part in Python. The goal of the small project was to deal with the use of NAO camera or another camera (laptop camera). My work is not finished as I didn't implement the feature matching in C++. Thus. I studied another invariant feature extraction method (SURF).

Sylvain Cherrier – Msc Robotics and Automation

- descriptor matching - 2D features calculation

Feature matching

- SIFT on Matlab - some testing

SIFT

- ball tracking project

NAO control

- parameters of a video camera - calibration - stereovision

Localization

SURF
- use of OpenSURF library - some testing

Feature matching
- keypoint matching - feature matching in C++ - SURF and feature matching combined with stereovision in a module for NAO

Future work

NAO project – page 66/107

Sylvain Cherrier – Msc Robotics and Automation

4 Conclusion

The project was around three subjects : – – – object recognition object localization NAO control

I spoke firstly about these subjects before to present what I did and what I have to do for the next days. I used different methods to deal with each one of these subjects : I prefered to simulate more on matlab the recognition whereas I chose to practice quickly the localization and the control of NAO (using C++, OpenCV and Python). However, dealing with the localization, I used also Matlab in order to understand more what was inside the calibration and stereo-calibration. I can conclude that this project allowed me to learn more about the use of video camera for robotic purpose. Indeed, recognition is useful at the beginning to track the desired object and localization enables the robot to interact with it using control : the behavior of the robot will depend on these three parts (recognition, localization and control). Concerning my project, as I said when I described its steps, it's not finished : it need some implementations in C++ and some testing. I hope it will be. Contrary to SIFT, SURF can be open source with OpenSURF for example : thus, the recognition I presented is not only academic and could be used by companies.

NAO project – page 67/107

Sylvain Cherrier – Msc Robotics and Automation

5 References
– – – « 3D-Perception from Binocular Vision » (Suresh Kumar Pupala) Website of Utkarsh Sinha http://www.aishack.in/ « A simple camera calibration method based on sub-pixel corner extraction of the chessboard image » (Yang Xingfang, Huang Yumei and Gao Feng) http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5658280&tag=1 – University of Nevada, Reno http://www.cse.unr.edu/~bebis/CS791E/Notes/EpipolarGeonetry.pdf – – – – – « A simple rectification method of stereo image pairs » (Huihuang Su, Bingwei He) http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5678343 OpenCV tutorials (Noah Kuntz) http://www.pages.drexel.edu/~nk752/tutorials.html Learning OpenCV (Gary Bradski and Adrian Kaehler) OpenCV 2.1 C Reference http://opencv.willowgarage.com/documentation/c/index.html Structure from Stereo Vision using Optical Flow (Brendon Kelly) http://www.cosc.canterbury.ac.nz/research/reports/HonsReps/2006/hons_0608.pdf – « Distinctive Image Features from Scale-Invariant Keypoints » (David G. Lowe) http://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf – – Feature Extraction and Image Processing (Mark S. Nixon and Alberto S. Aguado) Speeded-Up Robust Features (SURF) (Herbert Bay, Andreas Ess, Tinne Tuytelaars and Luc Van Gool) ftp://ftp.vision.ee.ethz.ch/publications/articles/eth_biwi_00517.pdf – – Aldebaran official website (Aldebaran) http://www.aldebaran-robotics.com/ Aldebaran Robotics (Aldebaran) http://users.aldebaran-robotics.com/docs/site_en/index_doc.html – URBI (Gostai) http://www.gostai.com/downloads/urbi-sdk/2.x/doc/urbi-sdk.htmldir/index.html#urbiplatforms.html – NAO tutorials (Robotics Group of the University of Leόn) http://robotica.unileon.es/mediawiki/index.php/Nao_tutorial_1:_First_steps

NAO project – page 68/107

SIFT: a method of feature extraction Definition and goal Find the keypoints invariant with scale.3. that's why we need one descriptor for each one of them.1.Sylvain Cherrier – Msc Robotics and Automation 6 Appendices 6. Dealing with SIFT. SURF (Speeded Up Robust Feature) leans on the same principle that SIFT but it's quicker but in order to be invariant in rotation.1. NAO project – page 69/107 . the processing time. In pattern recognition. We can have a lot of keypoints so we have to be able to be very selective in our choice. Keypoint extraction This first process is divided in two main steps: – find the location of keypoints – find their orientation These two steps have to be done very carefully because they have a big impact on descriptor generation we will see later. 6.1 6. Description SIFT process can be divided in two main steps we will study in detail: – find the location and orientation of the keypoints (keypoint extraction) – generate for each one a specific descriptor (descriptor generation) Indeed. one descriptor is a vector of 128 values. it has a main disavantage.1. rotation and illumination of the object. 6.2. we have to build an ID very descriminating. it has to use FAST keypoint detection. we have to know before all the location of the keypoints on the image but after.1. SIFT has two advantages: – it can identify complex objects in a scene using specific keypoints – it can identify a same object at several positions and rotations of the camera However.

1. y .  = e 26 x y 2 2 2 2 Actually. we use a gaussian operator which blurs the image in 2D according to the parameter  : − 1 G x . 6. we don't calculate the laplacian of gaussian but we focus on the difference of gaussian (DoG) which has more advantages we will see in the next part. it's used to blur the image in order to delete the high frequency noise on the image. y .1 Keypoint location The laplacian of gaussian (LoG) operator The laplacian operator can be explained thanks to the following formula : ∂2 I  x . In order to do this. in order to find their location. NAO project – page 70/107 . it's very sensitive to noise so before to apply the laplacian. y=  ∂ x2 ∂ y2 The laplacian allows to detect pixels with rapid intensity change on : they match with extremums of the laplacian. in the SIFT method. However. y . y =∇ 2 I  x . y  L x .3.  =  ∂ x2 ∂ y2 x 2 y 2−2 2 −  G x . y ∂2 I  x .   ∂ 2 G  x . y = I  x .  = e 2  2 x y 2 2 2 2 We call the resulting operator (gaussian followed by laplacian) the laplacian of gaussian (LoG).Sylvain Cherrier – Msc Robotics and Automation Firslty.1 6. as it's a second derivative measurement.   G x . We can estimate the laplacian of gaussian : ∂2 G x .1. y .1.3. we will have to work on the laplacian of gaussian. y .

1. y . y . more we blur. we resize it several times in order to reduce the calculation time. y . y . y . k  −G x . we have three main advantages: – it reduces the processing time because laplacian requires two derivatives – it's less sensitive to noise than laplacian – it's already scale invariant (normalization by  2 shown in equation (3)).1. We can notice the factor k-1 is constant in the SIFT method so it doesn't have an impact on extrema location.Sylvain Cherrier – Msc Robotics and Automation 6.3. NAO project – page 71/107 . we have an estimation of the laplacian of gaussian: ∂G x .3. more there are octaves and intervals. Instead of blu ring several times the raw image. more keypoints are found but more the processing time is important. Several constants have to be specified : – the number of octaves (octavesNb) specifying the number of scales to work with – the number of intervals (intervalsNb) specifying the number of local extremas to generate for each octave Of course. Indeed.   ≈ (2) ∂ k  − From equations (1) and (2). y . more we lose high frequency details. more we can approximate the image by reducing its size. we deduce: 2 2 G x .1.   G x . k  −G x .   can be calculated thanks to a difference of gaussians: ∂ ∂G x .2 The operator used in SIFT : the difference of gaussian (DoG) Using the heat diffusion equation.  (1) ∂ Where ∂G x . y .  ≈k −1 ∇ G x .  (3) Using the difference of gaussian instead of the laplacian. Below. you can see the structure of the scale space for gaussians.   = ∇ 2 G x . 6. y . y . we have to build a scale space.3 The resulting scale space In order to generate these differences of gaussian.1.

two Gaussians are substracted to generate one DoG and three DoG are needed to find local extremas. Indeed. we have to generate intervalsNb+2 Differences of Gaussians (DoG) and intervalsNb+3 Gaussians.Sylvain Cherrier – Msc Robotics and Automation 2-0 size 20 20/intervalsNb σ 2-0 size 20 2(intervalsNb+2)/intervalsNb σ 2-(octaveNb-1) size 2(octaveNb-1) 20/intervalsNb σ 2-(octaveNb-1) size 2(octaveNb-1) 2(intervalsNb+2)/intervalsNb σ In order to generate intervalsNb local extremas images. Gaussians Differences of Gaussians Local extremas intervalsNb+3 images intervalsNb+2 images MIN MAX MIN MAX intervalsNb images NAO project – page 72/107 .

5  x0. y1I  x .5 . two gaussians are applied : – one before doubling the size of the image for anti-aliasing (σ = 0. y ∂ y∂x  ∂2 I  x .4 Keep stable keypoints Local extremas can be numerous for each scale. y=I  x0.5 . y −1−2 I  x . we use the Hessian matrix : ∂2 I  x . y −dx h=0. In order to keep highest frequency.0) 6. so we have to be selective in our choice. There are several ways we can be more selective : – keep keypoints whose DoG has enough contrast – keep keypoints whose DoG is located on a corner In order to find whether a keypoint is located on a corner. y −I  x−0. y  NAO project – page 73/107 . y  ∂ x∂ y = dxx 2 dxy ∂ I x . we deduce : dx h=0.3. a prior smoothing to the raw image can be done.1. Thus.5  x . y=I  x1.6 could increase the number of stable keypoints by a factor of 4.5.5 .5) – one after for pre-bluring (σ = 1. dxx≈dx h=0. Lowe Found experimentally that a gaussian of σ = 1. y I  x−1. David G. we can estimate the different derivatives of the Hessian matrix : f '  x=lim h  0 f  xh−f x  f  xh−f  x−h =lim h  0 h 2h Using (8) with h=0. y  We have similarly : dyy≈ I  x .5  x−0. the size of the raw image is doubled. y = 2 ∂ I x. y  So.1.Sylvain Cherrier – Msc Robotics and Automation Comment : prior smoothing In order to increase the number of stable keypoints.5 . y 2 ∂y   dxy dyy  Using the classical differentiate equation. y−2 I  x . y  ∂ x2 H  x .

we can express the trace and dterminant of the Hessian matrix thanks to its eigenvalues α and β : tr  H=dxxdyy = and det  H =dxx dyy−2 dxy =  Thus. y1 4 So. More they are closed. Thus. y−1 I x 1. in order to calculate dxy. y = dx  x . more the corresponding keypoint is a corner and more they are different. we can use h=1 : dx h=1  x . Moreover. y −I  x−1. NAO project – page 74/107 . we can estimate a curvature value C supposing α = r β where r >= 1 : tr H 2  2 r 12 = = =f r  det  H  r C= The function f has a minimal value when r = 1.e when the two eigenvalues are closed. y  and dxy≈ h=1 2 2 I  x1. in the case of flat zones. we can be less or more selective depending on the value of r we choose to threshold C : Cf r threshold  David G. dxy≈ After. if det(H) < 0. we know the position of the keypoints on their specific octave and the corresponding scale where they were found. i. y−1−I  x1. To conclude. after this step. in this case also. Then. it means the eigenvalues have an opposite sign . y1I  x−1. The scale is calulated depending on the octave and interval numbers. y−1−I  x−1.Sylvain Cherrier – Msc Robotics and Automation And then. y1−dx h=1  x . one or the two eigenvalues are near zero so these zones can be identified by det(H) ≈ 0. the point is discarded as not being an extremum. more it's similar to an edge. Lowe used r = 10 in its experiments.

Sylvain Cherrier – Msc Robotics and Automation 6.2. a gradients can be represented as a vector whose position.1. magnitude and orientation are specified.3. we have to find their orientation. Comment : gradients on the image are oriented from black to white. the gradient parameters alude to the magnitude and orientation.1 Calculation of gradient parameters within gaussians As mentionned presiously.2 Keypoint orientation After to know the location of the keypoints (position on the octave and scale). x y θ M x NAO project – page 75/107 . gradients allow to classify contrast changes on the image : – the magnitude shows how big changes are – the orientation aludes to the direction of the main change Thus. Finding the keypoint orientation can be explained using three steps : – gradient parameters (magnitude and orientation) are calculated around all the pixels of gaussians – Feature windows around each keypoint are generated (one feature window for magnitude and another for orientation) – Orientation of each keypoint (center of feature windows) are estimated doing an average of the orientations within their corresponding feature window 6.1.3.

we deduce with h = 1 : ∂ I  x . y−I  x−1.Sylvain Cherrier – Msc Robotics and Automation In order to calculate gradient parameter. we deduce the magnitude (M) and orientation (θ) : M = dx 2dy 2 and =arctan  dy  dx 6.2 Generation of keypoint feature windows Now. y  ≈ ∂x 2 ∂ I  x . Using equation (8) showed before. the keypoint (located at the center of the feature window) has to have the best weight whereas the pixels located near the limits of the feature window has to have the smaller weight. instead of having the magnitude of orientation. Then. y −1 ≈ ∂y 2 dx= and dy= Knowing dx and dy. y1−I  x . Indeed. NAO project – page 76/107 .1. we hyave only to calculate the first derivative for each pixel on the image.2. we will speak about the way we will create the feature window. Two parameters have to be studied : – the gaussian to apply to feature windows – the size of the feature windows ➔ Why apply a gaussian ? It allows to have a best average of the orientation of the keypoint. y I  x1.3. we have a weight W : weight=G∗M With G being the gaussian mask and M the magnitudes of the entire image. y I  x .

Indeed. Lowe uses :  kpFeatureWindow =1.2. Thus.3 Averaging orientations within each feature window From the feature windows we have generated. the gaussian can be considered as negligible from a specific position on the feature window. it's chosen depending on  kpFeatureWindow . we can now estimage an average of the orientation of each keypoint. more we will need pixels to estimate the average of orientation.Sylvain Cherrier – Msc Robotics and Automation ➔ How to choose the σ of the gaussian and the size of the feature window ? The σ to choose will depend on the scale we are. more we will have an image blured (i.1. Each bin deals with a range of 10 degrees for orientations so that the entire histogram deals with orientations from 0 degree to 360 degrees. Dealing with the size of the featue window. less we will have details on the image.5  Where σ is the absolute sigma of the corresponding image studied (depending on the scale). NAO project – page 77/107 . Indeed. the size can be approximated : size kpFeatureWindow =3 kpFeatureWindow 6. This is done thanks to an histogram of 36 bins. David G. If an orientation matches a specific bin.3. its weight fills the corresponding bin.e higher scale or higher octave and interval numbers).

Thus.Sylvain Cherrier – Msc Robotics and Automation Weight 0 11 21 351 Orientation (°) The maximum of the histogram will refer to the average of the keypoint orientation. Thanks to the parabolic equation. we must be accurate and we can't have an error of 10 degrees because it's the length of a bin. in order to approximate the real maximum. The accuracy can be improved also by making more accurate the bin filling. as we have a real orientation at the beginning. we use the left and right neighbooring bins of the matched one in order to calculate the parabole which goes through these three points. Comment : Accuracy As we work with an histogram. it's easy after to calculate the maximum. bin(θ) n-1 mn-1 n n+1 mn mn+1 NAO project – page 78/107 . Thus. we have to compare this value with the middle of each neighbooring bins and spread the weight accordingly. it's always located between two bins. Indeed.

1. Neig Kp Neig Neig Neig NAO project – page 79/107 .4. we will consider the keypoint as a subpoint.1. in order to generate the descriptors.Sylvain Cherrier – Msc Robotics and Automation As we can see with the diagram above. The pixel located at its location doesn't interest us but its neighbours yes. Descriptor generation This second process is divided in several steps: – calculate the interpolated gradient parameters of each gaussian (interpolated magnitudes and orientations) – generate the descriptor feature windows – generate the descriptor of each keypoint 6.1 Calculation of interpolated gradient parameters within gaussians ➔ Why calculate the interpolated gradient parameters ? Contrary to the previous case with keypoints. the weight we have to add to bins n-1 and bins n can be calculated easily : weight n=weight n ∣ bin  −mn−1 W =weight n∣bin−mn−1∣W  mn−mn−1 ∣ weight n−1=weight n−1∣bin−mn∣W  6.4.

yI  x1.2 Generation of descriptor feature windows As we said before. y ≈ I inter  x . yI  x . y =I  x0. y  I inter  x−1. y I x . y −1−I  x . y =I  x−1. y−1=I  x . y −2I  x . David G. The x gradient dx and y one dy are determined using the same formulas except that now we will have to use averages : I  x . Lowe used : size descFeatureWindow=16 NAO project – page 80/107 . y1−I inter  x . y −I x −1. y≈ 2 I inter  x1. y−I  x .1. y1I  x . we have to work on the subpixels of the image instead of working on the real pixels. y−I inter  x−1.5 . we deduce dx and dy for the interpolated case : dx inter ≈I inter  x1. y 1 2 I  x .5≈ I  x . The calculation of interpolated gradients is similar to the calculation used for the non interpolated ones (keypoints). y −I  x−2. y  2 I  x−2.4. y−1 2 Thus.Sylvain Cherrier – Msc Robotics and Automation Thus. 6. y−2 2 dy inter ≈I inter  x . In order to have a symmetric. y0. we will interest ourself in the subpoints around the keypoint.5 . y−1≈ The calculation of the magnitude and orientation is exactly the same that previously with the keypoints. There are several ways to do so.5≈ I inter  x . we could either : – firstly generate each interpolated gaussians to calculate the gradient parameters after – or directly estimate the interpolated gradient parameters from the gaussians We chose the second method. the size of the feature window should be pair. y  2 I  x . y−1. y1=I  x . y≈ I  x1. y I  x−1.

– We thresold (invariance in brightness) and normalize (invariance in contrast) the vector (given by the histograms). we have to rotate the feature window by the orientation θ of the corresponding keypoint (invariance in rotation). – after. we can now interest ourselves on the most important part : the descriptor generation.4. Lowe used :  descFeatureWindow = size descFeatureWindow 2 6. Indeed. we divide our feature window in 16 subwindows (same size). David G. Kp As for the keypoint case.1. – For each subwindow.Sylvain Cherrier – Msc Robotics and Automation This value is interested for the generation of the subwindows we will see after. NAO project – page 81/107 . This time. this number is easily divisible by 4.3 Generation of the descriptor components for each keypoint As we have generated the feature windows for each of our keypoint. we need a gaussian in order to calculate the weight within the descriptor feature window. the size of the window we choose can help us to find a good gaussian. we fill a 8 bins histogram using the relative orientation and the magnitude of the gradients (interpolated). This process can be divided in three steps : – Before all.

we have to fill the histograms using the interpolated orientations substracted by the keypoint orientation :  diff =inter − Except this fact. we have to notice that. Before all. the filling of histograms use the interpolated weight as for the keypoint case . we could also substract the intetrpolated weight with the keypoint weight : weight diff =weight inter −M We can be more accurate spreading the weight between two adjacent bins as for the keypoint case. in order to have the invariance with rotation. NAO project – page 82/107 .Sylvain Cherrier – Msc Robotics and Automation Kp M θ x Kp Rotate the feature window by the orientation θ of the keypoint Split into 16 subwindows Fill 1 histogram 8 bins for each subwindow threshold and normalize the vector 128 (= 16x8) descriptor components We will focus on the filling of the 8 bins histograms. however.

Conclusion and drawbacks To conclude. we will see more in details how SURF works. The orientation is found using a feature window of the gradients around the keypoint : a histogram of 36 bins allows to have the value. those enable to be faster. In order to improve the speed of SIFT for its use in real time applications. on a scale space of filters and on Haar wavelets for orientation assignement and descriptor generation.1. SURF works on the integral of the image. Then. Presented the Speeded Up Robust Feature in 2006. – SIFT has a main drawback which is its speed : it's very slow mainly because it needs to blur the image several times. divided in 16 subwindows : for each subwindow. Herbert Bay et al. Contrary to SIFT.5.Sylvain Cherrier – Msc Robotics and Automation 6. descriptor generation : descriptors are generated using a feature window of the interpolated gradients around the keypoint. keypoints extracted are not invariant with all 3D rotation : indeed. it's before all invariant with rotations within the plan of the image and less with others. NAO project – page 83/107 . Besides this problem. a histogram of 8 bins is generated using the relative orienation to generate 8 values. SIFT is divided in two steps : – keypoint extraction : the location is found using a scale space of images and calculating the difference of Gaussian. In the following. the final vector (128 values) is thresholded and normalized. This feature window is rotated by the orientation of the keypoint.

NAO project – page 84/107 .1.2.1 Keypoint location LoG approximation : Fast-Hessian In order to calculate the Laplacian of Gaussian (LoG) within the image.2 SURF : another method adapted for real time applications keypoint extraction 6.1.2. SURF uses a Fast-Hessian calculation. we study the determinant of the Hessian matrix : As the determinant is the product of eigen values.1 6. the eigen values have opposite sign and the keypoint doesn't correspond to a local extrema. det  H =L xx L yy −L 2 xy In order to extract invariant keypoints in scale. and similarly for Lxy and Lyy. ➢ If it's positive or negative. As for SIFT.1. both eigen values have the same sign and the keypoint is a local minimum or maximum. Fast-Hessian focus on the Hessian matrix generalised to Gaussian operators : H  x .2. we can conclude that : ➢ if it's negative.   L xy  x .1.    ∂2 2 ∂x with the image I in Where Lxx is the convolution of the Gaussian second order derivative point x. keypoint extraction consists in two steps : – – find the location of the keypoints find their main orientation 6.   L xy  x .   L yy  x . =  L xx  x .Sylvain Cherrier – Msc Robotics and Automation 6.

the real Gaussian second order derivative (from the left to the right : Lxx. working with an integral image allows to make calculation indepedent to the area studied.1. The integral image IΣ is calculated by the sum of the values between the point and the origin : NAO project – page 85/107 .2.    6.  D xy  x .2 Work on an integral image SURF uses the simplified Gaussian second order derivative D and applies it not directly to the image but to the integral image.9 D xy  2 where H approx  x . you can see : ➢ On the first row. SURF doesn't use Difference of Gaussian as SIFT but a mask as approximation of the Gaussian second derivative.1. the determinant has to be ajusted : det  H approx =D xx D yy−0. ➢ On the second row. which enable to reduce the number of calculations.  =  D xx  x . Dyy and Dxy). Lyy and Lxy).  D xy  x . Indeed.Sylvain Cherrier – Msc Robotics and Automation In order to calculate the values of the Hessian matrices (known as Laplacian of Gaussian). The simplified mask allows to improve the speed keeping enough accuracy. Using this mask. In the figure below. the simplified Gaussian second order derivative used in SURF (from the left to the right : Dxx.  D yy  x .

As we saw previously.3 Scale space As for SIFT. we need only four calculation to calculate the whatever area within the original image : =A D−CB 6. SIFT blur the image within a same octave and reduce the size of the image to go to the next octave : this method is time consuming ! SURF doesn't modify the image but the mask it uses (the simplified Gaussian second order derivative D). NAO project – page 86/107 .1. SURF has to use a scale space in order to apply its Fast-Hessian operation. as it works on an integral image.2.1.Sylvain Cherrier – Msc Robotics and Automation I   x . the size of the mask is independant of the number of calculation as it requires always four calculation for an area. Indeed. j  i=0 j=0 i x j  y Working with an itegral image. y =∑ ∑ I i .

NAO project – page 87/107 .1. we can use a proportionnal relation between the real operator L and the aproximated one D :  mask −base 1.1. we will reject pixels with not enough intensity. 6.4 Thresolding and non maximal suppression Similarly to SIFT. The smaller D is represented in the figure above.Sylvain Cherrier – Msc Robotics and Automation Thus. the minimum step is to add six pixels on the longer length (as it's necessary to have a pixel at the center and three lobes of identical size) and four on the smaller length (to keep the structure). it's necessary to compare the pixel with its neighbours (eight in the same scale and nine in both the previous and the next scales). in order to build a scale space within several octaves. Dxx has its three lobes of width three pixels and of height five pixels. For example.2 =sizemask size mask −base 9  approx =size mask In order to increase the scale of D.2. it's necessary to increase the size of the mask D : each octave aludes to a doubling of scale. The minimum size of D is determined using a real gaussian of σ = 1.2. after applying the Fast-Hessian with a specific mask D. in SURF. Then. In order to calculate the approximate scale σapprox.

2 Keypoint orientation : Haar wavelets 6.1.1 Haar wavelets Haar wavelets are simple filters which can be used to find gradients in the x and y directions.2. In the diagram below. Working with integral images is interesting because it needs only 6 operations to convolve an Haar wavelet with it.1. The white side has a weight of 1 whereas the black one has a weight of -1. NAO project – page 88/107 . you can see from the left to the right : a Haar wavelet filter for x-direction gradient and another one for y-direction gradient.2.Sylvain Cherrier – Msc Robotics and Automation 6.2.

The descriptor is calculated within a 20σ window wich will be rotated of the main orientation in order to make it invariant to rotation. Haar wavelet responses of size 4σ are calculated for a set of pixels within a radius of 6σ of the detected point (σ refers to the scale of the corresponding keypoint). As for main orientation calculation.2. Descriptor generation : Haar wavelets As it uses to calculate the main orientation of the keypoints. The dominant orientation is selected by rotating a circle segment covering an angle of π/3 around the origin. The descriptor window is divided in 4 by 4 subregions where Haar wavelets responses of size 2σ are calculated. it is chosen to have σGaussian=2.5 σ. SURF will use again Haar wavelets to generate the descriptor. In order to know the main orientation.2.2.Sylvain Cherrier – Msc Robotics and Automation 6. 6.1. the longest vector will be the main orientation of the keypoint.2 Calculation of the orientation To determine the orientation. For each subregions. the x and y-reponses of the Haar wavelets are combined to focus on the vector-response.2. the sets of pixels used for calculation are defined by a step of σ. a vector of four values is estimated : V subregion = ∑ dx ∑ dy ∑∣dx∣ ∑ ∣dy∣ NAO project – page 89/107 . The sets of pixels are defined by a step of σ. At each angle of the circle. As for SIFT. the vector-reponses are summed and form a new vector : at the end of the rotating. there is a weight by a Gaussian centered at the keypoint.

Thus. in order to make the descriptor invariant with brightness and contrast.Sylvain Cherrier – Msc Robotics and Automation Where dx and dy are respectively the x-response and y-response of each sets of pixels within the subwindow. we conclude that SURF has a descriptor of 4 x 4 x 4 = 64 values which is less than SIFT. it's necessary to threshold it and normalize it to unit length. Kp M θ x Kp Rotate the feature window by the orientation θ of the keypoint Split into 16 subwindows Calculate the 4 values vector for each subwindow Threshold and normalize the vector 64 (= 16x4) descriptor components NAO project – page 90/107 . As for SIFT.

Conclusion To conclude.3. This feature window is rotated by the orientation of the keypoint. as SIFT.2. The orientation is generated using Haar wavelets for the calculation of the gradients within feature windows around the keypoints : the value is calculated by estmating average gradient vectors. descriptor generation : descriptors are generated using a feature window of gradients (given also by Haar wavelets). – The fact that SURF works on integral images and uses simplified masks reduces the computation and makes the algorithm quicker to execute. SURF is divided in two steps : – keypoint extraction : the location is found using a scale space of specific masks and by calculating the determinant of the Hessian matrix at each point.Sylvain Cherrier – Msc Robotics and Automation 6. The final vector (64 values) is then thresholded and normalized. NAO project – page 91/107 . divided in 16 subwindows : 4 values are calculated to build the vector.

Thus. this method isn't flexible using an unknow environment and moreover. I chose to use the Tsai method which is a simple way to find the extrinsic and intrinsic parameters. we will be able to know dimensions in units of distance within the image knowing the extrinsic parameters. it's difficult to implement because of the need of precised 3D objects. There are two main steps in this process: – – the generation of the 3D corner points and of the 2D corner points the two steps calibration Dealing with the two steps calibration. it will be possible to undistord the image knowing the focale. It's this method chosen to calibrate using a chessboard which has simple reptitive patterns. ➢ Chessboard corner detection A simple way to calibrate is using a chessboard. this allow to know accurately the position of the corners on the image. it will be possible to rectify images from a stereovision system (stereo-calibration) ➢ What are the different methods of calibration ? We can notice that there are several ways to calibrate a camera: – – photogrammetric calibration self-calibration Photogrammetric calibration need to know the exact geometry of the object in 3D space. NAO project – page 92/107 .Sylvain Cherrier – Msc Robotics and Automation 6.3 Mono-calibration ➢ Why mono-calibrate ? The goal of the mono-calibration is to know the extrinsic and intrinsic parameters of the camera: – – – knowing the distortion coefficents. Self-calibration doesn't need any measurements and produces more accurate results.

it's easy to generate the 3D corner points functions of the world frame. it's necessary to generate the 3D corner points and the 2D corner points within the image in order to use them for the two step calibration.1 3D corner points The idea is to choose as reference the chessboard.3. Using this way.1. we can detect them using two ways: – – manual selection automatic detection NAO project – page 93/107 . dy). the chessboard is supposed fixed and the camera is supposed to be moving whereas in reality. Thus.1. We will see later that only the relation between dx and dy (rectangle ratio) will interest us: if we work on squares.1.3. 6. it will be needed to generate 3D points only one time depending on the dimensions of the rectangles on the chessbard (dx. Generation of the 3D corner points and of the 2D corner points Firstly. the ratio will be 1 and the calibration will be definitely a self one.2 2D corner points Dealing with the 2D corner points on the image. Ocam2 Xcam2 Zw Ow dx Xw Yw Zcam2 dy Ycam2 Xcam1 Ocam1 Zcam1 Ycam1 As you can see on the diagram above.3. the camera is fixed and the user moves the chessboard. 6.Sylvain Cherrier – Msc Robotics and Automation 6.

The software can for example use the first points given by the user in order to generate the others using pattern recurrence. 6. the user could check also to see whether it's valid. a corner detection technics will be used. Using the automatic detection. I will not go into details in this report but an interesting method is to use SUSAN operator combined with a ring shaped operator.3.Sylvain Cherrier – Msc Robotics and Automation Xcam1 Ycam1 Ycam2 Xcam2 Using the manual selection. This method can be very laborious if the chessboard is big or if the user want register a lot of positions of the chessboard. Moreover. extrinsic parameters and distortions. Two steps calibration Using Tsai method. We will come from the equation presented in the last part (describing the parameters of the camera). this method is not accurate and need often a checking by the software. this realtion linked a 2D pixel on the image with a 3D point (in distance unit) within the world: f x f x c c x pimg =s 0 f y cy 0 0 1   dr 0 dt x r 11 r 12 r 13 t x 0 dr dt y r 21 r 22 r 23 t y p obj/world =M p obj/world 0 0 1 r 31 r 32 r 33 t z   We will have to separate in two problems: – find the projection and extrinsic parameters – find the distortions NAO project – page 94/107 .2. we will use two least squares solutions in order to know projection matrix. the user have to select each one of the corners on the image.

Sylvain Cherrier – Msc Robotics and Automation 6.1 First least-square solution (projection matrix and extrinsic parameters) Firstly. s= m11 m12 m13 m14 M = m21 m22 m23 m24 m31 m32 m33 m34  and X obj/world Y pobj/ world = obj/ world Z obj/world 1    X obj/ world Y obj/ world Z obj/ world 1 The scale s can be easily deduced from M: 1 m31 X obj/ world m32 Y obj /world m33 Z obj /world m34 Thus.2. we negligate the distortions in order to make the equation easier: f x f x c c x pimg =s 0 f y cy 0 0 1   r 11 r 12 r 13 t x r 21 r 22 r 23 t y p obj/ world r 31 r 32 r 33 t z  pimg =s M p obj/ world Where x img pimg = y img 1   .3. the equation (1) can be also expressed as: ximg m11 m12 m13 m14 m31 X obj /world m32 Y obj/ world m33 Z obj/ world m34  y img = m21 m22 m23 m24 1 m31 m32 m33 m34    m 11 X obj/ world m12 Y obj / world m 13 Z obj/ world m 14−m 31 X obj /world ximg −m32 Y obj /world x img−m 33 Z obj/ world ximg =m34 ximg m 21 X obj/ world m22 Y obj /world m 23 Z obj /world m 24 −m 31 X obj /world y img−m 32 Y obj /world y img−m 33 Z obj /world y img=m 34 y img NAO project – page 95/107 .

we have to have: 2NK6K5 So. more corner features are recorded. it will be necessary to multiply all the others coefficients as they are directly proportional. we can use only 5 corners. we have to suppose m34=1 but we will calculate its real value later. we need at least 6 corner points as we showed before. In a global wiewpoint. fy. NAO project – page 96/107 . we have to take more points and more chessboard views in order to have more accuracy: 10 views of a 7-by-8 or larger chessboard is a good choice. because of noise and numerical stability. if we choose to work on only 1 chessboard view. we can have an idea of the coefficients of M but in order to be able to calculate a correct least squares solutions. if we consider K as the number of chessboard views to treat and N the number of corner point (couple of 2D and 3D corner points). 1 chessboard plane offers only 4 corner points worth of information whatever the number of corner points chosen. In practice. we can calculate m using the least squares method (pseudo inverse calculation): m= A A  A  pimg T −1 T 0 At the beginning. we have: – 2NK constraints (2 because of 2D coordinates) – 6K unknown extrinsic parameters (3 for rotation and 3 for translation) – 5 unknown projection parameters (fx. cx and cy) Thus. in order to find a solution. in reality.Sylvain Cherrier – Msc Robotics and Automation We can express the two relations from above using a matrix A and a vector m: A m= where: A=  m34 x img = p0 img m 34 y img   X obj/ world Y obj / world Z obj / world 1 0 0 0 0 0 0 X obj /world Y obj /world 0 0 − X obj / world x img −Y obj / world x img −Z obj / world ximg Z obj/ world 1 −X obj /world y img −Y obj /world y img −Zobj / world y img T  m= m11 m12 m13 m14 m21 m22 m23 m24 m31 m32 m33  Thus. When we will modify m34. Whereas using 2 chessboard views. we have to know at least 6 corner features (corresponding 2D and 3D corner points). However. better the approximation of M is. αc. it's why we have to work at least on 3 chessboard views. Thanks to this first solution.

1−c x −k T 2 2 and c = k fx with k = M r M T 1. we can solve the system of equations: c x= M r M r 1. A simple way is to split the matrix M into Mr and Mt in order to use the proprieties of the orthogonal matrices: M =〈 M r∣M t 〉=〈 Pcam− px R cam∣−P cam− px Rcam T cam 〉 As Rcam is a rotation matrix. it's easy to deduce the external matrix (rotation and translation matrices of the camera): −1 T −1 Rcam =Pcam− px M r and T cam=−Rcam Pcam− px M t NAO project – page 97/107 .Sylvain Cherrier – Msc Robotics and Automation The next step is to split the matrix M into two matrices: projection matrix and extrinsic matrix. 2−c y f x = M r M r 1.3 2 T f y = M r M r 2.3=M r M T 3. it's possible to estimate the projection parameters: 12 f 2c2  c f x f y c x c y c x c x x Pcam− px PT − px= c f x f y c x c y f 2 c 2 cy cam y y cx cy 1 We can notice a condition which allows us to find the real value of m 34: Pcam PT 3. Thus. we deduce: Rcam Rcam =I T M r M r =P cam− px R cam  Pcam −px R cam  =Pcam−px R cam R cam P cam− px =Pcam− px Pcam− px T T T T T By calculating the coefficients of Pcam-pxPTcam-px.2−c x c y r fy When the projection matrix is known. we have to normalize the matrix M by mutliplying it by the coefficient m34. After the normalization. 3∣ r Thus.3=1 ⇔ m 34= cam r px px    1 ∣M r M T 3.3 T T c y =M r M r 2.

dt2)T..2 Second least square solution (distortions) Thanks to the matrix M previously calculated. r 2n x iimg0 2 xiimg0 y iimg0 r 22 x iimg0 2 r 2 y iimg0 r 4 y iimg0 . using the deformation equation: dt x dt x i 2 4 6 2n i  dt y ⇔  p img0= pimg0 − pimg0 =dr 1 r dr 2 r dr 3 r . dt1. we can estimated the ideal 2D corner points p iimg (xiimg..Sylvain Cherrier – Msc Robotics and Automation 6. we have to express the equation of above using a matrix Ad. the principle is the same of previously with A:  pimg0 =A d k c where Ad =  r 2 x iimg0 r 4 x iimg0 . we have to consider the normalized points before the projection p img0 and piimg0: P−1 px pimg0 = cam− pimg and m34 p i img0 P−1 px i = cam− pimg m34 Then.2. r 2n y iimg0 r 22 y iimg0 2 2 x iimg0 y iimg0  NAO project – page 98/107 .. …. drn.dr n r  p img0 dt y 1 1 with r=  x iimg0 2 y iimg0 2 and dt x =2 dt 1 x img0 y img0 dt 2 r 2 x img0  i i 2 i 2 pimg0 =dr p i img0   dt y =dt 1 r 2 y img0 2dt 2 ximg0 y img0 2 i 2 i i In order to calculate kc = (dr1. yiimg) using the equations (2) and (3): x iimg = y iimg = m 11 X obj/world m12 Y obj/ world m13 Z obj/ world m14 m31 X obj/ world m 32 Y obj/ world m33 Z obj/ world x img m34 m21 X obj/ world m 22 Y obj /world m23 Z obj /world m24 m31 X obj /world m32 Y obj/ world m33 Z obj/ world x img m34 Here..3.. dr2..

a main use of mono-calibration is to undistord the images. As before. y img  d i i  ximg . To do so. y img  i i i i i i NAO project – page 99/107 .3. map y x img . y img  .3. we use the the distortion matrix Md and the projection matrix Pcam-px in order to calculate the ideal pixel from the distorted one. A main use of mono-calibration: undistortion As we said previously. we have: pimg =s Pcam− px p obj/cam and i pimg =s Pcam− px M d p obj/ cam d The idea is to calculate the distorted pixel from the ideal one: pimg =s Pcam− px M d d 1 −1 i −1 i P p =Pcam −px M d P cam− px p img s cam− px img We can't calculated directly the ideal one because the distortion matrix M d is calculated thanks to the ideal pixels. Stopping until the sixth degree offers enough precision (three radial coefficients). y img =mapx  x img . we will need to use image mapping: x img =mapx  x img . in this case. y img  and Thus. more we have points. Because of this problem. the least-square solution a this equation is as follow: k c= A d A d  A d  pimg0 The minimal number of points required depends on the maximal degree we want for the radial distortions. T −1 T 6. we remap: d i i y img =map y  x img .Sylvain Cherrier – Msc Robotics and Automation As before. so. we need to work at least on three 2D corner points. in order to find the ideal image. better is the least-squares solution. If we consider piimg as the ideal pixel and pdimg as the distorted one.

Oi is the center. Their position is defined by the intersection between the line OlOr and each one of the image planes πi and πr.Sylvain Cherrier – Msc Robotics and Automation 6. el and er are respectively the epipoles of the left and right cameras. Pi the position relative to the object P. They are located always at the same place within each image.4. If the cameras are stationary. On the diagram below. In the following. Epipolar geometry Epipolar geometry is the geometry used in stereovision.1.4 Stereo-calibration Stereo-calibration deals with the calibration of a system made up of several cameras. we will focus only on a stereovision system made up of two cameras (one on the left and the other on the right). NAO project – page 100/107 . r). because it's easier and because the tool was available in the laboratory. Thus. the left camera parameters have the indication l and the right one the indication r. πi the image plane and pi the image projection of each camera (i=l. P is an object point visible for the two cameras. it focus on the location of each one from each others in order to make relations between them. We have firstly to focus on that in order to understand better the stereovision system. 6.

1 The essential matrix E Actually. Using this geometry. The three vectors POl.POr and OlOr shown on the diagram below have to be coplanar. we can define two interesting matrices which allows to link the parameters of the two cameras: – the essential matrix E links the camera positions Pl and Pr – the fundamental matrix F links the image projections pl and pr 6. the essential matrix E describes the extrensic relations between the left and right cameras (due to the location of each of them). Indeed. we deduce these relations: PO r⋅O l Or ∧POl =0 ⇔ PT T l  r∧P l=0 r P Pl Pr Ol Or Tl→r Rl→r NAO project – page 101/107 .Sylvain Cherrier – Msc Robotics and Automation The epipolar lines ll and lr are defined respectively by the segment elpl and erpr. we can calculate the essential matrix E.1.4. each one of the cameras are defined by their position from the object as we saw during the mono-calibration: Pl=R l 〈 I∣−T l 〉 P and Pr =R r 〈 I ∣−T r 〉 P We deduce a relative relation from the left camera to the right one P l→r: Pl r=P r P l or Pl r=R l r 〈 I ∣−T l r 〉 P with −1 Rl r =R r RT and T lr =T l−T r l Using a coplanarity relation.

the previous coplanarity relation can be written as below: PT E Pl=0 ⇔  M −1 pr T E M −1 p l= pT  M −1T E M −1 pl r r l r r l F is defined as: F= M r  E M l = M r  Rl  r S M l −1 T −1 −1 T −1 NAO project – page 102/107 .1. we can express Pr functions of Pl→r using the definitions of Pl and Pr: Pr =P l  r Pl =Rl r 〈 I ∣−T l  r 〉 Pl=Rl r  Pl−T l  r  Thanks to the two equations mentionned above. we have: Pr =P l−T l r Moreover.4. Each camera has an intrinsic matrix (Ml and Mr) linking their image projection and their position relative to the object: pl=M l Pl and pr =M r P r As we saw with the mono-calibration.2 The fundamental matrix F The fundamental matrix F will consider the intrinsic parameters of each camera in order to link their pixels. Ml and Mr alude to the projection and distortion matrices of each camera.Sylvain Cherrier – Msc Robotics and Automation Using the diagram above. we can define the essential matrix E:  RT r P r T T l r∧Pl =PT Rl  r T l  r ∧P l=PT R l r S Pl =0 and l r r Where S deals with the vector product between Tl→r and Pl: 0 −t z t y S= t z 0 −t x −t y t x 0 E=R l  r S   with rank(S) = 2 Thus. Thus. we can simplify the coplanarity relation: PT T l r ∧P l=0 ⇔  Pl−T l r T T lr ∧Pl =0 ⇔  RT r Pr T T l  r ∧Pl=0 r l Finally. the positions of the left and right cameras are linked as follows: Pr E Pl=0 T 6. we can see in terms of vector.

we conclude from these two relations: – if pl is a pixel from the left camera. whe have: T T e l l l=0 and e r l r=0 We conclude: e l l l=0 ⇔ l l e l=0 ⇔ pr F e l=0 ⇒ Fe l=0 e r l r=0 ⇔ l r e r =0 ⇔ p l F er =0 ⇒ F er =0 T T T T T T T T 6. as we work on a projective representation of lines. Dealing with the epipoles. the correponding left epiline ll will be Ftpr.4.3 Roles of the stereo-calibration We deduce from the previous calculations that the main roles of the stereo-calibration to calculate: – – the relative position between the two cameras Pl→r (Rl→r and Tl→r) the essential and fundamental matrices E and F NAO project – page 103/107 . – if pr is a pixel from the right camera.1. it's possible to know: – the epipolar line of one camera corresponding to one point in the other – the epipoles of the two cameras (el and er) Only one relation is needed to know all these parameters: pT F pl= p T F T pr =0 r l Moreover. the correponding right epiline lr will be Fpl.Sylvain Cherrier – Msc Robotics and Automation Knowing the fundamental matrix. they are defined as points lying on all epilines. so. we have if l is a line and p a point: p ∈ l ⇒ p⋅l=0 Thus.

Now. we deduce the rectification matrix for the left camera RlcamRect: RcamRect = e1 e 2 e 3  l As the right camera has a rotation relative of the left one defined by R l→r. we have RrcamRect: RcamRect =Rl  r R camRect r l NAO project – page 104/107 .4. A simple way to know that is by calculating the unit vector e1. Thus. their relative position relative to the object P l and Pr have to be colinear.2.4. Thus. 6.Sylvain Cherrier – Msc Robotics and Automation The stereo-calibration can do also integrate mono-calibration of each cameras but that's a complicated solutions. we know some key links between the cameras.2. 6. e2 and e3 defining this frame: e2 is chosen as the vector product between e1 and the z axis: −t y 1 e 2= = t ∣e1∧  0 0 1 T∣  t 2t 2y 0x x −e 1∧ 0 0 1  T  e3 is directly deduced from e1 and e2: −t x t z e 1∧e 2 1 e 3= = −t t z ∣e 1∧e 2∣  t 2 t 2t 2 t 2t2 t 2 2 2 y 2 x z y z x y t x t y   From this three unit vectors. it's necessary to know the orientation of the frame defined by OlOr = Tl→r as it's this vector which define the orientation of the epipolar plane. er and P) make easier this matching: this mean align the cameras. we need a rectification matrix for each camera R lrect and Rrrect in order to make a correcting rotation. a easy way to proceed is to mono-calibrate each one of the camera before to stereo-calibrate the stereovision system made up of them. In order to know the rectification matrices.1 Calculation of the rectification and reprojection matrices Rectification matrices Thanks to the calibration. Working in the epipolar plane (plane defined by e l. we want to make simple the stereovision system in order to match easily the same object within the two cameras.

Thus. The projection matrix of the global camera can be expressed as this: f global f global global c global x x c x global = 0 fy c global y 0 0 1 r l r P global cam− px   l r where f global x f f f f   = x x . we need to rotate the image in the opposite direction. x global y NAO project – page 105/107 . c and cxglobal is not common for the two = 2 cameras ( c r =c lx dx where dx is the disparity through the x axis). indeed. The goal is to consider the system of multiple cameras as one global camera with only one general projection matrix which could represent the whole system. we can calculate the new projection or reprojection matrix. we conclude the rectification matrices for the images: e1 l l T Rrect =R camRect  = eT 2 eT 3  T and Rr =R lcamRect T RT r =Rlrect RT r rect l l Thus. we can calcute the rectified position of each camera Plrect and Prrect: Plrect =Rlrect P l and Pr =Rr P r rect rect 6.4. in order to simulate a real rotation of the camera in one direction. f global = y y . Thus. we will need a reprojection matrix for each camera Plrepro and Prrepro in order to reproject the pixels from each camera to the global camera. that are the rectification matrices for the cameras and not for the images.2.2 Reprojection matrices After the rectified position calculated.Sylvain Cherrier – Msc Robotics and Automation As you can see. global = c c y c 2 2 2 l – c ly c ry if this is a horizontal stereosystem.

thus. c and cyglobal is not common for the two = 2 cameras ( c ry =cly dy where dy is the disparity through the y axis). thus. if we consider the left camera as being the reference frame. in the case of a horizontal stereovision system. global x In order to simplify the model and to use the global projection matrix P globalcam-px. we deduce the reprojection matrices P lrepro and Prrepro (if the left camera is considered as reference frame): fx l Prepro = 0 0  global fx global c global fy 0 global cx 0 global cy 0 1 0 global   r r f global f global  global c global f global t x x x c x x global global Pr = 0 fy cy 0 repro 0 0 1 0  Then. pixel coordinates from the right ones have to be offseted by Δx. we can consider the disparity as 0. we can express the stereo-rectified pixel of each camera p lrect and prrect functions of the positions of the object relative to the camera Pl and Pr: l l l l l prect =Prepro Prect =P repro R rect Pl and prect =Prepro Prect =P repro R rect Pr r r r NAO project – page 106/107 . P Plrect zglobal Ol Prrect ∆x = fxglobalt/zglobal Or Finally. the left camera and the right one are separated by Δx = fxglobalt/zglobal (where t is the norm of Tl→r).Sylvain Cherrier – Msc Robotics and Automation – c lx c r x if this is a vertical stereosystem. we can consider we have also a mean between cl and cr. Geometrically. focusing on a horizontal stereovision system.

our ideal pixel will be now: piimg =s global Prepro R rect pobj /cam Thus.4.3. Indeed. NAO project – page 107/107 . we can express the distorted (and unrectified) pixel pdimg functions of piimg: pd =s Pcam− px M d img 1 RT P−1 pi s global rect repro img Then. Rectification and reprojection of an image In order to rectifate and reproject an image from a camera within a stereovision system.Sylvain Cherrier – Msc Robotics and Automation 6. we will use the same method as for undistortion adding the rectification and reprojection matrices. we will use image mapping. similarly to undistortion.

You're Reading a Free Preview

Download
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->