This action might not be possible to undo. Are you sure you want to continue?

# Lie Algebra Approach for Tracking and 3D Motion Estimation Using Monocular Vision

Eduardo Bayro-Corrochano and Jaime Ortegon-Aguilar CINVESTAV Unidad Guadalajara,

Electric Engineering and Computer Science Departament, Av. Cientifca 1145, Col. El Bajfo, , Zapopan, Jalisco, Mexico. {edb,jortegon}@gdl.cinvestav.mx

Abstract

The main purpose of this paper is to estimate 2D and 3D transformation parameters. All the group transformations are represented in terms of their Lie Algebra elements. The Lie algebra approach assures to follow the shortest path or geodesic in the involved Lie group. For the estimation of the Lie algebra parameters, we take advantage of the theory of system identification. Two experiments are presented to show the potencial of the method. First, we carry out the estimation of the affine or projective parameters related to the transformation involved in monocular region tracking. Second, we develop a monocular method to estimate 3D motion of an object in the visual space. In the latter, the six parameters of the rigid motion are estimated based on measurements of the six parameters of the affine transformation in the image.

Keywords: Lie algebra, template tracking, monocular 3D motion estimation, least squares estimation, system identification

1. Introduction

In robotics, the estimation of 3D rigid motion is very useful for very different tasks, like navigation and accurate positioning of parts of different shapes and sizes. This estimation is trivial when corresponding 3D points are available, and becomes a more challenging problem when less information is known. Sometimes if 3D points are not available, it is common the use of sensors to get information about the environment. Cameras are between the most useful sensors. In this paper, we focus in the use of just one camera, i.e., we use a monocular system.

Monocular 3D motion estimation is one of the most challenging problems in robotics and computer vision, since with just one camera it is impossible to estimate the depth of an object in motion. In order to estimate the 3D motion, it is unavoidable to impose restrictions in the configuration of the environment. One example is [8], where Daniilidis use fixation of an environmental point.

1

Another approach is to learn how the 3D motion affects the projection of an object, and then measure the changes by the image transformation to estimate the 3D motion. With this purpose, authors have used neural networks [21], the extended Kalman filter [4] and Jacobian based methods [7,9]. In [9], Cipolla and Drummond, similar to Colombo and Allota [7], relate the 2D image space with the 3D space of the rigid motion by means of a Jacobian [7,9].

In contrast, we use standard methods of linear optimization to estimate those parameters. These methods are much more robust than the Jacobian methods because they can address non-linear problems. The estimation is used for simple visual servoing tasks. From now on, we will refer to our method as the A-Method.

The work of Chiuso et al. [6] falls within the category of causal motion and structure estimation. It is similar to the work of Azarbayejani and Pentland [1], but extended to handle occlusions. Our work uses template tracking, thus features in correspondence are not used and our algorithm can naturally handle occlusions. That is why we do not compare our method with this kind of algorithms [1,6].

Recently some authors have been using Lie Algebras to deal with problems of vision systems [17] and robotics systems [9,18]. One of the advantages of using Lie Algebras is that it allows us to guide an object along the shortest orbit (geodesics), i.e., the shortest path between two points of the Lie group manifold. Another advantage is that it is possible to make a natural parameterization of the transformation whether it is in 2D or 3D space. Another contribution of this work is the use of Lie algebras to model the system's parameters.

Another application of our approach is the tracking of objects in video sequences. The tracking problem is a very important topic in the computer vision field. It is widely used in a variety of areas, such as surveillance, traffic control, visual servoing, medical applications, etc. This problem has been addressed using very different approaches, being the most popular feature tracking [13,19,22], and template tracking [2,3,5,6,11,14-16]. The feature tracking methods are based on corners [19,22] or edges [13]. Some au thors use covariance [19,22] or iterative methods to minimize the sum of square errors [13].

Regarding template tracking, the problem has been addressed using Jacobian related methods [2,3,5, 6,11,15,16]. Nevertheless, this problem also can be addressed with techniques of linear optimization, like the A-method. This kind of method, similar to ours, was presented recently for template tracking under the name of hyper-plane estimation by Jurie and Dhome [14]. The main difference of our method with this approach, is the use of a Lie algebra parameterization.

Baker and Mathews [2] classify the template tracking algorithms in forward or inverse, and additive or compositional. The additive approach estimates an additive increment to the parameters [16], while the compositional approach estimates an increment to the parameters to be composed with the current ones [20]. The classical approach where the roles of the current image and the template are preserved is called the forward approach. In the inverse approach, the roles are inverted to improve the efficiency [11]. The forward additive algorithms can be applied to any type of warp. The forwards compositional algorithms are restricted to warps that form semi-groups. The inverse additive algorithms can be applied to translation and affine 2D warps and the inverse compositional is useful for sets of warps that form groups. If efficiency is the main concern, the clear better choice is the inverse compositional algorithm, as the derivation of it is far simpler and far easier to determine for the case when this algorithms should be applied to a new set of warps. Our algorithm estimates an additive increment to the parameters, however it falls in the category of inverse compositional. Our method casts the problem using the mathematical framework of the Lie algebra where an additive increment represents rather a composition with the parameters as we will show in the following sections.

2

The rest of paper is organized as follows: the second section describes the general idea for the Lie algebra parameterization. The third presents the template tracking method, as well as, experiments and results. In the fourth section, the estimator for the 3D motion is explained and experiments and results are presented. Section five is devoted to the conclusions.

2. Lie algebras

In this section, we briefly describe the Lie Groups and Lie Algebras, which are the base of the geometric approach used in our work to describe the transformations in 2D and 3D. For further reading on Lie algebras, see [10, 12]. We will first describe the Lie groups that describe the groups of 2D and 3D transformations. Next, we review the Lie generators, these are used to represent the Lie algebras. Lie algebras are the third topic of this section. The last two subsections review motion integration and motion estimation.

The principal reason for using Lie algebras for representing transformations is that straight lines trough the origin in the Lie algebra represent geodesics in the Lie group. To make the concepts clear, an example of the parameterization of transformations is presented using traditional matrix parameterization and Lie algebra parameterization. A standard matrix parameterization by t of a transformation T is given by

T(t) = 1+ (1 - t)(T - I),

(1)

where I is the identity matrix, while the Lie parameterization is given by T(t) = exp( (1 - t)THe),

(2)

where T1ie is the representation of the transformation in the Lie algebra.

In the next example an object is rotated and translated, so that the object shown only rotates and translates to return to its original position (where the transformation is the identity). Fig. 1.a) shows the transformed position of the object, Fig. 1.b shows the original position. The path from transformed to initial position using matrix parameterization is presented in Fig. l.c, where it is clear that matrix parameterization shrinks and then grows the object to get the final transformation, while the Lie parameterization , due its linearity, moves the object along a geodesic without deforming its shape, as it can be seen in Fig. 1.d.

2.1. Lie groups

A Lie group is a differentiable manifold, that is, a structure G with a "product" that is consistent with the continuous structure of a manifold. Additionally it has a special element e (the identity). In simple words, if two points e and d arc close to a and b, respectively, then the product eel is close to the product abo One property of Lie groups is that it has locally the topology of R" everywhere.

We are going to work on a subset of the total n x n real matrices that are invertible. The Lie group formed with such matrices is often called the General Linear Group, G L( n). In this paper, we specially focus on the 2D general affine group, GA(2). This particular group is useful to describe the general affine transformations in ]R2, i.e., a non-singular linear transformation followed by a translation. However, sometimes in computer vision, it is not enough to work with GA(2). Thus, we sometimes use the projective linear group P(2), which allows us to work with projectivities, i.e., non-singular linear transformation in homogeneous coordinates.

3

@

a)

b)

~\I, p---'

c)

d)

Figure 1. Image trajectories. a) Initial position. b) Final position. c) Matrix parameterization. d) Lie parameterization. For c) and d) the second, fifth, seventh and final tenth step are presented.

Additionally to GA(2) and P(2) we are going to work with the 3D Euclidean group, 8E(3), which is useful for handling rigid body transformations in JR3. The dimension of the group is the number of independent infinitesimal ways in which it can transform an object. GA(2) and 8E(3) has dimension 6, while P(2) has dimension 8.

GA(2) is the group of the 3 x 3 matrices with the last row being [0,0, l]T, i.e.,

(3)

where A is a 2 x 2 non-singular matrix and t is a column 2-vector for the translation.

Usually, we decompose such matrices to get the 6 independent transformations in the following way: translations along both x- and y-axis, isotropic dilation about the origin, shear at ° degrees, rotation about the origin and shear at 45 degrees. In homogeneous coordinates, and parameterized by a, these

4

**transformations are given by
**

1 0 n

TAl = 0 1

0 0

ea 0 ~ 1 '

T A3 = 0 ea

0 0 1

cos (a) - sin(a) n

TA5 = sin(a) cos (a)

0 0 TA2 =

TA4 =

TA6 =

P(2) is the group of the matrices with the form

Tp = [~ :],

1 DOl

0 1 a ,

0 o 1

ea 0 n

0 e-a (4)

0 0

cosh(a) sinh( a) •

sinh( a) cosh(a)

0 0 (5)

where A is a 2 x 2 non-singular matrix, t is a column 2-vector for the translation, and v T is a row 2-vector for the projectivity induced by the line at infinity.

It is easy to see from (3) and (5), that GA(2) is a subset of P(2), then they differ in only two independent transformations, those corresponding to the projective transformations induced by the parameters of the line at infinity. In homogeneous coordinates, and parameterized by a, these transformations are given by

(6)

Finally, we also deal with the group of the 3D rigid body transformations or SE(3). This transformation causes rotations and translations in 3D and it is given by

(7)

where R is a 3 x 3 rotation matrix, and t is a column 3-vector for the translation. We can express its 6 d.o.f. in the following way: translations along x, y and z axis, rotations about the x, y and z axis. These transformations, in homogeneous coordinates, and parameterized by a, are given by

TE4 =

o 0 a l [ 1

1 0 0 , 0

o 1 0 j' '1 E2 = 0

o 0 1 0

o 0 0 1

cos(a) - sin (a) 0

sin (a) cos(a) 0 '

001

[ cos(a)

T _ sin (a)

E6 - 0

o

! H l' TE3 = [~ ! H 1 '

001 000 1

[ cos(a) 0 sin (a) 0 1

o 100

TE5 = ( ) ( ) ,

- sin a 0 cos a 0

000 1

~ ~ 1

1 0 .

o 1

- sin(a) cos(a) o

o

5

(8)

9

'0 f------- 6: dX8

'deur tU!lu~modxg gl[l gu!sn dnorf gn gl[l 01 pg1UP.r oq UU;) u.rqggtU gn V ·l;:))PV.lq ()!7 gq1 'ionpord gA!lU1nUlUlO;)-PUU .IP.gUmq n q1!M '() A1!lUgp! gq1 lU '0 'dnorf gn gq1 01 oosds roioox 1Ugguu1 u S! '6 'mqggtu gn u AtIUUl·lOd

0 0 0 0 ,[ ~ 0 0 0

0 0 0 0 =9Q 0 0 1- =<;;Q

0 0 0 1 0 0 0

0 0 1- 0 1 0 0

0 0 0 0 0 0 0 0

(zI) 0 0 1 0 =f7Q 1 0 0 0 =£Q

0 1- 0 0 0 0 0 0

,r ~ 0 0 0 0 0 0 0

0 0 0 0 0 0 0

0 0 0 =GQ 0 0 0 0 =tQ

0 0 0 0 0 0 0

0 0 0 1 0 0 0

'U0!l0Ul P!g!.r at .r0J S.ro1U.rgUgg gn gq1 SU!U1qo 01 (8) 01 (aT) gU!Alddu pUB scinutprooo snooucfiourcq

at gu!sn 'ABM .IP.gUl!S B U! poaoord gM '(£)88 dnorf gn U0!lBUl.I0JSURIl P!ZlP at gql JO ;lSB;) gq1 UI

[(] II 0 H 0 0

. 0 0 1 =9Q 0 1 =<;;Q

o 1 0 1- 0

[(] (] 0 0 0 0

(IT) , 0 1- 0 =f7Q 0 1 0 =£Q

o 0 1 0 0 1

[(] (] 0 0 0 0

, 1 0 0 =GQ 0 0 0 =tQ

o 0 0 1 0 0 (on

'SgJP)l'.Ul JO ionpord B U! (6) osodurooop UUJ gM '.IBgU!I S! U0!lBpU~UgJJ!P 1uq1 UOPl'.Jgp!SUOJ OW! ZlU!){l'..l ·pnp0.Id xureur gq110J SPUB1S * pUB iutod snocuosouroq az B S! x 'gSBJ gUYJB gq1 .T0J '9 , ... '1 = :t g.rgqM

,0=7)1 VP 0=7)1 VP

(6) [ ~ ] * (v)\LP = (x * (v)\rJp = fI

:~I P[gy .1OpgA B unnqo gM '0 = v uo ZluPBnlUAg pUB v .T;llgUlRlBd 01 ioodsor l.{l!M (17) UlO.TJ U0!lBlUJOJSUUJ1 az B ZlU!lB!lU;)lgJJ!a ')U!od az gUO JO U0!lB1UgSg1dg1 snocucxouroq gq) osn snonnnbo gU!M0tI0J gq.l 'siutod at JO SUl.Ig1 U! g.IP. Agq) mq '.lBgUl!S gll'. (£)88 gq110J strounnbo gq.l '(0)\10 dnorf gUYJB az gq) 10J .TO)l'..rgUgZl gn B gApgp 0) MOq Moqs tEM gM 'lXgN "P[0J!UBUl dnorf gn gq1 01 ooads 1UgZlUB1 gl[l sunds qJ!qM 'S!SBq 10pgA B PEnq sromrcuof gn

this is a diffeomorphism of some neighborhood of 0 E 9 onto a neighborhood of e E C. Hence a vector Vlie Egis related to a vector V E C as

v = exp(VUe).

(14)

As it was mentioned before, a Lie algebra has a product, the Lie bracket. The Lie bracket must satisfy the Jacobi identity:

(15)

In the particular case of matrices, the bracket is defined by the matrix commutator as

(16)

where * stands for the standard matrix product.

The bracket relations between the generators of the affine group are given in Table 1. It is easy to see in this table that any product of the generators is a linear combinations of themselves. For example

[G3,G3] [G4,G5]

0] [0 -1 0] [1

o - 1 0 0 0

o 0 0 0 0

1 0]

o 0 = -2G6.

o 0

(17)

Hence, the generators Gi form a basis for the vector space of the Lie Algebra, i.e., any affine transformation T is obtained with the exponential of the weighted sum of generators (G 1, ... , G6 of (11)) with scalars t, ,

(18)

Now, given the vectors U1ie = Li lliGi, VUe = Li ViGi and WUe = Li WiGi in the Lie algebra, the bracket of them UUe = [VUe, WUe] is computed as follows,

(19)

It is a common practice to use a matrix representation for the transformations, such as those presented in (4), (6) and (8). However, in this work we prefer to use a natural representation of the affine transformation, V = Li ViGi, as a 6-vector of the scalar coefficients Vi, VUe = [Vl"'" v6F. The canonical basis of this representation correspond to the 6 independent modes of affine deformation. Note that, the identity e corresponds to the identity transformation. This alternative representation uses a coordinate reference system that can be attached to any point of the group manifold to represent small transformations near to the identity.

We use the Lie algebra ga(2) corresponding to GA(2) and se(3) corresponding to SE(3).

7

234

Table 1. Bracket relations for the affine group generators.

2.4. Motion integration

Since a vector in the Lie algebra is related to a Lie group by the exponential map, we need to compute the exponential of matrices. The exponential is easily approximated with the Taylor's series expansion

(20)

Because of the higher order terms in (20), it is not possible to naively add vectors to obtain the vector that represents the product of transformations, then it is necessary to define an operation that truly represents the product of transformations. For vectors Ulie, Vjie and Wjie, this operation Ulie = Vlie 0 Wlie must imply exp(2": UiGi) = exp(2": ViGi) * exp(2": WiGi). Then the operation is defined as

(21)

where u., = 2":UiGi, Vlie = 2": ViGi and Wlie = 2": WiGi. Equation (21) provides a better approximation to integrate affine transformations using the vectors in the Lie algebra than just the addition of vectors. A comparison is presented in Fig. 2, where the solid line represents the product of transformations in the Lie group and the dotted line the sum of their respective Lie algebra parameterizations. Fig. 2.a) shows the result when the Lie vectors are just added, while Fig. 2.b) shows the Lie vectors integrated according to (21). It is clear that the latter yields a better approximation.

a)

b)

Figure 2. Sum of transformations: a) simply added b) integrated according to (21).

8

2.5. Affine motion estimation

As it was mentioned earlier in (20), the transformations are approximated using a Taylor's series expansion. If the transformation is close to identity, it can be estimated using only the linear terms, therefore simplifying the motion estimation. Let Xj (t) and Xj (t + 1) be two sets of corresponding points, we want to estimate the involved transformation T that moves the set Xj(t) to Xj(t + 1). In other words, we want to estimate the coefficients of the Lie vector, tlie, corresponding to T. We use the vector fields obtained from (9), which are

11 = [H 12 = [H 13 = [ ~ J,

(22)

14 = [ ~y J, 15 = [ ~Y ], 16 = [ ~ J, note that we are keeping the third coordiante for the required homogeneous representation.

Knowing the correspondence between the points, it is easy to compute the direction of the motion for each pair of points, also it is easy to compute the magnitude of such a motion. We use the following equations for each 2D point Xj ,

m

L ai(li . llj) = dj,

(23)

i=O

m.

(24)

i=O

X·(t) - x·(t + 1)

ll'= J J

J Ilxj(t) - Xj(t + 1)11'

(25)

where llj is the unit vector of the motion of point Xj' dj the magnitude of the motion along the normal and llj . llj_L = O. In this context, we want to estimate the coefficients such that the transformation induced correspond to the unit vector and the magnitude of the motion for all points. Moreover, it is the only motion, so if we project the motion vector to nj , the projection is a null vector. See figure 3.

3. Region tracking

In this section, we present the estimation model used in our template tracking. This model is used to estimate the parameters of the 2D transformation based on the pixels' intensities. The estimator, in the classical approach, is computed using the Jacobian. In the following, we refer to this approach as the "J-method". In contrast, we propose to use a hyper-plane optimization method based on the theory of system identification. As it was mentioned before, this approach was presented recently by Jurie and Dhome [14] as Hyper-plane Tracking, but we will refer to it as the "A-method". The A-method yields a better estimation of the parameters because it can take into account the hidden variables in a system, possibly of higher order, while the J -method is just linear because it truncates the non-linear components of the Taylor Series expansion. We have to clarify that Jurie and Dhome, [14], did not use the A-method method for estimating Lie algebra parameters. They used only the matrix representation of the involved

9

L3 njj_ <Previous Position x,(t) -New Position xj(t+1)

Figure 3. Computing the motion vector

transformation, which is actually weaker than our Lie algebra approach, as we will demonstrate in the section devoted to the experimental analysis.

3.1. Region tracking model

We briefly present the classical model, i.e., the model for the J-method. We use almost the same notation presented by Hager and Belhumeur [11]. A vision system acquires an image at time t. Let 1(x, t) be the brightness value at the image location x = (x, y). Now, for simplicity we group a set of image locations as 8 = (x., X2, ... , x.,}, This set defines the target region. Additionally, at the time t this set defines a vector of the brightness values of the observed target region, 1(8, t) = {1(Xl' t), l(x2' t), ... , l(xn' t)}.

If the 3D relative position of the observed object varies with time, it induces changes in the image location of the observed region. This motion can be perfectly described by a parametric motion model that defines the new location, x' = f'(x; /-L(t)), where x and x' denotes image locations and /-L ( t) = (f-L I ( t ), ... , J.Lm ( t)) the set of the parameters of the model. The parameters J.-L (t) are the unknown parameters which we have to estimate. Like other authors [11], we assume that f is differentiable both in x and in /-L (t). The set of n image locations affected by f will be denoted as £( 8; /-L (t) ).

Considering the image at time to, as the initial image, the observed region can be taken as the reference region 1(8', to). At this time this region has the 20 position defined by a set of initial parameters /-La, 8' = £(8, /-La)'

We can now define precisely the problem as follows: tracking an object at time t means estimating, by certain numerical method, /1,( t) such that

1(£(8; /1,(t)), t) = 1(£(S; /-La), to),

(26)

i.e., the purpose of any template tracking method is to track the brightness value vector 1(8', to) through time. In order to simplify the notation, we will refer to 1(£(8; /-L(t)), t) simply by I(/-L(t), t).

The motion parameters vector can be estimated minimizing the following equation for a set of sample

motions,

(27)

Since we are working on visual tracking on image sequences, we can suppose the continuity of motion. Assuming this continuity, we reformulate the problem to work with vector offsets from time t to t + 1,

10

OJ-L, instead of vectors representing total motion, 1.L(t + T). In this way, we have J.L(t + T) = J.L(t) + OJ-L for the image acquired at time t + T. Including the last modification, (27) becomes

(28)

If the magnitude of the components of 0 J-L and T are small, is possible to linearize the problem expanding I(J.L(t) + O~L, t + T) in a Taylor series, about J.L and t, and truncating the terms with order higher than 1,

(29)

where 111- is the Jacobian of pixels' intensities with respect to the parameters J.L, and It is the Jacobian with respect to time t.

To further analyze this, we use the approximation

(30)

then (28) becomes

(31)

In order to simplify future equations, it will be convenient to define the vector of brightness differences,

Oi = I(/.L(t), t + T) - I(J.Lo, to).

(32)

3.2. A-method estimation in template tracking

In the classical J-method, we take the first derivative of (31) and make it equal to zero to obtain the solution

(33)

where .l(J.L, t) = II-'(J.L, t). This results is nothing else but an expression in terms of the Jacobian with respect to J.L. Fig. 4.a represents equation (33). The geometric interpretation of this result is that the change in brightness intensity is due to the motion of the parameters vector. Thus we can compute it as the projection (inner product) of the image gradient onto the motion vector of J.L. In order to compute the Jacobian we stack a number of observations to get

(34)

where Y = [Oil, Oi2, ... , Oiq] is the matrix of intensities differences and M = [O~Ll(t), ... , OJ-Lq(t)] is the matrix of parameters differences. Taking the first derivative of (34) with respect to J, equating the result to zero and solving for J, we obtain

(35)

Thus to compute the variations of the vector 11" we resort to an expression in terms of the pseudoinverse of the Jacobian denoted as J (J.L, T) + , i.e.

(36)

11

J

oi

a)

Figure 4. Methods for system identification: a) excite the system with a vector of offsets in the system parameters, b) excite the system with a vector of changes in brightness.

The previous method with equation (36) requires the computation of the pseudo-inverse of the Jacobian J (J.L, T) ": it depends on the parameters J.L, so it needs to be computed at each frame. Hager [11] proposed to factor the Jacobian, as a result, the online computations are reduced dramatically; but he presented the factorization only for transformations up to the affine model. Later, Buenaposada [5] extended this approach to work with projective transformations. The f-method has some problems because it only takes into account the linear terms of the Taylor Series expansion.

In this work we use the A-method, which is actually well-known in the field of system identification.

Opposed to the J-method, we consider that the system input is a vector of changes of intensities, instead of being parameters changes. This is presented in Fig. 4.b.

In the following, it will be assumed that we have a set of 11, points that defines the target region. Also, assume that the transformations are fully described by m parameters. Let oi and 0 P be intensities and parameters offsets. These offsets are~lative to initial values of brightness I(J.Lo, to), and parameters J.Lo. The goal is to find the model output 0 P that best approximates the real offset 0 P in the least square sense, i.e., with the minimal sum of squared error cost function value. In this regard, we propose the following model for the system

(37)

where the matrix Ah represents the transfer function of the system. The rank of Ah is m and this tells us the amount of unknown parameters which fully describe the system. The new formulation of the cost function is as follows

(38)

If we consider q offsets, we can stack them in a matrix equation similar to (38) as follows

(39)

where Y nxq = [Oil, ... , Oiq] is the matrix of intensities differences and Mmxq = [0 PI (t), ... , O/-Lq( t)] is the matrix of parameters differences. We can find the value for Ah which minimizes 0 simply by taking its first derivative with respect to A and equating the result to zero, i.e.

-2(M - Ahy)yT = 0,

(40)

which leads to

MyT - AhyyT = 0 Ah = MyT(yyT)-I.

(41)

12

Now, from this perspective the solution of the problem can be formulated as

(42)

The existence of (yyT)-l guarantees the observability of the estimator for computing Ah. In practice (41) is simply solved using SVD as follows,

(43)

with Y = UDVT and D+ the Penrose pseudo-inverse of the diagonal matrix D.

This estimator is the best estimator in the least square sense. If the system is observable, it can be always computed. The estimator is computed numerically, so it deal with noise, thus our method is more robust. Remarkably enough, it is not needed that an exact linear relationship exists between the parameters and the observations. The J -method leads to a linear approximation, in contrast, the A-method also takes into account high non-linearities. Furthermore, a recursive formulation is possible which can be applied on-line.

Note that A" = MyT(yyT)-l is completely different than the Jacobian J = _(MTM)-lMTy.

Even more notice that after we have computed J, with eq. (35), for the estimation we require to compute the pseudo-inverse, eq. (36). On contrast, for the case of A-method we do not need any further inversion.

3.3. A-method template tracker

In our experiments there are two stages, the learning stage and the tracking stage. The first can be done off-line, since it may take several seconds depending on the number of offsets and location points in the region. The second is the on-line real-time tracking of the region.

The learning stage involves all the steps to compute the least square estimator for the model described in (37). To compute the model estimator, the target region can be automatically or manually selected in the image, this target region defines a set of initial parameters /-La to transform a reference region into the target region, and initial image brightness I(/-Lo, to). Notice that the parameters are defined using Lie algebras. In order to compute the estimator, some offsets in the parameters and in the intensities are needed as it was pointed out in the previous section. The parameters' offsets induces offsets in the intensities, see Fig. S.a). The objective of the tracker is to estimate the differences in the parameters that induced the changes of the intensities. It is helpful to remark that the inverses of the parameters' offsets are learned instead of the generated offsets, this is depicted in Fig. S.b), where it is shown the regions described by the parameters. The tracker supposes that the region is not moved, but the object moves. From this point of view the region suffers the inverse transformation of the offset generated, see Fig. S.b).

We generate q random perturbations or offsets on the parameters, op" stacking them into a matrix M.

The parameters offsets, added to the initial parameters using (21), define a new region with corresponding image brightness. We take the differences in the image brightness, oi,

(44)

and stack them in a matrix Y. With such matrices we compute the model estimator using (41). This process is presented in Fig. 6.

13

Irnage from Camera

Original Region defined by 110

a)

Region defined by 110+ Oil

b)

Region defined by 110 - Oil

F rom the point of view of region transformed (generated transformation)

•

From the point of view of image transformed (inverted transformation)

Figure 5. a) Parameters offset induces intensities changes. b) In the learning stage the inverses of the parameters offset are learned.

J.lo

Read Imageo

n

o

Generate Olll

M

y

M

Figure 6. Learning Stage. Block diagram to compute the A-method estimator.

14

o

Read Imageo

,

1-10

, Rename

~ ----------------------------~

Figure 7. Tracking Stage. Block diagram to estimate the parameters.

During the tracking stage, a new image is acquired. Let 1-£' be the last computed parameters. Using 1-£', we transform the reference region, and measure the corresponding image brightness. With the image brightness differences and the estimator, we compute the offset for the parameters l5p. The process of estimating the new parameters is presented in Fig. 7.

3.4. Experimental results for template tracking

For the experiments, we track a series of images. A Mega-D camera connected to a Pentium III PC, acquires such images. We rectify the images to eliminate the radial distortion. We use 256 points in the region and 300 parameters offsets. We compare our results of the Lie parameters model with our implementation of the matrix parameters model. The matrix model uses the following decomposition that makes easier the generation of parameters offsets,

(45)

where R1, R2 are rotation matrices and D is a diagonal matrix. Such matrices are computed using the SVD of A = UDyT in the following way: RI = UyT, R2 = vr and D ,\ = D.

As a first experiment, we track a target region that suffers a projective distortion. The target region is presented in Fig. 8. The results with the Lie parameterization are shown in Fig. 9.a), and the results with the matrix parameterization are shown in Fig. 9.b). Although the matrix parameterization tracks the target region, it presents small errors with the projective distortion as it can be seen in Fig. 9.b). The Lie parameterization is still able to track even when the region suffers a severe projective deformation, see Fig. 10. According to these results we can claim that the Lie parameters model has proved to be more robust and stable than the matrix parameters model.

In a second experiment, we use our method for face tracking. The target region is shown in Fig. 11.

Some of the images ofthe sequence are presented in Fig. 12. Notice the difference in depth of the subject.

15

Figure 8. Projective initial region.

4. 3D motion estimation

In this section we will present the monocular estimation of 3D rigid motion, which takes place in the 3D visual space. We estimate the six parameters of the 3D rigid transformation using the six affine parameters measured on the projected image. With the approach presented in this paper, we are able to track a 3D relative position using only the information of a 2D image (monocular vision). This is done knowing the initial position in the image, and computing the affine motion of the target in the image. Notice that a single camera is used and no 3D reconstruction algorithm is needed.

4.1. System model

In our particular case, we address the problem of visual servoing. As in [9], we are going to relate the parameters of GA(2) with those of SE(3). The first parameters represent the affine transformation that may suffer the image of the target depending in its relative 3D position with respect to the camera. The second parameters represent the relative 3D position of the target with respect to the camera. The parameters that represent GA(2) are in a 6-vector Yi, while the SE(3) are in the 6-vector I-~j, see Fig. 4.b).

We suppose that there is a linear (or approximately linear), but unknown relationship between parameters ~j and Yi. So the following system model is proposed

(46)

where Ah is a matrix representing the linear relationship.

4.2. J-lUcthod cstimation of 3D motion

In most cases, we can obtain a ground observation Yo corresponding to initial parameters 1-£0' We want to compute a small change 01-' in parameters, that correspond to a new observation y' close to the initial Yo.

In the classical approach, using the f-method, we suppose that there is a linear relationship between the parameters and the measurements. Then we can propose a model like:

(47)

that is, for a given function g, we compute the I acobian matrix J 9 (1-£) and use it to express the relationship between the variations of parameters and observations. Since we want to estimate the parameters, we

16

a)

b)

Figure 9. Projective tracking sequence. a) Lie Parameterization. b) Matrix Parameterization. In both cases the same image sequence is being used

17

go

maximum

return

Figure 10. Projective image tracking using Lie parameterization. Three images of a sequence under a severe projective transformation, far from the fronto-parallel projection.

Figure 11. Face initial region.

Figure 12. Face tracking sequence.

18

derive from (47) the following equation:

(48)

where Jt(J-L) is the Penrose pseudo-inverse of J g(J-L).

One of the principal dis vantages of this linear approach is, that sometimes it is very hard to compute Jt(J-L) analytically (if it can be computed), and if we focus on a slightly different problem it must be recomputed completely. An interesting discussion of this difficulty to compute Jt (J-L) was given by Hager and Belhumeur [11].

An alternative for estimating parameters which does not rely on Jt (J-L) has been developed by us, inspired on ideas of System Identification theory. The basic idea is presented in the following sections.

4.3. A-method estimation of 30 motion

As in the classical approach we have initial parameters J-Lo and observations Yo, and we want to estimate the variations on parameters 0 fl using the measured differences with respect to initial observations Oy. So (46) becomes

-----

Ofl = AhOy.

To solve (49) we consider similar as eq. (37)-(42) the quadratic error

(49)

(50)

stacking q observations in a matrix we obtain

(51)

For the A-method we use the first derivative and equal it to zero, then solving for Ah resulting in

(52)

As we said before, the matrix Ah is completely different to the Jacobian J = _(MTM)-lMTy.

4.4. Robot motion estimation

In the learning stage, we generate random offsets in the 3D parameters, then the GA(2) motion of the features in the image is estimated using eq. (23)-(25). The /-Lo parameters are subtracted from those obtained with the offsets using (21), getting in this way the differences which are stacked to build the matrices (52).

In the online control stage, we track the features of the target. Knowing the current position on the image of the target, the affine transformation from initial to current position is computed using eq. (23)(25), Oy. Then we multiply Oy and Ah obtaining Ow Finally, using (21), we have to integrate this Ofl with the initial parameters J.Lo, to obtain the current relative position of the target.

19

A-method J-method

22. 932 65.5336

sample 167

45.8761 56.9078

sample 245

38.9844 108.073

27.2 02 97.4065

av~.

Table 2. Errors in the 3D points for 3 different sample motions.

4.5. Experimental results for 3D motion estimation

With our approach, remarkably we are able to track a 3D relative position using only the information of a 2D image. We carried out some experiments to prove our algorithm.

The first experiment was a simulation to compare with the I-method. We place a plane with the target in front of the camera at 1500 units. This plane was rotated 45 degrees about the x axis. We apply random offset to initial parameters. With the offsets we move the target and project down to estimate the offsets in the 2D parameters. We use the offsets to compute our model estimator using (43). The Jacobian was computed using the model estimator proposed in [9],

0 0 0 0 0 -2d

tanil

0 0 0 0 2d 0

J= 0 0 0 -d t'!!.Vj 0 (53)

0 -1 0 0 2 0

tan 0

1 0 0 0 0 2

0 0 1 0 0 tarO where d is the distance along the z axis from the center of the camera to the target plane, and () is the angle between the target plane and the image plane.

After the system learning, we estimated 300 different motions. We estimate the affine motion with eq. (23)-(25).

The 3D motion is estimated with (49) (A-method) and (4S)(J-method). In Fig. 13 the error in the estimation of the parameters is presented. Note that the error in the angles is less than 1 degree in each axis, and the higher error in the translation is 20-35 of 1500 units, i.e., about 1 per cent. We must point out that we did not apply any control rule for any method. Table 2 shows the norm for the total errors of three different motions and the total average error. The reader can see in Fig. 14.a) that our A-method almost overlaps the ground truth, whereas the J-method is shifted, Fig. 14.b).

A second experiment use a monocular pan-tilt unit. We want to estimate the relative position of the unit. The experiment is to track the target shown in Fig. 15. The estimation is computed with the readings of the pan-tilt angles and the affine motion measured from the images. We change the pan-tilt unit to several positions, estimating the corresponding relative motion. After computing the estimator, we place the pan-tilt unit in another position, our algorithm measures the affine motion, and estimates the angles of the pan-tilt using (49). We compare our estimations with those of the ground truth. The graphics for the angle errors are shown in Fig. 16, note that the errors are bounded by I degree. The average quadratic error in the pan angle was 0.03 degrees and in the tilt angle was O.lS degrees. In other experiment using the same scenario, we manually moved the target, and measured the affine motion. With this measurement we estimated the relative position of the pan-tilt. In this manner, using the estimated transformation, we were able to relocalize the unit, so that the tracked features are matched with the old ones, as we show in Fig. 17.

The final experiment was to estimate the relative angles of a robotic arm. The scenario is shown in fig. IS a).

We move the arm within the field of view of the camera. One view from the robot is in fig. IS b). In the learning stage, we change the angles of the arm and measure the affine motion in the images. In the control stage, we place the arm in several positions and estimate the angles. Since the 5 d.o.f. robot arm mounted on the mobile robot is very simple, the arm has a lower positioning accuracy with precision of 1 cm. We compare the tip of the ann with

20

A-method J-method

sam~Ie 34

22. 932 65.5336

sample 167

45.8761 56.9078

sample 245

38.9844 108.073

27.2 02 97.4065

av~.

Table 2. Errors in the 3D points for 3 different sample motions.

4.5. Experimental results for 3D motion estimation

With our approach, remarkably we are able to track a 3D relative position using only the information of a 20 image. We carried out some experiments to prove our algorithm.

The first experiment was a simulation to compare with the I-method. We place a plane with the target in front of the camera at 1500 units. This plane was rotated 45 degrees about the x axis. We apply random offset to initial parameters. With the offsets we move the target and project down to estimate the offsets in the 20 parameters. We use the offsets to compute our model estimator using (43). The Jacobian was computed using the model estimator proposed in [9],

0 0 0 0 0 -2d

tan e

0 0 0 0 2d 0

J= 0 0 0 -d t':!.1j 0 (53)

0 -1 0 0 2 0

tan 0

1 0 0 0 0 2

0 0 1 0 0 tare where d is the distance along the z axis from the center of the camera to the target plane, and () is the angle between the target plane and the image plane.

After the system learning, we estimated 300 different motions. We estimate the affine motion with eq. (23)-(25).

The 3D motion is estimated with (49) (A-method) and (48)(J-method). In Fig. 13 the error in the estimation of the parameters is presented. Note that the error in the angles is less than 1 degree in each axis, and the higher error in the translation is 20-35 of 1500 units, i.e., about 1 per cent. We must point out that we did not apply any control rule for any method. Table 2 shows the norm for the total errors of three different motions and the total average error. The reader can sec in Fig. 14.a) that our A-method almost overlaps the ground truth, whereas the J-rnethod is shifted, Fig. 14.b).

A second experiment use a monocular pan-tilt unit. We want to estimate the relative position of the unit. The experiment is to track the target shown in Fig. 15. The estimation is computed with the readings of the pan-tilt angles and the affine motion measured from the images. We change the pan-tilt unit to several positions, estimating the corresponding relative motion. After computing the estimator, we place the pan-tilt unit in another position, our algorithm measures the affine motion, and estimates the angles of the pan-tilt using (49). We compare our estimations with those of the ground truth. The graphics for the angle errors are shown in Fig. 16, note that the errors are bounded by 1 degree. The average quadratic error in the pan angle was 0.03 degrees and in the tilt angle was 0.18 degrees. In other experiment using the same scenario, we manually moved the target, and measured the affine motion. With this measurement we estimated the relative position of the pan-tilt. In this manner, using the estimated transformation, we were able to relocalize the unit, so that the tracked features are matched with the old ones, as we show in Fig. 17.

The final experiment was to estimate the relative angles of a robotic ann. The scenario is shown in fig. 18 a).

We move the arm within the field of view of the camera. One view from the robot is in fig. 18 b). In the learning stage, we change the angles of the arm and measure the affine motion in the images. In the control stage, we place the arm in several positions and estimate the angles. Since the 5 d.o.f. robot arm mounted on the mobile robot is very simple, the arm has a lower positioning accuracy with precision of 1 em. We compare the tip of the ann with

20

x angle

x translation

250'--~------'=~=~A=_m~e~th=od711 J-method

170

-170

-80 20 40 60 60 100 -2500 20 40 60 80 100

samoles samoles

yangle y translation

250

~ A-method

170 ., J-method 20 40 60 80 100 -2500 20 40 60 80 100

samoles samoles

z angle z translation

8

[~ A-method

, J-method

i _;l """,-""", ,} (I""""'/"'"

-4

-8 -2500

0 20 40 60 80 100 20 40 60 80 100

samples samples Figure 13. Errors in the 3D parameters estimation. Left column shows the angles (+1- 8 degrees), right columns shows the transtatlonsr-r- 250 units). The solid line represents the A-method, the dotted line represents the J-method.

our estimate against the expected tip position. The error norm is presented in fig. 18 c). The tip of the arm of some sample positions is presented in fig. 18 d). This experiment confirms that our method is of great use for tasks like the control of robotic arms when they become de-calibrated.

5. Conclusions

In the case of template tracking this work shows that the Lie parameterization is better than a matrix parameterization. We present the A-method estimation, in a least square sense, of the Lie parameters which govern the involved-group transformation. Our model proves that it can robustly track a template in an image sequence. The authors have shown that for problems that relate 20 data with 3D, it is a better approach to work with the A-method rather than to work with the Jvmethod. Moreover, it is a great advantage to work with a Lie parameterization of the 20 and 3D transformations, since it allows us to move along geodesics in the space parameterized by the Lie parameters. The experiments validate the potencial of our approach.

21

a) lQ] [QJ......... ~ @ .. ----~-'-

" ..... ""--.,, .....

...

,." .... " ...... "",,.,

b) ~ [QJ fR1\ rflf

~i td=~J ~

' •.• - •. ~ •••••••• ,., •• ~.~ . .I

Figure 14. Errors in the 3D parameters estimation. a) (upper row) The A-method. b) (lower row) The J-method. Results of four transformations generated randomly.

Figure 15. Target for pan-tilt angle estimation.

Pan angle

1- A-meHlOd

5'-~----r=~~~

I A-method

3

-3

-50 20 40 60 80 100

samples

Tilt angle

3

~3

-50 20 40 60 80 100

samples

a) b)

Figure 16. Errors in pan-tilt angles. a) Pan angle. b) Tilt angle. (+1- 1 degree)

22

a)

b)

Figure 17. Pan-Tilt estimation, three examples of the sequence. a) Initial position. b) Position after moving the pan-tilt unit.

References

[1] A. Azarbayejani and A. Pentland. Recursive estimation of motion, structure and focallenght. IEEE Trans. on Pattern Analysis and Machine Intelligence, 17(6):562-575, 1995.

[2] S. Baker and 1. Matthews. Lucas-kanade 20 years on: A unifying framework. International Journal of Computer Vision, 56(3):221-255, March 2004.

[3] S. Benhimane and E. Malis. Real-time image-based tracking of planes using efficient second-order minimization. In Proceedings of 2004 IEEEIRSJ International Conference on Intelligent Robots and Systems, pages 943-948, September 28 - October 2 2004.

[4] T. J. Broida, S. Chandrashckhar, and R. CheIIappa. Recursive 3-d motion estimation from a monocular image sequence. IEEE Trans. Aerosp. Electron. Syst., 26(4):639-656, July 1990.

[5] J. M. Buenaposada and L. Baumcla. Real-time tracking and estimation of plane pose. In Proc. Int. Con!

Pattern Recogn. Vol. II, pages 697-700, August 2002.

23

a)

Error in Final Positions

3,----------r==~~~ L -- A-method

2.5

oL_--~------~----~

5 10 15

samples

c)

b)

5

2.5

o

-2.5

d)

Figure 18. Robotic arm estimation. a)Scenario. b) View from the camera. c) Error in position. d) Final point positions.

[6] H. Chiuso, P. Favaro, and S. Soatto. Structure from motion and causally intergated over time. IEEE Trans. on Pattern Analysis and Machine Intelligence, 24(4):523-535, 2002.

[7] C. Colombo and B. Alloia. Image-based robot task planning and control using compact visual representation.

IEEE Trans. Syst. Man Cybern. Part A: Syst. Humans, 29(1):92-100, January 1999.

[8] K. Daniilidis. Fixation simplifies 3d motion estimation. Comput. Vis. Image Underst., 68(2):158-169, November 1997.

[9] T. Drummond and R. Cipolla. Application of lie algebras to visual servoing. Inti. 1. Compo Vision, 37(1):21- 41, June 2000.

[10] T. Frankel. The Geometry of Physics: an Introduction. Cambridge University Press, Cambridge, UK, 1997.

[11] G. D. Hager and P. N. Belhumeur. Efficient region tracking with parametric models of geometry and illumination. IEEE Trans. Pattern Anal. Mach. In tell. , 20(10): 1 025-1039, October 1998.

[12] B. C. Hall. Lie Groups, Lie Algebras, and Representations: an elementary introduction. Springer-Verlag, New York, NY, USA, 2003.

[13] M. A. Isard and A. Blake. Condensation - conditional density propagation for visual tracking. Intl. J Compo Vision, 29(1):5-28, August 1998.

[14] F. Jurie and M. Dhome. Hyperplane approximation for template matching. IEEE Trans. Pattern Anal. Mach.

Intell., 24(7):996-1000, July 2002.

24

[15] M. la Cascia, S. Sclaroff, and V. Athitsos. Fast, reliable head tracking under varying illumination: an approach based on registration of texture-mapped 3d models. IEEE Trans. Pattern Anal. Mach. Intell., 22(4):322-336, April 2000.

[16] B. D. Lucas and T. Kanade. An iterative image registration technique with an application to stereo vision. In Proceedings of the 1981 DARPA Image Understanding Workshop, pages 121-l30, April 1981.

[17] S. Mann and R. W. Picard. Video orbits of the projective group: A simple approach to featureless estimation of parameters. IEEE Trans. Image Process., 6(9):1281-1295, September 1997.

[18] F. C. Park, J. E. Bobrow, and S. Ploen. A lie group formulation of robot dynamics. Int. 1. Rob. Res., 14(6):609-618, December 1995.

[19] J. Shi and C. Tomasi. Good features to track. In Proc. IEEE Int. Conf Compo Vision and Pattern Recogn., pages 593-600,June 1994.

[20] H. Shum and R. Szeliski. Construction of panoramic image mosaics with global and local alignment. International Journal of Computer Vision, 16( 1):63-84, 2000.

[21] H. J. R. Thomas M. Martinetz and K. J. Schulten. Three-dimensional neural net for learning visuomotor coordination of a robot arm. IEEE trans. neural netw., 1(1): 131-136, March 1990.

[22] T. Tommasini, A. Fusiello, E. Trucco, and V. Roberto. Making good features track better. In Proc. IEEE Int.

Conf Compo Vision and Pattern Recogn., pages 178-183, June 1998.

25