This action might not be possible to undo. Are you sure you want to continue?

# Pyramidal Implementation of the Lucas Kanade Feature Tracker

**Description of the algorithm
**

Jean-Yves Bouguet

Intel Corporation

Microprocessor Research Labs

jean-yves.bouguet@intel.com

1 Problem Statement

Let I and J be two 2D grayscaled images. The two quantities I(x) = I(x, y) and J(x) = J(x, y) are then the

grayscale value of the two images are the location x = [x y]

T

, where x and y are the two pixel coordinates of a generic

image point x. The image I will sometimes be referenced as the ﬁrst image, and the image J as the second image.

For practical issues, the images I and J are discret function (or arrays), and the upper left corner pixel coordinate

vector is [0 0]

T

. Let n

x

and n

y

be the width and height of the two images. Then the lower right pixel coordinate

vector is [n

x

−1 n

y

−1]

T

.

Consider an image point u = [u

x

u

y

]

T

on the ﬁrst image I. The goal of feature tracking is to ﬁnd the location

v = u+d = [u

x

+d

x

u

y

+d

y

]

T

on the second image J such as I(u) and J(v) are “similar”. The vector d = [d

x

d

y

]

T

is the image velocity at x, also known as the optical ﬂow at x. Because of the aperture problem, it is essential to

deﬁne the notion of similarity in a 2D neighborhood sense. Let ω

x

and ω

y

two integers. We deﬁne the image velocity

d as being the vector that minimizes the residual function deﬁned as follows:

(d) = (d

x

, d

y

) =

u

x

+ω

x

x=u

x

−ω

x

u

y

+ω

y

y=u

y

−ω

y

(I(x, y) −J(x + d

x

, y + d

y

))

2

. (1)

Observe that following that deﬁntion, the similarity function is measured on a image neighborhood of size (2ω

x

+

1) ×(2ω

y

+1). This neighborhood will be also called integration window. Typical values for ω

x

and ω

y

are 2,3,4,5,6,7

pixels.

2 Description of the tracking algorithm

The two key components to any feature tracker are accuracy and robustness. The accuracy component relates to

the local sub-pixel accuracy attached to tracking. Intuitively, a small integration window would be preferable in order

not to “smooth out” the details contained in the images (i.e. small values of ω

x

and ω

y

). That is especially required

at occluding areas in the images where two patchs potentially move with very diﬀerent velocities.

The robustness component relates to sensitivity of tracking with respect to changes of lighting, size of image

motion,... In particular, in oder to handle large motions, it is intuively preferable to pick a large integration window.

Indeed, considering only equation 1, it is preferable to have d

x

≤ ω

x

and d

y

≤ ω

y

(unless some prior matching

information is available). There is therefore a natural tradeoﬀ between local accuracy and robustness when choosing

the integration window size. In provide to provide a solution to that problem, we propose a pyramidal implementation

of the classical Lucas-Kanade algorithm. An iterative implementation of the Lucas-Kanade optical ﬂow computation

provides suﬃcient local tracking accuracy.

2.1 Image pyramid representation

Let us deﬁne the pyramid representsation of a generic image I of size n

x

× n

y

. Let I

0

= I be the “zero

th

” level

image. This image is essentially the highest resolution image (the raw image). The image width and height at that

level are deﬁned as n

0

x

= n

x

and n

0

y

= n

y

. The pyramid representation is then built in a recursive fashion: compute

I

1

from I

0

, then compute I

2

from I

1

, and so on... Let L = 1, 2, . . . be a generic pyramidal level, and let I

L−1

be

the image at level L − 1. Denote n

L−1

x

and n

L−1

y

the width and height of I

L−1

. The image I

L−1

is then deﬁned as

follows:

I

L

(x, y) =

1

4

I

L−1

(2x, 2y) +

1

8

_

I

L−1

(2x −1, 2y) + I

L−1

(2x + 1, 2y) + I

L−1

(2x, 2y −1) + I

L−1

(2x, 2y + 1)

_

+

1

16

_

I

L−1

(2x −1, 2y −1) + I

L−1

(2x + 1, 2y + 1) + I

L−1

(2x −1, 2y + 1) + I

L−1

(2x + 1, 2y + 1)

_

.

(2)

1

For simplicity in the notation, let us deﬁne dummy image values one pixel around the image I

L−1

(for 0 ≤ x ≤ n

L−1

x

−1

and 0 ≤ y ≤ n

L−1

y

−1):

I

L−1

(−1, y)

.

= I

L−1

(0, y),

I

L−1

(x, −1)

.

= I

L−1

(x, 0),

I

L−1

(n

L−1

x

, y)

.

= I

L−1

(n

L−1

x

−1, y),

I

L−1

(x, n

L−1

y

)

.

= I

L−1

(x, n

L−1

y

−1),

I

L−1

(n

L−1

x

, n

L−1

y

)

.

= I

L−1

(n

L−1

x

−1, n

L−1

y

−1).

Then, equation 2 is only deﬁned for values of x and y such that 0 ≤ 2x ≤ n

L−1

x

−1 and 0 ≤ 2y ≤ n

L−1

y

−1. Therefore,

the width n

L

x

and height n

L

y

of I

L

are the largest integers that satisfy the two conditions:

n

L

x

≤

n

L−1

x

+ 1

2

, (3)

n

L

y

≤

n

L−1

y

+ 1

2

. (4)

Equations (2), (3) and (4) are used to construct recursively the pyramidal representations of the two images I and

J:

_

I

L

_

L=0,...,L

m

and

_

J

L

_

L=0,...,L

m

. The value L

m

is the height of the pyramid (picked heuristically). Practical

values of L

m

are 2,3,4. For typical image sizes, it makes no sense to go above a level 4. For example, for an image I

of size 640 ×480, the images I

1

, I

2

, I

3

and I

4

are of respective sizes 320 ×240, 160 ×120, 80 ×60 and 40 ×30. Going

beyond level 4 does not make much sense in most cases. The central motivation behind pyramidal representation is

to be able to handle large pixel motions (larger than the integration window sizes ω

x

and ω

y

). Therefore the pyramid

height (L

m

) should also be picked appropriately according to the maximum expected optical ﬂow in the image. The

next section describing the tracking operation in detail we let us understand that concept better. Final observation:

equation 2 suggests that the lowpass ﬁlter [1/4 1/2 1/4] ×[1/4 1/2 1/4]

T

is used for image anti-aliasing before image

subsampling. In practice however (in the C code) a larger anti-aliasing ﬁlter kernel is used for pyramid construction

[1/16 1/4 3/8 1/4 1/16] ×[1/16 1/4 3/8 1/4 1/16]

T

. The formalism remains the same.

2.2 Pyramidal Feature Tracking

Recall the goal of feature tracking: for a given point u in image I, ﬁnd its corresponding location v = u + d in

image J, or alternatively ﬁnd its pixel displacement vector d (see equation 1).

For L = 0, ..., L

m

, deﬁne u

L

= [u

L

x

u

L

y

], the corresponding coordinates of the point u on the pyramidal images

I

L

. Following our deﬁnition of the pyramid representation equations (2), (3) and (4), the vectors u

L

are computed

as follows:

u

L

=

u

2

L

. (5)

The division operation in equation 5 is applied to both coordinates independently (so will be the multiplication

operations appearing in subsequent equations). Observe that in particular, u

0

= u.

The overall pyramidal tracking algorithm proceeds as follows: ﬁrst, the optical ﬂow is comptuted at the deepest

pyramid level L

m

. Then, the result of the that computation is propagated to the upper level L

m

−1 in a form of an

initial guess for the pixel displacement (at level L

m

−1). Given that initial guess, the reﬁned optical ﬂow is computed

at level L

m

−1, and the result is propagated to level L

m

−2 and so on up to the level 0 (the original image).

Let us now describe the recursive operation between two generics levels L+1 and L in more mathematical details.

Assume that an initial guess for optical ﬂow at level L, g

L

= [g

L

x

g

L

y

]

T

is available from the computations done from

level L

m

to level L+1. Then, in order to compute the optical ﬂow at level L, it is necessary to ﬁnd the residual pixel

displacement vector d

L

= [d

L

x

d

L

y

]

T

that minimizes the new image matching error function

L

:

L

(d

L

) =

L

(d

L

x

, d

L

y

) =

u

L

x

+ω

x

x=u

L

x

−ω

x

u

L

y

+ω

y

y=u

L

y

−ω

y

_

I

L

(x, y) −J

L

(x + g

L

x

+ d

L

x

, y + g

L

y

+ d

L

y

)

_

2

. (6)

Observe that the window of integration is of constant size (2ω

x

+ 1) × (2ω

y

+ 1) for all values of L. Notice that the

initial guess ﬂow vector g

L

is used to pre-translate the image patch in the second image J. That way, the residual

ﬂow vector d

L

= [d

L

x

d

L

y

]

T

is small and therefore easy to compute through a standard Lucas Kanade step.

2

The details of computation of the residual optical ﬂow d

L

will be described in the next section 2.3. For now, let us

assume that this vector is computed (to close the main loop of the algorithm). Then, the result of this computation

is propagated to the next level L −1 by passing the new initial guess g

L−1

of expression:

g

L−1

= 2 (g

L

+d

L

). (7)

The next level optical ﬂow residual vector d

L−1

is then computed through the same procedure. This vector, computed

by optical ﬂow computation (decribed in Section 2.3), minimizes the functional

L−1

(d

L−1

) (equation 6). This

procedure goes on until the ﬁnest image resolution is reached (L = 0). The algorithm is initialized by setting the

initial guess for level L

m

to zero (no initial guess is available at the deepest level of the pyramid):

g

L

m

= [0 0]

T

. (8)

The ﬁnal optical ﬂow solution d (refer to equation 1) is then available after the ﬁnest optical ﬂow computation:

d = g

0

+d

0

. (9)

Observe that this solution may be expressed in the following extended form:

d =

L

m

L=0

2

L

d

L

. (10)

The clear advantage of a pyramidal implementation is that each residual optical ﬂow vector d

L

can be kept very small

while computing a large overall pixel displacement vector d. Assuming that each elementary optical ﬂow computation

step can handle pixel motions up to d

max

, then the overall pixel motion that the pyramidal implementation can handle

becomes d

max ﬁnal

= (2

L

m

+1

− 1) d

max

. For example, for a pyramid depth of L

m

= 3, this means a maximum pixel

displacement gain of 15! This enables large pixel motions, while keeping the size of the integration window relatively

small.

2.3 Iterative Optical Flow Computation (Iterative Lucas-Kanade)

Let us now describe the core optical ﬂow computation. At every level L in the pyramid, the goal is ﬁnding the

vector d

L

that minimizes the matching function

L

deﬁned in equation 6. Since the same type of operation is per-

formed for all levels L, let us now drop the superscripts L and deﬁne the new images A and B as follows:

∀(x, y) ∈ [p

x

−ω

x

−1, p

x

+ ω

x

+ 1] ×[p

y

−ω

y

−1, p

y

+ ω

y

+ 1],

A(x, y)

.

= I

L

(x, y), (11)

∀(x, y) ∈ [p

x

−ω

x

, p

x

+ ω

x

] ×[p

y

−ω

y

, p

y

+ ω

y

],

B(x, y)

.

= J

L

(x + g

L

x

, y + g

L

y

). (12)

Observe that the domains of deﬁnition of A(x, y) and B(x, y) are slightly diﬀerent. Indeed, A(x, y) is deﬁned over a

window of size (2ω

x

+3)×(2ω

y

+3) instead of (2ω

x

+1)×(2ω

y

+1). This diﬀerence will become clear when computing

spatial derivatives of A(x, y) using the central diﬀerence operator (eqs. 19 and 20). For clarity purposes, let us change

the name of the displacement vector ν = [ν

x

ν

y

]

T

= d

L

, as well as the image position vector p = [p

x

p

y

]

T

= u

L

.

Following that new notation, the goal is to ﬁnd the displacement vector ν = [ν

x

ν

y

]

T

that minimizes the matching

function:

ε(ν) = ε(ν

x

, ν

y

) =

p

x

+ω

x

x=p

x

−ω

x

p

y

+ω

y

y=p

y

−ω

y

(A(x, y) −B(x + ν

x

, y + ν

y

))

2

. (13)

A standard iterative Lucas-Kanade may be applied for that task. At the optimimum, the ﬁrst derivative of ε with

respect to ν is zero:

∂ε(ν)

∂ν

¸

¸

¸

¸

ν=ν

opt

= [0 0]. (14)

After expansion of the derivative, we obtain:

∂ε(ν)

∂ν

= −2

p

x

+ω

x

x=p

x

−ω

x

p

y

+ω

y

y=p

y

−ω

y

(A(x, y) −B(x + ν

x

, y + ν

y

)) .

_

∂B

∂x

∂B

∂y

_

. (15)

3

Let us now substitute B(x + ν

x

, y + ν

y

) by its ﬁrst order Taylor expansion about the point ν = [0 0]

T

(this has a

good chance to be a valid approximation since we are expecting a small displacement vector, thanks to the pyramidal

scheme):

∂ε(ν)

∂ν

≈ −2

p

x

+ω

x

x=p

x

−ω

x

p

y

+ω

y

y=p

y

−ω

y

_

A(x, y) −B(x, y) −

_

∂B

∂x

∂B

∂y

_

ν

_

.

_

∂B

∂x

∂B

∂y

_

. (16)

Observe that the quatity A(x, y) −B(x, y) can be interpreted as the temporal image derivative at the point [x y]

T

:

∀(x, y) ∈ [p

x

−ω

x

, p

x

+ ω

x

] ×[p

y

−ω

y

, p

y

+ ω

y

],

δI(x, y)

.

= A(x, y) −B(x, y). (17)

The matrix

_

∂B

∂x

∂B

∂y

_

is merely the image gradient vector. Let’s make a slight change of notation:

∇I =

_

I

x

I

y

_

.

=

_

∂B

∂x

∂B

∂y

_

T

. (18)

Observe that the image derivatives I

x

and I

y

may be computed directly from the ﬁrst image A(x, y) in the (2ω

x

+

1) × (2ω

y

+ 1) neighborhood of the point p independently from the second image B(x, y) (the importance of this

observation will become apparent later on when describing the iterative version of the ﬂow computation). If a central

diﬀerence operator is used for derivative, the two derivative images have the following expression:

∀(x, y) ∈ [p

x

−ω

x

, p

x

+ ω

x

] ×[p

y

−ω

y

, p

y

+ ω

y

],

I

x

(x, y) =

∂A(x, y)

∂x

=

A(x + 1, y) −A(x −1, y)

2

, (19)

I

y

(x, y) =

∂A(x, y)

∂y

=

A(x, y + 1) −A(x, y −1)

2

. (20)

In practice, the Sharr operator is used for computing image derivatives (very similar to the central diﬀerence operator).

Following this new notation, equation 16 may be written:

1

2

∂ε(ν)

∂ν

≈

p

x

+ω

x

x=p

x

−ω

x

p

y

+ω

y

y=p

y

−ω

y

_

∇I

T

ν −δI

_

∇I

T

, (21)

1

2

_

∂ε(ν)

∂ν

_

T

≈

p

x

+ω

x

x=p

x

−ω

x

p

y

+ω

y

y=p

y

−ω

y

__

I

2

x

I

x

I

y

I

x

I

y

I

2

y

_

ν −

_

δI I

x

δI I

y

__

. (22)

Denote

G

.

=

p

x

+ω

x

x=p

x

−ω

x

p

y

+ω

y

y=p

y

−ω

y

_

I

2

x

I

x

I

y

I

x

I

y

I

2

y

_

and b

.

=

p

x

+ω

x

x=p

x

−ω

x

p

y

+ω

y

y=p

y

−ω

y

_

δI I

x

δI I

y

_

. (23)

Then, equation 22 may be written:

1

2

_

∂ε(ν)

∂ν

_

T

≈ Gν −b. (24)

Therefore, following equation 14, the optimimum optical ﬂow vector is

ν

opt

= G

−1

b. (25)

This expression is valid only if the matrix G is invertible. That is equivalent to saying that the image A(x, y) contains

gradient information in both x and y directions in the neighborhood of the point p.

4

This is the standard Lucas-Kanade optical ﬂow equation, which is valid only if the pixel displacement is small (in

order ot validate the ﬁrst order Taylor expansion). In practice, to get an accurate solution, is is necessary to iterate

multiple times on this scheme (in a Newton-Raphson fashion).

Now that we have introduced the mathematical background, let us give the details of the iterative version of the

algorithm. Recall the goal: ﬁnd the vector ν that minimizes the error functional ε(ν) introduced in equation 13.

Let k be the iterative index, initialized to 1 at the very ﬁrst iteration. Let us describe the algorithm recursively:

at a generic iteration k ≥ 1, assume that the previous computations from iterations 1, 2, . . . , k −1 provide an initial

guess ν

k−1

= [ν

k−1

x

ν

k−1

x

]

T

for the pixel displacement ν. Let B

k

be the new translated image according to that initial

guess ν

k−1

:

∀(x, y) ∈ [p

x

−ω

x

, p

x

+ ω

x

] ×[p

y

−ω

y

, p

y

+ ω

y

],

B

k

(x, y) = B(x + ν

k−1

x

, y + ν

k−1

y

). (26)

The goal is then to compute the residual pixel motion vector η

k

= [η

k

x

η

k

y

] that minimizes the error function

ε

k

(η

k

) = ε(η

k

x

, η

k

y

) =

p

x

+ω

x

x=p

x

−ω

x

p

y

+ω

y

y=p

y

−ω

y

_

A(x, y) −B

k

(x + η

k

x

, y + η

k

y

)

_

2

. (27)

The solution of this minimization may be computed through a one step Lucas-Kanade optical ﬂow computation

(equation 25)

η

k

= G

−1

b

k

, (28)

where the 2 ×1 vector b

k

is deﬁned as follows (also called image mismatch vector):

b

k

=

p

x

+ω

x

x=p

x

−ω

x

p

y

+ω

y

y=p

y

−ω

y

_

δI

k

(x, y) I

x

(x, y)

δI

k

(x, y) I

y

(x, y)

_

, (29)

where the k

th

image diﬀerence δI

k

are deﬁned as follows:

∀(x, y) ∈ [p

x

−ω

x

, p

x

+ ω

x

] ×[p

y

−ω

y

, p

y

+ ω

y

],

δI

k

(x, y) = A(x, y) −B

k

(x, y). (30)

Observe that the spatial derivatives I

x

and I

y

(at all points in the neighborhood of p) are computed only once at

the beginning of the iterations following equations 19 and 20. Therefore the 2 × 2 matrix G also remains constant

throughout the iteration loop (expression given in equation 23). That constitutes a clear computational advantage.

The only quantity that needs to be recomputed at each step k is the vector b

k

that really captures the amount of

residual diﬀerence between the image patches after translation by the vector ν

k−1

. Once the residual optical ﬂow η

k

is computed through equation 28, a new pixel displacement guess ν

k

is computed for the next iteration step k + 1:

ν

k

= ν

k−1

+ η

k

. (31)

The iterative scheme goes on until the computed pixel residual η

k

is smaller than a threshold (for example 0.03

pixel), or a maximum number of iteration (for example 20) is reached. On average, 5 iterations are enough to reach

convergence. At the ﬁrst iteration (k = 1) the initial guess is initialized to zero:

ν

0

= [0 0]

T

. (32)

Assuming that K iterations were necessary to reach convergence, the ﬁnal solution for the optical ﬂow vector ν = d

L

is:

ν = d

L

= ν

K

=

K

k=1

η

k

. (33)

This vector minimizes the error functional described in equation 13 (or equation 6). This ends the description of the

iterative Lucas-Kanade optical ﬂow computation. The vector d

L

is fed to equation 7 and this overall procedure is

repeated at all subsequent levels L −1, L −2, . . . , 0 (see section 2.2).

5

2.4 Summary of the pyramidal tracking algorithm

Let is now summarize the entire tracking algorithm in a form of a pseudo-code. Find the details of the equations in

the main body of the text (especially for the domains of deﬁnition).

Goal: Let u be a point on image I. Find its corresponding location v on image J

Build pyramid representations of I and J: {I

L

}

L=0,... ,L

m

and {J

L

}

L=0,... ,L

m

(eqs. 2,3,4)

Initialization of pyramidal guess: g

L

m

= [g

L

m

x

g

L

m

x

]

T

= [0 0]

T

(eq. 8)

for L = L

m

down to 0 with step of -1

Location of point u on image I

L

: u

L

= [p

x

p

y

]

T

= u/2

L

(eq. 5)

Derivative of I

L

with respect to x: I

x

(x, y) =

I

L

(x + 1, y) −I

L

(x −1, y)

2

(eqs. 19,11)

Derivative of I

L

with respect to y: I

y

(x, y) =

I

L

(x, y + 1) −I

L

(x, y −1)

2

(eqs. 20,11)

Spatial gradient matrix: G =

p

x

+ω

x

x=p

x

−ω

x

p

y

+ω

y

y=p

y

−ω

y

_

I

2

x

(x, y) I

x

(x, y) I

y

(x, y)

I

x

(x, y) I

y

(x, y) I

2

y

(x, y)

_

(eq. 23)

Initialization of iterative L-K: ν

0

= [0 0]

T

(eq. 32)

for k = 1 to K with step of 1 (or until η

k

< accuracy threshold)

Image diﬀerence: δI

k

(x, y) = I

L

(x, y) −J

L

(x + g

L

x

+ ν

k−1

x

, y + g

L

y

+ ν

k−1

y

) (eqs. 30,26,12)

Image mismatch vector: b

k

=

p

x

+ω

x

x=p

x

−ω

x

p

y

+ω

y

y=p

y

−ω

y

_

δI

k

(x, y) I

x

(x, y)

δI

k

(x, y) I

y

(x, y)

_

(eq. 29)

Optical ﬂow (Lucas-Kanade): η

k

= G

−1

b

k

(eq. 28)

Guess for next iteration: ν

k

= ν

k−1

+ η

k

(eq. 31)

end of for-loop on k

Final optical ﬂow at level L: d

L

= ν

K

(eq. 33)

Guess for next level L −1: g

L−1

= [g

L−1

x

g

L−1

y

]

T

= 2 (g

L

+d

L

) (eq. 7)

end of for-loop on L

Final optical ﬂow vector: d = g

0

+d

0

(eq. 9)

Location of point on J: v = u +d

Solution: The corresponding point is at location v on image J

6

2.5 Subpixel Computation

It is absolutely essential to keep all computation at a subpixel accuracy level. It is therefore necessary to be able

to compute image brightness values at locations between integer pixels (refer for example to equations 11,12 and 26).

In order to compute image brightness at subpixel locations we propose to use bilinear interpolation.

Let L be a generic pyramid level. Assume that we need the image value I

L

(x, y) where x and y are not integers.

Let x

o

and y

o

be the integer parts of x and y (larger integers that are smallers than x and y). Let α

x

and α

y

be the

two reminder values (between 0 and 1) such that:

x = x

o

+ α

x

, (34)

y = y

o

+ α

y

. (35)

Then I

L

(x, y) may be computed by bilinear interpolation from the original image brightness values:

I

L

(x, y) = (1 −α

x

) (1 −α

y

) I

L

(x

o

, y

o

) + α

x

(1 −α

y

) I

L

(x

o

+ 1, y

o

) +

(1 −α

x

) α

y

I

L

(x

o

, y

o

+ 1) + α

x

α

y

I

L

(x

o

+ 1, y

o

+ 1).

Let us make a few observations that are associated to subpixel computation (important implementation issues).

Refer to the summary of the algorithm given in section 2.4. When computing the two image derivatives I

x

(x, y) and

I

y

(x, y) in the neigborhood (x, y) ∈ [p

x

−ω

x

, p

x

+ω

x

] ×[p

y

−ω

y

, p

y

+ω

y

], it is necessary to have the brightness values

of I

L

(x, y) in the neighborhood (x, y) ∈ [p

x

−ω

x

−1, p

x

+ ω

x

+ 1] ×[p

y

−ω

y

−1, p

y

+ ω

y

+ 1] (see equations 19, 20).

Of course, the coordinates of the central point p = [p

x

p

y

]

T

are not garanteed to be integers. Call p

x

o

and p

y

o

the

integer parts of p

x

and p

y

. Then we may write:

p

x

= p

x

o

+ p

x

α

, (36)

p

y

= p

y

o

+ p

y

α

, (37)

where p

x

α

and p

y

α

are the associated reminder values between 0 and 1. Therefore, in order to compute the image

patch I

L

(x, y) in the neighborhood (x, y) ∈ [p

x

− ω

x

− 1, p

x

+ ω

x

+ 1] × [p

y

− ω

y

− 1, p

y

+ ω

y

+ 1] through bilinear

interpolation, it is necessary to use the set of original brightness values I

L

(x, y) in the integer grid patch (x, y) ∈

[p

x

o

−ω

x

−1, p

x

o

+ ω

x

+ 2] ×[p

y

o

−ω

y

−1, p

y

o

+ ω

y

+ 2] (recall that ω

x

and ω

y

are integers).

A similar situation occurs when computing the image diﬀerence δI

k

(x, y) in the neigborhood (x, y) ∈ [p

x

−ω

x

, p

x

+

ω

x

] ×[p

y

−ω

y

, p

y

+ ω

y

] (refer to section 2.4). Indeed, in order to compute δI

k

(x, y), it is required to have the values

J

L

(x +g

L

x

+ν

k−1

x

, y +g

L

y

+ν

k−1

y

) for all (x, y) ∈ [p

x

−ω

x

, p

x

+ω

x

] ×[p

y

−ω

y

, p

y

+ω

y

], or, in other words, the values

of J

L

(α, β) for all (α, β) ∈ [p

x

+ g

L

x

+ ν

k−1

x

−ω

x

, p

x

+ g

L

x

+ ν

k−1

x

+ ω

x

] ×[p

y

+ g

L

y

+ ν

k−1

y

−ω

y

, p

y

+ g

L

y

+ ν

k−1

y

+ ω

y

].

Of course, p

x

+ g

L

x

+ ν

k−1

x

and p

y

+ g

L

y

+ ν

k−1

y

are not necessarily integers. Call q

x

o

and q

y

o

the integer parts of

p

x

+ g

L

x

+ ν

k−1

x

and p

y

+ g

L

y

+ ν

k−1

y

:

p

x

+ g

L

x

+ ν

k−1

x

= q

x

o

+ q

x

α

, (38)

p

y

+ g

L

y

+ ν

k−1

y

= q

y

o

+ q

y

α

, (39)

where q

x

α

and q

y

α

are the associated reminder values between 0 and 1. Then, in order to compute the image patch

J

L

(α, β) in the neighborhood (α, β) ∈ [p

x

+g

L

x

+ν

k−1

x

−ω

x

, p

x

+g

L

x

+ν

k−1

x

+ω

x

]×[p

y

+g

L

y

+ν

k−1

y

−ω

y

, p

y

+g

L

y

+ν

k−1

y

+ω

y

],

it is necessary to use the set of original brightness values J

L

(α, β) in the integer grid patch (α, β) ∈ [q

x

o

−ω

x

, q

x

o

+

ω

x

+ 1] ×[q

y

o

−ω

y

, q

y

o

+ ω

y

+ 1].

2.6 Tracking features close to the boundary of the images

It is very useful to observe that it is possible to process points that are close enough to the image boundary to

have a portion of their integration window outside the image. This observation becomes very signiﬁcant as the the

number of pyramid levels L

m

is large. Indeed, if we always enforce the complete (2ω

x

+1) ×(2ω

y

+1) window to be

within the image in order to be tracked, then there is “forbidden band” of width ω

x

(and ω

y

) all around the images

I

L

. If L

m

is the height of the pyramid, this means an eﬀective forbidden band of width 2

L

m

ω

x

(and 2

L

m

ω

y

) around

the original image I. For small values of ω

x

, ω

y

and L

m

, this might not constitute a signiﬁcant limitation, howerver,

this may become very troublesome for large integration window sizes, and more importantly, for large values of L

m

.

Some numbers: ω

x

= ω

y

= 5 pixels, and L

m

= 3 leads to a forbidden band of 40 pixels around the image!

In order to prevent this from happenning, we propose to keep tracking points whose integration windows partially

fall outside the image (at any pyramid level). In this case, the tracking procedure reamins the same, expect that the

summations appearing in all expressions (see pseudo-code of section 2.4) should be done only over the valid portion of

the image neigborhood, i.e. the portion of the neighborhood that have valid entries for I

x

(x, y), I

y

(x, y) and δI

k

(x, y)

7

(see section 2.4). Observe that doing so, the valid summation regions may vary while going through the Lucas-Kanade

iteration loop (loop over the index k in the pseudo-code - Sec. 2.4). Indeed, from iteration to iteration, the valid entry

of the the image diﬀerence δI

k

(x, y) may vary as the translation vector [g

L

x

+ ν

k−1

x

g

L

y

+ ν

k−1

y

]

T

vary. In addition,

observe that when computing the mismatch vector b

k

and the gradient matrix G, the summations regions must be

identical (for the mathematics to remain valid). Therefore, in that case, the G matrix must be recomputed within

the loop at each iteration. The diﬀerential patches I

x

(x, y) and I

y

(x, y) may however be computed once before the

iteration loop.

Of course, if the point p = [p

x

p

y

]

T

(the center of the neighborhood) falls outside of the image I

L

, or if its

corresponding tracked point [p

x

+g

L

x

+ν

k−1

x

p

y

+g

L

y

+ν

k−1

y

] falls outside of the image J

L

, it is reasonable to declare

the point “lost”, and not pursue tracking for it.

2.7 Declaring a feature “lost”

There are two cases that should give rise to a “lost” feature. The ﬁrst case is very intuitive: the point falls outside

of the image. We have discussed this issue in section 2.6. The other case of loss is when the image patch around the

tracked point varies too much between image I and image J (the point disappears due to occlusion). Observe that

this condition is a lot more challenging to quantify precisely. For example, a feature point may be declared “lost” if

its ﬁnal cost function (d) is larger than a threshold (see equation 1). A problem comes in when having decide about

a threshold. It is particularly sensitive when tracking a point over many images in a complete sequence. Indeed,

if tracking is done based on consecutive pairs of images, the tracked image patch (used as reference) is implicitely

initialized at every track. Consequently, the point may drift throughout the extended sequence even if the image

diﬀerence between two consecutive images is very small. This drift problem is a very classical issue when dealing with

long sequences. One approach is to track feature points through a sequence while keeping a ﬁxed reference for the

appearance of the feature patch (use the ﬁrst image the feature appeared). Following this technique, the quantity

(d) has a lot more meaning. While adopting this approach however another problem rises: a feature may be declared

“lost” too quickly. One direction that we envision to answer that problem is to use aﬃne image matching for deciding

for lost track (see Shi and Tomasi in [1]).

The best technique known so far is to combine a traditional tracking approach presented so far (based on image

pairs) to compute matching, with an aﬃne image matching (using a unique reference for feature patch) to decide for

false track. More information regarding this technique is presented by Shi and Tomasi in [1]. We believe that eventu-

ally, this approach should be implemented for rejection. It is worth observing that for 2D tracking itself, the standard

tracking scheme presented in this report performs a lot better (more accurate) that aﬃne tracking alone. The reason

is that aﬃne tracking requires to estimate a very large number of parameters: 6 instead of 2 for the standard scheme.

Many people have made that observation (see for example http://www.stanford.edu/ ssorkin/cs223b/final.html).

Therefore, aﬃne tracking should only be considered for building a reliable rejection scheme on top of the main 2D

tracking engine.

3 Feature Selection

So far, we have described the tracking procedure that takes care of following a point u on an image I to another

location v on another image J. However, we have not described means to select the point u on I in the ﬁrst place. This

step is called feature selection. It is very intuitive to approach the problem of feature selection once the mathematical

ground for tracking is led out. Indeed, the central step of tracking is the computation of the optical ﬂow vector η

k

(see pseudo-code of algorithm in section 2.4). At that step, the G matrix is required to be invertible, or in other

words, the minimum eigenvalue of G must be large enough (larger than a threshold). This characterizes pixels that

are “easy to track”.

Therefore, the process of selection goes as follows:

1. Compute the G matrix and its minimum eigenvalue λ

m

at every pixel in the image I.

2. Call λ

max

the maximum value of λ

m

over the whole image.

3. Retain the image pixels that have a λ

m

value larger than a percentage of λ

max

. This percentage can be 10% or

5%.

4. From those pixels, retain the local max. pixels (a pixel is kept if its λ

m

value is larger than that of any other

pixel in its 3 ×3 neighborhood).

5. Keep the subset of those pixels so that the minimum distance between any pair of pixels is larger than a given

threshold distance (e.g. 10 or 5 pixels).

8

After that process, the remaining pixels are typically “good to track”. They are the selected feature points that

are fed to the tracker.

The last step of the algorithm consisting of enforcing a minimum pairwise distance between pixels may be omited if

a very large number of points can be handled by the tracking engine. It all depends on the computational performances

of the feature tracker.

Finally, it is important to observe that it is not necessary to take a very large integration window for feature

selection (in order to compute the G matrix). In fact, a 3 × 3 window is suﬃcient ω

x

= ω

x

= 1 for selection, and

should be used. For tracking, this window size (3 ×3) is typically too small (see section 1).

References

[1] Jianbo Shi and Carlo Tomasi, “Good features to track”, Proc. IEEE Comput. Soc. Conf. Comput. Vision and Pattern Recogn., pages

593–600, 1994.

9

The central motivation behind pyramidal representation is to be able to handle large pixel motions (larger than the integration window sizes ωx and ωy ). I 3 and I 4 are of respective sizes 320 × 240. x L−1 I L−1 (x. the images I 1 . Observe that in particular.. .2 Pyramidal Feature Tracking Recall the goal of feature tracking: for a given point u in image I. For L = 0. it makes no sense to go above a level 4. x y L−1 L−1 Then. Following our deﬁnition of the pyramid representation equations (2). u0 = u. 160 × 120. = . 2. gL = [gx gy ]T is available from the computations done from level Lm to level L + 1. I L−1 (nL−1 − 1. the reﬁned optical ﬂow is computed at level Lm − 1. Then. x y 2 .. The next section describing the tracking operation in detail we let us understand that concept better. the corresponding coordinates of the point u on the pyramidal images x y I L . 2L (5) The division operation in equation 5 is applied to both coordinates independently (so will be the multiplication operations appearing in subsequent equations). ﬁnd its corresponding location v = u + d in image J.L and J L L=0. in order to compute the optical ﬂow at level L. L L Assume that an initial guess for optical ﬂow at level L.. 2 (3) (4) Equations (2).. deﬁne uL = [uL uL ]. For example. let us deﬁne dummy image values one pixel around the image I L−1 (for 0 ≤ x ≤ nL−1 −1 x and 0 ≤ y ≤ nL−1 − 1): y . 80 × 60 and 40 × 30. y + gy + dL ) x y x=uL −ωx x y=uL −ωy y 2 L L (d ) = (dL . = . the residual ﬂow vector dL = [dL dL ]T is small and therefore easy to compute through a standard Lucas Kanade step. Practical m m values of Lm are 2. That way. or alternatively ﬁnd its pixel displacement vector d (see equation 1). 0).3. In practice however (in the C code) a larger anti-aliasing ﬁlter kernel is used for pyramid construction [1/16 1/4 3/8 1/4 1/16] × [1/16 1/4 3/8 1/4 1/16]T . = . (6) Observe that the window of integration is of constant size (2ωx + 1) × (2ωy + 1) for all values of L. For typical image sizes.. The formalism remains the same. nL−1 ) = x y I L−1 (−1. (3) and (4). I L−1 (x. −1) I L−1 (nL−1 . I L−1 (nL−1 .For simplicity in the notation. Lm . Therefore the pyramid height (Lm ) should also be picked appropriately according to the maximum expected optical ﬂow in the image.L .. Given that initial guess. dL ) x y = . I L−1 (nL−1 − 1. y). y). ny − 1). and the result is propagated to level Lm − 2 and so on up to the level 0 (the original image). the optical ﬂow is comptuted at the deepest pyramid level Lm . Going beyond level 4 does not make much sense in most cases. equation 2 is only deﬁned for values of x and y such that 0 ≤ 2x ≤ nx − 1 and 0 ≤ 2y ≤ ny − 1. nL−1 − 1).. the result of the that computation is propagated to the upper level Lm − 1 in a form of an initial guess for the pixel displacement (at level Lm − 1). y) − J L (x + gx + dL . Final observation: equation 2 suggests that the lowpass ﬁlter [1/4 1/2 1/4]×[1/4 1/2 1/4]T is used for image anti-aliasing before image subsampling. y) I L−1 (x. Therefore.. 2 nL−1 + 1 y . Notice that the initial guess ﬂow vector gL is used to pre-translate the image patch in the second image J.. The value Lm is the height of the pyramid (picked heuristically). I 2 . the vectors uL are computed as follows: uL = u . (3) and (4) are used to construct recursively the pyramidal representations of the two images I and J: I L L=0... it is necessary to ﬁnd the residual pixel displacement vector dL = [dL dL ]T that minimizes the new image matching error function L : x y uL +ωx x L uL +ωy y L L I L (x.4. L L L the width nx and height ny of I are the largest integers that satisfy the two conditions: nL x nL y ≤ ≤ L−1 nx + 1 . Let us now describe the recursive operation between two generics levels L + 1 and L in more mathematical details. The overall pyramidal tracking algorithm proceeds as follows: ﬁrst. Then. = . y) x I L−1 (x. for an image I of size 640 × 480. nL−1 ) y I L−1 (0.

y) is deﬁned over a window of size (2ωx + 3) × (2ωy + 3) instead of (2ωx + 1) × (2ωy + 1). Indeed. Following that new notation. (15) 3 . Observe that this solution may be expressed in the following extended form: Lm (8) (9) d= L=0 2L dL . y) using the central diﬀerence operator (eqs. This vector. νy ) = x=px −ωx y=py −ωy (A(x. For example. y). let us change the name of the displacement vector ν = [νx νy ]T = dL . 19 and 20). The algorithm is initialized by setting the initial guess for level Lm to zero (no initial guess is available at the deepest level of the pyramid): gLm = [0 0]T . Then. we obtain: ∂ε(ν) = −2 ∂ν px +ωx py +ωy = [0 0]. The ﬁnal optical ﬂow solution d (refer to equation 1) is then available after the ﬁnest optical ﬂow computation: d = g0 + d0 . y) are slightly diﬀerent. (10) The clear advantage of a pyramidal implementation is that each residual optical ﬂow vector dL can be kept very small while computing a large overall pixel displacement vector d. (7) The next level optical ﬂow residual vector dL−1 is then computed through the same procedure. ν=ν opt (14) (A(x. L L B(x. computed by optical ﬂow computation (decribed in Section 2. the result of this computation is propagated to the next level L − 1 by passing the new initial guess gL−1 of expression: gL−1 = 2 (gL + dL ). the goal is to ﬁnd the displacement vector ν = [νx νy ]T that minimizes the matching function: px +ωx py +ωy ε(ν) = ε(νx .3 Iterative Optical Flow Computation (Iterative Lucas-Kanade) Let us now describe the core optical ﬂow computation. let us now drop the superscripts L and deﬁne the new images A and B as follows: ∀(x. At the optimimum. px + ωx ] × [py − ωy . while keeping the size of the integration window relatively small. For now. (12) (11) Observe that the domains of deﬁnition of A(x. as well as the image position vector p = [px py ]T = uL . . y) = J L (x + gx . then the overall pixel motion that the pyramidal implementation can handle becomes dmax ﬁnal = (2Lm +1 − 1) dmax . 2. let us assume that this vector is computed (to close the main loop of the algorithm). y + νy )) . the ﬁrst derivative of ε with respect to ν is zero: ∂ε(ν) ∂ν After expansion of the derivative. y) ∈ [px − ωx .The details of computation of the residual optical ﬂow dL will be described in the next section 2. 2 (13) A standard iterative Lucas-Kanade may be applied for that task. y) = I L (x. y) ∈ [px − ωx − 1. the goal is ﬁnding the vector dL that minimizes the matching function L deﬁned in equation 6. y + gy ). this means a maximum pixel displacement gain of 15! This enables large pixel motions. y) − B(x + νx . y + νy )) . For clarity purposes. .3. y) − B(x + νx . px + ωx + 1] × [py − ωy − 1. x=px −ωx y=py −ωy ∂B ∂x ∂B ∂y . ∀(x. This diﬀerence will become clear when computing spatial derivatives of A(x. A(x. py + ωy ]. for a pyramid depth of Lm = 3. y) and B(x.3). minimizes the functional L−1 (dL−1 ) (equation 6). Assuming that each elementary optical ﬂow computation step can handle pixel motions up to dmax . At every level L in the pyramid. py + ωy + 1]. A(x. Since the same type of operation is performed for all levels L. This procedure goes on until the ﬁnest image resolution is reached (L = 0).

∂y 2 (19) (20) In practice. b= px +ωx py +ωy x=px −ωx y=py −ωy δI Ix δI Iy . (23) Then. y) = ∂A(x. Ix (x.Let us now substitute B(x + νx . (18) Observe that the image derivatives Ix and Iy may be computed directly from the ﬁrst image A(x. y) − A(x − 1. px + ωx ] × [py − ωy . . y) − B(x. y) = A(x. (16) Observe that the quatity A(x. ∂x 2 ∂A(x. y) = Iy (x. y) can be interpreted as the temporal image derivative at the point [x y]T : ∀(x. y) A(x + 1. px + ωx ] × [py − ωy . y) − B(x. y − 1) = . Following this new notation. equation 22 may be written: 1 2 ∂ε(ν) ∂ν T ≈ G ν − b. y) = . (21) 1 2 Denote . That is equivalent to saying that the image A(x. (24) Therefore. y) ∈ [px − ωx . y + νy ) by its ﬁrst order Taylor expansion about the point ν = [0 0]T (this has a good chance to be a valid approximation since we are expecting a small displacement vector. ∂B ∂x ∂B ∂y . y) − x=px −ωx y=py −ωy ∂B ∂x ∂B ∂y ν . 4 . py + ωy ]. If a central diﬀerence operator is used for derivative. δI(x. = ∂B ∂x ∂B ∂y T . the optimimum optical ﬂow vector is ν opt = G−1 b. y) (the importance of this observation will become apparent later on when describing the iterative version of the ﬂow computation). (22) py +ωy x=px −ωx y=py −ωy 2 Ix Ix Iy Ix Iy 2 Iy and . following equation 14. G= px +ωx ∂ε(ν) ∂ν py +ωy ≈ x=px −ωx y=py −ωy 2 Ix Ix Iy Ix Iy 2 Iy ν− δI Ix δI Iy . the two derivative images have the following expression: ∀(x. y) contains gradient information in both x and y directions in the neighborhood of the point p. equation 16 may be written: 1 ∂ε(ν) ≈ 2 ∂ν x=p T px +ωx px +ωx x −ωx py +ωy I T ν − δI y=py −ωy IT . y). (25) This expression is valid only if the matrix G is invertible. y + 1) − A(x. y) in the (2ωx + 1) × (2ωy + 1) neighborhood of the point p independently from the second image B(x. the Sharr operator is used for computing image derivatives (very similar to the central diﬀerence operator). thanks to the pyramidal scheme): ∂ε(ν) ≈ −2 ∂ν px +ωx py +ωy A(x. py + ωy ]. y) ∈ [px − ωx . Let’s make a slight change of notation: I= Ix Iy . y) A(x. The matrix ∂B ∂x ∂B ∂y (17) is merely the image gradient vector. y) − B(x.

y) = A(x. Once the residual optical ﬂow η k is computed through equation 28. which is valid only if the pixel displacement is small (in order ot validate the ﬁrst order Taylor expansion). assume that the previous computations from iterations 1. 5 . This ends the description of the iterative Lucas-Kanade optical ﬂow computation.This is the standard Lucas-Kanade optical ﬂow equation. y) − Bk (x. Now that we have introduced the mathematical background. ηy ) = x=px −ωx y=py −ωy py +ωy k k A(x. py + ωy ]. is is necessary to iterate multiple times on this scheme (in a Newton-Raphson fashion). k k The goal is then to compute the residual pixel motion vector η k = [ηx ηy ] that minimizes the error function px +ωx k k εk (η k ) = ε(ηx . δIk (x.2). initialized to 1 at the very ﬁrst iteration. k−1 k−1 Bk (x.03 pixel). py + ωy ]. Let Bk be the new translated image according to that initial k−1 guess ν : ∀(x. (33) This vector minimizes the error functional described in equation 13 (or equation 6). The vector dL is fed to equation 7 and this overall procedure is repeated at all subsequent levels L − 1. the ﬁnal solution for the optical ﬂow vector ν = dL is: K ν = dL = ν K = k=1 ηk . (32) Assuming that K iterations were necessary to reach convergence. . . Let k be the iterative index. 2. Therefore the 2 × 2 matrix G also remains constant throughout the iteration loop (expression given in equation 23). At the ﬁrst iteration (k = 1) the initial guess is initialized to zero: ν 0 = [0 0]T . (27) The solution of this minimization may be computed through a one step Lucas-Kanade optical ﬂow computation (equation 25) η k = G−1 bk . y) δIk (x. y + ηy ) 2 (26) . . y) Ix (x. . y) ∈ [px − ωx . y) = B(x + νx . Recall the goal: ﬁnd the vector ν that minimizes the error functional ε(ν) introduced in equation 13. . (29) where the k th image diﬀerence δIk are deﬁned as follows: ∀(x. px + ωx ] × [py − ωy . px + ωx ] × [py − ωy . . (31) The iterative scheme goes on until the computed pixel residual η k is smaller than a threshold (for example 0. (30) Observe that the spatial derivatives Ix and Iy (at all points in the neighborhood of p) are computed only once at the beginning of the iterations following equations 19 and 20. . a new pixel displacement guess ν k is computed for the next iteration step k + 1: ν k = ν k−1 + η k . 5 iterations are enough to reach convergence. 0 (see section 2. L − 2. where the 2 × 1 vector bk is deﬁned as follows (also called image mismatch vector): px +ωx py +ωy (28) bk = x=px −ωx y=py −ωy δIk (x. or a maximum number of iteration (for example 20) is reached. y). let us give the details of the iterative version of the algorithm. That constitutes a clear computational advantage. . y) ∈ [px − ωx . y + νy ). y) − Bk (x + ηx . On average. k − 1 provide an initial k−1 k−1 guess ν k−1 = [νx νx ]T for the pixel displacement ν. y) Iy (x. y) . Let us describe the algorithm recursively: at a generic iteration k ≥ 1. In practice. The only quantity that needs to be recomputed at each step k is the vector bk that really captures the amount of residual diﬀerence between the image patches after translation by the vector ν k−1 . to get an accurate solution.

y) = Iy (x.26.. Goal: Let u be a point on image I.. y) − J L (x + gx + νx . 33) (eq. y) δIk (x..2. 8) (eq.. y) − I L (x − 1. 29) Optical ﬂow (Lucas-Kanade): Guess for next iteration: end of for-loop on k Final optical ﬂow at level L: Guess for next level L − 1: end of for-loop on L Final optical ﬂow vector: Location of point on J: η k = G−1 bk ν k = ν k−1 + η k (eq. y) Ix (x.4) (eq. y) = I L (x. 20. y + gy + νy ) px +ωx py +ωy (eqs. Find its corresponding location v on image J Build pyramid representations of I and J: {I L }L=0. y) (eq.Lm and {J L }L=0. Find the details of the equations in the main body of the text (especially for the domains of deﬁnition). y) Iy (x. . y) Iy (x.Lm Initialization of pyramidal guess: for L = Lm down to 0 with step of -1 Location of point u on image I L : Derivative of I L with respect to x: Derivative of I L with respect to y: uL = [px py ]T = u/2L Ix (x. y) Iy (x..12) Image mismatch vector: bk = x=px −ωx y=py −ωy δIk (x.11) (eqs. . 7) d = g0 + d0 v =u+d (eq.3. 30. y) L L gLm = [gx m gx m ]T = [0 0]T (eqs. 9) Solution: The corresponding point is at location v on image J 6 . 23) Initialization of iterative L-K: ν 0 = [0 0]T (eq. y) 2 I L (x. 19. 31) dL = ν K L−1 L−1 gL−1 = [gx gy ]T = 2 (gL + dL ) (eq.11) px +ωx Spatial gradient matrix: G= x=px −ωx y=py −ωy (eq.4 Summary of the pyramidal tracking algorithm Let is now summarize the entire tracking algorithm in a form of a pseudo-code. y) = I L (x + 1. 32) for k = 1 to K with step of 1 (or until η k < accuracy threshold) Image diﬀerence: L k−1 L k−1 δIk (x. 2. y + 1) − I L (x. y) Ix (x. 5) (eqs.. 28) (eq. y) Iy (x. y) 2 Ix (x. y − 1) 2 py +ωy 2 Ix (x.

in order to compute δIk (x. y) and Iy (x. px + ωx ] × [py − ωy . py +gy +νy +ωy ]. This observation becomes very signiﬁcant as the the number of pyramid levels Lm is large. py + gy + νy + ωy ]. px + gx + νx + ωx ] × [py + gy + νy − ωy . L it is necessary to use the set of original brightness values J (α. the portion of the neighborhood that have valid entries for Ix (x. the coordinates of the central point p = [px py ]T are not garanteed to be integers. then there is “forbidden band” of width ωx (and ωy ) all around the images I L . β) ∈ [qxo − ωx . yo ) + αx (1 − αy ) I L (xo + 1. py + ωy ]. Then I L (x. If Lm is the height of the pyramid. y) in the neighborhood (x. this may become very troublesome for large integration window sizes. it is necessary to use the set of original brightness values I L (x. and more importantly. y) ∈ [px − ωx . L k−1 L k−1 Of course. A similar situation occurs when computing the image diﬀerence δIk (x. Let xo and yo be the integer parts of x and y (larger integers that are smallers than x and y). Let L be a generic pyramid level. In this case.5 Subpixel Computation It is absolutely essential to keep all computation at a subpixel accuracy level. y). y) in the integer grid patch (x. y) and δIk (x. Some numbers: ωx = ωy = 5 pixels. = q y o + q yα . if we always enforce the complete (2ωx + 1) × (2ωy + 1) window to be within the image in order to be tracked. Call pxo and pyo the integer parts of px and py .6 Tracking features close to the boundary of the images It is very useful to observe that it is possible to process points that are close enough to the image boundary to have a portion of their integration window outside the image. y) in the neigborhood (x. px + ωx ] × [py − ωy . (36) (37) where pxα and pyα are the associated reminder values between 0 and 1.12 and 26). Then. y) ∈ [px − ωx − 1. py + ωy ]. Of course. yo + 1) + αx αy I L (xo + 1. y). this might not constitute a signiﬁcant limitation. y) 7 . y) in the neighborhood (x. in order to compute the image patch k−1 L k−1 L k−1 L k−1 L J L (α. β) in the integer grid patch (α. y) where x and y are not integers. the values k−1 L k−1 L k−1 L k−1 L L of J (α. expect that the summations appearing in all expressions (see pseudo-code of section 2. px + ωx + 1] × [py − ωy − 1. i. it is necessary to have the brightness values of I L (x. y) = (1 − αx ) (1 − αy ) I L (xo . Iy (x. howerver.4). px + ωx ] × [py − ωy . py + ωy ] (refer to section 2.4) should be done only over the valid portion of the image neigborhood. in other words. β) in the neighborhood (α. Refer to the summary of the algorithm given in section 2. and Lm = 3 leads to a forbidden band of 40 pixels around the image! In order to prevent this from happenning. y) ∈ [pxo − ωx − 1. py + ωy + 1] (see equations 19. = py o + py α . Indeed. Therefore. It is therefore necessary to be able to compute image brightness values at locations between integer pixels (refer for example to equations 11. 20). In order to compute image brightness at subpixel locations we propose to use bilinear interpolation. we propose to keep tracking points whose integration windows partially fall outside the image (at any pyramid level). Then we may write: px py = pxo + pxα . pyo + ωy + 2] (recall that ωx and ωy are integers). Let αx and αy be the two reminder values (between 0 and 1) such that: x = xo + αx . px + gx + νx and py + gy + νy are not necessarily integers. py + ωy + 1] through bilinear interpolation. it is required to have the values L k−1 L k−1 J L (x + gx + νx . y) may be computed by bilinear interpolation from the original image brightness values: I L (x. ωy and Lm . qyo + ωy + 1]. Indeed. px +gx +νx +ωx ]×[py +gy +νy −ωy . yo + 1). β) ∈ [px +gx +νx −ωx . β) ∈ [px + gx + νx − ωx . Call qxo and qyo the integer parts of k−1 L k−1 L px + gx + νx and py + gy + νy : L k−1 px + gx + νx L k−1 py + gy + νy = qxo + qxα .e. yo ) + (1 − αx ) αy I L (xo . pxo + ωx + 2] × [pyo − ωy − 1. (38) (39) where qxα and qyα are the associated reminder values between 0 and 1. qxo + ωx + 1] × [qyo − ωy . y) ∈ [px − ωx − 1. y) in the neigborhood (x. When computing the two image derivatives Ix (x. y + gy + νy ) for all (x.2. Assume that we need the image value I L (x. y = yo + αy . px + ωx + 1] × [py − ωy − 1. the tracking procedure reamins the same. y) ∈ [px − ωx . in order to compute the image patch I L (x. 2. β) for all (α. for large values of Lm . this means an eﬀective forbidden band of width 2Lm ωx (and 2Lm ωy ) around the original image I.4. y) ∈ [px − ωx . or. For small values of ωx . (34) (35) Let us make a few observations that are associated to subpixel computation (important implementation issues).

7 Declaring a feature “lost” There are two cases that should give rise to a “lost” feature. this approach should be implemented for rejection. In addition. a feature point may be declared “lost” if its ﬁnal cost function (d) is larger than a threshold (see equation 1). 3 Feature Selection So far. More information regarding this technique is presented by Shi and Tomasi in [1]. 2. Indeed.4). the tracked image patch (used as reference) is implicitely initialized at every track. from iteration to iteration. we have described the tracking procedure that takes care of following a point u on an image I to another location v on another image J.g. Of course. retain the local max. if tracking is done based on consecutive pairs of images. or in other words. While adopting this approach however another problem rises: a feature may be declared “lost” too quickly. the G matrix is required to be invertible. 10 or 5 pixels). the process of selection goes as follows: 1. For example. 2. This percentage can be 10% or 5%. The reason is that aﬃne tracking requires to estimate a very large number of parameters: 6 instead of 2 for the standard scheme. It is very intuitive to approach the problem of feature selection once the mathematical ground for tracking is led out. However. we have not described means to select the point u on I in the ﬁrst place. Observe that this condition is a lot more challenging to quantify precisely. One direction that we envision to answer that problem is to use aﬃne image matching for deciding for lost track (see Shi and Tomasi in [1]). Retain the image pixels that have a λm value larger than a percentage of λmax . aﬃne tracking should only be considered for building a reliable rejection scheme on top of the main 2D tracking engine. pixels (a pixel is kept if its λm value is larger than that of any other pixel in its 3 × 3 neighborhood). Indeed. 2.4). the central step of tracking is the computation of the optical ﬂow vector η k (see pseudo-code of algorithm in section 2. or if its L k−1 L k−1 corresponding tracked point [px + gx + νx py + gy + νy ] falls outside of the image J L . Observe that doing so. the valid summation regions may vary while going through the Lucas-Kanade iteration loop (loop over the index k in the pseudo-code . the minimum eigenvalue of G must be large enough (larger than a threshold). A problem comes in when having decide about a threshold. and not pursue tracking for it. y) may vary as the translation vector [gx + νx gy + νy ]T vary. Following this technique. This drift problem is a very classical issue when dealing with long sequences. the valid entry L k−1 L k−1 of the the image diﬀerence δIk (x. This characterizes pixels that are “easy to track”. the quantity (d) has a lot more meaning. Therefore. The ﬁrst case is very intuitive: the point falls outside of the image. The diﬀerential patches Ix (x. At that step.stanford. This step is called feature selection. the summations regions must be identical (for the mathematics to remain valid). with an aﬃne image matching (using a unique reference for feature patch) to decide for false track. 4. The best technique known so far is to combine a traditional tracking approach presented so far (based on image pairs) to compute matching. Many people have made that observation (see for example http://www. Compute the G matrix and its minimum eigenvalue λm at every pixel in the image I. We have discussed this issue in section 2. Keep the subset of those pixels so that the minimum distance between any pair of pixels is larger than a given threshold distance (e. the G matrix must be recomputed within the loop at each iteration.html). it is reasonable to declare the point “lost”. in that case. the standard tracking scheme presented in this report performs a lot better (more accurate) that aﬃne tracking alone. Therefore. It is worth observing that for 2D tracking itself. Call λmax the maximum value of λm over the whole image. It is particularly sensitive when tracking a point over many images in a complete sequence. Indeed.Sec. We believe that eventually. 8 . From those pixels. 5. y) and Iy (x.(see section 2.edu/ ssorkin/cs223b/final. y) may however be computed once before the iteration loop. 3. Consequently. The other case of loss is when the image patch around the tracked point varies too much between image I and image J (the point disappears due to occlusion). if the point p = [px py ]T (the center of the neighborhood) falls outside of the image I L .4). the point may drift throughout the extended sequence even if the image diﬀerence between two consecutive images is very small. observe that when computing the mismatch vector bk and the gradient matrix G. One approach is to track feature points through a sequence while keeping a ﬁxed reference for the appearance of the feature patch (use the ﬁrst image the feature appeared).6. Therefore.

It all depends on the computational performances of the feature tracker. it is important to observe that it is not necessary to take a very large integration window for feature selection (in order to compute the G matrix). Conf. 9 . References [1] Jianbo Shi and Carlo Tomasi. Finally. pages 593–600. IEEE Comput. They are the selected feature points that are fed to the tracker. The last step of the algorithm consisting of enforcing a minimum pairwise distance between pixels may be omited if a very large number of points can be handled by the tracking engine. Vision and Pattern Recogn. and should be used. this window size (3 × 3) is typically too small (see section 1).After that process. 1994. In fact. Soc. the remaining pixels are typically “good to track”. Proc.. For tracking. Comput. “Good features to track”. a 3 × 3 window is suﬃcient ωx = ωx = 1 for selection.