You are on page 1of 41

Stereopsis-III Motion -I

Epipolar Geometry for Parallel Cameras

Ol T

Or f el er

P Epipoles are at infinity Epipolar lines are parallel to the baseline

We can always achieve this geometry with image rectification

Image Reprojection
reproject image planes onto common plane parallel to line between optical centers
Notice, only focal point of camera really matters

(Seitz)

Stereo Correspondences

Left scanline

Right scanline

Dynamic Programming: Assume ordering constraint: a matched pair means no other matches can occur to their left on either scanline

Possibilities at each pixel pair


At any pixel all possibilities are to the right pixels on the left have already been matched Thus any global optimal match will contain a local optimal match Match pixel pair Goodness of match determined by some cost-function Or, we cannot match pixels because:
1. Occlusion in the right image 2. Occlusion in the left image

Possibilities at each match Const.

Search Over Correspondences


Occluded Pixels Left scanline

Right scanline Disoccluded Pixels

Three cases:
Sequential add cost of match (small if intensities agree) Occluded add cost of no match (large cost) Disoccluded add cost of no match (large cost)

Stereo Matching with Dynamic Programming


Occluded Pixels Left scanline Dynamic programming yields the optimal path through grid. This is the best set of matches that satisfy the ordering constraint

Start

Dis-occluded Pixels

Right scanline
End

Dynamic Programming
i =1
States: 1
12 22
32

i=2 i=3

Ct 1

Ct

Ct +1

Suppose cost can be decomposed into stages:

ij = Cost of going from state i to state j

Dynamic Programming
i =1 i=2 i=3
1
12 22 32

j=2

Ct 1

Ct

Ct +1

Principle of Optimality for an n-stage assignment problem:

Ct ( j ) = min i ( ij + Ct 1 (i ))

Algorithm
Decide which cost is minimum Start from end

Structure-from-Motion
Determining the 3-D structure of the world, and/or the motion of a camera using a sequence of images taken by a moving camera.
Equivalently, we can think of the world as moving and the camera as fixed.

Like stereo, but the position of the camera isnt known (and its more natural to use many images with little motion between them, not just two with a lot of motion).
We may or may not assume we know the parameters of the camera, such as its focal length.

Structure-from-Motion
As with stereo, we can divide problem:
Correspondence. Reconstruction.

Again, well talk about reconstruction first.


Assume that each image contains some points, we know which points match which

Cases to consider
Moving camera (moving human) Moving objects in the scene General case: Both are moving

We must relate motion in the real world with motion in images

Structure-from-Motion

Reconstruction
A lot harder than with stereo. Start with simpler case: scaled orthographic projection (weak perspective).
Recall, in this we remove the z coordinate and scale all x and y coordinates the same amount. We will consider a fixed camera and a moving scene

First: Represent motion


Well talk about a fixed camera, and moving object. Key point: Some matrix Points

x1 y1 P = z1 1

x2 xn y2 . . . yn z2 zn 1 1

r1,1 r1,2 S = r2,1 r2,2 r 3,1 r3,2

r1,3 t x r2,3 t y r3,3 t z

The image (ignoring z) X X . . . X I = Y Y Y


1 2 n 1 2 n

Then:

I = SP

Objects represented as a moving set of points These are projected into the image Projection is represented with Matrix multiplication take inner product between each row of S and each point. First row of S produces X coordinates, while second row produces Y. Projection model is scaled orthographic here from now on we ignore third row of S. Translation occurs with tx and ty; tz does not matter Scaling encoded with a scale factor in S. The rest of S accounts for rotation

Examples: S = [s, 0, 0, 0; 0, s, 0, 0]; This is just projection, with scaling by s. S = [s, 0, 0, s*tx; 0, s, 0, s*ty]; This is translation by (tx,ty,something), projection, and scaling.

Structure-from-Motion
S encodes:
Projection: only two lines Scaling, since S can have a scale factor. Translation, by tx and ty. Rotation:

I = SP

Rotation
r1,1 r2,1 r 3,1 r1,3 r2,3 P r3,3

r1, 2 r2, 2 r3, 2

Represents a 3D rotation of the points in P.

First, look at 2D rotation (easier)


Matrix R acts on points by rotating them.

cos R= sin
x2 y2 . .

sin cos
. xn yn

cos sin

sin x1 cos y1

Also, RRT = Identity. RT is also a rotation matrix, in the opposite direction to R.

Why does multiplying points by R rotate them? rows of R are basis vectors of a new coordinate system Taking inner products of each point with these expresses that point in that coordinate system. This means rows of R must be orthonormal vectors (orthogonal unit vectors). points (1,0) and (0,1) go to (cos , -sin ), and (sin , cos ). They remain orthonormal, and rotate clockwise by theta. Any other point, (a,b) can be written as a(1,0) + b(0,1). So R(a(1,0)+b(0,1) = Ra(1,0) + Ra(0,1) = aR(1,0) + bR(0,1). So its in the same position relative to the rotated coordinates that it was in before rotation relative to the x, y coordinates. That is, its rotated.

Simple 3D Rotation
cos sin 0 sin cos 0 0 x 0 y 1 z x y z . . . x y z
n n n

Rotation about z axis. Rotates x,y coordinates. Leaves z coordinates fixed.

Full 3D Rotation
cos R = sin 0 sin cos 0 0 cos 0 0 sin 1 0 sin 1 0 1 0 0 cos 0 sin 0 cos 0 sin cos

Any rotation can be expressed as combination of three rotations about three axes.

1 0 0 RR = 0 1 0 0 0 1
T

Rows (and columns) of R are orthonormal vectors. R has determinant 1 (not -1).

3D rotations can be expressed as 3 separate rotations about fixed axes. Rotations have 3 degrees of freedom; two describe an axis of rotation, and one the amount. Order of rotations matter Typically rotations described using Euler angles Rotations preserve the length of a vector, and the angle between two vectors. Axes, (1,0,0), (0,1,0), (0,0,1) must remain orthonormal after rotation. After rotation, they are the three columns of R. So these columns must be orthonormal vectors for R to be a rotation. Same reasoning as 2D tells us all other points rotate too. If R has determinant -1, then R is a rotation plus a reflection.

Putting it Together
Scale

1 0 0 t r 1 0 0 s 0 1 0 t r 0 1 0 0 0 1 t 0 Projection
x y z

3D Translation r

1,1

2 ,1

r r r

1, 2

2,2

r r r

1, 3

2,3

3 ,1

3, 2

3,3

0 0 P 0 1

3D Rotation

s s s s where
1,1 2 ,1 1,1 1, 2

1, 2

s s

1, 3

2,2

2,3

st P st
x y

(s , s , s ) (s , s , s ) = 0
1, 3 2 ,1 2,2 2,3

We can just write stx as tx and sty as ty.

(s , s , s ) = (s , s , s )
1,1 1, 2 1, 3 2 ,1 2,2 2,3

Affine Structure from Motion

s s t s P s s s t where (s , s , s ) (s , s , s ) = 0
1,1 1, 2 1, 3 x 2 ,1 2,2 2,3 y 1,1 1, 2 1, 3 2 ,1 2,2 2,3

(s , s , s ) = (s , s , s )
1,1 1, 2 1, 3 2 ,1 2,2 2,3

Affine Structure-from-Motion: Two Frames (1)

u v u v

1 1 1

u v u v

1 2 1 2 2 2 2

1 2

1 2

u s v s = s u v s
1 n 1 n 2 n 2 n

1,1 1

2 ,1 2

1,1 2

s s s s

1, 2 1

2,2 2

1, 2 2 2,2

s s s s

1, 3 1

2,3 2

1, 3 2 2,3

2 ,1

x y z t 1 t t t
1 x 1 y 2 x 2 y

x y

z 1

x y z 1
n n n

Affine Structure-from-Motion: Two Frames (2)

To make things easy, suppose:

x y z 1

y z 1

y z 1

x 0 y 0 = z 0 1 1
4 4 4

1 0 0 0 1 0 0 0 1 1 1 1

Affine Structure-from-Motion: Two Frames (3)


u v u v
1 1 1

u v u v

1 2 1 2 2 2 2

1 2

1 2

u s v s = s u v s
1 n 1 n 2 n 2 n

1,1 1

s s s s

1, 2 1 2,2 2

s s s s

1, 3 1 2,3 2

2 ,1 2

1,1 2

1, 2 2 2,2

1, 3 2 2,3

2 ,1

t x t y t z t 1
1 x 1 y 2 x 2 y

z 1

x y z 1
n n n

Looking at the first four points, we get:

u v u v

1 1

1 2

1 2

u v u v

1 2 1 2 2 2 2

u v u v

1 3 1

3 2

3 2

u v u

s s =s v s
1 4 1 4 2 4 2 4

1,1 1

2 ,1 2

1,1 2

2 ,1

s s s s

1, 2 1

2,2 2

1, 2 2

2,2

s s s s

1, 3 1

2,3 2

1, 3 2

2,3

t t t t

1 x 1 y 2 x 2 y

0 0 0 1

1 0 0 1

0 1 0 1

0 0 1 1

Affine Structure-from-Motion: Two Frames (4) u u u u s s s t 0 1 0 0 v v v v s s s t 0 0 1 0 = u u u u s s s t 0 0 0 1 1 1 1 1 v v v v s s s t


1 1 1 1 1 1 1 1 1 2 3 4 1,1 1 1, 2 1 1, 3 1 x 1 1 1 1 1 1 2 3 4 2 ,1 2 2, 2 2 2, 3 2 y 2 2 2 2 2 1 2 3 4 1,1 2 1, 2 2 1, 3 2 x 2 2 2 2 2 1 2 3 4 2 ,1 2, 2 2, 3 y

We can solve for motion by inverting matrix of points. Or, explicitly, we see that first column on left (images of first point) give the translations. After solving for these, we can solve for the each column of the s components of the motion using the images of each point, in turn.

Affine Structure-from-Motion: Two Frames (5)

u s v s = u s v s
1 k 1 k 2 k 2 k

1,1 1

1, 2 1

1, 3 1

2 ,1 2

2, 2 2

2, 3 2

1,1 2

2 ,1

s s

1, 2 2

2, 2

s s

1, 3 2

2, 3

t a t b t c t 1
1 x k 1 y k 2 x k 2 y

Once we know the motion, we can use the images of another point to solve for the structure. We have four linear equations, with three unknowns.

Affine Structure-from-Motion: Two Frames (6) Suppose we just know where the kth point is in image 1.

u s s s t a v s s s t b = ? s s s t c ? s s s t 1 Then, we can use the first two equations to write a and b as linear
1 1 1 1 1 k 1,1 1 1, 2 1 1, 3 1 x k 1 1 k 2,1 2 2, 2 2 2, 3 2 y k 2 1,1 2 1, 2 2 1, 3 2 x k 2 2,1 2, 2 2, 3 y
k k

in ck. The final two equations lead to two linear equations in the missing values and ck. If we eliminate ck we get one linear equation in the missing values. This means the unknown point lies on a known line. That is, we recover the epipolar constraint. Furthermore, these lines are all parallel.

Affine Structure-from-Motion: Two Frames (7)


But, what if the first four points arent so simple? Then we define A so that:

x y A z 1
1 1 1

x x x 0 y y y 0 = z z z 0 1 1 1 1
2 3 4 2 3 4 2 3 4

1 0 0 0 1 0 0 0 1 1 1 1

This is always possible as long as the points arent coplanar.

Affine Structure-from-Motion: Two Frames (8)


Then, given:
u v u v
1 1 1

u v u v

1 2 1 2 2 2 2

1 2

1 2

u s v s = s u s v
1 n 1 n 2 n 2 n

1,1 1

2 ,1 2

s s s s

1, 2 1

2,2 2

s s s s

1, 3 1

2,3 2

1,1 2

1, 2 2

1, 3 2

2 ,1

2,2

2,3

t x t y t z 1 t
1 x 1 y 2 x 2 y

x y

z 1

x y z 1
n n n

We have:

u v u v
u v u v
1 1 1

1 1 1

1 2

u v u v
u v u v
1 2 1 2 2 2 2 2

1 2 1 2 2 2 2

1 2

u s v s = s u v s
1 n 1 n 2 n 2 n
1 1 n 1,1 1 1

1,1 1

2 ,1 2

1,1 2

s s s s
1 1, 2 1

1, 2 1

2,2 2

1, 2 2 2,2

s s s s
1 1, 3 1

1, 3 1

2,3 2

1, 3 2 2,3

2 ,1

x y A A z 1 t t t t
1 x 1 y 1 2 x 2 y
1 x 1

x y z

x y z

1
. . .

1
n n n
n

And:

1 2

1 2

u s v s = s u v s
n 2 n 2 n

s s s s

s s s s

2 ,1 2

2,2 2

2,3 2

1,1 2

1, 2 2

1, 3 2

2 ,1

2,2

2,3

t t A t t
y 2 x 2 y

0 0 0 1

1 0 0 0 1 0 0 0 1 1 1 1

x y z 1
n n

Affine Structure-from-Motion: Two Frames (9)


u v u v
1 1 1

u v u v

1 2 1 2 2 2 2

Given:

1 2

1 2

u s v s = u s v s
1 n 1 n 2 n 2 n

1,1 1

s s s s

1, 2 1 2,2 2

s s s s

1, 3 1 2,3 2

2 ,1 2

1,1 2

1, 2 2

1, 3 2

2 ,1

2,2

2,3

t t A t t
1 x 1 y 2 x 2 y

0 0 0 1

1 0 0 0 1 0 0 0 1 1 1 1

x y z 1
n n n

Then we just pretend that:

s s s s

1,1 1

2 ,1 2

1,1 2

s s s s

1, 2 1

2,2 2

1, 2 2 2,2

s s s s

1, 3 1

2,3 2

1, 3 2 2,3

2 ,1

A t t t t
1 x 1 y 2 x 2 y

is our motion, and solve as before.

Affine Structure-from-Motion: Two Frames (10)


This means that we can never determine the exact 3D structure of the scene. We can only determine it up to some transformation, A. Since if a structure and motion explains the points:

u v u v
1

1 1

1 2

1 2

u . . . u s v v s = s u u v v s
1 1 2 n 1 1 2 n 2 2 2 n 2 2 2 n

1,1 1

s s s s
1 n

1, 2 1

s s s s
1 1,1 1

1, 3 1

2 ,1 2

2, 2 2

2, 3 2

1,1 2

1, 2 2

1, 3 2

2 ,1

2, 2

2, 3

t x t y t z t 1
1 x 1 1 y 1 2 x 1 2 y
1 1, 2 1

x . . . x y y
2 n 2 n

z 1
2

z 1
n
1

So does another image of the form:

u v u v

1 1

u v u v

1 2 1 2 2 2 2

1 2

1 2

u s v s = u s v s
1 n 2 n 2 n

2 ,1 2

s s

2,2 2

s s

1, 3 1

2,3 2

1,1 2

2 ,1

s s

1, 2 2

2,2

s s

1, 3 2

2,3

t t A t t
1 x 1 y 2 x 2 y

x y A z 1

x y

z 1

x y z 1
n n n

Affine Structure-from-Motion: Two Frames (11) u u . . . u s s s t x x . . .


v v u u v v
1 1 1 1 2 1 2 1 1 2 1 2 2 2 2

v s = s u s v
1 n 1 n 2 n 2 n

1,1 1

1, 2 1

1, 3 1

2,1 2

1,1 2

s s s

2, 2 2

1, 2 2

s s s

2, 3 2

1, 3 2

2,1

2, 2

2, 3

t y y A A t z z t 1 1
1 x 1 2 1 y 1 1 2 2 x 1 2 2 y

x y z 1
n n n

Note that A has the form: A corresponds to translation of the points, plus a linear transformation.

a a a a a a a a A= a a a a 0 0 0 1
1,1 1, 2 1, 3 1, 4 2 ,1 2, 2 2, 3 2, 4 3, 1 3, 2 3, 3 3, 4

For example, there is clearly a translational ambiguity in recovering the points. We cant tell the difference between two point sets that are identical up to a translation when we only see them after they undergo an unknown translation. Similarly, theres clearly a rotational ambiguity. The rest of the ambiguity is a stretching in an unknown direction.

You might also like