ASL Perception Fundamentals Computer Vision

ASL
Autonomous Systems Lab
Perception II:
Fundamentals of Computer Vision
Autonomous Mobile Robots
Margarita Chli
Roland Siegwart and Nick Lawrance

Margarita Chli, Nick Lawrance and Roland Siegwart Perception II | 1
ASL
www.v4rl.ethz.ch
Collaborative scene perception
Vision-based robotic perception
Focus on small aircraft

Onboard visual 3D reconstruction
§  Tracking & mapping
§  Collaborative sensing & mapping for multi-robot
§  Experimentation with a variety of cameras
Depth-camera surface estimation
Margarita Chli, Nick Lawrance and Roland Siegwart | 2
ASL

ASL
Mobile Robot Control Scheme

knowledge, mission
data base commands
Localization “position“ Cognition

Map Building global map Path Planning
environment model
path
local map
Information Path
Motion Control
Extraction Execution
see-think-act
Perception
actuator
raw data
commands
Sensing Acting
Real World
Environment
ASL

Margarita Chli, Nick Lawrance and Roland Siegwart | 5
ASL
Today’s Topics
Section 4.2 in the book : Fundamentals of Computer Vision

§  Pinhole Camera Model
§  Perspective Projection
§  Camera Calibration
§  Stereo Vision, Epipolar Geometry & Triangulation
§  Structure from Motion & Optical Flow

Additional, optional reading on Computer Vision:
“Computer Vision: Algorithms and Applications“, by
Richard Szeliski (Springer)

ASL
Vision
Sony Cybershot WX1

ASL
One picture, a thousand words
§  Vision is our most powerful sense in aiding our perception of the 3D world around us.
§  Retina is ~10cm2. Contains millions of photoreceptors

(120 mil. rods and 7 mil. Cones for colour sampling)
§  Provides enormous amount of information: data-rate of ~3 GBytes/s
a a large proportion of our brain power is dedicated to processing the signals from our
eyes
Autonomous Mobile Robots http://webvision.med.utah.edu/sretina.html

ASL
Human Visual Capabilities
§  Our visual system is very sophisticated

§  Humans can interpret images successfully under a wide range of conditions – even in
the presence of very limited cues

ASL
Do we get it always right?
Count the black spots

ASL
Which square is darker, A or B?

ASL
Which square is darker, A or B?

ASL
Computer Vision for Robotics
§  Enormous descriptability of images a a lot of data to process

(human vision involves 60 billion neurons!)
§  Vision provides humans with a great deal of useful cues

a explore the power of vision towards intelligent robots
Cameras:
§  Vision is increasingly popular as a sensing modality:
§  descriptive
§  compactness, compatibility, …
§  low cost
§  HW advances necessary to support the processing of images
Image: Nicholas M. Short

ASL
Computer Vision | applications
q97country.com
www.robertlpeters.com www.samsung.com news.nationalgeographic.com

ASL
Computer Vision | what is it?
§  Automatic extraction of “meaningful” information from images and videos

– varies depending on the application
sky
roof tower
building
window
door
people plant
car
shade
ground
outdoors
Semantic information Geometric information

ASL
Computer Vision | how far along are we?
§  Partial 3D model from overlapping §  Dense 3D surface model (from §  Face recognition
images more overlapping images)
Enqvist et al., Tech Rep 2011 Goesele et al., ICCV 2007 www.dailymail.co.uk
§ How many animals in the image?

§ Resort to physics-based probabilistic
models to disambiguate solutions

http://articles.orlandosentinel.com
ASL
The camera | image formation
§  If we place a piece of film in front of an object, do we get a reasonable image?
Photoreceptive
object surface

ASL
The camera | image formation
§  If we place a piece of film in front of an object, do we get a reasonable image?

§  Add a barrier to block off most of the rays
§  This reduces blurring
§  The opening is known as the aperture
Photoreceptive
object barrier surface

ASL
The camera | camera obscura
§  Basic principle known to Mozi (470-390 BC),

Aristotle (384-322 BC), Euclid (323-283 BC)
§  Drawing aid for artists: described by Leonardo da
Vinci (1452-1519)
commons.wikimedia.org
§  Image is inverted
§  Depth of the room (box) is the effective focal length

www.thelivingmoon.com
ASL
The camera | the pinhole camera model
§  Pinhole model:
§  Captures beam of rays – all rays through a single point (note: no lens!)
§  The point is called Center of Projection or Optical Center
§  The image is formed on the Image Plane
§  We will use the pinhole camera model to describe how the image is formed
Based on slide by Steve Seitz

ASL
The camera | home-made pinhole camera
What can we do
to reduce the blur?
www.debevec.org/Pinhole/
Based on slide by Steve Seitz

ASL
The camera | shrinking the aperture
Why not make the aperture as small as possible?
§  Less light gets through (must increase the exposure)

§  Diffraction effects…

ASL
The camera | why use a lens?
§  The ideal pinhole:

only one ray of light reaches each point on the film
a image can be very dim; gives rise to diffraction effects
§  Making the pinhole bigger (i.e. aperture) makes the image blurry

ASL
§  The ideal pinhole:

only one ray of light reaches each point on the film
a image can be very dim; gives rise to diffraction effects
§  Making the pinhole bigger (i.e. aperture) makes the image blurry

ASL
§  A lens focuses light onto the film

§  Rays passing through the optical center are not deviated
object Lens film
Optical Center

ASL
§  A lens focuses light onto the film

§  Rays passing through the optical center are not deviated
§  All rays parallel to the optical axis converge at the focal point
object Lens
Focal Point
Optical Axis
Optical Center
Focal Length: f
ASL
The camera | how to create a focused image?

Lens
Object
A Focal Point
B
Image
f
z e

ASL
The camera | how to create a focused image?

Lens
Object
A Focal Point
B
Image
f
z e
§  Similar Triangles: B e
=
A z

ASL
The camera | the thin lens equation

Lens
Object
A A Focal Point
B
Image
f
z e
B e “Thin lens equation”

§  Similar Triangles: =
A z e e 1 1 1
B e− f e −1 = ⇒ = +
= = −1 f z f z e
A f f
ASL
The camera | pinhole approximation

Lens
Object
A Focal Point
B
Image
f
z e
§  What happens if z is very big, i.e. z >> f ?
§  We need to adjust the image plane s.t. objects at infinity are in focus
1 1 1
= +
f z e
Margarita Chli, Nick Lawrance and Roland Siegwart
≅0 Perception II | 30
ASL
Object Lens Focal Plane
h Focal Point
C h'
Optical Center or
Center of Projection
e
Image
z f
§  What happens if z is very big, i.e. z >> f ?
§  We adjust the image plane s.t. objects at infinity are in focus
1 1 1 1 1
= + ⇒ ≈ ⇒ f ≈e
f z e f e
≅0 Perception II | 31
ASL
Object Lens Focal Plane
h Focal Point
C h'
Optical Center or
Center of Projection
z f
§  The dependence of the apparent size of an object on its depth (i.e. distance from
camera) is known as perspective.
h' f f
= ⇒ h' = h
h z z
ASL
Playing with Perspective
§  Perspective gives us very strong depth cues

a hence we can perceive a 3D scene by viewing its 2D representation (i.e. image)
“Ames room”
A clip from "The computer that

ate Hollywood" documentary. Dr.
Vilayanur S. Ramachandran.

ASL
Perspective effects
§  Far away objects appear smaller

ASL
Vanishing points and lines
§  Parallel lines in the world intersect in the image at a “vanishing point”

ASL
Vanishing points and lines
Vertical vanishing
point
(at infinity)
Vanishing
line
Vanishing
Vanishing
point
point

ASL
Perspective projection
How world points map to pixels in the image?

ASL
Perspective projection
§  For convenience: the image plane is usually represented in front of C, such that the
image preserves the same orientation (i.e. not flipped)
Zc = optical axis
§  A camera does not measure Zc
distances but angles! P
u ϕ
a a camera is a “bearing sensor”
v
O
O = principal point
p
f Image plane (CCD)
C Xc
C = optical center = center of the lens
Yc
ASL
Perspective projection | from world to pixel coordinates
Find pixel coordinates (u,v) of point

Pw in the world frame:
0. Convert world point Pw to

camera point Pc
Pc , Pw
u
Find pixel coordinates (u,v) of point
Pc in the camera frame:
v
O
1. Convert Pc to image-plane p x
coordinates (x,y) y Zw Xw
Zc
C Xc W
2. Convert Pc to (discretised)
Yw
pixel coordinates (u,v) Yc [R|T]
ASL
Perspective projection | from camera frame to image plane
§  The Camera point Pc= ( Xc , 0 , Zc )T projects to p = (x, y) onto the image plane
§  From similar triangles: Pc=( Xc , 0 , Zc )T

Xc Image Plane
p
x Xc fX c Zc Xc
= ⇒x= C x
f Zc Zc O
f
Zc
§  Similarly, in the general case:
1. Convert Pc to image-plane
y Yc fYc coordinates (x,y)
= ⇒y=
f Zc Zc
2. Convert Pc to (discretised)
pixel coordinates (u,v)
ASL
Perspective projection | from camera frame to pixel coords.
§  Convert p from the local image plane coords (x,y) to the pixel coords (u,v), we need to account
for:
(0,0) Image plane
§  the pixel coords of the camera optical center O = (u0 , v0 ) u
§  scale factors ku , kv for the pixel-size in both dimensions
ku fX c v
§  So: u = u 0 + k u x ⇒ u = u 0 + O (u0,v0) x
Zc
kv fYc
v = v0 + kv y ⇒ v = v0 + y p
Zc
§  Use Homogeneous Coordinates for linear mapping from 3D to 2D, by introducing an extra
element (scale): ⎡ λu ⎤ ⎡ u ⎤
⎛u ⎞ ⎢ ⎥ ⎢ ⎥
p = ⎜⎜ ⎟⎟ p! = ⎢ λ v ⎥ = λ ⎢ v ⎥ and similarly for the world coordinates. Note, usually λ = 1
⎝v⎠ ⎢ λ ⎥ ⎢ 1 ⎥
⎣ ⎦ ⎣ ⎦
ASL
Perspective projection | from camera frame to pixel coords.

ku fX c
u = u0 +
Zc
k fY u
v = v0 + v c Zc
Zc Pc
v
§  Expressed in matrix form and homogeneous coordinates: O
⎡λ u ⎤ ⎡ k u f 0 u0 ⎤ ⎡ X c ⎤ p
⎢ λv ⎥ = ⎢ 0 f
⎢ ⎥ ⎢ kv f v0 ⎥⎥ ⎢⎢ Yc ⎥⎥
⎢⎣ λ ⎥⎦ ⎢⎣ 0 0 1 ⎥⎦ ⎢⎣ Z c ⎥⎦
C Xc
§  Or alternatively Yc
⎡λu ⎤ ⎡α u 0 u0 ⎤ ⎡ X c ⎤ ⎡Xc ⎤ Focal length in u-direction
⎢ λv ⎥ = ⎢ 0 α v0 ⎥⎥ ⎢⎢ Yc ⎥⎥ = K ⎢⎢ Yc ⎥⎥
⎢ ⎥ ⎢ v Focal length in v-direction
⎢⎣ λ ⎥⎦ ⎢⎣ 0 0 1 ⎥⎦ ⎢⎣ Z c ⎥⎦ ⎢⎣ Z c ⎥⎦
“Calibration matrix”/
“Matrix of Intrinsic Parameters”
ASL
Perspective projection | from the world to the camera frame
⎡ X c ⎤ ⎡ r11 r12 r13 ⎤ ⎡ X w ⎤ ⎡ t1 ⎤

⎢ Y ⎥ = ⎢r in homogeneous
r 22 r23 ⎥⎥ ⎢⎢ Yw ⎥⎥ + ⎢⎢t 2 ⎥⎥
⎢ c ⎥ ⎢ 21 coordinates
⎢⎣ Z c ⎥⎦ ⎢⎣ r31 r32 r33 ⎥⎦ ⎢⎣ Z w ⎥⎦ ⎢⎣t3 ⎥⎦ Pc , Pw
u
v
⎡ X c ⎤ ⎡ r11 r12 r13 t1 ⎤ ⎡ X w ⎤ ⎡ ⎤⎡ X w ⎤ O Zw Xw
⎢ Y ⎥ ⎢r p
⎢ c ⎥ = ⎢ 21 r22 r23 t 2 ⎥⎥ ⎢⎢ Yw ⎥⎥ ⎢⎢ R T ⎥⎥ ⎢⎢ Yw ⎥⎥
= Zc W
⎢ Z c ⎥ ⎢ r31 r32 r33 t3 Z w
⎥ ⎢ ⎥ ⎢ ⎥⎢ Zw ⎥ Yw
⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥ C Xc
⎣ 1 ⎦ ⎣0 0 0 1 ⎦ ⎣ 1 ⎦ ⎣0 0 0 1 ⎦ ⎣ 1 ⎦
Yc
Projection Matrix [R|T]
From the previous slide: ⎡ X ⎤
⎡u ⎤ ⎡Xc ⎤ ⎡ u ⎤ ⎢ w ⎥ Extrinsic
⎢ ⎥ ⎢ Y ⎥ Parameters
λ ⎢⎢ v ⎥⎥ = K ⎢⎢ Yc ⎥⎥ λ⎢ v ⎥ = K ⎡⎣ R T ⎤⎦⎢ w ⎥
⎢⎣1 ⎥⎦ ⎢⎣ Z c ⎥⎦ ⎢
⎣ 1 ⎥⎦ ⎢ Zw ⎥
⎢⎣ 1 ⎥⎦

ASL
Perspective projection | radial distortion
§  Amount of distortion is a non-linear

fn of distance from center of image
From ideal (u,v) to distorted pixel

coordinates (ud , vd):
§  Simple quadratic distortion model:
Radial
Distortion
parameter
§  Works well for most lenses

No distortion Barrel distortion Pincushion distortion
ASL
Camera Calibration
§  Finding the intrinsic parameters of your camera

ASL
Camera Calibration
§  Use camera model to interpret the projection from world to

image plane
§  Using known correspondences of p ó P, we can compute the

unknown parameters K, R, T by applying the perspective
projection equation
§  … so associate known, physical distances in the world to pixel-
distances in image Projection Matrix
⎡ X ⎤
⎡ u ⎤ ⎢ w ⎥
⎢ ⎥ ⎢ Y ⎥
λ ⎢ v ⎥ = K ⎡⎣ R T ⎤⎦⎢ w ⎥
⎢ 1 ⎥ ⎢ Zw ⎥
⎣ ⎦
⎢⎣ 1 ⎥⎦

ASL
Camera Calibration | direct linear transform (DLT)

⎡ Xw ⎤
⎡ u ⎤ ⎢ ⎥
⎢ ⎥ ⎢ Yw ⎥
§  We know that : λ ⎢ v ⎥ = K ⎡⎣ R T ⎤⎦⎢ ⎥
⎢ 1 ⎥ ⎢ Zw ⎥
⎣ ⎦
⎢⎣ 1 ⎥⎦ ⎡ X ⎤
⎡ ⎤ ⎡ ⎤
m13 m14 ⎢ w ⎥
§  So there are 11 values to estimate: ⎢ u ⎥ ⎢ m11 m12 ⎥⎢ Y ⎥
(the overall scale doesn’t matter, λ ⎢ v ⎥ = ⎢ m21 m22m23 m24 ⎥⎢ w ⎥
⎢ 1 ⎥ ⎢ m ⎥ Z
so e.g. m34 could be set to 1) ⎣ ⎦ ⎢⎣ 31 m32 m33 m34 ⎥⎦⎢ w ⎥
⎢⎣ 1 ⎥⎦
λu m X + m12Yi + m13 Z i + m14
§  Each observed point gives us a pair of equations: ui = i = 11 i
λ m31 + m32 + m33 + m34
λvi m21 X i + m22Yi + m23 Z i + m24
vi = =
λ m31 + m32 + m33 + m34
§  To estimate 11 unknowns, we need at least 6 points to calibrate the camera a solved using
linear least squares

ASL
Camera Calibration | direct linear transform (DLT)

⎡X ⎤ ⎡X w⎤
⎡u ⎤ ⎡ m11 m12 m13 m14 ⎤ ⎢ w ⎥ ⎢Y ⎥
Y
λ ⎢⎢ v ⎥⎥ = ⎢⎢m21 m22 m23 ⎥
m24 ⎥ ⎢ ⎥ = K [ R | T ]⎢ w ⎥
w
⎢Z ⎥ ⎢ Zw ⎥
⎢⎣1 ⎥⎦ ⎢⎣m31 m32 m33 m34 ⎥⎦ ⎢ w ⎥ ⎢ ⎥
1
⎣ ⎦ ⎣ 1 ⎦
§  what we obtained: the 3 x 4 projection matrix,
what we need: its decomposition into the camera calibration matrix K, and the rotation R and position T of
the camera.
§  Use QR factorization to decompose the 3x3 submatrix (m11:m33) into the product of an upper triangular matrix
K and a rotation matrix R (orthogonal matrix)
⎡ m14 ⎤
§  The translation T can subsequently be obtained by: T = K −1 ⎢m24 ⎥
⎢ ⎥
⎢⎣ m34 ⎥⎦

ASL
How do we measure distances with cameras?
learning.covcollege.ac.uk csiro.au

ASL
§  From a single image: we can only deduct the ray along which each image-point lies
3D Object
Left Image Right Image
Left Right
Center of projection Center of projection
§  Observe the scene from 2 different viewpoints a solve for the intersection of the rays
and recover the 3D structure
ASL
§  Stereo vision:
using 2 cameras with known relative position T and orientation R, recover the 3D
scene information
§  Structure from Motion:

recover the 3D scene structure & the camera poses (up to scale) from multiple
images, from potentially unknown cameras

ASL
Disparity in the human retina
§  Stereopsys: the brain allows us to see the left and right retinal images as a single 3D image
§  Observe image disparity
disparity

ASL
Stereo Vision | simplified case
§  An ideal, simplified case: assume both cameras are identical and are aligned with
the x-axis
Z P = ( X P , YP , Z P )

From Similar Triangles:
f ul
=
Z P XP bf
ZP = ZP
f ur u l −u r
= ul ur
Z P X P −b
f X

Disparity Cl Cr
difference in image location of the projection of a b
3D point in two image planes
Baseline
distance between the optical centers
Autonomous Mobile Robots of the two cameras
ASL
Stereo Vision | disparity map
§  Disparity: the difference in image location of corresponding 3D points as projected to the left
& right images
§  Disparity map holds the disparity value

at every pixel:
§  Identify correspondent points of all

image pixels in the original images Left image Right image
§  Compute the disparity for each pair Disparity map

of correspondences
§  Near-by objects experience bigger disparity:

bf
ZP =
u l −u r
ASL
Stereo Vision | general case
§  Two identical cameras do not exist in nature!

§  Aligning both cameras on a horizontal axis is very difficult
§  In order to use a stereo camera, we need:
§  relative pose between the cameras (rotation, translation), and ð “extrinsics”
§  the focal length, optical center, radial distortion of each ð “intrinsics”

a Use calibration method
= ( X w , Yw , Z w )

(R,T) Perception II | 55
ASL
Stereo Vision | general case
§  To estimate the 3D position of Pw we construct the system of equations of the left and
right camera
= ( X w , Yw , Z w )
⎡ Xw ⎤
⎡ u ⎤ ⎢ ⎥
⎢ l ⎥ ⎢
⎡ Yw ⎥
Left camera: p! l = λl ⎢ vl ⎥ = K l ⎢⎣ I 0 ⎤⎥⎢ ⎥
⎢ ⎥ ⎦
set the world frame to coincide ⎢ Zw ⎥
(R,T) ⎢⎣ 1 ⎥⎦ ⎢ ⎥
with the left camera frame ⎣ 1 ⎦
§  Triangulation: the problem of determining ⎡ u ⎤
⎡ Xw ⎤
⎢ ⎥
the 3D position of a point, given a set of ⎢ r ⎥ ⎢ Yw ⎥
! ⎡ ⎤
corresponding image locations & known Right camera: pr = λr vr = K r ⎢⎣ R T ⎥⎦⎢
⎢ ⎥ ⎥
⎢ ⎥ ⎢ Zw ⎥
camera poses. ⎢⎣ 1 ⎥⎦ ⎢ ⎥
⎣ 1 ⎦
ASL
Correspondence Search | the problem
§  goal: identify image regions / patches in the left & right images, corresponding to the same
scene structure
§  Typical similarity measures: Normalized Cross-Correlation (NCC) , Sum of Squared Differences
(SSD), Sum of Absolute Differences (SAD), …
§  Exhaustive image search can be computationally very expensive!
Can we search for correspondences more efficiently?

ASL
Correspondence Search | the epipolar constraint
§  Potential matches for pL have to lie on the corresponding epipolar line

§  Epipolar line: the projection of the infinite ray that connects CL and pL , onto the image
plane of CR (and vice versa)
§  Epipole: the projection of the optical center CL onto the image plane of CR (and vice
versa)
§  Epipolar plane is defined by pL , CL , CR
Corresponding epipolar line for pL
a pR lies on this line
Impose the epipolar

pL constraint to aid the search
for correspondences
eL CR
CL
Margarita Chli, Nick Lawrance and Roland Siegwart eR Perception II | 58
ASL
Correspondence Search | the epipolar constraint
§  Thanks to the epipolar constraint, corresponding points can be searched for, along
epipolar lines
⇒ computational cost reduced to 1 dimension!

ASL
Epipolar Rectification
§  In practice, it is convenient if image scan-lines are the epipolar lines

§  Epipolar rectification: warps image pairs into new “rectified” images, whose
epipolar lines are parallel & collinear (aligned to the baseline)
Image from Left Camera Image from Right Camera

ASL


Rotation

ASL


Rotation
Focal lengths

ASL


Rotation
Focal lengths
Lens Distortion

ASL


Rotation
Focal lengths
Lens Distortion
Translation
ASL


Rotation
Focal lengths
Lens Distortion
Translation
ASL
Epipolar Rectification | example

ASL
Triangulation
•  Given the projections p1 and p2 of a 3D point P in two or more images (with known camera
matrices R and T), find the coordinates of the 3D point
P=?
p2
p1
C1 C2
[R|T] Perception II | 67
ASL
Triangulation
•  We want to intersect the two visual rays corresponding to p1 and p2, but because of noise and
numerical errors, they don’t meet exactly
P=?
p2
p1
C1 C2
ASL
Triangulation | geometric approach
•  Find shortest segment connecting the two viewing rays and let P be the midpoint of that
segment
p2
p1
C1 C2
ASL
Triangulation | linear approach
Left camera:
⎡ X ⎤
⎡ u ⎤ ⎢ w
⎥
⎢ 1 ⎥ ⎢ Yw ⎥
p!1 = K1 ⎡⎣ I 0⎤⎦ P! ⎡ ⎤
⇒ λ1 ⎢ v1 ⎥ = K1 ⎣ I 0⎦⎢ ⎥
⎢ ⎥ ⎢ Zw ⎥
⎢⎣ 1 ⎥⎦
M1 ⎢⎣ 1 ⎥⎦ P=?
Right Camera:
⎡ Xw ⎤
⎡ u ⎤ ⎢ ⎥
2
⎢ ⎥ Yw ⎥
p! 2 = K 2 ⎡⎣ R T ⎤⎦ P! ⎢
⇒ λ2 ⎢ v2 ⎥ = K 2 ⎡⎣ R T ⎤⎦⎢ ⎥
⎢ ⎥ Zw ⎥ p2
⎢⎣ 1 ⎥⎦ p1 ⎢
M2 ⎢⎣ 1 ⎥⎦
C1 C2
ASL
Left camera:
Cross-multiply by p!1
0
p!1 = M1P! ⇒ M1P − p!1 = 0 ⇒ p!1 × M1P − p!1 × p!1 = 0 ⇒ [ p!1× ] M1P! = 0
! !
Right Camera:
Cross-multiply by p! 2
0
p! 2 = M 2 P! ⇒ M 2 P! − p! 2 = 0 ⇒ p! 2 × M 2 P! − p! 2 × p! 2 = 0 ⇒ [ p! 2× ] M 2 P! = 0

ASL
Left camera:
p!1 = M1P! ⇒ M1P! − p!1 = 0 ⇒ p!1 × M1P! = 0 ⇒ [ p!1× ] M1P! = 0
Right Camera:
p! 2 = M 2 P! ⇒ M 2 P! − p! 2 = 0 ⇒ p! 2 × M 2 P! = 0 ⇒ [ p! 2× ] M 2 P! = 0
⎡ 0 −az ay ⎤⎡ bx ⎤
⎢ ⎥⎢ ⎥
Cross product as matrix multiplication: a × b = ⎢ az 0 −ax ⎥⎢ by ⎥ = [a × ]b
⎢ ⎥⎢ ⎥
⎢⎣ −ay ax 0 ⎥⎢ bz ⎥
⎦⎣ ⎦
ASL
Left camera:
p!1 = M1P! ⇒ M1P! − p!1 = 0 ⇒ p!1 × M1P! = 0 ⇒ [ p!1× ] M1P! = 0
Right Camera:
p! 2 = M 2 P! ⇒ M 2 P! − p! 2 = 0 ⇒ p! 2 × M 2 P! = 0 ⇒ [ p! 2× ] M 2 P! = 0
3 unknowns ( P ) : X w ,Yw , Z w
6 independent equations

⇒ Solve system: A⋅P = 0 using SVD
ASL
Triangulation | non-linear approach
§  Find P that minimizes the Sum of Squared Reprojection

ReprojectionError:
Error
2 2
SSRE = p1 − M 1 P + p2 − M 2 P
§  In practice: initialize P using linear approach and then minimize SSRE using Gauss-
Newton or Levenberg-Marquardt
P=?
Reprojected point: M1P

p2 : Observed point
p1 : Observed point
Reprojected point: M2P

Margarita Chli, Nick Lawrance and Roland Siegwart C1 C2 Perception II | 74
ASL
Triangulation | sensitivity to baseline
§  What’s the optimal baseline?

§  Too small:
§  Large depth error
§  Can you quantify the error as a function of the disparity?
§  Too large:
§  Object might be visible from one camera, but not the other
§  Difficult search problem for close objects

Margarita Chli, Nick Lawrance and Roland Siegwart Small Baseline Large Baseline Perception II | 75
ASL
SFM: Structure From Motion
§  Given image point correspondences, xi n xi/,

determine R and T
§  Simultaneously estimate both 3D geometry

(structure) and camera pose (motion):
/
solve via non-linear minimization of the x
reprojection errors. x
C
(R,T) /
C

Image courtesy of Nader Salman
ASL
Multi-view SFM
§  Results of Structure from motion from user images from flickr.com [Seitz & Szeliski, ICCV 2009]
Colosseum, Rome San Marco square, Venice

2,106 images, 819,242 points 14,079 images, 4,515,157 points
ASL
Optical Flow -- the most general (and challenging) version of motion

estimation: compute the motion of each pixel
§  Optical flow: vector field representing
motion of parts of an image in subsequent
frames
§  It computes the motion vectors of pixels in

the image (or a subset of them to be faster)
By minimizing the brightness/color difference

between corresponding pixels, summer over
the image
§  Explain motion w.r.t. pixel intensity -- does it

always work? Link
ASL
Event-based Cameras
§  Dynamic Vision Sensor (DVS)

§  Similar to the human retina:
captures intensity changes asynchronously instead of capturing image
frames at a fixed rate
ü  Low power
ü  High temporal resolution è tackle motion blur
ü  High dynamic range
Effects of motion blur
Work with Wilko Schwarting and Dragos Stanciu

ASL
Event-based Cameras for Robot Navigation
Work with I. Alzugaray, W.

Schwarting & D. Stanciu
ASL

Margarita Chli, Nick Lawrance and Roland Siegwart www.donparrish.com/FavoriteOpticalIllusion.html Perception II | 81

ASL Perception Fundamentals Computer Vision

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ASL Perception Fundamentals Computer Vision

Uploaded by

Copyright:

Available Formats

ASL

Autonomous Systems Lab

Autonomous Mobile Robots

Collaborative scene perception

Vision-based robotic perception

Focus on small aircraft

Autonomous Mobile Robots

Mobile Robot Control Scheme

Localization “position“ Cognition

Autonomous Mobile Robots

Section 4.2 in the book : Fundamentals of Computer Vision

§ Stereo Vision, Epipolar Geometry & Triangulation

§ Structure from Motion & Optical Flow

Autonomous Mobile Robots

Sony Cybershot WX1

Autonomous Mobile Robots

One picture, a thousand words

§ Retina is ~10cm2. Contains millions of photoreceptors

§ Provides enormous amount of information: data-rate of ~3 GBytes/s

Autonomous Mobile Robots http://webvision.med.utah.edu/sretina.html

Human Visual Capabilities

§ Our visual system is very sophisticated

Autonomous Mobile Robots

Do we get it always right?

Count the black spots

Autonomous Mobile Robots

Do we get it always right?

Which square is darker, A or B?

Autonomous Mobile Robots

Do we get it always right?

Which square is darker, A or B?

Autonomous Mobile Robots

Computer Vision for Robotics

§ Enormous descriptability of images a a lot of data to process

§ Vision provides humans with a great deal of useful cues

Image: Nicholas M. Short

Computer Vision | applications

www.robertlpeters.com www.samsung.com news.nationalgeographic.com

Autonomous Mobile Robots

Computer Vision | what is it?

§ Automatic extraction of “meaningful” information from images and videos

Semantic information Geometric information

Computer Vision | how far along are we?

§ How many animals in the image?

Autonomous Mobile Robots

The camera | image formation

§ If we place a piece of film in front of an object, do we get a reasonable image?

Autonomous Mobile Robots

The camera | image formation

§ If we place a piece of film in front of an object, do we get a reasonable image?

Autonomous Mobile Robots

The camera | camera obscura

§ Basic principle known to Mozi (470-390 BC),

Autonomous Mobile Robots

The camera | the pinhole camera model

Autonomous Mobile Robots

The camera | home-made pinhole camera

Based on slide by Steve Seitz

The camera | shrinking the aperture

Why not make the aperture as small as possible?

§ Less light gets through (must increase the exposure)

Autonomous Mobile Robots

§  Stereo Vision, Epipolar Geometry & Triangulation

§  Structure from Motion & Optical Flow

§  Retina is ~10cm2. Contains millions of photoreceptors

§  Provides enormous amount of information: data-rate of ~3 GBytes/s

§  Our visual system is very sophisticated

§  Enormous descriptability of images a a lot of data to process

§  Vision provides humans with a great deal of useful cues

§  Automatic extraction of “meaningful” information from images and videos

§ How many animals in the image?

§  If we place a piece of film in front of an object, do we get a reasonable image?

§  If we place a piece of film in front of an object, do we get a reasonable image?

§  Basic principle known to Mozi (470-390 BC),

§  Less light gets through (must increase the exposure)

§  The ideal pinhole:

§  The ideal pinhole:

§  A lens focuses light onto the film

§  A lens focuses light onto the film

§  Perspective gives us very strong depth cues

§  Far away objects appear smaller

§  Parallel lines in the world intersect in the image at a “vanishing point”

§  From similar triangles: Pc=( Xc , 0 , Zc )T

§  Amount of distortion is a non-linear

§  Works well for most lenses

§  Finding the intrinsic parameters of your camera

§  Use camera model to interpret the projection from world to

§  Using known correspondences of p ó P, we can compute the

§  Structure from Motion: