Lecture5 PDF

Autonomous Mobile Robots
"Position"
Localization Cognition
Global Map
Environment Model Path

Local Map
Perception Real World Motion Control

Environment
Perception
Sensors
Vision
Uncertainties, Line extraction from laser scans
Zürich Autonomous Systems Lab

Lecture 5 - Perception - Vision
Lec. 5
2
One picture, a thousand words
 Of all our senses, vision is the most powerful in aiding our perception of
the 3D world around us.
 Retina is ~1000mm2. Contains millions of photoreceptors

(120 mil. rods and 7 mil. Cones for colour sampling)
 Provides enormous amount of information: data-rate of ~3GBytes/s

 a large proportion of our brain power is dedicated to processing the
signals from our eyes
http://webvision.med.utah.edu/sretina.html
© R. Siegwart , M.Chli and D. Scaramuzza, ETH Zurich - ASL
Lec. 5
3
Human Visual Capabilities
 Our visual system is very sophisticated

 Humans can interpret images successfully under a wide range of
conditions – even in the presence of very limited cues

Lec. 5
4
Do we get it always right?
Count the black dots

Lec. 5
5
Which square is darker, A or B?

Lec. 5
6
Which square is darker, A or B?

Lec. 5
7
Vision for Robotics
 Enormous descriptability of images

 a lot of data to process (human vision involves 60 billion neurons!)
 Not sensible to copy the biology, but learn from it
 Capture light  Convert to digital image

 Process to get „salient‟ information
 Vision is increasingly popular as a sensing

modality:
 compactness,
 compatibility,
 low cost, …
 HW advances necessary to support the
processing of images
Image: Nicholas M. Short

Lec. 5
8
Computer Vision

Lec. 5
9
Connection to other disciplines

Lec. 5
10
Applications of Computer Vision

Lec. 5
11
Applications of Computer Vision

Lec. 5
12
Today‟s Topics
Section 4.2 in the book
 Pinhole Camera Model
 Perspective Projection
 Stereo Vision
 Optical Flow
 Color Tracking

Lec. 5
13
The camera
Sony Cybershot WX1

Lec. 5
14
How do we see the world?
object film
 Place a piece of film in front of an object

 Do we get a reasonable image?

Lec. 5
15
Pinhole camera
object barrier film
 Add a barrier to block off most of the rays

 This reduces blurring
 The opening known as the aperture

Lec. 5
16
Camera obscura
 Basic principle known to Mozi (470-390
BC), Aristotle (384-322 BC)
 Drawing aid for artists: described by
Leonardo da Vinci (1452-1519)
 Depth of the room (box) is the
effective focal length
Solar eclipse
http://www.thelivingmoon.com/45jack_files/03files/Launch_Sites_Baikonur_Tour.html
Lec. 5
17
Pinhole camera model
 Pinhole model:
 Captures beam of rays – all rays through a single point
 The point is called Center of Projection or Optical Center
 The image is formed on the Image Plane
 We will use the pinhole camera model to describe how the image is
formed
Slide by Steve Seitz

Lec. 5
18
Home-made pinhole camera
Why so
blurry?
http://www.debevec.org/Pinhole/

Lec. 5
19
Shrinking the aperture
 Why not make the aperture as small as possible?

 Less light gets through (must increase the exposure)
 Diffraction effects…
Images courtesy of Steve Seitz

Lec. 5
20
Shrinking the aperture
Images courtesy of Steve Seitz

Lec. 5
21
Why use a lens?
 Ideal pinhole: only one ray of light reaches each point
 image can be very dim
 Why not make the pinhole bigger?  aperture
 A lens can focus multiple rays coming from the same point

Lec. 5
22
Why use a lens?
 Ideal pinhole: only one ray of light reaches each point
 image can be very dim
 Why not make the pinhole bigger?  aperture
 A lens can focus multiple rays coming from the same point

Lec. 5
23
Image Formation using a converging lens
object Lens film
 A lens focuses light onto the film

 Rays passing through the optical center are not deviated

Lec. 5
24
Image formation using a converging lens
Lens
object
Focal Point
Optical Axis
Focal Length: f
 A lens focuses light onto the film

 Rays passing through the center are not deviated
 All rays parallel to the Optical Axis converge at the Focal Point

Lec. 5
25
Thin lens equation
Lens
Object
A Focal Point
B
Image
f
z e
 Similar Triangles:
B e

A z

Lec. 5
26
Thin lens equation
Lens
Object
A A Focal Point
B
Image
f
z e
B e f e “Thin lens equation”

 Similar Triangles:   1
A f f e e 1 1 1
1    
B e f z f z e

A z

Lec. 5
27
Thin lenses
Lens
Object
Focal Point
Image
f
z e
1 1 1
 Thin lens equation:  
f z e
 Any object point satisfying this equation is in focus
 “Depth from Focus”: use this to estimate (roughly) the distance to the object

Lec. 5
28
“In focus”
Lens film
object
Optical Axis Focal Point
“Circle of Confusion”
f or
“Blur Circle”
 There is a specific distance at which objects are “in focus”

 other points project to a “blur circle” in the image

Lec. 5
29
Blur Circle
Lens
Focal Plane
Object Image Plane

f
L
Blur Circle of radius R
z e

L
 Object is out of focus  Blur Circle has radius: R 
2e
 A minimal L (pinhole) gives minimal R
 For objects out of focus, larger aperture gives worse blur
 Adjust camera settings, such that R remains smaller than the image resolution

Lec. 5
30
Blur Circle
Lens
Focal
Plane
Object Image Plane
f
L
Blur Circle of radius R
e' L 1 1 1
z e R ,  

2e f z e
 Let e’=1.2, L=0.2 and f=0.5. If z=1 then R = ? e  1,   0.2, R  0.02
R
0.14
 Same setup, but now 0.12
 z=2 then R = ? R  0.08

0.1
0.08
 z=10 then R = ? R  0.117 0.06
 z=11 then R = ? R  0.129 0.04
0.02
0 2 4 6 8 10 12 14
z
 Increased sensitivity to blurring when object is close to the lens

Lec. 5
31
From Pinhole to Perspective Camera
Object
h
Object C C
h'
f Image Image
z e z f
C = “optical center”, “center of projection”
 Adjust the image plane so that objects at infinity are in focus

So: z  f , z  L 1 1
1 1 1   f e
  f e
f z e
h' f f
 Assuming that f stays constant:   h'  h
h z z
The dependence of the apparent size of objects on their distance from the
observer is known as perspective
Lec. 5
32
Playing with Perspective
 Perspective gives us very strong depth cues  hence we can

perceive a 3D scene by viewing its 2D representation (i.e.
image)
 When viewing 3D scenes, it is possible to be mislead by
perception
“Ames room”
A clip from "The computer

that ate Hollywood"
documentary. Dr. Vilayanur
S. Ramachandran.

Lec. 5
33
Perspective Camera
Zc = optical axis
Zc
Pc
u 
O = principal point
v
O
p
Image plane (CCD)
f
C Xc
C = optical center = center of the lens
Yc
 For convenience, the image plane is usually represented in front so that the
image preserves the same orientation (i.e. not flipped)
 Note: a camera does not measure distances but angles!

 a camera is a “bearing sensor”
Lec. 5
34
Perspective Projection: from world to pixel coords
Find pixel coordinates (u,v) of

point Pw in the world frame:
P c Pw
u
0. Convert world point Pw to
v camera point Pc
O
p x
Find pixel coordinates (u,v) of
y Zw
Zc Xw point Pc in the camera frame:
C Xc
W
1. Convert Pc to image-plane
Yc Yw
coordinates (x,y)
[R|T]
2. Convert Pc to (discretised)
pixel coordinates (u,v)

Lec. 5
35
Perspective Projection (1)
From the Camera frame to the image plane
 The 3D camera point Pc=( Xc , 0 , Zc )T projects to point p=(x, y)

onto the image plane
Pc=( Xc , 0 , Zc )T
Xc
p
 Analysing similar triangles: Zc Xc
x Xc fX c C x
 x O
f Zc Zc
f Image Plane
 So, for the 3D case, we can also obtain:

y Yc fY 1. Convert Pc to image-plane
 y c
f Zc Zc coordinates (x,y)
2. Convert Pc to (discretised)
pixel coordinates (u,v)
Lec. 5
36
(0,0) u Image plane
From the Camera frame to pixel coordinates
 Convert p, from the local image plane coords (x,y) v

to the pixel coords (u,v) we have to take account for: O (u0,v0) x
 Pixel coords of the camera optical center O  (u0 , v0 )
 Scale factors for the pixel-size in both dimensions ku , kv y p
So: ku fX c
u  u0  ku x  u  u0 
Zc
kv fYc
v  v0  kv y  v  v0 
Zc
 Use Homogeneous Coordinates for linear mapping from 3D to 2D, by
introducing an extra element (scale):
 u 
u   
p    ~
p   v  and similarly for the world coordinates. Note, usually  1
v 
 
Lec. 5
37
ku fX c
u  u0  Zc
Zc
 So: Pc
k fY u
v  v0  v c
Zc
v
Expressed in matrix form & Homogenerous coords: O
p
 u   k u f 0 u0   X c  f Image plane (CCD)
 v    0 kv f v0   Yc 
   C
    0 0 1   Z c  Xc
Yc
Or alternatively
u   u 0 u0   X c  Xc  Focal length in u-direction

 v    0  v0   Yc   K  Yc 
   v Focal length in v-direction
    0 0 1   Z c   Z c 
“Calibration matrix”
or
“Matrix of Intrinsic Parameters” © R. Siegwart , M.Chli and D. Scaramuzza, ETH Zurich - ASL
Lec. 5
38
From the Camera frame to the World frame Pc
u Pw
 X c   r11 r12 r13   X w   t1  v
 Y   r in homogeneous
r23   Yw   t 2 
O Zw Xw
 c   21 r 22 coords p
 Z c  r31 r32 r33   Z w  t3  W
Zc Yw
C Xc
 X c   r11 r12 r13 t1   X w    X w  Yc
 Y  r t 2   Yw   R T   Yw  [R|T]
 c    21 r22 r23

 Z c  r31 r32 r33 
t3 Z w     Zw  Extrinsic Parameters
       
 1  0 0 0 1   1  0 0 0 1   1 
Projection Matrix
X w
u  Xc   
u Y 
  v   K  Yc    v   K R T  w 
 
 Zw 
1   Z c  1   
 1 

Lec. 5
39
Radial distortion
 Straight lines in the world  Straight lines in the image?
From ideal (u,v) to distorted

pixel coordinates (ud , vd)
 Simple distortion model:
Radial Distortion
parameter
Barrel distortion Pincushion distortion

Lec. 5
40
Camera Calibration
 Use our camera model to interpret the projection from world to image plane
 Using known correspondences of p  P, we can compute the unknown parameters

K, R, T by applying the perspective projection equation
 … so associate known, physical distances in the world to pixel-distances in image
Projection Matrix
X w
u  Y 
  v   K R T  w 
 Zw 
1   
 1 

Lec. 5
41 Camera Calibration
X w
u  Y 
 We know that :  v  K R T  w 
 
   Zw 
1   
 1  Xw
u   m11 m12 m13 m14   
 So there are 11 values to estimate:
  v   m21 m22   Yw 
(the overall scale doesn’t matter, so m23 m24 
 Zw 
e.g. m34 could be set to 1) 1  m31 m32 m33 
m34   
 1 
 Each observed point gives us a pair of equations:

ui m11 X i  m12Yi  m13Z i  m14
ui  
 m31  m32  m33  m34
v m X  m22Yi  m23Z i  m24
vi  i  21 i
 m31  m32  m33  m34
 To estimate 11 unknowns, we need at least 6 points to calibrate the

camera  solved using linear least squares

Lec. 5
42
Camera Calibration
X  X w
u   m11 m12 m13 m14   w  Y 
  v   m21 m22  Y
m23 m24     K [ R | T ] w 
w
Z   Zw 
1  m31 m32 m33 m34   w   
 
1  1 
 what we obtained: the 3x4 projection matrix,
what we need: its decomposition into the camera calibration matrix K, and the
rotation R and position T of the camera.
 Use QR factorization to decompose the 3x3 submatrix (m11:33) into the product of
an upper triangular matrix K and a rotation matrix R (orthogonal matrix)
 m14 
 The translation T can subsequently be obtained by: T  K 1 m24 
 
 m34 

Lec. 5
43
How do we measure distances with cameras?
 Impossible to capture 3D structure from a single image. We can only
deduct the ray on which each image point lies.
3D Object
Left Image Right Image
 Observe scene from 2 different viewpoints  solve for the intersection of

the rays and recover the 3D structure
Lec. 5
44
Disparity in the human retina

Lec. 5
45
How do we measure distances with cameras?
 Structure from stereo (Stereo-vision):

• use two cameras with known relative position and orientation
 Structure from motion:

• use a single moving camera: both 3D structure and camera motion
can be estimated up to a scale

Lec. 5
46
Stereo Vision - The simplified case
 An ideal, simplified case: assume both cameras are identical and are
aligned on a horizontal axis
From Similar Triangles:
Z P  ( X P , YP , Z P )
f u
 l
ZP XP bf
ZP 
f ur ul u r

ZP Z P b X P
ul ur
Disparity
f difference in image location of the projection
Cl Cr of a 3D point in two image planes
X
b
Baseline
distance between the optical centers of
the two cameras
 “Triangulation”: the intersection of rays to estimate the scene depth

Lec. 5
47
Disparity map
 Disparity = difference in image

location of the projection of a 3D
point in two image planes
 Disparity map holds the Left image Right image
disparity value at every pixel:
 Find the correspondent points of
all image pixels of the original
images
 Compute the disparity for each

pair of correspondences
 Usually visualised in gray-scale

images
Disparity map
 Close objects experience bigger disparity  appear brighter in disparity map

Lec. 5
48
Stereo Vision facts
bf
ZP 
ul u r
 Depth is inversely proportional to disparity (ul u r )

 Foreground objects have bigger disparity than background objects
 Disparity is proportional to stereo-baseline b
 The smaller the baseline, the smaller the disparities, so the more uncertain
our estimate of scene depth
 Note: increasing b some objects may be visible from only one camera
 The projections of a single 3D point onto the left and the right stereo
images are called ‘correspondence pair’

Lec. 5
49
Stereo Vision – the general case
 Two identical cameras do not exist in nature!
 Aligning both cameras on a horizontal axis is very hard
( R, T )
 In order to be able to use a stereo camera, we need the

 relative pose between the cameras (rotation, translation), and
 the focal length, optical center, radial distortion of each
 Use a calibration method, as mentioned before

Lec. 5
50
Stereo Vision – the general case
 To estimate the 3D position of Pw we just construct the system of equations
of the left and right camera
 ( X w , Yw , Z w )
( R, T ) ul  Xw
pl  l  vl   K l  Yw 
Left camera: ~
set the world frame to coincide
with the left camera frame  1   Z w 
ur  Xw
Right camera: pr  r  vr   K r R  Yw   T
~
 1   Z w 

Lec. 5
51
Correspondence Problem
 Which patch in the Left image, corresponds to the projection of the same 3D
scene point on the Right image?
 Correspondence search: test the query-patches at pixel positions in the other
image.
 Typical similarity measures are the Normalised Cross-Correlation (NCC)
and Sum of Squared Differences (SSD)
 Exhaustive image search can be computationally very expensive!
Can we make the correspondence search in 1D?

Lec. 5
52
Epipolar Geometry
 The epipolar plane is defined by a 3D point P and the optical centers.
P  ( x, y, z )
Epipolar Line 1
Epipolar Line 2
p1  (u1 , v1 ) p2  (u2 , v2 )
C1 E1 E2 C2
epipoles
 Impose the epipolar constraint to aid matching: search for a
correspondence along the epipolar line

Lec. 5
53 Correspondence Problem: Epipolar Constraint
 Thanks to the epipolar constraint, conjugate points can be searched

along epipolar lines: this reduces the computational cost to 1 dimension!

Lec. 5
54
Epipolar Rectification
 Determine a transformation of each image-plane so that pairs of conjugate
epipolar lines become collinear and parallel to one of the image axes
(usually the horizontal one)
Image from Left Camera Image from Right Camera

Lec. 5
55
Rotation

Lec. 5
56
Rotation
Focal lengths

Lec. 5
57
Rotation
Focal lengths
Lens Distortion

Lec. 5
58
Rotation
Focal lengths
Lens Distortion
Translation

Lec. 5
59
Rotation
Focal lengths
Lens Distortion
Translation

Lec. 5
60
Stereo Vision Output: 3D Reconstruction via triangulation
Z
80
60
40
20
-20
-40
-60
-80
400
Y
0
50
100
150
350
X
200

Lec. 5
61
Stereo Vision - summary
3D Object
Left Image Right Image
1. Stereo camera calibration  compute camera relative pose

2. Epipolar rectification  align images & epipolar lines
3. Search for correspondences
4. Output: compute stereo triangulation or disparity map
5. Consider how baseline & image resolution affect accuracy of depth
estimates
Lec. 5
62
Structure from motion
 Given image point correspondences, xi n xi/,determine R and T
 Rotate and translate camera until stars of rays intersect
 At least 5 point correspondences are needed (for calibrated cameras)
x x/
C
C/
(R,T)
Lec. 5
63
Multiple-view structure from motion
Image courtesy of Nader Salman

Lec. 5
64
Multiple-view structure from motion
 Results of Structure from motion from user images from flickr.com
[Seitz, Szeliski ICCV 2009]
Colosseum, Rome San Marco square, Venice

2,106 images, 819,242 points 14,079 images, 4,515,157 points

Lec. 5
65
Optical Flow (1)
 Optical flow: vector field representing motion of parts of an image in

subsequent frames

Lec. 5
66
Optical Flow
 It computes the motion vectors of all pixels in the image (or a subset of
them to be faster)
 Applications include collision avoidance
 Does it always work?

Lec. 5
67
Color Tracking
 Motion estimation of ball and robot for soccer playing using color tracking

Lec. 5
68
Color segmentation with fixed thesholds
 The simplest way of doing this is using constant thresholding:

a given pixel point is selected if and only if its RGB values (r,g,b) fall simultaneously in the
desired range:
 Alternatively, use YUV color space.

 R, G, and B values encode the intensity of each color, YUV separates the color (or
chrominance captured in U and V) measure from the brightness (or luminosity captured
in Y) measure.
 thresholding in YUV space can achieve greater stability to illumination changes than in
RGB space.

Lec. 5
69
Color Tracking in RoboCup
 A typical play

Lec. 5
70
Image from http://www.donparrish.com/FavoriteOpticalIllusion.html


Lecture5 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture5 PDF

Uploaded by

Copyright:

Available Formats

Autonomous Mobile Robots

Environment Model Path

Perception Real World Motion Control

Zürich Autonomous Systems Lab

 Retina is ~1000mm2. Contains millions of photoreceptors

 Provides enormous amount of information: data-rate of ~3GBytes/s

 Our visual system is very sophisticated

© R. Siegwart , M.Chli and D. Scaramuzza, ETH Zurich - ASL

Count the black dots

© R. Siegwart , M.Chli and D. Scaramuzza, ETH Zurich - ASL

Which square is darker, A or B?

© R. Siegwart , M.Chli and D. Scaramuzza, ETH Zurich - ASL

Which square is darker, A or B?

© R. Siegwart , M.Chli and D. Scaramuzza, ETH Zurich - ASL

 Enormous descriptability of images

 Not sensible to copy the biology, but learn from it

 Capture light  Convert to digital image

 Vision is increasingly popular as a sensing

Image: Nicholas M. Short

© R. Siegwart , M.Chli and D. Scaramuzza, ETH Zurich - ASL

© R. Siegwart , M.Chli and D. Scaramuzza, ETH Zurich - ASL

© R. Siegwart , M.Chli and D. Scaramuzza, ETH Zurich - ASL

© R. Siegwart , M.Chli and D. Scaramuzza, ETH Zurich - ASL

© R. Siegwart , M.Chli and D. Scaramuzza, ETH Zurich - ASL

Section 4.2 in the book

 Pinhole Camera Model

© R. Siegwart , M.Chli and D. Scaramuzza, ETH Zurich - ASL

Sony Cybershot WX1

© R. Siegwart , M.Chli and D. Scaramuzza, ETH Zurich - ASL

 Place a piece of film in front of an object

© R. Siegwart , M.Chli and D. Scaramuzza, ETH Zurich - ASL

object barrier film

 Add a barrier to block off most of the rays

© R. Siegwart , M.Chli and D. Scaramuzza, ETH Zurich - ASL

© R. Siegwart , M.Chli and D. Scaramuzza, ETH Zurich - ASL

© R. Siegwart , M.Chli and D. Scaramuzza, ETH Zurich - ASL

 Why not make the aperture as small as possible?

© R. Siegwart , M.Chli and D. Scaramuzza, ETH Zurich - ASL

Images courtesy of Steve Seitz

© R. Siegwart , M.Chli and D. Scaramuzza, ETH Zurich - ASL

© R. Siegwart , M.Chli and D. Scaramuzza, ETH Zurich - ASL

object Lens film

 A lens focuses light onto the film

© R. Siegwart , M.Chli and D. Scaramuzza, ETH Zurich - ASL

 A lens focuses light onto the film

© R. Siegwart , M.Chli and D. Scaramuzza, ETH Zurich - ASL

© R. Siegwart , M.Chli and D. Scaramuzza, ETH Zurich - ASL

B e f e “Thin lens equation”

© R. Siegwart , M.Chli and D. Scaramuzza, ETH Zurich - ASL

© R. Siegwart , M.Chli and D. Scaramuzza, ETH Zurich - ASL

Optical Axis Focal Point

 There is a specific distance at which objects are “in focus”

© R. Siegwart , M.Chli and D. Scaramuzza, ETH Zurich - ASL

Object Image Plane

© R. Siegwart , M.Chli and D. Scaramuzza, ETH Zurich - ASL

 Same setup, but now 0.12

 z=2 then R = ? R  0.08

 z=11 then R = ? R  0.129 0.04

© R. Siegwart , M.Chli and D. Scaramuzza, ETH Zurich - ASL

 Adjust the image plane so that objects at infinity are in focus

 Perspective gives us very strong depth cues  hence we can

A clip from "The computer

© R. Siegwart , M.Chli and D. Scaramuzza, ETH Zurich - ASL