Computer
vision,
in
one
lecture


Bill
Freeman
 Electrical
Engineering
and
Computer
Science
Dept.
 Massachuse<s
Ins>tute
of
Technology
 April
21,
2010


The
Taiyuan
University
of
Technology
Computer
 Center
staff,
and
me
(1987)


Me
and
my
wife,
riding
from
the
Foreigners’
 Cafeteria



Inside
the
computer
center,
with
the
image
 processing
equipment


While
in
China,
I
read
this
book
(to
be
re‐issued
by
MIT
Press
this
year),
 and
got
very
excited
about
computer
vision.

Studied
for
PhD
at
MIT.


Goal
of
computer
vision

Marr:
“To
tell
what
is
where
by
looking”.
 Want
to:

–  Es>mate
the
shapes
and
proper>es
of
things.
 –  Recognize
objects
 –  Find
and
recognize
people
 –  Find
road
lanes
and
other
cars
 –  Help
a
robot
walk,
navigate,
or
fly.
 –  Inspect
for
manufacturing


Some
par>cular
goals
of
computer
vision

•  •  •  •  •  Wave
a
camera
around,
get
a
3‐d
model
out.
 Capture
body
pose
of
actor
dancing.
 Detect
and
recognize
faces.
 Recognize
objects.
 Track
people
or
objects


Let’s
go
back
in
>me,
to
the
mid‐1980’s


What
everyone
looked
like
back
then


Features

•  Points
 but
also,
 •  Lines
 •  Conics
 •  Other
fi<ed
curves


10

Objects

“blocks
world”
 A
toy
world
in
which
to
 study
image
 interpreta>on.

All
we
 have
to
do
is
to
convert
 real
world
images
to
 their
blocks
world
 equivalents
and
we’re
all
 set.


Features


Yvan
Leclerc
and
Mar>n
 Fischler,
an
op>miza>on‐ based
approach
to
the
 interpreta>on
of
single
line
 drawings
as
3‐d
wire
 frames.


11

Computer
vision
research
results,
1986


Hu<enlocher
and
Ullman,
Object
recogni>on
using
alignment,
ICCV,
1986


12

Computer
vision
research
results,
1992

Input
image


Edge
points
fi<ed
with
lines
or
conics


6
years
later:

 Recognizing
planar
 objects
using
invariants.


Objects
that
 have
been
 recognized
and
 verified.


From
Rothwell
et
al,
Efficient
model
library
access
by
projec>vely
invariant
indexing
func>ons,
CVPR
1992.


13

Back
to
the
present…


Companies
and
applica>ons

•  •  •  •  •  •  •  •  Cognex
 Poseidon
 Mobileye
 Eyetoy
 Iden>x
 Google
 Microsoh
 Face
recogni>on
in
cameras


Mobil
Eye


Google


Microsoh


Microsoh


Some
par>cular
goals
of
computer
vision
 (status
report)

•  Wave
a
camera
around,
get
a
3‐d
model
out
 (almost)
 •  Capture
body
pose
of
actor
dancing.

Using
 mul>ple
cameras
(pre<y
well),
using
a
single
 camera
(not
yet)
 •  Detect
and
recognize
faces.
(frontal,
yes)
 •  Recognize
objects.

(working
on
it,
lots
of
progress)
 •  Track
people
or
objects
(over
short
>mes)


What
has
allowed
us
to
make
progress?

•  •  •  •  SIFT
features
 Discrimina>ve
classifiers
 Bayesian
methods
 Large
databases


What
has
allowed
us
to
make
progress?

•  •  •  •  SIFT
features
 Discrimina>ve
classifiers
 Bayesian
methods
 Large
databases


Building
a
Panorama


M.
Brown
and
D.
G.
Lowe.
Recognising
Panoramas.
ICCV
2003


How
do
we
build
a
panorama?

•  We
need
to
match
(align)
images


h<p://www.wisdom.weizmann.ac.il/~deniss/vision_spring04/files/InvariantFeatures.ppt


Matching
with
Features

• Detect
feature
points
in
both
images


h<p://www.wisdom.weizmann.ac.il/~deniss/vision_spring04/files/InvariantFeatures.ppt


Matching
with
Features

• Detect
feature
points
in
both
images
 • Find
corresponding
pairs


h<p://www.wisdom.weizmann.ac.il/~deniss/vision_spring04/files/InvariantFeatures.ppt


Matching
with
Features

• Detect
feature
points
in
both
images
 • Find
corresponding
pairs
 • Use
these
pairs
to
align
images
‐
we
know
this


h<p://www.wisdom.weizmann.ac.il/~deniss/vision_spring04/files/InvariantFeatures.ppt


Matching
with
Features

•  Problem
1:

–  Detect
the
same
point
independently
in
both
 images

counter‐example:



no
chance
to
match!


We
need
a
repeatable
detector

h<p://www.wisdom.weizmann.ac.il/~deniss/vision_spring04/files/InvariantFeatures.ppt


Matching
with
Features

•  Problem
2:

–  For
each
point
correctly
recognize
the
 corresponding
one


?

We
need
a
reliable
and
dis>nc>ve
descriptor

h<p://www.wisdom.weizmann.ac.il/~deniss/vision_spring04/files/InvariantFeatures.ppt


•  Detector:
detect
same
scene
points
independently
in
 both
images
 •  Descriptor:
encode
local
neighboring
window

–  Note
how
scale
&
rota>on
of
window
are
the
same
in
both
 image
(but
computed
independently)


Overview
of
feature
detec2on
for
 (instance)
object
recogni2on


•  Correspondence:
find
most
similar
descriptor
in
other
 image


detector location

Descriptor

Note:
here
viewpoint
is
different,
 not
panorama
(they
show
off)


CVPR
2003
Tutorial
 Recogni2on
and
Matching
 Based
on
Local
Invariant
 Features


David
Lowe

 Computer
Science
Department
 University
of
Bri>sh
Columbia

h<p://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf


Invariant
Local
Features

•  Image
content
is
transformed
into
local
feature
 coordinates
that
are
invariant
to
transla>on,
 rota>on,
scale,
and
other
imaging
parameters


SIFT Features

Freeman
et
al,
1998


h<p://people.csail.mit.edu/billf/papers/cga1.pdf


Advantages
of
invariant
local
features

•  Locality:
features
are
local,
so
robust
to
 occlusion
and
clu<er
(no
prior
segmenta>on)
 •  Dis2nc2veness:
individual
features
can
be
 matched
to
a
large
database
of
objects
 •  Quan2ty:
many
features
can
be
generated
for
 even
small
objects
 •  Efficiency:
close
to
real‐>me
performance
 •  Extensibility:
can
easily
be
extended
to
wide
 range
of
differing
feature
types,
with
each
 adding
robustness


SIFT
vector
forma2on

•  Computed
on
rotated
and
scaled
version
of
window
 according
to
computed
orienta>on
&
scale
 –  resample
a
16x16
version
of
the
window
 •  Based
on
gradients
weighted
by
a
Gaussian
of
 variance
half
the
window
(for
smooth
falloff)


SIFT
vector
forma2on

•  4x4
array
of
gradient
orienta>on
histograms
 –  not
really
histogram,
weighted
by
magnitude
 •  8
orienta>ons
x
4x4
array
=
128
dimensions
 •  Mo>va>on:

some
sensi>vity
to
spa>al
layout,
but
not
 too
much.


showing
only
2x2
here
but
is
4x4


SIFT
vector
forma2on

•  Thresholded
image
gradients
are
sampled
over
16x16
 array
of
loca>ons
in
scale
space
 •  Create
array
of
orienta>on
histograms
 •  8
orienta>ons
x
4x4
histogram
array
=
128
dimensions


showing
only
2x2
here
but
is
4x4


Ensure
smoothness

•  Gaussian
weight

 •  Trilinear
interpola>on


–  a
given
gradient
contributes
to
8
bins:

 4
in
space
>mes
2
in
orienta>on


Reduce
effect
of
illumina2on

•  128‐dim
vector
normalized
to
1

 •  Threshold
gradient
magnitudes
to
avoid
excessive
 influence
of
high
gradients
 –  aher
normaliza>on,
clamp
gradients
>0.2
 –  renormalize


Feature
stability
to
noise

•  Match
features
aher
random
change
in
image
scale
&
 orienta>on,
with
differing
levels
of
image
noise
 •  Find
nearest
neighbor
in
database
of
30,000
features


Feature
stability
to
affine
change

•  Match
features
aher
random
change
in
image
scale
&
 orienta>on,
with
2%
image
noise,
and
affine
distor>on
 •  Find
nearest
neighbor
in
database
of
30,000
features


Dis2nc2veness
of
features

•  Vary
size
of
database
of
features,
with
30
degree
 affine
change,
2%
image
noise
 •  Measure
%
correct
for
single
nearest
neighbor
match


These
feature
point
detectors
and
descriptors
 are
the
most
important
recent
advance
in
 computer
vision
and
graphics.



•  Feature
points
are
used
also
for:

–  Image
alignment
(homography,
fundamental
matrix)
 –  3D
reconstruc>on
 –  Mo>on
tracking
 –  Object
recogni>on
 –  Indexing
and
database
retrieval
 –  Robot
naviga>on
 –  …
other


More
uses
for
SIFT
features

SIFT
features
have
also
been
applied
to
 (categorical)
object
recogni>on
 First,
let’s
present
various
of
the
issues
in
object
 recogni>on.


intra‐class
varia>on


Slide
from:

Li
Fei‐Fei,
Rob
Fergus
and
Antonio
Torralba,
short
course
on
object
 recogni>on,
h<p://people.csail.mit.edu/torralba/shortCourseRLOC/


Object
recogni2on
issues

–  Genera>ve
/
 discrimina>ve
/
hybrid


Slide
from:

Li
Fei‐Fei,
Rob
Fergus
and
Antonio
Torralba,
short
course
on
object
 recogni>on,
h<p://people.csail.mit.edu/torralba/shortCourseRLOC/


Object
recogni2on
issues

–  Genera>ve
/
 discrimina>ve
/
hybrid
 –  Appearance
only
or
 loca>on
and
appearance


Slide
from:

Li
Fei‐Fei,
Rob
Fergus
and
Antonio
Torralba,
short
course
on
object
 recogni>on,
h<p://people.csail.mit.edu/torralba/shortCourseRLOC/


Object
recogni2on
issues

–  Genera>ve
/
 discrimina>ve
/
hybrid
 –  Appearance
only
or
 loca>on
and
appearance
 –  Invariances

•  View
point
 •  Illumina>on
 •  Occlusion
 •  Scale
 •  Deforma>on
 •  Clu<er
 •  etc.


Slide
from:

Li
Fei‐Fei,
Rob
Fergus
and
Antonio
Torralba,
short
course
on
object
 recogni>on,
h<p://people.csail.mit.edu/torralba/shortCourseRLOC/


Object
recogni2on
issues

–  Genera>ve
/
 discrimina>ve
/
hybrid
 –  Appearance
only
or
 loca>on
and
appearance
 –  invariances
 –  Parts
or
global
w/sub‐ window
 –  Use
set
of
features
or
each
 pixel
in
image

Slide
from:

Li
Fei‐Fei,
Rob
Fergus
and
Antonio
Torralba,
short
course
on
object
 recogni>on,
h<p://people.csail.mit.edu/torralba/shortCourseRLOC/


Current
approaches
in
object
 recogni>on

•  Bag
of
words
 •  Boos>ng
 •  Label
transfer


Visual
words

•  Vector
quan>ze
SIFT
descriptors
to
a
 vocabulary
of
2
or
3
thousand
“visual
words”.
 •  Heuris>c
design
of
descriptors
makes
these
 words
somewhat
invariant
to:

–  Ligh>ng
 –  2‐d
Orienta>on
 –  3‐d
Viewpoint


Object
recogni>on
using
visual
words


Find
words


Form
histograms
 Compare
with

 object
class

 database


Many
combinatorial
matching
problems
to
be
 solved
for
object
recogni>on.


Instance
recogni>on:

with
features
allowed
to
appear
 or
not
in
both
the
test
and
training
examples.
 Deformable
object
recogni>on:

some
feature
clusters
 maintain
spa>al
coherence,
others
can
vary.
 h<p://www.cs.utexas.edu/~grauman/research/projects/pmk/pmk_projectpage.htm
 Category
recogni>on:

each
class
defined
by
many
 different
training
set
exemplars.

Find
the
class
that
 best
explains
the
observed
feature
set.

 Semi‐supervised
object
recogni>on:

observed
 training
set
features
include
many
background
object
 features.


h<p://www‐cvr.ai.uiuc.edu/ponce_grp/publica>on/paper/cvpr06b.pdf


Caltech
101


Caltech
101
results
over
>me


Problem:


Category
level
recogni>on
using
visual
words
representa>on.


Applica>ons:
 References:


Object
recogni>on.


Lazebnik,
Schmid,
and
Ponce,
Beyond
bags
of
features:
Spa>al
pyramid
matching
for
 recognizing
natural
scene
categories,
Computer
Vision
and
Pa<ern
Recogni>on
(CVPR
 2006),
h<p://www‐cvr.ai.uiuc.edu/ponce_grp/publica>on/paper/cvpr06b.pdf
 K.
Grauman
and
T.
Darrell.

Unsupervised
Learning
of
Categories
from
Sets
of
Par>ally
 Matching
Image
Features.

In
Proceedings
of
the
IEEE
Conference
on
Computer
Vision
and
 Pa<ern
Recogni>on
(CVPR),
New
York
City,
NY,
June
2006,
 h<p://www.cs.utexas.edu/~grauman/papers/grauman_darrell_cvpr2006.pdf


What
has
allowed
us
to
make
progress?

•  •  •  •  SIFT
features
 Discrimina>ve
classifiers—SVM’s
and
boos>ng
 Bayesian
methods
 Large
databases


Rapid Object Detection Using a Boosted Cascade of Simple Features

Paul
Viola






Michael
J.
Jones
 Mitsubishi
Electric
Research
Laboratories
(MERL)
 Cambridge,

MA

Most
of
this
work
was
done
at
Compaq
CRL
before
the
authors
moved
to
MERL


Manuscript
available
on
web:

h<p://citeseer.ist.psu.edu/cache/papers/cs/23183/h<p:zSzzSzwww.ai.mit.eduzSzpeoplezSzviolazSzresearchzSzpublica>onszSzICCV01‐Viola‐Jones.pdf/viola01robust.pdf


Viola‐Jones
approach

•  Large
feature
set
(…
is
huge
about
 16,000,000
features)
 •  Efficient
feature
selec>on
using
AdaBoost

 •  Cascaded
Classifier
for
rapid
detec>on

–  Hierarchy
of
A<en>onal
Filters
 The combination of these ideas yields the fastest known face detector for gray scale images.
Viola and Jones, Robust object detection using a boosted cascade of simple features, CVPR 2001

Image
Features

“Rectangle filters” Similar to Haar wavelets Differences between sums of pixels in adjacent rectangles

ht(x) =

{

+1 if ft(x) > θt -1 otherwise

Unique Features

Viola and Jones, Robust object detection using a boosted cascade of simple features, CVPR 2001

Huge
“Library”
of
Filters


Viola and Jones, Robust object detection using a boosted cascade of simple features, CVPR 2001

Integral
Image

•  Define
the
Integral
Image


•  Any
rectangular
sum
can
be
computed
 in
constant
>me:


•  Rectangle
features
can
be
computed
as
 differences
between
rectangles



Viola
and
Jones,
Robust
object
detec>on
using
a
boosted
cascade
of
simple
features,
CVPR
2001


Construc>ng
classifiers
by
combining
filter
 outputs

•  Perceptron
yields
a
sufficiently
powerful
 classifier


•  Use
AdaBoost
to
efficiently
choose
best
 features

Viola and Jones, Robust object detection using a boosted cascade of simple features, CVPR 2001

AdaBoost

(Freund
&
Shapire
’95)


Ini>al
uniform
weight

 on
training
examples


weak
classifier
1


Incorrect
classifica2ons
 
re‐weighted
more
heavily
 weak
classifier
2


weak
classifier
3


Final
classifier
is
weighted
 combina2on
of
weak
classifiers

Viola
and
Jones,
Robust
object
detec>on
using
a
boosted
cascade
of
simple
features,
CVPR
2001


Ada‐Boost
Tutorial

•  Given
a
Weak
learning
algorithm

–  Learner
takes
a
training
set
and
returns
the
best
 classifier
from
a
weak
concept
space


•  required
to
have
error
<
50%


•  Star>ng
with
a
Training
Set
(ini>al
weights
1/n)

–  Weak
learning
algorithm
returns
a
classifier
 –  Reweight
the
examples

•  Weight
on
correct
examples
is
decreased
 •  Weight
on
errors
is
decreased


•  Final
classifier
is
a
weighted
majority
of
Weak
 Classifiers

–  Weak
classifiers
with
low
error
get
larger
weight

Viola
and
Jones,
Robust
object
detec>on
using
a
boosted
cascade
of
simple
features,
CVPR
2001


Review
of
AdaBoost
(Freund
&
Shapire
95)

• Given
examples
(x1,
y1),
…,
(xN,
yN)
where
yi
=
0,1
for
nega>ve
and
posi>ve
examples
respec>vely.
 • Ini>alize
weights
wt=1,i
=
1/N
 • For
t=1,
…,
T
 • Normalize
the
weights,


wt,i
=
wt,i
/
Σ
wt,j
 
N


j=1
 • Find
a
weak
learner,
i.e.
a
hypothesis,
ht(x)
with
weighted
error
less
than
.5
 • Calculate
the
error
of
ht
:
et
=
Σ
wt,i
|
ht(xi)
–
yi
|


• Update
the
weights:
wt,i
=
wt,i
Bt(1‐di)

where
Bt
=
et
/
(1‐
et)
and
di
=
0
if
example
xi
is
 classified
correctly,
di
=
1
otherwise.
 • The
final
strong
classifier
is


h(x)

=
 where
αt
=
log(1/
Bt)


{


1

if


α 
h (x)
> 

t=1
 t t


T


Σ

0.5
Σ αt



t=1



T


0

otherwise


Viola
and
Jones,
Robust
object
detec>on
using
a
boosted
cascade
of
simple
features,
CVPR
2001


Example
Classifier
for
Face
Detec>on

One stage: a classifier with 200 rectangle features was learned using AdaBoost 95% correct detection on test set with 1 in 14084 false positives.

Viola and Jones, Robust object detection using a boosted cascade of simple features, CVPR 2001

ROC curve for 200 feature classifier

Develop
fast,
accurate
classifier
using
a
 cascade

•  Given
a
nested
set
of
classifier
 hypothesis
classes

50







































99
 %
Detec>on
 %
False
Pos
 0







































 
50


vs
 false
neg

 
 determined
by


•  Computa>onal
Risk
Minimiza>on

IMAGE
 SUB‐WINDOW

Classifier
1
 F
 NON‐FACE
 T
 Classifier
2
 F
 NON‐FACE
 T
 Classifier
3
 F
 NON‐FACE
 T


FACE


Viola
and
Jones,
Robust
object
detec>on
using
a
boosted
cascade
of
simple
features,
CVPR
2001


Experiment:
Simple
Cascaded
Classifier


Viola
and
Jones,
Robust
object
detec>on
using
a
boosted
cascade
of
simple
features,
CVPR
2001


Cascaded
Classifier

IMAGE
 SUB‐WINDOW

1
Feature
 F
 NON‐FACE
 50%
 5
Features
 F
 NON‐FACE
 20%
 20
Features
 F
 NON‐FACE
 2%


FACE


•  A
1
feature
classifier
achieves
100%
detec>on
rate
 and
about
50%
false
posi>ve
rate.
 •  A
5
feature
classifier
achieves
100%
detec>on
rate
 and
40%
false
posi>ve
rate
(20%
cumula>ve)
 •  A
20
feature
classifier
achieve
100%
detec>on
rate
 with
10%
false
posi>ve
rate
(2%
cumula>ve)

Viola
and
Jones,
Robust
object
detec>on
using
a
boosted
cascade
of
simple
features,
CVPR
2001


–  using
data
from
previous
stage.



A
Real‐>me
Face
Detec>on
System

Training
faces:
4916
face
images
(24
x
24
 pixels)
plus
ver>cal
flips
for
a
total
of
9832
 faces
 Training
non‐faces:
350
million
sub‐ windows
from
9500
non‐face
images
 Final
detector:
38
layer
cascaded
classifier

 The
number
of
features
per
layer
was
1,
 10,
25,
25,
50,
50,
50,
75,
100,
…,
200,
…
 Final
classifier
contains
6061
features.

Viola
and
Jones,
Robust
object
detec>on
using
a
boosted
cascade
of
simple
features,
CVPR
2001


Accuracy
of
Face
Detector

Performance on MIT+CMU test set containing 130 images with 507 faces and about 75 million sub-windows.

Viola
and
Jones,
Robust
object
detec>on
using
a
boosted
cascade
of
simple
features,
CVPR
2001


Comparison
to
Other
Systems

False Detections Detector

10

31

50

65

78 92.1 93.1

95

110

167

Viola-Jones Viola-Jones (voting) Rowley-BalujaKanade SchneidermanKanade

76.1 88.4 81.1 89.7 83.2 86.0

91.4 92.0 92.1 93.1

92.9 93.1 93.9 93.2 93.7 93.7 89.2 90.1

94.4

Viola
and
Jones,
Robust
object
detec>on
using
a
boosted
cascade
of
simple
features,
CVPR
2001


Speed
of
Face
Detector

Speed is proportional to the average number of features computed per sub-window. On the MIT+CMU test set, an average of 9 features out of a total of 6061 are computed per sub-window. On a 700 Mhz Pentium III, a 384x288 pixel image takes about 0.067 seconds to process (15 fps). Roughly 15 times faster than Rowley-Baluja-Kanade and 600 times faster than Schneiderman-Kanade.
Viola
and
Jones,
Robust
object
detec>on
using
a
boosted
cascade
of
simple
features,
CVPR
2001


Output
of
Face
Detector
on
Test
Images


Viola
and
Jones,
Robust
object
detec>on
using
a
boosted
cascade
of
simple
features,
CVPR
2001


More
Examples


Viola
and
Jones,
Robust
object
detec>on
using
a
boosted
cascade
of
simple
features,
CVPR
2001


Conclusions

•  We
[they]
have
developed
the
fastest
known
 face
detector
for
gray
scale
images
 •  Three
contribu>ons
with
broad
applicability

–  Cascaded
classifier
yields
rapid
classifica>on
 –  AdaBoost
as
an
extremely
efficient
feature
 selector
 –  Rectangle
Features
+
Integral
Image
can
be
used
 for
rapid
image
analysis


Viola
and
Jones,
Robust
object
detec>on
using
a
boosted
cascade
of
simple
features,
CVPR
2001


What
has
allowed
us
to
make
progress?

•  •  •  •  SIFT
features
 Discrimina>ve
classifiers
 Bayesian
methods
 Large
databases


Tracking
a
human
in
3D


The appearance of people can vary dramatically.

People can appear in arbitrary poses.

Structure is unobservable— inference from visible parts.

Geometrically under-constrained.

But
this
requires
that
we
use
 markers,
which
we
don’t
 want,
and
also
requires
 mul>ple
cameras.


http://www.vicon.com/animation/

State of the Art.

•  Brightness
constancy
cue


–  Insensi>ve
to
appearance



•  Full‐body
required
 mul>ple
cameras
 •  Single
hypothesis


State of the Art.

•  Single
camera,
mul>ple
hypotheses
 •  2D
templates
(no
drih
but
view
dependent)

I(x, t) = I(x+u, 0) + η

State of the Art.

•  Mul>ple
 hypotheses
 •  Mul>ple
cameras
 •  Simplified
clothing,
 ligh>ng
and
 background


* No special clothing * Monocular, grayscale, sequences (archival data) * Unknown, cluttered, environment
Task: Infer 3D human motion from 2D image

p(model | cues) = p(cues | model) p(model) p(cues) 1.  Need a constraining likelihood model that is also invariant to variations in human appearance. 2. Need a prior model of how people move. 3. Posterior probability: Need an effective way to explore the model space (very high dimensional) and represent ambiguities.

System
components
for
human
body
 tracking


•  Representa>on
for
probabilis>c
analysis.
 •  Models
for
human
mo>on
(prior
term).
 •  Models
for
human
appearance
(likelihood
 term).


•  
Representa>on
for
probabilis>c

analysis.
 •  Models
for
human
mo>on
(prior
term).
 •  Models
for
human
appearance
(likelihood
 term).


* Limbs are truncated cones * Parameter vector of joint angles and angular velocities = φ

•  Posterior
distribu>on
over
model
 parameters
ohen
mul>‐modal
(due
to
 ambigui>es)
 •  Represent
whole
distribu>on:

–  sampled
representa>on
 –  each
sample
is
a
pose
 –  predict
over
>me
using
a
par>cle
filtering
 approach


•  Isard
and
Blake,
1998,
“Condensa>on
 Algorithm”


Given
the
data
so
far,
what
do
I
 think
is
the
set
of
possible
 states
the
body
could
be
in?


Posterior

Temporal dynamics

What
could
each
of
those
 states
become
at
the
next
>me
 step?

(Uses
prior
model
for
 human
mo>on).


Posterior
Update
es>mate
of
possible
 states,
given
the
visual
data.


Likelihood

How
much
is
each
of
those
 possible
states
supported
by
 the
visual
data
at
the
next
>me
 step?


•  
Representa>on
for
probabilis>c

analysis.
 •  Models
for
human
mo>on
(prior
term).
 •  Models
for
human
appearance
(likelihood
 term).


•  Only
handles
people
walking.
 •  Very
powerful
constraint
on
human
mo>on.


•  Ac>on‐specific
model
‐
Walking

–  Training
data:
3D
mo>on
capture
data
 –  From
training
set,
learn
mean
cycle
and
common
 modes
of
devia>on
(PCA)


Mean cycle

Small noise

Large noise

Initialize to figure, then let go…

•  
Representa>on
for
probabilis>c
analysis.
 •  Models
for
human
mo>on
(prior
term).
 •  Models
for
human
appearance
(likelihood
 term).


Changing background

What do people look like?
Varying shadows

Occlusion

Deforming clothing

Low contrast limb boundaries

What do non-people look like?

(5000 samples in each example)

Edge cues

Ridge cues

Flow cues

Edge cues

Ridge cues

Flow cues

2500 samples ~10 min/frame

Walking model

What
has
allowed
us
to
make
progress?

•  •  •  •  •  SIFT
features
 Discrimina>ve
classifiers
 Bayesian
methods
 Large
datasets
 Miscellaneous
advances:

exploi>ng
context


Use
of
context
for
object
detec>on


car

pedestrian Identical local image features!

Images by Antonio Torralba

Context
speeds
object
detec>on:

this
is
what
the
world
looks
like
to
a
 face
detector
that
doesn’t
take
advantage
of
context.

Can
you
find
the
 face?


Antonio
Torralba


The
best
object
detec>on
algorithms
combine
 top‐down
(context)
with
bo<om‐up
(local
 features)
cues.

The
top‐down
informa>on
can
help
suppress
false
 detec>ons
caused
by
ambiguous
local
 informa>on.


Feature
vector
for
an
image:
 the
“gist”
of
the
scene

–  Compute 12 x 30 = 360 dim. feature vector –  Or use steerable filter bank, 6 orientations, 4 scales, averaged over 4x4 regions = 384 dim. feature vector –  Reduce to ~ 80 dimensions using PCA

Oliva & Torralba, IJCV 2001

Low‐dimensional
representa>on
for
 image
context


Images

Random noise filtered to have the same 80dimensional representation as the images above.

“gist”
useful
for
object
priming


Examples
of
learned
features
for
bo<om‐up
detec>on:

apply
the
 filter
shown
at
top
rows
and
average
the
squared
output
over
 regions
shown
in
bo<om
rows.


For each type of object, we plot the single most probable detection if it is above a threshold (set to give 80% detection rate)
Object
detec>ons
without
 context:

note
false
alarms


The
advantage
of
context
in
object
detec>on

Object
detec>ons
aher
 suppression
of
false
detec>ons
 using
context


If we know we are in a street, we can prune false positives such as chair and coffee-machine (which are hard to detect, and hence must have low thresholds to get 80% hit rate)

What
has
allowed
us
to
make
progress?

•  •  •  •  SIFT
features
 Discrimina>ve
classifiers
 Bayesian
methods
 Large,
labeled
datasets.


A
correspondence‐based
approach
to
 scene
parsing

Given
an
image

–  Find
another
annotated
 image
with
similar
scene
 –  Find
correspondence
 between
these
two
 images
 –  Warp
the
annota>on
 according
to
the
 correspondence

Input
 Support

window
 tree
 sky
 road
 field
 car
 building
 unlabeled
 Dense
scene
alignment
using
SIFT
Flow
for
object
recogni>on
 C.
Liu,
J.
Yuen,
A.
Torralba
 Warped
annota>on
 IEEE
Conference
on
Computer
Vision
and
Pa<ern
Recogni>on
(CVPR),
2009.



User
annota>on


System
overview


RGB


SIFT


Query


tree


SIFT
flow


RGB


SIFT


Annota>on


sky
 road
 field
 car
 unlabeled


Nearest
neighbors

Flow visualization code
Dense
scene
alignment
using
SIFT
Flow
for
object
recogni>on
 C.
Liu,
J.
Yuen,
A.
Torralba
 IEEE
Conference
on
Computer
Vision
and
Pa<ern
Recogni>on
(CVPR),
2009.



System
overview


RGB


SIFT


Parsing


Query



Ground
 truth


tree


SIFT
flow


RGB


SIFT


Annota>on


sky
 road
 field
 car
 unlabeled


Warped
nearest
neighbors

Flow visualization code
Dense
scene
alignment
using
SIFT
Flow
for
object
recogni>on
 C.
Liu,
J.
Yuen,
A.
Torralba
 IEEE
Conference
on
Computer
Vision
and
Pa<ern
Recogni>on
(CVPR),
2009.



Scene
parsing
results
(1)


Query


Best
match


Annota>on
of
 best
match


Warped
best
 match
to
query


Parsing
result


Ground
truth


Scene
parsing
results
(2)


Query


Best
match


Annota>on
of
 best
match


Warped
best
 match
to
query


Parsing
result


Ground
truth


Pixel‐wise
performance

Our
system
 op>mized
parameters
 Per‐pixel
rate
74.75%


Dense
scene
alignment
using
SIFT
Flow
for
object
recogni>on
 C.
Liu,
J.
Yuen,
A.
Torralba
 IEEE
Conference
on
Computer
Vision
and
Pa<ern
Recogni>on
(CVPR),
2009.



Pixel‐wise
frequency
 count
of
each
class


Comparison

(a)
Our
system
 op>mized
parameters
 (b)
Our
system
 No
Markov
random
field


74.75%


66.24%


(c)
Sho<on
et
al.
 No
Markov
random
field


(d)
Our
system
 Matching
color
instead
of
SIFT


51.67%


49.68%


J.
Sho<on
et
al.
Textonboost:
Joint
appearance,
shape
and
context
modeling
for
mul>‐class
object
recogni>on
and
segmenta>on.
ECCV,
 2006


Comparison
for
each
class

•  We
convert
our
system
to
a
binary
detector
for
each
class
 and
compare
it
with
[Dalal
&
Triggs.
CVPR
2005]
 •  In
ROC,
our
system
(red)
outperforms
theirs
(blue)
for
 most
of
the
classes


Dense
scene
alignment
using
SIFT
Flow
for
object
recogni>on
 C.
Liu,
J.
Yuen,
A.
Torralba
 IEEE
Conference
on
Computer
Vision
and
Pa<ern
Recogni>on
(CVPR),
2009.



What
has
allowed
us
to
make
progress?

•  •  •  •  SIFT
features
 Discrimina>ve
classifiers
 Bayesian
methods
 Non‐parametric
methods


Algorithm

–  Pick
size
of
block
and
size
of
overlap
 –  Synthesize
blocks
in
raster
order


–  Search
input
texture
for
block
that
sa>sfies
overlap
 constraints
(above
and
leh)

•  Easy
to
op>mize
using
NN
search
[Liang
et.al.,
’01]


–  Paste
new
block
into
resul>ng
texture

•  use
dynamic
programming
to
compute
minimal
error
 boundary
cut


Problem:


How
to
construct
and
manage
a
non‐parametric
signal
prior?

How
 select
the
exemplars
to
use,
how
quickly
find
nearest
neighbor
 matches?


Applica>ons:


Low‐level
vision:

noise
removal,
super‐resolu>on,
filling‐in,
texture
 synthesis.


References:

W.
T.
Freeman,
E.
C.
Pasztor,
O.
T.
Carmichael
Learning
Low‐Level
Vision
Interna>onal
 Journal
of
Computer
Vision,
40(1),
pp.
25‐47,
2000.
 h<p://www.merl.com/reports/docs/TR2000‐05.pdf

 

Alexei
A.
Efros
and
Thomas
K.
Leung,
Texture
Synthesis
by
Non‐parametric
Sampling,
 IEEE
Interna>onal
Conference
on
Computer
Vision
(ICCV'99),
Corfu,
Greece,
September
 1999
,
h<p://graphics.cs.cmu.edu/people/efros/research/NPS/efros‐iccv99.pdf



2009
BIRS
Workshop
on
Computer
Vision
and
the
Internet


Rob
Fergus
 Rick
Szeliski


Lana
Lazebnik


Nearest
neighbor
search
in
high
dimensions

Nearest
neighbors
in
high‐dimensions.

category
recogni>on.
 for
instance
recogni>on,
nn
for
individual
features
works
fine.

but
 for
category
recogni>on,
many
>mes
the
local
features
are
not,
by
 themselves,
a
close
match,
due
to
within‐class
varia>ons.
 Nearest
neighbor
search,
but
taking
into
account
are
par>cular
data.
 or,
tell
us
what
ques>ons
we
should
be
asking
about
our
data
in
order
 to
do
nearest
neighbor
search
well.
 on
the
large
database
side:

how
store
memories,
concepts,
objects
in
 very
large
databases?
 Large
database
issues.

mul>
dimensional:

kd
tree
(but
only
up
to
20
 dims)
finding
similar
things
in
very
high
dimensions.
 Parallelism‐‐where
can
we
exploit
it?
 kd
tree
high
d
search.

Does
LSH
work
as
adver>sed?

in
prac>ce
not
 as
well.


Problem:


Nearest
neighbor
search
in
high
dimensions.


Applica>ons:


Non‐parametric
texture
synthesis
and
super‐resolu>on.

Image
 filling‐in.

Object
recogni>on.

Scene
recogni>on.


References:


(Many
in
CS
literature,
LSH,
etc.)

PatchMatch:
A
Randomized
Correspondence
Algorithm
for
Structural
Image
Edi>ng
 ACM
Transac>ons
on
Graphics
(Proc.
SIGGRAPH),
August
2009
 Connelly
Barnes,
Eli
Shechtman,
Adam
Finkelstein,
 Dan
B
Goldman,
h<p://www.cs.princeton.edu/gfx/pubs/Barnes_2009_PAR/patchmatch.pdf


Shai
Avidan


Blind
vision


Problem:


Develop
secure
mul>‐party
techniques
for
vision
algorithms.

Paper
abstract:
 Alice
would
like
to
detect
faces
in
a
collec>on
of
sensi>ve
 surveillance
images
she
own.
Bob
has
a
face
detec>on
 algorithm
that
he
is
willing
to
let
Alice
use,
for
a
fee,
as
long
 as
she
learns
nothing
about
his
detector.
Alice
is
willing
to
 use
Bob´
s
detector
provided
that
he
will
learn
nothing
 about
her
images,
not
even
the
result
of
the
face
detec>on
 opera>on.
Blind
vision
is
about
applying
secure
mul>‐party
 techniques
to
vision
algorithms
so
that
Bob
will
learn
 nothing
about
the
images
he
operates
on,
not
even
the
 result
of
his
own
opera>on
and
Alice
will
learn
nothing
 about
the
detector.
The
prolifera>on
of
surveillance
cameras
 raises
privacy
concerns
that
can
be
addressed
by
secure
 mul>‐party
techniques
and
their
adapta>on
to
vision
 algorithms.


Applica>ons:


Secure,
distributed
image
analysis.


References:


S.
Avidan
and
M.
Butman
 Blind
Vision
 European
Conference
on
Computer
Vision
(ECCV),
Graz,
Austria,
2006.
 h<p://www.merl.com/reports/docs/TR2006‐006.pdf


Deva
 Ramanan


Evaluate
easily
over
a
powerset
of
all
 segmenta>ons.

Deva
Ramanan:

wants
a
fast
and
efficient
way
to
search
over
all
 possible
segmenta>ons
of
an
image,
scoring
each
one
against
 some
model.


h<p://www.di.ens.fr/~russell/papers/Russell06.pdf


Problem:


Evaluate
some
segmenta>on‐dependent
func>on
over
(some
 approxima>on
to)
all
possible
segmenta>ons.

Note:

different
than
 bo<om‐up
segmenta>on,
which
I
would
not
recommend
as
a
 research
project.


Applica>ons:
 References:


Image
understanding.


Deva’s
home
page:

h<p://www.ics.uci.edu/~dramanan/
 Using
Mul>ple
Segmenta>ons
to
Discover
Objects
and
their
Extent
in
Image
 Collec>ons,
Bryan
Russell,
Alexei
A.
Efros,
Josef
Sivic,
Bill
Freeman,
Andrew
Zisserman
 in
CVPR
2006,
 h<p://people.csail.mit.edu/brussell/research/proj/mult_seg_discovery/index.html


Alyosha
Efros


Efros
comments

Alyosha:

non‐boolean
retrieval
of

large
dataset.

ie,
it's
not
 logical
opera>ons
we
wanna
retreive,
but
real
valued
numbers.
 alyosha:

the
needle
in
the
haystack
problem.

find
signal
 clusters/characteris>cs
when
there's
lots
of
noise.

find
the
 pa<erns,
ignore
the
noise.

see
the
picture
of
the
4
of
us
with
hats
 and
determine
that
hats
are
what's
in
common.
 alyosha:

we
need
to
find
something
new
to
generalize
from
graphical
 models.

those
were
good
for
toy
problems
where
there
were
lots
of
 condi>onal
independencies.

But
now
we
don't
have
that.

want
some
 other
model.

something
that
provides
the
abstrac>on,
maybe,
that
 only
a
few
of
these
condi>onal
independencies
are
ac>ve
at
any
one
 >me
(like
sparse
coding).



sort
of
similar
to
higher
order
cliques.


David
Lowe


David
Lowe


need
be<er
features.

an
ar>st
can
draw
then
end
of
an
 elephant's
trunk,
and
you
know
immediately
what
it
is.

but
our
 features
don't
capture
that
similarity
at
all.


learning
of
features
from
images.

what
is
a
natural
encoding
of
images?
 as
a
warning
for
what
approach
not
to
take:

don't
bother
learning
 transla>on
invariance,
or
rota>on
invariance.

so
a
li<le
bit
of
 supervision
is
ok.


Computer
vision
academic
culture

No
more
“if
only”
papers
 End‐to‐end
empirical
orienta>on
 There
is
a
certain
overhead
in
coming
up
to
speed
on
the
filters
 and
representa>ons.
 Need
dataset
valida>on
 The
compe>>ve
conferences
have
20‐25%
acceptance
rate.

Other
 conferences
have
li<le
impact.

The
compe>>ve
conferences:

 CVPR,
ICCV,
ECCV,
NIPS.
 Thus:

best
to
collaborate.


People
at
MIT
to
work
with

Edward
Adelson—Brain
and
Cogni>ve
Sciences,
material
percep>on
in
 
humans
and
machines;
mul>‐resolu>on
image
representa>ons.
 Fredo
Durand—EECS,
computa>onal
photography,
computer
graphics.
 Bill
Freeman—EECS,
computa>onal
photography,
computer
vision.
 John
Fisher—CSAIL,
machine
learning,
computer
vision.
 Polina
Golland—EECS,
medical
applica>ons.
 Eric
Grimson—EECS,
surveillance,
medical
applica>ons.
 Berthold
Horn—EECS,
computed
imaging.
 Tommy
Poggio—Brain
and
Cogni>ve
Sciences,
machine
learning,
 
computer
vision,
inspired
by
and
modeling
human
vision.
 Ramesh
Raskar—Media
Lab,
computa>onal
photography.
 Antonio
Torralba—EECS,
object
recogni>on,
scene
interpreta>on.


A
computer
graphics
applica>on
of
nearest‐neighbor
 finding
in
high
dimensions


The
image
database


•  We
have
collected
~6
million
images
from
 Flickr
based
on
keyword
and
group
searches

–  typical
image
size
is
500x375
pixels
 –  720GB
of
disk
space
(jpeg
compressed)


Image
representa>on

Original image GIST [Oliva and Torralba’01]

Color layout

Obtaining
seman>cally
coherent
themes

We further break-up the collection into themes of semantically coherent scenes:

Train SVM-based classifiers from 1-2k training images [Oliva and Torralba, 2001]

Basic
camera
mo>ons


Starting from a single image, images to simulate a camera motion:
Forward motion Camera rotation

find a sequence of

Camera pan

Scene matching with camera view transformations: Translation

1. Move camera 4. Locally align images 2. View from the virtual camera 5. Find a seam 6. Blend in the gradient domain

3. Find a match to fill the missing pixels

Scene matching with camera view transformations: Camera rotation

1. Rotate camera 4. Stitched rotation 2. View from the virtual camera

3. Find a match to fill-in the missing pixels 5. Display on a cylinder

More “infinite” images – camera translation

Virtual space as an image graph

Image graph

•  Nodes represent Images •  Edges represent particular motions: Forward Rotate (left/right) Pan (left/right) •  Edge cost is given by the cost of the image match under the particular transformation

Kaneva,
Sivic,
Torralba,
Avidan,
and
Freeman,
Infinite
Images,
to
appear
in
Proceedings
of
IEEE.


Virtual image space laid out in 3D

Kaneva,
Sivic,
Torralba,
Avidan,
and
Freeman,
Infinite
Images,
to
appear
in
Proceedings
of
IEEE.


Outline

•  About
me
 •  Computer
vision
applica>ons
 •  Computer
vision
techniques
and
problems:

–  Low‐level
vision:

underdetermined
problems
 –  High‐level
vision:

combinatorial
problems
 –  Miscellaneous
problems


Problem:


Inference
in
Markov
Random
Fields.

Want
to
handle
higher
order
 clique
poten>als,
high‐dimensional
state
variables,
and
real‐valued
 state
variables.


Applica>ons:

synthesis.


Low‐level
vision:

noise
removal,
super‐resolu>on,
filling‐in,
texture


References:
 Pushmeet
Kohli,
Lubor
Ladicky,
Philip
Torr

Robust
Higher
Order
Poten>als
for
Enforcing
Label
Consistency.
 In:
Interna>onal
Journal
of
Computer
Vision,
2009.
h<p://research.microsoh.com/en‐us/ um/people/pkohli/papers/klt_IJCV09.pdf


Sign up to vote on this title
UsefulNot useful

Master Your Semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master Your Semester with a Special Offer from Scribd & The New York Times

Cancel anytime.