Thesis Defenseman Mike Has Been Named To His Team

Seeing the World Behind the Image
Spatial Layout for 3D Scene

Understanding
Derek Hoiem
July 10, 2007
Robotics Institute
Carnegie Mellon University
Thesis Committee: Alexei A. Efros, Martial Hebert,
Rahul Sukthankar, Takeo Kanade, William Freeman
Scene Understanding
The World Behind the Image
3D Spatial Layout
SKY
VERTICAL
VERTICAL
SUPPORT
• Description of 3D Surfaces
• Occlusion Relationships
• Camera Viewpoint & Objects
3D Spatial Layout
3
1 2
3D Spatial Layout
Car Car Person
Recent Work in 3D
[Oliva & Torralba 2001]
[Saxena, Chung & Ng 2005]
[Torralba, Murphy & Freeman 2003]

Our Main Challenge
• Recovering 3D geometry from

single 2D projection
• Infinite number of possible

solutions!
…
Our World is Structured
Abstract World Our World
Image Credit (left): F. Cunin

and M.J. Sailor, UCSD
Early Work in 3D Scene Understanding
[Guzman 1968] [Ohta & Kanade 1978]
• Hansen & Riseman 1978 (VISIONS)

• Barrow & Tenenbaum 1978 (Intrinsic Images)
• Brooks 1979 (ACRONYM)
• Marr 1982 (2½ D Sketch)
Learn the Structure of the World
…
Infer Most Likely Scene
Unlikely Likely
Description of 3D Surfaces
Goal: Label image into 7 Geometric Classes:

• Support
• Vertical
– Planar: facing Left (), Center ( ), Right ()

– Non-planar: Solid (X), Porous or wiry (O)
• Sky
Use All Available Cues
Color, texture, image location
Vanishing points, lines
Texture gradient
Get Good Spatial Support
50x50 Patch 50x50 Patch

Image Segmentation
• Single segmentation won’t work
• Solution: multiple segmentations
…
Labeling Segments
For each segment:

- Get P(good segment | data) P(label | good segment, data)
Image Labeling
Labeled Segmentations
Labeled Pixels
P(label | data )   P( good segment | data ) P(label | good segment, data )

segments
Confidences from Logistic Adaboost
with Decision Trees
High in
Gray?
Image?
Yes No Yes No
High in Many Long

Smooth? Green?
… Image? Lines?
Yes No Yes No Yes No Yes No
Very High
Blue? Vanishing
Point?
Yes No Yes No
Ground Vertical Sky P(label | good segment, data)
[Collins et al. 2002]

Surface Confidence Maps
Input Image Most Likely Labels
Support Vertical Sky

Surface Estimates: Outdoor
Avg. Accuracy
Main Class: 88%
Subclass: 62%
Input Image Ground Truth Our Result

Surface Estimates: Indoor
Avg. Accuracy
Main Class: 93%
Subclass: 76%

Automatic Photo Popup
Labeled Image Fit Ground-Vertical Form Segments Cut and Fold
Boundary with Line into Polylines
Segments
Final Pop-up Model
[Hoiem Efros Hebert 2005]

Robot Navigation
[Nabbe Hoiem Hebert Efros 2006]

Robot Navigation
Image Ground Truth
Occlusion Reasoning is Necessary
Ground Truth 3D Model

Recover Major Occlusions
Prior Work: Finding Boundaries
Input Image NCuts Segmentation Pb Boundaries
NCuts: [Cour et al. 2004] Pb: [Martin et al. 2002]

Segmentation into Physical Boundaries
Prior Work: Figure/Ground Assignment
• Line labeling approach
– Focus on junctions
Guzman 1968
also [Clowes 1971, Huffman 1971, Waltz 1975, …, Saund 2006]

Prior Work: Figure/Ground Assignment
Input Image Pb Boundaries Human Figure/Ground

Boundaries Goal
Figure/Ground Accuracy
Shapemes + CRF
Pb Boundaries 68.9%
Human Boundaries 78.3%
Boundary Continuity/Junction Cues

Shape Cues
[Ren et al. 2006]
Recover Major Occlusions
Occlusion Boundaries Inferred Depth

Start with Oversegmentation
R1
R2
Occlusion
Initial Segmentation boundary?
2D Cues for Occlusions
Region: Color and Texture Boundaries: Strength and

Continuity
3D Surface Clues for Occlusions
Support Planar Porous Solid Sky
1 3
Surface Labels Geometric T-Junction

3D Depth Cues for Occlusion
Surfaces Initial Boundaries
Depth Depth
Underestimate Overestimate
Illustration of Depth Range
SKY
SUPPORT
Image Depth (Min) Depth (Max)

Gradual Occlusion Inference
Initial Segmentation Final Boundaries
Initial Depth (Min) Initial Depth (Max)

P(occlusion)
Soft Boundary Map Stage 1 Result

P(occlusion)
Soft Boundary Map Stage 1 Result

P(occlusion) + CRF(continuity, closure)
Soft-Max Boundary Map Stage 2 Result

P(occlusion) + CRF(continuity, closure, surfaces)
Soft-Max Boundary Map Stage 3 Result

Final Estimate
Depth (Min)
Boundaries, Foreground/Background, Contact Depth (Max)

Evaluation
• Training: 50 images
• Testing: 250 images (50 quantitative)
Occlusion vs. Non-Occlusion
Foreground/Background Accuracy
Ours
Edge/Region Cues + 3D Cues With CRF
Stage 1 58.7% 71.7%
Stage 2 65.4% 75.6% 77.3%
Stage 3 68.2% 77.1% 79.9%
Ren et al. 2006, Corel Images

Shapemes + CRF
Pb Boundaries 68.9%
Human Boundaries 78.3%
Occlusion Result
Depth (Min)

Occlusion Result
Depth (Min)

3D Model with Occlusions
3D Model without 3D Model with Occlusion

Occlusion Reasoning Reasoning
Recovering Viewpoint and Objects
Objects
Viewpoint 3D Surfaces
Results of a 2D Pedestrian Detector
True
Detection
False
Detections
Missed
Missed
True
Detections
Detector from [Dalal Triggs 2005]
2D Contextual Reasoning
[Kumar Hebert 2005]
[Torralba Murphy Freeman 2004]
• Winn Shotton 2006 • Carbonetto Freitas Banard 2004

• Fink Perona 2003 • He Zemel Cerreira-Perpiñán 2004
Reasoning within the 3D Scene
Close
Not
Close
Camera Viewpoint
Image
Image Coordinates World Coordinates

Object Size ↔ Camera Viewpoint
Input Image Loose Viewpoint Prior

Input Image Loose Viewpoint Prior

Object Position/Sizes Viewpoint




Camera Viewpoint Object Height
Input Image 2D Object Heights
3D Object Heights
Viewpoint from Scene Matching
LabelMe with Viewpoint Annotations
Input Image
+
…
What does surface and viewpoint
say about objects?
Image P(surfaces) P(viewpoint)
P(object) P(object | surfaces) P(object | viewpoint)

What does surface and viewpoint
say about objects?
Image P(surfaces) P(viewpoint)
P(object | surfaces, viewpoint)

P(object)
Input to Our Algorithm
Object Detection Surface Estimates Viewpoint Initial
Local Car Detector
Local Ped Detector

Surfaces
Exact Inference over Tree with Belief
Propagation
Viewpoint
Local Object Local Object

Evidence Objects Evidence
...
o1 on
Local Surface Local Surface

Local Surfaces Evidence
Evidence
…
s1 sn
Improved Viewpoint Estimate
Viewpoint Initial Viewpoint Final
Likelihood
Likelihood
Height Horizon Horizon
Height
Improved Object Estimate
Car: TP / FP Initial (Local) Final (Global)
Ped: TP / FP
Car Detection
4 TP / 2 FP 4 TP / 1 FP
Ped Detection
3 TP / 2 FP 4 TP / 0 FP
Experiments on LabelMe Dataset
• Testing with LabelMe dataset

– Cars as small as 14 pixels
– Peds as small as 36 pixels
More Tasks  Better Detection
Local Detector from Murphy et al. 2003
Car Detection Pedestrian Detection

All Information All Information
Objects + Objects +
View Objects + View Objects +
Geom Objects Geom Objects
Detection Rate
Detection Rate
Only Only
False Positives Per Image False Positives Per Image
[Hoiem Efros Hebert 2006]

Good Detectors Become Better
Local Detector from Dalal-Triggs 2005
Car Detection Pedestrian Detection

All Information All Information
Objects Only Objects Only
Better Detectors  Better Viewpoint
Using 2003 Using 2005

Horizon Prior
Local Detector Local Detector
Median
8.5% 3.8% 3.0%
Error:
90%
Bound:
More is Better
More objects  Better viewpoint estimates

Detect Cars Only 7.3% Error
Detect Peds Only 5.0% Error
Detect Both 3.8% Error
Better viewpoint  Better object detection

10% fewer false positives at same detection rate
Results
Car: TP / FP Ped: TP / FP
Initial: 6 TP / 1 FP Final: 9 TP / 0 FP
Results
Putting Objects in Perspective
Ped
meters
Ped
Car
meters
Geometrically Coherent Image
Interpretation
Surface Maps
Surfaces Occlusions
Su
pp
ort
Viewpoint/Size
Reasoning
Viewpoint and Objects

Interpretation
Surface Maps
Depth, Boundaries
Surfaces Occlusions
Ho Su r i es
pp a ps
riz
on ort u nd Ma
,O Bo e ct
bje j
ct , Ob
Ma n
ps ir zo
Ho
Viewpoint/Size
Reasoning
Viewpoint and Objects

Interpretation
Input Surfaces
Occlusion Boundaries Viewpoint and Objects

Interpretation
Input Surfaces

Interpretation
Input Surfaces

Interpretation
Input Surfaces

Next Steps
• More robust and comprehensive high level

reasoning
• Learn perceptual similarity and general

appearance models
Conclusions
• One image contains much 3D information
• Learn statistical models of the structure of our world

from training images
• Important aspects of approach

– Use all available cues
– Delay decisions
– Think of vision as one 3D scene understanding problem
Video
Thank you
Acknowledgements
• Committee: Alyosha, Martial, Rahul, Takeo, and Bill
• Practice Presentation: Srinivas, Tom, Alex
Vision as Scene Understanding
[Ohta & Kanade 1978]
• Guzman (SEE), 1968 • Brooks (ACRONYM), 1979

• Hansen & Riseman • Marr (2 ½ D sketch), 1982
(VISIONS), 1978
• Ohta & Kanade, 1978
• Barrow & Tenenbaum 1978
Vision as Scene Understanding
[Guzman 1968] [Ohta Kanade 1978]

Results
Failures
Failures: Reflections, Rare Viewpoint

Results
Local Detector from [Murphy-Torralba-Freeman 2003]

Results
How do we get robust scene priors?
Hill Standing on Step

How to find occluding contours?
Other slides
Overview of Our Algorithm
Input Image
Multiple Surface
Segmentations Estimates Final Labels
Learned Models
Estimating surface properties
• We want to know:
– Is a segment is good?
P(good segment | data)
– If so, what is the surface label?

P(label | good segment, data)
• Learn these likelihoods from training images

Results

Results

Average Accuracy
Main Class: 88.1%
Subclasses: 61.5%
Experiments: Input Image
Experiments: Ground Truth
Experiments: Our Result
Surface Estimates: Paintings
Input Image Our Result

Object Pasting
[Lalonde et al. 2007]

Object Pasting
Before After
Object Pasting
Before After
Are Surfaces Enough?

Thesis Defenseman Mike Has Been Named To His Team

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Thesis Defenseman Mike Has Been Named To His Team

Uploaded by

Copyright:

Available Formats

Seeing the World Behind the Image

Spatial Layout for 3D Scene

Car Car Person

[Oliva & Torralba 2001]

[Saxena, Chung & Ng 2005]

[Torralba, Murphy & Freeman 2003]

• Recovering 3D geometry from

• Infinite number of possible

Abstract World Our World

Image Credit (left): F. Cunin

[Guzman 1968] [Ohta & Kanade 1978]

• Hansen & Riseman 1978 (VISIONS)

Goal: Label image into 7 Geometric Classes:

– Non-planar: Solid (X), Porous or wiry (O)

Color, texture, image location

Vanishing points, lines

50x50 Patch 50x50 Patch

• Single segmentation won’t work

• Solution: multiple segmentations

For each segment:

P(label | data )   P( good segment | data ) P(label | good segment, data )

High in Many Long

Ground Vertical Sky P(label | good segment, data)

[Collins et al. 2002]

Input Image Most Likely Labels

Support Vertical Sky

Input Image Ground Truth Our Result

Input Image Ground Truth Our Result

Final Pop-up Model

[Hoiem Efros Hebert 2005]

[Nabbe Hoiem Hebert Efros 2006]

Ground Truth 3D Model

Input Image NCuts Segmentation Pb Boundaries

NCuts: [Cour et al. 2004] Pb: [Martin et al. 2002]

also [Clowes 1971, Huffman 1971, Waltz 1975, …, Saund 2006]

Input Image Pb Boundaries Human Figure/Ground

Boundary Continuity/Junction Cues

Occlusion Boundaries Inferred Depth

Region: Color and Texture Boundaries: Strength and

Support Planar Porous Solid Sky

Surface Labels Geometric T-Junction

Surfaces Initial Boundaries

Image Depth (Min) Depth (Max)

Initial Segmentation Final Boundaries

Initial Depth (Min) Initial Depth (Max)

Soft Boundary Map Stage 1 Result

Soft Boundary Map Stage 1 Result

P(occlusion) + CRF(continuity, closure)

Soft-Max Boundary Map Stage 2 Result

P(occlusion) + CRF(continuity, closure, surfaces)

Soft-Max Boundary Map Stage 3 Result

Boundaries, Foreground/Background, Contact Depth (Max)

Ren et al. 2006, Corel Images

Boundaries, Foreground/Background, Contact Depth (Max)

Boundaries, Foreground/Background, Contact Depth (Max)

3D Model without 3D Model with Occlusion

[Kumar Hebert 2005]

[Torralba Murphy Freeman 2004]

• Winn Shotton 2006 • Carbonetto Freitas Banard 2004

Image Coordinates World Coordinates

Input Image Loose Viewpoint Prior

Input Image Loose Viewpoint Prior