You are on page 1of 113

Seeing the World Behind the Image

Spatial Layout for 3D Scene


Understanding

Derek Hoiem
July 10, 2007
Robotics Institute
Carnegie Mellon University
Thesis Committee: Alexei A. Efros, Martial Hebert,
Rahul Sukthankar, Takeo Kanade, William Freeman
Scene Understanding
The World Behind the Image
3D Spatial Layout

SKY

VERTICAL

VERTICAL

SUPPORT

• Description of 3D Surfaces
• Occlusion Relationships
• Camera Viewpoint & Objects
3D Spatial Layout

3
1 2

• Description of 3D Surfaces
• Occlusion Relationships
• Camera Viewpoint & Objects
3D Spatial Layout

Car Car Person

• Description of 3D Surfaces
• Occlusion Relationships
• Camera Viewpoint & Objects
Recent Work in 3D

[Oliva & Torralba 2001]

[Saxena, Chung & Ng 2005]

[Torralba, Murphy & Freeman 2003]


Our Main Challenge

• Recovering 3D geometry from


single 2D projection

• Infinite number of possible


solutions!


Our World is Structured

Abstract World Our World

Image Credit (left): F. Cunin


and M.J. Sailor, UCSD
Early Work in 3D Scene Understanding

[Guzman 1968] [Ohta & Kanade 1978]

• Hansen & Riseman 1978 (VISIONS)


• Barrow & Tenenbaum 1978 (Intrinsic Images)
• Brooks 1979 (ACRONYM)
• Marr 1982 (2½ D Sketch)
Learn the Structure of the World


Infer Most Likely Scene

Unlikely Likely
Description of 3D Surfaces

Goal: Label image into 7 Geometric Classes:


• Support

• Vertical
– Planar: facing Left (), Center ( ), Right ()

– Non-planar: Solid (X), Porous or wiry (O)

• Sky
Use All Available Cues

Color, texture, image location

Vanishing points, lines

Texture gradient
Get Good Spatial Support

50x50 Patch 50x50 Patch


Image Segmentation

• Single segmentation won’t work

• Solution: multiple segmentations


Labeling Segments

For each segment:


- Get P(good segment | data) P(label | good segment, data)
Image Labeling
Labeled Segmentations

Labeled Pixels

P(label | data )   P( good segment | data ) P(label | good segment, data )


segments
Confidences from Logistic Adaboost
with Decision Trees
High in
Gray?
Image?
Yes No Yes No

High in Many Long


Smooth? Green?
… Image? Lines?
Yes No Yes No Yes No Yes No

Very High
Blue? Vanishing
Point?
Yes No Yes No

Ground Vertical Sky P(label | good segment, data)

[Collins et al. 2002]


Surface Confidence Maps

Input Image Most Likely Labels

Support Vertical Sky


Surface Estimates: Outdoor

Avg. Accuracy
Main Class: 88%
Subclass: 62%

Input Image Ground Truth Our Result


Surface Estimates: Indoor

Avg. Accuracy
Main Class: 93%
Subclass: 76%

Input Image Ground Truth Our Result


Automatic Photo Popup
Labeled Image Fit Ground-Vertical Form Segments Cut and Fold
Boundary with Line into Polylines
Segments

Final Pop-up Model

[Hoiem Efros Hebert 2005]


Robot Navigation

[Nabbe Hoiem Hebert Efros 2006]


Robot Navigation
Image Ground Truth
Occlusion Reasoning is Necessary

Ground Truth 3D Model


Recover Major Occlusions
Prior Work: Finding Boundaries

Input Image NCuts Segmentation Pb Boundaries

NCuts: [Cour et al. 2004] Pb: [Martin et al. 2002]


Segmentation into Physical Boundaries
Prior Work: Figure/Ground Assignment
• Line labeling approach
– Focus on junctions

Guzman 1968

also [Clowes 1971, Huffman 1971, Waltz 1975, …, Saund 2006]


Prior Work: Figure/Ground Assignment

Input Image Pb Boundaries Human Figure/Ground


Boundaries Goal

Figure/Ground Accuracy
Shapemes + CRF
Pb Boundaries 68.9%
Human Boundaries 78.3%

Boundary Continuity/Junction Cues


Shape Cues
[Ren et al. 2006]
Recover Major Occlusions

Occlusion Boundaries Inferred Depth


Start with Oversegmentation

R1

R2

Occlusion
Initial Segmentation boundary?
2D Cues for Occlusions

Region: Color and Texture Boundaries: Strength and


Continuity
3D Surface Clues for Occlusions

Support Planar Porous Solid Sky

1 3

Surface Labels Geometric T-Junction


3D Depth Cues for Occlusion

Surfaces Initial Boundaries

Depth Depth
Underestimate Overestimate
Illustration of Depth Range

SKY

SUPPORT

Image Depth (Min) Depth (Max)


Gradual Occlusion Inference

Initial Segmentation Final Boundaries

Initial Depth (Min) Initial Depth (Max)


Gradual Occlusion Inference

P(occlusion)

Soft Boundary Map Stage 1 Result


Gradual Occlusion Inference

P(occlusion)

Soft Boundary Map Stage 1 Result


Gradual Occlusion Inference

P(occlusion) + CRF(continuity, closure)

Soft-Max Boundary Map Stage 2 Result


Gradual Occlusion Inference

P(occlusion) + CRF(continuity, closure, surfaces)

Soft-Max Boundary Map Stage 3 Result


Final Estimate

Depth (Min)

Boundaries, Foreground/Background, Contact Depth (Max)


Evaluation

• Training: 50 images
• Testing: 250 images (50 quantitative)
Occlusion vs. Non-Occlusion
Foreground/Background Accuracy

Ours
Edge/Region Cues + 3D Cues With CRF
Stage 1 58.7% 71.7%
Stage 2 65.4% 75.6% 77.3%
Stage 3 68.2% 77.1% 79.9%

Ren et al. 2006, Corel Images


Shapemes + CRF
Pb Boundaries 68.9%
Human Boundaries 78.3%
Occlusion Result

Depth (Min)

Boundaries, Foreground/Background, Contact Depth (Max)


Occlusion Result

Depth (Min)

Boundaries, Foreground/Background, Contact Depth (Max)


3D Model with Occlusions

3D Model without 3D Model with Occlusion


Occlusion Reasoning Reasoning
Recovering Viewpoint and Objects

Objects

Viewpoint 3D Surfaces
Results of a 2D Pedestrian Detector
True
Detection

False
Detections

Missed
Missed
True
Detections
Detector from [Dalal Triggs 2005]
2D Contextual Reasoning

[Kumar Hebert 2005]

[Torralba Murphy Freeman 2004]

• Winn Shotton 2006 • Carbonetto Freitas Banard 2004


• Fink Perona 2003 • He Zemel Cerreira-Perpiñán 2004
Reasoning within the 3D Scene

Close
Not
Close
Camera Viewpoint

Image

Image Coordinates World Coordinates


Object Size ↔ Camera Viewpoint

Input Image Loose Viewpoint Prior


Object Size ↔ Camera Viewpoint

Input Image Loose Viewpoint Prior


Object Size ↔ Camera Viewpoint

Object Position/Sizes Viewpoint


Object Size ↔ Camera Viewpoint

Object Position/Sizes Viewpoint


Object Size ↔ Camera Viewpoint

Object Position/Sizes Viewpoint


Object Size ↔ Camera Viewpoint

Object Position/Sizes Viewpoint


Camera Viewpoint Object Height

Input Image 2D Object Heights

3D Object Heights
Viewpoint from Scene Matching

LabelMe with Viewpoint Annotations

Input Image

+

What does surface and viewpoint
say about objects?

Image P(surfaces) P(viewpoint)

P(object) P(object | surfaces) P(object | viewpoint)


What does surface and viewpoint
say about objects?

Image P(surfaces) P(viewpoint)

P(object | surfaces, viewpoint)


P(object)
Input to Our Algorithm
Object Detection Surface Estimates Viewpoint Initial

Local Car Detector

Local Ped Detector


Surfaces
Exact Inference over Tree with Belief
Propagation
Viewpoint

Local Object Local Object


Evidence Objects Evidence

...
o1 on

Local Surface Local Surface


Local Surfaces Evidence
Evidence


s1 sn
Improved Viewpoint Estimate
Viewpoint Initial Viewpoint Final
Likelihood

Likelihood
Height Horizon Horizon
Height
Improved Object Estimate
Car: TP / FP Initial (Local) Final (Global)
Ped: TP / FP

Car Detection

4 TP / 2 FP 4 TP / 1 FP

Ped Detection

3 TP / 2 FP 4 TP / 0 FP
Experiments on LabelMe Dataset

• Testing with LabelMe dataset


– Cars as small as 14 pixels
– Peds as small as 36 pixels
More Tasks  Better Detection

Local Detector from Murphy et al. 2003

Car Detection Pedestrian Detection


All Information All Information
Objects + Objects +
View Objects + View Objects +
Geom Objects Geom Objects

Detection Rate
Detection Rate

Only Only

False Positives Per Image False Positives Per Image

[Hoiem Efros Hebert 2006]


Good Detectors Become Better

Local Detector from Dalal-Triggs 2005

Car Detection Pedestrian Detection


All Information All Information
Objects Only Objects Only
Better Detectors  Better Viewpoint

Using 2003 Using 2005


Horizon Prior
Local Detector Local Detector
Median
8.5% 3.8% 3.0%
Error:

90%
Bound:
More is Better

More objects  Better viewpoint estimates


Detect Cars Only 7.3% Error
Detect Peds Only 5.0% Error
Detect Both 3.8% Error

Better viewpoint  Better object detection


10% fewer false positives at same detection rate
Results
Car: TP / FP Ped: TP / FP

Initial: 6 TP / 1 FP Final: 9 TP / 0 FP
Results
Car: TP / FP Ped: TP / FP

Initial: 3 TP / 3 FP Final: 5 TP / 1 FP
Putting Objects in Perspective

Ped

meters
Ped

Car

meters
Geometrically Coherent Image
Interpretation

Surface Maps

Surfaces Occlusions

Su
pp
ort

Viewpoint/Size
Reasoning

Viewpoint and Objects


Geometrically Coherent Image
Interpretation

Surface Maps

Depth, Boundaries

Surfaces Occlusions

Ho Su r i es
pp a ps
riz
on ort u nd Ma
,O Bo e ct
bje j
ct , Ob
Ma n
ps ir zo
Ho

Viewpoint/Size
Reasoning

Viewpoint and Objects


Geometrically Coherent Image
Interpretation

Input Surfaces

Occlusion Boundaries Viewpoint and Objects


Geometrically Coherent Image
Interpretation

Input Surfaces

Occlusion Boundaries Viewpoint and Objects


Geometrically Coherent Image
Interpretation

Input Surfaces

Occlusion Boundaries Viewpoint and Objects


Geometrically Coherent Image
Interpretation

Input Surfaces

Occlusion Boundaries Viewpoint and Objects


Next Steps

• More robust and comprehensive high level


reasoning

• Learn perceptual similarity and general


appearance models
Conclusions
• One image contains much 3D information

• Learn statistical models of the structure of our world


from training images

• Important aspects of approach


– Use all available cues
– Delay decisions
– Think of vision as one 3D scene understanding problem
Video
Thank you

Acknowledgements
• Committee: Alyosha, Martial, Rahul, Takeo, and Bill
• Practice Presentation: Srinivas, Tom, Alex
Vision as Scene Understanding

[Ohta & Kanade 1978]

• Guzman (SEE), 1968 • Brooks (ACRONYM), 1979


• Hansen & Riseman • Marr (2 ½ D sketch), 1982
(VISIONS), 1978
• Ohta & Kanade, 1978
• Barrow & Tenenbaum 1978
Vision as Scene Understanding

[Guzman 1968] [Ohta Kanade 1978]


Results
Car: TP / FP Ped: TP / FP
Failures
Failures: Reflections, Rare Viewpoint

Input Image Ground Truth Our Result


Results
Car: TP / FP Ped: TP / FP

Initial: 1 TP / 23 FP Final: 0 TP / 10 FP

Local Detector from [Murphy-Torralba-Freeman 2003]


Results
Car: TP / FP Ped: TP / FP

Initial: 1 TP / 5 FP Final: 5 TP / 2 FP
How do we get robust scene priors?

Hill Standing on Step


How to find occluding contours?
Other slides
Overview of Our Algorithm

Input Image

Multiple Surface
Segmentations Estimates Final Labels

Learned Models
Estimating surface properties

• We want to know:
– Is a segment is good?
P(good segment | data)

– If so, what is the surface label?


P(label | good segment, data)

• Learn these likelihoods from training images


Results

Input Image Ground Truth Our Result


Results

Input Image Ground Truth Our Result


Average Accuracy
Main Class: 88.1%
Subclasses: 61.5%
Experiments: Input Image
Experiments: Ground Truth
Experiments: Our Result
Surface Estimates: Paintings

Input Image Our Result


Object Pasting

[Lalonde et al. 2007]


Object Pasting

Before After
Object Pasting

Before After
Are Surfaces Enough?

You might also like