You are on page 1of 48

Torr Vision Group, Engineering Department

Semantic Image
Segmentation with
Deep Learning
Sadeep Jayasumana

07/10/2015

Collaborators:
Bernardino Romera-Paredes
Shuai Zheng
Phillip Torr
Torr Vision Group, Engineering Department

Live Demo - http://crfasrnn.torr.vision/


Torr Vision Group, Engineering Department

Outline

 Semantic segmentation
 Why?
 CNNs for Pixelwise prediction
 CRFs
 CRF as RNN
 Conclusion
Torr Vision Group, Engineering Department

Semantic Segmentation
• Recognizing and delineating objects in an image 
Classifying each pixel in the image
Torr Vision Group, Engineering Department

Why Semantic Segmentation?


• To help partially sighted people by highlighting
important objects in their glasses
Torr Vision Group, Engineering Department

Why Semantic Segmentation?


• To let robots segment objects so that they can grasp
them
Torr Vision Group, Engineering Department

Why Semantic Segmentation?


• Road scenes understanding
• Useful for autonomous navigation of cars and
drones

Image taken from the cityscapes dataset.


Torr Vision Group, Engineering Department

Why Semantic Segmentation?


• Useful tool for editing images
Torr Vision Group, Engineering Department

Why Semantic Segmentation?


• Medical purposes: e.g. segmenting
tumours, dental cavities, ...

Image taken from Mauricio Reyes

ISBI Challenge 2015, dental x-ray images


Torr Vision Group, Engineering Department

But How?
• Deep convolutional neural networks are successful at
learning a good representation of the visual inputs.

• However, here we have a structured output.


Torr Vision Group, Engineering Department

CNN for Pixel-wise Labelling


• Usual convolutional networks
Torr Vision Group, Engineering Department

CNN for Pixel-wise Labelling


• Usual convolutional networks

• Fully convolutional networks

Long et. al., Fully Convolutional Networks for Semantic Segmentation, CVPR 2015.
Torr Vision Group, Engineering Department

Fully Convolutional Networks


[Long et al, CVPR 2014]
Torr Vision Group, Engineering Department

Fully Convolutional Networks


[Long et al, CVPR 2014]

+ Significantly improved the state of the art in semantic


segmentation.
- Poor object delineation: e.g. spatial consistency
neglected.

Image FCN Results Ground truth


Torr Vision Group, Engineering Department

Conditional Random Fields (CRFs)


• A CRF can account for contextual information in the
image

Coarse output from the MRF/CRF modelling Output after the CRF
pixel-wise classifier inference
Torr Vision Group, Engineering Department

Conditional Random Fields (CRFs)


∈ {bg, cat, tree, person, …}

• Define a discrete random variable Xi for each pixel i.


• Each Xi can take a value from the label set.
• Connect random variables to form a random field. (MRF)
Torr Vision Group, Engineering Department

Conditional Random Fields (CRFs)


∈ {bg, cat, tree, person, …} = bg = cat

• Define a discrete random variable Xi for each pixel i.


• Each Xi can take a value from the label set.
• Connect random variables to form a random field. (MRF)
• Most probable assignment given the image → segmentation.
Torr Vision Group, Engineering Department

Finding the Best Assignment


= bg
Pr = , = ,…, = | = Pr ( = | )

Pr = | = exp − |

= cat

• Maximize Pr = → Minimize
• So we have formulated the problem as an energy minimization.
Torr Vision Group, Engineering Department

| = _ + _

=
Torr Vision Group, Engineering Department

| = _ + _

Unary energy
 ( = ) = ? =
Torr Vision Group, Engineering Department

| = _ + _

Unary energy
 ( = ) = ? =

 Your label doesn’t agree with the initial


classifier → you pay a penalty.
Torr Vision Group, Engineering Department

| = _ + _

Unary energy
 ( = ) = ?
 Your label doesn’t agree with the initial
classifier → you pay a penalty.

Pairwise energy
 ( = , = ) = ?
 You assign different labels to two very similar
pixels → you pay a penalty.
 How do you measure similarity?
Torr Vision Group, Engineering Department

| = _ + _

Unary energy
 ( = ) = ?
 Your label doesn’t agree with the initial
classifier → you pay a penalty.

Pairwise energy
 ( = , = ) = ?
 You assign different labels to two very similar
pixels → you pay a penalty.
 How do you measure similarity?
Torr Vision Group, Engineering Department

| = _ + _

Unary energy
 ( = ) = ?
 Your label doesn’t agree with the initial
classifier → you pay a penalty.

Pairwise energy
 ( = , = ) = ?
 You assign different labels to two very similar
pixels → you pay a penalty.
 How do you measure similarity?
Torr Vision Group, Engineering Department

Dense CRF Formulation


[Krähenbühl & Koltun, NIPS 2011.]

• Pairwise energies are defined for every pixel pair in the


image.

= ( )+ ( , )
,

• Exact inference is not feasible.


• Use approximate mean field inference.
Torr Vision Group, Engineering Department

Dense CRF Formulation


[Krähenbühl & Koltun, NIPS 2011.]

• Pairwise energies are defined for every pixel pair in the


image.

= ( )+ ( , )
,

• Exact inference is not feasible.


• Use approximate mean field inference.

exp (− )= = ( )
Torr Vision Group, Engineering Department

Fully Connected CRFs as a CNN


Torr Vision Group, Engineering Department

Fully Connected CRFs as a CNN

Q Bilateral
I
Torr Vision Group, Engineering Department

Fully Connected CRFs as a CNN

Q Bilateral Conv
I
Torr Vision Group, Engineering Department

Fully Connected CRFs as a CNN

Q Bilateral Conv Conv


I
Torr Vision Group, Engineering Department

Fully Connected CRFs as a CNN

Q Bilateral Conv Conv +


I
Torr Vision Group, Engineering Department

Fully Connected CRFs as a CNN

Q Bilateral Conv Conv + SoftMax


I
Torr Vision Group, Engineering Department

CRF as a Recurrent Neural Network

Q Bilateral Conv Conv + SoftMax


I

Mean-field Iteration

• Each of these blocks is differentiable  We can backprop


Torr Vision Group, Engineering Department

CRF as a Recurrent Neural Network


Image

CRF
Unaries Output
Iteration

SoftMax

CRF as RNN

• Each of these blocks is differentiable  We can backprop


Torr Vision Group, Engineering Department

Putting Things Together

FCN CRF-RNN
Torr Vision Group, Engineering Department

Experiments

CRF-
FCN FCN CRF FCN
RNN

[Long et al, 2014] [Chen et al, 2015] Ours

68.3 69.5 72.9


Torr Vision Group, Engineering Department

Try our demo: http://crfasrnn.torr.vision


Code & model: https://github.com/torrvision/crfasrnn

Shuai Zheng

Bernardino
Romera-Paredes

Philip Torr
Torr Vision Group, Engineering Department

Examples

http://pp.vk.me/c622119/v622119584/20dc3/7lS5BU2Bp_k.jpg
Torr Vision Group, Engineering Department

Examples

http://media1.fdncms.com/boiseweekly/imager/mountain-bikers-are-advised-to-dism/u/original/3446917/walk_thru_sheep_1_.jpg
Torr Vision Group, Engineering Department

Examples

http://img.rtvslo.si/_up/upload/2014/07/22/65129194_tour-3.jpg
Torr Vision Group, Engineering Department

Examples

http://www.toxel.com/wp-content/uploads/2010/11/bike05.jpg
Torr Vision Group, Engineering Department

Not-so-good examples

http://www.independent.co.uk/incoming/article10335615.ece/alternates/w620/planecat.jpg
Torr Vision Group, Engineering Department

Not-so-good examples

http://i1.wp.com/theverybesttop10.files.wordpress.com/2013/02/the-world_s-top-10-best-images-of-camouflage-cats-5.jpg?resize=375,500
Torr Vision Group, Engineering Department

Tricky examples

http://se-preparer-aux-crises.fr/wp-content/uploads/2013/10/Golum.png
Torr Vision Group, Engineering Department

Tricky examples

https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRf4J7Hszkc8Wf6riVUX-cV_K-un8LJy5dYIBW1KDIn6i7UCzGHpg
Torr Vision Group, Engineering Department

Tricky examples

http://i.huffpost.com/gen/1478236/thumbs/s-DIRD6-large640.jpg
Torr Vision Group, Engineering Department

Conclusion
• CNNs yield a coarse prediction on pixel-labeled tasks.
• CRFs improve the result by accounting for the contextual
information in the image.
• Learning the whole pipeline end-to-end significantly
improves the results.

CNN CRF
Torr Vision Group, Engineering Department

Conclusion
• CNNs yield a coarse prediction on pixel-labeled tasks.
• CRFs improve the result by accounting for the contextual
information in the image.
• Learning the whole pipeline end-to-end significantly
improves the results.

CNN CRF
Thank You!

You might also like