Professional Documents
Culture Documents
for Vision:
Random Decision Forests
and
Deep Neural Networks
Kari Pulli
VP Computational Imaging
Light
Material sources
A. Criminisi & J. Shotton: Decision Forests Tutorial
http://research.microsoft.com/en-us/projects/decisionforests/
J. Shotton et al. (CVPR11): Real-Time Human Pose Recognition
in Parts from a Single Depth Image
http://research.microsoft.com/apps/pubs/?id=145347
Geoffrey Hinton: Neural Networks for Machine Learning
https://www.coursera.org/course/neuralnets
Andrew Ng: Machine Learning
https://www.coursera.org/course/ml
Rob Fergus: Deep Learning for Computer Vision
http://media.nips.cc/Conferences/2013/Video/Tutorial1A.pdf
https://www.youtube.com/watch?v=qgx57X0fBdA
Supervised
Learning
x2
x1
Andrew
Ng
Unsupervised
Learning
x2
x1
Andrew
Ng
Clustering
K-‐means
algorithm
Machine
Learning
Andrew
Ng
Andrew
Ng
Andrew
Ng
Andrew
Ng
Andrew
Ng
Andrew
Ng
Andrew
Ng
Andrew
Ng
Andrew
Ng
Decision Forests
for computer vision and medical image analysis
h"p://research.microso/.com/projects/decisionforests
Decision trees and decision forests
Is
bo"om
Is
bo"om
internal
1
2
part
green?
part
blue?
(split)
node
false
true
3
4
5
6
7
8
9
10
11
12
13
14
terminal
(leaf)
node
A
forest
is
an
ensemble
of
trees.
The
trees
are
all
slightly
different
from
one
another.
Decision tree testing (runtime)
Input
test
point
Split
the
data
at
node
Input data in feature space
Decision tree training (off-line)
Input
training
data
How
to
split
the
data?
Before
split
Split
1
Information gain
Shannon’s entropy
Split
2
Node training
The weak learner model
Splitting data at node j
Node
weak
learner
Weak learner: axis aligned Weak learner: oriented line Weak learner: conic section
leaf
leaf
leaf
Decision forest training (off-line)
… …
The randomly sampled subset of training data made available for the tree t
Forest training
Efficient
training
Decision forest model: the randomness model
2) Randomized node optimization (RNO)
The
full
set
of
all
possible
node
test
parameters
Node training
Node
weak
learner
For
each
node
the
set
of
randomly
sampled
features
The effect of
Small value of ; little tree correlation. Large value of ; large tree correlation.
Classification forest: the ensemble model
Tree t=1 t=2 t=3
Training points
(2
videos
in
this
page)
Parameters:
T=200,
D=2,
weak
learner
=
aligned,
leaf
model
=
probabilisHc
Classification forest: effect of the weak learner model
Training
different
trees
in
the
forest
Training points
(2
videos
in
this
page)
Parameters:
T=200,
D=2,
weak
learner
=
linear,
leaf
model
=
probabilisHc
Classification forest: effect of the weak learner model
Training
different
trees
in
the
forest
Training points
(2
videos
in
this
page)
Parameters:
T=200,
D=2,
weak
learner
=
conic,
leaf
model
=
probabilisHc
Classification forest: with >2 classes
Training
different
trees
in
the
forest
Training
points
(2
videos
in
this
page)
Parameters:
T=200,
D=3,
weak
learner
=
conic,
leaf
model
=
probabilisHc
Classification forest: effect of tree depth
Training
points:
4-‐class
mixed
T=200, D=3, w. l. = conic T=200, D=6, w. l. = conic T=200, D=15, w. l. = conic
right
elbow
¡ No
temporal
informaSon
§ frame-‐by-‐frame
¡ Local
pose
esSmate
of
parts
§ each
pixel
&
each
body
joint
treated
independently
synthetic
real
(train
&
test)
(test)
input
Δ
¡ Depth
comparisons
depth
§ very
fast
to
compute
Δ
Δ
Δ
x
image
x
x
Δ
x
image
depth
offset
depth
Δ
feature
f (I, x) = dI (x) dI (x +
response
) x
x
image
coordinate
= v/dI (x)
depth
159
6
4
378
23
7
5
4
2
0
1
6
8
65%
65%
Average
per-‐class
60%
synthetic
test
data
real
test
data
60%
accuracy
55%
55%
50%
50%
45%
45%
40%
40%
35%
35%
30%
30%
8
12
of
t16
Depth
rees
20
5
10
15
20
Depth
of
trees
ground
truth
Average
per-‐class
55%
50%
40%
1
2
3
4
5
6
Number
of
trees
input
depth
inferred
body
parts
bias i th input
y = b + ∑ xi wi
weight on
output i
index over i th input
input connections
Sigmoid neurons
1
• These give a real-valued z = b + ∑ xi wi y=
output that is a smooth and i 1+ e
−z
bounded function of their
total input. ∂z ∂z dy
= xi = wi = y (1 − y)
– They have nice ∂wi ∂xi dz
derivatives which make
learning easy 1
y 0.5
0
0 z
Finding weights with backpropagation
• There is a big difference between the
forward and backward passes.
1
y=
1+ e
−z
z j = ∑ yi wij E j = 12 (y j − t j )2
Backpropagating dE/dy ∂E
i
yj ∂y j
= yj − t j
j ∂E dy j ∂E ∂E
= = y j (1− y j )
zj ∂z j dz j ∂y j ∂y j
wij
yi Propagate error across
i non-linearity
1 dy
y= = y (1 − y)
1+ e
−z dz
z j = ∑ yi wij E j = 12 (y j − t j )2
Backpropagating dE/dy ∂E
i
yj ∂y j
= yj − t j
j ∂E dy j ∂E ∂E
= = y j (1− y j )
zj ∂z j dz j ∂y j ∂y j
wij
yi ∂E dz j ∂E
∂yi
= ∑ dy ∂z
i j i j
j ∂E dy j ∂E ∂E
= = y j (1− y j )
zj ∂z j dz j ∂y j ∂y j
wij
yi ∂E dz j ∂E ∂E
∂yi
= ∑ dy ∂z = ∑ wij ∂z j
i j i j j
j ∂E dy j ∂E ∂E
= = y j (1− y j )
zj ∂z j dz j ∂y j ∂y j
wij
yi ∂E dz j ∂E ∂E
∂yi
= ∑ dy ∂z = ∑ wij ∂z j
i j i j j
∂E ∂z j ∂E
=
∂wij ∂wij ∂z j
1 dy
y=
−z dz
= y (1 − y) Error gradient w.r.t. weights
1+ e
z j = ∑ yi wij E j = 12 (y j − t j )2
Backpropagating dE/dy ∂E
i
yj ∂y j
= yj − t j
j ∂E dy j ∂E ∂E
= = y j (1− y j )
zj ∂z j dz j ∂y j ∂y j
wij
yi ∂E dz j ∂E ∂E
∂yi
= ∑ dy ∂z = ∑ wij ∂z j
i j i j j
∂E ∂z j ∂E ∂E
= = yi
∂wij ∂wij ∂z j ∂z j
1 dy
y=
−z dz
= y (1 − y) Error gradient w.r.t. weights
1+ e
z j = ∑ yi wij E j = 12 (y j − t j )2
Backpropagating dE/dy ∂E
i
yj ∂y j
= yj − t j
j ∂E dy j ∂E ∂E
= = y j (1− y j )
zj ∂z j dz j ∂y j ∂y j
wij
yi ∂E dz j ∂E ∂E
∂yi
= ∑ dy ∂z = ∑ wij ∂z j
i j i j j
∂E ∂z j ∂E ∂E
= = yi
∂wij ∂wij ∂z j ∂z j
1 dy
y= = y (1 − y)
1+ e
−z dz
Converting error derivatives into a learning procedure
E(θ0,θ1)
θ1
θ0
@E
repeat until convergence { W := W ↵ @W }
Andrew
Ng
Gradient
descent
algorithm
E(θ0,θ1)
θ1
θ0
@E
repeat until convergence { W := W ↵ @W }
Andrew
Ng
Overfitting: The downside of using powerful models
• The training data contains information about the regularities in the
mapping from input to output. But it also contains two types of noise.
– The target values may be unreliable
– There is sampling error:
accidental regularities just because of the particular training
cases that were chosen.
• When we fit the model, it cannot tell which regularities are real and
which are caused by sampling error.
– So it fits both kinds of regularity.
– If the model is very flexible it can model the sampling error really
well. This is a disaster.
Example:
LogisSc
regression
x2 x2 x2
x1 x1 x1
( = sigmoid funcSon)
Andrew
Ng
Preventing overfitting
• Approach 1: Get more data! • Approach 3: Average many different
– almost always the best bet if you models.
have enough compute power to – use models with different forms
train on more data • Approach 4: (Bayesian) Use a
• Approach 2: Use a model that has single neural network architecture,
the right capacity: but average the predictions made
– enough to fit the true regularities. by many different weight vectors.
– not enough to also fit spurious – train the model on different
regularities (if they are weaker) subsets of the training data
(this is called “bagging”)
Some ways to limit the capacity of a neural net
• The capacity can be controlled in many ways:
– Architecture: Limit the number of hidden layers and the number
of units per layer.
– Early stopping: Start with small weights and stop the learning
before it overfits.
– Weight-decay: Penalize large weights using penalties or
constraints on their squared values (L2 penalty) or absolute
values (L1 penalty).
– Noise: Add noise to the weights or the activities.
• Typically, a combination of several of these methods is used.
Small
Model
vs.
Big
Model
+
Regularize
Small model Big model Big model +
regularize
Cross-validation for choosing meta parameters
• Divide the total dataset into three subsets:
– Training data is used for learning the parameters of the model.
– Validation data is not used for learning but is used for deciding
what settings of the meta parameters work best.
– Test data is used to
get a final, unbiased
estimate of how well
the network works.
ConvoluSonal
Neural
Networks
(currently
the
dominant
approach
for
neural
networks)
• Use
many
different
copies
of
the
same
feature
detector
The
similarly
colored
connecSons
all
have
the
with
different
posiSons.
same
weight.
– ReplicaSon
greatly
reduces
the
number
of
free
parameters
to
be
learned.
• Recognizing
real
objects
in
color
photographs
downloaded
from
the
web
is
much
more
complicated
than
recognizing
hand-‐wrihen
digits:
– Hundred
Smes
as
many
classes
(1000
vs.
10)
– Hundred
Smes
as
many
pixels
(256
x
256
color
vs.
28
x
28
gray)
– Two
dimensional
image
of
three-‐dimensional
scene.
– Cluhered
scenes
requiring
segmentaSon
– MulSple
objects
in
each
image.
• Will
the
same
type
of
convoluSonal
neural
network
work?
The ILSVRC-2012 competition on ImageNet
1024
2048
4096
8192