Professional Documents
Culture Documents
Neural Networks
1
Last Time
• Classification: given inputs x,
predict labels (classes) y Y
• Naïve Bayes
F1 F2 Fn
• Parameter estimation:
– MLE, MAP, priors
• Laplace smoothing
3
Learning highly non-linear functions
f: X Y
f might be non-linear function
X (vector of) continuous and/or discrete vars
Y (vector of) continuous and/or discrete vars
Middle Layer
Input Layer Output Layer
© Eric Xing @ CMU, 2006-2011 5
Connectionist Models
Consider humans:
Neuron switching time
~ 0.001 second
Number of neurons (86 billion neurons)
~ 1010
Connections per neuron
~ 104-5
Scene recognition time
~ 0.1 second
100 inference steps doesn't seem like enough
much parallel computation
Properties of artificial neural nets (ANN)
Many neuron-like threshold switching units
Many weighted interconnections among units
Highly parallel, distributed processes
7
Why is everyone talking
Motivation
about Deep Learning?
8
Why is everyone talking
Motivation
about Deep Learning?
• Because a lot of money is invested in it…
– DeepMind: Acquired by Google for $400
million
– DNNResearch: Three person startup
(including Geoff Hinton) acquired by Google
for unknown price tag
– Enlitic, Ersatz, MetaMind, Nervana, Skylab:
Deep Learning startups commanding millions
of VC dollars
• Because it made the front page of the
New York Times
9
Motivation Why Deep Learning?
10
Performance (Computer Vision)
AlexNet
14
Motivation Why now?
2006
2016
15
A Recipe for
Background
Machine Learning
1. Given training data: Face Face Not a face
– Loss function
Examples: Mean-squared error,
Cross Entropy
16
A Recipe for
Background
Machine Learning
1. Given training data: 3. Define goal:
17
A Recipe for
Background
Gradients
Machine Learning
1. Given training data: 3. Definecan
Backpropagation goal:
compute this
gradient!
And it’s a special case of a more
general algorithm called reverse-
2. Choose each of these:mode automatic differentiation that
– Decision function can compute
4. Train
the with SGD:
gradient of any
differentiable
(takefunction efficiently!
small steps
opposite the gradient)
– Loss function
18
A Recipe for
Background
Goals for this lecture
Machine Learning
1. 1.
Given training
Explore data:
a new class of 3. Define functions
decision goal:
(Neural Networks)
2. Consider variants of this recipe for training
2. Choose each of these:
– Decision function 4. Train with SGD:
(take small steps
opposite the gradient)
– Loss function
19
Decision Functions Linear Regression
Output
θ1 θ2 θ3 θM
Input …
20
Decision Functions Logistic Regression
Output
θ1 θ2 θ3 θM
Input …
21
Decision Functions Logistic Regression
Output
θ1 θ2 θ3 θM
Input …
22
Decision Functions Logistic Regression
Output
1 1 0
y
x2
θ1 θ2 θ3 θM
x1
Input …
23
Decision Functions Logistic Regression
Output
θ1 θ2 θ3 θM
Input …
24
The Perceptron: Forward Propagation
25
Neural Network Model
Inputs
.6 Output
Age 34 .4
.2 S
.1 .5 0.6
Gender 2 .3 .2
.8
S
.7 S “Probability of
beingAlive”
Stage 4 .2
Dependent
Independent Weights HiddenL Weights variable
variables ayer
Prediction
Inputs
.6 Output
Age 34
.5 0.6
.1
Gender 2 S
.7 .8 “Probability of
beingAlive”
Stage 4
Dependent
Independent Weights HiddenLayer Weights variable
variables
Prediction
Dependent
Independent Weights HiddenL Weights variable
variables ayer
Prediction
Dependent
Independent Weights HiddenL Weights variable
variables ayer
Prediction
Age 34 .6 .4
.2 S
.1 .5 0.6
Gender 2 .3 .2
.8
S
.7 S “Probability of
beingAlive”
Stage 4 .2
Dependent
Independent Weights HiddenL Weights variable
variables ayer
Prediction
Output
Hidden Layer …
Input …
32
Decision Functions Neural Network
Output
…
Hidden Layer
…
Input
33
Building a Neural Net
Output
Features …
34
Building a Neural Net
Output
Hidden Layer …
D=M
1 1 1
Input …
35
Building a Neural Net
Output
Hidden Layer …
D=M
Input …
36
Building a Neural Net
Output
Hidden Layer …
D=M
Input …
37
Building a Neural Net
Output
Hidden Layer …
D<M
Input …
38
Decision Boundary
x1 x2
• 1 hidden layer
– Boundary of convex region (open or closed)
x1 x2
y
• 2 hidden layers
– Combinations of convex regions
x1 x2
Output …
Hidden Layer …
Input …
42
Decision Functions Deeper Networks
Next lecture:
Output
…
Hidden Layer 1
…
Input
43
Decision Functions Deeper Networks
Next lecture:
Output
…
Hidden Layer 2
…
Hidden Layer 1
…
Input
44
Decision Functions Deeper Networks
Next Output
lecture:
…
Making the Hidden Layer 3
neural
networks Hidden Layer 2
…
deeper
…
Hidden Layer 1
…
Input
45
Decision Functions Different Levels of Abstraction
• We don’t know
the “right”
levels of
abstraction
• So let the model
figure it out!
46
Example from Honglak Lee (NIPS 2010)
Decision Functions Different Levels of Abstraction
Face Recognition:
– Deep Network
can build up
increasingly
higher levels of
abstraction
– Lines, parts,
regions
47
Example from Honglak Lee (NIPS 2010)
Decision Functions Different Levels of Abstraction
Output
…
Hidden Layer 3
…
Hidden Layer 2
…
Hidden Layer 1
…
Input
48
Example from Honglak Lee (NIPS 2010)
ARCHITECTURES
49
Neural Networks Properties
• Theorem (Universal Function Approximators). A two-layer
neural network with a sufficient number of neurons can
approximate any continuous function to any desired accuracy.
• Practical considerations
– Can be seen as learning the features
– Large number of neurons
• Danger for overfitting
• (hence early stopping!)
Neural Network Architectures
Even for a basic Neural Network, there are many design
decisions to make:
1. # of hidden layers (depth)
2. # of units per hidden layer (width)
3. Type of activation function (nonlinearity)
4. Form of objective function
51
Activation Functions
Neural Network with sigmoid
activation functions
Output
…
Hidden Layer
…
Input
52
Activation Functions
Neural Network with arbitrary
nonlinear activation functions
Output
…
Hidden Layer
…
Input
53
Activation Functions
Sigmoid / Logistic Function So far, we’ve
1 assumed that the
logistic(u) º -u activation function
1+ e
(nonlinearity) is
always the sigmoid
function…
54
Activation Functions
• A new change: modifying the nonlinearity
– The logistic is not widely used in modern ANNs
Alternate 1:
tanh
59
Multi-Class Output
Output …
Hidden Layer …
Input …
60
Multi-Class Output
Softmax:
…
Output
…
Hidden Layer
…
Input
61
Cross-entropy vs. Quadratic loss
63
BACKPROPAGATION
64
A Recipe for
Background
Machine Learning
1. Given training data: 3. Define goal:
65
Training Backpropagation
• Question 1:
When can we compute the gradients of the parameters of
an arbitrary neural network?
• Question 2:
When can we make the gradient computation efficient?
66
Training Chain Rule
Given:
Chain Rule:
67
Training Chain Rule
Given:
Chain Rule:
Backpropagation
…
is just repeated
application of the
chain rule.
68
Training Chain Rule
Given:
Chain Rule:
…
Backpropagation:
1. Instantiate the computation as a directed acyclic graph, where each
intermediate quantity is a node
2. At each node, store (a) the quantity computed in the forward pass
and (b) the partial derivative of the goal with respect to that node’s
intermediate quantity.
3. Initialize all partial derivatives to 0.
4. Visit each node in reverse topological order. At each node, add its
contribution to the partial derivatives of its parents
70
Training Backpropagation
71
Training Backpropagation
Output
Case 1:
Logistic θ1 θ2 θ3 θM
Regression
…
Input
72
Training Backpropagation
Output
…
Hidden Layer
…
Input
73
Training Backpropagation
Output
…
Hidden Layer
…
Input
74
Training Backpropagation
Case 2:
Neural
Network
75
Training Chain Rule
Given:
Chain Rule:
…
Backpropagation:
1. Instantiate the computation as a directed acyclic graph, where each
intermediate quantity is a node
2. At each node, store (a) the quantity computed in the forward pass
and (b) the partial derivative of the goal with respect to that node’s
intermediate quantity.
3. Initialize all partial derivatives to 0.
4. Visit each node in reverse topological order. At each node, add its
contribution to the partial derivatives of its parents
Backpropagation:
1. Instantiate the computation as a directed acyclic graph, where each
node represents a Tensor.
2. At each node, store (a) the quantity computed in the forward pass
and (b) the partial derivatives of the goal with respect to that node’s
Tensor.
3. Initialize all partial derivatives to 0.
4. Visit each node in reverse topological order. At each node, add its
contribution to the partial derivatives of its parents
Module 4
…
Module 3
Module 2
Module 1
78
A Recipe for
Background
Gradients
Machine Learning
1. Given training data: 3. Definecan
Backpropagation goal:
compute this
gradient!
And it’s a special case of a more
general algorithm called reverse-
2. Choose each of these:mode automatic differentiation that
– Decision function can compute
4. Train
the with SGD:
gradient of any
differentiable
(takefunction efficiently!
small steps
opposite the gradient)
– Loss function
79
Automatic Differentiation
• Automatic differentiation software
– e.g. Theano, TensorFlow, PyTorch, Chainer
– Only need to program the function g(x,y,w)
– Can automatically compute all derivatives w.r.t. all entries in w
– This is typically done by caching info during forward computation pass
of f, and then doing a backward pass = “backpropagation”
– Autodiff / Backpropagation can often be done at computational cost
comparable to the forward pass
• Need to know this exists
• How this is done?
Lib and Framework for deep
learning
• Many deep learning library and frameworks have been
developed (special, 2017):
– Tensorflow (Google)
– PyTorch (Facebook)
– Theano* (LISA lab, university of montreal)
– MXNet (Amazon)
– CNTK (Microsoft)
– Caffe
– Deeplearning4j
– …
81
Summary
1. Neural Networks…
– provide a way of learning features
– are highly nonlinear prediction functions
– (can be) a highly parallel network of logistic regression classifiers
– discover useful hidden representations of the input
2. Backpropagation…
– provides an efficient way to compute gradients
– is a special case of reverse-mode automatic differentiation
82
Demo
• http://playground.tensorflow.org
83
Reference
• Lecture slides - Introduction Machine Learning, CMU, 2016.
• Introduction to Deep Learing, MIT, 2017
84