You are on page 1of 84

Artificial Intelligent

Neural Networks

Nguyen Van Vinh


UET-VNU

1
Last Time
• Classification: given inputs x,
predict labels (classes) y Y

• Naïve Bayes

F1 F2 Fn

• Parameter estimation:
– MLE, MAP, priors

• Laplace smoothing

• Training set, held-out set, test set


Outline
• Neural Networks
• Backpropagation
• Demo

3
Learning highly non-linear functions
f: X  Y
 f might be non-linear function
 X (vector of) continuous and/or discrete vars
 Y (vector of) continuous and/or discrete vars

The XOR gate Speech recognition

© Eric Xing @ CMU, 2006-2011 4


Perceptron and Neural Nets
 From biological neuron to artificial neuron (perceptron)
Synapse Inputs
Synapse Dendrites x1 Linear Hard
Axon
Axon w1 Combiner Limiter
Output
 Y
Soma Soma w2
Dendrites 
x2
Synapse
Threshold
 Activation function
n
 1, if X  
X   xi wi Y 
i 1  1, if X  

 Artificial neuron networks


 supervised learning

Out put Signals


Input Signals
 gradient descent

Middle Layer
Input Layer Output Layer
© Eric Xing @ CMU, 2006-2011 5
Connectionist Models
 Consider humans:
 Neuron switching time
~ 0.001 second
 Number of neurons (86 billion neurons)
~ 1010
 Connections per neuron
~ 104-5
 Scene recognition time
~ 0.1 second
 100 inference steps doesn't seem like enough
 much parallel computation
 Properties of artificial neural nets (ANN)
 Many neuron-like threshold switching units
 Many weighted interconnections among units
 Highly parallel, distributed processes

© Eric Xing @ CMU, 2006-2011 6


Why is everyone talking
Motivation
about Deep Learning?
• 2016: year of deep learning

7
Why is everyone talking
Motivation
about Deep Learning?

8
Why is everyone talking
Motivation
about Deep Learning?
• Because a lot of money is invested in it…
– DeepMind: Acquired by Google for $400
million
– DNNResearch: Three person startup
(including Geoff Hinton) acquired by Google
for unknown price tag
– Enlitic, Ersatz, MetaMind, Nervana, Skylab:
Deep Learning startups commanding millions
of VC dollars
• Because it made the front page of the
New York Times
9
Motivation Why Deep Learning?

10
Performance (Computer Vision)

AlexNet

graph credit Matt


Zeiler, Clarifai
Speech Recognition

graph credit Matt Zeiler, Clarifai


The State of the Art models

• Computer Vision: CNN model  ResNet, Efficient Net


• Natural Language Processing: attention  Transformer
model (Seq2Seq, Bert, GPT-3)
Pre-trained language model (LM)

14
Motivation Why now?

2006

2016

15
A Recipe for
Background
Machine Learning
1. Given training data: Face Face Not a face

2. Choose each of these:


– Decision function
Examples: Linear regression,
Logistic regression, Neural Network

– Loss function
Examples: Mean-squared error,
Cross Entropy

16
A Recipe for
Background
Machine Learning
1. Given training data: 3. Define goal:

2. Choose each of these:


– Decision function 4. Train with SGD:
(take small steps
opposite the gradient)
– Loss function

17
A Recipe for
Background
Gradients
Machine Learning
1. Given training data: 3. Definecan
Backpropagation goal:
compute this
gradient!
And it’s a special case of a more
general algorithm called reverse-
2. Choose each of these:mode automatic differentiation that
– Decision function can compute
4. Train
the with SGD:
gradient of any
differentiable
(takefunction efficiently!
small steps
opposite the gradient)
– Loss function

18
A Recipe for
Background
Goals for this lecture
Machine Learning
1. 1.
Given training
Explore data:
a new class of 3. Define functions
decision goal:
(Neural Networks)
2. Consider variants of this recipe for training
2. Choose each of these:
– Decision function 4. Train with SGD:
(take small steps
opposite the gradient)
– Loss function

19
Decision Functions Linear Regression

Output

θ1 θ2 θ3 θM

Input …
20
Decision Functions Logistic Regression

Output

θ1 θ2 θ3 θM

Input …
21
Decision Functions Logistic Regression

Output

Face Face Not a face

θ1 θ2 θ3 θM

Input …
22
Decision Functions Logistic Regression

Output

1 1 0

y
x2
θ1 θ2 θ3 θM
x1

Input …
23
Decision Functions Logistic Regression

Output

θ1 θ2 θ3 θM

Input …
24
The Perceptron: Forward Propagation

25
Neural Network Model
Inputs
.6 Output
Age 34 .4
.2 S
.1 .5 0.6
Gender 2 .3 .2
.8
S
.7 S “Probability of
beingAlive”
Stage 4 .2

Dependent
Independent Weights HiddenL Weights variable
variables ayer
Prediction

© Eric Xing @ CMU, 2006-2011 26


“Combined logistic models”

Inputs
.6 Output
Age 34
.5 0.6
.1
Gender 2 S
.7 .8 “Probability of
beingAlive”
Stage 4

Dependent
Independent Weights HiddenLayer Weights variable
variables
Prediction

© Eric Xing @ CMU, 2006-2011 27


Inputs
Output
Age 34
.2 .5
0.6
Gender 2 .3
S
“Probability of
.8
beingAlive”
Stage 4 .2

Dependent
Independent Weights HiddenL Weights variable
variables ayer
Prediction

© Eric Xing @ CMU, 2006-2011 28


Inputs
.6 Output
Age 34
.2 .5
.1 0.6
Gender 1 .3
S
.7 “Probability of
.8
beingAlive”
Stage 4 .2

Dependent
Independent Weights HiddenL Weights variable
variables ayer
Prediction

© Eric Xing @ CMU, 2006-2011 29


Not really,
no target for hidden units...

Age 34 .6 .4
.2 S
.1 .5 0.6
Gender 2 .3 .2
.8
S
.7 S “Probability of
beingAlive”
Stage 4 .2

Dependent
Independent Weights HiddenL Weights variable
variables ayer
Prediction

© Eric Xing @ CMU, 2006-2011 30


Jargon Pseudo-Correspondence
 Independent variable = input variable
 Dependent variable = output variable
 Coefficients = “weights”
 Estimates = “targets”

Logistic Regression Model (the sigmoid unit)


Inputs Output
Age 34
5
0.6
Gende 1 4
S “Probability of
r beingAlive”
Stage 4 8

Independent variables Coefficients Dependent variable


x1, x2, x3 a, b, c p Prediction
© Eric Xing @ CMU, 2006-2011 31
Decision Functions Neural Network

Output

Hidden Layer …

Input …
32
Decision Functions Neural Network

Output


Hidden Layer


Input

33
Building a Neural Net

Output

Features …

34
Building a Neural Net

Output

Hidden Layer …
D=M

1 1 1

Input …
35
Building a Neural Net

Output

Hidden Layer …
D=M

Input …
36
Building a Neural Net

Output

Hidden Layer …
D=M

Input …
37
Building a Neural Net

Output

Hidden Layer …
D<M

Input …
38
Decision Boundary

• 0 hidden layers: linear classifier


– Hyperplanes

x1 x2

Example from to Eric Postma via Jason Eisner 39


Decision Boundary

• 1 hidden layer
– Boundary of convex region (open or closed)

x1 x2

Example from to Eric Postma via Jason Eisner 40


Decision Boundary

y
• 2 hidden layers
– Combinations of convex regions

x1 x2

Example from to Eric Postma via Jason Eisner 41


Decision Functions Multi-Class Output

Output …

Hidden Layer …

Input …
42
Decision Functions Deeper Networks

Next lecture:

Output


Hidden Layer 1


Input

43
Decision Functions Deeper Networks

Next lecture:

Output


Hidden Layer 2


Hidden Layer 1


Input

44
Decision Functions Deeper Networks

Next Output

lecture:

Making the Hidden Layer 3

neural
networks Hidden Layer 2

deeper

Hidden Layer 1


Input

45
Decision Functions Different Levels of Abstraction

• We don’t know
the “right”
levels of
abstraction
• So let the model
figure it out!

46
Example from Honglak Lee (NIPS 2010)
Decision Functions Different Levels of Abstraction

Face Recognition:
– Deep Network
can build up
increasingly
higher levels of
abstraction
– Lines, parts,
regions

47
Example from Honglak Lee (NIPS 2010)
Decision Functions Different Levels of Abstraction
Output


Hidden Layer 3


Hidden Layer 2


Hidden Layer 1


Input

48
Example from Honglak Lee (NIPS 2010)
ARCHITECTURES

49
Neural Networks Properties
• Theorem (Universal Function Approximators). A two-layer
neural network with a sufficient number of neurons can
approximate any continuous function to any desired accuracy.

• Practical considerations
– Can be seen as learning the features
– Large number of neurons
• Danger for overfitting
• (hence early stopping!)
Neural Network Architectures
Even for a basic Neural Network, there are many design
decisions to make:
1. # of hidden layers (depth)
2. # of units per hidden layer (width)
3. Type of activation function (nonlinearity)
4. Form of objective function

51
Activation Functions
Neural Network with sigmoid
activation functions

Output


Hidden Layer


Input

52
Activation Functions
Neural Network with arbitrary
nonlinear activation functions

Output


Hidden Layer


Input

53
Activation Functions
Sigmoid / Logistic Function So far, we’ve
1 assumed that the
logistic(u) º -u activation function
1+ e
(nonlinearity) is
always the sigmoid
function…

54
Activation Functions
• A new change: modifying the nonlinearity
– The logistic is not widely used in modern ANNs

Alternate 1:
tanh

Like logistic function but


shifted to range [-1, +1]

Slide from William Cohen


Activation Functions
• A new change: modifying the nonlinearity
– reLU often used in vision tasks

Alternate 2: rectified linear unit

Linear with a cutoff at zero

(Implementation: clip the gradient


when you pass zero)

Slide from William Cohen


Activation Functions
• A new change: modifying the nonlinearity
– reLU often used in vision tasks

Alternate 2: rectified linear unit

Soft version: log(exp(x)+1)

Doesn’t saturate (at one end)


Sparsifies outputs
Helps with vanishing gradient

Slide from William Cohen


Common Activation Functions

[source: MIT 6.S191 introtodeeplearning.com]


Objective Functions for NNs
• Regression:
– Use the same objective as Linear Regression
– Quadratic loss (i.e. mean squared error)
• Classification:
– Use the same objective as Logistic Regression
– Cross-entropy (i.e. negative log likelihood)
– This requires probabilities, so we add an additional
“softmax” layer at the end of our network

59
Multi-Class Output

Output …

Hidden Layer …

Input …
60
Multi-Class Output
Softmax:


Output


Hidden Layer


Input

61
Cross-entropy vs. Quadratic loss

Figure from Glorot & Bentio (2010)


A Recipe for
Background
Machine Learning
1. Given training data: 3. Define goal:

2. Choose each of these:


– Decision function 4. Train with SGD:
(take small steps
opposite the gradient)
– Loss function

63
BACKPROPAGATION

64
A Recipe for
Background
Machine Learning
1. Given training data: 3. Define goal:

2. Choose each of these:


– Decision function 4. Train with SGD:
(take small steps
opposite the gradient)
– Loss function

65
Training Backpropagation

• Question 1:
When can we compute the gradients of the parameters of
an arbitrary neural network?

• Question 2:
When can we make the gradient computation efficient?

66
Training Chain Rule

Given:
Chain Rule:

67
Training Chain Rule

Given:
Chain Rule:

Backpropagation

is just repeated
application of the
chain rule.

68
Training Chain Rule
Given:
Chain Rule:

Backpropagation:
1. Instantiate the computation as a directed acyclic graph, where each
intermediate quantity is a node
2. At each node, store (a) the quantity computed in the forward pass
and (b) the partial derivative of the goal with respect to that node’s
intermediate quantity.
3. Initialize all partial derivatives to 0.
4. Visit each node in reverse topological order. At each node, add its
contribution to the partial derivatives of its parents

This algorithm is also called automatic differentiation in the reverse-


mode 69
Training Backpropagation

70
Training Backpropagation

71
Training Backpropagation
Output

Case 1:
Logistic θ1 θ2 θ3 θM
Regression

Input

72
Training Backpropagation

Output


Hidden Layer


Input

73
Training Backpropagation

Output


Hidden Layer


Input

74
Training Backpropagation
Case 2:
Neural
Network

75
Training Chain Rule
Given:
Chain Rule:

Backpropagation:
1. Instantiate the computation as a directed acyclic graph, where each
intermediate quantity is a node
2. At each node, store (a) the quantity computed in the forward pass
and (b) the partial derivative of the goal with respect to that node’s
intermediate quantity.
3. Initialize all partial derivatives to 0.
4. Visit each node in reverse topological order. At each node, add its
contribution to the partial derivatives of its parents

This algorithm is also called automatic differentiation in the reverse-


mode 76
Training Chain Rule
Given:
Chain Rule:

Backpropagation:
1. Instantiate the computation as a directed acyclic graph, where each
node represents a Tensor.
2. At each node, store (a) the quantity computed in the forward pass
and (b) the partial derivatives of the goal with respect to that node’s
Tensor.
3. Initialize all partial derivatives to 0.
4. Visit each node in reverse topological order. At each node, add its
contribution to the partial derivatives of its parents

This algorithm is also called automatic differentiation in the reverse-


mode 77
Training Backpropagation
Case 2:
Neural
Module 5
Network

Module 4

Module 3

Module 2

Module 1

78
A Recipe for
Background
Gradients
Machine Learning
1. Given training data: 3. Definecan
Backpropagation goal:
compute this
gradient!
And it’s a special case of a more
general algorithm called reverse-
2. Choose each of these:mode automatic differentiation that
– Decision function can compute
4. Train
the with SGD:
gradient of any
differentiable
(takefunction efficiently!
small steps
opposite the gradient)
– Loss function

79
Automatic Differentiation
• Automatic differentiation software
– e.g. Theano, TensorFlow, PyTorch, Chainer
– Only need to program the function g(x,y,w)
– Can automatically compute all derivatives w.r.t. all entries in w
– This is typically done by caching info during forward computation pass
of f, and then doing a backward pass = “backpropagation”
– Autodiff / Backpropagation can often be done at computational cost
comparable to the forward pass
• Need to know this exists
• How this is done?
Lib and Framework for deep
learning
• Many deep learning library and frameworks have been
developed (special, 2017):
– Tensorflow (Google)
– PyTorch (Facebook)
– Theano* (LISA lab, university of montreal)
– MXNet (Amazon)
– CNTK (Microsoft)
– Caffe
– Deeplearning4j
– …

81
Summary
1. Neural Networks…
– provide a way of learning features
– are highly nonlinear prediction functions
– (can be) a highly parallel network of logistic regression classifiers
– discover useful hidden representations of the input
2. Backpropagation…
– provides an efficient way to compute gradients
– is a special case of reverse-mode automatic differentiation

82
Demo
• http://playground.tensorflow.org

83
Reference
• Lecture slides - Introduction Machine Learning, CMU, 2016.
• Introduction to Deep Learing, MIT, 2017

84

You might also like