2020 - Neural Network

Artificial Intelligent
Neural Networks
Nguyen Van Vinh

UET-VNU
1
Outline
• Neural Networks
• Backpropagation
• Demo
2
NEURAL NETWORKS
3
Learning highly non-linear functions
f: X  Y
 f might be non-linear function
 X (vector of) continuous and/or discrete vars
 Y (vector of) continuous and/or discrete vars
The XOR gate Speech recognition
© Eric Xing @ CMU, 2006-2011 4

Perceptron and Neural Nets
 From biological neuron to artificial neuron (perceptron)
Synapse Inputs
Synapse Dendrites x1 Linear Hard
Axon
Axon w1 Combiner Limiter
Output
 Y
Soma Soma w2
Dendrites 
x2
Synapse
Threshold
 Activation function
n
 1, if X  
X   xi wi Y 
i 1  1, if X  
 Artificial neuron networks

 supervised learning
Out put Signals

Input Signals
 gradient descent
Middle Layer
Input Layer Output Layer
© Eric Xing @ CMU, 2006-2011 5
Connectionist Models
 Consider humans:
 Neuron switching time
~ 0.001 second
 Number of neurons
~ 1010
 Connections per neuron
~ 104-5
 Scene recognition time
~ 0.1 second
 100 inference steps doesn't seem like enough
 much parallel computation
 Properties of artificial neural nets (ANN)
 Many neuron-like threshold switching units
 Many weighted interconnections among units
 Highly parallel, distributed processes
© Eric Xing @ CMU, 2006-2011 6

Why is everyone talking
Motivation
about Deep Learning?
• 2016: year of deep learning
7
Motivation
8
Motivation
• Because a lot of money is invested in it…
– DeepMind: Acquired by Google for $400
million
– DNNResearch: Three person startup
(including Geoff Hinton) acquired by Google
for unknown price tag
– Enlitic, Ersatz, MetaMind, Nervana, Skylab:
Deep Learning startups commanding millions
of VC dollars
• Because it made the front page of the
New York Times
9
Motivation Why Deep Learning?
10
Motivation Why now?
2006
2016
11
A Recipe for
Background
Machine Learning
1. Given training data: Face Face Not a face
2. Choose each of these:

– Decision function
Examples: Linear regression,
Logistic regression, Neural Network
– Loss function
Examples: Mean-squared error,
Cross Entropy
12
A Recipe for
Background
Machine Learning
1. Given training data: 3. Define goal:

– Decision function 4. Train with SGD:
(take small steps
opposite the gradient)
– Loss function
13
A Recipe for
Background
Gradients
Machine Learning
1. Given training data: 3. Definecan
Backpropagation goal:
compute this
gradient!
And it’s a special case of a more
general algorithm called reverse-
2. Choose each of these:mode automatic differentiation that
– Decision function can compute
4. Train
the with SGD:
gradient of any
differentiable
(takefunction efficiently!
small steps
– Loss function
14
A Recipe for
Background
Goals for this lecture
Machine Learning
1. 1.
Given training
Explore data:
a new class of 3. Define functions
decision goal:
(Neural Networks)
2. Consider variants of this recipe for training
(take small steps
– Loss function
15
Decision Functions Linear Regression
Output
θ1 θ2 θ3 θM
Input …
16
Decision Functions Logistic Regression
Output
θ1 θ2 θ3 θM
Input …
17
Output
Face Face Not a face
θ1 θ2 θ3 θM
Input …
18
Output
1 1 0
y
x2
θ1 θ2 θ3 θM
x1
Input …
19
Output
θ1 θ2 θ3 θM
Input …
20
The Perceptron: Forward Propagation
21
Neural Network Model
Inputs
.6 Output
Age 34 .4
.2 S
.1 .5 0.6
Gender 2 .3 .2
.8
S
.7 S “Probability of
beingAlive”
Stage 4 .2
Dependent
Independent Weights HiddenL Weights variable
variables ayer
Prediction
© Eric Xing @ CMU, 2006-2011 22

“Combined logistic models”
Inputs
.6 Output
Age 34
.5 0.6
.1
Gender 2 S
.7 .8 “Probability of
beingAlive”
Stage 4
Dependent
Independent Weights HiddenLayer Weights variable
variables
Prediction
© Eric Xing @ CMU, 2006-2011 23

Inputs
Output
Age 34
.2 .5
0.6
Gender 2 .3
S
“Probability of
.8
beingAlive”
Stage 4 .2
Dependent
variables ayer
Prediction
© Eric Xing @ CMU, 2006-2011 24

Inputs
.6 Output
Age 34
.2 .5
.1 0.6
Gender 1 .3
S
.7 “Probability of
.8
beingAlive”
Stage 4 .2
Dependent
variables ayer
Prediction
© Eric Xing @ CMU, 2006-2011 25

Not really,
no target for hidden units...
Age 34 .6 .4
.2 S
.1 .5 0.6
Gender 2 .3 .2
.8
S
.7 S “Probability of
beingAlive”
Stage 4 .2
Dependent
variables ayer
Prediction
© Eric Xing @ CMU, 2006-2011 26

Jargon Pseudo-Correspondence
 Independent variable = input variable
 Dependent variable = output variable
 Coefficients = “weights”
 Estimates = “targets”
Logistic Regression Model (the sigmoid unit)

Inputs Output
Age 34
5
0.6
Gende 1 4
S “Probability of
r beingAlive”
Stage 4 8
Independent variables Coefficients Dependent variable

x1, x2, x3 a, b, c p Prediction
© Eric Xing @ CMU, 2006-2011 27
Decision Functions Neural Network
Output
Hidden Layer …
Input …
28
Decision Functions Neural Network
Output
…
Hidden Layer
…
Input
29
Building a Neural Net
Output
Features …
30
Output
Hidden Layer …
D=M
1 1 1
Input …
31
Output
Hidden Layer …
D=M
Input …
32
Output
Hidden Layer …
D=M
Input …
33
Output
Hidden Layer …
D<M
Input …
34
Decision Boundary
• 0 hidden layers: linear classifier

– Hyperplanes
x1 x2
Example from to Eric Postma via Jason Eisner 35

Decision Boundary
• 1 hidden layer
– Boundary of convex region (open or closed)
x1 x2

Decision Boundary
y
• 2 hidden layers
– Combinations of convex regions
x1 x2

Decision Functions Multi-Class Output
Output …
Hidden Layer …
Input …
38
Decision Functions Deeper Networks
Next lecture:
Output
…
Hidden Layer 1
…
Input
39
Next lecture:
Output
…
Hidden Layer 2
…
Hidden Layer 1
…
Input
40
Next Output
lecture:
…
Making the Hidden Layer 3
neural
networks Hidden Layer 2
…
deeper
…
Hidden Layer 1
…
Input
41
Decision Functions Different Levels of Abstraction
• We don’t know
the “right”
levels of
abstraction
• So let the model
figure it out!
42
Example from Honglak Lee (NIPS 2010)
Face Recognition:
– Deep Network
can build up
increasingly
higher levels of
abstraction
– Lines, parts,
regions
43
Output
…
Hidden Layer 3
…
Hidden Layer 2
…
Hidden Layer 1
…
Input
44
ARCHITECTURES
45
Neural Network Architectures
Even for a basic Neural Network, there are many design
decisions to make:
1. # of hidden layers (depth)
2. # of units per hidden layer (width)
3. Type of activation function (nonlinearity)
4. Form of objective function
46
Activation Functions
Neural Network with sigmoid
activation functions
Output
…
Hidden Layer
…
Input
47
Neural Network with arbitrary
nonlinear activation functions
Output
…
Hidden Layer
…
Input
48
Sigmoid / Logistic Function So far, we’ve
1 assumed that the
logistic(u) º -u activation function
1+ e
(nonlinearity) is
always the sigmoid
function…
49
• A new change: modifying the nonlinearity
– The logistic is not widely used in modern ANNs
Alternate 1:
tanh
Like logistic function but

shifted to range [-1, +1]
Slide from William Cohen

– reLU often used in vision tasks
Alternate 2: rectified linear unit
Linear with a cutoff at zero
(Implementation: clip the gradient

when you pass zero)

– reLU often used in vision tasks
Alternate 2: rectified linear unit
Soft version: log(exp(x)+1)
Doesn’t saturate (at one end)

Sparsifies outputs
Helps with vanishing gradient

Objective Functions for NNs
• Regression:
– Use the same objective as Linear Regression
– Quadratic loss (i.e. mean squared error)
• Classification:
– Use the same objective as Logistic Regression
– Cross-entropy (i.e. negative log likelihood)
– This requires probabilities, so we add an additional
“softmax” layer at the end of our network
53
Multi-Class Output
Output …
Hidden Layer …
Input …
54
Multi-Class Output
Softmax:
…
Output
…
Hidden Layer
…
Input
55
Cross-entropy vs. Quadratic loss
Figure from Glorot & Bentio (2010)

A Recipe for
Background
Machine Learning

(take small steps
– Loss function
57
BACKPROPAGATION
58
A Recipe for
Background
Machine Learning

(take small steps
– Loss function
59
Training Backpropagation
• Question 1:
When can we compute the gradients of the parameters of
an arbitrary neural network?
• Question 2:
When can we make the gradient computation efficient?
60
Training Chain Rule
Given:
Chain Rule:
61
Training Chain Rule
Given:
Chain Rule:
Backpropagation
…
is just repeated
application of the
chain rule.
62
Training Chain Rule
Given:
Chain Rule:
…
Backpropagation:
1. Instantiate the computation as a directed acyclic graph, where each
intermediate quantity is a node
2. At each node, store (a) the quantity computed in the forward pass
and (b) the partial derivative of the goal with respect to that node’s
intermediate quantity.
3. Initialize all partial derivatives to 0.
4. Visit each node in reverse topological order. At each node, add its
contribution to the partial derivatives of its parents
This algorithm is also called automatic differentiation in the reverse-

mode 63
64
65
Output
Case 1:
Logistic θ1 θ2 θ3 θM
Regression
…
Input
66
Output
…
Hidden Layer
…
Input
67
Output
…
Hidden Layer
…
Input
68
Case 2:
Neural
Network
69
Training Chain Rule
Given:
Chain Rule:
…
Backpropagation:
intermediate quantity is a node
and (b) the partial derivative of the goal with respect to that node’s
intermediate quantity.

mode 70
Training Chain Rule
Given:
Chain Rule:
…
Backpropagation:
node represents a Tensor.
and (b) the partial derivatives of the goal with respect to that node’s
Tensor.

mode 71
Case 2:
Neural
Module 5
Network
Module 4
…
Module 3
Module 2
Module 1
72
A Recipe for
Background
Gradients
Machine Learning
1. Given training data: 3. Definecan
Backpropagation goal:
compute this
gradient!
And it’s a special case of a more
general algorithm called reverse-
2. Choose each of these:mode automatic differentiation that
– Decision function can compute
4. Train
the with SGD:
gradient of any
differentiable
(takefunction efficiently!
small steps
– Loss function
73
Lib and Framework for deep
learning
• Many deep learning library and frameworks have been
developed (special, 2017):
– Tensorflow (Google)
– PyTorch (Facebook)
– Theano* (LISA lab, university of montreal)
– MXNet (Amazon)
– CNTK (Microsoft)
– Caffe
– Deeplearning4j
– …
74
Summary
1. Neural Networks…
– provide a way of learning features
– are highly nonlinear prediction functions
– (can be) a highly parallel network of logistic regression classifiers
– discover useful hidden representations of the input
2. Backpropagation…
– provides an efficient way to compute gradients
– is a special case of reverse-mode automatic differentiation
75
Demo
• http://playground.tensorflow.org
76
Reference
• Lecture slides - Introduction Machine Learning, CMU, 2016.
• Introduction to Deep Learing, MIT, 2017
77

2020 - Neural Network

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2020 - Neural Network

Uploaded by

Copyright:

Available Formats

Artificial Intelligent

Nguyen Van Vinh

The XOR gate Speech recognition

© Eric Xing @ CMU, 2006-2011 4

 Artificial neuron networks

Out put Signals

© Eric Xing @ CMU, 2006-2011 6

2. Choose each of these:

2. Choose each of these:

Face Face Not a face

© Eric Xing @ CMU, 2006-2011 22

© Eric Xing @ CMU, 2006-2011 23

© Eric Xing @ CMU, 2006-2011 24

© Eric Xing @ CMU, 2006-2011 25

© Eric Xing @ CMU, 2006-2011 26

Logistic Regression Model (the sigmoid unit)

Independent variables Coefficients Dependent variable

• 0 hidden layers: linear classifier

Example from to Eric Postma via Jason Eisner 35

Example from to Eric Postma via Jason Eisner 36

Example from to Eric Postma via Jason Eisner 37

Like logistic function but

Slide from William Cohen

Alternate 2: rectified linear unit

Linear with a cutoff at zero

(Implementation: clip the gradient

Slide from William Cohen

Alternate 2: rectified linear unit

Soft version: log(exp(x)+1)

Doesn’t saturate (at one end)

Slide from William Cohen

Figure from Glorot & Bentio (2010)

2. Choose each of these:

2. Choose each of these:

This algorithm is also called automatic differentiation in the reverse-

This algorithm is also called automatic differentiation in the reverse-

This algorithm is also called automatic differentiation in the reverse-

You might also like