Slide 11

Machine Learning Department of Electrical Engineering
Artificial Neural
Networks

Neuron

Neuron in the Brain

Artificial Neuron or Perceptron
x1
W1
x2 W2
Σ
Wd
xd
d

Artificial Neuron or Perceptron

Classification with Perceptron

Learning the weights for Perceptron

Gradient Descent Perceptron

Example

Example

Perceptron & Logistic Regression

Example

Example

Combining Perceptrons
x1 x2 OR x1 x2 OR
0 0 0 -1 -1 -1
0 1 1 -1 +1 +1
1 0 1 +1 -1 +1
1 1 1 +1 +1 +1
x1 x2 AND x1 x2 AND
0 0 0 -1 -1 -1
0 1 0 -1 +1 -1
1 0 0 +1 -1 -1
1 1 1 +1 +1 +1

XOR & XNOR Functions
x1 x2 XOR x1 x2 XOR
0 0 0 -1 -1 -1
0 1 1 -1 +1 +1
1 0 1 +1 -1 +1
1 1 0 +1 +1 -1
x1 x2 XNOR x1 x2 XNOR
0 0 0 -1 -1 -1
0 1 0 -1 +1 -1
1 0 0 +1 -1 -1
1 1 1 +1 +1 +1
• The data presented by XOR or XNOR functions is not

linearly separable.
• Single Perceptron is unable to classify this data
The Multi-Layer Perceptron for XOR
1 1
x1
x2
Input Layer Hidden Layer Output Layer

The Multi-Layer Perceptron for XNOR
1 1
x1
x2
Input Layer Hidden Layer Output Layer

Neural Network Intuition
Input Layer Hidden Layer 1 Hidden Layer 2 Output Layer

Neural Network Intuition

Multi-Class Classification
Class 1
Class 2
Class 3
Class 4
Input Layer
Hidden Layer 1 Output Layer
Hidden Layer 2
Output = Output = Output = Output =
Class 1 Class 2 Class 3 Class 4

Activation Functions
y
1
• Step Function 0
x
y
+1
• Signum Function 0
x
-1

y
1
• Sigmoid Function 0
x
y
+1
• Tangent Hyperbolic 0
x
Function -1

• ReLU Function
x
0
• Identity Function

Renaissance of Neural Networks
• Rebranding/Renaming
• ReLU
• GPUs
• Stochastic Gradient Descent

Non Linear Function Modelling with ANN
• Feed Forward Multi-Layer Neural Network
1 1
x1 Z1 Y1
x2 Z2 Y2

Adaptive Non-Linear Functions
• Non-Linear Regression
• h1: Non Linear Function
• h2: Identity
• Non-Linear Classification
• h1: Non Linear Function
• h2: Sigmoid

Optimization
 Error Minimization
 Back Propagation
 Maximum Likelihood
 Maximum A Posteriori
 Bayesian Learning

Least Square Error
• Error Function
We are optimizing a linear combination of non-linear

functions (Regression)

Gradient Descent
• For each example, adjust the weights as follows:
• How can we compute the gradient efficiently,

given an arbitrary network structure?
• Back Propagation
• Automatic Differentiation

Back Propagation Algorithm
• Two Phases
• Forward Phase: Compute output Zj of each unit j.
• Backward Phase: Compute δj (error) at unit j.

Forward Phase
2 Input Units 2 Hidden Units 2 Output Units
1 1
x1 Z1 Z3
x2 Z2 Z4

Backward Phase
Use chain rule to recursively compute gradient

• For each weight wji:
1 1
x1 Z1 Z3
x2 Z2 Z4

Backward Phase

Example with tanh(.) Activation Function
• Forward Propagation
• Hidden Units aj=
• Output Units ak=
• Backward Propagation
• Output Units δk:
• Hidden Units δj:

Deep Neural Networks
• Definition : Neural networks with many hidden layers
• Advantage: High Expressivity
• Challenges:
• How to train deep neural network?

• How can we avoid overfitting?

Expressiveness
• Neural networks with one hidden layer of

sigmoid/hyperbolic units can approximate
arbitrarily closely neural networks with
several layers of sigmoid/hyperbolic units.
• As we increase the number of layers, the

number of units needed may decrease
exponentially (with the number of layers).

Example: Parity Function
• Single Layer of Hidden Nodes (2n-1 units)

A
N
x1 D
A
N
D
A
N
D
x2 A
OR
N
D
A
N
D
x3 A
N
D
A
N
D
A
x4 N
D

Example: Parity Function
• 2n-2 layers of hidden nodes
1 1 1 1 1 1
x1 AND OR AND OR AND OR
x4 AND
x2 AND x3 AND

Vanishing Gradients
• Deep neural networks of sigmoid and

hyperbolic units often suffer from vanishing
gradients.
small
Gradient Medium
Gradient Large
Gradient

Sigmoid & Hyperbolic Units
• Derivative is always less than 1

Example
w1 w2 w3 w4
X h1 h2 h3 y
• Common weight initialization {-1,1}

• Sigmoid function and its derivative always less than 1
• This leads to vanishing gradients

Avoiding Vanishing Gradients
• Several popular solutions
• Skip Connections
• Batch Normalization
• Rectified Linear Units (ReLU as activation function)

Rectified Linear Units
• Rectified Linear Unit: h(a) = max(0,a)

• Gradient is not between 0 and 1
• Soft Version “softplus”

h(a) = log (1+ea)
• Warning: softplus
does not prevent
gradient vanishing

Slide 11

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Slide 11

Uploaded by

Copyright:

Available Formats

Machine Learning Department of Electrical Engineering

Machine Learning Department of Electrical Engineering

Machine Learning Department of Electrical Engineering

Machine Learning Department of Electrical Engineering

Machine Learning Department of Electrical Engineering

Machine Learning Department of Electrical Engineering

Machine Learning Department of Electrical Engineering

Machine Learning Department of Electrical Engineering

Machine Learning Department of Electrical Engineering

Machine Learning Department of Electrical Engineering

Machine Learning Department of Electrical Engineering

Machine Learning Department of Electrical Engineering

Machine Learning Department of Electrical Engineering

Machine Learning Department of Electrical Engineering

Machine Learning Department of Electrical Engineering

• The data presented by XOR or XNOR functions is not

Input Layer Hidden Layer Output Layer

Machine Learning Department of Electrical Engineering

Input Layer Hidden Layer Output Layer

Machine Learning Department of Electrical Engineering

Input Layer Hidden Layer 1 Hidden Layer 2 Output Layer

Machine Learning Department of Electrical Engineering

Machine Learning Department of Electrical Engineering

Output = Output = Output = Output =

Class 1 Class 2 Class 3 Class 4

Machine Learning Department of Electrical Engineering

Machine Learning Department of Electrical Engineering

Machine Learning Department of Electrical Engineering

Machine Learning Department of Electrical Engineering

Machine Learning Department of Electrical Engineering

• Feed Forward Multi-Layer Neural Network

Machine Learning Department of Electrical Engineering

Machine Learning Department of Electrical Engineering

Machine Learning Department of Electrical Engineering

We are optimizing a linear combination of non-linear

Machine Learning Department of Electrical Engineering

• For each example, adjust the weights as follows:

• How can we compute the gradient efficiently,

Machine Learning Department of Electrical Engineering

• Backward Phase: Compute δj (error) at unit j.

Machine Learning Department of Electrical Engineering

Machine Learning Department of Electrical Engineering

Use chain rule to recursively compute gradient

Machine Learning Department of Electrical Engineering

Machine Learning Department of Electrical Engineering

Machine Learning Department of Electrical Engineering

• Definition : Neural networks with many hidden layers

• Advantage: High Expressivity

• How to train deep neural network?

Machine Learning Department of Electrical Engineering

• Neural networks with one hidden layer of

• As we increase the number of layers, the

Machine Learning Department of Electrical Engineering

• Single Layer of Hidden Nodes (2n-1 units)

Machine Learning Department of Electrical Engineering

• 2n-2 layers of hidden nodes

x1 AND OR AND OR AND OR

Machine Learning Department of Electrical Engineering

• Deep neural networks of sigmoid and

Machine Learning Department of Electrical Engineering

• Derivative is always less than 1

Machine Learning Department of Electrical Engineering

• Common weight initialization {-1,1}