Lecture03 VDL

Very Deep Learning
Lecture 04
Dr. Muhammad Zeshan Afzal, Prof. Didier Stricker

MindGarage, University of Kaiserslautern
afzal.tukl@gmail.com
M. Zeshan Afzal, Very Deep Learning Ch. 4

Recap
M. Zeshan Afzal, Very Deep Learning Ch. 4 2

Recap
◼ Logistic Regression as One Layer Neural Network
◼ Computational Graphs

Computational Graph
◼ AlexNet
input image
weights
loss

Computational Graph
◼ Neural Turing Machine
input image
loss

Computational Graph

Computational Graph
◼ A simple example

Computational Graph
◼ Chain Rule
◼ Multivariate Chain Rule

Computational Graph
Forward pass
Backward pass
Backward Gradients: Magenta Local Gradients: Blue

Computational Graph
Forward pass
Backward pass
2 3𝑥 2 1 1*2
2 2
6𝑥 2 2 1
Backward Gradients: Magenta Local Gradients: Blue

Computational Graph

Computational Graph (Fan-Out >1)
◼ A slightly complicated example

Logistic Regression

Logistic Regression
◼ As a neural network

Variants of Gradient Descent
◼ Gradient Descent
^ Updates after looking at complete dataset
◼ Minibatch Gradient Descent
^ Updates after looking at N samples (batch size)
◼ Stochastic Gradient Descent
^ Updates after looking at every samples
◼ Related Concept
^ Epoch
• one cycle through the full training dataset

Logistic Regression (Decision Boundary)
◼ The decision boundary
◼ Decide for class 1
◼ Decide for class 0

Simple Examples

OR
◼ Linear Classifier
𝑥1
ሺ 1 1ሻ > 0.5
ด
𝑇
𝑥2 −𝑤
𝒙𝟏 𝒙𝟐 OR (𝒙𝟏 , 𝒙𝟐 ) 𝒘 0
𝒙
0 0 0
0 1 1
1 0 1
1 1 1

AND
◼ Linear Classifier 𝑥1
ሺ 1 1ሻ > 1.5
ด
𝑇
𝑥2 −𝑤
𝒘 0
𝒙
𝒙𝟏 𝒙𝟐 AND (𝒙𝟏 , 𝒙𝟐 )
0 0 0
0 1 0
1 0 0
1 1 1

NAND
◼ Linear Classifier 𝑥1
ሺ−1 −1ሻ > − 1.5
ด
𝑇
𝑥2 −𝑤
𝒘 0
𝒙
𝒙𝟏 𝒙𝟐 NAND (𝒙𝟏 , 𝒙𝟐 )
0 0 1
0 1 1
1 0 1
1 1 0

XOR
𝒙𝟏 𝒙𝟐 XOR (𝒙𝟏 , 𝒙𝟐 )
0 0 0
0 1 1
1 0 1
1 1 0

A Brief History of Deep Learning
1940 1950 1960 1970 1980 1990 2000 2010 2020
◼ 1958 Perceptron (Rosenblatt)

^ find a separating hyperplane by minimizing the distance of
misclassified points to the decision boundary
^ Code the two classes by yi = 1, −1
^ Linear threshold unit
^ It was hyped in the media

• It will solve all the problems
Rosenblatt: The perceptron -a probabilistic model for information storage and organization in the brain. Psychological Review, 1958.

1940 1950 1960 1970 1980 1990 2000 2010 2020
◼ 1969 Minsky and Pappert

^ Mathematical proof of what perceptron is capable of
• Discouraging results
^ Simple problem cant be solved
• XOR
Minsky and Papert: Perceptrons: An introduction to computational geometry. MIT Press, 1969.

XOR (Multilayer Perceptron)
𝒙𝟏 𝒙𝟐 XOR (𝒙𝟏 , 𝒙𝟐 )
0 0 0
0 1 1
1 0 1
1 1 0

Representation Matters

Neural Network Playground
◼ https://playground.tensorflow.org/

Multilayer Perceptron




Activation Functions

The Neuron
◼ Neuron
^ Electrically excitable cell that communicates
with other cells via specialized connections
called synapses (100 billion)
◼ Sensory neurons
^ 5 senses
◼ Motor neurons
^ Allow brain to communicate with other parts
of the body
◼ Interneurons
^ connect neurons to other neurons within the
same region of the brain

The Neuron

The Neuron

Sigmoid
◼ Maps input to range [0, 1]

◼ Historically popular since they
have nice interpretation as
saturating firing neuron
◼ Problems
^ Saturates: The gradients are
killed
^ Outputs are not zero centred

Sigmoid
◼ Maps input to range [0, 1]
◼ Problems
Restricts gradient updates and is the reason for inefficient optimisation(minibatch helps)

Tanh
◼ Maps input to range [-1, 1]
◼ Zero centred
◼ Antisymmetric
◼ Problem
^ Saturation Kills the gradient

Relu
◼ Does not saturate (for x>0)

◼ Leads to fast convergence
◼ Computationally efficient
◼ Problem
^ Non zero-centered
^ No learning for x < 0, leads
to dead relu

Leaky Relu
◼ Does not saturate

◼ Closer to zero-centred
outputs
◼ Fast convergence

Parametric Relu
◼ Does not saturate

◼ Closer to zero-centred
outputs
◼ Fast convergence
◼ Parameter α is learned
from data

ELU
◼ All benefits of leaky relu

◼ Adds some robustness to
noise
◼ Default value α = 1

Activation Functions
◼ Choice of activation function depends on the problem

◼ Only most common ones are discussed there are many others
◼ Best activation function is often found by trial and error
◼ It is important to insure good gradient flow during the optimization
◼ In practice
^ Use relu by default with small enough learning rate
^ Try leaky relu, and elu for some additional gain

1940 1950 1960 1970 1980 1990 2000 2010 2020
◼ 1986 Backpropagation Algorithm

^ backpropagation algorithm that was able to train a neural
network based on the feed back
^ Allowed the efficient calculation of the gradients with
respect to weights
Rumelhart, Hinton and Williams: Learning representations by back-propagating errors. Nature, 1986.

Summary
◼ Concepts
^ Gradient Descent
• Vanilla, Minibatch, Stochastic
◼ Simple Functions
^ OR, AND, NAND, XOR
◼ Representation Matters
◼ Neural Network Playground
◼ MLP
◼ Activation Functions

Thanks a lot for your Attention

Lecture03 VDL

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture03 VDL

Uploaded by

Copyright:

Available Formats

Very Deep Learning

Dr. Muhammad Zeshan Afzal, Prof. Didier Stricker

M. Zeshan Afzal, Very Deep Learning Ch. 4

M. Zeshan Afzal, Very Deep Learning Ch. 4 2

M. Zeshan Afzal, Very Deep Learning Ch. 4 3

M. Zeshan Afzal, Very Deep Learning Ch. 4 4

◼ Neural Turing Machine

M. Zeshan Afzal, Very Deep Learning Ch. 4 5

M. Zeshan Afzal, Very Deep Learning Ch. 4 6

M. Zeshan Afzal, Very Deep Learning Ch. 4 7

◼ Multivariate Chain Rule

M. Zeshan Afzal, Very Deep Learning Ch. 4 8

Backward Gradients: Magenta Local Gradients: Blue

Backward Gradients: Magenta Local Gradients: Blue

M. Zeshan Afzal, Very Deep Learning Ch. 4 11

◼ A slightly complicated example

M. Zeshan Afzal, Very Deep Learning Ch. 4 12

M. Zeshan Afzal, Very Deep Learning Ch. 4 13

M. Zeshan Afzal, Very Deep Learning Ch. 4 14

M. Zeshan Afzal, Very Deep Learning Ch. 4 15

◼ The decision boundary

◼ Decide for class 1

◼ Decide for class 0

M. Zeshan Afzal, Very Deep Learning Ch. 4 16

M. Zeshan Afzal, Very Deep Learning Ch. 4 17

M. Zeshan Afzal, Very Deep Learning Ch. 4 18

M. Zeshan Afzal, Very Deep Learning Ch. 4 19

M. Zeshan Afzal, Very Deep Learning Ch. 4 20

M. Zeshan Afzal, Very Deep Learning Ch. 4 21

1940 1950 1960 1970 1980 1990 2000 2010 2020

◼ 1958 Perceptron (Rosenblatt)

^ It was hyped in the media

M. Zeshan Afzal, Very Deep Learning Ch. 4 22

1940 1950 1960 1970 1980 1990 2000 2010 2020

◼ 1969 Minsky and Pappert

M. Zeshan Afzal, Very Deep Learning Ch. 4 23

M. Zeshan Afzal, Very Deep Learning Ch. 4 24

M. Zeshan Afzal, Very Deep Learning Ch. 4 25

M. Zeshan Afzal, Very Deep Learning Ch. 4 26

M. Zeshan Afzal, Very Deep Learning Ch. 4 27

M. Zeshan Afzal, Very Deep Learning Ch. 4 28

M. Zeshan Afzal, Very Deep Learning Ch. 4 29

M. Zeshan Afzal, Very Deep Learning Ch. 4 30

M. Zeshan Afzal, Very Deep Learning Ch. 4 31

M. Zeshan Afzal, Very Deep Learning Ch. 4 32

M. Zeshan Afzal, Very Deep Learning Ch. 4 33

M. Zeshan Afzal, Very Deep Learning Ch. 4 34

◼ Maps input to range [0, 1]

M. Zeshan Afzal, Very Deep Learning Ch. 4 35

◼ Maps input to range [0, 1]

M. Zeshan Afzal, Very Deep Learning Ch. 4 36

◼ Maps input to range [-1, 1]

M. Zeshan Afzal, Very Deep Learning Ch. 4 37

◼ Does not saturate (for x>0)

M. Zeshan Afzal, Very Deep Learning Ch. 4 38

◼ Does not saturate

M. Zeshan Afzal, Very Deep Learning Ch. 4 39

◼ Does not saturate

M. Zeshan Afzal, Very Deep Learning Ch. 4 40

◼ All benefits of leaky relu

M. Zeshan Afzal, Very Deep Learning Ch. 4 41

◼ Choice of activation function depends on the problem