Lecture04 VDL

Very Deep Learning
Lecture 05
Dr. Muhammad Zeshan Afzal, Prof. Didier Stricker

MindGarage, University of Kaiserslautern
afzal.tukl@gmail.com
M. Zeshan Afzal, Very Deep Learning Ch. 5

Recap
M. Zeshan Afzal, Very Deep Learning Ch. 5 2

Variants of Gradient Descent
◼ Gradient Descent
^ Updates after looking at complete dataset
◼ Minibatch Gradient Descent
^ Updates after looking at N samples (batch size)
◼ Stochastic Gradient Descent
^ Updates after looking at every samples
◼ Related Concept
^ Epoch
• one cycle through the full training dataset

Logistic Regression (Decision Boundary)
◼ The decision boundary
◼ Decide for class 1
◼ Decide for class 0

Simple Examples

XOR
◼ Linear Classifier
𝒙𝟏 𝒙𝟐 XOR (𝒙𝟏 , 𝒙𝟐 )
0 0 0
0 1 1
1 0 1
1 1 0

XOR (Multilayer Perceptron)
◼ Linear Classifier
𝒙𝟏 𝒙𝟐 XOR (𝒙𝟏 , 𝒙𝟐 )
0 0 0
0 1 1
1 0 1
1 1 0

Representation Matters

Neural Network Playground
◼ https://playground.tensorflow.org/

Multilayer Perceptron

Multilayer Perceptron

A Brief History of Deep Learning
1940 1950 1960 1970 1980 1990 2000 2010 2020
◼ 1986 Backpropagation Algorithm

^ backpropagation algorithm that was able to train a neural
network based on the feed back
^ Allowed the efficient calculation of the gradients with
respect to weights
Rumelhart, Hinton and Williams: Learning representations by back-propagating errors. Nature, 1986.

Activation Functions

Sigmoid
◼ Maps input to range [0, 1]

◼ Historically popular since they
have nice interpretation as
saturating firing neuron
◼ Problems
^ Saturates: The gradients are
killed
^ Outputs are not zero centred

Sigmoid
◼ Maps input to range [0, 1]
◼ Problems
Restricts gradient updates and is the reason for inefficient optimisation(minibatch helps)

Tanh
◼ Maps input to range [-1, 1]
◼ Zero centred
◼ Antisymmetric
◼ Problem
^ Saturation Kills the gradient

Relu
◼ Does not saturate (for x>0)

◼ Leads to fast convergence
◼ Computationally efficient
◼ Problem
^ Non zero-centered
^ No learning for x < 0, leads
to dead relu

Leaky Relu
◼ Does not saturate

◼ Closer to zero-centred
outputs
◼ Fast convergence

Parametric Relu
◼ Does not saturate

◼ Closer to zero-centred
outputs
◼ Fast convergence
◼ Parameter α is learned
from data

ELU
◼ All benefits of leaky relu

◼ Adds some robustness to
noise
◼ Default value α = 1

Maxout
◼ Any continuous PWL function

can be expressed as a
difference of two convex PWL
functions.
◼ Any continuous function can
be approximated arbitrarily
well, by a piecewise linear
function.
◼ Generalizes relu and leaky
relu
◼ Increases number of
parameters per neuron
Goodfellow, Ian, et al. "Maxout networks." International conference on machine learning. PMLR, 2013.

Training

Logistic Regression
◼ As a neural network

Logistic Regression
◼ We have already see the Maximum Likelihood Estimator
◼ We now perform a binary classification

◼ How should we choose the model is this case
◼ Answer: Bernoulli distribution
where predicted by the model:

Logistic Regression
◼ Putting it together
◼ In machine learning we use a general term ‘loss function’ rather than the error
function
◼ We minimize the dissimilarity between the empirical data distribution
(defined by the training set) and the model distribution

Logistic Regression
◼ In summary we have assumed the Bernoulli

distribution
Where
◼ The question is that how to choose
◼ We are working with discrete distribution i.e
◼ We can choose the
The sigmoid is given as follows

Multinomial Distribution
◼ Categorical distribution
^ probability for class c
◼ Alternative notation
^ “ one hot vector ” with
^ only true class 1 all other zeros

One Hot representation
Class y y
1 (1, 0, 0, 0)T
2 (0, 1, 0, 0) T
3 (0, 0, 1, 0) T
4 (0, 0, 0, 1) T

Multinomial Distribution
◼ Categorical distribution
^ probability for class c
◼ Alternative notation
One hot class 1 =(1, 0, 0, 0)
= 0.51x0.10x0.20x0.10
^ “ one hot vector ” with
^ only true class 1 all other zeros

Categorical Distribution / Cross Entropy Loss

Softmax
◼ How can we ensure that predicts a valid categorical (discrete) distribution?

◼ We must guarantee
^ and
◼ An element-wise sigmoid as output function would ensure first condition only
◼ Solution: Softmax function
◼ Let s denote the network output after the last affine layer (=scores). Then:

Putting is all together
◼ Cross entropy loss for a single training sample
Class Label y Prediction Scores Softmax(s) CE Loss

s
(1, 0, 0, 0) (3, 1, -2, -1) (0.85, 0.11, 0.005, 0.015) 0.16
(0, 1, 0, 0) (1, 2, -1, -1) (0.25, 0.68, 0.033, 0.033) 0.38
(0, 0, 1, 0) (2, 2, 1, 3) (0.19, 0.19, 0.072, 0.534) 2.6
(0, 0, 0, 1) (3, 2, 3, -1) (0.41, 0.15, 0.419, 0.007) 4.9

Softmax
◼ It is an approximation of Max.
◼ It is a soft/smooth approximation of max.
◼ differentiable approximation of a non-
differentiable function
◼ Optimization is easier
Why is softmax activate function called 'softmax'? - Quora

Loss function
◼ Simple example
^ Lets say there is a three class classification (Cat, Dog, Cow)
Computed Ground Truth class Correct

classification error = 1/3 = 0.33
0.3 0.3 0.4 0 0 1 Cat Yes
classification accuracy of 2/3 = 0.67.
0.3 0.4 0.3 0 1 0 Dog Yes
0.1 0.2 0.7 1 0 0 Cow No

0.1 0.2 0.7 0 0 1 Cat Yes classification error = 1/3 = 0.33
0.1 0.7 0.2 0 1 0 Dog Yes classification accuracy of 2/3 = 0.67.
0.3 0.4 0.3 1 0 0 Cow No

Loss function (Cross Entropy)
-(ln(0.4) + ln(0.4) + ln(0.1)) / 3 = 1.38
0.3 0.3 0.4 0 0 1 Cat Yes
0.3 0.4 0.3 0 1 0 Dog Yes
0.1 0.2 0.7 1 0 0 Cow No
-(ln(0.7) + ln(0.7) + ln(0.3)) / 3 = 0.64

0.1 0.2 0.7 0 0 1 Cat Yes
0.1 0.7 0.2 0 1 0 Dog Yes
0.3 0.4 0.3 1 0 0 Cow No

Image Classification (MNIST)
◼ we would like to classify into 10

different classes. The digits 0-9
◼ Its old but still used in research
◼ Its based on the data from the national
institute of standard and technology
◼ It is comprised of handwritten digits by
census employees and school children
◼ It has resolution of 28x28. 60K training
and 10K testing samples also with
labels
◼ The train and test samples are not
written by same participants
Image Classification (MNIST)
◼ Curse of dimensionality
^ Assume that they are binary images 2784 = 10236
different images
^ For grayscale we have 256784 combinations
^ Why the classification even with the 60K images even
possible?
^ Image is concentrated on a low dimensional manifold
in {0,…,255}784

MLP (MNIST DEMO)
◼ Check Uploaded IPython Notebook

Universal Representation
◼ Networks with any single layer can represent any function F(x) with
arbitrary accuracy in the large hidden size limit
◼ However
^ Limitations of learning algorithm
• A given learning algorithm may be unable to find an optimum with this accuracy
^ Efficiency
• Network with one hidden layer can be inefficient to represent nonlinear function
• Required number of hidden neurons exponential in the input size
^ Nonlinear function F(x) can be better represented
• Deep networks with narrower layers
Kurt Hornik, Approximation capabilities of multilayer feedforward networks,Neural Networks,Volume 4, Issue 2, (1991).
Cybenko, G. Approximation by superpositions of a sigmoidal function. Math. Control Signal Systems 2, 303–314 (1989).

Motivation (Convolutional Neural Networks)
◼ Fully connected layers

^ Each input is connected to each node
◼ Pixels are bad features
^ Highly correlated
^ Scale dependent
^ Intensity variations etc
^ Pixels are bad representation from a
machine learning point of view
Image source: https://cv-tricks.com/cnn/understand-resnet-alexnet-vgg-inception/

Edge detection
◼ Simple filters
^ Edge detection

SIFT
https://medium.com/machine-learning-world/feature-extraction-and-similar-image-search-with-opencv-for-newbies-3c59796bf774

◼ Fully connected layers

^ Each input is connected to each node
◼ Pixels are bad features
^ Highly correlated
^ Scale dependent
^ Intensity variations etc
^ Pixels are bad representation from a
machine learning point of view
◼ Can we find a better representation?

◼ Can we find a better representation

^ We have a certain degree of locality in an image
^ We can find macro features at different locations
^ Hierarchy of features
• Edges + Corners → Eyes
• Eyes + Nose + Ears → Face
• Face + Body + Legs → human

Convolutional Neural Network
▪ Feature hierarchies

Convolutional Neural Network
Feature Extraction Classification
◼ Built-in invariances / equivariances (translation)

◼ Suitable for data on grid topologies
^ 1D (audio signal, time series)
^ 2D (pixelated images)
^ 3D (videos)
◼ CNNs: Consists of blocks
^ Convolutions + nonlinear activation + pooling (subsampling)
Y. Lecun, L. Bottou, Y. Bengio and P. Haffner, "Gradient-based learning applied to document recognition," in Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, Nov. 1998

Convolutions
◼ Convolution operation
◼ Discrete convolution

Convolutions
◼ Convolution is a linear operation
^ Example: Lets us consider and x1 x2 x3 x4 x5
^ We get the following equations for y
w1 w2 w3
◼ Convolving with can
be written as linear operation
^ We assumed that j remains within the

bounds of x. As a result the convolved
output y will be reduced to the dimension
n-m+1. For the same dimensions we can
pas with zeros (zero padding)
^ Shared Weights
^ Sparse Connectivity
Convolution vs Cross Corelation
◼ Convolution operation
◼ Cross Corelation
^ Cross corelation is convolution with flipped kernel

^ In practice, cross correlation is used
• The filters (weights) are initialized randomly
• The values are learned with backpropagation
https://en.wikipedia.org/wiki/Convolution

Multidimensional Convolutions
◼ 2D convolution
◼ 2D cross corelation

A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
Stanford University CS231n: Convolutional Neural Networks for Visual Recognition

A closer look at spatial
dimensions:
7
assume 3x3 filter

dimensions:
7
assume 3x3 filter

dimensions:
7
assume 3x3 filter

dimensions:
7
assume 3x3 filter
=> 5x5 output

7

dimensions:
7
assume 3x3 filter
applied with stride 2

dimensions:
7
assume 3x3 filter

dimensions:
7
assume 3x3 filter
=> 3x3 output!
7

dimensions:
7
assume 3x3 filter
applied with stride 3?

dimensions:
7
assume 3x3 filter
applied with stride 3?
7 doesn’t fit!
cannot apply 3x3 filter on
7x7 input with stride 3.

N
Output size:
(N - F) / stride + 1
F
e.g. N = 7, F = 3:
F N
stride 1 => (7 - 3)/1 + 1 = 5
stride 2 => (7 - 3)/2 + 1 = 3
stride 3 => (7 - 3)/3 + 1 = 2.33 :\

In practice: Common to zero pad the border
0 0 0 0 0 0
e.g. input 7x7
0 3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
0
(recall:)
(N - F) / stride + 1

0 0 0 0 0 0
e.g. input 7x7
0
7x7 output!
0

0 0 0 0 0 0
e.g. input 7x7
0
7x7 output!
0
in general,
• common to see CONV layers with stride 1
• filters of size FxF
• and zero-padding with (F-1)/2. (will preserve
size spatially)
e.g. F = 3 => zero pad with 1
F = 5 => zero pad with 2
F = 7 => zero pad with 3M. Zeshan Afzal, Very Deep Learning Ch. 5 68
Multichannel Convolutions
◼ Multichannel convolutions

Convolution Layer
32x32x3 image -> preserve spatial structure
32 height
32 width
3 depth

Convolution Layer
◼ 32x32x3 image
◼ 5x5x3 filter
32
◼ Convolve the filter with the image

◼ i.e. “slide over the image spatially,
computing dot products”
32
3

Convolution Layer Filters always extend the full
depth of the input volume
◼ 32x32x3 image
◼ 5x5x3 filter
32
◼ Convolve the filter with the image
◼ i.e. “slide over the image spatially,
computing dot products”
32
3

Convolution Layer
32x32x3 image
5x5x3 filter
32
1 number:
the result of taking a dot product between the
filter and a small 5x5x3 chunk of the image
32 (i.e. 5*5*3 = 75-dimensional dot product + bias)
3

Convolution Layer
activation map
32x32x3 image
5x5x3 filter
32
28
convolve (slide) over all

spatial locations
32 28
3 1

consider a second, green
Convolution Layer filter
32x32x3 image activation maps
5x5x3 filter
32
28
convolve (slide) over all

spatial locations
32 28
3 1
Lecture 5 - 33
For example, if we had 6 5x5 filters, we’ll get 6 separate activation
maps:
activation maps
32
28
Convolution Layer
32 28
3 6
We stack these up to get a “new image” of size 28x28x6!

Preview: ConvNet is a sequence of Convolution Layers, interspersed
with activation functions
32 28
CONV,
ReLU
e.g. 6
5x5x3
32 filters 28
3 6

Preview: ConvNet is a sequence of Convolutional Layers, interspersed
with activation functions
32 28 24
….
CONV, CONV, CONV,
ReLU ReLU ReLU
e.g. 6 e.g. 10
5x5x3 5x5x6
32 filters 28 filters 24
3 6 10

F
A
L Preview [Zeiler and Fergus 2013]
Visualization of VGG-16 by Lane McIntosh. VGG-16
architecture from [Simonyan and Zisserman2014].
e
p
e
irc
-i
tF
lu
e
1
ri
8
e
L
,i5
2
&
-
J0
9
u
1
0
s7
t
i
n
J
o
h M. Zeshan Afzal, Very Deep Learning Ch. 5
F
A
L
e
p
e Pooling layer
irc - makes the representations smaller and more manageable
-i - operates over each activation map independently:
tF
lu
e
1
ri
8
e
L
,i5
2
&
-
J0
9
u
1
1
s7
t
i
n
J
o
h M. Zeshan Afzal, Very Deep Learning Ch. 5
MAX POOLING
Single depth slice

1 1 2 4
x max pool with 2x2 filters
5 6 7 8 and stride 2 6 8
3 2 1 0 3 4
1 2 3 4
y
Typical CNN Structure
◼ Image -> convolution -> max pooling -> output
Fully
Image Convolution Pooling Flattenning Connected Softmax Loss
Layer

Thanks a lot for your Attention

Lecture04 VDL

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture04 VDL

Uploaded by

Copyright:

Available Formats

Very Deep Learning

Dr. Muhammad Zeshan Afzal, Prof. Didier Stricker

M. Zeshan Afzal, Very Deep Learning Ch. 5

M. Zeshan Afzal, Very Deep Learning Ch. 5 2

M. Zeshan Afzal, Very Deep Learning Ch. 5 3

◼ The decision boundary

◼ Decide for class 1

◼ Decide for class 0

M. Zeshan Afzal, Very Deep Learning Ch. 5 4

M. Zeshan Afzal, Very Deep Learning Ch. 5 5

M. Zeshan Afzal, Very Deep Learning Ch. 5 6

M. Zeshan Afzal, Very Deep Learning Ch. 5 7

M. Zeshan Afzal, Very Deep Learning Ch. 5 8

M. Zeshan Afzal, Very Deep Learning Ch. 5 9

M. Zeshan Afzal, Very Deep Learning Ch. 5 10

M. Zeshan Afzal, Very Deep Learning Ch. 5 11

1940 1950 1960 1970 1980 1990 2000 2010 2020

◼ 1986 Backpropagation Algorithm

M. Zeshan Afzal, Very Deep Learning Ch. 5 12

M. Zeshan Afzal, Very Deep Learning Ch. 5 13

◼ Maps input to range [0, 1]

M. Zeshan Afzal, Very Deep Learning Ch. 5 14

◼ Maps input to range [0, 1]

M. Zeshan Afzal, Very Deep Learning Ch. 5 15

◼ Maps input to range [-1, 1]

M. Zeshan Afzal, Very Deep Learning Ch. 5 16

◼ Does not saturate (for x>0)

M. Zeshan Afzal, Very Deep Learning Ch. 5 17

◼ Does not saturate

M. Zeshan Afzal, Very Deep Learning Ch. 5 18

◼ Does not saturate

M. Zeshan Afzal, Very Deep Learning Ch. 5 19

◼ All benefits of leaky relu

M. Zeshan Afzal, Very Deep Learning Ch. 5 20

◼ Any continuous PWL function

M. Zeshan Afzal, Very Deep Learning Ch. 5 21

M. Zeshan Afzal, Very Deep Learning Ch. 5 22

M. Zeshan Afzal, Very Deep Learning Ch. 5 23

◼ We have already see the Maximum Likelihood Estimator

◼ We now perform a binary classification

◼ Answer: Bernoulli distribution

where predicted by the model:

M. Zeshan Afzal, Very Deep Learning Ch. 5 24

M. Zeshan Afzal, Very Deep Learning Ch. 5 25

◼ In summary we have assumed the Bernoulli

◼ We can choose the

The sigmoid is given as follows

M. Zeshan Afzal, Very Deep Learning Ch. 5 26

^ probability for class c

^ “ one hot vector ” with

^ only true class 1 all other zeros

M. Zeshan Afzal, Very Deep Learning Ch. 5 27

M. Zeshan Afzal, Very Deep Learning Ch. 5 28

^ probability for class c

^ only true class 1 all other zeros

M. Zeshan Afzal, Very Deep Learning Ch. 5 29

M. Zeshan Afzal, Very Deep Learning Ch. 5 30

◼ How can we ensure that predicts a valid categorical (discrete) distribution?

M. Zeshan Afzal, Very Deep Learning Ch. 5 31

◼ Cross entropy loss for a single training sample

Class Label y Prediction Scores Softmax(s) CE Loss

(1, 0, 0, 0) (3, 1, -2, -1) (0.85, 0.11, 0.005, 0.015) 0.16