You are on page 1of 93

Very Deep Learning

Lecture 05

Dr. Muhammad Zeshan Afzal, Prof. Didier Stricker

MindGarage, University of Kaiserslautern

M. Zeshan Afzal, Very Deep Learning Ch. 5


M. Zeshan Afzal, Very Deep Learning Ch. 5 2

Variants of Gradient Descent

◼ Gradient Descent
^ Updates after looking at complete dataset
◼ Minibatch Gradient Descent
^ Updates after looking at N samples (batch size)
◼ Stochastic Gradient Descent
^ Updates after looking at every samples
◼ Related Concept
^ Epoch
• one cycle through the full training dataset

M. Zeshan Afzal, Very Deep Learning Ch. 5 3

Logistic Regression (Decision Boundary)

◼ The decision boundary

◼ Decide for class 1

◼ Decide for class 0

M. Zeshan Afzal, Very Deep Learning Ch. 5 4

Simple Examples

M. Zeshan Afzal, Very Deep Learning Ch. 5 5


◼ Linear Classifier

𝒙𝟏 𝒙𝟐 XOR (𝒙𝟏 , 𝒙𝟐 )

0 0 0
0 1 1
1 0 1
1 1 0

M. Zeshan Afzal, Very Deep Learning Ch. 5 6

XOR (Multilayer Perceptron)

◼ Linear Classifier

𝒙𝟏 𝒙𝟐 XOR (𝒙𝟏 , 𝒙𝟐 )

0 0 0
0 1 1
1 0 1
1 1 0

M. Zeshan Afzal, Very Deep Learning Ch. 5 7

Representation Matters

M. Zeshan Afzal, Very Deep Learning Ch. 5 8

Neural Network Playground


M. Zeshan Afzal, Very Deep Learning Ch. 5 9

Multilayer Perceptron

M. Zeshan Afzal, Very Deep Learning Ch. 5 10

Multilayer Perceptron

M. Zeshan Afzal, Very Deep Learning Ch. 5 11

A Brief History of Deep Learning

1940 1950 1960 1970 1980 1990 2000 2010 2020

◼ 1986 Backpropagation Algorithm

^ backpropagation algorithm that was able to train a neural
network based on the feed back
^ Allowed the efficient calculation of the gradients with
respect to weights

Rumelhart, Hinton and Williams: Learning representations by back-propagating errors. Nature, 1986.

M. Zeshan Afzal, Very Deep Learning Ch. 5 12

Activation Functions

M. Zeshan Afzal, Very Deep Learning Ch. 5 13


◼ Maps input to range [0, 1]

◼ Historically popular since they
have nice interpretation as
saturating firing neuron

◼ Problems
^ Saturates: The gradients are
^ Outputs are not zero centred

M. Zeshan Afzal, Very Deep Learning Ch. 5 14


◼ Maps input to range [0, 1]

◼ Problems

Restricts gradient updates and is the reason for inefficient optimisation(minibatch helps)

M. Zeshan Afzal, Very Deep Learning Ch. 5 15


◼ Maps input to range [-1, 1]

◼ Zero centred
◼ Antisymmetric
◼ Problem
^ Saturation Kills the gradient

M. Zeshan Afzal, Very Deep Learning Ch. 5 16


◼ Does not saturate (for x>0)

◼ Leads to fast convergence
◼ Computationally efficient
◼ Problem
^ Non zero-centered
^ No learning for x < 0, leads
to dead relu

M. Zeshan Afzal, Very Deep Learning Ch. 5 17

Leaky Relu

◼ Does not saturate

◼ Closer to zero-centred
◼ Fast convergence
◼ Computationally efficient

M. Zeshan Afzal, Very Deep Learning Ch. 5 18

Parametric Relu

◼ Does not saturate

◼ Closer to zero-centred
◼ Fast convergence
◼ Computationally efficient
◼ Parameter α is learned
from data

M. Zeshan Afzal, Very Deep Learning Ch. 5 19


◼ All benefits of leaky relu

◼ Adds some robustness to
◼ Default value α = 1

M. Zeshan Afzal, Very Deep Learning Ch. 5 20


◼ Any continuous PWL function

can be expressed as a
difference of two convex PWL
◼ Any continuous function can
be approximated arbitrarily
well, by a piecewise linear
◼ Generalizes relu and leaky
◼ Increases number of
parameters per neuron
Goodfellow, Ian, et al. "Maxout networks." International conference on machine learning. PMLR, 2013.

M. Zeshan Afzal, Very Deep Learning Ch. 5 21


M. Zeshan Afzal, Very Deep Learning Ch. 5 22

Logistic Regression

◼ As a neural network

M. Zeshan Afzal, Very Deep Learning Ch. 5 23

Logistic Regression

◼ We have already see the Maximum Likelihood Estimator

◼ We now perform a binary classification

◼ How should we choose the model is this case

◼ Answer: Bernoulli distribution

where predicted by the model:

M. Zeshan Afzal, Very Deep Learning Ch. 5 24

Logistic Regression

◼ Putting it together

◼ In machine learning we use a general term ‘loss function’ rather than the error
◼ We minimize the dissimilarity between the empirical data distribution
(defined by the training set) and the model distribution

M. Zeshan Afzal, Very Deep Learning Ch. 5 25

Logistic Regression

◼ In summary we have assumed the Bernoulli


◼ The question is that how to choose
◼ We are working with discrete distribution i.e

◼ We can choose the

The sigmoid is given as follows

M. Zeshan Afzal, Very Deep Learning Ch. 5 26

Multinomial Distribution

◼ Categorical distribution

^ probability for class c

◼ Alternative notation

^ “ one hot vector ” with

^ only true class 1 all other zeros

M. Zeshan Afzal, Very Deep Learning Ch. 5 27

One Hot representation
Class y y

1 (1, 0, 0, 0)T

2 (0, 1, 0, 0) T

3 (0, 0, 1, 0) T

4 (0, 0, 0, 1) T

M. Zeshan Afzal, Very Deep Learning Ch. 5 28

Multinomial Distribution

◼ Categorical distribution

^ probability for class c

◼ Alternative notation
One hot class 1 =(1, 0, 0, 0)
= 0.51x0.10x0.20x0.10
^ “ one hot vector ” with

^ only true class 1 all other zeros

M. Zeshan Afzal, Very Deep Learning Ch. 5 29

Categorical Distribution / Cross Entropy Loss

M. Zeshan Afzal, Very Deep Learning Ch. 5 30


◼ How can we ensure that predicts a valid categorical (discrete) distribution?

◼ We must guarantee
^ and
◼ An element-wise sigmoid as output function would ensure first condition only
◼ Solution: Softmax function

◼ Let s denote the network output after the last affine layer (=scores). Then:

M. Zeshan Afzal, Very Deep Learning Ch. 5 31

Putting is all together

◼ Cross entropy loss for a single training sample

Class Label y Prediction Scores Softmax(s) CE Loss


(1, 0, 0, 0) (3, 1, -2, -1) (0.85, 0.11, 0.005, 0.015) 0.16

(0, 1, 0, 0) (1, 2, -1, -1) (0.25, 0.68, 0.033, 0.033) 0.38

(0, 0, 1, 0) (2, 2, 1, 3) (0.19, 0.19, 0.072, 0.534) 2.6

(0, 0, 0, 1) (3, 2, 3, -1) (0.41, 0.15, 0.419, 0.007) 4.9

M. Zeshan Afzal, Very Deep Learning Ch. 5 32


◼ It is an approximation of Max.
◼ It is a soft/smooth approximation of max.
◼ differentiable approximation of a non-
differentiable function
◼ Optimization is easier

Why is softmax activate function called 'softmax'? - Quora

M. Zeshan Afzal, Very Deep Learning Ch. 5 33

Loss function
◼ Simple example
^ Lets say there is a three class classification (Cat, Dog, Cow)

Computed Ground Truth class Correct

classification error = 1/3 = 0.33
0.3 0.3 0.4 0 0 1 Cat Yes
classification accuracy of 2/3 = 0.67.
0.3 0.4 0.3 0 1 0 Dog Yes
0.1 0.2 0.7 1 0 0 Cow No

Computed Ground Truth class Correct

0.1 0.2 0.7 0 0 1 Cat Yes classification error = 1/3 = 0.33
0.1 0.7 0.2 0 1 0 Dog Yes classification accuracy of 2/3 = 0.67.
0.3 0.4 0.3 1 0 0 Cow No

M. Zeshan Afzal, Very Deep Learning Ch. 5 34

Loss function (Cross Entropy)
-(ln(0.4) + ln(0.4) + ln(0.1)) / 3 = 1.38
Computed Ground Truth class Correct
0.3 0.3 0.4 0 0 1 Cat Yes
0.3 0.4 0.3 0 1 0 Dog Yes
0.1 0.2 0.7 1 0 0 Cow No

-(ln(0.7) + ln(0.7) + ln(0.3)) / 3 = 0.64

Computed Ground Truth class Correct
0.1 0.2 0.7 0 0 1 Cat Yes
0.1 0.7 0.2 0 1 0 Dog Yes
0.3 0.4 0.3 1 0 0 Cow No

M. Zeshan Afzal, Very Deep Learning Ch. 5 35

Image Classification (MNIST)

◼ we would like to classify into 10

different classes. The digits 0-9
◼ Its old but still used in research
◼ Its based on the data from the national
institute of standard and technology
◼ It is comprised of handwritten digits by
census employees and school children
◼ It has resolution of 28x28. 60K training
and 10K testing samples also with
◼ The train and test samples are not
written by same participants
M. Zeshan Afzal, Very Deep Learning Ch. 5 37
Image Classification (MNIST)

◼ Curse of dimensionality
^ Assume that they are binary images 2784 = 10236
different images
^ For grayscale we have 256784 combinations
^ Why the classification even with the 60K images even
^ Image is concentrated on a low dimensional manifold
in {0,…,255}784

M. Zeshan Afzal, Very Deep Learning Ch. 5 38


◼ Check Uploaded IPython Notebook

M. Zeshan Afzal, Very Deep Learning Ch. 5 39

Universal Representation

◼ Networks with any single layer can represent any function F(x) with
arbitrary accuracy in the large hidden size limit
◼ However
^ Limitations of learning algorithm
• A given learning algorithm may be unable to find an optimum with this accuracy
^ Efficiency
• Network with one hidden layer can be inefficient to represent nonlinear function
• Required number of hidden neurons exponential in the input size
^ Nonlinear function F(x) can be better represented
• Deep networks with narrower layers

Kurt Hornik, Approximation capabilities of multilayer feedforward networks,Neural Networks,Volume 4, Issue 2, (1991).
Cybenko, G. Approximation by superpositions of a sigmoidal function. Math. Control Signal Systems 2, 303–314 (1989).

M. Zeshan Afzal, Very Deep Learning Ch. 5 40

Motivation (Convolutional Neural Networks)

◼ Fully connected layers

^ Each input is connected to each node
◼ Pixels are bad features
^ Highly correlated
^ Scale dependent
^ Intensity variations etc
^ Pixels are bad representation from a
machine learning point of view

Image source:

M. Zeshan Afzal, Very Deep Learning Ch. 5 41

M. Zeshan Afzal, Very Deep Learning Ch. 5
M. Zeshan Afzal, Very Deep Learning Ch. 5
M. Zeshan Afzal, Very Deep Learning Ch. 5
Edge detection

◼ Simple filters
^ Edge detection

M. Zeshan Afzal, Very Deep Learning Ch. 5 45


M. Zeshan Afzal, Very Deep Learning Ch. 5 46

Motivation (Convolutional Neural Networks)

◼ Fully connected layers

^ Each input is connected to each node
◼ Pixels are bad features
^ Highly correlated
^ Scale dependent
^ Intensity variations etc
^ Pixels are bad representation from a
machine learning point of view
◼ Can we find a better representation?

M. Zeshan Afzal, Very Deep Learning Ch. 5 47

Motivation (Convolutional Neural Networks)

◼ Can we find a better representation

^ We have a certain degree of locality in an image
^ We can find macro features at different locations
^ Hierarchy of features
• Edges + Corners → Eyes
• Eyes + Nose + Ears → Face
• Face + Body + Legs → human

M. Zeshan Afzal, Very Deep Learning Ch. 5 48

Convolutional Neural Network

▪ Feature hierarchies

M. Zeshan Afzal, Very Deep Learning Ch. 5

Convolutional Neural Network

Feature Extraction Classification

◼ Built-in invariances / equivariances (translation)

◼ Suitable for data on grid topologies
^ 1D (audio signal, time series)
^ 2D (pixelated images)
^ 3D (videos)
◼ CNNs: Consists of blocks
^ Convolutions + nonlinear activation + pooling (subsampling)
Y. Lecun, L. Bottou, Y. Bengio and P. Haffner, "Gradient-based learning applied to document recognition," in Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, Nov. 1998

M. Zeshan Afzal, Very Deep Learning Ch. 5 50


◼ Convolution operation

◼ Discrete convolution

M. Zeshan Afzal, Very Deep Learning Ch. 5 51

◼ Convolution is a linear operation
^ Example: Lets us consider and x1 x2 x3 x4 x5
^ We get the following equations for y
w1 w2 w3
◼ Convolving with can
be written as linear operation

^ We assumed that j remains within the

bounds of x. As a result the convolved
output y will be reduced to the dimension
n-m+1. For the same dimensions we can
pas with zeros (zero padding)
^ Shared Weights
^ Sparse Connectivity
M. Zeshan Afzal, Very Deep Learning Ch. 5 52
Convolution vs Cross Corelation

◼ Convolution operation

◼ Cross Corelation

^ Cross corelation is convolution with flipped kernel

^ In practice, cross correlation is used
• The filters (weights) are initialized randomly
• The values are learned with backpropagation

M. Zeshan Afzal, Very Deep Learning Ch. 5 53

Multidimensional Convolutions

◼ 2D convolution

◼ 2D cross corelation

M. Zeshan Afzal, Very Deep Learning Ch. 5 54

A closer look at spatial dimensions:

7x7 input (spatially)
assume 3x3 filter

Stanford University CS231n: Convolutional Neural Networks for Visual Recognition

M. Zeshan Afzal, Very Deep Learning Ch. 5 55

A closer look at spatial

7x7 input (spatially)
assume 3x3 filter

Stanford University CS231n: Convolutional Neural Networks for Visual Recognition

M. Zeshan Afzal, Very Deep Learning Ch. 5 56

A closer look at spatial

7x7 input (spatially)
assume 3x3 filter

Stanford University CS231n: Convolutional Neural Networks for Visual Recognition

M. Zeshan Afzal, Very Deep Learning Ch. 5 57

A closer look at spatial

7x7 input (spatially)
assume 3x3 filter

Stanford University CS231n: Convolutional Neural Networks for Visual Recognition

M. Zeshan Afzal, Very Deep Learning Ch. 5 58

A closer look at spatial

7x7 input (spatially)
assume 3x3 filter

=> 5x5 output


Stanford University CS231n: Convolutional Neural Networks for Visual Recognition

M. Zeshan Afzal, Very Deep Learning Ch. 5 59

A closer look at spatial

7x7 input (spatially)
assume 3x3 filter
applied with stride 2

Stanford University CS231n: Convolutional Neural Networks for Visual Recognition

M. Zeshan Afzal, Very Deep Learning Ch. 5 60

A closer look at spatial

7x7 input (spatially)
assume 3x3 filter
applied with stride 2

Stanford University CS231n: Convolutional Neural Networks for Visual Recognition

M. Zeshan Afzal, Very Deep Learning Ch. 5 61

A closer look at spatial

7x7 input (spatially)
assume 3x3 filter
applied with stride 2
=> 3x3 output!

Stanford University CS231n: Convolutional Neural Networks for Visual Recognition

M. Zeshan Afzal, Very Deep Learning Ch. 5 62

A closer look at spatial

7x7 input (spatially)
assume 3x3 filter
applied with stride 3?

Stanford University CS231n: Convolutional Neural Networks for Visual Recognition

M. Zeshan Afzal, Very Deep Learning Ch. 5 63

A closer look at spatial

7x7 input (spatially)
assume 3x3 filter
applied with stride 3?

7 doesn’t fit!
cannot apply 3x3 filter on
7x7 input with stride 3.

Stanford University CS231n: Convolutional Neural Networks for Visual Recognition

M. Zeshan Afzal, Very Deep Learning Ch. 5 64

Output size:
(N - F) / stride + 1
e.g. N = 7, F = 3:
stride 1 => (7 - 3)/1 + 1 = 5
stride 2 => (7 - 3)/2 + 1 = 3
stride 3 => (7 - 3)/3 + 1 = 2.33 :\

Stanford University CS231n: Convolutional Neural Networks for Visual Recognition

M. Zeshan Afzal, Very Deep Learning Ch. 5

In practice: Common to zero pad the border
0 0 0 0 0 0
e.g. input 7x7
0 3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?

(N - F) / stride + 1
Stanford University CS231n: Convolutional Neural Networks for Visual Recognition

M. Zeshan Afzal, Very Deep Learning Ch. 5 66

In practice: Common to zero pad the border
0 0 0 0 0 0
e.g. input 7x7
0 3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
7x7 output!

Stanford University CS231n: Convolutional Neural Networks for Visual Recognition

M. Zeshan Afzal, Very Deep Learning Ch. 5 67

In practice: Common to zero pad the border
0 0 0 0 0 0
e.g. input 7x7
0 3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
7x7 output!
in general,
• common to see CONV layers with stride 1
• filters of size FxF
• and zero-padding with (F-1)/2. (will preserve
size spatially)
e.g. F = 3 => zero pad with 1
Stanford University CS231n: Convolutional Neural Networks for Visual Recognition
F = 5 => zero pad with 2
F = 7 => zero pad with 3M. Zeshan Afzal, Very Deep Learning Ch. 5 68
M. Zeshan Afzal, Very Deep Learning Ch. 5
M. Zeshan Afzal, Very Deep Learning Ch. 5
M. Zeshan Afzal, Very Deep Learning Ch. 5
M. Zeshan Afzal, Very Deep Learning Ch. 5
M. Zeshan Afzal, Very Deep Learning Ch. 5
M. Zeshan Afzal, Very Deep Learning Ch. 5
M. Zeshan Afzal, Very Deep Learning Ch. 5
M. Zeshan Afzal, Very Deep Learning Ch. 5
M. Zeshan Afzal, Very Deep Learning Ch. 5
M. Zeshan Afzal, Very Deep Learning Ch. 5
Multichannel Convolutions

◼ Multichannel convolutions

M. Zeshan Afzal, Very Deep Learning Ch. 5 80

Convolution Layer
32x32x3 image -> preserve spatial structure

32 height

32 width
3 depth

M. Zeshan Afzal, Very Deep Learning Ch. 5 81

Convolution Layer
◼ 32x32x3 image

◼ 5x5x3 filter

◼ Convolve the filter with the image

◼ i.e. “slide over the image spatially,
computing dot products”

M. Zeshan Afzal, Very Deep Learning Ch. 5 82

Convolution Layer Filters always extend the full
depth of the input volume

◼ 32x32x3 image

◼ 5x5x3 filter
◼ Convolve the filter with the image
◼ i.e. “slide over the image spatially,
computing dot products”


M. Zeshan Afzal, Very Deep Learning Ch. 5 83

Convolution Layer

32x32x3 image
5x5x3 filter

1 number:
the result of taking a dot product between the
filter and a small 5x5x3 chunk of the image
32 (i.e. 5*5*3 = 75-dimensional dot product + bias)

M. Zeshan Afzal, Very Deep Learning Ch. 5 84

Convolution Layer
activation map
32x32x3 image
5x5x3 filter


convolve (slide) over all

spatial locations

32 28
3 1

M. Zeshan Afzal, Very Deep Learning Ch. 5 85

consider a second, green
Convolution Layer filter

32x32x3 image activation maps

5x5x3 filter


convolve (slide) over all

spatial locations

32 28
3 1

Lecture 5 - 33
M. Zeshan Afzal, Very Deep Learning Ch. 5 86
For example, if we had 6 5x5 filters, we’ll get 6 separate activation
activation maps



Convolution Layer

32 28
3 6

We stack these up to get a “new image” of size 28x28x6!

M. Zeshan Afzal, Very Deep Learning Ch. 5 87

Preview: ConvNet is a sequence of Convolution Layers, interspersed
with activation functions

32 28

e.g. 6
32 filters 28
3 6

M. Zeshan Afzal, Very Deep Learning Ch. 5 88

Preview: ConvNet is a sequence of Convolutional Layers, interspersed
with activation functions

32 28 24

e.g. 6 e.g. 10
5x5x3 5x5x6
32 filters 28 filters 24
3 6 10

M. Zeshan Afzal, Very Deep Learning Ch. 5 89

L Preview [Zeiler and Fergus 2013]
Visualization of VGG-16 by Lane McIntosh. VGG-16
architecture from [Simonyan and Zisserman2014].
h M. Zeshan Afzal, Very Deep Learning Ch. 5
e Pooling layer
irc - makes the representations smaller and more manageable
-i - operates over each activation map independently:
h M. Zeshan Afzal, Very Deep Learning Ch. 5

Single depth slice

1 1 2 4
x max pool with 2x2 filters
5 6 7 8 and stride 2 6 8

3 2 1 0 3 4

1 2 3 4

M. Zeshan Afzal, Very Deep Learning Ch. 5 92
M. Zeshan Afzal, Very Deep Learning Ch. 5
Typical CNN Structure

◼ Image -> convolution -> max pooling -> output

Image Convolution Pooling Flattenning Connected Softmax Loss

M. Zeshan Afzal, Very Deep Learning Ch. 5 94

Thanks a lot for your Attention

M. Zeshan Afzal, Very Deep Learning Ch. 5 95

You might also like