You are on page 1of 93

Very Deep Learning

Lecture 05

Dr. Muhammad Zeshan Afzal, Prof. Didier Stricker


MindGarage, University of Kaiserslautern
afzal.tukl@gmail.com

M. Zeshan Afzal, Very Deep Learning Ch. 5


Recap

M. Zeshan Afzal, Very Deep Learning Ch. 5 2


Variants of Gradient Descent

◼ Gradient Descent
^ Updates after looking at complete dataset
◼ Minibatch Gradient Descent
^ Updates after looking at N samples (batch size)
◼ Stochastic Gradient Descent
^ Updates after looking at every samples
◼ Related Concept
^ Epoch
• one cycle through the full training dataset

M. Zeshan Afzal, Very Deep Learning Ch. 5 3


Logistic Regression (Decision Boundary)

◼ The decision boundary

◼ Decide for class 1

◼ Decide for class 0

M. Zeshan Afzal, Very Deep Learning Ch. 5 4


Simple Examples

M. Zeshan Afzal, Very Deep Learning Ch. 5 5


XOR

◼ Linear Classifier

𝒙𝟏 𝒙𝟐 XOR (𝒙𝟏 , 𝒙𝟐 )

0 0 0
0 1 1
1 0 1
1 1 0

M. Zeshan Afzal, Very Deep Learning Ch. 5 6


XOR (Multilayer Perceptron)

◼ Linear Classifier

𝒙𝟏 𝒙𝟐 XOR (𝒙𝟏 , 𝒙𝟐 )

0 0 0
0 1 1
1 0 1
1 1 0

M. Zeshan Afzal, Very Deep Learning Ch. 5 7


Representation Matters

M. Zeshan Afzal, Very Deep Learning Ch. 5 8


Neural Network Playground

◼ https://playground.tensorflow.org/

M. Zeshan Afzal, Very Deep Learning Ch. 5 9


Multilayer Perceptron

M. Zeshan Afzal, Very Deep Learning Ch. 5 10


Multilayer Perceptron

M. Zeshan Afzal, Very Deep Learning Ch. 5 11


A Brief History of Deep Learning

1940 1950 1960 1970 1980 1990 2000 2010 2020

◼ 1986 Backpropagation Algorithm


^ backpropagation algorithm that was able to train a neural
network based on the feed back
^ Allowed the efficient calculation of the gradients with
respect to weights

Rumelhart, Hinton and Williams: Learning representations by back-propagating errors. Nature, 1986.

M. Zeshan Afzal, Very Deep Learning Ch. 5 12


Activation Functions

M. Zeshan Afzal, Very Deep Learning Ch. 5 13


Sigmoid

◼ Maps input to range [0, 1]


◼ Historically popular since they
have nice interpretation as
saturating firing neuron

◼ Problems
^ Saturates: The gradients are
killed
^ Outputs are not zero centred

M. Zeshan Afzal, Very Deep Learning Ch. 5 14


Sigmoid

◼ Maps input to range [0, 1]

◼ Problems

Restricts gradient updates and is the reason for inefficient optimisation(minibatch helps)

M. Zeshan Afzal, Very Deep Learning Ch. 5 15


Tanh

◼ Maps input to range [-1, 1]

◼ Zero centred
◼ Antisymmetric
◼ Problem
^ Saturation Kills the gradient

M. Zeshan Afzal, Very Deep Learning Ch. 5 16


Relu

◼ Does not saturate (for x>0)


◼ Leads to fast convergence
◼ Computationally efficient
◼ Problem
^ Non zero-centered
^ No learning for x < 0, leads
to dead relu

M. Zeshan Afzal, Very Deep Learning Ch. 5 17


Leaky Relu

◼ Does not saturate


◼ Closer to zero-centred
outputs
◼ Fast convergence
◼ Computationally efficient

M. Zeshan Afzal, Very Deep Learning Ch. 5 18


Parametric Relu

◼ Does not saturate


◼ Closer to zero-centred
outputs
◼ Fast convergence
◼ Computationally efficient
◼ Parameter α is learned
from data

M. Zeshan Afzal, Very Deep Learning Ch. 5 19


ELU

◼ All benefits of leaky relu


◼ Adds some robustness to
noise
◼ Default value α = 1

M. Zeshan Afzal, Very Deep Learning Ch. 5 20


Maxout

◼ Any continuous PWL function


can be expressed as a
difference of two convex PWL
functions.
◼ Any continuous function can
be approximated arbitrarily
well, by a piecewise linear
function.
◼ Generalizes relu and leaky
relu
◼ Increases number of
parameters per neuron
Goodfellow, Ian, et al. "Maxout networks." International conference on machine learning. PMLR, 2013.

M. Zeshan Afzal, Very Deep Learning Ch. 5 21


Training

M. Zeshan Afzal, Very Deep Learning Ch. 5 22


Logistic Regression

◼ As a neural network

M. Zeshan Afzal, Very Deep Learning Ch. 5 23


Logistic Regression

◼ We have already see the Maximum Likelihood Estimator

◼ We now perform a binary classification


◼ How should we choose the model is this case

◼ Answer: Bernoulli distribution

where predicted by the model:

M. Zeshan Afzal, Very Deep Learning Ch. 5 24


Logistic Regression

◼ Putting it together

◼ In machine learning we use a general term ‘loss function’ rather than the error
function
◼ We minimize the dissimilarity between the empirical data distribution
(defined by the training set) and the model distribution

M. Zeshan Afzal, Very Deep Learning Ch. 5 25


Logistic Regression

◼ In summary we have assumed the Bernoulli


distribution

Where
◼ The question is that how to choose
◼ We are working with discrete distribution i.e

◼ We can choose the

The sigmoid is given as follows

M. Zeshan Afzal, Very Deep Learning Ch. 5 26


Multinomial Distribution

◼ Categorical distribution

^ probability for class c

◼ Alternative notation

^ “ one hot vector ” with

^ only true class 1 all other zeros

M. Zeshan Afzal, Very Deep Learning Ch. 5 27


One Hot representation
Class y y

1 (1, 0, 0, 0)T

2 (0, 1, 0, 0) T

3 (0, 0, 1, 0) T

4 (0, 0, 0, 1) T

M. Zeshan Afzal, Very Deep Learning Ch. 5 28


Multinomial Distribution

◼ Categorical distribution

^ probability for class c

◼ Alternative notation
One hot class 1 =(1, 0, 0, 0)
= 0.51x0.10x0.20x0.10
^ “ one hot vector ” with

^ only true class 1 all other zeros

M. Zeshan Afzal, Very Deep Learning Ch. 5 29


Categorical Distribution / Cross Entropy Loss

M. Zeshan Afzal, Very Deep Learning Ch. 5 30


Softmax

◼ How can we ensure that predicts a valid categorical (discrete) distribution?


◼ We must guarantee
^ and
◼ An element-wise sigmoid as output function would ensure first condition only
◼ Solution: Softmax function

◼ Let s denote the network output after the last affine layer (=scores). Then:

M. Zeshan Afzal, Very Deep Learning Ch. 5 31


Putting is all together

◼ Cross entropy loss for a single training sample

Class Label y Prediction Scores Softmax(s) CE Loss


s

(1, 0, 0, 0) (3, 1, -2, -1) (0.85, 0.11, 0.005, 0.015) 0.16

(0, 1, 0, 0) (1, 2, -1, -1) (0.25, 0.68, 0.033, 0.033) 0.38

(0, 0, 1, 0) (2, 2, 1, 3) (0.19, 0.19, 0.072, 0.534) 2.6

(0, 0, 0, 1) (3, 2, 3, -1) (0.41, 0.15, 0.419, 0.007) 4.9

M. Zeshan Afzal, Very Deep Learning Ch. 5 32


Softmax

◼ It is an approximation of Max.
◼ It is a soft/smooth approximation of max.
◼ differentiable approximation of a non-
differentiable function
◼ Optimization is easier

Why is softmax activate function called 'softmax'? - Quora

M. Zeshan Afzal, Very Deep Learning Ch. 5 33


Loss function
◼ Simple example
^ Lets say there is a three class classification (Cat, Dog, Cow)

Computed Ground Truth class Correct


classification error = 1/3 = 0.33
0.3 0.3 0.4 0 0 1 Cat Yes
classification accuracy of 2/3 = 0.67.
0.3 0.4 0.3 0 1 0 Dog Yes
0.1 0.2 0.7 1 0 0 Cow No

Computed Ground Truth class Correct


0.1 0.2 0.7 0 0 1 Cat Yes classification error = 1/3 = 0.33
0.1 0.7 0.2 0 1 0 Dog Yes classification accuracy of 2/3 = 0.67.
0.3 0.4 0.3 1 0 0 Cow No

M. Zeshan Afzal, Very Deep Learning Ch. 5 34


Loss function (Cross Entropy)
-(ln(0.4) + ln(0.4) + ln(0.1)) / 3 = 1.38
Computed Ground Truth class Correct
0.3 0.3 0.4 0 0 1 Cat Yes
0.3 0.4 0.3 0 1 0 Dog Yes
0.1 0.2 0.7 1 0 0 Cow No

-(ln(0.7) + ln(0.7) + ln(0.3)) / 3 = 0.64


Computed Ground Truth class Correct
0.1 0.2 0.7 0 0 1 Cat Yes
0.1 0.7 0.2 0 1 0 Dog Yes
0.3 0.4 0.3 1 0 0 Cow No

M. Zeshan Afzal, Very Deep Learning Ch. 5 35


Image Classification (MNIST)

◼ we would like to classify into 10


different classes. The digits 0-9
◼ Its old but still used in research
◼ Its based on the data from the national
institute of standard and technology
◼ It is comprised of handwritten digits by
census employees and school children
◼ It has resolution of 28x28. 60K training
and 10K testing samples also with
labels
◼ The train and test samples are not
written by same participants
M. Zeshan Afzal, Very Deep Learning Ch. 5 37
Image Classification (MNIST)

◼ Curse of dimensionality
^ Assume that they are binary images 2784 = 10236
different images
^ For grayscale we have 256784 combinations
^ Why the classification even with the 60K images even
possible?
^ Image is concentrated on a low dimensional manifold
in {0,…,255}784

M. Zeshan Afzal, Very Deep Learning Ch. 5 38


MLP (MNIST DEMO)

◼ Check Uploaded IPython Notebook

M. Zeshan Afzal, Very Deep Learning Ch. 5 39


Universal Representation

◼ Networks with any single layer can represent any function F(x) with
arbitrary accuracy in the large hidden size limit
◼ However
^ Limitations of learning algorithm
• A given learning algorithm may be unable to find an optimum with this accuracy
^ Efficiency
• Network with one hidden layer can be inefficient to represent nonlinear function
• Required number of hidden neurons exponential in the input size
^ Nonlinear function F(x) can be better represented
• Deep networks with narrower layers

Kurt Hornik, Approximation capabilities of multilayer feedforward networks,Neural Networks,Volume 4, Issue 2, (1991).
Cybenko, G. Approximation by superpositions of a sigmoidal function. Math. Control Signal Systems 2, 303–314 (1989).

M. Zeshan Afzal, Very Deep Learning Ch. 5 40


Motivation (Convolutional Neural Networks)

◼ Fully connected layers


^ Each input is connected to each node
◼ Pixels are bad features
^ Highly correlated
^ Scale dependent
^ Intensity variations etc
^ Pixels are bad representation from a
machine learning point of view

Image source: https://cv-tricks.com/cnn/understand-resnet-alexnet-vgg-inception/

M. Zeshan Afzal, Very Deep Learning Ch. 5 41


M. Zeshan Afzal, Very Deep Learning Ch. 5
M. Zeshan Afzal, Very Deep Learning Ch. 5
M. Zeshan Afzal, Very Deep Learning Ch. 5
Edge detection

◼ Simple filters
^ Edge detection

M. Zeshan Afzal, Very Deep Learning Ch. 5 45


SIFT

https://medium.com/machine-learning-world/feature-extraction-and-similar-image-search-with-opencv-for-newbies-3c59796bf774

M. Zeshan Afzal, Very Deep Learning Ch. 5 46


Motivation (Convolutional Neural Networks)

◼ Fully connected layers


^ Each input is connected to each node
◼ Pixels are bad features
^ Highly correlated
^ Scale dependent
^ Intensity variations etc
^ Pixels are bad representation from a
machine learning point of view
◼ Can we find a better representation?

M. Zeshan Afzal, Very Deep Learning Ch. 5 47


Motivation (Convolutional Neural Networks)

◼ Can we find a better representation


^ We have a certain degree of locality in an image
^ We can find macro features at different locations
^ Hierarchy of features
• Edges + Corners → Eyes
• Eyes + Nose + Ears → Face
• Face + Body + Legs → human

M. Zeshan Afzal, Very Deep Learning Ch. 5 48


Convolutional Neural Network

▪ Feature hierarchies

M. Zeshan Afzal, Very Deep Learning Ch. 5


Convolutional Neural Network

Feature Extraction Classification

◼ Built-in invariances / equivariances (translation)


◼ Suitable for data on grid topologies
^ 1D (audio signal, time series)
^ 2D (pixelated images)
^ 3D (videos)
◼ CNNs: Consists of blocks
^ Convolutions + nonlinear activation + pooling (subsampling)
Y. Lecun, L. Bottou, Y. Bengio and P. Haffner, "Gradient-based learning applied to document recognition," in Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, Nov. 1998

M. Zeshan Afzal, Very Deep Learning Ch. 5 50


Convolutions

◼ Convolution operation

◼ Discrete convolution

M. Zeshan Afzal, Very Deep Learning Ch. 5 51


Convolutions
◼ Convolution is a linear operation
^ Example: Lets us consider and x1 x2 x3 x4 x5
^ We get the following equations for y
w1 w2 w3
◼ Convolving with can
be written as linear operation

^ We assumed that j remains within the


bounds of x. As a result the convolved
output y will be reduced to the dimension
n-m+1. For the same dimensions we can
pas with zeros (zero padding)
^ Shared Weights
^ Sparse Connectivity
M. Zeshan Afzal, Very Deep Learning Ch. 5 52
Convolution vs Cross Corelation

◼ Convolution operation

◼ Cross Corelation

^ Cross corelation is convolution with flipped kernel


^ In practice, cross correlation is used
• The filters (weights) are initialized randomly
• The values are learned with backpropagation
https://en.wikipedia.org/wiki/Convolution

M. Zeshan Afzal, Very Deep Learning Ch. 5 53


Multidimensional Convolutions

◼ 2D convolution

◼ 2D cross corelation

M. Zeshan Afzal, Very Deep Learning Ch. 5 54


A closer look at spatial dimensions:

7
7x7 input (spatially)
assume 3x3 filter

Stanford University CS231n: Convolutional Neural Networks for Visual Recognition

M. Zeshan Afzal, Very Deep Learning Ch. 5 55


A closer look at spatial
dimensions:

7
7x7 input (spatially)
assume 3x3 filter

Stanford University CS231n: Convolutional Neural Networks for Visual Recognition

M. Zeshan Afzal, Very Deep Learning Ch. 5 56


A closer look at spatial
dimensions:

7
7x7 input (spatially)
assume 3x3 filter

Stanford University CS231n: Convolutional Neural Networks for Visual Recognition

M. Zeshan Afzal, Very Deep Learning Ch. 5 57


A closer look at spatial
dimensions:

7
7x7 input (spatially)
assume 3x3 filter

Stanford University CS231n: Convolutional Neural Networks for Visual Recognition

M. Zeshan Afzal, Very Deep Learning Ch. 5 58


A closer look at spatial
dimensions:

7
7x7 input (spatially)
assume 3x3 filter

=> 5x5 output


7

Stanford University CS231n: Convolutional Neural Networks for Visual Recognition

M. Zeshan Afzal, Very Deep Learning Ch. 5 59


A closer look at spatial
dimensions:

7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2

Stanford University CS231n: Convolutional Neural Networks for Visual Recognition

M. Zeshan Afzal, Very Deep Learning Ch. 5 60


A closer look at spatial
dimensions:

7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2

Stanford University CS231n: Convolutional Neural Networks for Visual Recognition

M. Zeshan Afzal, Very Deep Learning Ch. 5 61


A closer look at spatial
dimensions:

7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2
=> 3x3 output!
7

Stanford University CS231n: Convolutional Neural Networks for Visual Recognition

M. Zeshan Afzal, Very Deep Learning Ch. 5 62


A closer look at spatial
dimensions:

7
7x7 input (spatially)
assume 3x3 filter
applied with stride 3?

Stanford University CS231n: Convolutional Neural Networks for Visual Recognition

M. Zeshan Afzal, Very Deep Learning Ch. 5 63


A closer look at spatial
dimensions:

7
7x7 input (spatially)
assume 3x3 filter
applied with stride 3?

7 doesn’t fit!
cannot apply 3x3 filter on
7x7 input with stride 3.

Stanford University CS231n: Convolutional Neural Networks for Visual Recognition

M. Zeshan Afzal, Very Deep Learning Ch. 5 64


N
Output size:
(N - F) / stride + 1
F
e.g. N = 7, F = 3:
F N
stride 1 => (7 - 3)/1 + 1 = 5
stride 2 => (7 - 3)/2 + 1 = 3
stride 3 => (7 - 3)/3 + 1 = 2.33 :\

Stanford University CS231n: Convolutional Neural Networks for Visual Recognition

M. Zeshan Afzal, Very Deep Learning Ch. 5


In practice: Common to zero pad the border
0 0 0 0 0 0
e.g. input 7x7
0 3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
0

(recall:)
(N - F) / stride + 1
Stanford University CS231n: Convolutional Neural Networks for Visual Recognition

M. Zeshan Afzal, Very Deep Learning Ch. 5 66


In practice: Common to zero pad the border
0 0 0 0 0 0
e.g. input 7x7
0 3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
0
7x7 output!
0

Stanford University CS231n: Convolutional Neural Networks for Visual Recognition

M. Zeshan Afzal, Very Deep Learning Ch. 5 67


In practice: Common to zero pad the border
0 0 0 0 0 0
e.g. input 7x7
0 3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
0
7x7 output!
0
in general,
• common to see CONV layers with stride 1
• filters of size FxF
• and zero-padding with (F-1)/2. (will preserve
size spatially)
e.g. F = 3 => zero pad with 1
Stanford University CS231n: Convolutional Neural Networks for Visual Recognition
F = 5 => zero pad with 2
F = 7 => zero pad with 3M. Zeshan Afzal, Very Deep Learning Ch. 5 68
M. Zeshan Afzal, Very Deep Learning Ch. 5
M. Zeshan Afzal, Very Deep Learning Ch. 5
M. Zeshan Afzal, Very Deep Learning Ch. 5
M. Zeshan Afzal, Very Deep Learning Ch. 5
M. Zeshan Afzal, Very Deep Learning Ch. 5
M. Zeshan Afzal, Very Deep Learning Ch. 5
M. Zeshan Afzal, Very Deep Learning Ch. 5
M. Zeshan Afzal, Very Deep Learning Ch. 5
M. Zeshan Afzal, Very Deep Learning Ch. 5
M. Zeshan Afzal, Very Deep Learning Ch. 5
Multichannel Convolutions

◼ Multichannel convolutions

M. Zeshan Afzal, Very Deep Learning Ch. 5 80


Convolution Layer
32x32x3 image -> preserve spatial structure

32 height

32 width
3 depth

M. Zeshan Afzal, Very Deep Learning Ch. 5 81


Convolution Layer
◼ 32x32x3 image

◼ 5x5x3 filter
32

◼ Convolve the filter with the image


◼ i.e. “slide over the image spatially,
computing dot products”
32
3

M. Zeshan Afzal, Very Deep Learning Ch. 5 82


Convolution Layer Filters always extend the full
depth of the input volume

◼ 32x32x3 image

◼ 5x5x3 filter
32
◼ Convolve the filter with the image
◼ i.e. “slide over the image spatially,
computing dot products”

32
3

M. Zeshan Afzal, Very Deep Learning Ch. 5 83


Convolution Layer

32x32x3 image
5x5x3 filter
32

1 number:
the result of taking a dot product between the
filter and a small 5x5x3 chunk of the image
32 (i.e. 5*5*3 = 75-dimensional dot product + bias)
3

M. Zeshan Afzal, Very Deep Learning Ch. 5 84


Convolution Layer
activation map
32x32x3 image
5x5x3 filter
32

28

convolve (slide) over all


spatial locations

32 28
3 1

M. Zeshan Afzal, Very Deep Learning Ch. 5 85


consider a second, green
Convolution Layer filter

32x32x3 image activation maps

5x5x3 filter
32

28

convolve (slide) over all


spatial locations

32 28
3 1

Lecture 5 - 33
M. Zeshan Afzal, Very Deep Learning Ch. 5 86
For example, if we had 6 5x5 filters, we’ll get 6 separate activation
maps:
activation maps

32

28

Convolution Layer

32 28
3 6

We stack these up to get a “new image” of size 28x28x6!

M. Zeshan Afzal, Very Deep Learning Ch. 5 87


Preview: ConvNet is a sequence of Convolution Layers, interspersed
with activation functions

32 28

CONV,
ReLU
e.g. 6
5x5x3
32 filters 28
3 6

M. Zeshan Afzal, Very Deep Learning Ch. 5 88


Preview: ConvNet is a sequence of Convolutional Layers, interspersed
with activation functions

32 28 24

….
CONV, CONV, CONV,
ReLU ReLU ReLU
e.g. 6 e.g. 10
5x5x3 5x5x6
32 filters 28 filters 24
3 6 10

M. Zeshan Afzal, Very Deep Learning Ch. 5 89


F
A
L Preview [Zeiler and Fergus 2013]
Visualization of VGG-16 by Lane McIntosh. VGG-16
architecture from [Simonyan and Zisserman2014].
e
p
e
irc
-i
tF
lu
e
1
ri
8
e
L
,i5
2
&
-
J0
9
u
1
0
s7
t
i
n
J
o
h M. Zeshan Afzal, Very Deep Learning Ch. 5
F
A
L
e
p
e Pooling layer
irc - makes the representations smaller and more manageable
-i - operates over each activation map independently:
tF
lu
e
1
ri
8
e
L
,i5
2
&
-
J0
9
u
1
1
s7
t
i
n
J
o
h M. Zeshan Afzal, Very Deep Learning Ch. 5
MAX POOLING

Single depth slice


1 1 2 4
x max pool with 2x2 filters
5 6 7 8 and stride 2 6 8

3 2 1 0 3 4

1 2 3 4

y
M. Zeshan Afzal, Very Deep Learning Ch. 5 92
M. Zeshan Afzal, Very Deep Learning Ch. 5
Typical CNN Structure

◼ Image -> convolution -> max pooling -> output

Fully
Image Convolution Pooling Flattenning Connected Softmax Loss
Layer

M. Zeshan Afzal, Very Deep Learning Ch. 5 94


Thanks a lot for your Attention

M. Zeshan Afzal, Very Deep Learning Ch. 5 95

You might also like