Professional Documents
Culture Documents
Lecture04 VDL
Lecture04 VDL
Lecture 05
◼ Gradient Descent
^ Updates after looking at complete dataset
◼ Minibatch Gradient Descent
^ Updates after looking at N samples (batch size)
◼ Stochastic Gradient Descent
^ Updates after looking at every samples
◼ Related Concept
^ Epoch
• one cycle through the full training dataset
◼ Linear Classifier
𝒙𝟏 𝒙𝟐 XOR (𝒙𝟏 , 𝒙𝟐 )
0 0 0
0 1 1
1 0 1
1 1 0
◼ Linear Classifier
𝒙𝟏 𝒙𝟐 XOR (𝒙𝟏 , 𝒙𝟐 )
0 0 0
0 1 1
1 0 1
1 1 0
◼ https://playground.tensorflow.org/
Rumelhart, Hinton and Williams: Learning representations by back-propagating errors. Nature, 1986.
◼ Problems
^ Saturates: The gradients are
killed
^ Outputs are not zero centred
◼ Problems
Restricts gradient updates and is the reason for inefficient optimisation(minibatch helps)
◼ Zero centred
◼ Antisymmetric
◼ Problem
^ Saturation Kills the gradient
◼ As a neural network
◼ Putting it together
◼ In machine learning we use a general term ‘loss function’ rather than the error
function
◼ We minimize the dissimilarity between the empirical data distribution
(defined by the training set) and the model distribution
Where
◼ The question is that how to choose
◼ We are working with discrete distribution i.e
◼ Categorical distribution
◼ Alternative notation
1 (1, 0, 0, 0)T
2 (0, 1, 0, 0) T
3 (0, 0, 1, 0) T
4 (0, 0, 0, 1) T
◼ Categorical distribution
◼ Alternative notation
One hot class 1 =(1, 0, 0, 0)
= 0.51x0.10x0.20x0.10
^ “ one hot vector ” with
◼ Let s denote the network output after the last affine layer (=scores). Then:
◼ It is an approximation of Max.
◼ It is a soft/smooth approximation of max.
◼ differentiable approximation of a non-
differentiable function
◼ Optimization is easier
◼ Curse of dimensionality
^ Assume that they are binary images 2784 = 10236
different images
^ For grayscale we have 256784 combinations
^ Why the classification even with the 60K images even
possible?
^ Image is concentrated on a low dimensional manifold
in {0,…,255}784
◼ Networks with any single layer can represent any function F(x) with
arbitrary accuracy in the large hidden size limit
◼ However
^ Limitations of learning algorithm
• A given learning algorithm may be unable to find an optimum with this accuracy
^ Efficiency
• Network with one hidden layer can be inefficient to represent nonlinear function
• Required number of hidden neurons exponential in the input size
^ Nonlinear function F(x) can be better represented
• Deep networks with narrower layers
Kurt Hornik, Approximation capabilities of multilayer feedforward networks,Neural Networks,Volume 4, Issue 2, (1991).
Cybenko, G. Approximation by superpositions of a sigmoidal function. Math. Control Signal Systems 2, 303–314 (1989).
◼ Simple filters
^ Edge detection
https://medium.com/machine-learning-world/feature-extraction-and-similar-image-search-with-opencv-for-newbies-3c59796bf774
▪ Feature hierarchies
◼ Convolution operation
◼ Discrete convolution
◼ Convolution operation
◼ Cross Corelation
◼ 2D convolution
◼ 2D cross corelation
7
7x7 input (spatially)
assume 3x3 filter
7
7x7 input (spatially)
assume 3x3 filter
7
7x7 input (spatially)
assume 3x3 filter
7
7x7 input (spatially)
assume 3x3 filter
7
7x7 input (spatially)
assume 3x3 filter
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2
=> 3x3 output!
7
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 3?
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 3?
7 doesn’t fit!
cannot apply 3x3 filter on
7x7 input with stride 3.
(recall:)
(N - F) / stride + 1
Stanford University CS231n: Convolutional Neural Networks for Visual Recognition
◼ Multichannel convolutions
32 height
32 width
3 depth
◼ 5x5x3 filter
32
◼ 32x32x3 image
◼ 5x5x3 filter
32
◼ Convolve the filter with the image
◼ i.e. “slide over the image spatially,
computing dot products”
32
3
32x32x3 image
5x5x3 filter
32
1 number:
the result of taking a dot product between the
filter and a small 5x5x3 chunk of the image
32 (i.e. 5*5*3 = 75-dimensional dot product + bias)
3
28
32 28
3 1
5x5x3 filter
32
28
32 28
3 1
Lecture 5 - 33
M. Zeshan Afzal, Very Deep Learning Ch. 5 86
For example, if we had 6 5x5 filters, we’ll get 6 separate activation
maps:
activation maps
32
28
Convolution Layer
32 28
3 6
32 28
CONV,
ReLU
e.g. 6
5x5x3
32 filters 28
3 6
32 28 24
….
CONV, CONV, CONV,
ReLU ReLU ReLU
e.g. 6 e.g. 10
5x5x3 5x5x6
32 filters 28 filters 24
3 6 10
3 2 1 0 3 4
1 2 3 4
y
M. Zeshan Afzal, Very Deep Learning Ch. 5 92
M. Zeshan Afzal, Very Deep Learning Ch. 5
Typical CNN Structure
Fully
Image Convolution Pooling Flattenning Connected Softmax Loss
Layer