4.pattern Recognition (Pattern Classification) - Convolutional Neural Networks - (CNN)

Pattern Recognition
(Pattern Classification)
Convolutional Neural Networks (CNNs
or ConvNet) for visual Recognition
Hypothesis set and Algorithm
First Edition
Acknowledgment
• This chapter adapted from lecture notes of “CS231n: Convolutional
Neural Networks for Visual Recognition”, Spring 2022
• http://cs231n.stanford.edu/
• https://github.com/cs231n/cs231n.github.io
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 2

ST, Morteza Analoui
Contents
1. Convolutional Neural Network 10. Architectures for Image recognition
2. Fully Connected (Multilayer 11. Normalization
Perceptron)
12. Transfer Learning
3. Convolution
13. CNN for text classification
4. Pooling
14. Beyond Classification
5. Training
15. Visualizing CNN features
6. CNN Optimization algorithm
16. Failures of CNN
7. Learning rate schedules
8. Backpropagation 17. Deep learning hardware and
software
9. Initialization
ST, Morteza Analoui
1- Convolutional Neural Network
Recall from Chapter 1: Items, Data set,
Feature vector, Label set
• Data Set: : collection of items (instances, example) of data used for training,
validation and evaluation (test)
• Sample Set:
• Example: In email spam prediction (detection)
Sample set S = collection of email messages
• Feature vector (attributes) : example is represented by a vector of features in
• : set of all possible items, and ; Features can be either hand crafted or learned
• Items in comes from unknown distribution ()
03/18/2024 Machine Learning Theory for Pattern Recognition - School of C 5

omputer Engineering, IUST - Morteza Analoui
Shallow learning
Recall from Chapter 1: Learning (label

learning)
𝑾𝒉𝒂𝒕 𝒎𝒂𝒄𝒉𝒊𝒏𝒆 𝒌𝒏𝒐𝒘𝒔(𝒅𝒂𝒕𝒂) 𝑾𝒉𝒂𝒕 𝒎𝒂𝒄𝒉𝒊𝒏𝒆 𝒍𝒆𝒂𝒓𝒏𝒔(𝒉𝒚𝒑𝒐𝒕𝒉𝒆𝒔𝒊𝒔)
• Training set: Sample, Label • Hypothesis set: H
• Learning: select a hypothesis H
• Concept set:
• Target concept
• Target concept
Probability Distribution of
examples (unknown): Learning Algorithm: select a hypothesis
Training set: , H. has a small loss
Teacher provides noise-free
label

Deep learning: Learning “Features +
Labels”
• Suppose a multi-label classification project in computer vision
• Deep learning results in:
learning the features + learning the labels (classification)
• learning the features: Representation learning
SVM, Adaboost, Perceptron, …
Feature (representation) learning Label learning (Shallow learning)
32x32x3
Label
Prediction
Deep learning
32 ×32
RGB: 3 channels 𝑥∈3ℝ
ST, Morteza Analoui
Convolutional Neural Networks (CNNs)
• Representation learning is done by Convolution
• Label learning is done by Perceptron (fully connected neural network)
Convolution
Feature (representation) learning Perceptron
32x32x3 Label learning
Labeling
Scores
32 ×32
RGB: 3 channels 𝑥∈3ℝ
ST, Morteza Analoui
Example
Labeling
Scores

ST, Morteza Analoui
Example: CIFAR 10
(https://www.cs.toronto.edu/~kriz/cifar.html)
• CIFAR 10 dataset consists of

60,000 32x32 color images in
10 (concepts), with 6000
images per class.
• There are 50,000
training/validation images
and 10,000 test images.
• Five training batches and one
test batch, each with 10,000
images.
• classes are completely
mutually exclusive
ST, Morteza Analoui
Example: CIFAR 100
(https://www.cs.toronto.edu/~kriz/cifar.html)
• Just like the CIFAR-10, except it has 100 concepts containing 600
images each.
• 500 training/validation images and 100 testing images per class.
• 100 classes in the CIFAR 100 are grouped into 20 super-classes. Each
image comes with a "fine" label (the class to which it belongs) and a
"coarse" label (the super-class to which it belongs).
• e.g. Super-class: vehicles (classes: bicycle, bus, motorcycle, pickup
truck, train)

ST, Morteza Analoui
ImageNet Large Scale Visual Recognition
Challenge (ILSVRC), and Kaggle
• ImageNet is a visual Dataset that contains more than 15 million of labeled high-
resolution images covering almost 22,000 categories (concepts), such as "balloon" or
"strawberry", consisting of several hundred images.
• During 2010-2017, annual contest, “ImageNet Large Scale Visual Recognition Challenge (
ILSVRC)”, competition on correctly classify and detect objects. The total count of training
images is 1.3 million, accompanied by 50,000 validation images, and 1,00,000 testing
images.
• Completion of ILSVRC: Annual ImageNet competition no longer held after 2017 and
moved to Kaggle
• Kaggle, a subsidiary of Google LLC, is an online community of data scientists and machine
learning practitioners
ST, Morteza Analoui
Kaggle.com/datasets

ST, Morteza Analoui
https://www.kaggle.com/competitions

ST, Morteza Analoui
CNN Models for Practical Applications,
2017 Inception-v4: ResNet + Inception
VGG-19: most parameters,
most operations
GoogLeNet: most efficient
Top-k accuracy means that

any of k highest scores must
match the real label

ST, Morteza Analoui
Challenge (ILSVRC) winners

ST, Morteza Analoui
Convolutional Neural Networks-VGG16
fully connected
Representation learning softmax
Classifier
Label learning
𝟕×𝟕×𝟓𝟏𝟐 𝟕×𝟕×𝟓𝟏𝟐=𝟐𝟓 ,𝟎𝟖𝟖 𝒇𝒆𝒂𝒕𝒖𝒓𝒆𝒔
VGG: Visual Geometry Group at Oxford, 2014

ST, Morteza Analoui
h ∈ H – VGG16 , 2014 CONV+ReLU
CONV+ReLU
CONV+ReLU
CONV+ReLU
𝑥𝑖
CONV+ReLU
CONV+ReLU
CONV+ReLU
CONV+ReLU
CONV+ReLU
CONV+ReLU
CONV+ReLU
CONV+ReLU
CONV+ReLU
FC
FC
𝑚𝑝 ¿ FC
𝑤1 ∗ 𝑥 𝑖 𝑤2 ∗(𝑅𝑒𝐿𝑈(𝑤¿¿1∗ 𝑥¿¿𝑖))¿¿
ST, Morteza Analoui
– VGG16, 2014 𝑥
CONV+ReLU
CONV+ReLU
CONV+ReLU
CONV+ReLU
• Softmax is a loss function
CONV+ReLU
• 138 million parameters CONV+ReLU
CONV+ReLU
• 102.76 million parameters CONV+ReLU
• 16.78 million parameters CONV+ReLU
CONV+ReLU
• 4.096 million parameters
CONV+ReLU
• Label learner: 123.63 million parameters CONV+ReLU
CONV+ReLU
• Size of feature vector:
FC
FC
FC

ST, Morteza Analoui
– ResNet, 2015
• Total layers: 18, 34, 50, 101, or 152
• Batch Normalization after ever CONV layer
𝑥
ST, Morteza Analoui
Overview of CNN architectures

ST, Morteza Analoui

ST, Morteza Analoui

ST, Morteza Analoui

ST, Morteza Analoui
Deep learning frameworks and libraries
TensorFlow is deemed the most effective and easy to use

ST, Morteza Analoui
CNN (CNN architecture)
• Functions in are consist of a series of operations on input
• The main operations are:
• Convolution
• Non-leaner transformation
• Pooling, Normalization, Dropout, …
• Fully connected (Multilayer Perceptron)
h ( 𝑥 )=Sequences of 𝐶𝑜𝑛𝑣𝑜𝑙𝑢𝑡𝑖𝑜𝑛 , 𝑁𝑜𝑛𝑙𝑖𝑛𝑒𝑎𝑟 , 𝑃𝑜𝑜𝑙𝑖𝑛𝑔 , 𝑁𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑎𝑡𝑖𝑜𝑛 , 𝐷𝑟𝑜𝑝𝑜𝑢𝑡 , … ,𝑃𝑒𝑟𝑐𝑒𝑝𝑡𝑖𝑜𝑛

ST, Morteza Analoui
CNN learning problem
• We discussed the following optimization problem based on empirical risk of and
regularization
𝐿(𝑊 ) regularization term
𝑎𝑟𝑔𝑚𝑖𝑛 ( ^
𝑅𝑆 ( h ) + 𝜆 ℛ ( h) ) =𝑎𝑟𝑔𝑚𝑖𝑛 𝐿(𝑊 )
h∈H h∈H
λ>0 is regularization parameter (treated as a hyper parameter)
• Here is CNN learning problem:
Training Error regularization loss = regularization strength
𝑚
1
𝐿 ( 𝑊 )= ∑ [ 𝐿¿ ¿𝑖 (h ( 𝑥 𝑖 ,𝑊 ) , 𝑦 𝑖 )]+ 𝜆 ℛ (𝑊 )¿
(Hyperparameter)
𝑊 ={𝑤 , 𝑏}
𝑚 𝑖=1 Regularization: Prevent the model
Empirical risk: Model predictions from doing too well on training
should match training data data (keep model simple so works
better on test data)
ST, Morteza Analoui
Regularization is total number of layers in CNN
• In common use:
1. L2 regularization:
2. L1 regularization:
3. Elastic net (L1 + L2):
4. Dropout
5. Batch normalization (mini-batch normalization dose not do
regularization)
6. Stochastic depth
7. …
ST, Morteza Analoui
CNN is multi-class, mono label and score
based classifier
• In multi-concept setting, a hypothesis is defined based on a scoring function
• Label associated to test example (image) is one resulting in largest score , which
defines mapping from to for
𝑥1 𝑥 2 𝑥3
𝑆𝑐𝑜𝑟𝑒 ( 𝑥 1 , 𝑐𝑎𝑟 ) =h (𝑥 1 , 𝑐𝑎𝑟 )

cat
car
frog
Predicted label: car car car

Hinge (SVM) margin loss function in CNN
• CNN incorporates either hinge margin loss function or Log loss
function during training
• Hinge margin loss function in CNN:
• Empirical margin loss: Given a sample and a hypothesis , empirical
margin loss is defined by
𝑚
^ 1
𝑅 𝑆 , 𝜌 ( h )= ∑ 𝐿𝑖
𝑚 𝑖=1
∑
𝐿𝑖= max ⁡(0 ,Φ 𝜌=1 𝜌h (𝑥 𝑖 , 𝑦 𝑖 ) ¿ )¿ Φ 𝜌 =1 ( 𝜌 h ( 𝑥𝑖 , 𝑦 𝑖 ))=1− 𝜌h ( 𝑥 , 𝑦 )
( ) 𝜌
𝜌 h ( 𝑥 , 𝑦 )=h ( 𝑥 , 𝑦 ) − h(𝑥 , 𝑦 ′ )
𝑦′≠ 𝑦
𝑦′≠ 𝑦
Example: margin loss, =3 examples
Examples: 𝑥1 𝑥 2 𝑥3
𝑠1 ,1 =𝑠 ( 𝑥1 , 1 )=h(𝑥 1 , 𝑐𝑎𝑡 )
𝑘=1 cat
concepts: 𝑘= 2 car
𝑘=3 frog
𝑠2 , 3=𝑠 ( 𝑥 2 ,3 ) =h(𝑥 2 , 𝑓𝑟𝑜𝑔)
Score vector of

Hinge (SVM) margin loss function,
Φ 𝜌 =1 ( 𝜌 h ( 𝑥 , 𝑦 ) )
𝑥1 𝑥 2 𝑥3
1 1 − 𝜌h ( 𝑥 , 𝑦 )
0 𝜌h ( 𝑥 , 𝑦 )
0 𝝆=𝟏
Losses: 2.9 0 12.9 ′
𝜌 h ( 𝑥 , 𝑦 )=h ( 𝑥 , 𝑦 ) − h(𝑥 , 𝑦 )
𝑦′≠ 𝑦

Hinge loss function in CNN
Φ 𝜌 =1 ( 𝜌 h ( 𝑥 , 𝑦 ) )
1 1 − 𝜌h ( 𝑥 , 𝑦 )
Φ 𝜌 =1 ( 𝜌 h ( 𝑥 , 𝑦 ) )
0 𝜌h ( 𝑥 , 𝑦 )
1 1 − 𝜌h ( 𝑥 , 𝑦 ) 0 𝝆=𝟏
′
𝟏+ h(𝑥 , 𝑦 )−h (𝑥 , 𝑦 )
0 ′
h ( 𝑥 , 𝑦 )𝟏+ h( 𝑥 , 𝑦 ′ ) 𝑆𝑐𝑜𝑟𝑒=h ( 𝑥 , 𝑦 )=𝜌 h ( 𝑥 , 𝑦 )+ h(𝑥 , 𝑦 ′ )
𝑦′≠ 𝑦
′
𝟏+ h(𝑥 , 𝑦 )−h (𝑥 , 𝑦 )

ST, Morteza Analoui
Empirical margin loss: ,
𝑥1 𝑥 2 𝑥3
Losses: 2.9 0 12.9
𝑚 𝑚
^ 1 1 2. 9+0+12 . 9
𝑅 𝑆 , 𝜌 ( h )= ∑ 𝐿𝑖 = ∑ Φ 𝜌 ( 𝜌 h (𝑥 𝑖 , 𝑦 𝑖 ) ) = =5 . 27
𝑚 𝑖=1 𝑚 𝑖=1 3

ST, Morteza Analoui
“Softmax function” + “Log loss function” in
CNN
• We interpret the scores as probabilities
• For image classifier gives score and the score interpreted as a probability given
by softmax function:
𝑠𝑖 , 𝑘
𝑒
likelihood of correctness : 𝑃 ( 𝑦 𝑖 =𝑘|𝑥𝑖 )= 𝐾
∑ 𝑒𝑠 𝑖, 𝑗
[ ]
𝑗 =1 𝑒𝑠 𝑖, 𝑘
𝐿 𝑖= 𝐿𝑜𝑔 1− 𝐿𝑜𝑔𝑃 ( 𝑦 𝑖 =𝑘|𝑥 𝑖 ) =− 𝐿𝑜𝑔

• Log loss for is: 𝐾
𝑗=1
• It is “negative log likelihood of correctness”

• Empirical lose:

ST, Morteza Analoui
Log loss (logistic loss), cross-entropy
• Cross-entropy is related to and
often confused with logistic loss,
𝐿𝑖=− 𝐿𝑜𝑔𝑃 ( 𝑦 𝑖=𝑘|𝑥 𝑖 )

called Log loss
• Both measures calculate same
quantity and can be used
interchangeably for classification
training
𝑃 ( 𝑦 𝑖 =𝑘|𝑥 𝑖 )

ST, Morteza Analoui
CNN algorithm using Log loss function
• Algorithm is “Maximum Likelihood Estimation”
• It chooses to maximize likelihood of observation of training set by
minimizing empirical risk in a regularization based framework
𝑚
^ 1
𝑅 𝑆 ( h )= ∑ 𝐿𝑖 𝐿𝑖= 𝐿𝑜𝑔1− 𝐿𝑜𝑔𝑃 ( 𝑦 𝑖 =𝑘|𝑥 𝑖 )
𝑚 𝑖=1
𝑎𝑟𝑔𝑚𝑖𝑛 ( ^
𝑅𝑆 ( h ) + 𝜆 ℛ ( h) ) =𝑎𝑟𝑔𝑚𝑖𝑛 𝐿(𝑊 )
h∈H h∈H

ST, Morteza Analoui
Multiclass Classifier using softmax
𝑥1
𝐿1=𝑙𝑜𝑔 1 . 00 −𝑙𝑜𝑔 0 . 13=−𝑙𝑜𝑔 0 . 13=2 . 04

ST, Morteza Analoui
Cross-Entropy vs Hinge for
𝑺=h (𝒙𝑖 , 𝑦 𝑖=3 ; 𝑾 ,𝑏) 𝑺
𝐿𝑖=1 . 58
𝐿𝑖=0 . 452

ST, Morteza Analoui
Cross-Entropy vs Hinge
[ ]
𝑒𝑠 𝑖 ,𝑖
Score=
𝐿 𝑖=− 𝐿𝑜𝑔 𝐾
𝑗 =1
…
𝐾
𝐿𝑖=∑ 𝑚𝑎𝑥 ( 0 , 𝑠𝑖 , 𝑗 +1− 𝑠 𝑖, 𝑖 ) …
𝑗≠𝑖

ST, Morteza Analoui
Contents
1. Convolutional Neural Network 10. Architectures for Image
2. Fully Connected (Multilayer
recognition
Perceptron) 11. Normalization
3. Convolution 12. Transfer Learning
4. Pooling 13. CNN for text classification
5. Training 14. Beyond Classification
6. CNN Optimization algorithm 15. Visualizing CNN features
7. Learning rate schedules 16. Failures of CNN
8. Backpropagation
17. Deep learning hardware and
software
9. Initialization
ST, Morteza Analoui
2- Fully Connected
(Multilayer Perceptron)
A label learner
Label learning stage
• VGG16: Last maxpool layer outputs a
feature vector of 25088 dimension for each
input image
Input feature vector
7 ×7 ×512=25088
𝑤14
𝑤1 5
𝑤1 6
Labe prediction probabilities

ST, Morteza Analoui
Label learner: Fully Connected (multi-
layer perceptron)
VGG16
• “Fully-connected” layers
𝑤1 𝑤2𝑆
1000
25088
: Score 4096 4096
“2-layer Neural Net”, or
“1-hidden-layer Neural Net”

ST, Morteza Analoui
Fully connected layer
• No hidden layer (1 layer)
𝑤
1 1
𝑇
𝑤 𝑥 +𝑏
+𝑏
25088 1000
25088 weights 1000
input layer activation
𝑥
𝑇 1000×25088 25088 ×1 1000 ×1
h ( 𝑥 , 𝑤 ) =𝑤 𝑥 +𝑏 ≡𝑤 𝑥 +𝑏

ST, Morteza Analoui
Fully connected layer
• One hidden layer (2-layer)
• Hypothesis +
• Score of
𝑥 𝑤 14 Φ 𝑤 15 S
Score
hidden layer
+ nonlinear
input layer
ST, Morteza Analoui
activation functions (non linear
transform)
ReLU is a good default choice for most problems.

ST, Morteza Analoui
Fully connected layer using ReLU
• One hidden layer (2-layer)
• Hypothesis +
• ReLU:
• Score of 𝑥 𝑤 14 Φ 𝑤 15 S
Score
hidden layer
h: nonlinear
input layer
ST, Morteza Analoui
Setting the number of layers and their sizes
• 1 hidden layer and 3, 6, and 20 neurons
• More neurons  more complex hypothesis set

ST, Morteza Analoui
Regularization
𝑚
1
𝐿(𝑊 )= ∑ [𝐿¿ ¿𝑖 (h ( 𝑥𝑖 ,𝑊 ) , 𝑦𝑖 )]+𝜆 ℛ (𝑊)¿
𝑚 𝑖=1 : Stronger regularization, simple hypothesis
ST, Morteza Analoui
Contents
Perceptron)
3. Convolution
4. Pooling
5. Training
16. Failures of CNN
software
9. Initialization
ST, Morteza Analoui
3- Convolution
Convolutional Neural Networks-VGG16
and Zisserman, 2014]
[Simonyan
fully connected
Representation learning Softmax+Log loss
Classifier
Label learning
𝟕×𝟕×𝟓𝟏𝟐
VGG: Visual Geometry Group at Oxford, 2014

ST, Morteza Analoui
VGG16
• INPUT: [224x224x3] • CONV33-512: [28x28x512]
• CONV33-64: [224x224x64] • CONV33-512: [28x28x512]
• CONV33-64: [224x224x64] • CONV33-512: [28x28x512]
• POOL2: [112x112x64] • POOL2: [14x14x512]
• CONV33-128: [112x112x128] • CONV33-512: [14x14x512]
• CONV33-128: [112x112x128] • CONV33-512: [14x14x512]
• POOL2: [56x56x128] • CONV33-512: [14x14x512]
• CONV33-256: [56x56x256] • POOL2: [7x7x512]
• CONV33-256: [56x56x256] • FC: [1x1x4096]
• CONV33-256: [56x56x256] • FC: [1x1x4096]
• POOL2: [28x28x256] • FC:
high level [1x1x1000]
features  CONV33-512: [14x14x512]
ST, Morteza Analoui
Convolutional Neural Networks - Alexnet
2012

ST, Morteza Analoui
Input: RGB image
• Resolution: 3 colors and height-width
Input: RGB image

ST, Morteza Analoui
𝑔𝑥
Convolution
Product of a of image and a moved
filter =
• We call the layer 𝑎
convolutional because it A pixel of
source
is related to convolution image
w [ 𝑎−2 , 𝑏 −2 ]
of two functions and :
7 7
𝑥 [ 𝑎 , 𝑏 ] ∗ 𝑤 [ 𝑎 ,𝑏 ] = ∑ ∑ 𝑥 [ 𝑎 , 𝑏] ∙𝑤 [ 𝑎 −𝑛1 , 𝑏− 𝑛2 ]
𝑛1 =1 𝑛2=1
image Filter
pixels image 𝑏 𝑤 [ 𝑎 ,𝑏 ]
𝑥 [ 𝑎 , 𝑏]
w filter A pixel of filtered image
filtered image (stride=1) filtered image

𝑥 [ 𝑎 , 𝑏 ] ∗ 𝑤 [ 𝑎 ,𝑏 ]
ST, Morteza Analoui
Convolution
• Convolving 1 filter of size and a image results in a filtered image (called
activation map)
Filter extends all 3
channels of input image
image
3 𝑥
filter
3 𝑤 size
5
A pixel of filtered image
5
3
𝑇
𝑤 𝑥 +𝑏
˙ + bias
5 ×5 × 3=75 dimension product
ST, Morteza Analoui
Convolution
• We call the layer convolutional because it is related to convolution of
two functions :
,
filter
image
A filter slides over all spatial locations of image

ST, Morteza Analoui
Convolution  activation map (filtered
image)
• One filter  One activation map
𝑖𝑛
𝑝𝑎𝑡𝑐 h𝑖
𝐻 =32 𝑜𝑢𝑡 𝑖
𝑊 =32
𝐶 =3 3 5× 5
𝑜𝑢𝑡 𝑖 =∑ ∑ 𝑤𝑖𝑗 ,𝑘 ∙𝑖𝑛(𝑝𝑎𝑡𝑐 h𝑖 ) 𝑗 ,𝑘 +𝑏
𝑘=1 𝑗=1
ST, Morteza Analoui
Two filters  2 activation maps
• Consider a second, green filter

ST, Morteza Analoui
Six filters  6 maps
• For example, if we had six 5x5x3 filters, we’ll get 6 separate activation
maps:
• We stack these up to get a “new image” of size 28x28x6

ST, Morteza Analoui
Example
Filter1: Filter15: Filter32:
7 4×74 ×3𝑖𝑚𝑎𝑔𝑒 1 2 15 32
𝑥
70
𝑥 ∗𝑤 1 70
74
70
𝑥 ∗𝑤 15
32
74 70
filters
32

ST, Morteza Analoui
A closer look at spatial dimensions
•7

ST, Morteza Analoui
A closer look at spatial dimensions
•7

ST, Morteza Analoui
Convolution Networks
• ConvNet is a sequence of Convolution Layers, interspersed with
activation functions (non-linear function)
• 32x32 input convolved repeatedly with 5x5 filters shrinks volumes
spatially!(32 -> 28 -> 24 ...). Shrinking too fast is not good, doesn’t
work well
Input image
ST, Morteza Analoui
Zero pad border Pad=2
F
• Put zero pad on the border of input

𝑁 ′=9 𝑁 =7
Pad=2, N’=9, F=3, output size=(9-3)/1+1 =7x7
Output size = Input size:

Output size = (N+2Pad-F)/stride +1
Pad = (F-1)/2

ST, Morteza Analoui
Number of parameters
• Number of parameters in a convolution layer?
Convolution parameters: each filter has 5*5*3+1= 76 each filter has 5*5*6+1= 151
parameters (+1 for bias) parameters (+1 for bias)
Total=76*6 = 456 Total=151*10 = 1510
• : input channels
• : output channels
Input image

ST, Morteza Analoui
1 × 1 × 𝑚 𝑓𝑖𝑙𝑡𝑒𝑟
• 1x1 convolutions named as bottleneck convolution
• convolution layers (filters) make perfect sense
1 64
1

ST, Morteza Analoui
1 × 1 × 𝑚 𝑓𝑖𝑙𝑡𝑒𝑟
• Preserving spatial information, reducing depth

ST, Morteza Analoui
What is effective receptive field of three
3x3 conv (stride 1) layers?
• three 3x3 conv (stride 1) layers has same effective receptive field as
one 7x7 conv layer but
• deeper, more non-linearities
• fewer parameters: 3*(32C2) vs 72C2 for C=Cin=Cout channels per layer
input
7x7

ST, Morteza Analoui
Contents
Perceptron)
3. Convolution
4. Pooling
5. Training
16. Failures of CNN
software
9. Initialization
ST, Morteza Analoui
4- Pooling
Pooling
• Make a pool of pixels, and root one of
them to next layer
• Pooling makes representations smaller
and more manageable
• Operates over each activation map
independently
down sampling
• We lose some valuable information
• Pooling layers reduce spatial resolution,
so their outputs are invariant to small
changes in inputs
ST, Morteza Analoui
MAX POOLING

ST, Morteza Analoui
Fully connected (multilayer perceptron)
1x1x4096
a neuron: 7x7x512
7 the result of taking a dot product
7x7x512=25088 between a row of and 1x1x25088
7
512
1x1x25088 1x1x4096
4096 ×25088
ReLU( ∗𝑤 ¿=¿
Each neuron looks

at the full input
1
4096 input

ST, Morteza Analoui
Examples
A classifier

ST, Morteza Analoui
Some applications
.
Image Recognition Speech Recognition Text Recognition

ST, Morteza Analoui
A non-application
• If data is just as useful after
swapping any of columns with
each other, then Convolution
does not work
• Convolution captures local
“spatial” patterns in data

ST, Morteza Analoui
A simple example – Handwritten
recognition
• Input: character X
3 (3 ×3) 𝑓𝑖𝑙𝑡𝑒𝑟𝑠

ST, Morteza Analoui
A simple example – Handwritten recognition-1
channel
• Convolution

ST, Morteza Analoui
recognition
• Convolution

ST, Morteza Analoui
recognition
*
*
*
ST, Morteza Analoui
recognition
• Activation maps

ST, Morteza Analoui
recognition
• Max Pooling

ST, Morteza Analoui
recognition
• Non linear transformation

ST, Morteza Analoui
recognition

ST, Morteza Analoui
recognition

ST, Morteza Analoui
recognition
• Fully Connected
Feature Values for X Feature Values for O

ST, Morteza Analoui
recognition
• 2-hidden layer perceptron

ST, Morteza Analoui
recognition
• A full view
• Output scores

ST, Morteza Analoui
[ConvNetJS demo: training on CIFAR-10]
• This demo trains a Convolutional Neural Network on the
CIFAR-10 dataset in your browser, with nothing but Javascript. The
state of the art on this dataset is about 90% accuracy and human
performance is at about 94% (not perfect as the dataset can be a bit
ambiguous).
• URL for the demo:
• https://cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.ht
ml

ST, Morteza Analoui
Contents
Perceptron)
3. Convolution
4. Pooling
5. Training
16. Failures of CNN
software
9. Initialization
ST, Morteza Analoui
5- Training
Learning Problem
• Find by solving optimization problem:
•,
•. Or

ST, Morteza Analoui
More Regularization: Dropout
• Dropout: In each forward pass, randomly set some nodes to zero
• Probability of dropping is a hyperparameter; is common
• By dropping a node out, we mean temporarily removing it from
network, along with all its incoming and outgoing connections
5 neurons are dropped out

ST, Morteza Analoui
More Regularization: Dropout
• Q: How can this possibly be a good idea?
• A1: Forces the network to be less complex. It prevents co-adaptation
of nodes (prevents different nodes have highly correlated behavior)
• A2: Dropout is training a large ensemble of models (that share
parameters). Each binary mask is one

ST, Morteza Analoui
Dropout - test time
• At test time all neurons are active always
• We must scale the activations so that for each neuron:
output at test time = expected output at training time
• So,
at test time, multiply the neuron activation by dropout probability
or
divide each neuron activation by and do nothing in test time

ST, Morteza Analoui
More Regularization: Data Augmentation
• Generate more example to
increase complexity of
target concept
• Transfer input image: flip CNN
horizontal, vertical, …
• Random crops and scale into
Color Jitter and flip
an complete image
• Randomize contrast and
brightness
random crops & scale

ST, Morteza Analoui
Contents
Perceptron)
3. Convolution
4. Pooling
5. Training
16. Failures of CNN
software
9. Initialization
ST, Morteza Analoui
6- CNN Optimization algorithm
1- Stochastic Gradient Decent (SGD)
2 - SGD + momentum
3- AdaGrad (Adaptive Gradient )
4 - RMSProp
5- Adam (adaptive moment estimation)
Stochastic gradient descent - SGD
• Stochastic gradient descent (SGD) algorithm is used to solve the
optimization problem.
• SGD is an optimization algorithm that estimates loss: gradient () for
the current state of the model using examples from the training
dataset.
• then updates the weights of the model. The examples participate in
update can be referred to as support vectors for the Perceptron
algorithm.
• The amount that the weights are updated is referred to as the step
size or the “learning rate.”
ST, Morteza Analoui
stochastic Gradient Descent
• GD (SGD) is an algorithm that has a number of hyperparameters.
• Two integer hyperparameters are the batch size and number of epochs.
• Batch size is a hyperparameter that controls the number of training
samples to work through before the model’s internal parameters are
updated.
• The number of epochs is a hyperparameter of SGD that controls the
number of complete passes through the training dataset.
• Batch Gradient Descent. Batch Size = Size of Training Set
• Stochastic Gradient Descent. Batch Size = 1
• Mini-Batch Gradient Descent. 1 < Batch Size < Size of Training Set

ST, Morteza Analoui
Number of epoch
• Stop training the model when accuracy on the validation set
decreases
OR
• Train for a long time, but always keep track of the model snapshot
that worked best on validation set.

ST, Morteza Analoui
Weight updating in SGD
• The weights are updated as follows:
• In iteration of training, a mini-batch of training set examples feed into the network
one by one and then is calculated. For the mini-batch
𝑚′ size of =32/64/128 is
1
common. 𝐿 ( 𝑊 )= ∑ ((
[ 𝐿 ¿𝑖 h 𝑥 ,𝑊 , 𝑦 ]+𝜆 𝑅(𝑊 )¿
𝑚′
¿ 𝑖 ) 𝑖)
𝑖=1
• Then calculate , and update at iteration:
• is the learning rate

• Example: ,
,,
ST, Morteza Analoui
Problems with Weight updating in SGD
1.
• What if loss changes quickly in one direction and slowly in another?
• What does gradient descent do?
• Answer: Very slow progress along shallow dimension, jitter along
steep direction
𝑤2
𝑤1
ST, Morteza Analoui
Problems with Weight updating in SGD
2.
• What if the loss function has a local minima or saddle point? Saddle
points much more common in high dimension
𝐿𝑜𝑠𝑠
𝑊
ST, Morteza Analoui
SGD + Momentum
• Weight updating in SGD:
• SGD + Momentum (velocity): ,

• Build up “velocity” as a running mean of gradients
• Typically: =0.9 – 0.99, is the learning rate
𝑤2
𝑤1
ST, Morteza Analoui
SGD + Momentum
• Combine gradient at current point with velocity to get step used to
update weights
• Note that: , can be written as:
Momentum update: , Weight update:
: 𝜌 𝑣𝑡
𝑣 𝑡 +1
current point

ST, Morteza Analoui
AdaGrad (Adaptive Subgradient Methods)
• Added element-wise scaling of the gradient based on the historical
sum of squares in each dimension of
• Weight updating:
• is historical sum of gradient squares in each dimension of

• “Per learning rates” or “adaptive learning rates”

ST, Morteza Analoui
AdaGrad (Adaptive Subgradient Methods)
• Python code: x and =dx

ST, Morteza Analoui
Problems with AdaGrad
• 1. Progress along “steep” directions is damped; progress along “flat”
directions is accelerated.
• 2. Step size over long time decays to zero

ST, Morteza Analoui
RMSProp: “Leaky AdaGrad”
• Weight updating:
• is replaced with
Exponential Moving Average - controls the decay rate

ST, Morteza Analoui
Adam (adaptive moment estimation)
• (First moment estimate using Exponential Moving Averaging)

• (First moment bias correction)
• (Second moment estimate using Exponential Moving Averaging)
• (Second moment bias correction)
• Return
• (learning rate), (first moment decay rate, typically 0.9), (second moment decay rate,
typically 0.999), (numerical term, typically 10-7),

ST, Morteza Analoui
Adam (adaptive moment estimation)
• Bias correction for the fact that first and second moment estimates
start at zero
• Adam with beta1 = 0.9, beta2 = 0.999, and learning_rate = 1e-3 or 5e-
4 is a great starting point for many models!

ST, Morteza Analoui
Initialization Bias and Its Correction
• The moving averages are estimates of the 1st moment (the mean) and the 2nd
moment (the uncentered variance) of the gradient.
• However, these moving averages are initialized as (vectors of) 0’s, leading to moment
estimates that are biased towards zero, especially during the initial timesteps, and
especially when the decay rates are small (i.e. and are close to 1).
• Bias Correction for the mean:
[ ]
𝑡 +1 𝑡 +1
𝔼 [ 𝑀 𝑡 +1 ]= 𝔼 [ 𝛽 1 𝑀 𝑡 + ( 1− 𝛽 1 ) 𝛻 𝐿 ( 𝑊 𝑡 ) ] =𝔼 ( 1− 𝛽1 ) ∑ 𝛽 𝑡 + 1− 𝑖
1 𝛻 𝐿 ( 𝑊 𝑡 ) =𝔼 [ 𝛻 𝐿 ( 𝑊 𝑡 ) ] × ( 1 − 𝛽1 ) ∑ 𝛽𝑡1+1 −𝑖 +𝑐
𝑖=1 𝑖=1
𝔼 [ 𝑀 𝑡 +1 ]
𝔼 [ 𝑀 𝑡 +1 ] = 𝔼 [ 𝛻 𝐿 ( 𝑊 𝑡 ) ] ( 1− 𝛽 ) +𝑐 ⟹ 𝔼 [ 𝛻 𝐿 ( 𝑊 𝑡 ) ] = =^
𝑡+1
1 𝑀 𝑡 +1
(1 − 𝛽 )𝑡 +1
1
• An analogous argument derives the bias correction of second moment

ST, Morteza Analoui
Adam’s properties
• 1 - Loss scale-invariance:
^𝑡⟶𝑐 𝑀
𝐿 ( 𝑊 ) ⟶ 𝑐𝐿 (𝑊 ) ⟹ 𝑀 ^ 𝑡∧𝑅
^ 𝑡 ⟶ 𝑐2 ^
𝑅𝑡
𝑐^
𝑀 𝑡 +1 ^ 𝑡 +1
𝑀
𝑊 𝑡 + 1=𝑊 𝑡 − 𝜂 𝑊𝑡 − 𝜂
√𝑐 𝑅𝑡 +1 +𝜖
2^
√ 𝑅^ 𝑡 +1+ 𝜖
• 2 – Initialization Bias correction for first and second moments:

ST, Morteza Analoui
Adam’s properties
• 3 – Bounded norm of
• 4 - Disabling second estimation reduses Adam to SGD with:

Infinity
• Learning rate norm gives toward
descending the largest magnitude among each element of a
• Momentum ascending towards

ST, Morteza Analoui
Adam’s properties
• 5 - RMSProp with momentum is the method most closely related to
Adam.
• Main differences:
• RMSProp rescales gradient and then applies momentum, Adam first applies
momentum (moving average) and then rescales (bias correction).
• RMSProp lacks bias correction, often leading to large step sizes in early stages
of run (especially when is close to 1).

ST, Morteza Analoui
When update parameters
• Batch Gradient Descent
• Batch Size = Size of Training Set, There is one parameter update per
epoch
• Stochastic Gradient Descent
• Batch Size = 1, There is one parameter update per example
• Mini-Batch Gradient Descent
• 1 < Mini-Batch Size < Size of Training Set, There is one parameter
update per Mini-batch

ST, Morteza Analoui
Mini-Batch Gradient Descent
• Consider a CNN is under training and at 1…𝑚′
iteration parameters are updated. In
iteration we inter a batch of examples into
the network and calculate average loss for
the batch in forward pass
• Consider a mini-batch of size . The average
loss is
• We calculate in the backward using

ST, Morteza Analoui
Mini-Batch Gradient Descent
Loop:
1. Sample a mini-batch of data
2. Forward prop it through the network, get loss
3. Backprop to calculate the gradients
4. Update the parameters using the gradient
ResNet: Mini-batch size 256 examples

ST, Morteza Analoui
Contents
Perceptron)
3. Convolution
4. Pooling
5. Training
16. Failures of CNN
software
9. Initialization
ST, Morteza Analoui
7- Learning rate schedules
Learning rate: Weight updating
• SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have learning
rate as a hyperparameter
• All of them start with large learning rate and decay over time
• Advantage of SGD and other online or mini-batch update methods is

that their convergence does not depend on size of training set, only
on number of updates

ST, Morteza Analoui
Learning rate
• Learning rate is a hyperparameter used in training
• Learning rate has a small positive value, often in the range between
0.0001 and 1.0
• Learning rate controls how quickly model is adapted to problem
• Smaller learning rates require more training epochs given smaller
changes made to weights each update, whereas larger learning rates
result in rapid changes and require fewer training epochs
• Too large learning rate can cause model to converge too quickly to a
suboptimal solution
• Adaptive Learning Rate!
ST, Morteza Analoui
Learning Rate Decay
• Start with large learning rate and decay over time
𝐿(𝑊 )

ST, Morteza Analoui
Learning Rate: Cosine Decay

ST, Morteza Analoui
Learning Rate: Linear, Inverse sqrt Decay

ST, Morteza Analoui
Learning Rate Decay: Linear Warmup
• High initial learning rates can
make loss explode; linearly
increasing learning rate from
0.001 over first ~10 epoch can
prevent this

ST, Morteza Analoui
Which decay scheduling is the best?
• All of them
• Start with large learning rate and decay over time
• Try cosine schedule
• Where and are ranges for learning rate, and account for
how many epochs have been performed since last period

ST, Morteza Analoui
In practice
• Adam is a good default choice in many cases; it often works ok even
with constant learning rate
• SGD + Momentum can outperform Adam but may require more
tuning of LR and schedule
• ResNets: multiply Learning Rate by 0.1 after epochs 30, 60, and 90

ST, Morteza Analoui
Contents
Perceptron)
3. Convolution
4. Pooling
5. Training
16. Failures of CNN
software
9. Initialization
ST, Morteza Analoui
8- Backpropagation
Optimization Algorithm
How to find the best ?
min
𝑊

ST, Morteza Analoui
Backpropagation: a simple example
flow graph method
• flow graph: to prevent mathematical mistakes and make sure an
implementation is computationally efficient
Example: find that minimize
𝑎𝑟𝑔𝑚𝑖𝑛 𝑓 =( 𝑥 + 𝑦 ) 𝑧
𝑥, 𝑦 , 𝑧
We want

ST, Morteza Analoui
Back propagation: a simple example
• e.g. x = -2, y = 5, z = -4

ST, Morteza Analoui
Back propagation: a simple example

ST, Morteza Analoui
Upstream, Downstream

ST, Morteza Analoui
Example 2:
• Find to minimize
Computational graph representation

ST, Morteza Analoui
Feed forward

ST, Morteza Analoui
Back propagation: Gradient flow

ST, Morteza Analoui

ST, Morteza Analoui

ST, Morteza Analoui

ST, Morteza Analoui

ST, Morteza Analoui

ST, Morteza Analoui
Computational graph variations
• Computational graph representation may not be unique. Choose one
where local gradients at each node can be easily expressed!

ST, Morteza Analoui
Patterns in gradient flow

ST, Morteza Analoui
Backprop Implementation: “Flat” code
Forward pass:
Computes loss
Backward pass:
Computes grads

ST, Morteza Analoui
Contents
Perceptron)
3. Convolution
4. Pooling
5. Training
16. Failures of CNN
software
9. Initialization
ST, Morteza Analoui
9- Initialization
Xavier Initialization – Convolution Layer
• All components of W1
6*(W1, b1) 10*(W2, b2)
are initialized by random each filter has 5*5*3+1= 76 each filter has 5*5*6+1= 151
quantity of Uniform parameters parameters
distribution of zero mean
and 1/75 variance
are initialized by random
quantity of Uniform
and 1/150 variance Input image
Initialize all bs to zero
ST, Morteza Analoui
Xavier Initialization – Fully Connected
Layer
• All components of W1 are initialized by
random quantity of Uniform distribution
of zero mean and 2/(25088+4096)
variance
random quantity of Uniform distribution
of zero mean and 2/(4096+2024) 1000
variance Initialize all bs to zero 25088
4096 2024
random quantity of Uniform distribution “2-hidden-layer Neural Net”
of zero mean and 2/(2024+1000)
variance Initialize all bs to zero
ST, Morteza Analoui
Xavier Initialization and ReLU
• Xavier assumes zero centered activation function
• Activations collapse to zero when ReLU is used.

ST, Morteza Analoui
Weight Initialization: Kaiming / MSRA
Initialization when ReLU is used - ConvLayer
6*(W1, b1) 10*(W2, b2)
are initialized by random each filter has 5*5*3+1= 76 each filter has 5*5*6+1= 151
quantity of Normal parameters parameters
and 2/75 variance
are initialized by random
quantity of Normal
and 2/150 variance Input image
Initialize all bs to zero
ST, Morteza Analoui
Weight Initialization: Kaiming / MSRA
Initialization when ReLU is used – Fully
ConL
random quantity of Normal distribution of
zero mean and 2/25088 variance
random quantity of Normal distribution of
zero mean and 2/4096 variance Initialize all
bs to zero
• All components of W3 are initialized by 1000
random quantity of Normal distribution of 25088
4096 2024
zero mean and 2/2024 variance Initialize all
bs to zero

ST, Morteza Analoui
Contents
Perceptron)
3. Convolution
4. Pooling
5. Training
16. Failures of CNN
software
9. Initialization
ST, Morteza Analoui
10- Architectures for Image
recognition
VGG16:Parameters and memory
(*2 for backward) input

ST, Morteza Analoui
GoogLeNet [Szegedy et al., 2014]
• GoogLeNet architecture tried to reduce learnable parameters, mainly
through inception module
• Inception module:
• leverages feature learning at different scales through convolutions with
different filters
• reduces number of parameters of hypothesis
• A deeper network, with computational efficiency
• 1000 labels (softmax)
• 22 layers (Don’t count auxiliary layers), 5 million parameters (12x less
than AlexNet, 27x less than VGG-16)
ST, Morteza Analoui
GoogLeNet
• All convolution blocks are appended
with a ReLU
• Dropout in Full connected, Fully
connected-a1, and Full connected-b1,
• Local response normalization (LRN)
normalizing over local input regions -
not used any more
• Auxiliary Classifier for Training only,
generated loss of 2-Aux added to
total loss with a weight of 0.3

ST, Morteza Analoui
(Convolution, Pool-2)
dropout
Stacked Inception modules
dropout
dropout average pooling

Fully connected – 1 layer
ST, Morteza Analoui Loss function
Configuration details
ops: number of mathematical operations carried out within the module

ST, Morteza Analoui
GoogLeNet [Szegedy et al., 2014]
• “Inception module”: design a good local network topology (network
within a network) and then stack these modules on top of each other
• Apply parallel filter operations on input from previous layer:

• Multiple receptive field sizes for convolution (1x1, 3x3, 5x5)
• Pooling operation (3x3)
• Concatenate all filter outputs together depth-wise

ST, Morteza Analoui
Example of an “Inception” module
• Convolution Operations: 28x28x(96+192+256+128)=28x28x672
• [1x1 conv, 128] 28x28x128x1x1x256 Concatenation
• [3x3 conv, 192] 28x28x192x3x3x256
• [5x5 conv, 96] 28x28x96x5x5x256 28x28x96 28x28x192 28x28x256 28x28x128
• Total: 854M ops (Very expensive compute) 5x5 conv, 3x3 conv, 3x3 pool, 1x1 conv,
• ops: number of mathematical operations 96 192 stirde1 128
carried out within the module
• Problem: Computational complexity Previous Layer
Input: 28x28x256
Naive Inception module

ST, Morteza Analoui
Example of an “Inception” module
28x28x480
Concatenation
• Convolution Operations: 28x28x96 28x28x192 28x28x64 28x28x128
• [1x1 conv, 64] 28x28x64x1x1x256 5x5 conv, 3x3 conv, 1x1 conv,
96 192 64
• [1x1 conv, 64] 28x28x64x1x1x256
28x28x64 28x28x64 28x28x256
• [1x1 conv, 128] 28x28x128x1x1x256
1x1 conv, 1x1 conv, 3x3 pool, 1x1 conv,
• [3x3 conv, 192] 28x28x192x3x3x64 64 64 stirde1 128
• [5x5 conv, 96] 28x28x96x5x5x64
• [1x1 conv, 64] 28x28x64x1x1x256 Previous Layer
Input: 28x26x256
• Total: 358M ops (Compared to 854M ops) Inception module with dimension reduction
• Note that: 1x1 convolutions to reduce feature depth (256  64)

ST, Morteza Analoui
FC, 1000 outputs
ResNet [He et al., 2015] (labels)
Global average
pooling layer
• 152-layer model for ImageNet

• What happens when we continue stacking deeper
layers on a “plain” CNN? deeper model performs
worse on both training and test error
• Deeper model performs worse, but it’s not caused stride 2
by overfitting
• Fact: Deep models have more representation power
(higher complexity, e.g. more parameters) than
shallower models
• Hypothesis: problem is an optimization problem,
deeper models are harder to optimize
ST, Morteza Analoui
Deeper models are harder to optimize
• In Resnet, the deepness of convolutional layers is controlled during optimization
process.
• We build the network with high number of convolution layers, and let the
optimization process decides about the number of active layers.
Plain CNN Residual CNN
Training Error
Training Error
#Convolutional Layers #Convolutional Layers

ST, Morteza Analoui
Deeper CNN using Identity mapping
• What should deeper model learn to be at least as good as shallower
model?
• Solution: copy learned layers x and setting additional layers to identity
mapping
• Identity mapping: if
H(x)=

ST, Morteza Analoui
ResNet architecture: Stack residual blocks
• Every residual block has two 3x3 conv
layers 28 × 28 ×256
• Convolution Operations for :
• [3x3 conv, 256] 28x28x256x3x3x256
• [3x3 conv, 256] 28x28x256x3x3x256
28 × 28 ×256
A Residual block
ST, Morteza Analoui
A note on efficiency
• It uses “bottleneck” layer (conv) to improve
efficiency(similar to GoogLeNet)
• Convolution Operations: 1x1 conv, 256 filters projects
back to 256 feature maps
• Reducing depth from 256 to 64:
• [1x1 conv, 64] 28x28x64x1x1x256 3x3 conv operates over
only 64 feature maps
• 3x3 conv: 1x1 conv, 64 filters to

• [3x3 conv, 64] 28x28x64x3x3x64 project to 28x28x64
• Back to depth 256:

• [1x1 conv256] 28x28x256x1x1x64

ST, Morteza Analoui
Training ResNet in practice
• Batch Normalization after every CONV layer
• Xavier initialization
• SGD + Momentum (=0.9)
• Learning rate: 0.1, divided by 10 when validation error plateaus
• Mini-batch size 256
• No dropout used

ST, Morteza Analoui
SENet (Squeeze-and-Excitation Networks )
[Hu et al. 2017]
• Improving ResNets
• Add a “feature recalibration” module that learns to adaptively
reweight feature maps
• Global information (global avg. pooling layer) together with 2 FC
layers used to determine feature map weights

ST, Morteza Analoui
SENet: Improving ResNets
• Schema of original Residual module (left) and SE-ResNet module
(right)
per-channel modulation weights

ST, Morteza Analoui
SENet
• Transformation Ftr mappes input X to feature maps U, e.g. a convolution,
• Features U are passed through a squeeze operation Fsq, which produces a channel
descriptor by aggregating feature maps across their spatial dimensions (HW)
• Aggregation is followed by an excitation operation Fex, which takes form of a
simple self-gating mechanism that takes embedding as input and produces a
collection of per-channel modulation weights
SE-ResNet Module

ST, Morteza Analoui
Challenge (ILSVRC) winners
First CNN-based
winner

ST, Morteza Analoui
Comparing complexity...
Top-1 one-crop accuracy versus
amount of operations required for a
single forward pass
Size of blobs is proportional to number
of parameters

ST, Morteza Analoui
Contents
Perceptron)
3. Convolution
4. Pooling
5. Training
16. Failures of CNN
software
9. Initialization
ST, Morteza Analoui
11- Normalization
Input image normalization
𝜎𝑦 𝜎𝑦
𝜎𝑥
𝜎𝑥

ST, Morteza Analoui
Mini-Batch Normalization
• In deep networks, too-high learning rate may result in the gradients that explode
or vanish, as well as getting stuck in poor local minima. Batch Normalization helps
address these issues
• Batch normalization provide a normalized input for each layer of CNN during
training
• BN: Usually inserted after Fully Connected or Convolutional layers and before
non-linear operation
• We could ensure that distribution of nonlinearity inputs remains more stable as
network trains, then optimizer would be less likely to get stuck in saturated
regime, and training convergence would accelerate

ST, Morteza Analoui
1. Input images in Batch are normalized (zero-centered) based on mean and variance of images in
Batch
2. Layer
1. Feed output of ReLU1) into Conv( and calculate ’ activation maps,
2. Normalize each activation of a channel based on mean and variance of activations of that channel,
3. Scale and shift normalized value,
4. Send resulting maps, namely into ReLU( Calculate average loss for images in final layer
3. Start backpropagation
channels channels
𝑥 𝑦 …
ReLU1) Conv BN ReLU( Conv+1)

ST, Morteza Analoui
• is an activation map of channels
• is normalized using mean and variance of activation maps in that channel
, Expected value is calculated per channel (activation map)
• is calculated by scaling and shifting of normalized value

• Parameters , are learned along with original model parameters
(𝑘) (𝑘) (𝑘) (𝑘 )

𝑦 =𝛾 ^
𝑥 +𝛽

ST, Morteza Analoui
MBP in test time
• During training activations means and variances are computed. These statistics
are used to normalize test examples at each layer as follow:
• Keep an exponentially decaying running mean of the mean and variance of each activation,
and these averages are used to normalize data at test-time.
• For each activation map in all layers, at each step of training we update the running averages
for mean, variance, and using an exponential decay based on the momentum parameter. So,
at the end of training, there is a 4-tuple of (mean, variance, , ) for any activation map of CNN.
running_mean = momentum * running_mean + (1 - momentum) * sample_mean

running_var = momentum * running_var + (1 - momentum) * sample_var
running_ = momentum * running_ + (1 - momentum) * sample_
running_= momentum * running_+ (1 - momentum) * sample_

ST, Morteza Analoui
MBP in test time
• A running mean is an average that continually changes as mini-batch
enters into training algorithm. Calculating a running average requires
repeated calculations.
• Momentum is a parameter which usually set at 0.9

ST, Morteza Analoui
A code for BN
• Here is a code for batch normalization in github
https://github.com/Erlemar/cs231n_self/blob/master/assignment2/c
s231n/layers.py#L116

ST, Morteza Analoui
Batch Normalization
• In batch setting where each training step is based on entire training
set, we would use whole set to normalize activations. However, this is
impractical when using stochastic optimization
• Therefore, since we use mini-batches in stochastic gradient training,
each mini-batch produces estimates of mean and variance of each
activation
• This way, statistics used for normalization can fully participate in
gradient backpropagation

ST, Morteza Analoui
Contents
Perceptron)
3. Convolution
4. Pooling
5. Training
16. Failures of CNN
software
9. Initialization
ST, Morteza Analoui
12- Transfer Learning
Transfer Learning with CNNs

ST, Morteza Analoui
Transfer Learning with CNNs
More specific
representation
More generic
representation

ST, Morteza Analoui
Transfer learning with CNN
• Image Captioning: CNN + RNN
Image feature
vector from CNN
Pre-trained CNN
Word feature vectors pre-
trained with word2vec
Karpathy and Fei-Fei, “Deep Visual-Semantic Alignments for Generating Image Descriptions”, CVPR 2015

ST, Morteza Analoui
QA
• Zhou et al, “Unified Vision-Language Pre-Training for Image Captioning and VQA” CVPR 2020
1. Train CNN on ImageNet

2. Fine-Tune for object
detection on Visual Genome
dataset (Connecting language
and vision dataset)
3. Train BERT language model
on lots of text
4. Combine(2) and (3), train for
joint image /language
modeling
5. Fine-tune (4) for image
captioning, visual question
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IUanswering, etc. 196
ST, Morteza Analoui
Contents
Perceptron)
3. Convolution
4. Pooling
5. Training
16. Failures of CNN
software
9. Initialization
ST, Morteza Analoui
13- CNN for text classification
CNN for text classification
• Word embedding (such as Word2Vec) is an algorithm that accepts
text corpus as an input and outputs a vector representation for each
word
• Example: a 50-dimension real vector represents the word Tree

ST, Morteza Analoui
A sentence presentation
1 2 50
quick 1
brown 2
fox
jumps
over
lazy
dog 7
7
A text of 7 words:
50
1
ST, Morteza Analoui
Sentence classification
A text of 7 words:
50*7*1
7
6 6 2
Conv, … …
2 32
ReLU max pooling concatenated
50 16 50*2*1 fully connected
11 11
1*3
11 softmax
1
filters 11 11
16 16
2
50*2*1 filter 50
1

ST, Morteza Analoui
Contents
Perceptron)
3. Convolution
4. Pooling
5. Training
16. Failures of CNN
software
9. Initialization
ST, Morteza Analoui
14- Beyond Classification
Beyond Classification
• CNN for Semantic Segmentation, Object Detection

ST, Morteza Analoui
FCNN for Semantic Segmentation
• Fully Convolutional Neural Network
• Design CNN as a bunch of convolutional layers, with downsampling
and upsampling inside
downsampling upsampling
convolution transpose
convolution

ST, Morteza Analoui
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”,
ICCV 2015
Transposed convolution network
• On top of a CNN based on VGG 16-layer, a multilayer deconvolution network generates

segmentation map of an input image.
• Given a feature representation obtained from the convolution network, class prediction map is
constructed through multiple series of unpooling, deconvolution and rectification operations.
ST, Morteza Analoui
Transposed convolution
• Exemple:𝑤 ∗𝑇 𝑥
𝑥 𝑤
𝑇
∗
Transposed convolution with stride of 1

ST, Morteza Analoui
In-Network upsampling: “Unpooling”
• Max Pooling/Unpooling

ST, Morteza Analoui
R-CNN for Multiple Objects detection
• Region-based Convolutional Neural Networks
Region of Interest

ST, Morteza Analoui
R-CNN open soureces
• TensorFlow Detection API:
• https://github.com/tensorflow/models/tree/master/research/
object_detection
• Faster RCNN, SSD, RFCN, Mask R-CNN, ...
• Detectron2 (PyTorch)
• https://github.com/facebookresearch/detectron2
• Mask R-CNN, RetinaNet, Faster R-CNN, RPN, Fast R-CNN, R-FCN, ...

ST, Morteza Analoui
CNN and Image restoration tasks
• CNNs have limited receptive field and inadaptability to input content
• Its computational complexity grows quadratically with spatial
resolution, therefore making it infeasible to apply to most image
restoration tasks involving high-resolution images

ST, Morteza Analoui
Contents
Perceptron)
3. Convolution
4. Pooling
5. Training
16. Failures of CNN
software
9. Initialization
ST, Morteza Analoui
15- Visualizing CNN features
Maximally Activating Patches
• Pick a layer and a channel, run many images through the network, record values
of chosen channel, visualize image patches that correspond to maximal
activations
part of input image that a channel responds to input images
Visualization of patterns learned by layer conv9 of a CNN

trained on ImageNet. Each row corresponds to a channel
ST, Morteza Analoui
Which pixels matter
• Mask part of the image before feeding to CNN, check how much
predicted probabilities change.
Softmax probability=0.95
Trained CNN
Softmax probability=0.45

ST, Morteza Analoui
Which pixels matter
• Move masked part of the image and check how much predicted
probabilities is

ST, Morteza Analoui
Gradient ascent
• Gradient ascent: Generate a synthetic image that maximally activates
a pixel of an activation map
1. Initialize image to zeros
Repeat:
2. Forward image to compute current scores
3. Backprop to get gradient of pixel value with respect to image pixels
4. Make a small update to the image

ST, Morteza Analoui
Visualizing CNN features: Gradient Ascent
• Visualizing CNN features: Gradient Ascent

ST, Morteza Analoui
Feature Inversion
• Given a CNN feature vector for an image, find a new image that:
• Matches the given feature vector
• “Looks natural” (image prior regularization)
Reconstructing from different layers of VGG-16

ST, Morteza Analoui
Contents
Perceptron)
3. Convolution
4. Pooling
5. Training
16. Failures of CNN
software
9. Initialization
ST, Morteza Analoui
16- Failures of CNN
Main failure of CNNs
• Main failure of CNNs is that they do not carry any information about
relative relationships between features
• This leads to triggering false positive for images which have
components of a face but not in correct order
• This is simply a flaw in core design of CNNs since they are based on
basic convolution operation applied to scalar values
• CNN uses a single scalar output to summarize activities of a local pool
of replicated feature detectors

ST, Morteza Analoui
A case
• A neurons in final layer of a trained CNN detect or be “activated” by
certain features in input image
Example:
• Train CNN for face detection
• Some channel in a layer might be triggered by eyes while others may be
triggered by mouth
• If we have all of components (or at least a certain amount) to make up a face
like eyes, ears, nose, and mouth, then our CNN will tell us that it has detected
a face
• But that’s unfortunately where reach of CNNs end

ST, Morteza Analoui
A case
• Let’s have a look at an example
• Image of face on left has all of components of a face, so CNN handles this case
just fine
• Tricky part is that for a CNN, image on right is also a face (it has all features of a
face). When a trained CNN is applied there will be activation on all of features

ST, Morteza Analoui
A case
CNN
0.95 of being a face
0.7 0.9 0.9 0.7 0.9

ST, Morteza Analoui
Why CNN fails
• In CNN, higher-level features combine lower-level features as a
weighted sum: activation of a preceding layer are multiplied by
following layer weights () and added (), before being passed to non-
linearity
• Nowhere in this information flow are relationships between features
taken into account
3 5×5
5 ×5 𝑝𝑎𝑡𝑐 h
∑ ∑ 𝑤𝑖𝑗 ,𝑘 ∙𝑖𝑛(𝑝𝑎𝑡𝑐 h𝑖 ) 𝑗, 𝑘 +𝑏
𝑘=1 𝑗=1
ST, Morteza Analoui
Adversarial robustness
• Vulnerability of neural networks to adversarial examples:
• Input image is slightly changed by an attacker to trick a neural net classifier into making
wrong classification
• These inputs can be created in a variety of ways, but straightforward strategies
such as FGSM in
• Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing
adversarial exemples. arXiv preprint arXiv:1412.6572, 2014.
• It has been shown drastically decrease accuracy in convolutional neural networks
on image classification tasks can be done
Panda (57.7% confidence) 𝝐 Gibbon (99.3% confidence)
03/18/2024 Pattern Recognition-Capsule Networks - School of Computer En 227

gineering , IUST - Morteza Analoui
Fooling Images
1. Start from an
arbitrary image
2. Pick an arbitrary
class
3. Modify the image
to maximize the
class
4. Repeat until
network is fooled

ST, Morteza Analoui
Contents
Perceptron)
3. Convolution
4. Pooling
5. Training
16. Failures of CNN
software
9. Initialization
ST, Morteza Analoui
17- Deep learning hardware and
software
Deep learning hardware
• CPU: Fewer cores, but each core is much faster and much more
capable; great at sequential tasks.
• GPU: More cores, but each core is much slower and “dumber”; great
for parallel tasks.

ST, Morteza Analoui
Example: matrix multiplication
• AxC parallel vector inner products

ST, Morteza Analoui
CPU vs GPU in practice

ST, Morteza Analoui
Programming GPUs

ST, Morteza Analoui
Deep Learning Software
• PyTorch (Facebook), version 1.4 (January 2020)
• TensorFlow (Google), Version 2.1 (March 2020)
• Quick to develop and test new ideas

• Automatically compute gradients
• Run it all efficiently on GPU (wrap cuDNN, cuBLAS, OpenCL, etc)

ST, Morteza Analoui
PyTorch
• Tensor: Like a numpy array, but can run on GPU
• Autograd: Package for building computational graphs out of Tensors,
and automatically computing gradients
• Module: A neural network layer; may store state or learnable weights

ST, Morteza Analoui

4.pattern Recognition (Pattern Classification) - Convolutional Neural Networks - (CNN)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

4.pattern Recognition (Pattern Classification) - Convolutional Neural Networks - (CNN)

Uploaded by

Copyright:

Available Formats

Pattern Recognition

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 2

• Items in comes from unknown distribution ()

03/18/2024 Machine Learning Theory for Pattern Recognition - School of C 5

Recall from Chapter 1: Learning (label

03/18/2024 Machine Learning Theory for Pattern Recognition - School of C 6

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 9

• CIFAR 10 dataset consists of

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 11

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 13

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 14

GoogLeNet: most efficient

Top-k accuracy means that

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 15

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 16

𝟕×𝟕×𝟓𝟏𝟐 𝟕×𝟕×𝟓𝟏𝟐=𝟐𝟓 ,𝟎𝟖𝟖 𝒇𝒆𝒂𝒕𝒖𝒓𝒆𝒔

VGG: Visual Geometry Group at Oxford, 2014

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 19

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 21

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 22

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 23

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 24

TensorFlow is deemed the most effective and easy to use

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 25

h ( 𝑥 )=Sequences of 𝐶𝑜𝑛𝑣𝑜𝑙𝑢𝑡𝑖𝑜𝑛 , 𝑁𝑜𝑛𝑙𝑖𝑛𝑒𝑎𝑟 , 𝑃𝑜𝑜𝑙𝑖𝑛𝑔 , 𝑁𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑎𝑡𝑖𝑜𝑛 , 𝐷𝑟𝑜𝑝𝑜𝑢𝑡 , … ,𝑃𝑒𝑟𝑐𝑒𝑝𝑡𝑖𝑜𝑛

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 26

𝑆𝑐𝑜𝑟𝑒 ( 𝑥 1 , 𝑐𝑎𝑟 ) =h (𝑥 1 , 𝑐𝑎𝑟 )

03/18/2024 Machine Learning Theory for Pattern Recognition - School of C 29

03/18/2024 Machine Learning Theory for Pattern Recognition - School of C 31

03/18/2024 Machine Learning Theory for Pattern Recognition - School of C 32

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 33

Losses: 2.9 0 12.9

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 34

𝐿 𝑖= 𝐿𝑜𝑔 1− 𝐿𝑜𝑔𝑃 ( 𝑦 𝑖 =𝑘|𝑥 𝑖 ) =− 𝐿𝑜𝑔

• It is “negative log likelihood of correctness”

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 35

𝐿𝑖=− 𝐿𝑜𝑔𝑃 ( 𝑦 𝑖=𝑘|𝑥 𝑖 )

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 36

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 37

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 39

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 40

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 41

Labe prediction probabilities

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 44

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 45

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 46

ReLU is a good default choice for most problems.

• More neurons  more complex hypothesis set

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 50

VGG: Visual Geometry Group at Oxford, 2014

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 56

Input: RGB image

filtered image (stride=1) filtered image

A filter slides over all spatial locations of image

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 62

• We stack these up to get a “new image” of size 28x28x6

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 64

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 65

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 66

• Put zero pad on the border of input

Pad=2, N’=9, F=3, output size=(9-3)/1+1 =7x7

Output size = Input size:

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 68