You are on page 1of 235

Pattern Recognition

(Pattern Classification)
Convolutional Neural Networks (CNNs
or ConvNet) for visual Recognition
Hypothesis set and Algorithm

First Edition
Acknowledgment
• This chapter adapted from lecture notes of “CS231n: Convolutional
Neural Networks for Visual Recognition”, Spring 2022
• http://cs231n.stanford.edu/

• https://github.com/cs231n/cs231n.github.io

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 2


ST, Morteza Analoui
Contents
1. Convolutional Neural Network 10. Architectures for Image recognition
2. Fully Connected (Multilayer 11. Normalization
Perceptron)
12. Transfer Learning
3. Convolution
13. CNN for text classification
4. Pooling
14. Beyond Classification
5. Training
15. Visualizing CNN features
6. CNN Optimization algorithm
16. Failures of CNN
7. Learning rate schedules
8. Backpropagation 17. Deep learning hardware and
software
9. Initialization
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 3
ST, Morteza Analoui
1- Convolutional Neural Network
Recall from Chapter 1: Items, Data set,
Feature vector, Label set
• Data Set: : collection of items (instances, example) of data used for training,
validation and evaluation (test)
• Sample Set:
• Example: In email spam prediction (detection)
Sample set S = collection of email messages
• Feature vector (attributes) : example is represented by a vector of features in

• : set of all possible items, and ; Features can be either hand crafted or learned

• Items in comes from unknown distribution ()

03/18/2024 Machine Learning Theory for Pattern Recognition - School of C 5


omputer Engineering, IUST - Morteza Analoui
Shallow learning

Recall from Chapter 1: Learning (label


learning)
𝑾𝒉𝒂𝒕 𝒎𝒂𝒄𝒉𝒊𝒏𝒆 𝒌𝒏𝒐𝒘𝒔(𝒅𝒂𝒕𝒂) 𝑾𝒉𝒂𝒕 𝒎𝒂𝒄𝒉𝒊𝒏𝒆 𝒍𝒆𝒂𝒓𝒏𝒔(𝒉𝒚𝒑𝒐𝒕𝒉𝒆𝒔𝒊𝒔)
• Training set: Sample, Label • Hypothesis set: H
• Learning: select a hypothesis H

• Concept set:
• Target concept
• Target concept

Probability Distribution of
examples (unknown): Learning Algorithm: select a hypothesis
Training set: , H. has a small loss
Teacher provides noise-free
label

03/18/2024 Machine Learning Theory for Pattern Recognition - School of C 6


omputer Engineering, IUST - Morteza Analoui
Deep learning: Learning “Features +
Labels”
• Suppose a multi-label classification project in computer vision
• Deep learning results in:
learning the features + learning the labels (classification)
• learning the features: Representation learning
SVM, Adaboost, Perceptron, …
Feature (representation) learning Label learning (Shallow learning)
32x32x3
Label
Prediction

Deep learning
32 ×32
RGB: 3 channels 𝑥∈3ℝ
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 7
ST, Morteza Analoui
Convolutional Neural Networks (CNNs)
• Representation learning is done by Convolution
• Label learning is done by Perceptron (fully connected neural network)

Convolution
Feature (representation) learning Perceptron
32x32x3 Label learning

Labeling
Scores

32 ×32
RGB: 3 channels 𝑥∈3ℝ
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 8
ST, Morteza Analoui
Example

Labeling
Scores

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 9


ST, Morteza Analoui
Example: CIFAR 10
(https://www.cs.toronto.edu/~kriz/cifar.html)

• CIFAR 10 dataset consists of


60,000 32x32 color images in
10 (concepts), with 6000
images per class.
• There are 50,000
training/validation images
and 10,000 test images.
• Five training batches and one
test batch, each with 10,000
images.
• classes are completely
mutually exclusive
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 10
ST, Morteza Analoui
Example: CIFAR 100
(https://www.cs.toronto.edu/~kriz/cifar.html)

• Just like the CIFAR-10, except it has 100 concepts containing 600
images each.
• 500 training/validation images and 100 testing images per class.
• 100 classes in the CIFAR 100 are grouped into 20 super-classes. Each
image comes with a "fine" label (the class to which it belongs) and a
"coarse" label (the super-class to which it belongs).
• e.g. Super-class: vehicles (classes: bicycle, bus, motorcycle, pickup
truck, train)

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 11


ST, Morteza Analoui
ImageNet Large Scale Visual Recognition
Challenge (ILSVRC), and Kaggle
• ImageNet is a visual Dataset that contains more than 15 million of labeled high-
resolution images covering almost 22,000 categories (concepts), such as "balloon" or
"strawberry", consisting of several hundred images.

• During 2010-2017, annual contest, “ImageNet Large Scale Visual Recognition Challenge (
ILSVRC)”, competition on correctly classify and detect objects. The total count of training
images is 1.3 million, accompanied by 50,000 validation images, and 1,00,000 testing
images.

• Completion of ILSVRC: Annual ImageNet competition no longer held after 2017 and
moved to Kaggle
• Kaggle, a subsidiary of Google LLC, is an online community of data scientists and machine
learning practitioners
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 12
ST, Morteza Analoui
Kaggle.com/datasets

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 13


ST, Morteza Analoui
https://www.kaggle.com/competitions

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 14


ST, Morteza Analoui
CNN Models for Practical Applications,
2017 Inception-v4: ResNet + Inception
VGG-19: most parameters,
most operations

GoogLeNet: most efficient

Top-k accuracy means that


any of k highest scores must
match the real label

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 15


ST, Morteza Analoui
ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) winners

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 16


ST, Morteza Analoui
Convolutional Neural Networks-VGG16
fully connected
Representation learning softmax

Classifier

Label learning

𝟕×𝟕×𝟓𝟏𝟐 𝟕×𝟕×𝟓𝟏𝟐=𝟐𝟓 ,𝟎𝟖𝟖 𝒇𝒆𝒂𝒕𝒖𝒓𝒆𝒔

VGG: Visual Geometry Group at Oxford, 2014


03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 17
ST, Morteza Analoui
h ∈ H – VGG16 , 2014 CONV+ReLU
CONV+ReLU

CONV+ReLU
CONV+ReLU
𝑥𝑖
CONV+ReLU
CONV+ReLU
CONV+ReLU

CONV+ReLU
CONV+ReLU
CONV+ReLU

CONV+ReLU
CONV+ReLU
CONV+ReLU

FC
FC
𝑚𝑝 ¿ FC

𝑤1 ∗ 𝑥 𝑖 𝑤2 ∗(𝑅𝑒𝐿𝑈(𝑤¿¿1∗ 𝑥¿¿𝑖))¿¿
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 18
ST, Morteza Analoui
– VGG16, 2014 𝑥
CONV+ReLU
CONV+ReLU

CONV+ReLU
CONV+ReLU
• Softmax is a loss function
CONV+ReLU
• 138 million parameters CONV+ReLU
CONV+ReLU
• 102.76 million parameters CONV+ReLU
• 16.78 million parameters CONV+ReLU
CONV+ReLU
• 4.096 million parameters
CONV+ReLU
• Label learner: 123.63 million parameters CONV+ReLU
CONV+ReLU
• Size of feature vector:
FC
FC
FC

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 19


ST, Morteza Analoui
– ResNet, 2015
• Total layers: 18, 34, 50, 101, or 152
• Batch Normalization after ever CONV layer

𝑥
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 20
ST, Morteza Analoui
Overview of CNN architectures

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 21


ST, Morteza Analoui
Overview of CNN architectures

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 22


ST, Morteza Analoui
Overview of CNN architectures

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 23


ST, Morteza Analoui
Overview of CNN architectures

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 24


ST, Morteza Analoui
Deep learning frameworks and libraries

TensorFlow is deemed the most effective and easy to use

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 25


ST, Morteza Analoui
CNN (CNN architecture)
• Functions in are consist of a series of operations on input
• The main operations are:
• Convolution
• Non-leaner transformation
• Pooling, Normalization, Dropout, …
• Fully connected (Multilayer Perceptron)

h ( 𝑥 )=Sequences of 𝐶𝑜𝑛𝑣𝑜𝑙𝑢𝑡𝑖𝑜𝑛 , 𝑁𝑜𝑛𝑙𝑖𝑛𝑒𝑎𝑟 , 𝑃𝑜𝑜𝑙𝑖𝑛𝑔 , 𝑁𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑎𝑡𝑖𝑜𝑛 , 𝐷𝑟𝑜𝑝𝑜𝑢𝑡 , … ,𝑃𝑒𝑟𝑐𝑒𝑝𝑡𝑖𝑜𝑛

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 26


ST, Morteza Analoui
CNN learning problem
• We discussed the following optimization problem based on empirical risk of and
regularization
𝐿(𝑊 ) regularization term
𝑎𝑟𝑔𝑚𝑖𝑛 ( ^
𝑅𝑆 ( h ) + 𝜆 ℛ ( h) ) =𝑎𝑟𝑔𝑚𝑖𝑛 𝐿(𝑊 )
h∈H h∈H
λ>0 is regularization parameter (treated as a hyper parameter)
• Here is CNN learning problem:
Training Error regularization loss = regularization strength
𝑚
1
𝐿 ( 𝑊 )= ∑ [ 𝐿¿ ¿𝑖 (h ( 𝑥 𝑖 ,𝑊 ) , 𝑦 𝑖 )]+ 𝜆 ℛ (𝑊 )¿
(Hyperparameter)
𝑊 ={𝑤 , 𝑏}
𝑚 𝑖=1 Regularization: Prevent the model
Empirical risk: Model predictions from doing too well on training
should match training data data (keep model simple so works
better on test data)
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 27
ST, Morteza Analoui
Regularization is total number of layers in CNN

• In common use:
1. L2 regularization:
2. L1 regularization:
3. Elastic net (L1 + L2):
4. Dropout
5. Batch normalization (mini-batch normalization dose not do
regularization)
6. Stochastic depth
7. …
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 28
ST, Morteza Analoui
CNN is multi-class, mono label and score
based classifier
• In multi-concept setting, a hypothesis is defined based on a scoring function
• Label associated to test example (image) is one resulting in largest score , which
defines mapping from to for

𝑥1 𝑥 2 𝑥3

𝑆𝑐𝑜𝑟𝑒 ( 𝑥 1 , 𝑐𝑎𝑟 ) =h (𝑥 1 , 𝑐𝑎𝑟 )


cat
car
frog
Predicted label: car car car

03/18/2024 Machine Learning Theory for Pattern Recognition - School of C 29


omputer Engineering, IUST - Morteza Analoui
Hinge (SVM) margin loss function in CNN
• CNN incorporates either hinge margin loss function or Log loss
function during training
• Hinge margin loss function in CNN:
• Empirical margin loss: Given a sample and a hypothesis , empirical
margin loss is defined by
𝑚
^ 1
𝑅 𝑆 , 𝜌 ( h )= ∑ 𝐿𝑖
𝑚 𝑖=1

𝐿𝑖= max ⁡(0 ,Φ 𝜌=1 𝜌h (𝑥 𝑖 , 𝑦 𝑖 ) ¿ )¿ Φ 𝜌 =1 ( 𝜌 h ( 𝑥𝑖 , 𝑦 𝑖 ))=1− 𝜌h ( 𝑥 , 𝑦 )
( ) 𝜌
𝜌 h ( 𝑥 , 𝑦 )=h ( 𝑥 , 𝑦 ) − h(𝑥 , 𝑦 ′ )
𝑦′≠ 𝑦
𝑦′≠ 𝑦
03/18/2024 Machine Learning Theory for Pattern Recognition - School of C 30
omputer Engineering, IUST - Morteza Analoui
Example: margin loss, =3 examples

Examples: 𝑥1 𝑥 2 𝑥3
𝑠1 ,1 =𝑠 ( 𝑥1 , 1 )=h(𝑥 1 , 𝑐𝑎𝑡 )

𝑘=1 cat
concepts: 𝑘= 2 car
𝑘=3 frog
𝑠2 , 3=𝑠 ( 𝑥 2 ,3 ) =h(𝑥 2 , 𝑓𝑟𝑜𝑔)
Score vector of

03/18/2024 Machine Learning Theory for Pattern Recognition - School of C 31


omputer Engineering, IUST - Morteza Analoui
Hinge (SVM) margin loss function,
Φ 𝜌 =1 ( 𝜌 h ( 𝑥 , 𝑦 ) )
𝑥1 𝑥 2 𝑥3

1 1 − 𝜌h ( 𝑥 , 𝑦 )

0 𝜌h ( 𝑥 , 𝑦 )
0 𝝆=𝟏
Losses: 2.9 0 12.9 ′
𝜌 h ( 𝑥 , 𝑦 )=h ( 𝑥 , 𝑦 ) − h(𝑥 , 𝑦 )
𝑦′≠ 𝑦

03/18/2024 Machine Learning Theory for Pattern Recognition - School of C 32


omputer Engineering, IUST - Morteza Analoui
Hinge loss function in CNN
Φ 𝜌 =1 ( 𝜌 h ( 𝑥 , 𝑦 ) )

1 1 − 𝜌h ( 𝑥 , 𝑦 )
Φ 𝜌 =1 ( 𝜌 h ( 𝑥 , 𝑦 ) )

0 𝜌h ( 𝑥 , 𝑦 )
1 1 − 𝜌h ( 𝑥 , 𝑦 ) 0 𝝆=𝟏

𝟏+ h(𝑥 , 𝑦 )−h (𝑥 , 𝑦 )
0 ′
h ( 𝑥 , 𝑦 )𝟏+ h( 𝑥 , 𝑦 ′ ) 𝑆𝑐𝑜𝑟𝑒=h ( 𝑥 , 𝑦 )=𝜌 h ( 𝑥 , 𝑦 )+ h(𝑥 , 𝑦 ′ )
𝑦′≠ 𝑦

𝟏+ h(𝑥 , 𝑦 )−h (𝑥 , 𝑦 )

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 33


ST, Morteza Analoui
Empirical margin loss: ,
𝑥1 𝑥 2 𝑥3

Losses: 2.9 0 12.9

𝑚 𝑚
^ 1 1 2. 9+0+12 . 9
𝑅 𝑆 , 𝜌 ( h )= ∑ 𝐿𝑖 = ∑ Φ 𝜌 ( 𝜌 h (𝑥 𝑖 , 𝑦 𝑖 ) ) = =5 . 27
𝑚 𝑖=1 𝑚 𝑖=1 3

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 34


ST, Morteza Analoui
“Softmax function” + “Log loss function” in
CNN
• We interpret the scores as probabilities
• For image classifier gives score and the score interpreted as a probability given
by softmax function:
𝑠𝑖 , 𝑘
𝑒
likelihood of correctness : 𝑃 ( 𝑦 𝑖 =𝑘|𝑥𝑖 )= 𝐾

∑ 𝑒𝑠 𝑖, 𝑗

[ ]
𝑗 =1 𝑒𝑠 𝑖, 𝑘

𝐿 𝑖= 𝐿𝑜𝑔 1− 𝐿𝑜𝑔𝑃 ( 𝑦 𝑖 =𝑘|𝑥 𝑖 ) =− 𝐿𝑜𝑔


• Log loss for is: 𝐾

∑ 𝑒𝑠 𝑖, 𝑗

𝑗=1

• It is “negative log likelihood of correctness”


• Empirical lose:

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 35


ST, Morteza Analoui
Log loss (logistic loss), cross-entropy
• Cross-entropy is related to and
often confused with logistic loss,

𝐿𝑖=− 𝐿𝑜𝑔𝑃 ( 𝑦 𝑖=𝑘|𝑥 𝑖 )


called Log loss
• Both measures calculate same
quantity and can be used
interchangeably for classification
training

𝑃 ( 𝑦 𝑖 =𝑘|𝑥 𝑖 )

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 36


ST, Morteza Analoui
CNN algorithm using Log loss function
• Algorithm is “Maximum Likelihood Estimation”
• It chooses to maximize likelihood of observation of training set by
minimizing empirical risk in a regularization based framework
𝑚
^ 1
𝑅 𝑆 ( h )= ∑ 𝐿𝑖 𝐿𝑖= 𝐿𝑜𝑔1− 𝐿𝑜𝑔𝑃 ( 𝑦 𝑖 =𝑘|𝑥 𝑖 )
𝑚 𝑖=1

𝑎𝑟𝑔𝑚𝑖𝑛 ( ^
𝑅𝑆 ( h ) + 𝜆 ℛ ( h) ) =𝑎𝑟𝑔𝑚𝑖𝑛 𝐿(𝑊 )
h∈H h∈H

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 37


ST, Morteza Analoui
Multiclass Classifier using softmax
𝑥1
𝐿1=𝑙𝑜𝑔 1 . 00 −𝑙𝑜𝑔 0 . 13=−𝑙𝑜𝑔 0 . 13=2 . 04

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 39


ST, Morteza Analoui
Cross-Entropy vs Hinge for
𝑺=h (𝒙𝑖 , 𝑦 𝑖=3 ; 𝑾 ,𝑏) 𝑺

𝐿𝑖=1 . 58

𝐿𝑖=0 . 452

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 40


ST, Morteza Analoui
Cross-Entropy vs Hinge

[ ]
𝑒𝑠 𝑖 ,𝑖
Score=
𝐿 𝑖=− 𝐿𝑜𝑔 𝐾

∑ 𝑒𝑠 𝑖, 𝑗

𝑗 =1

𝐾
𝐿𝑖=∑ 𝑚𝑎𝑥 ( 0 , 𝑠𝑖 , 𝑗 +1− 𝑠 𝑖, 𝑖 ) …

𝑗≠𝑖

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 41


ST, Morteza Analoui
Contents
1. Convolutional Neural Network 10. Architectures for Image
2. Fully Connected (Multilayer
recognition
Perceptron) 11. Normalization
3. Convolution 12. Transfer Learning
4. Pooling 13. CNN for text classification
5. Training 14. Beyond Classification
6. CNN Optimization algorithm 15. Visualizing CNN features
7. Learning rate schedules 16. Failures of CNN
8. Backpropagation
17. Deep learning hardware and
software
9. Initialization
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 42
ST, Morteza Analoui
2- Fully Connected
(Multilayer Perceptron)
A label learner
Label learning stage
• VGG16: Last maxpool layer outputs a
feature vector of 25088 dimension for each
input image
Input feature vector
7 ×7 ×512=25088
𝑤14
𝑤1 5
𝑤1 6

Labe prediction probabilities

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 44


ST, Morteza Analoui
Label learner: Fully Connected (multi-
layer perceptron)
VGG16
• “Fully-connected” layers

𝑤1 𝑤2𝑆

1000
25088
: Score 4096 4096
“2-layer Neural Net”, or
“3-layer Neural Net”, or
“1-hidden-layer Neural Net”
“2-hidden-layer Neural Net”

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 45


ST, Morteza Analoui
Fully connected layer
• No hidden layer (1 layer)

𝑤
1 1
𝑇
𝑤 𝑥 +𝑏
+𝑏
25088 1000
25088 weights 1000
input layer activation
𝑥
𝑇 1000×25088 25088 ×1 1000 ×1
h ( 𝑥 , 𝑤 ) =𝑤 𝑥 +𝑏 ≡𝑤 𝑥 +𝑏

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 46


ST, Morteza Analoui
Fully connected layer
• One hidden layer (2-layer)

• Hypothesis +
• Score of

𝑥 𝑤 14 Φ 𝑤 15 S

Score
hidden layer
+ nonlinear
input layer
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 47
ST, Morteza Analoui
activation functions (non linear
transform)

ReLU is a good default choice for most problems.


03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 48
ST, Morteza Analoui
Fully connected layer using ReLU
• One hidden layer (2-layer)

• Hypothesis +
• ReLU:

• Score of 𝑥 𝑤 14 Φ 𝑤 15 S

Score
hidden layer
h: nonlinear
input layer
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 49
ST, Morteza Analoui
Setting the number of layers and their sizes
• 1 hidden layer and 3, 6, and 20 neurons

• More neurons  more complex hypothesis set

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 50


ST, Morteza Analoui
Regularization

𝑚
1
𝐿(𝑊 )= ∑ [𝐿¿ ¿𝑖 (h ( 𝑥𝑖 ,𝑊 ) , 𝑦𝑖 )]+𝜆 ℛ (𝑊)¿
𝑚 𝑖=1 : Stronger regularization, simple hypothesis
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 51
ST, Morteza Analoui
Contents
1. Convolutional Neural Network 10. Architectures for Image recognition
2. Fully Connected (Multilayer 11. Normalization
Perceptron)
12. Transfer Learning
3. Convolution
13. CNN for text classification
4. Pooling
14. Beyond Classification
5. Training
15. Visualizing CNN features
6. CNN Optimization algorithm
16. Failures of CNN
7. Learning rate schedules
8. Backpropagation 17. Deep learning hardware and
software
9. Initialization
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 52
ST, Morteza Analoui
3- Convolution
Convolutional Neural Networks-VGG16
and Zisserman, 2014]
[Simonyan

fully connected
Representation learning Softmax+Log loss

Classifier

Label learning

𝟕×𝟕×𝟓𝟏𝟐

VGG: Visual Geometry Group at Oxford, 2014


03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 54
ST, Morteza Analoui
VGG16
• INPUT: [224x224x3] • CONV33-512: [28x28x512]
• CONV33-64: [224x224x64] • CONV33-512: [28x28x512]
• CONV33-64: [224x224x64] • CONV33-512: [28x28x512]
• POOL2: [112x112x64] • POOL2: [14x14x512]
• CONV33-128: [112x112x128] • CONV33-512: [14x14x512]
• CONV33-128: [112x112x128] • CONV33-512: [14x14x512]
• POOL2: [56x56x128] • CONV33-512: [14x14x512]
• CONV33-256: [56x56x256] • POOL2: [7x7x512]
• CONV33-256: [56x56x256] • FC: [1x1x4096]
• CONV33-256: [56x56x256] • FC: [1x1x4096]
• POOL2: [28x28x256] • FC:
high level [1x1x1000]
features  CONV33-512: [14x14x512]
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 55
ST, Morteza Analoui
Convolutional Neural Networks - Alexnet
2012

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 56


ST, Morteza Analoui
Input: RGB image
• Resolution: 3 colors and height-width

Input: RGB image


03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 57
ST, Morteza Analoui
𝑔𝑥
Convolution
Product of a of image and a moved
filter =
• We call the layer 𝑎
convolutional because it A pixel of
source
is related to convolution image
w [ 𝑎−2 , 𝑏 −2 ]
of two functions and :
7 7
𝑥 [ 𝑎 , 𝑏 ] ∗ 𝑤 [ 𝑎 ,𝑏 ] = ∑ ∑ 𝑥 [ 𝑎 , 𝑏] ∙𝑤 [ 𝑎 −𝑛1 , 𝑏− 𝑛2 ]
𝑛1 =1 𝑛2=1

image Filter
pixels image 𝑏 𝑤 [ 𝑎 ,𝑏 ]
𝑥 [ 𝑎 , 𝑏]
w filter A pixel of filtered image

filtered image (stride=1) filtered image


𝑥 [ 𝑎 , 𝑏 ] ∗ 𝑤 [ 𝑎 ,𝑏 ]
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 58
ST, Morteza Analoui
Convolution
• Convolving 1 filter of size and a image results in a filtered image (called
activation map)
Filter extends all 3
channels of input image

image
3 𝑥
filter
3 𝑤 size

5
A pixel of filtered image
5
3
𝑇
𝑤 𝑥 +𝑏
˙ + bias
5 ×5 × 3=75 dimension product
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 59
ST, Morteza Analoui
Convolution
• We call the layer convolutional because it is related to convolution of
two functions :
,

filter

image

A filter slides over all spatial locations of image


03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 60
ST, Morteza Analoui
Convolution  activation map (filtered
image)
• One filter  One activation map

𝑖𝑛
𝑝𝑎𝑡𝑐 h𝑖
𝐻 =32 𝑜𝑢𝑡 𝑖

𝑊 =32
𝐶 =3 3 5× 5
𝑜𝑢𝑡 𝑖 =∑ ∑ 𝑤𝑖𝑗 ,𝑘 ∙𝑖𝑛(𝑝𝑎𝑡𝑐 h𝑖 ) 𝑗 ,𝑘 +𝑏
𝑘=1 𝑗=1
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 61
ST, Morteza Analoui
Two filters  2 activation maps
• Consider a second, green filter

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 62


ST, Morteza Analoui
Six filters  6 maps
• For example, if we had six 5x5x3 filters, we’ll get 6 separate activation
maps:

• We stack these up to get a “new image” of size 28x28x6


03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 63
ST, Morteza Analoui
Example
Filter1: Filter15: Filter32:
7 4×74 ×3𝑖𝑚𝑎𝑔𝑒 1 2 15 32

𝑥
70

𝑥 ∗𝑤 1 70

74
70
𝑥 ∗𝑤 15
32
74 70
filters
32

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 64


ST, Morteza Analoui
A closer look at spatial dimensions
•7

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 65


ST, Morteza Analoui
A closer look at spatial dimensions
•7

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 66


ST, Morteza Analoui
Convolution Networks
• ConvNet is a sequence of Convolution Layers, interspersed with
activation functions (non-linear function)
• 32x32 input convolved repeatedly with 5x5 filters shrinks volumes
spatially!(32 -> 28 -> 24 ...). Shrinking too fast is not good, doesn’t
work well

Input image
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 67
ST, Morteza Analoui
Zero pad border Pad=2
F

• Put zero pad on the border of input


𝑁 ′=9 𝑁 =7

Pad=2, N’=9, F=3, output size=(9-3)/1+1 =7x7

Output size = Input size:


Output size = (N+2Pad-F)/stride +1
Pad = (F-1)/2

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 68


ST, Morteza Analoui
Number of parameters
• Number of parameters in a convolution layer?
Convolution parameters: each filter has 5*5*3+1= 76 each filter has 5*5*6+1= 151
parameters (+1 for bias) parameters (+1 for bias)
Total=76*6 = 456 Total=151*10 = 1510
• : input channels
• : output channels

Input image

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 69


ST, Morteza Analoui
1 × 1 × 𝑚 𝑓𝑖𝑙𝑡𝑒𝑟
• 1x1 convolutions named as bottleneck convolution
• convolution layers (filters) make perfect sense

1 64
1

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 70


ST, Morteza Analoui
1 × 1 × 𝑚 𝑓𝑖𝑙𝑡𝑒𝑟
• Preserving spatial information, reducing depth

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 71


ST, Morteza Analoui
What is effective receptive field of three
3x3 conv (stride 1) layers?
• three 3x3 conv (stride 1) layers has same effective receptive field as
one 7x7 conv layer but
• deeper, more non-linearities
• fewer parameters: 3*(32C2) vs 72C2 for C=Cin=Cout channels per layer

input

7x7

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 72


ST, Morteza Analoui
Contents
1. Convolutional Neural Network 10. Architectures for Image recognition
2. Fully Connected (Multilayer 11. Normalization
Perceptron)
12. Transfer Learning
3. Convolution
13. CNN for text classification
4. Pooling
14. Beyond Classification
5. Training
15. Visualizing CNN features
6. CNN Optimization algorithm
16. Failures of CNN
7. Learning rate schedules
8. Backpropagation 17. Deep learning hardware and
software
9. Initialization
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 73
ST, Morteza Analoui
4- Pooling
Pooling
• Make a pool of pixels, and root one of
them to next layer
• Pooling makes representations smaller
and more manageable
• Operates over each activation map
independently
down sampling
• We lose some valuable information
• Pooling layers reduce spatial resolution,
so their outputs are invariant to small
changes in inputs
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 75
ST, Morteza Analoui
MAX POOLING

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 76


ST, Morteza Analoui
Fully connected (multilayer perceptron)
1x1x4096
a neuron: 7x7x512
7 the result of taking a dot product
7x7x512=25088 between a row of and 1x1x25088
7
512

1x1x25088 1x1x4096
4096 ×25088
ReLU( ∗𝑤 ¿=¿

Each neuron looks


at the full input
1

4096 input

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 77


ST, Morteza Analoui
Examples
A classifier

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 79


ST, Morteza Analoui
Some applications
.

Image Recognition Speech Recognition Text Recognition

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 80


ST, Morteza Analoui
A non-application
• If data is just as useful after
swapping any of columns with
each other, then Convolution
does not work
• Convolution captures local
“spatial” patterns in data

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 81


ST, Morteza Analoui
A simple example – Handwritten
recognition
• Input: character X

3 (3 ×3) 𝑓𝑖𝑙𝑡𝑒𝑟𝑠

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 82


ST, Morteza Analoui
A simple example – Handwritten recognition-1
channel
• Convolution

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 83


ST, Morteza Analoui
A simple example – Handwritten
recognition
• Convolution

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 84


ST, Morteza Analoui
A simple example – Handwritten
recognition

*
*
*
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 85
ST, Morteza Analoui
A simple example – Handwritten
recognition
• Activation maps

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 86


ST, Morteza Analoui
A simple example – Handwritten
recognition
• Max Pooling

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 87


ST, Morteza Analoui
A simple example – Handwritten
recognition
• Non linear transformation

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 88


ST, Morteza Analoui
A simple example – Handwritten
recognition

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 89


ST, Morteza Analoui
A simple example – Handwritten
recognition

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 90


ST, Morteza Analoui
A simple example – Handwritten
recognition
• Fully Connected

Feature Values for X Feature Values for O

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 91


ST, Morteza Analoui
A simple example – Handwritten
recognition
• 2-hidden layer perceptron

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 92


ST, Morteza Analoui
A simple example – Handwritten
recognition
• A full view
• Output scores

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 93


ST, Morteza Analoui
[ConvNetJS demo: training on CIFAR-10]
• This demo trains a Convolutional Neural Network on the
CIFAR-10 dataset in your browser, with nothing but Javascript. The
state of the art on this dataset is about 90% accuracy and human
performance is at about 94% (not perfect as the dataset can be a bit
ambiguous).
• URL for the demo:
• https://cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.ht
ml

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 94


ST, Morteza Analoui
Contents
1. Convolutional Neural Network 10. Architectures for Image recognition
2. Fully Connected (Multilayer 11. Normalization
Perceptron)
12. Transfer Learning
3. Convolution
13. CNN for text classification
4. Pooling
14. Beyond Classification
5. Training
15. Visualizing CNN features
6. CNN Optimization algorithm
16. Failures of CNN
7. Learning rate schedules
8. Backpropagation 17. Deep learning hardware and
software
9. Initialization
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 95
ST, Morteza Analoui
5- Training
Learning Problem
• Find by solving optimization problem:

•,

•. Or

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 97


ST, Morteza Analoui
More Regularization: Dropout
• Dropout: In each forward pass, randomly set some nodes to zero
• Probability of dropping is a hyperparameter; is common
• By dropping a node out, we mean temporarily removing it from
network, along with all its incoming and outgoing connections

5 neurons are dropped out


03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 98
ST, Morteza Analoui
More Regularization: Dropout
• Q: How can this possibly be a good idea?
• A1: Forces the network to be less complex. It prevents co-adaptation
of nodes (prevents different nodes have highly correlated behavior)
• A2: Dropout is training a large ensemble of models (that share
parameters). Each binary mask is one

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 99


ST, Morteza Analoui
Dropout - test time
• At test time all neurons are active always
• We must scale the activations so that for each neuron:
output at test time = expected output at training time
• So,
at test time, multiply the neuron activation by dropout probability
or
divide each neuron activation by and do nothing in test time

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 100


ST, Morteza Analoui
More Regularization: Data Augmentation
• Generate more example to
increase complexity of
target concept
• Transfer input image: flip CNN
horizontal, vertical, …
• Random crops and scale into
Color Jitter and flip
an complete image
• Randomize contrast and
brightness
random crops & scale

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 101


ST, Morteza Analoui
Contents
1. Convolutional Neural Network 10. Architectures for Image recognition
2. Fully Connected (Multilayer 11. Normalization
Perceptron)
12. Transfer Learning
3. Convolution
13. CNN for text classification
4. Pooling
14. Beyond Classification
5. Training
15. Visualizing CNN features
6. CNN Optimization algorithm
16. Failures of CNN
7. Learning rate schedules
8. Backpropagation 17. Deep learning hardware and
software
9. Initialization
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 102
ST, Morteza Analoui
6- CNN Optimization algorithm
1- Stochastic Gradient Decent (SGD)
2 - SGD + momentum
3- AdaGrad (Adaptive Gradient )
4 - RMSProp
5- Adam (adaptive moment estimation)
Stochastic gradient descent - SGD
• Stochastic gradient descent (SGD) algorithm is used to solve the
optimization problem.
• SGD is an optimization algorithm that estimates loss: gradient () for
the current state of the model using examples from the training
dataset.
• then updates the weights of the model. The examples participate in
update can be referred to as support vectors for the Perceptron
algorithm.
• The amount that the weights are updated is referred to as the step
size or the “learning rate.”
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 104
ST, Morteza Analoui
stochastic Gradient Descent
• GD (SGD) is an algorithm that has a number of hyperparameters.
• Two integer hyperparameters are the batch size and number of epochs.
• Batch size is a hyperparameter that controls the number of training
samples to work through before the model’s internal parameters are
updated.
• The number of epochs is a hyperparameter of SGD that controls the
number of complete passes through the training dataset.
• Batch Gradient Descent. Batch Size = Size of Training Set
• Stochastic Gradient Descent. Batch Size = 1
• Mini-Batch Gradient Descent. 1 < Batch Size < Size of Training Set

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 105


ST, Morteza Analoui
Number of epoch
• Stop training the model when accuracy on the validation set
decreases
OR
• Train for a long time, but always keep track of the model snapshot
that worked best on validation set.

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 106


ST, Morteza Analoui
Weight updating in SGD
• The weights are updated as follows:
• In iteration of training, a mini-batch of training set examples feed into the network
one by one and then is calculated. For the mini-batch
𝑚′ size of =32/64/128 is
1
common. 𝐿 ( 𝑊 )= ∑ ((
[ 𝐿 ¿𝑖 h 𝑥 ,𝑊 , 𝑦 ]+𝜆 𝑅(𝑊 )¿
𝑚′
¿ 𝑖 ) 𝑖)
𝑖=1

• Then calculate , and update at iteration:

• is the learning rate


• Example: ,
,,
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 107
ST, Morteza Analoui
Problems with Weight updating in SGD
1.
• What if loss changes quickly in one direction and slowly in another?
• What does gradient descent do?
• Answer: Very slow progress along shallow dimension, jitter along
steep direction
𝑤2

𝑤1
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 108
ST, Morteza Analoui
Problems with Weight updating in SGD
2.
• What if the loss function has a local minima or saddle point? Saddle
points much more common in high dimension
𝐿𝑜𝑠𝑠

𝑊
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 109
ST, Morteza Analoui
SGD + Momentum
• Weight updating in SGD:

• SGD + Momentum (velocity): ,


• Build up “velocity” as a running mean of gradients
• Typically: =0.9 – 0.99, is the learning rate
𝑤2

𝑤1
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 110
ST, Morteza Analoui
SGD + Momentum
• Combine gradient at current point with velocity to get step used to
update weights
• Note that: , can be written as:
Momentum update: , Weight update:

: 𝜌 𝑣𝑡
𝑣 𝑡 +1

current point

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 111


ST, Morteza Analoui
AdaGrad (Adaptive Subgradient Methods)
• Added element-wise scaling of the gradient based on the historical
sum of squares in each dimension of

• Weight updating:

• is historical sum of gradient squares in each dimension of


• “Per learning rates” or “adaptive learning rates”

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 112


ST, Morteza Analoui
AdaGrad (Adaptive Subgradient Methods)
• Python code: x and =dx

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 113


ST, Morteza Analoui
Problems with AdaGrad
• 1. Progress along “steep” directions is damped; progress along “flat”
directions is accelerated.
• 2. Step size over long time decays to zero

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 114


ST, Morteza Analoui
RMSProp: “Leaky AdaGrad”
• Weight updating:
• is replaced with

Exponential Moving Average - controls the decay rate

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 115


ST, Morteza Analoui
Adam (adaptive moment estimation)

• (First moment estimate using Exponential Moving Averaging)


• (First moment bias correction)
• (Second moment estimate using Exponential Moving Averaging)
• (Second moment bias correction)

• Return
• (learning rate), (first moment decay rate, typically 0.9), (second moment decay rate,
typically 0.999), (numerical term, typically 10-7),

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 116


ST, Morteza Analoui
Adam (adaptive moment estimation)
• Bias correction for the fact that first and second moment estimates
start at zero
• Adam with beta1 = 0.9, beta2 = 0.999, and learning_rate = 1e-3 or 5e-
4 is a great starting point for many models!

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 117


ST, Morteza Analoui
Initialization Bias and Its Correction
• The moving averages are estimates of the 1st moment (the mean) and the 2nd
moment (the uncentered variance) of the gradient.
• However, these moving averages are initialized as (vectors of) 0’s, leading to moment
estimates that are biased towards zero, especially during the initial timesteps, and
especially when the decay rates are small (i.e. and are close to 1).
• Bias Correction for the mean:

[ ]
𝑡 +1 𝑡 +1
𝔼 [ 𝑀 𝑡 +1 ]= 𝔼 [ 𝛽 1 𝑀 𝑡 + ( 1− 𝛽 1 ) 𝛻 𝐿 ( 𝑊 𝑡 ) ] =𝔼 ( 1− 𝛽1 ) ∑ 𝛽 𝑡 + 1− 𝑖
1 𝛻 𝐿 ( 𝑊 𝑡 ) =𝔼 [ 𝛻 𝐿 ( 𝑊 𝑡 ) ] × ( 1 − 𝛽1 ) ∑ 𝛽𝑡1+1 −𝑖 +𝑐
𝑖=1 𝑖=1

𝔼 [ 𝑀 𝑡 +1 ]
𝔼 [ 𝑀 𝑡 +1 ] = 𝔼 [ 𝛻 𝐿 ( 𝑊 𝑡 ) ] ( 1− 𝛽 ) +𝑐 ⟹ 𝔼 [ 𝛻 𝐿 ( 𝑊 𝑡 ) ] = =^
𝑡+1
1 𝑀 𝑡 +1
(1 − 𝛽 )𝑡 +1
1

• An analogous argument derives the bias correction of second moment

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 118


ST, Morteza Analoui
Adam’s properties
• 1 - Loss scale-invariance:
^𝑡⟶𝑐 𝑀
𝐿 ( 𝑊 ) ⟶ 𝑐𝐿 (𝑊 ) ⟹ 𝑀 ^ 𝑡∧𝑅
^ 𝑡 ⟶ 𝑐2 ^
𝑅𝑡
𝑐^
𝑀 𝑡 +1 ^ 𝑡 +1
𝑀
𝑊 𝑡 + 1=𝑊 𝑡 − 𝜂 𝑊𝑡 − 𝜂
√𝑐 𝑅𝑡 +1 +𝜖
2^
√ 𝑅^ 𝑡 +1+ 𝜖
• 2 – Initialization Bias correction for first and second moments:

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 119


ST, Morteza Analoui
Adam’s properties
• 3 – Bounded norm of

• 4 - Disabling second estimation reduses Adam to SGD with:


Infinity
• Learning rate norm gives toward
descending the largest magnitude among each element of a

• Momentum ascending towards

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 120


ST, Morteza Analoui
Adam’s properties
• 5 - RMSProp with momentum is the method most closely related to
Adam.
• Main differences:
• RMSProp rescales gradient and then applies momentum, Adam first applies
momentum (moving average) and then rescales (bias correction).
• RMSProp lacks bias correction, often leading to large step sizes in early stages
of run (especially when is close to 1).

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 121


ST, Morteza Analoui
When update parameters
• Batch Gradient Descent
• Batch Size = Size of Training Set, There is one parameter update per
epoch
• Stochastic Gradient Descent
• Batch Size = 1, There is one parameter update per example
• Mini-Batch Gradient Descent
• 1 < Mini-Batch Size < Size of Training Set, There is one parameter
update per Mini-batch

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 122


ST, Morteza Analoui
Mini-Batch Gradient Descent
• Consider a CNN is under training and at 1…𝑚′
iteration parameters are updated. In
iteration we inter a batch of examples into
the network and calculate average loss for
the batch in forward pass
• Consider a mini-batch of size . The average
loss is

• We calculate in the backward using

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 123


ST, Morteza Analoui
Mini-Batch Gradient Descent
Loop:
1. Sample a mini-batch of data
2. Forward prop it through the network, get loss
3. Backprop to calculate the gradients
4. Update the parameters using the gradient

ResNet: Mini-batch size 256 examples

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 124


ST, Morteza Analoui
Contents
1. Convolutional Neural Network 10. Architectures for Image recognition
2. Fully Connected (Multilayer 11. Normalization
Perceptron)
12. Transfer Learning
3. Convolution
13. CNN for text classification
4. Pooling
14. Beyond Classification
5. Training
15. Visualizing CNN features
6. CNN Optimization algorithm
16. Failures of CNN
7. Learning rate schedules
8. Backpropagation 17. Deep learning hardware and
software
9. Initialization
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 125
ST, Morteza Analoui
7- Learning rate schedules
Learning rate: Weight updating
• SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have learning
rate as a hyperparameter
• All of them start with large learning rate and decay over time

• Advantage of SGD and other online or mini-batch update methods is


that their convergence does not depend on size of training set, only
on number of updates

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 127


ST, Morteza Analoui
Learning rate
• Learning rate is a hyperparameter used in training
• Learning rate has a small positive value, often in the range between
0.0001 and 1.0
• Learning rate controls how quickly model is adapted to problem
• Smaller learning rates require more training epochs given smaller
changes made to weights each update, whereas larger learning rates
result in rapid changes and require fewer training epochs
• Too large learning rate can cause model to converge too quickly to a
suboptimal solution
• Adaptive Learning Rate!
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 128
ST, Morteza Analoui
Learning Rate Decay
• Start with large learning rate and decay over time

𝐿(𝑊 )

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 129


ST, Morteza Analoui
Learning Rate: Cosine Decay

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 130


ST, Morteza Analoui
Learning Rate: Linear, Inverse sqrt Decay

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 131


ST, Morteza Analoui
Learning Rate Decay: Linear Warmup
• High initial learning rates can
make loss explode; linearly
increasing learning rate from
0.001 over first ~10 epoch can
prevent this

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 132


ST, Morteza Analoui
Which decay scheduling is the best?
• All of them
• Start with large learning rate and decay over time
• Try cosine schedule

• Where and are ranges for learning rate, and account for
how many epochs have been performed since last period

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 133


ST, Morteza Analoui
In practice
• Adam is a good default choice in many cases; it often works ok even
with constant learning rate
• SGD + Momentum can outperform Adam but may require more
tuning of LR and schedule

• ResNets: multiply Learning Rate by 0.1 after epochs 30, 60, and 90

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 134


ST, Morteza Analoui
Contents
1. Convolutional Neural Network 10. Architectures for Image recognition
2. Fully Connected (Multilayer 11. Normalization
Perceptron)
12. Transfer Learning
3. Convolution
13. CNN for text classification
4. Pooling
14. Beyond Classification
5. Training
15. Visualizing CNN features
6. CNN Optimization algorithm
16. Failures of CNN
7. Learning rate schedules
8. Backpropagation 17. Deep learning hardware and
software
9. Initialization
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 135
ST, Morteza Analoui
8- Backpropagation
Optimization Algorithm
How to find the best ?

min
𝑊

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 137


ST, Morteza Analoui
Backpropagation: a simple example
flow graph method
• flow graph: to prevent mathematical mistakes and make sure an
implementation is computationally efficient
Example: find that minimize
𝑎𝑟𝑔𝑚𝑖𝑛 𝑓 =( 𝑥 + 𝑦 ) 𝑧
𝑥, 𝑦 , 𝑧
We want

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 138


ST, Morteza Analoui
Back propagation: a simple example
• e.g. x = -2, y = 5, z = -4

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 139


ST, Morteza Analoui
Back propagation: a simple example

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 140


ST, Morteza Analoui
Upstream, Downstream

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 141


ST, Morteza Analoui
Example 2:
• Find to minimize

Computational graph representation

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 142


ST, Morteza Analoui
Feed forward

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 143


ST, Morteza Analoui
Back propagation: Gradient flow

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 144


ST, Morteza Analoui
Back propagation: Gradient flow

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 145


ST, Morteza Analoui
Back propagation: Gradient flow

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 146


ST, Morteza Analoui
Back propagation: Gradient flow

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 147


ST, Morteza Analoui
Back propagation: Gradient flow

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 148


ST, Morteza Analoui
Back propagation: Gradient flow

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 149


ST, Morteza Analoui
Computational graph variations
• Computational graph representation may not be unique. Choose one
where local gradients at each node can be easily expressed!

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 150


ST, Morteza Analoui
Patterns in gradient flow

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 151


ST, Morteza Analoui
Backprop Implementation: “Flat” code

Forward pass:
Computes loss

Backward pass:
Computes grads

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 152


ST, Morteza Analoui
Contents
1. Convolutional Neural Network 10. Architectures for Image recognition
2. Fully Connected (Multilayer 11. Normalization
Perceptron)
12. Transfer Learning
3. Convolution
13. CNN for text classification
4. Pooling
14. Beyond Classification
5. Training
15. Visualizing CNN features
6. CNN Optimization algorithm
16. Failures of CNN
7. Learning rate schedules
8. Backpropagation 17. Deep learning hardware and
software
9. Initialization
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 153
ST, Morteza Analoui
9- Initialization
Xavier Initialization – Convolution Layer
• All components of W1
6*(W1, b1) 10*(W2, b2)
are initialized by random each filter has 5*5*3+1= 76 each filter has 5*5*6+1= 151
quantity of Uniform parameters parameters
distribution of zero mean
and 1/75 variance
• All components of W2
are initialized by random
quantity of Uniform
distribution of zero mean
and 1/150 variance Input image
Initialize all bs to zero
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 155
ST, Morteza Analoui
Xavier Initialization – Fully Connected
Layer
• All components of W1 are initialized by
random quantity of Uniform distribution
of zero mean and 2/(25088+4096)
variance
• All components of W2 are initialized by
random quantity of Uniform distribution
of zero mean and 2/(4096+2024) 1000
variance Initialize all bs to zero 25088
4096 2024
• All components of W3 are initialized by
“3-layer Neural Net”, or
random quantity of Uniform distribution “2-hidden-layer Neural Net”
of zero mean and 2/(2024+1000)
variance Initialize all bs to zero
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 156
ST, Morteza Analoui
Xavier Initialization and ReLU
• Xavier assumes zero centered activation function
• Activations collapse to zero when ReLU is used.

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 157


ST, Morteza Analoui
Weight Initialization: Kaiming / MSRA
Initialization when ReLU is used - ConvLayer
• All components of W1
6*(W1, b1) 10*(W2, b2)
are initialized by random each filter has 5*5*3+1= 76 each filter has 5*5*6+1= 151
quantity of Normal parameters parameters
distribution of zero mean
and 2/75 variance
• All components of W2
are initialized by random
quantity of Normal
distribution of zero mean
and 2/150 variance Input image
Initialize all bs to zero
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 158
ST, Morteza Analoui
Weight Initialization: Kaiming / MSRA
Initialization when ReLU is used – Fully
ConL
• All components of W1 are initialized by
random quantity of Normal distribution of
zero mean and 2/25088 variance
• All components of W2 are initialized by
random quantity of Normal distribution of
zero mean and 2/4096 variance Initialize all
bs to zero
• All components of W3 are initialized by 1000
random quantity of Normal distribution of 25088
4096 2024
zero mean and 2/2024 variance Initialize all
“3-layer Neural Net”, or
bs to zero
“2-hidden-layer Neural Net”

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 159


ST, Morteza Analoui
Contents
1. Convolutional Neural Network 10. Architectures for Image recognition
2. Fully Connected (Multilayer 11. Normalization
Perceptron)
12. Transfer Learning
3. Convolution
13. CNN for text classification
4. Pooling
14. Beyond Classification
5. Training
15. Visualizing CNN features
6. CNN Optimization algorithm
16. Failures of CNN
7. Learning rate schedules
8. Backpropagation 17. Deep learning hardware and
software
9. Initialization
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 160
ST, Morteza Analoui
10- Architectures for Image
recognition
VGG16:Parameters and memory

(*2 for backward) input

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 162


ST, Morteza Analoui
GoogLeNet [Szegedy et al., 2014]
• GoogLeNet architecture tried to reduce learnable parameters, mainly
through inception module
• Inception module:
• leverages feature learning at different scales through convolutions with
different filters
• reduces number of parameters of hypothesis
• A deeper network, with computational efficiency
• 1000 labels (softmax)
• 22 layers (Don’t count auxiliary layers), 5 million parameters (12x less
than AlexNet, 27x less than VGG-16)
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 163
ST, Morteza Analoui
GoogLeNet
• All convolution blocks are appended
with a ReLU
• Dropout in Full connected, Fully
connected-a1, and Full connected-b1,
• Local response normalization (LRN)
normalizing over local input regions -
not used any more
• Auxiliary Classifier for Training only,
generated loss of 2-Aux added to
total loss with a weight of 0.3

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 164


ST, Morteza Analoui
(Convolution, Pool-2)

dropout
Stacked Inception modules

dropout

dropout average pooling


Fully connected – 1 layer
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 165
ST, Morteza Analoui Loss function
Configuration details

ops: number of mathematical operations carried out within the module


03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 166
ST, Morteza Analoui
GoogLeNet [Szegedy et al., 2014]
• “Inception module”: design a good local network topology (network
within a network) and then stack these modules on top of each other

• Apply parallel filter operations on input from previous layer:


• Multiple receptive field sizes for convolution (1x1, 3x3, 5x5)
• Pooling operation (3x3)
• Concatenate all filter outputs together depth-wise

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 167


ST, Morteza Analoui
Example of an “Inception” module
• Convolution Operations: 28x28x(96+192+256+128)=28x28x672
• [1x1 conv, 128] 28x28x128x1x1x256 Concatenation
• [3x3 conv, 192] 28x28x192x3x3x256
• [5x5 conv, 96] 28x28x96x5x5x256 28x28x96 28x28x192 28x28x256 28x28x128

• Total: 854M ops (Very expensive compute) 5x5 conv, 3x3 conv, 3x3 pool, 1x1 conv,
• ops: number of mathematical operations 96 192 stirde1 128
carried out within the module
• Problem: Computational complexity Previous Layer
Input: 28x28x256
Naive Inception module

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 168


ST, Morteza Analoui
Example of an “Inception” module
28x28x480

Concatenation

• Convolution Operations: 28x28x96 28x28x192 28x28x64 28x28x128

• [1x1 conv, 64] 28x28x64x1x1x256 5x5 conv, 3x3 conv, 1x1 conv,
96 192 64
• [1x1 conv, 64] 28x28x64x1x1x256
28x28x64 28x28x64 28x28x256
• [1x1 conv, 128] 28x28x128x1x1x256
1x1 conv, 1x1 conv, 3x3 pool, 1x1 conv,
• [3x3 conv, 192] 28x28x192x3x3x64 64 64 stirde1 128
• [5x5 conv, 96] 28x28x96x5x5x64
• [1x1 conv, 64] 28x28x64x1x1x256 Previous Layer
Input: 28x26x256
• Total: 358M ops (Compared to 854M ops) Inception module with dimension reduction
• Note that: 1x1 convolutions to reduce feature depth (256  64)

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 169


ST, Morteza Analoui
FC, 1000 outputs

ResNet [He et al., 2015] (labels)

Global average
pooling layer

• 152-layer model for ImageNet


• What happens when we continue stacking deeper
layers on a “plain” CNN? deeper model performs
worse on both training and test error
• Deeper model performs worse, but it’s not caused stride 2
by overfitting
• Fact: Deep models have more representation power
(higher complexity, e.g. more parameters) than
shallower models
• Hypothesis: problem is an optimization problem,
deeper models are harder to optimize
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 170
ST, Morteza Analoui
Deeper models are harder to optimize
• In Resnet, the deepness of convolutional layers is controlled during optimization
process.
• We build the network with high number of convolution layers, and let the
optimization process decides about the number of active layers.
Plain CNN Residual CNN
Training Error

Training Error
#Convolutional Layers #Convolutional Layers

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 171


ST, Morteza Analoui
Deeper CNN using Identity mapping
• What should deeper model learn to be at least as good as shallower
model?
• Solution: copy learned layers x and setting additional layers to identity
mapping
• Identity mapping: if
H(x)=

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 172


ST, Morteza Analoui
ResNet architecture: Stack residual blocks
• Every residual block has two 3x3 conv
layers 28 × 28 ×256
• Convolution Operations for :
• [3x3 conv, 256] 28x28x256x3x3x256
• [3x3 conv, 256] 28x28x256x3x3x256

28 × 28 ×256
A Residual block
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 173
ST, Morteza Analoui
A note on efficiency
• It uses “bottleneck” layer (conv) to improve
efficiency(similar to GoogLeNet)
• Convolution Operations: 1x1 conv, 256 filters projects
back to 256 feature maps
• Reducing depth from 256 to 64:
• [1x1 conv, 64] 28x28x64x1x1x256 3x3 conv operates over
only 64 feature maps

• 3x3 conv: 1x1 conv, 64 filters to


• [3x3 conv, 64] 28x28x64x3x3x64 project to 28x28x64

• Back to depth 256:


• [1x1 conv256] 28x28x256x1x1x64

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 174


ST, Morteza Analoui
Training ResNet in practice
• Batch Normalization after every CONV layer
• Xavier initialization
• SGD + Momentum (=0.9)
• Learning rate: 0.1, divided by 10 when validation error plateaus
• Mini-batch size 256
• No dropout used

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 175


ST, Morteza Analoui
SENet (Squeeze-and-Excitation Networks )
[Hu et al. 2017]

• Improving ResNets
• Add a “feature recalibration” module that learns to adaptively
reweight feature maps
• Global information (global avg. pooling layer) together with 2 FC
layers used to determine feature map weights

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 176


ST, Morteza Analoui
SENet: Improving ResNets
• Schema of original Residual module (left) and SE-ResNet module
(right)

per-channel modulation weights

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 177


ST, Morteza Analoui
SENet
• Transformation Ftr mappes input X to feature maps U, e.g. a convolution,
• Features U are passed through a squeeze operation Fsq, which produces a channel
descriptor by aggregating feature maps across their spatial dimensions (HW)
• Aggregation is followed by an excitation operation Fex, which takes form of a
simple self-gating mechanism that takes embedding as input and produces a
collection of per-channel modulation weights

SE-ResNet Module

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 178


ST, Morteza Analoui
ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) winners

First CNN-based
winner

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 179


ST, Morteza Analoui
Comparing complexity...
Top-1 one-crop accuracy versus
amount of operations required for a
single forward pass
Size of blobs is proportional to number
of parameters

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 180


ST, Morteza Analoui
Contents
1. Convolutional Neural Network 10. Architectures for Image recognition
2. Fully Connected (Multilayer 11. Normalization
Perceptron)
12. Transfer Learning
3. Convolution
13. CNN for text classification
4. Pooling
14. Beyond Classification
5. Training
15. Visualizing CNN features
6. CNN Optimization algorithm
16. Failures of CNN
7. Learning rate schedules
8. Backpropagation 17. Deep learning hardware and
software
9. Initialization
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 181
ST, Morteza Analoui
11- Normalization
Input image normalization

𝜎𝑦 𝜎𝑦

𝜎𝑥
𝜎𝑥

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 183


ST, Morteza Analoui
Mini-Batch Normalization
• In deep networks, too-high learning rate may result in the gradients that explode
or vanish, as well as getting stuck in poor local minima. Batch Normalization helps
address these issues
• Batch normalization provide a normalized input for each layer of CNN during
training
• BN: Usually inserted after Fully Connected or Convolutional layers and before
non-linear operation
• We could ensure that distribution of nonlinearity inputs remains more stable as
network trains, then optimizer would be less likely to get stuck in saturated
regime, and training convergence would accelerate

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 184


ST, Morteza Analoui
Mini-Batch Normalization
1. Input images in Batch are normalized (zero-centered) based on mean and variance of images in
Batch
2. Layer
1. Feed output of ReLU1) into Conv( and calculate ’ activation maps,
2. Normalize each activation of a channel based on mean and variance of activations of that channel,
3. Scale and shift normalized value,
4. Send resulting maps, namely into ReLU( Calculate average loss for images in final layer
3. Start backpropagation

channels channels

𝑥 𝑦 …

ReLU1) Conv BN ReLU( Conv+1)

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 185


ST, Morteza Analoui
Mini-Batch Normalization
• is an activation map of channels
• is normalized using mean and variance of activation maps in that channel

, Expected value is calculated per channel (activation map)

• is calculated by scaling and shifting of normalized value


• Parameters , are learned along with original model parameters

(𝑘) (𝑘) (𝑘) (𝑘 )


𝑦 =𝛾 ^
𝑥 +𝛽

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 186


ST, Morteza Analoui
MBP in test time
• During training activations means and variances are computed. These statistics
are used to normalize test examples at each layer as follow:
• Keep an exponentially decaying running mean of the mean and variance of each activation,
and these averages are used to normalize data at test-time.
• For each activation map in all layers, at each step of training we update the running averages
for mean, variance, and using an exponential decay based on the momentum parameter. So,
at the end of training, there is a 4-tuple of (mean, variance, , ) for any activation map of CNN.

running_mean = momentum * running_mean + (1 - momentum) * sample_mean


running_var = momentum * running_var + (1 - momentum) * sample_var
running_ = momentum * running_ + (1 - momentum) * sample_
running_= momentum * running_+ (1 - momentum) * sample_

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 187


ST, Morteza Analoui
MBP in test time
• A running mean is an average that continually changes as mini-batch
enters into training algorithm. Calculating a running average requires
repeated calculations.
• Momentum is a parameter which usually set at 0.9

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 188


ST, Morteza Analoui
A code for BN
• Here is a code for batch normalization in github
https://github.com/Erlemar/cs231n_self/blob/master/assignment2/c
s231n/layers.py#L116

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 189


ST, Morteza Analoui
Batch Normalization
• In batch setting where each training step is based on entire training
set, we would use whole set to normalize activations. However, this is
impractical when using stochastic optimization
• Therefore, since we use mini-batches in stochastic gradient training,
each mini-batch produces estimates of mean and variance of each
activation
• This way, statistics used for normalization can fully participate in
gradient backpropagation

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 190


ST, Morteza Analoui
Contents
1. Convolutional Neural Network 10. Architectures for Image recognition
2. Fully Connected (Multilayer 11. Normalization
Perceptron)
12. Transfer Learning
3. Convolution
13. CNN for text classification
4. Pooling
14. Beyond Classification
5. Training
15. Visualizing CNN features
6. CNN Optimization algorithm
16. Failures of CNN
7. Learning rate schedules
8. Backpropagation 17. Deep learning hardware and
software
9. Initialization
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 191
ST, Morteza Analoui
12- Transfer Learning
Transfer Learning with CNNs

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 193


ST, Morteza Analoui
Transfer Learning with CNNs

More specific
representation

More generic
representation

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 194


ST, Morteza Analoui
Transfer learning with CNN
• Image Captioning: CNN + RNN

Image feature
vector from CNN

Pre-trained CNN
Word feature vectors pre-
trained with word2vec
Karpathy and Fei-Fei, “Deep Visual-Semantic Alignments for Generating Image Descriptions”, CVPR 2015

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 195


ST, Morteza Analoui
QA
• Zhou et al, “Unified Vision-Language Pre-Training for Image Captioning and VQA” CVPR 2020

1. Train CNN on ImageNet


2. Fine-Tune for object
detection on Visual Genome
dataset (Connecting language
and vision dataset)
3. Train BERT language model
on lots of text
4. Combine(2) and (3), train for
joint image /language
modeling
5. Fine-tune (4) for image
captioning, visual question
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IUanswering, etc. 196
ST, Morteza Analoui
Contents
1. Convolutional Neural Network 10. Architectures for Image recognition
2. Fully Connected (Multilayer 11. Normalization
Perceptron)
12. Transfer Learning
3. Convolution
13. CNN for text classification
4. Pooling
14. Beyond Classification
5. Training
15. Visualizing CNN features
6. CNN Optimization algorithm
16. Failures of CNN
7. Learning rate schedules
8. Backpropagation 17. Deep learning hardware and
software
9. Initialization
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 197
ST, Morteza Analoui
13- CNN for text classification
CNN for text classification
• Word embedding (such as Word2Vec) is an algorithm that accepts
text corpus as an input and outputs a vector representation for each
word
• Example: a 50-dimension real vector represents the word Tree

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 199


ST, Morteza Analoui
A sentence presentation
1 2 50
quick 1
brown 2
fox
jumps
over
lazy
dog 7

7
A text of 7 words:

50
1
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 200
ST, Morteza Analoui
Sentence classification
A text of 7 words:
50*7*1

7
6 6 2
Conv, … …
2 32
ReLU max pooling concatenated
50 16 50*2*1 fully connected
11 11
1*3
11 softmax
1
filters 11 11
16 16
2
50*2*1 filter 50
1

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 201


ST, Morteza Analoui
Contents
1. Convolutional Neural Network 10. Architectures for Image recognition
2. Fully Connected (Multilayer 11. Normalization
Perceptron)
12. Transfer Learning
3. Convolution
13. CNN for text classification
4. Pooling
14. Beyond Classification
5. Training
15. Visualizing CNN features
6. CNN Optimization algorithm
16. Failures of CNN
7. Learning rate schedules
8. Backpropagation 17. Deep learning hardware and
software
9. Initialization
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 202
ST, Morteza Analoui
14- Beyond Classification
Beyond Classification
• CNN for Semantic Segmentation, Object Detection

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 204


ST, Morteza Analoui
FCNN for Semantic Segmentation
• Fully Convolutional Neural Network
• Design CNN as a bunch of convolutional layers, with downsampling
and upsampling inside

downsampling upsampling
convolution transpose
convolution

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 205


ST, Morteza Analoui
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”,
ICCV 2015

Transposed convolution network

• On top of a CNN based on VGG 16-layer, a multilayer deconvolution network generates


segmentation map of an input image.
• Given a feature representation obtained from the convolution network, class prediction map is
constructed through multiple series of unpooling, deconvolution and rectification operations.
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 206
ST, Morteza Analoui
Transposed convolution
• Exemple:𝑤 ∗𝑇 𝑥

𝑥 𝑤
𝑇

Transposed convolution with stride of 1

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 207


ST, Morteza Analoui
In-Network upsampling: “Unpooling”
• Max Pooling/Unpooling

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 208


ST, Morteza Analoui
R-CNN for Multiple Objects detection
• Region-based Convolutional Neural Networks

Region of Interest

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 209


ST, Morteza Analoui
R-CNN open soureces
• TensorFlow Detection API:
• https://github.com/tensorflow/models/tree/master/research/
object_detection
• Faster RCNN, SSD, RFCN, Mask R-CNN, ...

• Detectron2 (PyTorch)
• https://github.com/facebookresearch/detectron2
• Mask R-CNN, RetinaNet, Faster R-CNN, RPN, Fast R-CNN, R-FCN, ...

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 210


ST, Morteza Analoui
CNN and Image restoration tasks
• CNNs have limited receptive field and inadaptability to input content
• Its computational complexity grows quadratically with spatial
resolution, therefore making it infeasible to apply to most image
restoration tasks involving high-resolution images

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 211


ST, Morteza Analoui
Contents
1. Convolutional Neural Network 10. Architectures for Image recognition
2. Fully Connected (Multilayer 11. Normalization
Perceptron)
12. Transfer Learning
3. Convolution
13. CNN for text classification
4. Pooling
14. Beyond Classification
5. Training
15. Visualizing CNN features
6. CNN Optimization algorithm
16. Failures of CNN
7. Learning rate schedules
8. Backpropagation 17. Deep learning hardware and
software
9. Initialization
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 212
ST, Morteza Analoui
15- Visualizing CNN features
Maximally Activating Patches
• Pick a layer and a channel, run many images through the network, record values
of chosen channel, visualize image patches that correspond to maximal
activations
part of input image that a channel responds to input images

Visualization of patterns learned by layer conv9 of a CNN


trained on ImageNet. Each row corresponds to a channel
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 214
ST, Morteza Analoui
Which pixels matter
• Mask part of the image before feeding to CNN, check how much
predicted probabilities change.

Softmax probability=0.95

Trained CNN

Softmax probability=0.45

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 215


ST, Morteza Analoui
Which pixels matter
• Move masked part of the image and check how much predicted
probabilities is

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 216


ST, Morteza Analoui
Gradient ascent
• Gradient ascent: Generate a synthetic image that maximally activates
a pixel of an activation map
1. Initialize image to zeros
Repeat:
2. Forward image to compute current scores
3. Backprop to get gradient of pixel value with respect to image pixels
4. Make a small update to the image

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 217


ST, Morteza Analoui
Visualizing CNN features: Gradient Ascent
• Visualizing CNN features: Gradient Ascent

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 218


ST, Morteza Analoui
Feature Inversion
• Given a CNN feature vector for an image, find a new image that:
• Matches the given feature vector
• “Looks natural” (image prior regularization)

Reconstructing from different layers of VGG-16


03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 219
ST, Morteza Analoui
Contents
1. Convolutional Neural Network 10. Architectures for Image recognition
2. Fully Connected (Multilayer 11. Normalization
Perceptron)
12. Transfer Learning
3. Convolution
13. CNN for text classification
4. Pooling
14. Beyond Classification
5. Training
15. Visualizing CNN features
6. CNN Optimization algorithm
16. Failures of CNN
7. Learning rate schedules
8. Backpropagation 17. Deep learning hardware and
software
9. Initialization
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 220
ST, Morteza Analoui
16- Failures of CNN
Main failure of CNNs
• Main failure of CNNs is that they do not carry any information about
relative relationships between features
• This leads to triggering false positive for images which have
components of a face but not in correct order
• This is simply a flaw in core design of CNNs since they are based on
basic convolution operation applied to scalar values
• CNN uses a single scalar output to summarize activities of a local pool
of replicated feature detectors

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 222


ST, Morteza Analoui
A case
• A neurons in final layer of a trained CNN detect or be “activated” by
certain features in input image
Example:
• Train CNN for face detection
• Some channel in a layer might be triggered by eyes while others may be
triggered by mouth
• If we have all of components (or at least a certain amount) to make up a face
like eyes, ears, nose, and mouth, then our CNN will tell us that it has detected
a face
• But that’s unfortunately where reach of CNNs end

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 223


ST, Morteza Analoui
A case
• Let’s have a look at an example
• Image of face on left has all of components of a face, so CNN handles this case
just fine
• Tricky part is that for a CNN, image on right is also a face (it has all features of a
face). When a trained CNN is applied there will be activation on all of features

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 224


ST, Morteza Analoui
A case
CNN
0.95 of being a face

0.7 0.9 0.9 0.7 0.9

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 225


ST, Morteza Analoui
Why CNN fails
• In CNN, higher-level features combine lower-level features as a
weighted sum: activation of a preceding layer are multiplied by
following layer weights () and added (), before being passed to non-
linearity
• Nowhere in this information flow are relationships between features
taken into account

3 5×5
5 ×5 𝑝𝑎𝑡𝑐 h
∑ ∑ 𝑤𝑖𝑗 ,𝑘 ∙𝑖𝑛(𝑝𝑎𝑡𝑐 h𝑖 ) 𝑗, 𝑘 +𝑏
𝑘=1 𝑗=1
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 226
ST, Morteza Analoui
Adversarial robustness
• Vulnerability of neural networks to adversarial examples:
• Input image is slightly changed by an attacker to trick a neural net classifier into making
wrong classification
• These inputs can be created in a variety of ways, but straightforward strategies
such as FGSM in
• Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing
adversarial exemples. arXiv preprint arXiv:1412.6572, 2014.
• It has been shown drastically decrease accuracy in convolutional neural networks
on image classification tasks can be done

Panda (57.7% confidence) 𝝐 Gibbon (99.3% confidence)

03/18/2024 Pattern Recognition-Capsule Networks - School of Computer En 227


gineering , IUST - Morteza Analoui
Fooling Images
1. Start from an
arbitrary image
2. Pick an arbitrary
class
3. Modify the image
to maximize the
class
4. Repeat until
network is fooled

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 228


ST, Morteza Analoui
Contents
1. Convolutional Neural Network 10. Architectures for Image recognition
2. Fully Connected (Multilayer 11. Normalization
Perceptron)
12. Transfer Learning
3. Convolution
13. CNN for text classification
4. Pooling
14. Beyond Classification
5. Training
15. Visualizing CNN features
6. CNN Optimization algorithm
16. Failures of CNN
7. Learning rate schedules
8. Backpropagation 17. Deep learning hardware and
software
9. Initialization
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 229
ST, Morteza Analoui
17- Deep learning hardware and
software
Deep learning hardware
• CPU: Fewer cores, but each core is much faster and much more
capable; great at sequential tasks.
• GPU: More cores, but each core is much slower and “dumber”; great
for parallel tasks.

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 231


ST, Morteza Analoui
Example: matrix multiplication
• AxC parallel vector inner products

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 232


ST, Morteza Analoui
CPU vs GPU in practice

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 233


ST, Morteza Analoui
Programming GPUs

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 234


ST, Morteza Analoui
Deep Learning Software
• PyTorch (Facebook), version 1.4 (January 2020)
• TensorFlow (Google), Version 2.1 (March 2020)

• Quick to develop and test new ideas


• Automatically compute gradients
• Run it all efficiently on GPU (wrap cuDNN, cuBLAS, OpenCL, etc)

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 235


ST, Morteza Analoui
PyTorch
• Tensor: Like a numpy array, but can run on GPU
• Autograd: Package for building computational graphs out of Tensors,
and automatically computing gradients
• Module: A neural network layer; may store state or learnable weights

03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 236


ST, Morteza Analoui

You might also like