Professional Documents
Culture Documents
(Pattern Classification)
Convolutional Neural Networks (CNNs
or ConvNet) for visual Recognition
Hypothesis set and Algorithm
First Edition
Acknowledgment
• This chapter adapted from lecture notes of “CS231n: Convolutional
Neural Networks for Visual Recognition”, Spring 2022
• http://cs231n.stanford.edu/
• https://github.com/cs231n/cs231n.github.io
• : set of all possible items, and ; Features can be either hand crafted or learned
• Concept set:
• Target concept
• Target concept
Probability Distribution of
examples (unknown): Learning Algorithm: select a hypothesis
Training set: , H. has a small loss
Teacher provides noise-free
label
Deep learning
32 ×32
RGB: 3 channels 𝑥∈3ℝ
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 7
ST, Morteza Analoui
Convolutional Neural Networks (CNNs)
• Representation learning is done by Convolution
• Label learning is done by Perceptron (fully connected neural network)
Convolution
Feature (representation) learning Perceptron
32x32x3 Label learning
Labeling
Scores
32 ×32
RGB: 3 channels 𝑥∈3ℝ
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 8
ST, Morteza Analoui
Example
Labeling
Scores
• Just like the CIFAR-10, except it has 100 concepts containing 600
images each.
• 500 training/validation images and 100 testing images per class.
• 100 classes in the CIFAR 100 are grouped into 20 super-classes. Each
image comes with a "fine" label (the class to which it belongs) and a
"coarse" label (the super-class to which it belongs).
• e.g. Super-class: vehicles (classes: bicycle, bus, motorcycle, pickup
truck, train)
• During 2010-2017, annual contest, “ImageNet Large Scale Visual Recognition Challenge (
ILSVRC)”, competition on correctly classify and detect objects. The total count of training
images is 1.3 million, accompanied by 50,000 validation images, and 1,00,000 testing
images.
• Completion of ILSVRC: Annual ImageNet competition no longer held after 2017 and
moved to Kaggle
• Kaggle, a subsidiary of Google LLC, is an online community of data scientists and machine
learning practitioners
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 12
ST, Morteza Analoui
Kaggle.com/datasets
Classifier
Label learning
CONV+ReLU
CONV+ReLU
𝑥𝑖
CONV+ReLU
CONV+ReLU
CONV+ReLU
CONV+ReLU
CONV+ReLU
CONV+ReLU
CONV+ReLU
CONV+ReLU
CONV+ReLU
FC
FC
𝑚𝑝 ¿ FC
𝑤1 ∗ 𝑥 𝑖 𝑤2 ∗(𝑅𝑒𝐿𝑈(𝑤¿¿1∗ 𝑥¿¿𝑖))¿¿
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 18
ST, Morteza Analoui
– VGG16, 2014 𝑥
CONV+ReLU
CONV+ReLU
CONV+ReLU
CONV+ReLU
• Softmax is a loss function
CONV+ReLU
• 138 million parameters CONV+ReLU
CONV+ReLU
• 102.76 million parameters CONV+ReLU
• 16.78 million parameters CONV+ReLU
CONV+ReLU
• 4.096 million parameters
CONV+ReLU
• Label learner: 123.63 million parameters CONV+ReLU
CONV+ReLU
• Size of feature vector:
FC
FC
FC
𝑥
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 20
ST, Morteza Analoui
Overview of CNN architectures
• In common use:
1. L2 regularization:
2. L1 regularization:
3. Elastic net (L1 + L2):
4. Dropout
5. Batch normalization (mini-batch normalization dose not do
regularization)
6. Stochastic depth
7. …
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 28
ST, Morteza Analoui
CNN is multi-class, mono label and score
based classifier
• In multi-concept setting, a hypothesis is defined based on a scoring function
• Label associated to test example (image) is one resulting in largest score , which
defines mapping from to for
𝑥1 𝑥 2 𝑥3
Examples: 𝑥1 𝑥 2 𝑥3
𝑠1 ,1 =𝑠 ( 𝑥1 , 1 )=h(𝑥 1 , 𝑐𝑎𝑡 )
𝑘=1 cat
concepts: 𝑘= 2 car
𝑘=3 frog
𝑠2 , 3=𝑠 ( 𝑥 2 ,3 ) =h(𝑥 2 , 𝑓𝑟𝑜𝑔)
Score vector of
1 1 − 𝜌h ( 𝑥 , 𝑦 )
0 𝜌h ( 𝑥 , 𝑦 )
0 𝝆=𝟏
Losses: 2.9 0 12.9 ′
𝜌 h ( 𝑥 , 𝑦 )=h ( 𝑥 , 𝑦 ) − h(𝑥 , 𝑦 )
𝑦′≠ 𝑦
1 1 − 𝜌h ( 𝑥 , 𝑦 )
Φ 𝜌 =1 ( 𝜌 h ( 𝑥 , 𝑦 ) )
0 𝜌h ( 𝑥 , 𝑦 )
1 1 − 𝜌h ( 𝑥 , 𝑦 ) 0 𝝆=𝟏
′
𝟏+ h(𝑥 , 𝑦 )−h (𝑥 , 𝑦 )
0 ′
h ( 𝑥 , 𝑦 )𝟏+ h( 𝑥 , 𝑦 ′ ) 𝑆𝑐𝑜𝑟𝑒=h ( 𝑥 , 𝑦 )=𝜌 h ( 𝑥 , 𝑦 )+ h(𝑥 , 𝑦 ′ )
𝑦′≠ 𝑦
′
𝟏+ h(𝑥 , 𝑦 )−h (𝑥 , 𝑦 )
𝑚 𝑚
^ 1 1 2. 9+0+12 . 9
𝑅 𝑆 , 𝜌 ( h )= ∑ 𝐿𝑖 = ∑ Φ 𝜌 ( 𝜌 h (𝑥 𝑖 , 𝑦 𝑖 ) ) = =5 . 27
𝑚 𝑖=1 𝑚 𝑖=1 3
∑ 𝑒𝑠 𝑖, 𝑗
[ ]
𝑗 =1 𝑒𝑠 𝑖, 𝑘
∑ 𝑒𝑠 𝑖, 𝑗
𝑗=1
𝑃 ( 𝑦 𝑖 =𝑘|𝑥 𝑖 )
𝑎𝑟𝑔𝑚𝑖𝑛 ( ^
𝑅𝑆 ( h ) + 𝜆 ℛ ( h) ) =𝑎𝑟𝑔𝑚𝑖𝑛 𝐿(𝑊 )
h∈H h∈H
𝐿𝑖=1 . 58
𝐿𝑖=0 . 452
[ ]
𝑒𝑠 𝑖 ,𝑖
Score=
𝐿 𝑖=− 𝐿𝑜𝑔 𝐾
∑ 𝑒𝑠 𝑖, 𝑗
𝑗 =1
…
𝐾
𝐿𝑖=∑ 𝑚𝑎𝑥 ( 0 , 𝑠𝑖 , 𝑗 +1− 𝑠 𝑖, 𝑖 ) …
𝑗≠𝑖
𝑤1 𝑤2𝑆
1000
25088
: Score 4096 4096
“2-layer Neural Net”, or
“3-layer Neural Net”, or
“1-hidden-layer Neural Net”
“2-hidden-layer Neural Net”
𝑤
1 1
𝑇
𝑤 𝑥 +𝑏
+𝑏
25088 1000
25088 weights 1000
input layer activation
𝑥
𝑇 1000×25088 25088 ×1 1000 ×1
h ( 𝑥 , 𝑤 ) =𝑤 𝑥 +𝑏 ≡𝑤 𝑥 +𝑏
• Hypothesis +
• Score of
𝑥 𝑤 14 Φ 𝑤 15 S
Score
hidden layer
+ nonlinear
input layer
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 47
ST, Morteza Analoui
activation functions (non linear
transform)
• Hypothesis +
• ReLU:
• Score of 𝑥 𝑤 14 Φ 𝑤 15 S
Score
hidden layer
h: nonlinear
input layer
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 49
ST, Morteza Analoui
Setting the number of layers and their sizes
• 1 hidden layer and 3, 6, and 20 neurons
𝑚
1
𝐿(𝑊 )= ∑ [𝐿¿ ¿𝑖 (h ( 𝑥𝑖 ,𝑊 ) , 𝑦𝑖 )]+𝜆 ℛ (𝑊)¿
𝑚 𝑖=1 : Stronger regularization, simple hypothesis
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 51
ST, Morteza Analoui
Contents
1. Convolutional Neural Network 10. Architectures for Image recognition
2. Fully Connected (Multilayer 11. Normalization
Perceptron)
12. Transfer Learning
3. Convolution
13. CNN for text classification
4. Pooling
14. Beyond Classification
5. Training
15. Visualizing CNN features
6. CNN Optimization algorithm
16. Failures of CNN
7. Learning rate schedules
8. Backpropagation 17. Deep learning hardware and
software
9. Initialization
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 52
ST, Morteza Analoui
3- Convolution
Convolutional Neural Networks-VGG16
and Zisserman, 2014]
[Simonyan
fully connected
Representation learning Softmax+Log loss
Classifier
Label learning
𝟕×𝟕×𝟓𝟏𝟐
image Filter
pixels image 𝑏 𝑤 [ 𝑎 ,𝑏 ]
𝑥 [ 𝑎 , 𝑏]
w filter A pixel of filtered image
image
3 𝑥
filter
3 𝑤 size
5
A pixel of filtered image
5
3
𝑇
𝑤 𝑥 +𝑏
˙ + bias
5 ×5 × 3=75 dimension product
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 59
ST, Morteza Analoui
Convolution
• We call the layer convolutional because it is related to convolution of
two functions :
,
filter
image
𝑖𝑛
𝑝𝑎𝑡𝑐 h𝑖
𝐻 =32 𝑜𝑢𝑡 𝑖
𝑊 =32
𝐶 =3 3 5× 5
𝑜𝑢𝑡 𝑖 =∑ ∑ 𝑤𝑖𝑗 ,𝑘 ∙𝑖𝑛(𝑝𝑎𝑡𝑐 h𝑖 ) 𝑗 ,𝑘 +𝑏
𝑘=1 𝑗=1
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 61
ST, Morteza Analoui
Two filters 2 activation maps
• Consider a second, green filter
𝑥
70
𝑥 ∗𝑤 1 70
74
70
𝑥 ∗𝑤 15
32
74 70
filters
32
Input image
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 67
ST, Morteza Analoui
Zero pad border Pad=2
F
Input image
1 64
1
input
7x7
1x1x25088 1x1x4096
4096 ×25088
ReLU( ∗𝑤 ¿=¿
4096 input
3 (3 ×3) 𝑓𝑖𝑙𝑡𝑒𝑟𝑠
*
*
*
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 85
ST, Morteza Analoui
A simple example – Handwritten
recognition
• Activation maps
•,
•. Or
𝑤1
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 108
ST, Morteza Analoui
Problems with Weight updating in SGD
2.
• What if the loss function has a local minima or saddle point? Saddle
points much more common in high dimension
𝐿𝑜𝑠𝑠
𝑊
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 109
ST, Morteza Analoui
SGD + Momentum
• Weight updating in SGD:
𝑤1
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 110
ST, Morteza Analoui
SGD + Momentum
• Combine gradient at current point with velocity to get step used to
update weights
• Note that: , can be written as:
Momentum update: , Weight update:
: 𝜌 𝑣𝑡
𝑣 𝑡 +1
current point
• Weight updating:
• Return
• (learning rate), (first moment decay rate, typically 0.9), (second moment decay rate,
typically 0.999), (numerical term, typically 10-7),
[ ]
𝑡 +1 𝑡 +1
𝔼 [ 𝑀 𝑡 +1 ]= 𝔼 [ 𝛽 1 𝑀 𝑡 + ( 1− 𝛽 1 ) 𝛻 𝐿 ( 𝑊 𝑡 ) ] =𝔼 ( 1− 𝛽1 ) ∑ 𝛽 𝑡 + 1− 𝑖
1 𝛻 𝐿 ( 𝑊 𝑡 ) =𝔼 [ 𝛻 𝐿 ( 𝑊 𝑡 ) ] × ( 1 − 𝛽1 ) ∑ 𝛽𝑡1+1 −𝑖 +𝑐
𝑖=1 𝑖=1
𝔼 [ 𝑀 𝑡 +1 ]
𝔼 [ 𝑀 𝑡 +1 ] = 𝔼 [ 𝛻 𝐿 ( 𝑊 𝑡 ) ] ( 1− 𝛽 ) +𝑐 ⟹ 𝔼 [ 𝛻 𝐿 ( 𝑊 𝑡 ) ] = =^
𝑡+1
1 𝑀 𝑡 +1
(1 − 𝛽 )𝑡 +1
1
𝐿(𝑊 )
• Where and are ranges for learning rate, and account for
how many epochs have been performed since last period
• ResNets: multiply Learning Rate by 0.1 after epochs 30, 60, and 90
min
𝑊
Forward pass:
Computes loss
Backward pass:
Computes grads
dropout
Stacked Inception modules
dropout
• Total: 854M ops (Very expensive compute) 5x5 conv, 3x3 conv, 3x3 pool, 1x1 conv,
• ops: number of mathematical operations 96 192 stirde1 128
carried out within the module
• Problem: Computational complexity Previous Layer
Input: 28x28x256
Naive Inception module
Concatenation
• [1x1 conv, 64] 28x28x64x1x1x256 5x5 conv, 3x3 conv, 1x1 conv,
96 192 64
• [1x1 conv, 64] 28x28x64x1x1x256
28x28x64 28x28x64 28x28x256
• [1x1 conv, 128] 28x28x128x1x1x256
1x1 conv, 1x1 conv, 3x3 pool, 1x1 conv,
• [3x3 conv, 192] 28x28x192x3x3x64 64 64 stirde1 128
• [5x5 conv, 96] 28x28x96x5x5x64
• [1x1 conv, 64] 28x28x64x1x1x256 Previous Layer
Input: 28x26x256
• Total: 358M ops (Compared to 854M ops) Inception module with dimension reduction
• Note that: 1x1 convolutions to reduce feature depth (256 64)
Global average
pooling layer
Training Error
#Convolutional Layers #Convolutional Layers
28 × 28 ×256
A Residual block
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 173
ST, Morteza Analoui
A note on efficiency
• It uses “bottleneck” layer (conv) to improve
efficiency(similar to GoogLeNet)
• Convolution Operations: 1x1 conv, 256 filters projects
back to 256 feature maps
• Reducing depth from 256 to 64:
• [1x1 conv, 64] 28x28x64x1x1x256 3x3 conv operates over
only 64 feature maps
• Improving ResNets
• Add a “feature recalibration” module that learns to adaptively
reweight feature maps
• Global information (global avg. pooling layer) together with 2 FC
layers used to determine feature map weights
SE-ResNet Module
First CNN-based
winner
𝜎𝑦 𝜎𝑦
𝜎𝑥
𝜎𝑥
channels channels
𝑥 𝑦 …
More specific
representation
More generic
representation
Image feature
vector from CNN
Pre-trained CNN
Word feature vectors pre-
trained with word2vec
Karpathy and Fei-Fei, “Deep Visual-Semantic Alignments for Generating Image Descriptions”, CVPR 2015
7
A text of 7 words:
50
1
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 200
ST, Morteza Analoui
Sentence classification
A text of 7 words:
50*7*1
7
6 6 2
Conv, … …
2 32
ReLU max pooling concatenated
50 16 50*2*1 fully connected
11 11
1*3
11 softmax
1
filters 11 11
16 16
2
50*2*1 filter 50
1
downsampling upsampling
convolution transpose
convolution
𝑥 𝑤
𝑇
∗
Region of Interest
• Detectron2 (PyTorch)
• https://github.com/facebookresearch/detectron2
• Mask R-CNN, RetinaNet, Faster R-CNN, RPN, Fast R-CNN, R-FCN, ...
Softmax probability=0.95
Trained CNN
Softmax probability=0.45
3 5×5
5 ×5 𝑝𝑎𝑡𝑐 h
∑ ∑ 𝑤𝑖𝑗 ,𝑘 ∙𝑖𝑛(𝑝𝑎𝑡𝑐 h𝑖 ) 𝑗, 𝑘 +𝑏
𝑘=1 𝑗=1
03/18/2024 Pattern Recognition-CNNs, School of Computer Engineering, IU 226
ST, Morteza Analoui
Adversarial robustness
• Vulnerability of neural networks to adversarial examples:
• Input image is slightly changed by an attacker to trick a neural net classifier into making
wrong classification
• These inputs can be created in a variety of ways, but straightforward strategies
such as FGSM in
• Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing
adversarial exemples. arXiv preprint arXiv:1412.6572, 2014.
• It has been shown drastically decrease accuracy in convolutional neural networks
on image classification tasks can be done