CNNs Explained in 12 Steps

CNNs
Convolution Neural Networks
1
What convolution really is?
▪ There is a subtle difference between convolution and mathematics’ convolution.
▪ If we were in a pedantic mood, we could call convolutions discrete cross-correlations.
Stevens, Eli, Luca Antiga, and Thomas Viehmann. Deep learning

with PyTorch. Manning Publications Company, 2020. 2
Structure of convnets
▪ Convnets used for image classification comprise two parts:
▪ They start with a series of pooling and convolution layers, called the convolutional base
of the model.
▪ They end with a densely connected classifier.
Chollet, Francois. Deep learning with Python. Vol. 361. New York: Manning, 2018
3
Convolution: locality and translation invariance
▪ Locality: if we want to recognize patterns
corresponding to objects, like an airplane in
the sky, we will likely need to look at how
nearby pixels are arranged, and we will be
less interested in how pixels that are far from
each other appear in combi- nation.
Essentially, it doesn’t matter if our image of a
Spitfire has a tree or cloud or kite in the
corner or not.
▪ Translation invariant: we would

like these localized patterns to have
an eﬀect on the output regardless
of their location in the image: that
is, to be translation invariant.
Advantages brought by convolution
▪ A convolution is equivalent to having multiple linear operations whose weights are zero
almost everywhere except around individual pixels and that receive equal updates during
training.
▪ Summarizing, by switching to convolutions, we get:

▪ Local operations on neighbourhoods
▪ Translation invariance
▪ Models with a lot fewer parameters. The number of parameters no longer depends on
the size of the image but on the size of the convolution kernel and on the number of
convolution filters.

Convolution and deconvolution
▪ The word convolve means to roll or slide.
▪ At each position, the dot product between the 3x3

kernel is sliding over a 4x4 input to give a 2x2
output.
▪ Convolution NNs are downsampling by nature, as

the convolution operator leaves the output with
fewer dimensions, but there exists an upsampling
methods such as padding.
▪ Launch the presentation to see animation

animated
https://medium.com/@marsxiang/convolutions-transposed-and-deconvolution-6430c358a5b6 6
Deconvolution
▪ When we have neural networks generate images, we often have them build them up from
low resolution, high-level descriptions. This allows the network to describe the rough image
and then fill in the details.
▪ In order to do this, we need some way to go from a lower resolution image to a higher one.
We generally do this with the deconvolution operation. Roughly, deconvolution layers allow
the model to use every point in the small image to “paint” a square in the larger one.
https://distill.pub/2016/deconv-checkerboard/
7
CNN: wider, deeper and higher resolution
A deeper network means more convolutional A network with
layers higher resolution
means that it
A wider network means more feature maps
processes input
(filters) in the convolutional layers. Essentially
images with larger
more channels per layer.
width and depth
(spatial resolutions).
That way the
produced feature
maps will have higher
spatial dimensions.
This means images
with higher
resolution.
https://theaisummer.com/cnn-architectures/ 8
CNNs vs. fully connected NNs: #1
In a fully (linear) connected neural network,

which is also called a dense layer, every node A CNN leverages the spatial structure (via filter)
from one layer is connected to every other between the pixels to reduce the number of
node in the subsequent layer. You can connections between two layers, significantly
imagine how for images this would increase improving the speed of training while at the same
the complexity quite a lot. time reducing the model parameters.
9
Jibin Mathew, PyTorch Artificial Intelligence Fundamentals
CNNs vs. fully connected NNs: #2
▪ Dense layers learn global patterns in their input

feature space (for example, for a MNIST digit, patterns
involving all pixels),
▪ Convolution layers learn local patterns: in the case of

images, patterns found in small 2D windows of the
inputs.
Chollet, Francois. Deep learning with Python. Vol. 361. New York: Manning, 2018. 10
Two key characteristics of convolution nets
▪ The patterns they learn are translation invariant: After learning
a certain pattern in the lower-right corner of a picture, a
convnet can recognize it anywhere: for example, in the
upper-left corner. A densely connected network would have to
learn the pattern anew if it appeared at a new location. This
makes convnets data efficient when processing image.
▪ They can learn spatial hierarchies of patterns. A first

convolution layer will learn small local patterns such as edges, a
second convolution layer will learn larger patterns made of the
features of the first layers, and so on. This allows convnets to
efficiently learn increasingly complex and abstract visual
concepts (because the visual world is fundamentally spatially
hierarchical).
Convolution reduces overfitting
▪ A single, small set of weights can train over a much
larger set of training examples, because even though
the dataset hasn’t changed, each mini-kernel is
forward propagated multiple times on multiple
segments of data, thus changing the ratio of weights
to datapoints on which those weights are being
trained.
▪ This has a powerful impact on the network, drastically

reducing its ability to overfit to training data and
increasing its ability to generalize.
12
Trask, Andrew W. "Deep learning." (2019).
Convolutional Neural Network (CNNs / ConvNets): #0
▪ Densely network architecture does not take into account the spatial structure of the
images. For instance, it treats input pixels which are far apart and close together on exactly
the same footing. Such concepts of spatial structure must instead be inferred from the
training data.
▪ But what if, instead of starting with a network architecture which is tabula rasa, we used an
architecture which tries to take advantage of the spatial structure? The name convolutional
comes from the fact that the operation in the used equation is sometimes known as a
convolution.
▪ ConvNet architectures make the explicit assumption that the inputs are images.
http://neuralnetworksanddeeplearning.com/chap6.html#other_techniques_for_regularization 13
Convolutional Neural Network (CNNs / ConvNets): #0.1
▪ Scale up neural networks to process very large images / video sequences
▪ Replace matrix multiplication in neural nets with convolution
▪ Everything else stays the same: Maximum likelihood, back-propagation
http://www.deeplearningbook.org/slides/09_conv.pdf
14
▪ Convolutional Neural Networks take advantage of the fact that the input consists of images
and they constrain the architecture in a more sensible way.
▪ In particular, unlike a regular Neural Network, the layers of a ConvNet have neurons
arranged in 3 dimensions: width, height, depth.
https://cs231n.github.io/convolutional-networks/ 15
Regular NN
https://cs231n.github.io/convolutional-networks/ 16
Convolutional Neural Network
(CNNs / ConvNets): #3
A convolution processes a chunk of

an
image by matrix multiplication.
Each kernel matrix element is therefore

a neural network neuron and modiﬁed
during training using backpropagation
for the best performance of the network
itself.
Mueller, John Paul, and Luca Massaron. Machine learning for dummies. John Wiley & Sons, 2016. 17
Convolutional Neural Network (CNNs /
ConvNets): #4
▪ Another interesting aspect of this process is

that each kernel specializes in finding specific
aspects of an image.
▪ Borders of an image are easily detected after a

3-x-3-pixel kernel is applied.
Mueller, John Paul, and Luca Massaron. Machine learning for dummies. John Wiley & Sons, 2016. 18
Why convolution are so good for images
when compared to fully connected?
▪ Parameter sharing: a feature detector, such as a vertical detector, that’s useful in one part of
the image is probably useful in another part of the image
▪ Sparsity of connections: in each layer, each output value depends only on a small number of
inputs
19
What are convolutions?
▪ Convolutions are a component within CNNs. They are defined as a layer within the CNNs. In
a convolution layer, we slide a filter matrix over the entire image matrix from left to right
and from top to bottom, and we take the dot product of the filter, with this patch spanning
the size of the filter over the image channel.
▪ If the two matrices have high values in the same positions, the dot product's output will be
high, and vice versa.
▪ The output of the dot product is a scalar value that identifies the correlation between the
pixel pattern in the image and the pixel pattern expressed by the filter.” If we were in a
pedantic mood, we could call convolutions discrete cross-correlations.
Jibin Mathew, PyTorch Artificial Intelligence Fundamentals. 20

Convolution [intuition]
If we have a 64x64x3 image then this is acceptable

from a memory stand point. However if we’d like to
use a high resolution imagines, say 1000x1000, with
1000 hidden layer (classical NN) the parameter
becomes 1000x1000x1000x3 = 3 billions which can
be become non manageable. The solution to this
problem is to use convolution NN.
21
How the convolution operator works (CNNs)
Essentially an
element wise
multiplication
followed by the same
of each entry
In the literature this is also

called kernel 22
Animate picture of convolution
▪ A Convolutional layer have a set of matrices

that get multiplied by the previous layer output
in a process called the convolution to detect
some features this features could be basic
features (e.g. edge, colour grade or pattern) or
complex one (e.g. shape, nose, or a mouth) so,
those matrices are called filters or kernels
Animated pic
https://towardsdatascience.com/what-is-wrong-with-convolutional-neural-networks-75c2ba8fbd6f
23
Edge detection and convolution detector
Apart from the technicalities the

key message here is the following:
the convolution operator (*) offers
a CONVENIENT way to detect
edges
24
Edge detection: options
Option #1: you can hand-pick the

filter (Sobolev filter, ….)
Option #2: treat the 9 numbers as

weights and let the algorithm
learn the values
25
Padding: #1
Padding is a way to increase the output dimensions. The edges of

the input are filled with 0’s, which do not affect the dot product,
but gives more space for the kernel to slide. 26
Padding: #2
▪ It allows you to use a CONV layer
without necessarily shrinking the
height and width of the volumes. This
is important for building deeper
networks, since otherwise the
height/width would shrink as you go
to deeper layers.
▪ An important special case is the

"same" convolution, in which the
height/width is exactly preserved
after one layer.
▪ It helps us keep more of the

information at the border of an
image.
27
Padding: #3
▪ Padding: This is the strategy that we apply to the
edges of an image while we convolve, depending on
whether we want to keep the dimensions of the
tensors the same after convolution or only apply
convolution where the filter fits properly with the
input image. If we want to keep the dimensions the
same, then we need to zero pad the edge so that the
original dimensions match with the output after
convolution.
▪ This is called same padding. But if we don't want to
preserve the original dimensions, then the places
where the filter doesn't fit completely are truncated,
which is called valid padding.”
Jibin Mathew, PyTorch Artificial Intelligence Fundamentals 28

Stride
Stride: This is the number of pixels that we
shift both horizontally and vertically before
applying convolution networks using a filter on
the next patch of the image.
29
Stride
•.
30
Dilations
▪ Dilations can be used to control the output size, but
their main purpose is to expand the range of what a
kernel sees. In the animation on the right the dilation
is set to 2.
▪ This can be useful for summarizing larger regions of

the input space without an increase in the number of
parameters. Dilated convolutions have proven very
useful when convolution layers are stacked. Successive
dilated convolutions exponentially increase the size of
the “receptive field”; that is, the size of the input
space seen by the network before a prediction is
made.
Transpose convolution
▪ Transposed convolution layer is upsampling in nature.
They are usually used in autoencoders and GANs or
where the network must reconstruct an image.
▪ The word ”transpose” in the context of convolution

NNs, causes the input ad the output dimensions to
switch.
Transposed convolution vs. deconvolution
▪ They are not the same thing, although they are used interchangeably.
▪ A deconvolution is a mathematical operation that reverses the convolution.
▪ On the other hand, a transposed convolution layer only reconstructs the spatial dimensions
of the input. However, it does not give the same output as the input.
Convolution over
volumes
34
Convolution operator and output dimensions
35
Biggest problem with CNN
▪ Commonly cited issues: Overfitting, exploding gradient, and class imbalance are the major
challenges while training the model using CNN.
▪ But there is more: it turned our that pooling is very bad and the fact that it’s working so
well is a disaster.
https://towardsdatascience.com/what-is-wrong-with-convolutional-neural-networks-75c2ba8fbd6f 36
Pooling: #1
Pooling is a destructive or generalization process to reduce overfitting. BUT CNNs have a
habit of overfitting, even with pooling layers. Dropout should be used such as between fully
connected layers and perhaps after pooling layers.
37
Pooling: #2
▪ There is more than one type of pooling layer (Max
pooling, avg pooling …), the most common -this
days- is Max pooling because it gives transational
variance — poor but good enough for some tasks
— and it reduces the dimensionality of the
network so cheaply (with no parameters) max
pooling layers is actually very simple, you
predefine a filter (a window) and swap this
window across the input taking the max of the
values contained in the window to be the output
38
Pooling: #3
▪ The pooling layer is used to reduce the spatial dimension of an input, preserving its depth.
▪ As we move from the initial layer to the later layers in a CNN, we want to identify more
conceptual meaning in the image compared to actual pixel by pixel information, and so we
want to identify and keep key pieces of information from the input and throw away the rest.
▪ A pooling layer helps us do that.
Jibin Mathew, PyTorch Artificial Intelligence Fundamentals.

39
Pooling: #4
▪ Here are the main reasons to use a pooling layer:
▪ Reduction in the number of computations: We get better computational performance
by reducing the spatial dimensions of the input without losing out on the filters, and so
we reduce the time needed to train, as well as the computational resources.
▪ Prevent overfitting: With reduced spatial dimensions, we reduce the number of
parameters the model has, which in turn reduces the model complexity and helps us
generalize better.
▪ Positional invariance: This allows the CNN to capture the features within an image,
irrespective of where the feature is located in a given image. Say that we are trying to
build a classifier to detect mangoes. It doesn't matter whether the mango is in the
center, top-left, bottom-right, or wherever in the image—it needs to be detected.
▪ ATTENTION: the last point is also a deficiency, see next slide!

Jibin Mathew, PyTorch Artificial Intelligence Fundamentals.
40
Why is pooling layers a big mistake because?
▪ Pooling layers is a big mistake because it loses a lot of

valuable information and it ignores the relation between
the part and the whole.
▪ If we are talking about a face detector so we have to

combine some features (mouth, 2 eyes, face oval and a
nose) to say that is a face, CNN would say if those 5
features present with high probability this would be a
face.
41
Network in network
Lim et al., 2013, Network in network 42

Are convnets black boxes?
▪ It’s often said that deep-learning models are “black boxes”: learning representations that are difficult to
extract and present in a human-readable form. This is not true for convnets!
▪ The representations learned by convnets are highly amenable to visualization, in large part because
they’re representations of visual concepts.
▪ We won’t survey all of them, but we’ll cover three of the most accessible and useful ones:
▪ Visualizing intermediate convnet outputs (intermediate activations) —Useful for understanding how
successive convnet layers transform their input, and for get- ting a first idea of the meaning of
individual convnet filters.
▪ Visualizing convnets filters—Useful for understanding precisely what visual pattern or concept each
filter in a convnet is receptive to.
▪ Visualizing heatmaps of class activation in an image—Useful for understanding which parts of an
image were identified as belonging to a given class, thus allowing you to localize objects in images.
What do hidden layer
in a CNN learn?
Some of the activations created by
the fifth convolution layer. we can
see that the early layers detect
lines and edges
Look at the last CNN layer:. the last

layers tend to learn higher-level
features (such as eyes, buildings,
trees) and are less interpretable.
Subramanian, Vishnu. Deep Learning with PyTorch: A practical approach to

building neural network models using PyTorch. Packt Publishing Ltd, 2018. 44
Convolution network on sequence data
▪ We know how CNNs solve problems in computer vision by learning features from the
images. CNNs work by convolving across height and width. In the same way, time can be
treated as a convolution feature. One-dimensional convolutions sometimes perform better
than RNNs and are computationally cheaper.
Subramanian, Vishnu. Deep Learning with PyTorch: A practical approach to

building neural network models using PyTorch. Packt Publishing Ltd, 2018. 45
What is a Network-in-Network Connections (1x1
Convolutions) ?
Network in network (NiN) connections are convolutional kernels with kernel_size=1 and have a few interesting
properties. In particular, a 1×1 convolution acts like a fully connected linear layer across the channels. This is
useful in mapping from feature maps with many channels to shallower feature maps. A single NiN connection
being applied to an input matrix. As you can see, it reduces the two channels down to a single channel. Thus,
NiN or 1×1 convolutions provide an inexpensive way to incorporate additional nonlinearity with few
parameters
Rao, Delip, and Brian McMahan. Natural language processing with PyTorch: build intelligent language
applications using deep learning. " O'Reilly Media, Inc.", 2019
46
Training a new model from scratch using
what little data you have
▪ Having to train an image-classification model using very little data is a common situation, which you’ll
likely encounter in practice if you ever do computer vision in a professional context.
▪ There exist 3 strategies to tackle this problem:

▪ Data augmentation
▪ Feature extraction with a pretrained network
▪ Fine-tuning a pretrained network
▪ What’s more, deep-learning models are by nature highly repurposable: you can take, say, an
image-classification or speech-to-text model trained on a large-scale dataset and reuse it on a
significantly different problem with only minor changes. Specifically, in the case of computer vision, many
pretrained models (usually trained on the Image-Net dataset) are now publicly available for download
and can be used to bootstrap powerful vision models out of very little data.
How to combat overfitting in computer vision
when you have a small dataset?
▪ Because you have relatively few training samples (2,000), overfitting will be your
number-one concern.
▪ Dropout and L2 regularization are all valid techniques to combat overfitting.
▪ However, when it comes to computer vision data augmentation is used almost universally.
1D convnets
▪ In the same way that 2D convnets perform well for processing visual patterns in 2D space,
1D convnets perform well for processing temporal patterns.
▪ They offer a faster alternative to RNNs on some problems, in particular natural- language
processing tasks.
▪ Typically, 1D convnets are structured much like their 2D equivalents from the world of
computer vision: they consist of stacks of Conv1D layers and Max- Pooling1D layers, ending
in a global pooling operation or flattening operation.
▪  Because RNNs are extremely expensive for processing very long sequences, but 1D
convnets are cheap, it can be a good idea to use a 1D convnet as a prepro- cessing step
before an RNN, shortening the sequence and extracting useful representations for the
RNN to process.
Chollet, Francois. Deep learning with Python. Simon and Schuster, 2017. 49
Scaling images
50

CNNs Explained in 12 Steps

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CNNs Explained in 12 Steps

Uploaded by

Copyright:

Available Formats

CNNs

Convolution Neural Networks

▪ If we were in a pedantic mood, we could call convolutions discrete cross-correlations.

Stevens, Eli, Luca Antiga, and Thomas Viehmann. Deep learning

▪ Convnets used for image classification comprise two parts:

▪ Translation invariant: we would

▪ Summarizing, by switching to convolutions, we get:

Stevens, Eli, Luca Antiga, and Thomas Viehmann. Deep learning

▪ At each position, the dot product between the 3x3

▪ Convolution NNs are downsampling by nature, as

▪ Launch the presentation to see animation

In a fully (linear) connected neural network,

▪ Dense layers learn global patterns in their input

▪ Convolution layers learn local patterns: in the case of

▪ They can learn spatial hierarchies of patterns. A first

▪ This has a powerful impact on the network, drastically

▪ Scale up neural networks to process very large images / video sequences

▪ Replace matrix multiplication in neural nets with convolution

▪ Everything else stays the same: Maximum likelihood, back-propagation

A convolution processes a chunk of

Each kernel matrix element is therefore

▪ Another interesting aspect of this process is

▪ Borders of an image are easily detected after a

Jibin Mathew, PyTorch Artificial Intelligence Fundamentals. 20

If we have a 64x64x3 image then this is acceptable

In the literature this is also

▪ A Convolutional layer have a set of matrices

Apart from the technicalities the

Option #1: you can hand-pick the

Option #2: treat the 9 numbers as

Padding is a way to increase the output dimensions. The edges of

▪ An important special case is the

▪ It helps us keep more of the

Jibin Mathew, PyTorch Artificial Intelligence Fundamentals 28

▪ This can be useful for summarizing larger regions of

▪ The word ”transpose” in the context of convolution

▪ A deconvolution is a mathematical operation that reverses the convolution.

▪ A pooling layer helps us do that.

Jibin Mathew, PyTorch Artificial Intelligence Fundamentals.

▪ ATTENTION: the last point is also a deficiency, see next slide!

▪ Pooling layers is a big mistake because it loses a lot of

▪ If we are talking about a face detector so we have to

Lim et al., 2013, Network in network 42

Look at the last CNN layer:. the last

Subramanian, Vishnu. Deep Learning with PyTorch: A practical approach to

Subramanian, Vishnu. Deep Learning with PyTorch: A practical approach to

▪ There exist 3 strategies to tackle this problem:

You might also like