You are on page 1of 187

Unit 4

Convolutional Neural Networks


Computer Vision Applications

Detects objects and


draws box around it to
show location of
objects
Computer Vision Applications
• Combines content and style image
Computer Vision Applications
• Combines content and style image
Computer Vision Applications
• Combines content and style image
Deep Learning on large images

1000 neurons

1000 × 1000 × 3 = 3000000 pixels 3000000 neurons 1000 neurons

• Network has large number of weights


• Instead use convlolutional Neural Network
Convolutional Neural Network
• Is a Deep Learning algorithm which can take in
an input image
and assign importance (learnable weights and
biases) to various objects in the image
• And be able to differentiate one from the other
• Pre-processing required in a ConvNet is much
lower as compared to other classification
algorithms
• ConvNets have the ability to learn filters
Convolutional Neural Network
• Some of the areas where CNNs are widely used:
• images recognition
• image classification
• objects detection
• recognition of faces
How a Computer Reads an Image

• Computers sees an input image as array of pixels


• Based on the image resolution, size of image matrix is
h×w×d ( h=Height, w = Width, d = Dimension/plane
• For color image
• Size of the image = h×w×3 = B × A × 3
ANN with Fully Connected Layer
• Ex: For grey image matrix, size =28×28 ×1pixels
• If it is color image then
size = 28×28×3 = 2352 pixels

All neurons of a layer are connected to all neurons of


previous layer in fully connected network
ANN with Fully Connected Layer
• In real life images are at least 200×200 pixels
• For a color image with size 200×200×3, number of
pixels is 120,000
• Would require large number of neurons
• Large number of hidden neurons may lead to
overfitting
• Therefore fully connected can not be used for images
Convolutional Neural Network
• For CNN, neurons in a
layer is connected to a
small number of neurons
of the previous layer
• Thus it requires less
number of weights
Ex: Convolutional Neural Network
• Classify X or O
Ex: Convolutional Neural Network
• Deformed images of X and O
Image representation
• Computer understands an image using values of
pixels
• Assume black pixel is 1 and white pixel is -1
Image representation
• Using normal techniques, computer compares
using addition
• Is not able to recognize as X because it
compares each pixel
How CNN works?
• Compares piece by piece
• Pieces match roughly at the same position or at
different locations
Why convolutions?

• If edge in an image is shifted even then it can be detected


because parameter sharing is used
Filters for piecewise comparison
• Filters are used for bigger image to identify
locations of pieces of images

Common filters
How filters work?
• Choose a filter and put it on the image
• If matches then image is classified correctly
How filters work?
• Multiply elements of filter and part of image on which filter
is overlapped
• Add result of multiplication and normalize it
• Replace center of part of the image with the result

Diagonal filter

Image
How filters work?
How filters work?
• Move filter to another location
How filters work?
How filters work?
• Move filter to each position and determine the output
How filters work?
• Move filter to each position and determine the output
• If filter and overlapped part of image matches then
result is larger than for unmatched part of image
How filters work?
• Apply more filters and get the output image
How filters work?
• Apply more filters and get the output image
How filters work?
• Apply more filters and get the output image
ReLU function
• Remove negative values from the filtered image and
replace it with zero
ReLU layer
• Remove negative values from the filtered image and replace it with
zero for filtered image using one filter
ReLU layer
• Remove negative values from the filtered image and replace it with
zero for filtered image using one filter
ReLU layer
• Repeat for output of all filters
Pooling Layer
• Shrink filtered image to a smaller size
• Move pooling window across the filtered image
• Pick maximum value

0.77
Pooling Layer
• Apply max pooling on entire image

0.77
Pooling Layer for all channels
• Apply pooling on all filtered images
Stack up the layers

Input image
Stack up the layers

Input image
Stack up the layers

Input image
Stack up the layers

Input image
Stack up the layers

Input image
Multiple Hidden Layers
• Add one more set of same layers
Complete CNN
• Convert shrunk image into a single list
• Finally use fully connected layer
• Classification is done at this layer
Complete CNN

• For ‘X’ there are some


elements which are high
• Locations of these
elements are 1,4,5,10,11
• For an input image if
these elements are high
the given image is ‘X’
Complete CNN

• For ‘O’, other elements


are high
• Locations of these
elements are 2,3,9,12
• Such images can be
classified as ‘O’
Complete CNN
• Consider a new list for the given image
• Compare it with the list of X
• Add high values for both
1
1
4
4

5 5

10
10
11
11
Complete CNN
• Consider a new list for the given image
• Compare it with the list of X
• Add high values for both
1
1
4
4

5 5

10
10
11
11
Complete CNN
• Compare it with the list of O
• Add high values for both
Complete CNN
• Compare it with the list of O
• Add high values for both

2 2

3 3

9 9

12 12
Complete CNN
• Input image is classified as X
CNN models for images
• Has multiple sets of layers
• Each set uses
• convolution with filter/s
• Pooling
• Finally fully connected layer
• Convolution with filters is major component of
CNN
Edge/feature Detection
• Convolution operation is one of the building blocks of CNN
• It is used to detect features

Hidden Layer 1 Hidden Layer 2 Hidden Layer 3

edges Partial objects Complete objects


Edge/feature Detection
• Convolution operation is one of the building blocks of CNN
• It is used to detect features

Hidden Layer 1 Hidden Layer 2 Hidden Layer 3

edges Partial objects Complete objects


Edge/feature Detection
• Convolution operation is one of the building blocks of CNN
• It is used to detect features

Hidden Layer 1 Hidden Layer 2 Hidden Layer 3

edges Partial objects Complete objects


Edge/feature Detection
• Convolution operation is one of the building blocks of CNN
• It is used to detect features

Hidden Layer 1 Hidden Layer 2 Hidden Layer 3

edges Partial objects Complete objects


Edge Detection
• Like diagonal filters, vertical and horizontal
edge filters are used to detect edges
Vertical Edge Detection
• Ex: Grey image (has one plane)

3 × 3 filter/kernel Filtered image

Matrix representation of image



Vertical
Filter on top left
Edge Detection
• Filtered part of image
• (3 × 1) + (1 × 1) + (2 × 1) + (0 × 0) + (5 × 0) + (7 × 0) + (1 × -1) + (8 × -1) + (2 × -1)
• = -5

-5

3 × 3 filter/kernel
Filtered image

6 × 6 image
Vertical Edge Detection
• Move filter right by one unit

-5 -4

3 × 3 filter/kernel
Filtered image

6 × 6 image
Vertical Edge Detection
• Move filter to cover entire image
• Remove first row and column
• Remove last row and column of filtered image

-5 -4 0 8
-10 -2 2 3
=
0 -2 -4 -7
-3 -2 -3 -16
3 × 3 filter/kernel

Filtered image
6 × 6 image
Vertical Edge Detection
• Filtered image

-5 -4 0 8
-10 -2 2 3
= 0 -2 -4 -7
-3 -2 -3 -16

3 × 3 filter/kernel 4 × 4 image
Vertical Edge Detection
• Image with vertical edge
• Matrix representation shows pixel values
Vertical Edge Detection
• Light to dark

• dark to light edges


Filters for Vertical and Horizontal
Edges
• Back propagation is used to learn weights of filters used in CNN
Padding
• Convolution operation shrinks image
• It may loose some information
• Append rows and columns of zeroes

-5 -4 0 8
-10 -2 2 3
= 0 -2 -4 -7
-3 -2 -3 -16
4×4
3 × 3 filter/kernel (n-f+1) × (n-f+1)
f×f

n×n
Padding

• If p = 1
• Before filtering one row and on column of zeroes
appended around image
• Then convolved image has same size as original
• In general size of convolved image is (n+2p -f+1) ×
(n+2p-f+1)
• For 6 × 6 image, n =6
• For 3 × 3 filter, f=3
• For p =1, size of filtered image is
6×6
Padding
• Pad such that output size is same as original image
(n+2p -f+1) = n
P = (f-1)/2
• Since p should be integer, f should be odd
Strided Convolution
• Stride, s = 2
• Filter moves by 2 units
Strided Convolution
• Initial location of filter on image
• Move filter 2 units forward
Strided Convolution
• Determine filtered values
Strided Convolution
• Similarly 2 units down
Strided Convolution
• Similarly 2 units down
Strided Convolution
• size of image after convolution is
n+2p –f n+2p –f
+1 × +1
𝑠 𝑠
7+0 –3 7+0 –3
= +1 × +1
2 2
=3×3
Convolution on volume
• Convolution on RGB images
• Number of channels ( planes) in image should
match with that number of filters

Convloved image

filter
Image
Convolution on volume
• Add result of convolution for each channel
• Output has one channel
• For 3 filters each of size 3 × 3, 27 parameters are
required
Convolution on volume
Filters to detect edges on red channel

1 0 -1
1 0 -1 Red
1 0 -1

0 0 0
0 0 0 Green
0 0 0

0 0 0
0 0 0 Blue
0 0 0
Convolution on volume
• Filters to detect edges on all channels

1 0 -1
1 0 -1 Red
1 0 -1

1 0 -1
1 0 -1 Green
1 0 -1

1 0 -1
1 0 -1 Blue
1 0 -1
Convolution on volume
• Multiple filters to detect various types of edges
• Two filters denote a bank of two types of filters
Vertical edge

Horizontal edge
One Layer of Convolutional Network

W[1]

a[0]

W[1] = {W1[1], W2[1] }


One Layer of Convolutional Network
W[1] z[1] = W[1] a[0] + b[1]
a[0]
ReLu + b1 =

4×4 4×4

ReLu + b2 =

4×4 4×4

• Number of parameters of filters


= (27+ 1) ×2
• Even if size of the image is large, 4×4×2 a[1] = g(z[1])
number of parameters remain the
same
• Therefore less prone to overfitting
Pooling Layer: Max pooling
• Hyper parameters are
• Filter size, f = 2
• Stride, s = 2
• For each region, large number detected feature
(edge or point)
• Feature does not exist in top right corner of
matrix (value is 2)
Pooling Layer: Max pooling

Size of filter is {(n_+ 2p – f)/s} + 1


Pooling Layer: Max pooling
• Max pooling works on each slice independently
Pooling Layer: Average pooling
• Max pooling is more popular than average pooling
• Except when it is required to collapse representation
• Ex: 7×7×1000  1×1×1000
CNN models for images
• Has multiple sets of layers
• Each set uses
• convolution with filter/s
• Pooling
• Finally fully connected layer
• Convolution with filters is major component of
CNN
Example: Convolutional Neural Network

39×39×3
nH[0]=nW[0]
=39
nC[0] =3
Example: Convolutional Neural Network

f[1]=3
s[1] =1
p[1]=0
39×39×3
10 filters
nH[0]=nW[0]
=39
nC[0] =3
Example: Convolutional Neural Network

f[1]=3
s[1] =1
37×37×10
n [1]=n [1]
39×39×3 p [1]=0 H W
[0] [0] 10 filters =37
nH =nW nC[1] =10
=39
nC[0] =3
Example: Convolutional Neural Network

f[2]=5
f[1]=3
s[2] =2
=1 37×37×10
s[1]
[1]=n [1] p[2]=0
39×39×3 p [1]=0 n H W 20 filters
[0] [0]
10 filters =39
nH =nW nC[1] =10
=39
nC[0] =3
Example: Convolutional Neural Network

f[2]=5
f[1]=3
s[1] =1 s[2] =2 17×17×20
37×37×10
[1]=n [1] p[2]=0
39×39×3 p [1]=0 n H W nH[2]=nW[2]
20 filters
[0] [0] 10 filters =39 =17
nH =nW nC[1] =10
=39 nC[2] =20
nC[0] =3
Example: Convolutional Neural Network

f[2]=5 f[3]=5
f[1]=3
s[1] =1 s[2] =2 17×17×20 s[3] =2
37×37×10 40 filters
[1]=n [1] p[2]=0
39×39×3 p [1]=0 n H W nH[2]=nW[2]
20 filters
[0] [0] 10 filters =39 =17
nH =nW nC[1] =10
=39 nC[2] =20
nC[0] =3
Example: Convolutional Neural Network

f[2]=5 f[3]=5
f[3] =5
f[1]=3 7×7×40
17×17×20 [3] =2
s[1] =1 s[2] =2 ss[3] =2
37×37×10 40filters
filters
[1]=n [1] p[2]=0 40
39×39×3 p [1]=0 n H W nH[2]=nW[2]
20 filters
[0] [0] 10 filters =39 =17
nH =nW nC[1] =10
=39 nC[2] =20
nC[0] =3
Example: Convolutional Neural Network

f[2]=5 f[3]=5
f[1]=3 7×7×40
s[1] =1 s[2] =2 17×17×20 s[3] =2
37×37×10 40 filters
[1]=n [1] p[2]=0
39×39×3 p [1]=0 n H W nH[2]=nW[2]
20 filters
[0] [0] 10 filters =39 =17
nH =nW nC[1] =10
=39 nC[2] =20
nC[0] =3
Example: Convolutional Neural Network

• In last step unroll 1960 neurons and apply logistic or


softmax classifier
• It is a long vector with 1960 elements
• Generally width and height reduces with each
subsequent layer
and number of filters increase with each subsequent
layer
Sizes for lth Convolution Layer
Hyper parameters for each layer
9. Input: nH[l-1] × nW[l-1] × nc[l-1]
1. f[l] = filter size
10. Output: nH[l] × nW[l] × nc[l]
2. p[l] = padding
𝑙−1
3. s[l] = stride [l] = 𝑛𝐻 +2𝑝 𝑙 −𝑓 𝑙
11. nH +1
𝑠𝑙
4. nc[l] = number of filters
𝑙−1
5. Each filter is: f[l] × f[l] × nc[l] 𝑛𝑊 +2𝑝 𝑙 −𝑓 𝑙
12. nW[l] = +1
6. Activations: a[l]  nH[l] × nW[l] × nc[l] 𝑠𝑙

7. Weights: f[l] × f[l] × nc[l-1] × nc[l] 13. A[l] M × nH[l] × nW[l] × nc[l]
8. bias: nc[l]
LeNet architecture
• Old neural network architecture
• was developed in 1998 by a French-American
computer scientist Yann André LeCun, Leon
Bottou, Yoshua Bengio, and Patrick Haffner
• Architecture was developed for the
recognition of handwritten and machine-
printed characters
• Is the basis of other deep learning models
LeNet architecture
• “first architecture” for Convolutional Neural
Networks
• especially when trained on the MNIST dataset,
an image dataset for handwritten digit
recognition
• Is small and easy to understand
• yet large enough to provide interesting results
• Versions are LeNet 1, LeNet 2,…, LeNet 5
LeNet architecture
• Consists of the following layers:
• INPUT => CONV => RELU => POOL => CONV => RELU => POOL
=> FC => RELU => FC
LeNet architecture
• Consists of a total of 7 layers
• 2 sets of Convolution layers
• 2 sets of average pooling layers which are followed by a
flattening convolution layer
• 2 dense fully connected layers
• and finally a softmax classifier
Input Layer of LeNet
• Standard MNIST image as an input of (32x32) grayscale image
• Input pixels are normalized so that the white background and
foreground black corresponds to -0.1 and 1.175 respectively
• Normalization makes mean approximately as 0 and the variance
approximately as 1
First Layer of LeNet
• No. of learning parameters
= (Weights + Bias )per filter * No. of filters
• Number of trainable parameters:
= (5 * 5 + 1) * 6 = 156
• No. of neurons = 28*28*6 = 4,704
Second Layer of LeNet
• The pooling operation follows immediately
after the first convolution
• Pooling is performed using 2 * 2 kernels
• Pooling layer of S2 is the average of the pixels
in the 2 * 2 area
Second Layer of LeNet
• No. of neurons = 14*14*6 = 1,176
Third Layer
• No. of neurons = 10*10*16 = 1,600
Fourth Layer
• No. of neurons = 5*5*16 = 400
Fifth Layer
• Fully connected Convolution layer
Sixth Layer
• consists of 84
neurons
• Dot product
between the input
vector and weight
vector is performed
• and then bias is
added to it
• The result is then
passed through a
sigmoidal activation
function
Output Layer
• fully connected
softmax output layer
• give the probability
of occurance each
output class at the
end
• has 10 possible
values corresponding
to the digits from 0
to 9
Tunable parameters for LeNet - 5
For convolutional layers
• Each element of filter matrix is a weight
• Bias is one tunable parameter at the output
each filter
• Number of tunable parameters for
convolutional layer
= {(width × height × no. of filters in previous
layer) + 1} × number of filters in next layer
Tunable parameters for LeNet - 5
For pooling layers
• Averaging is used
• Therefore no parameter is required
For fully connected layers (also called dense layers)
• For “n” inputs and “m” outputs, the number of
weights is n × m
• Additionally, this layer has the bias for each output
node
• Number of parameters = (n+1) × m
LeNet Parameters
Type of Layer Filter Size Output Shape Number of
parameters
Input - 32 × 32 × 3 -
Convolution Layer1 5×5 28 × 28 × 8 [(5 × 5 × 3) +1] × 8
= 608
Max Pooling1 2×2 14 × 14 × 8 -
Convolution Layer1 5×5 10 × 10 × 16 [(5 × 5 × 8) +1] × 16
= 3216
Max Pooling1 2×2 32 × 32 × 16 -
Flatten - 400 (400+1) × 120
= 48120
Dense1 - 120 (120+1) × 84 =
10164
Dense2 - 84 (84+1) × 10 = 850
Dense3 - 10 10
LeNet Parameters
Type of Layer Filter Size Output Shape Number of
parameters
Input - 32 × 32 × 3 -
Convolution Layer1 5×5 28 × 28 × 8 [(5 × 5 × 3) +1] × 8
= 608
Max Pooling1 2×2 14 × 14 × 8 -
Convolution Layer1 5×5 10 × 10 × 16 [(5 × 5 × 8) +1] × 16
= 3216
Max Pooling1 2×2 32 × 32 × 16 -
Flatten - 400 (400+1) × 120
= 48120
Dense1 - 120 (120+1) × 84 =
10164
Dense2 - 84 (84+1) × 10 = 850
Dense3 - 10 10
LeNet Parameters
Type of Layer Filter Size Output Shape Number of
parameters
Input - 32 × 32 × 3 -
Convolution Layer1 5×5 28 × 28 × 8 [(5 × 5 × 3) +1] × 8
= 608
Max Pooling1 2×2 14 × 14 × 8 -
Convolution Layer1 5×5 10 × 10 × 16 [(5 × 5 × 8) +1] × 16
= 3216
Max Pooling1 2×2 32 × 32 × 16 -
Flatten - 400 (400+1) × 120
= 48120
Dense1 - 120 (120+1) × 84 =
10164
Dense2 - 84 (84+1) × 10 = 850
Dense3 - 10 10
LeNet Parameters
Type of Layer Filter Size Output Shape Number of
parameters
Input - 32 × 32 × 3 -
Convolution Layer1 5×5 28 × 28 × 8 [(5 × 5 × 3) +1] × 8
= 608
Max Pooling1 2×2 14 × 14 × 8 -
Convolution Layer1 5×5 10 × 10 × 16 [(5 × 5 × 8) +1] × 16
= 3216
Max Pooling1 2×2 32 × 32 × 16 -
Flatten - 400 (400+1) × 120
= 48120
Dense1 - 120 (120+1) × 84 =
10164
Dense2 - 84 (84+1) × 10 = 850
Dense3 - 10 10
LeNet Parameters
Type of Layer Filter Size Output Shape Number of
parameters
Input - 32 × 32 × 3 -
Convolution Layer1 5×5 28 × 28 × 8 [(5 × 5 × 3) +1] × 8
= 608
Max Pooling1 2×2 14 × 14 × 8 -
Convolution Layer2 5×5 10 × 10 × 16 [(5 × 5 × 8) +1] × 16
= 3216
Max Pooling2 2×2 32 × 32 × 16 -
Flatten - 400 (400+1) × 120
= 48120
Dense1 - 120 (120+1) × 84 =
10164
Dense2 - 84 (84+1) × 10 = 850
Dense3 - 10 10
LeNet Parameters
Type of Layer Filter Size Output Shape Number of
parameters
Input - 32 × 32 × 3 -
Convolution Layer1 5×5 28 × 28 × 8 [(5 × 5 × 3) +1] × 8
= 608
Max Pooling1 2×2 14 × 14 × 8 -
Convolution Layer1 5×5 10 × 10 × 16 [(5 × 5 × 8) +1] × 16
= 3216
Max Pooling1 2×2 32 × 32 × 16 -
Flatten - 400 (400+1) × 120
= 48120
Dense1 - 120 (120+1) × 84 =
10164
Dense2 - 84 (84+1) × 10 = 850
Dense3 - 10 10
LeNet Parameters
Type of Layer Filter Size Output Shape Number of
parameters
Input - 32 × 32 × 3 -
Convolution Layer1 5×5 28 × 28 × 8 [(5 × 5 × 3) +1] × 8
= 608
Max Pooling1 2×2 14 × 14 × 8 -
Convolution Layer1 5×5 10 × 10 × 16 [(5 × 5 × 8) +1] × 16
= 3216
Max Pooling1 2×2 32 × 32 × 16 -
Flatten - 400 (400+1) × 120
= 48120
Dense1 - 120 (120+1) × 84 =
10164
Dense2 - 84 (84+1) × 10 = 850
Dense3 - 10 10
LeNet Parameters
Type of Layer Filter Size Output Shape Number of
parameters
Input - 32 × 32 × 3 -
Convolution Layer1 5×5 28 × 28 × 8 [(5 × 5 × 3) +1] × 8
= 608
Max Pooling1 2×2 14 × 14 × 8 -
Convolution Layer1 5×5 10 × 10 × 16 [(5 × 5 × 8) +1] × 16
= 3216
Max Pooling1 2×2 32 × 32 × 16 -
Flatten - 400 (400+1) × 120
= 48120
Dense1 - 120 (120+1) × 84 =
10164
Dense2 - 84 (84+1) × 10 = 850
Dense3 - 10 10
LeNet Parameters
Type of Layer Filter Size Output Shape Number of
parameters
Input - 32 × 32 × 3 -
Convolution Layer1 5×5 28 × 28 × 8 [(5 × 5 × 3) +1] × 8
= 608
Max Pooling1 2×2 14 × 14 × 8 -
Convolution Layer1 5×5 10 × 10 × 16 [(5 × 5 × 8) +1] × 16
= 3216
Max Pooling1 2×2 32 × 32 × 16 -
Flatten - 400 (400+1) × 120
= 48120
Dense1 - 120 (120+1) × 84 =
10164
Dense2 - 84 (84+1) × 10 = 850
Dense3 - 10 10
LeNet Parameters
Type of Layer Filter Size Output Shape Number of
parameters
Input - 32 × 32 × 3 -
Convolution Layer1 5×5 28 × 28 × 8 [(5 × 5 × 3) +1] × 8
= 608
Max Pooling1 2×2 14 × 14 × 8 -
Convolution Layer1 5×5 10 × 10 × 16 [(5 × 5 × 8) +1] × 16
= 3216
Max Pooling1 2×2 32 × 32 × 16 -
Flatten - 400 (400+1) × 120
= 48120
Dense1 - 120 (120+1) × 84 =
10164
Dense2 - 84 (84+1) × 10 = 850
Dense3 - 10 10
LeNet - 5
• Low level features are horizontal/vertical
edges etc
• High level features are eyes etc
Classic Convlolutional Neural Networks
• LeNet – 5
• AlexNet
• VGG
• Resnet
• Inception
LeNet - 5
• Uses about 60k parameters
• Today we use 10 million parameters
AlexNet (2012)
• The Alexnet has 60M parameter compare to LeNet’s
60k parameters
• And it uses ReLU activation function
• Has more number of filter in each layer than used for
LeNet
AlexNet (2012)

• AlexNet has the following layers


• Input: Color images of size 227x227x3.
• Conv-1: The first convolutional layer consists of 96 kernels of size 11×11 applied
with a stride of 4 and padding of 0.
• MaxPool-1: The maxpool layer following Conv-1 consists of pooling size of 3×3
and stride 2.
• Conv-2: The second conv layer consists of 256 kernels of size 5×5 applied with a
stride of 1 and padding of 2.
• MaxPool-2: The maxpool layer following Conv-2 consists of pooling size of 3×3
and a stride of 2.
AlexNet (2012)

• Conv-3: The third conv layer consists of 384 kernels of size 3×3 applied with a
stride of 1 and padding of 1.
• Conv-4: The fourth conv layer has the same structure as the third conv layer. It
consists of 384 kernels of size 3×3 applied with a stride of 1 and padding of 1.
• Conv-5: The fifth conv layer consists of 256 kernels of size 3×3 applied with a
stride of 1 and padding of 1.
• MaxPool-3: The maxpool layer following Conv-5 consists of pooling size of 3×3
and a stride of 2.
AlexNet (2012)

• FC-1: The first fully connected layer has 4096 neurons.


• FC-2: The second fully connected layer has 4096 neurons.
• FC-3: The third fully connected layer has 1000 neurons.
.

AlexNet (2012)
• size of image (output tensor) of a convolution layer is
n+2p –f n+2p –f
O= +1 × +1
𝑠 𝑠

where
O = Size (width) of output image.
n = Size (width) of input image.
f = Size (width) of kernels used in the Conv Layer.
N = Number of kernels.
s = Stride of the convolution operation.
p = Padding.

The number of channels in the output image is equal to the


number of kernels = N
AlexNet (2012)
Example: In AlexNet, the input image is of size 227x227x3. The first convolutional
layer has 96 kernels of size 11x11x3. The stride is 4 and padding is 0.
Therefore the size of the output image right after the first bank of convolutional
layers is

227−11 +2x0
O= + 1 = 55
4

• Output image is of size 55x55x96 ( one channel for each kernel )


AlexNet (2012)
• Size of Output Tensor
(Image) of a MaxPool Layer

n−f
O= +1
𝑠
Example: In AlexNet, the MaxPool layer
O = Size (width) of output after the bank of convolution filters has a
image. pool size of 3 and stride of 2.
n = Size (width) of input We know from the previous section, the
image. image at this stage is of size 55x55x96.
f = Size (width) of kernels used The output image after the MaxPool layer
in the pooling layer. is of size
N = Number of kernels. 55−3
O= + 1 = 27
2
s = Stride of the convolution
operation. So, the output image is of size 27x27x96
AlexNet (2012)
Size of the output of a Fully Connected Layer
• A fully connected layer outputs a vector of
length equal to the number of neurons in the
layer.
AlexNet (2012)

Change in the size of the tensor through AlexNet


• In AlexNet, the input is an image of size 227x227x3.
• After Conv-1, the size of changes to 55x55x96 which is transformed to 27x27x96
after MaxPool-1.
• After Conv-2, the size changes to 27x27x256 and following MaxPool-2 it changes
to 13x13x256.
• Conv-3 transforms it to a size of 13x13x384,
• while Conv-4 preserves the size and Conv-5 changes the size back go 27x27x256.
AlexNet (2012)

Change in the size of the tensor through AlexNet


• Finally, MaxPool-3 reduces the size to 6x6x256.
• This image feeds into FC-1 which transforms it into a vector of size 4096×1.
• The size remains unchanged through FC-2, and finally, we get the output of size
1000×1 after FC-3.
AlexNet (2012)

• Number of Parameters of a Conv Layer


• In a CNN, each layer has two kinds of
parameters : weights and biases.
• The total number of parameters is just the
sum of all weights and biases
AlexNet (2012)
Wc = f2 X C X N
Bc = N
Pc = Wc + Bc
wc = Number of weights of the Conv Layer.
Bc = Number of biases of the Conv Layer.
Pc = Number of parameters of the Conv
Layer. Example: In AlexNet, at the first Conv Layer,
f =Size (width) of kernels used in the Conv the number of channels (c), of the input
Layer. image is 3, the kernel size (k) is 11, the
N = Number of kernels. number of kernels (N) is 96.
c = Number of channels of the input image. So the number of parameters is given by
• In a Conv Layer, the depth of every kernel
is always equal to the number of channels
in the input image
• So every kernel has f2 X c parameters, and
there are N such kernels.
AlexNet (2012)
• Number of Parameters of a MaxPool Layer
– There are no parameters associated with a
MaxPool layer.
– The pool size, stride, and padding are hyper
parameters.
AlexNet (2012)
• Number of Parameters of a Fully Connected
(FC) Layer
– There are two kinds of fully connected layers in a
CNN.
– The first FC layer is connected to the last Conv
Layer, while later FC layers are connected to other
FC layers.
AlexNet (2012)
Case 1: Number of Parameters of a Fully Connected (FC) Layer connected to a Conv
Layer
Wcf = Number of weights of a FC Layer which is connected to a Conv Layer.
Bcf = Number of biases of a FC Layer which is connected to a Conv Layer.
O = Size (width) of the output image of the previous Conv Layer.
N = Number of kernels in the previous Conv Layer.
F = Number of neurons in the FC Layer.

• Example: The first fully connected layer of AlexNet is connected to a Conv Layer.
For this layer,
• O=6, N=256, F=4096,
AlexNet (2012)
Case 2: Number of Parameters of a Fully
Connected (FC) Layer connected to a FC Layer
• Wff = Number of weights of a FC Layer
which is connected to an FC Layer.
• Bff = Number of biases of a FC Layer which is
connected to an FC Layer
• Pff = Number of parameters of a FC Layer
which is connected to an FC Layer
• F = Number of neurons in the FC Layer
Example: The last fully connected layer
• F-1 = Number of neurons in the previous FC
of AlexNet is connected to an FC Layer.
Layer.
For this layer
AlexNet (2012)
Number of Parameters and
Tensor Sizes in AlexNet
• The total number of
parameters in AlexNet is
the sum of all
parameters in the 5 Conv
Layers + 3 FC Layers.
• It comes out to a
whopping 62,378,344!
VGG – 16 (2015)
• Instead of having several hyperparameters
• Uses a much simpler network
• Has Conv layers that are 3 by 3 filters with
stride 1
• Use the SAME padding
• Max pooling layers are 2 by 2 with a stride of 2
VGG – 16 (2015)
• Deeper network than LeNet and Alexnet
• Total of about 138 million parameters
Training for CNN
• Use gradient descent to optimize network to
reduce cost J
• Adam, RMS are commonly used optimizers
Training CNN
• Tunable parameters are weights/elements of filters,
biases, weights for fully connected layer
• Before training, tunable parameters are initialized with
random values
• Prediction error (loss function/ cost function) is
calculated at the output (last) layer
• For training, start with last layer
• Gradient/derivative of error with respect to each
tunable parameter of last layer is calculated
• Based on gradients, parameters of last layer are
updated
• Same is repeated for each layer till we reach the first
layer
Vanishing gradients
• Backpropagation algorithm uses gradients for training ANN
• Backpropagation finds the derivatives of the network by
moving layer by layer from the final layer to the initial one
• Using chain rule, the derivatives of each layer are
multiplied (from the final layer to the initial) to compute
the derivatives at the first few layers
• Derivative may be less than 1
• For n hidden layers, n small derivatives are multiplied
• Thus gradient (derivative) decreases exponentially as we
propagate down to initial layers
• This is called vanishing gradients
• Weights and biases are updated by multiplying weights and
biases by small vanishing gradients
• Updated weight = original weigh ± (learning rate × gradient)
Types of Gradients
• For training, two types of derivatives are used
1. Derivative along the path from one neuron of previous
layer to the neuron of next layer
2. Derivative of activation function
Sigmoid and its Gradient
Derivative (red) approaches zero for very large or small values
Why vanishing gradients are significant?
• For small gradient, the weights and biases of the
initial layers will not be updated effectively with
each training session
• Initial layers are crucial to recognizing the
attributes of the input data
• Low value of gradient can lead to overall
inaccuracy of the whole network
• In the worst case scenario the gradient will be 0
• Weight will not change
• Training of network will stop
• Solution is to use other activation functions, such
as ReLU, which doesn’t cause a small derivative
Exploding Gradients
• In a network of n hidden layers, n derivatives are
multiplied together
• If derivatives are large then gradient increase
exponentially as we propagate down to initial layer
• Accumulation of large derivatives results in the model
being very unstable and incapable of effective learning
• The large changes in the models weights creates a very
unstable network
which at extreme values the weights become so large
that is causes overflow resulting in NaN weight values
of which can no longer be updated
Solution to exploding gradients
• Reduce the amount of Layers
• could be used in both, scenarios (exploding
and vanishing gradient)
• By reducing the number of layers in network,
complexity of model is reduced
• However more layers makes the networks
more capable of classifying complex data
• Gradient Clipping
• Check the amount of gradient and limit the
size during training phase
Residual Networks (ResNet)
• Avoids vanishing and exploding gradients, as they
provide residual connections straight to earlier layers
• Residual connection doesn’t go through activation
functions that “squashes” the derivatives
resulting in a higher overall derivative of the block
Without Residual Block
Without Residual Block
With Residual Block
With Residual Block
With Residual Block
Residual Network
• Allows to use large number of hidden layers
• Avoids problem of vanishing and exploding gradients
Residual Network
• Allows to use large number of hidden layers
• Avoids problem of vanishing and exploding gradients
Residual Network
• Allows to use large number of hidden layers
• Avoids problem of vanishing and exploding gradients
Residual Network
Inception Network
• Parts are extracted from the same image
Why Inception Network?
• allows the internal layers to pick and choose which
filter size
• Even if the size of the face in the image is different, the
layer works accordingly to recognize the face

Requires large filter size Requires small filter size


Number of Operations in convolution
(revisited)
• Image filtering using convolution

45 56 42 63 54
1 1 1
20 47 56 28 53 46
63 59 26 38 47 * (1/9) 1 1 1
=
1 1 1
67 36 27 48 51
3x3 filter
43 36 42 65 43
5x5 image

(45+56+42+20+47+56+63+59+26)/9
= 46 (nearest integer)
Number of Operations in convolution
(revisited)
• Image matrix

45 56 42 63 54
1 1 1
20 47 56 28 53 46 46
63 59 26 38 47 * (1/9) 1 1 1
=
1 1 1
67 36 27 48 51
3x3 filter
43 36 42 65 43
5x5 image
Number of Operations in convolution
(revisited)

• Image matrix

45 56 42 63 54
1 1 1
20 47 56 28 53 46 46 45
63 59 26 38 47 * (1/9) 1 1 1
= 44 41 42
1 1 1
67 36 27 48 51 44 42 43
3x3 filter
43 36 42 65 43
5x5 image
• Size of image is 5x5
• Size of filtered image is 3x3
• To keep the size of filtered and original image same
add additional rows and columns of zeros called padding
Number of Operations in convolution
(revisited)
• Image matrix

0 0 0 0 0 0 0
1 1 1 19 .. .. .. ..
0 45 56 42 63 54 0
: 46 46 45 :
0 20 47 56 28 53 0 *(1/9) 1 1 1
=
1 1 1 : 44 41 42 :
0 63 59 26 38 47 0
3x3 filter : 44 42 43 :
0 67 36 27 48 51 0
: .. .. .. :
0 43 36 42 65 43 0
0 0 0 0 0 0 0 5x5 filtered image

5x5 image with padding


Number of Operations in convolution
(revisited)

• Image matrix

45 56 42 63 54
1 1 1
20 47 56 28 53 46 46 45
63 59 26 38 47 * (1/9) 1 1 1
= 44 41 42
1 1 1
67 36 27 48 51 44 42 43
3x3 filter
43 36 42 65 43
1 1 1 1 1
45 56 42 63 54
1 1 1 1 1
20 47 56 28 53
63 59 26 38 47
*(1/25) 1 1 1 1 1
46
=
1 1 1 1 1
67 36 27 48 51
1 1 1 1 1
43 36 42 65 43
5x5 filters
Number of Operations in convolution
(revisited)
• Image matrix

90 112 84 .. ..
45 56 42 63 54
: :
20 47 56 28 53 * 2
= : :
63 59 26 38 47
1x1 filter : :
67 36 27 48 51
: .. .. .. :
43 36 42 65 43
5x5 image 5x5 filtered image

Filter of size 1x1 does not change the size of image


Inception Network
• Ex: inception module has input with 28×28×192
• First filter generates filtered image of size 28×28 for each plane
• 192 filtered images are generated by first filter
• 192 filtered images are added
• Thus filtered image has size 28×28
• Same process is done by 64 filters to produce 28×28×64

28×28×64
Inception Network
• Inception module has input with 28×28×192
• Output has 64 + 128 + 32 + 32 =256 channels
• Therefore size at output is 28×28×256
Inception Layer
• Is a combination of
• 1×1 Convolutional layer
• 3×3 Convolutional layer
• 5×5 Convolutional layer
• Output filter banks concatenated into a single output
vector forming the input of the next stage
Computational Cost of Inception Network
• For one element of one channel of input, 5×5 multiplications are used
• 192 channels for the same location would require 5×5×192 multiplications
• For 28×28 elements, 28×28×5×5×192 multiplications
• 32 filters would require 32×28×28×5×5×192
= 120,422,400 ≈ 120 millions
• Number of multiplications = number of filters in next layer × width ×
height × filter size × filter size × number of channels in previous layer
Using 1×1 convolution
• Same input and output sizes
• Number of multiplications = number of filters in next layer × width ×
height × filter size × filter size × number of channels in previous layer
• Input to first hidden layer requires 16×28×28×1×1 ×192= 2,408, 448
multiplications
• First hidden layer to output requires 32×28×28×5×5×16 = 10,035,200
multiplications
• Total multiplications are 2,408, 448 + 10,035,200 = 12,443, 648 ≈ 12.4
million
• Requires much smaller number of computations
Bottleneck layer
Inception Layer
• There are two major add-ons in the original inception layer:
• 1×1 Convolutional layer before applying another layer,
which is mainly used for dimensionality reduction
• A parallel Max Pooling layer, which provides another option
to the inception layer
One block of Inception Network
Why Inception Network?
• Hebbian principle: “neurons that fire together, wire
together”
• when creating a subsequent layer in a deep learning
model
one should pay attention to the learnings of the
previous layer
• For example, a layer in our deep learning model has
learned to focus on individual parts of a face
• The next layer of the network would focus on the
overall face in the image
• To do this, the layer should have the appropriate
filter sizes to detect different objects
Inception Network
Inception Network
• Last few layers are Fully Connected layers and
have softmax classifier
• Some layers have side branches with same blocks
Transfer Learning
• Reuse of a pre-trained model on a new problem
• Very popular in deep learning because it can train
deep neural networks with comparatively little
data
• Machine exploits the knowledge gained from a
previous task to improve generalization about
another
• For example, in training a classifier to predict
whether an image contains food
use the knowledge it gained during training to
recognize drinks
Transfer Learning
• Knowledge of an already trained machine
learning model is applied to a different but
related problem
• Transfer the weights that a network has learned
at "task A" to a new "task B“
• Model has learned from a task with a lot of
available labeled training data
• Use this knowledge for a new task that doesn't
have much data
• Instead of starting the learning process from
scratch, start with patterns learned from solving a
related task
Transfer Learning
• Mostly used in computer vision and natural
language processing tasks due to the huge
amount of computational power required
• In computer vision, for example, neural networks
usually try to detect edges in the earlier layers,
shapes in the middle layer and some task-specific
features in the later layers
• In transfer learning, the early and middle layers
are used
• and only retrain the latter layers
• It helps leverage the labeled data of the task it
was initially trained on
Example: Transfer Learning
• Model is trained for recognizing a backpack on
an image
• which will be used to identify sunglasses
• In the earlier layers, the model has learned to
recognize objects
• Retrain only latter layers so it will learn what
separates sunglasses from other objects
Transfer Learning
• try to transfer as much knowledge as possible from the
previous task the model was trained on to the new task
Benefits of Transfer Learning
• Saves training time
• because it can sometimes take days or even
weeks to train a deep neural network from
scratch on a complex task
• Better performance of neural networks
• Does not need a lot of data.
• Machine learning model can be built with
comparatively little training data because the
model is already pre-trained
When to Use Transfer Learning
• There isn't enough labeled training data to train your
network from scratch
• There already exists a network that is pre-trained on a
similar task, which is usually trained on massive
amounts of data
• When task 1 and task 2 have the same input
• Features learned from the first task should be general
so that they can be useful for another related task
• Also, the input of the model needs to have the same
size as it was initially trained with
• Otherwise add a pre-processing step to resize your
input to the needed size
Using a pretrained model
• Keras, for example, provides nine pre-trained
models that can be used for transfer learning,
prediction, feature extraction and fine-tuning
• How many layers to reuse and how many to
retrain depends on the task B
Ways for Transfer Learning
• Classifier
Pre-trained model is used directly to classify new images
• Standalone Feature Extractor
The pre-trained model, or some portion of the model, is
used to pre-process images and extract relevant features
• Integrated Feature Extractor
Pre-trained model, or some portion of the model, is
integrated into a new model
layers of the pre-trained model are frozen during training
• Weight Initialization
Pre-trained model, or some portion of the model, is
integrated into a new model,
and the layers of the pre-trained model are trained in
concert with the new model

You might also like