Endsem

Deep learning for Computer Vision
Outline
● Introduction to Computer Vision

● Problems using Fully Connected Networks on Images
● What are convolutions
● Convolution on Images
● Stride, Padding, Pooling
● Dimension of a Convolution Layer
● Stacking Convolution Layers
● Convolution Neural Networks for images and its applications
Computer Vision Tasks
2018 Turing Award
The 2018 Turing Award was awarded jointly to Yoshua Bengio, Geoffrey
Hinton, and Yann LeCun for their pioneering work on deep learning
Geoffrey Hinton
is known by
many to be the
godfather of
deep learning.
Geoffrey E.
Hinton is known
by many to be
the godfather of
deep learning.
Geoffrey Everest Hinton early in his

career. His middle name comes
from a relative, George Everest,
who surveyed India
Geoffrey E. Hinton
● Geoffrey Hinton's foundational contribution to backpropagation algorithm

can be traced back to his work on the "Learning Representations by
Back-Propagating Errors".
● Hinton's work helped to establish the theoretical foundations for training
CNNs and RNNs and showed that it was possible to learn useful
representations using Deep Neural Networks.
● For example, Backpropagation through time" (BPTT) algorithm is a variant of
the backpropagation algorithm that is specifically designed for training
recurrent neural networks.
Yann LeCun
● Yann LeCun is well known for his work on Convolutional Neural

Networks, representation learning, geomatic deep learning.
● LeNet is a convolutional neural network (CNN) architecture
designed by Yann LeCun et. al. in 1998.
● LeNet is used for the recognition of handwritten digits in the
MNIST dataset.
Yoshua Bengio
● Yoshua Bengio initial works on "Learning long-term dependencies with gradient

descent is difficult", uncover fundamental difficulty of learning in RNNs.
● Bengio's has also done significant contribution in
○ Recurrent neural networks (RNNs)
○ Word embeddings from neural networks and neural language models,
○ Unsupervised deep learning based on auto-encoders,
○ introducing Generative Adversarial Networks (GANs).
ImageNet Large Scale Visual Recognition Challenge (ILSVRC)
● ILSVRC challenge is designed to evaluate and advance image classification

and object detection algorithms.
● ILSVRC is an annual computer vision competition that was first held in 2010.
● The competition uses ImageNet dataset, which contains millions of images
organized into thousands of categories.
● The challenge has been a driving force behind many of the recent advances
in computer vision, including self-driving cars, robotics, and augmented
reality.
in collaboration with
Geoffrey E. Hinton
Problems using Fully
Connected Networks on
Images
Fully Connected Neural Network (FC)
Fully Connected Neural Network (FC)
Problems using FC Layers on Images
How to process a tiny image with FC layers




Disadvantages of FC Layers for Image Data
● Dense layers require a lot of parameters, which can lead to overfitting when
the number of input features is large.
● Dense layers are not translation invariant, meaning that small shifts in the
input image can result in large changes in the output.
● Dense layers do not take advantage of the spatial structure of images, and
can therefore be inefficient for processing large images.
Convolution Neural
Networks for images
What are
convolutions?
What are Convolutions?
Discrete case: box filter
Slide’ filter kernel from left to right; at each position,

compute a single value in the output data

Convolution on
Images
Convolutions on images
● We just slide the kernel over
the input image
● Each time we slide the
kernel we get one value in
the output
● We just slide the kernel over
the input image
● Each time we slide the
kernel we get one value in
the output
● The resulting output is called
a feature map.
● We can use multiple filters to
get multiple feature maps.
Convolution
Convolutions on Images
Aim is to learn useful

kernel using DL
Feature Map
Dimension
Feature Map Dimension
Input
Filter
Output
Input
Filter
Output
Input
Filter
Output
Input
Filter
Output
Input
Filter
Output
Input
Filter
Output
black and white image

Convolution Layer
RGB image
Convolution Layer
RGB image
Convolution Layer
RGB image
Convolution Layer
RGB image
Convolution Layer
RGB image
Stride
The stride determines how much the convolutional
kernel is moved across the input image at each step.
Stride
Stride
Stride
Convolution on images
Input
Filter
Stride
Output
Input
Filter
Stride
Output
Input
Filter
Stride
Output
Input
Filter
Stride
Output
Input
Filter
Stride
Output
Input
Filter
Stride
Output
Does not fit and

illegal operation
Input
Filter
Stride
Output
Fractions are illegal

Stride
The advantages of using stride include:
● Stride Reduce the size of the output feature map, which can help to reduce
the computational cost and memory requirements of the network.
● Striding can make the network more efficient by reducing the number of
operations required to process the input data.
● Striding can help to reduce overfitting by reducing the number of parameters in
the network and forcing the network to learn more abstract features.

Solution is Convolution
Parameter Sharing
Parameter Sharing
Parameter Sharing
● the same set of weights is used to compute the output for all neurons in
the same feature map.
● Parameter sharing reduces the number of parameters needed to train.
● Parameter Sharing enables the model to learn translation-invariant
features, for example, kernel for edge.
Edge Detection by Convolution
?
Edge Detection by Convolution
Efficiency of Convolution
Input: 320 x 280

Kernel: 2 x 1
Output: 319 x 280
Huge
Features
Features
Features

kernel using DL
Feature Extraction using CNN

kernel using DL
Padding
Recall Stride
Input
Filter
Stride
Output
Fractions are illegal

Convolution Layers:Dimensions
Input Image
Always shrinking down may not a good approach

(information loss)
Padding (Key idea)
Padding (Key idea)
Padding (Key idea)
Padding (Key idea)
Padding (Key idea)
Padding (Key idea)
Padding (Key idea)
Padding (Key idea)
Padding (Key idea)
Padding (Key idea)
Padding (Key idea)
Convolution Layers: Padding
● Padding refers to the

process of adding additional
rows and columns of zeros
around the edges of the
input image.
Zero Padding
Why padding?
● Sizes get smaller too quickly

● Corner pixel is only used
once
● Preserving the spatial dimensions

of the output feature map
● Reducing information loss at the
edges of the image
● Reducing the effect of edge pixels
on the output
Zero Padding
Most common is zero padding

Most common is zero padding
Output Size:
denotes the floor operator

Padding Types
● Valid Padding:
○ No padding at all.
○ The output feature map is smaller than the input feature map.
● Same Padding:
○ Adding enough padding to the input image so that the output feature map
has the same size as the input image.
Set padding to P = F - 1 with stride S =1
2
○ Verify formula
Padding
● Reflective Padding:
○ The padded pixels are not filled with zeros, but with the reflected values
of the input image.
○ This type of padding is useful when the input image contains edges or
other sharp features that would be distorted by zero padding.
Original tensor Reflected padded tensor

[[5 4 5 6 5]
[[1 2 3] [2 1 2 3 2]
[4 5 6] [5 4 5 6 5]
[8 7 8 9 8]
[7 8 9]] [5 4 5 6 5]]
Padding
● Symmetric Padding:
Original tensor Symmetric padded tensor:

[[1 1 2 3 3]
[[1 2 3] [1 1 2 3 3]
[4 5 6] [4 4 5 6 6]
[7 7 8 9 9]
[7 8 9]] [7 7 8 9 9]]
Feature Map
Dimension (padding
and stride)
Convolution Example
Conv2D(filters=10, kernel_size=(5, 5), strides=1, padding='same', input_shape=(3, 32, 32))

Convolution Example
Convolution Example
Convolution Example
Convolution Example
Stacking Convolution
Layers
Convolution Layer
● A basic layer is defined by

— Filter width and height (depth is implicitly given)
— Number of different filter (#weight sets)
● Each filter captures a different image characteristic

Convolution Layer
Conv2D(filters=1, kernel_size=(5, 5), strides=1, padding='valid', input_shape=(3, 32, 32))

Convolution Layer
Convolution Layer
Convolution Layer
Convolution Layer
Conv2D(filters=6, kernel_size=(5, 5), strides=1, padding='valid', input_shape=(3, 32, 32))

Convolution Layer
#parameters
● Each filter has a size of 5x5x3 (since there are 3 input channels).
● There are 6 filters in the layer. Plus there is one bias term for each filter.
● Therefore, the total number of parameters is 6*(5x5x3+1) = 456.
Convolution Layer
Convolution Layer
Stacking Convolution layers
Pooling
(sub-sampling)
● Conv Layer = Feature Extraction
— Computes a feature in a given region
Pooling
● Pooling Layer = Feature Selection
— Picks the strongest activation in a region
Pooling Layer: Max Pooling
inputs = Input(shape=(4, 4, 1))
# Max pooling layer

max_pool = MaxPooling2D(pool_size=(2, 2), strides=2)(inputs)
Pooling Layer: Average Pooling
● Typically used deeper in the network
# Average pooling layer

avg_pool = AveragePooling2D(pool_size=(2, 2), strides=2)(inputs)
Pooling Layer
Common Setting
F=2, S=2
F=3, S=2
Pooling Layer
● Pooling Layer reduces the spatial dimensions of the input, which reduces
the computational cost of the network.
● Pooling layers aimed to important features, which help the network to learn
more robust representations of the input data.
● Pooling layers can improve the translation invariance of the network by
selecting the maximum or average value in a given region.
● Pooling layers provides distortion invariance by selecting the maximum or
average value in a region, which reduces effects of small variations.
Receptive Field
Receptive Field
● Receptive field refers to the area of the input that is used

by a particular neuron or feature map in the network.
Receptive Field (example)
Receptive Field
Receptive field refers to the area of the input image that is used by a
particular neuron or feature map in the network.
Receptive Field
Receptive Field
Receptive Field
Receptive Field
● Receptive field refers to the area of the input that is used

by a particular neuron or feature map in the network.
Receptive Field
Receptive fields (Advantages)
● Capturing global features

● Improved recognition accuracy
● Improved generalization
Receptive fields (Disadvantages)
● Increased computational cost

● Reduced localization accuracy
● Limited context modeling
Sparsity
Sparsity
● This is what a regular

feed-forward
neural network will look like
● There are many dense

connections
Sparsity
● This is what a regular

feed-forward
neural network will look like
● There are many dense

connections
Sparsity
● Reduced number of parameters using convolution (compared to a

fully connected layer)
● Efficient computation and low memory usage using convolution
Sparsity
Sparsity
Sparsity
Convolutional Neural
Networks (CNNs)
Pooling Pooling
Schematic of typical sequences of layers in CNNs

Advantages of Convolutional Networks
● Shared weights: reduces the number of parameters and

helps to prevent overfitting.

● Translation invariant: meaning that small shifts in the input
image result in small changes in the output.

● Translation invariant: meaning that small shifts in the input
image result in small changes in the output.
● Flexibility to design: application specific network by
Stacking convolution layers
Feature extraction using CNNs
● CNN take advantage of the spatial structure for feature

extraction

extraction
● Learn hierarchical representations: starting with low-level
features (edges) and progressing to higher-level features
(shape).

extraction
● Learn hierarchical representations: starting with low-level
features (edges) and progressing to higher-level features
(shape).
● Learn complex representations: convolutional layers can
be easily stacked to create deeper networks.
Basics of CNN (summary)






● Receptive Field and Sparsity
LeNet
LeNet
LeNet
● LeNet is a convolutional neural network architecture designed by Yann LeCun

in 1998.
LeNet

in 1998.
● LeNet uses a gradient-based learning algorithm (backpropagation) for
training.
LeNet

in 1998.
training.
● LeNet was designed specifically for handwritten digit recognition, and was
trained on the MNIST dataset of handwritten digits.
LeNet

in 1998.
training.
● LeNet was designed specifically for handwritten digit recognition, and was
trained on the MNIST dataset of handwritten digits.
● LeNet is also adapted and extended for other image recognition tasks.
LeNet
Digit recognition: 10 classes

LeNet
» Valid convolution: size shrinks

LeNet
At that time average pooling was used, now max pooling is much more common
LeNet
● Convolutional layer 1: This layer applies a set of filters to the input image to
extract features such as edges and corners.
LeNet
● Convolutional layer 1: This layer applies a set of filters to the input image to
extract features such as edges and corners.
● Subsampling layer 1:
○ Reduces the spatial size of the feature maps, while retaining the most
important features.
○ Reduce the computational complexity.
LeNet
● The second layer is another convolutional layer with 16 feature maps,

followed by another pooling layer.
LeNet
● Convolutional layer 2: Allowing the network to learn more complex features.
● Subsampling layer 2: Reduces the size of the feature maps, while retaining
the most important features.
LeNet
10
LeNet
● Fully connected layer 1 and 2:

○ The third layer is a fully connected layer with 120 units, followed by a
fourth layer with 84 units, both using a 'relu' (updated).
○ Last layers learns non-linear combinations of the features extracted by
the convolutional layers.
LeNet
10
● Output layer: Fully connected layers with 10 units each, using a softmax
activation function to produce the final output probabilities.
from keras.models import Sequential
from keras.layers import Conv2D, AveragePooling2D, Flatten, Dense
# Input layer
# Convolutional layer 1
conv1 = Conv2D(filters=6, kernel_size=(5, 5), strides=(1, 1), activation='relu',
padding='same')(inputs)
# Average pooling layer 1

pool1 = AveragePooling2D(pool_size=(2, 2), strides=(2, 2), padding='valid')(conv1)
# Convolutional layer 2
conv2 = Conv2D(filters=16, kernel_size=(5, 5), strides=(1, 1), activation='relu',
padding='valid')(pool1)
# Average pooling layer 2

LeNet
pool2 = AveragePooling2D(pool_size=(2, 2), strides=(2, 2), padding='valid')(conv2)
# Flatten layer
flatten = Flatten()(pool2)
# Fully connected layer 1

LeNet
fc1 = Dense(units=120, activation='relu')(flatten)
# Fully connected layer 2
fc2 = Dense(units=84, activation='relu')(fc1)
# Output layer
outputs = Dense(units=10, activation='softmax')(fc2)
Special Convolution
(Depth-wise separable
convolutions)
Normal convolutions
Normal convolution acts on all channels
# Input layer for an image of size 12x12 with 3 channels

# Convolutional layer with 256 filters of size 5x5

conv = Conv2D(filters=256, kernel_size=(5, 5), padding='valid')(inputs)
Depth-wise separable convolutions
1. Apply a separate filter to each input channel to produce a set of

intermediate feature maps.
2. Use a 1x1 convolution to combine the intermediate feature maps

into the final output.
from keras.layers import Input, DepthwiseConv2D, Conv2D, BatchNormalization,

Activation
# Input layer for an image of size 12x12 with 3 channels

# Depthwise convolutional layer

depthwise_conv = DepthwiseConv2D(kernel_size=(5, 5), padding='valid')(inputs)
depthwise_conv = Activation('relu')(depthwise_conv)
# Pointwise convolutional layer

pointwise_conv = Conv2D(filters=256, kernel_size=(1, 1))(depthwise_conv)
pointwise_conv = Activation('relu')(pointwise_conv)
But why?
● Each filter has a kernel size of 5x5x3 (there are 3

input channels). Plus one bias term for each filter.
● Total number of parameters in the Convolutional
layer is 256x(5x5x3+1) = 19456.
Simple convolution
But why?
Depthwise Convolutional layer:

● Each channel of the input tensor is convolved
separately using a 5x5 kernel.
● There are 3 input channels and a depth multiplier
of 1, so there are 3 depthwise filters in total.
● Each depthwise filter has 5x5=25 parameters.
Plus one bias term for each filter.
● Total number of parameters in the Depthwise
Convolutional layer is 3x(25+1) = 78.
Pointwise Convolutional layer:

● There are 256 output filters, and the input has 3
channels.
● Each filter has a kernel size of 1x1x3 (as 3 input
channels). Plus one bias term for each filter.
● Total number of parameters in the Pointwise
Convolutional layer is 256x(1x1x3+1) = 1024.
Depthwise Separable Convolution
● Fewer parameters: Compared to a standard convolution, depthwise

separable convolution requires fewer parameters to learn.

● Computationally efficient: The depthwise convolution requires fewer
computations than a standard convolution

● Improved performance: Depthwise separable convolution can achieve
similar or better performance than a standard convolution.

● Improved performance: Depthwise separable convolution can achieve
similar or better performance than a standard convolution.
● Reduced overfitting: By reducing the number of parameters, depthwise
separable convolution can help reduce overfitting.
More Special
Convolutions
Transpose Convolution: 1D Example
Transposed Convolution
● Convolution outputs tensor with a smaller spatial dimension, whereas

transposed convolution outputs tensor with a larger spatial dimension.

● Like convolution, transposed convolution uses learnable weights.

● Like convolution, transposed convolution uses learnable weights.
● In Transposed convolution, the output is upsampled by a factor of the
stride. For example, output_size = (input_size - 1) * stride +
kernel_size
Stride 1
Transposed convolution with a 2x2 kernel. The shaded portions are a portion of an intermediate
tensor as well as the input and kernel tensor elements used for the computation.
Stride 1
Transposed convolution with a 2x2 kernel. The shaded portions are a portion of an intermediate
tensor as well as the input and kernel tensor elements used for the computation.
Stride 2
Stride 2
# Add a transposed convolutional layer with 1 filters, kernel size 2x2, stride 2, no
padding
Conv2DTranspose(2, (2, 2), strides=(2, 2), padding='valid', input_shape=(height,
width, channels))
● Conv2DTranspose: This is the Keras layer class for a transposed convolution.

It is added to the model using the add method.
● 2: This is the number of filters in the layer.
● (2, 2): This is the size of the convolutional kernel in the layer.
● strides=(2, 2): stride of the transposed convolution to 2 in both the vertical

and horizontal directions.
● padding='valid': no padding is added.
● input_shape=(height, width, channels): the shape of the input to the layer.

Applications of Transposed Convolution:

● Generating high-resolution images from low-resolution input
● Encoder-Decoder networks
○ Transposed Convolution used for image denoising, inpainting,
and super-resolution
○ Transposed Convolution used for semantic segmentation
Disadvantages of Transposed Convolution:

● Can generate low-quality or unrealistic images if the training data is
insufficient or the model architecture is not well-suited for the task
Disadvantages of Transposed Convolution:

● Can generate low-quality or unrealistic images if the training data is
insufficient or the model architecture is not well-suited for the task
● Can lead to checkerboard artifacts in the output images, where the
same pixel values are repeated in a grid-like pattern
Alternatives to Transposed Convolution:

● Dense upsampling layers, computationally expensive and require
more memory than convolutional layers.
Alternatives to Transposed Convolution:

● Dense upsampling layers, computationally expensive and require
more memory than convolutional layers.
● Upsampling layers, such as bilinear or nearest-neighbor interpolation
(without learning any new parameters).
Classical Architecture
Alexnet, ResNet, VGG-16, VGG-19, and Inception Net
ImageNet
ImageNet
ImageNet
AI researcher Fei-Fei Li
began working on the idea
for ImageNet in 2006.
● Alexnet
● VGG Network
● ResNet
● Inception Net
Revolution of depth (ImageNet Benchmark)
AlexNet is the first CNN, that

achieved good performance on
the ImageNet dataset.
Non-CNN
CNN
Alexnet
CNNs Success
Alexnet
● The architecture of AlexNet was inspired by LeNet.

● AlexNet is a convolutional neural network (CNN) architecture
designed by Alex Krizhevsky in 2012.
Alexnet
● The architecture of AlexNet was inspired by LeNet.

● AlexNet is a convolutional neural network (CNN) architecture
designed by Alex Krizhevsky in 2012.
● AlexNet was one of the first CNN to achieve a good improvement in
image classification performance on the ImageNet dataset.
● AlexNet helped to popularize deep learning and convolutional neural
networks.
AlexNet
AlexNet
The first convolutional layer has 96 filters

AlexNet
The first convolutional layer has 96 filters
Convolution output is passed through a ReLU activation function and

then max pooled with a window size of 3x3 and a stride of 2 pixels.
AlexNet
the second convolutional layer has 256 filters.

AlexNet
Next convolution layers applies 384 filters and 256 filters

AlexNet
AlexNet uses max pooling after the first, second, and fifth convolutional
layers to reduce the spatial dimensions of the feature maps.
AlexNet
4096 4096 1000

units units units
At last we have Fully Connected layer and softmax layer

AlexNet
Let Convolutional Layer denoted by CL. AlexNet has following layers.

● CL 1: applies 96 filters of size 11x11 to input image with a stride of 4 pixels.
● CL 2: applies 256 filters of size 5x5 to the output of the first max pooling layer.
● CL 3: applies 384 filters of size 3x3 to the output of the second max pooling layer.
● CL 4: applies 384 filters of size 3x3 to the output of the third CL.
● CL 5: applies 256 filters of size 3x3 to the output of the fourth CL.
AlexNet
Let Convolutional layer denoted by CL. AlexNet has following layers.

● CL 1: applies 96 filters of size 11x11 to input image with a stride of 4 pixels.
● CL 2: applies 256 filters of size 5x5 to the output of the first max pooling layer.
● CL 3: applies 384 filters of size 3x3 to the output of the second max pooling layer.
● CL 4: applies 384 filters of size 3x3 to the output of the third CL.
● CL 5: applies 256 filters of size 3x3 to the output of the fourth CL.
AlexNet
Fully connected layers

● FC 1: This layer has 4096 units. The output is passed through a ReLU
activation function and then subject to dropout regularization.
● FC 2: This layer is similar to the first fully connected layer, with 4096 units, a
ReLU activation function, and dropout regularization.
Output layer:
● This layer has 1000 units (i.e., number of classes in the ImageNet dataset).
● The output is passed through a softmax activation function to produce the
final class probabilities.
AlexNet
from keras.layers import Input, Conv2D, MaxPooling2D, Flatten, Dense, Dropout
from keras.layers.normalization import BatchNormalization
from keras.models import Model
input_shape = (227, 227, 3) # Input shape of the image
# Define the input layer

inputs = Input(shape=input_shape)
AlexNet

# First convolutional layer, 96 filters, kernel size of 11x11 and stride of 4x4, followed by ReLU
conv1 = Conv2D(filters=96, kernel_size=(11, 11), strides=(4, 4), activation='relu')(inputs)
# Max pooling layer with pool size of 3x3 and stride of 2x2
pool1 = MaxPooling2D(pool_size=(3, 3), strides=(2, 2))(conv1)
# Batch normalization layer

bn1 = BatchNormalization()(pool1)
AlexNet

# First convolutional layer, 96 filters, kernel size of 11x11 and stride of 4x4, followed by ReLU
conv1 = Conv2D(filters=96, kernel_size=(11, 11), strides=(4, 4), activation='relu')(inputs)

# Second convolutional layer with 256 filters, kernel size of 5x5 and padding of same, followed by
ReLU activation
conv2 = Conv2D(filters=256, kernel_size=(5, 5), padding='same', activation='relu')(bn1)

AlexNet
# Third convolutional layer with 384 filters, kernel size of 3x3 and padding of same, followed by ReLU
# Fourth convolutional layer with 384 filters, kernel size of 3x3 and padding of same, followed by ReLU
conv4 = Conv2D(filters=384, kernel_size=(3, 3), padding='same', activation='relu')(conv3)
AlexNet
# Fifth convolutional layer, 256 filters, kernel size of 3x3 and padding of same, followed by ReLU
AlexNet
# Fifth convolutional layer, 256 filters, kernel size of 3x3 and padding of same, followed by ReLU
# Flatten layer
flatten = Flatten()(bn3)
# First fully connected layer with 4096 units, followed by ReLU activation and dropout
fc1 = Dense(units=4096, activation='relu')(flatten)
dropout1 = Dropout(0.5)(fc1)
# Second fully connected layer with 4096 units, followed by ReLU activation and dropout
fc2 = Dense(units=4096, activation='relu')(dropout1)
dropout2 = Dropout(0.5)(fc2)
# Output layer with 1000 units and softmax activation

outputs = Dense(units=1000, activation='softmax')(dropout2)
LeNet and AlexNet
● AlexNet is deeper and more complex than LeNet, with more layers
and more parameters.
LeNet and AlexNet
● AlexNet has larger filters in initial layers (e.g., 11x11 in the first layer),
while LeNet uses smaller filters (e.g., 5x5).
LeNet and AlexNet
● AlexNet has larger filters in initial layers (e.g., 11x11 in the first layer),
while LeNet uses smaller filters (e.g., 5x5).
● AlexNet achieved state-of-the-art performance on the ImageNet
dataset, while LeNet was designed and tested on a smaller
handwritten digit recognition task.
VGG Network
VGG Network
VGG16 showed good
performance on the ImageNet.
(ZFNet)
VGG Networks
● VGG stands for Visual Geometry Group (University of Oxford) that developed
a family of CNN architectures for image classification.
VGG Networks
● The original VGG model, VGG16, has 16 layers including 13 convolutional
layers and 3 fully connected layers.
VGG Networks
● VGG19 is with 19 layers including 16 convolutional layers and 3 fully
connected layers.
VGG Networks
connected layers.
● VGG family also includes VGG11 and VGG13, with fewer convolutional layers
than VGG16 and VGG19.
VGG Networks
connected layers.
● VGG family also includes VGG11 and VGG13, with fewer convolutional layers
than VGG16 and VGG19.
● VGG family generalize well to many tasks such as classification, object
detection, segmentation, style transfer, and transfer learning.
VGG16 architecture
VGG16 architecture
VGG16 architecture
VGG16 architecture
VGG16 architecture
VGG16 architecture
VGG16 (summary)
● VGG16 has 13 convolutional layers and 3 fully connected layers (16

total).
VGG16 (summary)

total).
● Convolutional layers use 3x3 filters, stride 1, same padding, followed by
a ReLU (deeper networks & less parameters)
VGG16 (summary)

total).
● Pooling layers use 2x2 filters with a stride of 2 pixels.
VGG16 (summary)

total).
● The first two convolutional layers have 64 filters each, while the
remaining layers have 128, 256, 512, and 512 filters, respectively.
VGG16 (summary)

total).
● The first two convolutional layers have 64 filters each, while the
remaining layers have 128, 256, 512, and 512 filters, respectively.
● Fully connected layers have 4096 units that use ReLU, and output layer
has 1000 units corresponding to number of classes in the ImageNet.
VGG16
# Import necessary libraries
import tensorflow as tf
from tensorflow.keras.applications.vgg16 import VGG16
from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications.vgg16 import preprocess_input, decode_predictions
VGG16
# Load the VGG16 model

model = VGG16(weights='imagenet')
VGG16

# Load the image you want to classify

img_path = 'tiger_shark.jpeg'
img = image.load_img(img_path, target_size=(224, 224))
VGG16


# Convert the image to an array

x = image.img_to_array(img)
x = tf.expand_dims(x, axis=0)
x = preprocess_input(x)
VGG16



x = tf.expand_dims(x, axis=0)
# Use the model to predict the class of the image

preds = model.predict(x)
# Print the top 5 predictions

print('Predicted:', decode_predictions(preds, top=5)[0])
VGG16 Implementation

from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Conv2D, MaxPooling2D, Flatten, Dense
def vgg16(input_shape=(224, 224, 3), num_classes=1000):

input_tensor = Input(shape=input_shape)


# Block 1
x = Conv2D(64, (3, 3), activation='relu', padding='same', name='block1_conv1')(input_tensor)
x = Conv2D(64, (3, 3), activation='relu', padding='same', name='block1_conv2')(x)
x = MaxPooling2D((2, 2), strides=(2, 2), name='block1_pool')(x)


# Block 1
x = Conv2D(64, (3, 3), activation='relu', padding='same', name='block1_conv1')(input_tensor)
# Block 2
# Block 3
# Block 4
# Block 5
# Block 4
# Block 5
# Flatten and dense layers

x = Flatten(name='flatten')(x)
x = Dense(4096, activation='relu', name='fc1')(x)
output_tensor = Dense(num_classes, activation='softmax', name='predictions')(x)
# Block 4
# Block 5
# Flatten and dense layers

x = Flatten(name='flatten')(x)
output_tensor = Dense(num_classes, activation='softmax', name='predictions')(x)
# Create model
model = Model(inputs=input_tensor, outputs=output_tensor, name='vgg16')
return model
Classical Architectures
Architecture Year Layers Key Innovations Parameters Researchers
Alex Krizhevsky
AlexNet 2012 8 CNN Architecture 62 million
et al.
Karen Simonyan
3x3 convolution filters, 138-144
VGGNet 2014 16-19 and Andrew
Deep architecture million
Zisserman
VGG16 architecture
Alex Krizhevsky
et al.
Karen Simonyan
Zisserman
Next?
Alex Krizhevsky
et al.
Karen Simonyan
Zisserman
Inception Net 2014 22-42 ? Szegedy et al.

Alex Krizhevsky
et al.
Karen Simonyan
Zisserman
Inception Net 2014 22-42 4-12 million Szegedy et al.
Going Deeper with Convolutions

Inception Net
Inception Net
Fine-grained visual categories
Difficult to identify for Uniformly increased networks

Inception Net
● Inception Net is a deep convolutional neural network architecture developed

by Google researchers in 2014.
● Inception Net won the 2014 ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) with a top-5 error rate of 6.67%.
Inception Net
Inception Net (Key Idea)
● Inception Layer: The multi-pathway convolutional blocks that enable the

network to learn complex features using fewer parameters.
● Auxiliary classifiers: At intermediate layers of the network to encourage
intermediate feature learning.
● Inception Net uses a multi-branch architecture that allows it to learn features
at multiple scales and resolutions.
Inception Layer
Inception Layer
Tired of choosing filter sizes?
Use them all!
Inception Layer
Use padding = Same
Inception Layer
Size of the output?
Not sustainable!
Inception Layer (key idea)
1x1 Convolutions
1x1 Convolution
Recall: Convolutions on Images
1x1 Convolution
1x1 Convolution
1x1 kernel keeps the dimensions and scales input!

Inception Layer: Computational Cost
Inception Layer: Computational Cost
Reduction of multiplications by 1/10

Inception Layer
def inception_module(x, filters):
"""
Inception module of the InceptionNet
"""
tower_1 = Conv2D(filters[0], (1, 1), padding='same', activation='relu')(x)
tower_1 = Conv2D(filters[1], (3, 3), padding='same', activation='relu')(tower_1)
tower_2 = Conv2D(filters[2], (1, 1), padding='same', activation='relu')(x)

tower_3 = MaxPooling2D((3, 3), strides=(1, 1), padding='same')(x)

output = Concatenate(axis=-1)([tower_1, tower_2, tower_3])

return output
InceptionNet
Input
|
Conv2D -> ReLU -> MaxPooling2D …
|
Inception module
|
…
|
…
|
Inception module
|
GlobalAveragePooling -> Dense -> Softmax
GlobalAveragePooling
● Global Average Pooling replace fully

connected layers in classical CNNs.

● In this layer, the average value of
each feature map is computed,
resulting in a single output value for
each feature map.

● In this layer, the average value of
each feature map is computed,
resulting in a single output value for
each feature map.
● Global Average Pooling helps reduce
the number of parameters in the
network.
Inception Layer in InceptionNet
def InceptionNet(input_shape, num_classes):
"""
InceptionNet architecture using functional API.
"""
x = Conv2D(64, (7, 7), strides=(2, 2), padding='same', activation='relu')(input_tensor)

x = MaxPooling2D((3, 3), strides=(2, 2), padding='same')(x)
x = Conv2D(64, (1, 1), padding='same', activation='relu')(x)
x = inception_module(x, [64, 128, 32, 32, 64])



x = GlobalAveragePooling2D()(x)
x = Dropout(0.4)(x)
x = Dense(num_classes, activation='softmax')(x)
model = Model(inputs=input_tensor, outputs=x, name='InceptionNet')

return model
Auxiliary classifiers
● The auxiliary classifiers in InceptionNet are additional output branches that

are inserted into the network at intermediate stages.

● Auxiliary classifiers provide additional supervision signals during training to
improve the overall performance of the network.

● Auxiliary classifiers provide additional supervision signals during training to
improve the overall performance of the network.
● The use of auxiliary classifiers is not limited to InceptionNet and can be
applied to other deep learning architectures as well.
Main classifier
● During training, the loss from the auxiliary classifiers is added to the overall
loss (main classifier) of the network with a weight factor (usually 0.3).
● During inference, the outputs of the auxiliary classifiers are discarded, and
only the output of the main classifier is used to make predictions.
● During inference, the outputs of the auxiliary classifiers are discarded, and
only the output of the main classifier is used to make predictions.
● The number and placement of the auxiliary classifiers in InceptionNet can
vary depending on the specific architecture and task.
Inception Net
Inception Net
Inception Net
Inception Net (Main Components)
Inception Net inception (3a) inception (3b)
The input to the network is a 224x224x3 RGB image.
The network begins with a series of convolutional and pooling layers to extract
low-level features from the image.
The Inception module contains multiple parallel convolutional paths

of different filter sizes, including 1x1, 3x3, and 5x5 convolutions.
The pooling operations and 1x1 convolutions in Inception modules to

reduce the dimensionality of the input.
The outputs of each path are concatenated together

along the channel axis and fed into the next layer.
Inception modules are stacked on top of each other to form the "stem" of the network.
The stem is followed by a series of "Inception-A" and "Inception-B" modules.

inception
Inception Net inception
(4c)
(4d)
inception
inception (4b)
(4a)
Reduction
The network also includes several "Reduction" modules, which are used to reduce the spatial
dimensions of the feature maps.
inception
(4c)
(4d)
inception
inception (4b)
(4a)
In addition to the main classification, the network also includes two "Auxiliary classifiers" at
intermediate layers.
inception
(5b)
(5a)
inception
(4e)
on
cti
du
Re
Final layers of the network consist of a global

average pooling layer and a fully connected
layer with softmax activation.
from tensorflow.keras.applications.inception_v3 import InceptionV3,
preprocess_input, decode_predictions
import numpy as np
# Load the InceptionV3 model

model = InceptionV3(weights='imagenet')


x = np.expand_dims(x, axis=0)


def InceptionNet(input_shape, num_classes):
"""
InceptionNet architecture using functional API.
"""
x = Conv2D(64, (7, 7), strides=(2, 2), padding='same', activation='relu')(input_tensor)




# Auxiliary Classifier 1
aux_output_1 = AveragePooling2D((5, 5), strides=(3, 3))(x)
aux_output_1 = Conv2D(128, (1, 1), padding='same', activation='relu')(aux_output_1)
aux_output_1 = Flatten()(aux_output_1)
aux_output_1 = Dense(1024, activation='relu')(aux_output_1)
aux_output_1 = Dropout(0.7)(aux_output_1)
aux_output_1 = Dense(num_classes, activation='softmax')(aux_output_1)



x = GlobalAveragePooling2D((7, 7))(x)
x = Dropout(0.4)(x)
output = Dense(num_classes, activation='softmax')(x)

InceptionNet variants
● Inception Net has been refined and optimized, leading to several smaller
and faster variants such as Inception-v2, Inception-v3, and
Inception-ResNet.
● Inception-ResNet incorporates residual connections into the Inception
modules to further improve training stability and performance.
InceptionNet Applications
● Image classification, Object Detection (fine-grained), and semantic

segmentation
● Image Quality Assessment for Inception Score.
Neural Style Transfer
VGG v/s InceptionNet

● Puzzle: VGG is better feature extractor then InceptionNet for Style Transfer.
The stylization performance degrades using InceptionNet instead of VGG.
Wang, Pei, Yijun Li, and Nuno Vasconcelos. "Rethinking and improving the robustness of image
style transfer." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition. 2021.
● Puzzle: VGG is better feature extractor then InceptionNet for Style Transfer.
The stylization performance degrades using InceptionNet instead of VGG.
ImageNet Benchmark (Recap)
Common Performance Metrics
● Top-1 score: check if a sample's top class (i.e. the one with highest
probability) is the same as its target label
● Top-5 score: check if your label is in your 5 first predictions (i.e.
predictions with 5 highest probabilities)
● Top-5 score: check if your label is in your 5 first predictions (i.e.
predictions with 5 highest probabilities)
● Top-5 error: percentage of test samples for which the correct class
was not in the top 5 predicted classes
Classical Architecture (Recap)
Accuracy
Architecture Year Layers Key Innovations Parameters (ImageNet) Researchers
Alex Krizhevsky
AlexNet 2012 8 ReLU, LRN 62 million 57.2%
et al.
Karen Simonyan
VGGNet 2014 16-19 74.4% and Andrew
Zisserman
Inception modules,
Inception Net 2014 22-42 4-12 million 74.8% Szegedy et al.
Next? 2015 50-152 75.3% He et al.

Problem of Depth
Going Deeper
● There has been a general trend in recent years to design

deeper networks.
● Deeper network are known to produce more complex features
and tend to generalise better.
Going Deeper
● There has been a general trend in recent years to design

deeper networks.
● Deeper network are known to produce more complex features
and tend to generalise better.
● Training deep networks is however difficult.
○ Problem of vanishing gradients
○ Problem of exploding gradient
Vanishing Gradient Problem
● Gradient of the loss function with respect to the weights in the

lower layers becomes very small during backpropagation.
Vanishing Gradient Problem:
Vanishing gradients problem on this simple network:
=
Weight update issue
During the gradient descent, we evaluate which is a product of
the intermediate derivatives. If any of these is zero, then ≈0 .

Vanishing gradients problem on this simple network:
=
the intermediate derivatives. If any of these is close to zero, then ≈0 .

● The small gradient is propagated back through the layers,

making it difficult for lower layers to learn meaningful
representations of the data.
● The small gradient is propagated back through the layers,

making it difficult for lower layers to learn meaningful
representations of the data.
● Very challenging to train CNNs, where the gradient can become
exponentially small.
Going Deeper
Now consider the problem of vanishing gradients on this new network:

Going Deeper
Now consider the problem of vanishing gradients on this new network:

Exploding Gradient Problem:
● Gradient of the loss function with respect to the weights in the

lower layers becomes very large during backpropagation.
Exploding Gradient Problem on this simple network:
=
Weight update issue
the intermediate derivatives. If any of these is zero, then ≈0 .

Exploding Gradient Problem on this simple network:
=
the intermediate derivatives. If these are high value then ≈∞

● The large gradient is propagated back through the layers,

causing weight updates that are too large.

exponentially large.

exponentially large.
● Weight clipping can be done for Exploding Gradient.
Problem of Depth
ResNet
Solution to Problem of Depth

Accuracy
Alex Krizhevsky
et al.
Karen Simonyan
Zisserman
Inception modules,
Inception Net 2014 22-42 4-12 million 74.8% Szegedy et al.
Residual connections, 25.6-60

ResNet 2015 50-152 Shortcut connections
75.3% He et al.
million
ResNet
ResNet (Introduction)
● ResNet is a deep neural network architecture that was introduced by

researchers at Microsoft in 2015.

● The key innovation of ResNet is the use of residual connections,

which allow for much deeper networks to be trained.


● ResNet comes in several variants, including ResNet-18, ResNet-34,

ResNet-50, ResNet-101, and ResNet-152.


● ResNet comes in several variants, including ResNet-18, ResNet-34,

ResNet-50, ResNet-101, and ResNet-152.
● ResNet has achieved good performance on many computer vision

tasks, including classification, object detection, and segmentation.
Residual Block
Skip connection (key idea)
● ResNet is composed of a series of residual blocks, each of which

contains one or more convolutional layers, batch normalization,
and ReLU activation.

● In residual blocks, there is a shortcut connection that bypasses
one or more layers and allows the gradient to flow directly to
earlier layers.

● In residual blocks, there is a shortcut connection that bypasses
one or more layers and allows the gradient to flow directly to
earlier layers.
● This is shortcut connection known as a residual connection or
skip connection.
Two layers
Two layers
Input
Linear Non-linear
Two layers
Input
Linear Non-linear
Two layers
Residual Block (key idea)
Two layers
In each residual block, there is a skip connection (residual connection) that bypasses
one or more layers and allows the gradient to flow directly to earlier layers.
Residual Block
Two layers
In each residual block, there is a skip connection (residual connection) that bypasses
one or more layers and allows the gradient to flow directly to earlier layers.
Residual Block
Two layers
● Residual connection improve the gradient flow and enable the network
to learn deeper and more complex features.
ResNet Block
ResNet Block
The residual connection is added to the output of the convolutional layers before
the ReLU activation function is applied.
ResNet Block
from tensorflow.keras.layers import Input, Conv2D, BatchNormalization, Activation, Add
def resnet_block(inputs, filters, kernel_size, strides=(1, 1), padding='same'):
x = Conv2D(filters=filters, kernel_size=kernel_size, strides=strides, padding=padding )(inputs)

x = BatchNormalization()(x)
x = Activation('relu')(x)
x = Conv2D(filters=filters, kernel_size=kernel_size, strides=(1, 1), padding=padding)(x)

x = BatchNormalization()(x)
x = Add()([x, inputs])
x = Activation('relu')(x)
return x
ResNet
ResNet
● Top: a residual network with 34 parameter layers (3.6 billion FLOPs).

● Middle: a plain network with 34 parameter layers (3.6 billion FLOPs).
● Bottom: the VGG-19 model (19.6 billion FLOPs) as a reference.
ResNet

ResNet

ResNet
ResNet Architectures for ImageNet. Building blocks are shown in brackets, with the
numbers of blocks stacked.
ResNet
Left: a building block (on 56×56 Right: a “bottleneck” building block

feature maps) as in ResNet34. for ResNet-50/101/152.
ResNet
ResNet Architectures for ImageNet. Downsampling is performed by conv3 1, conv4 1,

and conv5 1 with a stride of 2.
ResNet
● The ResNet architecture typically begins with a single convolutional layer,
followed by a max pooling layer.
ResNet
● After the initial layer, there are several stages of residual blocks with
different number of convolutional layers.
ResNet
● Each residual block contains one or more convolutional layers, batch
normalization, and ReLU activation functions.
ResNet
● The convolutional layers in each residual block typically have small filter
sizes, such as 3x3 or 1x1.
ResNet
● The convolutional layers in each residual block typically have small filter
sizes, such as 3x3 or 1x1.
● The final layers of the network are typically global average pooling and a
fully connected layer with a softmax activation function
Why do ResNets Work?
We kept the same values and added a non-linearity.

The network can effectively choose to use fewer layers when it is not
necessary, which can improve efficiency and reduce overfitting.
Residual connections also allow the network to adaptively determine how

many layers to use for a particular input.
● Shortcut connections in ResNets enable the gradient to flow more directly and
efficiently through the network.
● ResNets address the problem of vanishing gradients that can occur in very
deep neural networks.
ResNet
Training on ImageNet. If we make network deeper, at some point the

performance starts to decrease.
Left: plain networks of 18 and 34 layers.
Thin curves denote training error, and bold curves denote validation error of the center crops.
ResNet
Training on ImageNet. ResNet Solution
Left: plain networks of 18 and 34 layers. Right: ResNets of 18 and 34 layers
Thin curves denote training error, and bold curves denote validation error of the center crops.
from tensorflow.keras.applications.resnet50 import ResNet50, preprocess_input,
decode_predictions
import numpy as np
# Load the ResNet50 model

model = ResNet50(weights='imagenet')


x = np.expand_dims(x, axis=0)


References
● AlexNet: "ImageNet Classification with Deep Convolutional Neural Networks"

by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton (2012).
● VGGNet: "Very Deep Convolutional Networks for Large-Scale Image
Recognition" by Karen Simonyan and Andrew Zisserman (2014).
● Inception Net: "Going Deeper with Convolutions" by Christian Szegedy, Wei
Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru
Erhan, Vincent Vanhoucke, and Andrew Rabinovich (2015).
● ResNet: "Deep Residual Learning for Image Recognition" by Kaiming He,
Xiangyu Zhang, Shaoqing Ren, and Jian Sun (2016).
Final Note on Classical
Architecture
Accuracy
Alex Krizhevsky
et al.
Karen Simonyan
Zisserman
Inception modules,
Inception Net 2014 22-42 Auxiliary classifiers, 4-12 million 74.8% Szegedy et al.
Batch normalization
Residual connections, 25.6-60

ResNet 2015 50-152 75.3% He et al.
Shortcut connections million
Comparing Complexity
Top1 vs. network. Single-crop top-1 validation accuracies for

top scoring single-model architectures
Top1 vs. operations, size ∝ parameters. Top-1

one-crop accuracy versus amount of operations
required for a single forward pass. The size of
the blobs is proportional to the number of
network parameters
Accuracy per parameter vs. network
Accuracy per parameter vs. network. Information density (accuracy per parameters) is an efficiency metric that
highlight that capacity of a specific architecture to better utilise its parametric space.
References
● Canziani, Alfredo, Adam Paszke, and Eugenio Culurciello. "An analysis of

deep neural network models for practical applications." (2016).
Object Recognition and Face Recognition
Image Classification
Object Detection and Localization
Can we relate Object Detection

and Classification?
Object Detection: Task Definition
Object Detection: Challenges
Object Detection: Challenges
How to verify if the
output is correct?
Comparing Boxes: Intersection over Union (loU)
Detecting a single
object
Detecting a single object
Detecting Multiple
Objects
Detecting Multiple Objects
Detecting Multiple Objects: Sliding Window
h=2 w=2
h=2 w=2
Need to apply CNN to huge

number of locations and
scales, computationally h=2 w=2
expensive!
Region Proposals
R-CNN
R-CNN
R-CNN
R-CNN
R-CNN
R-CNN
R-CNN
● Training is slow (84h), takes a

lot of disk space
● Inference (detection) is slow
R-CNN
● Training is slow (84h), takes a

lot of disk space
● Inference (detection) is slow
Idea: Pass the image through

convnet before cropping! Crop the
conv feature instead!
Fast R-CNN
Fast R-CNN
“Backbone” network:
AlexNet, VGG, ResNet, etc
Fast R-CNN
Fast R-CNN
Cropping Features: RoI Pool
Cropping Features: RoI Align
● In Practice ROI Align is used in Fast

R-CNN. It uses bilinear interpolation to
compute the feature values.
● It a more precise and accurate way to
extract features from region proposals.
Fast R-CNN: Fully-connected layers
Fast R-CNN
Fast R-CNN (Training)
Fast R-CNN (Training)
R-CNN vs Fast R-CNN
R-CNN vs Fast R-CNN
Problem: Runtime dominated by

region proposals!
R-CNN vs Fast R-CNN
Problem: Runtime dominated by

region proposals!
Solution: make CNN do region

proposals.
Faster R-CNN:
Faster R-CNN:
Region Proposal Network
Faster R-CNN:
Faster R-CNN:
Faster R-CNN:
References
● Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and
semantic segmentation." Proceedings of the IEEE conference on computer vision and
pattern recognition. 2014.
● Girshick, Ross. "Fast R-CNN." Proceedings of the IEEE international conference on
computer vision. 2015.
● Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region
proposal networks." Advances in neural information processing systems 28 (2015).
Object Recognition and Face Recognition
Object Detection: Task Definition
R-CNN
Problem: Training is slow (84h),

takes a lot of disk space
R-CNN
Problem: Training is slow (84h),

takes a lot of disk space
Solution: Pass the image through

convnet before cropping! Crop the
conv feature instead!
Fast R-CNN
Multi-task loss Problem: Runtime dominated
by region proposals!
Fast R-CNN
Multi-task loss Problem: Runtime dominated
by region proposals!
Solution: make CNN do region

proposals.
Faster R-CNN:
Two stages: region proposal

generation and object
classification
Faster R-CNN:
Problem: Not practical for

real time Object Detection,
Inference is slow
Accurate object detection is slow!
Pascal 2007 mAP Speed

DPM v5 33.7 .07 FPS 14 s/img
R-CNN 66.0 .05 FPS 20 s/img

DPM v5 33.7 .07 FPS 14 s/img
60 miles/H (96.5 Km/H)
⅓ Mile, 1760 feet

20 sec

DPM v5 33.7 .07 FPS 14 s/img
Fast R-CNN 70.0 .5 FPS 2 s/img
176 feet

DPM v5 33.7 .07 FPS 14 s/img
Faster R-CNN 73.2 7 FPS 140 ms/img
8 feet
12 feet

DPM v5 33.7 .07 FPS 14 s/img
YOLO 63.4 45 FPS 22 ms/img
2 feet

DPM v5 33.7 .07 FPS 14 s/img
YOLO 63.4 69.0 45 FPS 22 ms/img
2 feet
With YOLO, you only look once at an image to
perform detection
YOLO: You Only Look Once

We split the image into a grid
Each cell predicts boxes and confidences: P(Object)
Each cell also predicts a class probability.
Each cell also predicts a class probability.
Bicycle Car
Dog
Dining
Table
Conditioned on object: P(Car | Object)
Bicycle Car
Dog
Dining
Table
Then we combine the box and class predictions.
Finally, set a threshold (Non-maximal suppression) to fix multiple
detections
This parameterization fixes the output size
Each cell predicts:
- For each bounding box:

- 4 coordinates (x, y, w, h)
- 1 confidence value
- Some number of class
probabilities
This parameterization fixes the output size
Each cell predicts:
- For each bounding box:

- 4 coordinates (x, y, w, h)
- 1 confidence value
- Some number of class
probabilities
For
TotalPascal VOC:
Predictions
- 7x7 grid
- 2 bounding boxes / cell
- 20 classes
7 x 7 x (2 x 5 + 20) = 7 x 7 x 30 tensor = 1470 outputs

Thus we can train one neural network to be a whole
detection pipeline
Look at that cell’s predicted boxes
Find the best one, adjust it, increase the confidence
Decrease the confidence of other boxes
Decrease the confidence of other boxes
Some cells don’t have any ground truth detections!
Some cells don’t have any ground truth detections!
Decrease the confidence of these boxes
Decrease the confidence of these boxes
Don’t adjust the class probabilities or coordinates
We train with standard tricks:
- Pretraining on Imagenet
- Extensive data augmentation
- For details, see the paper
YOLO works across a variety of natural images
It also generalizes well to new domains (like art)
Visualizing
importance
The occlusion experiment
Block different parts of the image and see how the classification score
changes
Block different parts of the image and see how the classification score
changes
The face of the

dog is more
important for
correct
classification
Create a map, where each pixel represents the classification probability if

an occlusion square is placed in that region
high values
small values
Most important pixels for

classification
Semantic Segmentation
Semantic Segmentation
Semantic Segmentation Idea: Sliding Window
Semantic Segmentation idea: CNN
Challenge: output size is of the

order of input size
Semantic Segmentation Limitation
Transfer Learning
Transfer Learning
● Training your own model can be difficult with limited data and other
resources, e.g.,
● It is a laborious task to manually annotate your own training dataset
● Why not reuse already pre-trained models?
Transfer Learning
Distribution Distribution
Use what has been

learned for another
setting
Transfer Learning for Images
Low-level Mid-level Top-level

features features features
Transfer Learning
Trained on
ImageNet
Feature
extraction
Transfer Learning
Trained on
ImageNet
Decision layers
Parts of an object (wheel, window)
Simple geometrical shapes (circles, etc)
Edges
Transfer Learning
Trained on
ImageNet
New dataset with C

classes
Transfer Learning
If the dataset is big

enough train more
layers with a low
learning rate
When Transfer Learning makes Sense
● When task T1 and T2 have the same input (e.g. an RGB image)
● When you have more data for task T1 than for task T2
● When the low-level features for T1 could be useful to learn T2
Now you are:
● Ready to perform image classification on any dataset

● Ready to design your own architecture
● Ready to deal with other problems such as semantic
segmentation (Fully Convolutional Network)
Deep Learning for Natural Language
Processing (NLP)
Natural
Language
Processing
(NLP)
Natural
Language
Processing
(NLP)
Deep Learning for Natural Language Processing (NLP)

Sequence Modelling
Sequence modeling
?
Sequence modeling
Sequence modeling
Sequences are Everywhere
Sequence Modeling Applications
Sequence data Input Data Output
Speech Recognition Sequence modeling
Machine Translation This is an apple.

यह एक सेब है ।
ਇਹ ਇੱ ਕ ਸੇਬ ਹੈ।
Language Modeling Recurrent neural __? Network
Named Entity Recognition “Person”: Mark Zuckerberg

Mark Zuckerberg is one of the founders of Facebook,
“Company”: Facebook
a company from the United States
“Location”: United States
Sentiment Classification There is nothing to like in this movie.
Video Activity
Recognition Punching
NLP Tasks Overview
Introduction To Data Science
Language Translation
Query Recommendations
Spelling and Grammar Corrections
Sentiment Analysis
Topic modeling
Document-Word Matrix
NLP Tasks
(chat about a specific topic)
(chat about any topic)

ChatGPT
NLP Tasks challenge
NLP Tasks challenge
• Variable length sequence
The food is great
Jaipur city is famous as a pink city
NLP Tasks challenge
The food is great
• Long-term dependency
France is where he grew up, but now he is in India. He speaks fluent ______.
NLP Tasks challenge
The food is great
France is where he grew up, but now he is in India. He speaks fluent ______. French
NLP Tasks challenge
The food is great
• Differences in sequence order.
The food was good, not bad at all
The food was bad, not good at all.
NLP Tasks challenge
The food is great
• Differences in sequence order.
The food was good, not bad at all
The food was bad, not good at all.
Sequence Modeling Design Criteria
The sequential model need to

• Handle variable-length sequence

• Track long-term dependencies

• Maintain information about the order

• Maintain information about the order
Sequence to Sequence Learning
with Neural Networks
Feed-Forward Neural Network
Feed-Forward Neural Network
Difficult to model speech recognition,

question answering, and machine
translation (Sequence to Sequence) etc.
Sequence to Sequence Problem
One to One
Image Classification
One to One One to many
Image Classification Image Captioning

One to One One to many many to one
Image Classification Image Captioning Language Recognition

Many to Many
Machine Translation
Many to Many Many to many
Machine Translation Video Activity Recognition

seq2seq Learning
• Seq2seq models are neural network architecture used to

transform input sequences into output sequences of variable
length.
seq2seq Learning
• Seq2seq models are neural network architecture used to

transform input sequences into output sequences of variable
length.
• Seq2seq does the sequence transformation using a simple RNN
or using LSTM or GRU (to avoid problem of vanishing gradients)
Basics of RNN
RNN
Recurrent Neural Network is a family of neural networks that:

● Process sequence data
● Take sequential input of variable length
● Apply the same weights on each step
● Can produce output of variable length
Output Vector
Input Vector
Recurrent Neural Networks
Output Vector
Recurrent
Cell h0 h1 h2
Input Vector
Output
ht
Input
Output
ht
Input
Output
ht
Input
Output
ht
Input
RNN Parameters
RNN RNN RNN RNN RNN

cell cell cell cell cell
RNN Parameters
RNN RNN RNN RNN RNN

RNN Parameters
RNN RNN RNN RNN RNN

RNN Computational Graph Across Time
RNN RNN RNN RNN RNN

RNN Computational
Graph
RNN Computational Graph
Sequence to
Sequence Modeling
using RNNs
Many to Many
Many to One
One to Many
Training RNNs
RNN State Update
Output Vector
Output Vector
Recurrent
Cell
ht
RNN Cell update
Input Vector
Input Vector
Vanilla RNN
Gradient Flow
Backpropagation from ht to ht-1

Backpropagation steps
• Input feedforward in the network

• Compute the Loss
• Take the derivative of the Loss with respect to each parameter
• Update parameters to minimize the Loss
Backpropagation Through Time (BPTT)
Backpropagation Through Time (BPTT)
Backpropagation Through Time
Backpropagation Through Time
Truncated Backpropagation Through Time
Problems in RNN
● Slow to train (inefficient / not parallelizable)

● Suffer from exploding or vanishing gradients
● Cannot handle very long-term dependencies
Problems in RNN
● Slow to train (inefficient / not parallelizable)

● Suffer from exploding or vanishing gradients
● Cannot handle very long-term dependencies
Solution: LSTM or GRU

Deep Learning for Natural Language
Processing (NLP)
RNN
Backpropagation from ht to ht-1

Exploding/vanishing gradients issues in RNN
LSTM vs RNN: big picture
LSTMs
● LSTMs proposed in 1997 as a solution to the vanishing gradient problem

in traditional RNNs.
LSTMs
● LSTMs proposed in 1997 as a solution to the vanishing gradient problem

in traditional RNNs.
● Basic unit of an LSTM is a cell, that contains a hidden state and cell state.
○ Cell state is used to store information over time
○ Hidden state is used to selectively output information from the cell
state at each time step.
hidden state in LSTMs
● Metaphor: The hidden state of the neural network can be considered as a

short-term memory.

short-term memory.
● LSTM architecture tries to make this short-term memory last as long as
possible by preventing vanishing gradients.

short-term memory.
● LSTM architecture tries to make this short-term memory last as long as
possible by preventing vanishing gradients.
● LSTM allowing the model to selectively forget or remember information
over time.
LSTM vs RNN: inside picture
Gating mechanism
Information flow in an LSTM
Observe x
Input
Don’t forget
Don’t forget
Output
LSTMs gates
● To control the flow of information into and out of the cell, LSTMs uses
three types of gates:
○ Input gates: Decide which information from current input to include in
cell state.
○ Forget gates: Deciding which information from the previous cell state
to forget.
○ Output gates: Deciding which information from current cell state to
output as hidden state.
Long short-term memory (LSTM) Key idea
~
ct
Long short-term memory (LSTM) Key idea
~
ct ~
ct
A look inside an LSTM cell
v
Forget gate
v
Input gate
v
Update candidate
v
Memory cell update
v
Output gate
v
Output
v
Observation
~
ct
● If the forget gate is always 1 and the input gate is always 0, the memory cell internal
state will remain constant forever.
● However, input gates and forget gates give the flexibility to learn when to keep this
value unchanged and when to perturb it in response to subsequent inputs.
What about gradient flow?
Long Short Term Memory (LSTM): Gradient Flow
Gated Recurrent Units (GRU)
Main solution for better RNNs: Units
● Gated recurrent units (GRUs) are a gating mechanism in recurrent neural

networks, introduced in 2014 by Kyunghyun Cho et al.

● GRU is like a long short-term memory (LSTM) but has fewer parameters
than LSTM, as it lacks an output gate.

● GRU is like a long short-term memory (LSTM) but has fewer parameters
than LSTM, as it lacks an output gate.
● GRU's performance on certain tasks of polyphonic music modeling,
speech signal modeling and natural language processing was found to be
similar to that of LSTM.
● In GRU, the LSTM’s three gates are replaced by two

● Reset gate controls how much of the previous state we might still want to
remember.
● Update gate would allow us to control how much of the new state is just
a copy of the old state.
● Units with short-term dependencies will

have active reset gates.
● Units with long term dependencies have
active update gates.
Memory Content
Final Memory at current time step

Bidirectional RNNs
Bidirectional and Multilayer RNN: Motivation
What about the right context

Bidirectional RNNs
For classification you want to incorporate information from words both

preceding and following.
Bidirectional RNNs

Two type of connections:
1) One going forward in time, which helps us learn from previous
representations
2) Another going backward in time, which helps us learn from future
representations
Bidirectional RNNs

Two type of connections:
1) One going forward in time, which helps us learn from previous
representations
2) Another going backward in time, which helps us learn from future
representations
Bidirectional RNNs can better exploit context in both directions,
Bidirectional RNNs
Bidirectional RNNs
Data is processed in both

directions with two
separate hidden layers,
which are then fed
forward into the same
output layer.
Bidirectional RNNs
Bidirectional RNNs
Bidirectional RNNs
Bidirectional RNNs
Bidirectional RNNs: simplified diagram
Bidirectional RNNs
● Bidirectional RNNs are only applicable if you have access to the entire input sequence.
○ Bidirectional RNNs are not applicable to Language Modeling, because in LM you
only have left context available.
● Bidirectional LSTMs perform better than unidirectional ones in speech recognition.
Deep RNNs
Single-Layer RNNs
Single-Layer RNNs
Multi-layer RNNs
Multi-layer RNNs
yt
ht2
ht1
Input
Multi-layer RNNs
yt
ht2
ht1
Input
Multi-layer RNNs
yt
ht2
ht1
Input
Multi-layer RNNs
● RNNs are already "deep" on one dimension (they unroll over many
timesteps).
Multi-layer RNNs
timesteps).
● We can also make them "deep" in another dimension by applying
multiple RNNs — this is a multi-layer RNN.
Multi-layer RNNs
timesteps).
● This allows the network to compute more complex representations
Multi-layer RNNs
timesteps).
● The lower RNNs should compute lower-level features and the higher
RNNs should compute higher-level features.
Multi-layer RNNs
timesteps).
● The lower RNNs should compute lower-level features and the higher
RNNs should compute higher-level features.
● Multi-layer RNNs are also called stacked RNNs

Endsem

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Endsem

Uploaded by

Copyright:

Available Formats

Deep learning for Computer Vision

● Introduction to Computer Vision

Geoffrey Everest Hinton early in his

● Geoffrey Hinton's foundational contribution to backpropagation algorithm

● Yann LeCun is well known for his work on Convolutional Neural

● Yoshua Bengio initial works on "Learning long-term dependencies with gradient

● ILSVRC challenge is designed to evaluate and advance image classification

How to process a tiny image with FC layers

How to process a tiny image with FC layers

How to process a tiny image with FC layers

How to process a tiny image with FC layers

Slide’ filter kernel from left to right; at each position,

Discrete case: box filter

Aim is to learn useful

black and white image

Does not fit and

Fractions are illegal

How to process a tiny image with FC layers

Input: 320 x 280

Aim is to learn useful

Aim is to learn useful

Fractions are illegal

Always shrinking down may not a good approach

● Padding refers to the

● Sizes get smaller too quickly

● Preserving the spatial dimensions

Most common is zero padding

Most common is zero padding

denotes the floor operator

Original tensor Reflected padded tensor

Original tensor Symmetric padded tensor:

Conv2D(filters=10, kernel_size=(5, 5), strides=1, padding='same', input_shape=(3, 32, 32))

● A basic layer is defined by

● Each filter captures a different image characteristic

Conv2D(filters=1, kernel_size=(5, 5), strides=1, padding='valid', input_shape=(3, 32, 32))

Conv2D(filters=6, kernel_size=(5, 5), strides=1, padding='valid', input_shape=(3, 32, 32))

inputs = Input(shape=(4, 4, 1))

# Max pooling layer

● Typically used deeper in the network

inputs = Input(shape=(4, 4, 1))

# Average pooling layer

● Receptive field refers to the area of the input that is used

● Receptive field refers to the area of the input that is used

● Capturing global features

● Increased computational cost

● This is what a regular

● There are many dense

● This is what a regular

● There are many dense

● Reduced number of parameters using convolution (compared to a

Schematic of typical sequences of layers in CNNs

● Shared weights: reduces the number of parameters and

● Shared weights: reduces the number of parameters and

● Shared weights: reduces the number of parameters and

● CNN take advantage of the spatial structure for feature

● CNN take advantage of the spatial structure for feature

● CNN take advantage of the spatial structure for feature

● Introduction to Computer Vision

● Introduction to Computer Vision

● Introduction to Computer Vision

● Introduction to Computer Vision

● Introduction to Computer Vision

● Introduction to Computer Vision