You are on page 1of 738

Deep learning for Computer Vision

Outline

● Introduction to Computer Vision


● Problems using Fully Connected Networks on Images
● What are convolutions
● Convolution on Images
● Stride, Padding, Pooling
● Dimension of a Convolution Layer
● Stacking Convolution Layers
● Convolution Neural Networks for images and its applications
Computer Vision Tasks
Deep learning for Computer Vision
2018 Turing Award

The 2018 Turing Award was awarded jointly to Yoshua Bengio, Geoffrey
Hinton, and Yann LeCun for their pioneering work on deep learning
Geoffrey Hinton
is known by
many to be the
godfather of
deep learning.
Geoffrey E.
Hinton is known
by many to be
the godfather of
deep learning.

Geoffrey Everest Hinton early in his


career. His middle name comes
from a relative, George Everest,
who surveyed India
Geoffrey E. Hinton

● Geoffrey Hinton's foundational contribution to backpropagation algorithm


can be traced back to his work on the "Learning Representations by
Back-Propagating Errors".
● Hinton's work helped to establish the theoretical foundations for training
CNNs and RNNs and showed that it was possible to learn useful
representations using Deep Neural Networks.
● For example, Backpropagation through time" (BPTT) algorithm is a variant of
the backpropagation algorithm that is specifically designed for training
recurrent neural networks.
Yann LeCun

● Yann LeCun is well known for his work on Convolutional Neural


Networks, representation learning, geomatic deep learning.
● LeNet is a convolutional neural network (CNN) architecture
designed by Yann LeCun et. al. in 1998.
● LeNet is used for the recognition of handwritten digits in the
MNIST dataset.
Yoshua Bengio

● Yoshua Bengio initial works on "Learning long-term dependencies with gradient


descent is difficult", uncover fundamental difficulty of learning in RNNs.
● Bengio's has also done significant contribution in
○ Recurrent neural networks (RNNs)
○ Word embeddings from neural networks and neural language models,
○ Unsupervised deep learning based on auto-encoders,
○ introducing Generative Adversarial Networks (GANs).
ImageNet Large Scale Visual Recognition Challenge (ILSVRC)

● ILSVRC challenge is designed to evaluate and advance image classification


and object detection algorithms.
● ILSVRC is an annual computer vision competition that was first held in 2010.
● The competition uses ImageNet dataset, which contains millions of images
organized into thousands of categories.
● The challenge has been a driving force behind many of the recent advances
in computer vision, including self-driving cars, robotics, and augmented
reality.
in collaboration with
Geoffrey E. Hinton
Problems using Fully
Connected Networks on
Images
Fully Connected Neural Network (FC)
Fully Connected Neural Network (FC)
Problems using FC Layers on Images

How to process a tiny image with FC layers


Problems using FC Layers on Images

How to process a tiny image with FC layers


Problems using FC Layers on Images

How to process a tiny image with FC layers


Problems using FC Layers on Images

How to process a tiny image with FC layers


Disadvantages of FC Layers for Image Data

● Dense layers require a lot of parameters, which can lead to overfitting when
the number of input features is large.
● Dense layers are not translation invariant, meaning that small shifts in the
input image can result in large changes in the output.
● Dense layers do not take advantage of the spatial structure of images, and
can therefore be inefficient for processing large images.
Convolution Neural
Networks for images
What are
convolutions?
What are Convolutions?
Discrete case: box filter

Slide’ filter kernel from left to right; at each position,


compute a single value in the output data
What are Convolutions?

Discrete case: box filter


What are Convolutions?
Discrete case: box filter
What are Convolutions?
Discrete case: box filter
What are Convolutions?
Discrete case: box filter
What are Convolutions?
Discrete case: box filter
What are Convolutions?
Discrete case: box filter
What are Convolutions?
Discrete case: box filter
What are Convolutions?
Discrete case: box filter
Convolution on
Images
Convolutions on images
● We just slide the kernel over
the input image
● Each time we slide the
kernel we get one value in
the output
● We just slide the kernel over
the input image
● Each time we slide the
kernel we get one value in
the output
● The resulting output is called
a feature map.
● We can use multiple filters to
get multiple feature maps.
Convolution
Convolutions on Images
Convolutions on Images
Convolutions on Images
Convolutions on Images
Convolutions on Images
Convolutions on Images
Convolutions on Images
Convolutions on Images
Convolutions on Images
Convolutions on images
Convolutions on images
Convolutions on images
Convolutions on images
Convolutions on images
Convolutions on images
Convolutions on images

Aim is to learn useful


kernel using DL
Feature Map
Dimension
Feature Map Dimension

Input
Filter
Output
Feature Map Dimension

Input
Filter
Output
Feature Map Dimension

Input
Filter
Output
Feature Map Dimension

Input
Filter
Output
Feature Map Dimension

Input
Filter
Output
Feature Map Dimension

Input
Filter
Output

black and white image


Convolution Layer

RGB image
Convolution Layer

RGB image
Convolution Layer

RGB image
Convolution Layer

RGB image
Convolution Layer

RGB image
Stride
The stride determines how much the convolutional
kernel is moved across the input image at each step.
Stride
Stride
Stride
Convolution on images

Input
Filter
Stride
Output
Convolution on images

Input
Filter
Stride
Output
Convolution on images

Input
Filter
Stride
Output
Feature Map Dimension

Input
Filter
Stride
Output
Feature Map Dimension

Input
Filter
Stride
Output
Feature Map Dimension

Input
Filter
Stride
Output

Does not fit and


illegal operation
Feature Map Dimension

Input
Filter
Stride
Output

Fractions are illegal


Stride
The advantages of using stride include:

● Stride Reduce the size of the output feature map, which can help to reduce
the computational cost and memory requirements of the network.
● Striding can make the network more efficient by reducing the number of
operations required to process the input data.
● Striding can help to reduce overfitting by reducing the number of parameters in
the network and forcing the network to learn more abstract features.
Deep learning for Computer Vision
Problems using FC Layers on Images

How to process a tiny image with FC layers


Solution is Convolution
Parameter Sharing
Parameter Sharing
Parameter Sharing

● the same set of weights is used to compute the output for all neurons in
the same feature map.
● Parameter sharing reduces the number of parameters needed to train.
● Parameter Sharing enables the model to learn translation-invariant
features, for example, kernel for edge.
Edge Detection by Convolution

?
Edge Detection by Convolution
Efficiency of Convolution

Input: 320 x 280


Kernel: 2 x 1
Output: 319 x 280

Huge
Features
Features
Features

Aim is to learn useful


kernel using DL
Feature Extraction using CNN

Aim is to learn useful


kernel using DL
Padding
Recall Stride

Input
Filter
Stride
Output

Fractions are illegal


Convolution Layers:Dimensions

Input Image

Always shrinking down may not a good approach


(information loss)
Padding (Key idea)
Padding (Key idea)
Padding (Key idea)
Padding (Key idea)
Padding (Key idea)
Padding (Key idea)
Padding (Key idea)
Padding (Key idea)
Padding (Key idea)
Padding (Key idea)
Padding (Key idea)
Convolution Layers: Padding

● Padding refers to the


process of adding additional
rows and columns of zeros
around the edges of the
input image.

Zero Padding
Convolution Layers: Padding

Why padding?

● Sizes get smaller too quickly


● Corner pixel is only used
once
Convolution Layers: Padding

● Preserving the spatial dimensions


of the output feature map
● Reducing information loss at the
edges of the image
● Reducing the effect of edge pixels
on the output

Zero Padding
Convolution Layers: Padding

Most common is zero padding


Feature Map Dimension

Most common is zero padding

Output Size:

denotes the floor operator


Padding Types

● Valid Padding:
○ No padding at all.
○ The output feature map is smaller than the input feature map.

● Same Padding:
○ Adding enough padding to the input image so that the output feature map
has the same size as the input image.
Set padding to P = F - 1 with stride S =1
2
○ Verify formula
Padding

● Reflective Padding:
○ The padded pixels are not filled with zeros, but with the reflected values
of the input image.
○ This type of padding is useful when the input image contains edges or
other sharp features that would be distorted by zero padding.

Original tensor Reflected padded tensor


[[5 4 5 6 5]
[[1 2 3] [2 1 2 3 2]
[4 5 6] [5 4 5 6 5]
[8 7 8 9 8]
[7 8 9]] [5 4 5 6 5]]
Padding

● Symmetric Padding:

Original tensor Symmetric padded tensor:


[[1 1 2 3 3]
[[1 2 3] [1 1 2 3 3]
[4 5 6] [4 4 5 6 6]
[7 7 8 9 9]
[7 8 9]] [7 7 8 9 9]]
Feature Map
Dimension (padding
and stride)
Convolution Example

Conv2D(filters=10, kernel_size=(5, 5), strides=1, padding='same', input_shape=(3, 32, 32))


Convolution Example
Convolution Example
Convolution Example
Convolution Example
Stacking Convolution
Layers
Convolution Layer

● A basic layer is defined by


— Filter width and height (depth is implicitly given)
— Number of different filter (#weight sets)

● Each filter captures a different image characteristic


Convolution Layer

Conv2D(filters=1, kernel_size=(5, 5), strides=1, padding='valid', input_shape=(3, 32, 32))


Convolution Layer
Convolution Layer
Convolution Layer
Convolution Layer

Conv2D(filters=6, kernel_size=(5, 5), strides=1, padding='valid', input_shape=(3, 32, 32))


Convolution Layer
#parameters

● Each filter has a size of 5x5x3 (since there are 3 input channels).
● There are 6 filters in the layer. Plus there is one bias term for each filter.
● Therefore, the total number of parameters is 6*(5x5x3+1) = 456.
Convolution Layer
Convolution Layer
Stacking Convolution layers
Stacking Convolution layers
Stacking Convolution layers
Stacking Convolution layers
Pooling
(sub-sampling)
● Conv Layer = Feature Extraction
— Computes a feature in a given region

Pooling
● Pooling Layer = Feature Selection
— Picks the strongest activation in a region
Pooling Layer: Max Pooling

inputs = Input(shape=(4, 4, 1))

# Max pooling layer


max_pool = MaxPooling2D(pool_size=(2, 2), strides=2)(inputs)
Pooling Layer: Average Pooling

● Typically used deeper in the network

inputs = Input(shape=(4, 4, 1))

# Average pooling layer


avg_pool = AveragePooling2D(pool_size=(2, 2), strides=2)(inputs)
Pooling Layer

Common Setting
F=2, S=2
F=3, S=2
Pooling Layer

● Pooling Layer reduces the spatial dimensions of the input, which reduces
the computational cost of the network.
● Pooling layers aimed to important features, which help the network to learn
more robust representations of the input data.
● Pooling layers can improve the translation invariance of the network by
selecting the maximum or average value in a given region.
● Pooling layers provides distortion invariance by selecting the maximum or
average value in a region, which reduces effects of small variations.
Receptive Field
Receptive Field

● Receptive field refers to the area of the input that is used


by a particular neuron or feature map in the network.
Receptive Field (example)
Receptive Field (example)
Receptive Field (example)
Receptive Field (example)
Receptive Field (example)
Receptive Field (example)
Receptive Field (example)
Receptive Field (example)
Receptive Field (example)
Receptive Field (example)
Receptive Field (example)
Receptive Field (example)
Receptive Field (example)
Receptive Field (example)
Receptive Field

Receptive field refers to the area of the input image that is used by a
particular neuron or feature map in the network.
Receptive Field

Receptive field refers to the area of the input image that is used by a
particular neuron or feature map in the network.
Receptive Field

Receptive field refers to the area of the input image that is used by a
particular neuron or feature map in the network.
Receptive Field

Receptive field refers to the area of the input image that is used by a
particular neuron or feature map in the network.
Deep learning for Computer Vision
Receptive Field

● Receptive field refers to the area of the input that is used


by a particular neuron or feature map in the network.
Receptive Field (example)
Receptive Field

Receptive field refers to the area of the input image that is used by a
particular neuron or feature map in the network.
Receptive fields (Advantages)

● Capturing global features


● Improved recognition accuracy
● Improved generalization
Receptive fields (Disadvantages)

● Increased computational cost


● Reduced localization accuracy
● Limited context modeling
Sparsity
Sparsity

● This is what a regular


feed-forward
neural network will look like

● There are many dense


connections
Sparsity

● This is what a regular


feed-forward
neural network will look like

● There are many dense


connections
Sparsity

● Reduced number of parameters using convolution (compared to a


fully connected layer)
● Efficient computation and low memory usage using convolution
Sparsity
Sparsity
Sparsity
Convolutional Neural
Networks (CNNs)
Pooling Pooling

Schematic of typical sequences of layers in CNNs


Advantages of Convolutional Networks

● Shared weights: reduces the number of parameters and


helps to prevent overfitting.
Advantages of Convolutional Networks

● Shared weights: reduces the number of parameters and


helps to prevent overfitting.
● Translation invariant: meaning that small shifts in the input
image result in small changes in the output.
Advantages of Convolutional Networks

● Shared weights: reduces the number of parameters and


helps to prevent overfitting.
● Translation invariant: meaning that small shifts in the input
image result in small changes in the output.
● Flexibility to design: application specific network by
Stacking convolution layers
Feature extraction using CNNs

● CNN take advantage of the spatial structure for feature


extraction
Feature extraction using CNNs

● CNN take advantage of the spatial structure for feature


extraction
● Learn hierarchical representations: starting with low-level
features (edges) and progressing to higher-level features
(shape).
Feature extraction using CNNs

● CNN take advantage of the spatial structure for feature


extraction
● Learn hierarchical representations: starting with low-level
features (edges) and progressing to higher-level features
(shape).
● Learn complex representations: convolutional layers can
be easily stacked to create deeper networks.
Basics of CNN (summary)

● Introduction to Computer Vision


Basics of CNN (summary)

● Introduction to Computer Vision


● Problems using Fully Connected Networks on Images
Basics of CNN (summary)

● Introduction to Computer Vision


● Problems using Fully Connected Networks on Images
● Convolution on Images
Basics of CNN (summary)

● Introduction to Computer Vision


● Problems using Fully Connected Networks on Images
● Convolution on Images
● Stride, Padding, Pooling
Basics of CNN (summary)

● Introduction to Computer Vision


● Problems using Fully Connected Networks on Images
● Convolution on Images
● Stride, Padding, Pooling
● Stacking Convolution Layers
Basics of CNN (summary)

● Introduction to Computer Vision


● Problems using Fully Connected Networks on Images
● Convolution on Images
● Stride, Padding, Pooling
● Stacking Convolution Layers
● Receptive Field and Sparsity
LeNet
LeNet
LeNet

● LeNet is a convolutional neural network architecture designed by Yann LeCun


in 1998.
LeNet

● LeNet is a convolutional neural network architecture designed by Yann LeCun


in 1998.
● LeNet uses a gradient-based learning algorithm (backpropagation) for
training.
LeNet

● LeNet is a convolutional neural network architecture designed by Yann LeCun


in 1998.
● LeNet uses a gradient-based learning algorithm (backpropagation) for
training.
● LeNet was designed specifically for handwritten digit recognition, and was
trained on the MNIST dataset of handwritten digits.
LeNet

● LeNet is a convolutional neural network architecture designed by Yann LeCun


in 1998.
● LeNet uses a gradient-based learning algorithm (backpropagation) for
training.
● LeNet was designed specifically for handwritten digit recognition, and was
trained on the MNIST dataset of handwritten digits.
● LeNet is also adapted and extended for other image recognition tasks.
LeNet

Digit recognition: 10 classes


LeNet

Digit recognition: 10 classes

» Valid convolution: size shrinks


LeNet

Digit recognition: 10 classes

At that time average pooling was used, now max pooling is much more common
LeNet

● Convolutional layer 1: This layer applies a set of filters to the input image to
extract features such as edges and corners.
LeNet

● Convolutional layer 1: This layer applies a set of filters to the input image to
extract features such as edges and corners.

● Subsampling layer 1:
○ Reduces the spatial size of the feature maps, while retaining the most
important features.
○ Reduce the computational complexity.
LeNet

Digit recognition: 10 classes

● The second layer is another convolutional layer with 16 feature maps,


followed by another pooling layer.
LeNet

● Convolutional layer 2: Allowing the network to learn more complex features.

● Subsampling layer 2: Reduces the size of the feature maps, while retaining
the most important features.
LeNet

Digit recognition: 10 classes

10
LeNet

● Fully connected layer 1 and 2:


○ The third layer is a fully connected layer with 120 units, followed by a
fourth layer with 84 units, both using a 'relu' (updated).
○ Last layers learns non-linear combinations of the features extracted by
the convolutional layers.
LeNet

Digit recognition: 10 classes

10

● Output layer: Fully connected layers with 10 units each, using a softmax
activation function to produce the final output probabilities.
from keras.models import Sequential
from keras.layers import Conv2D, AveragePooling2D, Flatten, Dense

# Input layer
inputs = Input(shape=(32, 32, 1))

# Convolutional layer 1
conv1 = Conv2D(filters=6, kernel_size=(5, 5), strides=(1, 1), activation='relu',
padding='same')(inputs)

# Average pooling layer 1


pool1 = AveragePooling2D(pool_size=(2, 2), strides=(2, 2), padding='valid')(conv1)

# Convolutional layer 2
conv2 = Conv2D(filters=16, kernel_size=(5, 5), strides=(1, 1), activation='relu',
padding='valid')(pool1)

# Average pooling layer 2


LeNet
pool2 = AveragePooling2D(pool_size=(2, 2), strides=(2, 2), padding='valid')(conv2)

# Flatten layer
flatten = Flatten()(pool2)

# Fully connected layer 1


LeNet
fc1 = Dense(units=120, activation='relu')(flatten)
# Fully connected layer 2
fc2 = Dense(units=84, activation='relu')(fc1)

# Output layer
outputs = Dense(units=10, activation='softmax')(fc2)
Advantages of Convolutional Networks
Special Convolution
(Depth-wise separable
convolutions)
Normal convolutions

Normal convolution acts on all channels

# Input layer for an image of size 12x12 with 3 channels


inputs = Input(shape=(12, 12, 3))

# Convolutional layer with 256 filters of size 5x5


conv = Conv2D(filters=256, kernel_size=(5, 5), padding='valid')(inputs)
Depth-wise separable convolutions

1. Apply a separate filter to each input channel to produce a set of


intermediate feature maps.
Depth-wise separable convolutions

2. Use a 1x1 convolution to combine the intermediate feature maps


into the final output.
Depth-wise separable convolutions

from keras.layers import Input, DepthwiseConv2D, Conv2D, BatchNormalization,


Activation

# Input layer for an image of size 12x12 with 3 channels


inputs = Input(shape=(12, 12, 3))

# Depthwise convolutional layer


depthwise_conv = DepthwiseConv2D(kernel_size=(5, 5), padding='valid')(inputs)
depthwise_conv = Activation('relu')(depthwise_conv)

# Pointwise convolutional layer


pointwise_conv = Conv2D(filters=256, kernel_size=(1, 1))(depthwise_conv)
pointwise_conv = Activation('relu')(pointwise_conv)
But why?

● Each filter has a kernel size of 5x5x3 (there are 3


input channels). Plus one bias term for each filter.
● Total number of parameters in the Convolutional
layer is 256x(5x5x3+1) = 19456.

Simple convolution
But why?

Depthwise Convolutional layer:


● Each channel of the input tensor is convolved
separately using a 5x5 kernel.
● There are 3 input channels and a depth multiplier
of 1, so there are 3 depthwise filters in total.
● Each depthwise filter has 5x5=25 parameters.
Plus one bias term for each filter.
● Total number of parameters in the Depthwise
Convolutional layer is 3x(25+1) = 78.

Pointwise Convolutional layer:


● There are 256 output filters, and the input has 3
channels.
● Each filter has a kernel size of 1x1x3 (as 3 input
channels). Plus one bias term for each filter.
● Total number of parameters in the Pointwise
Convolutional layer is 256x(1x1x3+1) = 1024.
Depthwise Separable Convolution

● Fewer parameters: Compared to a standard convolution, depthwise


separable convolution requires fewer parameters to learn.
Depthwise Separable Convolution

● Fewer parameters: Compared to a standard convolution, depthwise


separable convolution requires fewer parameters to learn.
● Computationally efficient: The depthwise convolution requires fewer
computations than a standard convolution
Depthwise Separable Convolution

● Fewer parameters: Compared to a standard convolution, depthwise


separable convolution requires fewer parameters to learn.
● Computationally efficient: The depthwise convolution requires fewer
computations than a standard convolution
● Improved performance: Depthwise separable convolution can achieve
similar or better performance than a standard convolution.
Depthwise Separable Convolution

● Fewer parameters: Compared to a standard convolution, depthwise


separable convolution requires fewer parameters to learn.
● Computationally efficient: The depthwise convolution requires fewer
computations than a standard convolution
● Improved performance: Depthwise separable convolution can achieve
similar or better performance than a standard convolution.
● Reduced overfitting: By reducing the number of parameters, depthwise
separable convolution can help reduce overfitting.
Deep learning for Computer Vision
More Special
Convolutions
Transpose Convolution: 1D Example
Transposed Convolution

● Convolution outputs tensor with a smaller spatial dimension, whereas


transposed convolution outputs tensor with a larger spatial dimension.
Transposed Convolution

● Convolution outputs tensor with a smaller spatial dimension, whereas


transposed convolution outputs tensor with a larger spatial dimension.
● Like convolution, transposed convolution uses learnable weights.
Transposed Convolution

● Convolution outputs tensor with a smaller spatial dimension, whereas


transposed convolution outputs tensor with a larger spatial dimension.
● Like convolution, transposed convolution uses learnable weights.
● In Transposed convolution, the output is upsampled by a factor of the
stride. For example, output_size = (input_size - 1) * stride +
kernel_size
Transpose Convolution: 2D Example

Stride 1

Transposed convolution with a 2x2 kernel. The shaded portions are a portion of an intermediate
tensor as well as the input and kernel tensor elements used for the computation.
Transpose Convolution: 2D Example

Stride 1

Transposed convolution with a 2x2 kernel. The shaded portions are a portion of an intermediate
tensor as well as the input and kernel tensor elements used for the computation.
Stride 2
Stride 2
Transposed Convolution

# Add a transposed convolutional layer with 1 filters, kernel size 2x2, stride 2, no
padding
Conv2DTranspose(2, (2, 2), strides=(2, 2), padding='valid', input_shape=(height,
width, channels))

● Conv2DTranspose: This is the Keras layer class for a transposed convolution.


It is added to the model using the add method.

● 2: This is the number of filters in the layer.

● (2, 2): This is the size of the convolutional kernel in the layer.

● strides=(2, 2): stride of the transposed convolution to 2 in both the vertical


and horizontal directions.

● padding='valid': no padding is added.

● input_shape=(height, width, channels): the shape of the input to the layer.


Transposed Convolution

Applications of Transposed Convolution:


● Generating high-resolution images from low-resolution input
● Encoder-Decoder networks
○ Transposed Convolution used for image denoising, inpainting,
and super-resolution
○ Transposed Convolution used for semantic segmentation
Transposed Convolution

Disadvantages of Transposed Convolution:


● Can generate low-quality or unrealistic images if the training data is
insufficient or the model architecture is not well-suited for the task
Transposed Convolution

Disadvantages of Transposed Convolution:


● Can generate low-quality or unrealistic images if the training data is
insufficient or the model architecture is not well-suited for the task
● Can lead to checkerboard artifacts in the output images, where the
same pixel values are repeated in a grid-like pattern
Transposed Convolution

Alternatives to Transposed Convolution:


● Dense upsampling layers, computationally expensive and require
more memory than convolutional layers.
Transposed Convolution

Alternatives to Transposed Convolution:


● Dense upsampling layers, computationally expensive and require
more memory than convolutional layers.
● Upsampling layers, such as bilinear or nearest-neighbor interpolation
(without learning any new parameters).
Classical Architecture
Alexnet, ResNet, VGG-16, VGG-19, and Inception Net
ImageNet
ImageNet
ImageNet

AI researcher Fei-Fei Li
began working on the idea
for ImageNet in 2006.
Classical Architecture

● Alexnet
● VGG Network
● ResNet
● Inception Net
Revolution of depth (ImageNet Benchmark)

AlexNet is the first CNN, that


achieved good performance on
the ImageNet dataset.

Non-CNN
CNN
Alexnet
CNNs Success
Alexnet

● The architecture of AlexNet was inspired by LeNet.


● AlexNet is a convolutional neural network (CNN) architecture
designed by Alex Krizhevsky in 2012.
Alexnet

● The architecture of AlexNet was inspired by LeNet.


● AlexNet is a convolutional neural network (CNN) architecture
designed by Alex Krizhevsky in 2012.
● AlexNet was one of the first CNN to achieve a good improvement in
image classification performance on the ImageNet dataset.
● AlexNet helped to popularize deep learning and convolutional neural
networks.
AlexNet
AlexNet

The first convolutional layer has 96 filters


AlexNet

The first convolutional layer has 96 filters

Convolution output is passed through a ReLU activation function and


then max pooled with a window size of 3x3 and a stride of 2 pixels.
AlexNet

the second convolutional layer has 256 filters.


AlexNet

Next convolution layers applies 384 filters and 256 filters


AlexNet

AlexNet uses max pooling after the first, second, and fifth convolutional
layers to reduce the spatial dimensions of the feature maps.
AlexNet

4096 4096 1000


units units units

At last we have Fully Connected layer and softmax layer


AlexNet

Let Convolutional Layer denoted by CL. AlexNet has following layers.


● CL 1: applies 96 filters of size 11x11 to input image with a stride of 4 pixels.
● CL 2: applies 256 filters of size 5x5 to the output of the first max pooling layer.
● CL 3: applies 384 filters of size 3x3 to the output of the second max pooling layer.
● CL 4: applies 384 filters of size 3x3 to the output of the third CL.
● CL 5: applies 256 filters of size 3x3 to the output of the fourth CL.
AlexNet

Let Convolutional layer denoted by CL. AlexNet has following layers.


● CL 1: applies 96 filters of size 11x11 to input image with a stride of 4 pixels.
● CL 2: applies 256 filters of size 5x5 to the output of the first max pooling layer.
● CL 3: applies 384 filters of size 3x3 to the output of the second max pooling layer.
● CL 4: applies 384 filters of size 3x3 to the output of the third CL.
● CL 5: applies 256 filters of size 3x3 to the output of the fourth CL.
AlexNet

Fully connected layers


● FC 1: This layer has 4096 units. The output is passed through a ReLU
activation function and then subject to dropout regularization.
● FC 2: This layer is similar to the first fully connected layer, with 4096 units, a
ReLU activation function, and dropout regularization.

Output layer:

● This layer has 1000 units (i.e., number of classes in the ImageNet dataset).
● The output is passed through a softmax activation function to produce the
final class probabilities.
AlexNet
from keras.layers import Input, Conv2D, MaxPooling2D, Flatten, Dense, Dropout
from keras.layers.normalization import BatchNormalization
from keras.models import Model

input_shape = (227, 227, 3) # Input shape of the image

# Define the input layer


inputs = Input(shape=input_shape)
AlexNet
from keras.layers import Input, Conv2D, MaxPooling2D, Flatten, Dense, Dropout
from keras.layers.normalization import BatchNormalization
from keras.models import Model

input_shape = (227, 227, 3) # Input shape of the image

# Define the input layer


inputs = Input(shape=input_shape)

# First convolutional layer, 96 filters, kernel size of 11x11 and stride of 4x4, followed by ReLU
conv1 = Conv2D(filters=96, kernel_size=(11, 11), strides=(4, 4), activation='relu')(inputs)

# Max pooling layer with pool size of 3x3 and stride of 2x2
pool1 = MaxPooling2D(pool_size=(3, 3), strides=(2, 2))(conv1)

# Batch normalization layer


bn1 = BatchNormalization()(pool1)
AlexNet
from keras.layers import Input, Conv2D, MaxPooling2D, Flatten, Dense, Dropout
from keras.layers.normalization import BatchNormalization
from keras.models import Model

input_shape = (227, 227, 3) # Input shape of the image

# Define the input layer


inputs = Input(shape=input_shape)

# First convolutional layer, 96 filters, kernel size of 11x11 and stride of 4x4, followed by ReLU
conv1 = Conv2D(filters=96, kernel_size=(11, 11), strides=(4, 4), activation='relu')(inputs)

# Max pooling layer with pool size of 3x3 and stride of 2x2
pool1 = MaxPooling2D(pool_size=(3, 3), strides=(2, 2))(conv1)

# Batch normalization layer


bn1 = BatchNormalization()(pool1)

# Second convolutional layer with 256 filters, kernel size of 5x5 and padding of same, followed by
ReLU activation
conv2 = Conv2D(filters=256, kernel_size=(5, 5), padding='same', activation='relu')(bn1)

# Max pooling layer with pool size of 3x3 and stride of 2x2
pool2 = MaxPooling2D(pool_size=(3, 3), strides=(2, 2))(conv2)

# Batch normalization layer


bn2 = BatchNormalization()(pool2)
AlexNet
# Third convolutional layer with 384 filters, kernel size of 3x3 and padding of same, followed by ReLU
conv3 = Conv2D(filters=384, kernel_size=(3, 3), padding='same', activation='relu')(bn2)
# Fourth convolutional layer with 384 filters, kernel size of 3x3 and padding of same, followed by ReLU
conv4 = Conv2D(filters=384, kernel_size=(3, 3), padding='same', activation='relu')(conv3)
AlexNet
# Third convolutional layer with 384 filters, kernel size of 3x3 and padding of same, followed by ReLU
conv3 = Conv2D(filters=384, kernel_size=(3, 3), padding='same', activation='relu')(bn2)
# Fourth convolutional layer with 384 filters, kernel size of 3x3 and padding of same, followed by ReLU
conv4 = Conv2D(filters=384, kernel_size=(3, 3), padding='same', activation='relu')(conv3)

# Fifth convolutional layer, 256 filters, kernel size of 3x3 and padding of same, followed by ReLU
conv5 = Conv2D(filters=256, kernel_size=(3, 3), padding='same', activation='relu')(conv4)
# Max pooling layer with pool size of 3x3 and stride of 2x2
pool5 = MaxPooling2D(pool_size=(3, 3), strides=(2, 2))(conv5)
# Batch normalization layer
bn3 = BatchNormalization()(pool5)
AlexNet
# Third convolutional layer with 384 filters, kernel size of 3x3 and padding of same, followed by ReLU
conv3 = Conv2D(filters=384, kernel_size=(3, 3), padding='same', activation='relu')(bn2)
# Fourth convolutional layer with 384 filters, kernel size of 3x3 and padding of same, followed by ReLU
conv4 = Conv2D(filters=384, kernel_size=(3, 3), padding='same', activation='relu')(conv3)

# Fifth convolutional layer, 256 filters, kernel size of 3x3 and padding of same, followed by ReLU
conv5 = Conv2D(filters=256, kernel_size=(3, 3), padding='same', activation='relu')(conv4)
# Max pooling layer with pool size of 3x3 and stride of 2x2
pool5 = MaxPooling2D(pool_size=(3, 3), strides=(2, 2))(conv5)
# Batch normalization layer
bn3 = BatchNormalization()(pool5)

# Flatten layer
flatten = Flatten()(bn3)

# First fully connected layer with 4096 units, followed by ReLU activation and dropout
fc1 = Dense(units=4096, activation='relu')(flatten)
dropout1 = Dropout(0.5)(fc1)
# Second fully connected layer with 4096 units, followed by ReLU activation and dropout
fc2 = Dense(units=4096, activation='relu')(dropout1)
dropout2 = Dropout(0.5)(fc2)

# Output layer with 1000 units and softmax activation


outputs = Dense(units=1000, activation='softmax')(dropout2)
LeNet and AlexNet

● AlexNet is deeper and more complex than LeNet, with more layers
and more parameters.
LeNet and AlexNet

● AlexNet is deeper and more complex than LeNet, with more layers
and more parameters.
● AlexNet has larger filters in initial layers (e.g., 11x11 in the first layer),
while LeNet uses smaller filters (e.g., 5x5).
LeNet and AlexNet

● AlexNet is deeper and more complex than LeNet, with more layers
and more parameters.
● AlexNet has larger filters in initial layers (e.g., 11x11 in the first layer),
while LeNet uses smaller filters (e.g., 5x5).
● AlexNet achieved state-of-the-art performance on the ImageNet
dataset, while LeNet was designed and tested on a smaller
handwritten digit recognition task.
VGG Network
VGG Network
VGG16 showed good
performance on the ImageNet.

(ZFNet)
VGG Networks

● VGG stands for Visual Geometry Group (University of Oxford) that developed
a family of CNN architectures for image classification.
VGG Networks

● VGG stands for Visual Geometry Group (University of Oxford) that developed
a family of CNN architectures for image classification.
● The original VGG model, VGG16, has 16 layers including 13 convolutional
layers and 3 fully connected layers.
VGG Networks

● VGG stands for Visual Geometry Group (University of Oxford) that developed
a family of CNN architectures for image classification.
● The original VGG model, VGG16, has 16 layers including 13 convolutional
layers and 3 fully connected layers.
● VGG19 is with 19 layers including 16 convolutional layers and 3 fully
connected layers.
VGG Networks

● VGG stands for Visual Geometry Group (University of Oxford) that developed
a family of CNN architectures for image classification.
● The original VGG model, VGG16, has 16 layers including 13 convolutional
layers and 3 fully connected layers.
● VGG19 is with 19 layers including 16 convolutional layers and 3 fully
connected layers.
● VGG family also includes VGG11 and VGG13, with fewer convolutional layers
than VGG16 and VGG19.
VGG Networks

● VGG stands for Visual Geometry Group (University of Oxford) that developed
a family of CNN architectures for image classification.
● The original VGG model, VGG16, has 16 layers including 13 convolutional
layers and 3 fully connected layers.
● VGG19 is with 19 layers including 16 convolutional layers and 3 fully
connected layers.
● VGG family also includes VGG11 and VGG13, with fewer convolutional layers
than VGG16 and VGG19.
● VGG family generalize well to many tasks such as classification, object
detection, segmentation, style transfer, and transfer learning.
VGG16 architecture
VGG16 architecture
VGG16 architecture
VGG16 architecture
VGG16 architecture
VGG16 architecture
VGG16 (summary)

● VGG16 has 13 convolutional layers and 3 fully connected layers (16


total).
VGG16 (summary)

● VGG16 has 13 convolutional layers and 3 fully connected layers (16


total).
● Convolutional layers use 3x3 filters, stride 1, same padding, followed by
a ReLU (deeper networks & less parameters)
VGG16 (summary)

● VGG16 has 13 convolutional layers and 3 fully connected layers (16


total).
● Convolutional layers use 3x3 filters, stride 1, same padding, followed by
a ReLU (deeper networks & less parameters)
● Pooling layers use 2x2 filters with a stride of 2 pixels.
VGG16 (summary)

● VGG16 has 13 convolutional layers and 3 fully connected layers (16


total).
● Convolutional layers use 3x3 filters, stride 1, same padding, followed by
a ReLU (deeper networks & less parameters)
● Pooling layers use 2x2 filters with a stride of 2 pixels.
● The first two convolutional layers have 64 filters each, while the
remaining layers have 128, 256, 512, and 512 filters, respectively.
VGG16 (summary)

● VGG16 has 13 convolutional layers and 3 fully connected layers (16


total).
● Convolutional layers use 3x3 filters, stride 1, same padding, followed by
a ReLU (deeper networks & less parameters)
● Pooling layers use 2x2 filters with a stride of 2 pixels.
● The first two convolutional layers have 64 filters each, while the
remaining layers have 128, 256, 512, and 512 filters, respectively.
● Fully connected layers have 4096 units that use ReLU, and output layer
has 1000 units corresponding to number of classes in the ImageNet.
VGG16
# Import necessary libraries
import tensorflow as tf
from tensorflow.keras.applications.vgg16 import VGG16
from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications.vgg16 import preprocess_input, decode_predictions
VGG16
# Import necessary libraries
import tensorflow as tf
from tensorflow.keras.applications.vgg16 import VGG16
from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications.vgg16 import preprocess_input, decode_predictions

# Load the VGG16 model


model = VGG16(weights='imagenet')
VGG16
# Import necessary libraries
import tensorflow as tf
from tensorflow.keras.applications.vgg16 import VGG16
from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications.vgg16 import preprocess_input, decode_predictions

# Load the VGG16 model


model = VGG16(weights='imagenet')

# Load the image you want to classify


img_path = 'tiger_shark.jpeg'
img = image.load_img(img_path, target_size=(224, 224))
VGG16
# Import necessary libraries
import tensorflow as tf
from tensorflow.keras.applications.vgg16 import VGG16
from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications.vgg16 import preprocess_input, decode_predictions

# Load the VGG16 model


model = VGG16(weights='imagenet')

# Load the image you want to classify


img_path = 'tiger_shark.jpeg'
img = image.load_img(img_path, target_size=(224, 224))

# Convert the image to an array


x = image.img_to_array(img)
x = tf.expand_dims(x, axis=0)
x = preprocess_input(x)
VGG16
# Import necessary libraries
import tensorflow as tf
from tensorflow.keras.applications.vgg16 import VGG16
from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications.vgg16 import preprocess_input, decode_predictions

# Load the VGG16 model


model = VGG16(weights='imagenet')

# Load the image you want to classify


img_path = 'tiger_shark.jpeg'
img = image.load_img(img_path, target_size=(224, 224))

# Convert the image to an array


x = image.img_to_array(img)
x = tf.expand_dims(x, axis=0)
x = preprocess_input(x)

# Use the model to predict the class of the image


preds = model.predict(x)

# Print the top 5 predictions


print('Predicted:', decode_predictions(preds, top=5)[0])
VGG16 Implementation

# Import necessary libraries


from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Conv2D, MaxPooling2D, Flatten, Dense

def vgg16(input_shape=(224, 224, 3), num_classes=1000):


input_tensor = Input(shape=input_shape)
VGG16 Implementation

# Import necessary libraries


from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Conv2D, MaxPooling2D, Flatten, Dense

def vgg16(input_shape=(224, 224, 3), num_classes=1000):


input_tensor = Input(shape=input_shape)

# Block 1
x = Conv2D(64, (3, 3), activation='relu', padding='same', name='block1_conv1')(input_tensor)
x = Conv2D(64, (3, 3), activation='relu', padding='same', name='block1_conv2')(x)
x = MaxPooling2D((2, 2), strides=(2, 2), name='block1_pool')(x)
VGG16 Implementation

# Import necessary libraries


from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Conv2D, MaxPooling2D, Flatten, Dense

def vgg16(input_shape=(224, 224, 3), num_classes=1000):


input_tensor = Input(shape=input_shape)

# Block 1
x = Conv2D(64, (3, 3), activation='relu', padding='same', name='block1_conv1')(input_tensor)
x = Conv2D(64, (3, 3), activation='relu', padding='same', name='block1_conv2')(x)
x = MaxPooling2D((2, 2), strides=(2, 2), name='block1_pool')(x)

# Block 2
x = Conv2D(128, (3, 3), activation='relu', padding='same', name='block2_conv1')(x)
x = Conv2D(128, (3, 3), activation='relu', padding='same', name='block2_conv2')(x)
x = MaxPooling2D((2, 2), strides=(2, 2), name='block2_pool')(x)

# Block 3
x = Conv2D(256, (3, 3), activation='relu', padding='same', name='block3_conv1')(x)
x = Conv2D(256, (3, 3), activation='relu', padding='same', name='block3_conv2')(x)
x = Conv2D(256, (3, 3), activation='relu', padding='same', name='block3_conv3')(x)
x = MaxPooling2D((2, 2), strides=(2, 2), name='block3_pool')(x)
VGG16 Implementation

# Block 4
x = Conv2D(512, (3, 3), activation='relu', padding='same', name='block4_conv1')(x)
x = Conv2D(512, (3, 3), activation='relu', padding='same', name='block4_conv2')(x)
x = Conv2D(512, (3, 3), activation='relu', padding='same', name='block4_conv3')(x)
x = MaxPooling2D((2, 2), strides=(2, 2), name='block4_pool')(x)

# Block 5
x = Conv2D(512, (3, 3), activation='relu', padding='same', name='block5_conv1')(x)
x = Conv2D(512, (3, 3), activation='relu', padding='same', name='block5_conv2')(x)
x = Conv2D(512, (3, 3), activation='relu', padding='same', name='block5_conv3')(x)
x = MaxPooling2D((2, 2), strides=(2, 2), name='block5_pool')(x)
VGG16 Implementation

# Block 4
x = Conv2D(512, (3, 3), activation='relu', padding='same', name='block4_conv1')(x)
x = Conv2D(512, (3, 3), activation='relu', padding='same', name='block4_conv2')(x)
x = Conv2D(512, (3, 3), activation='relu', padding='same', name='block4_conv3')(x)
x = MaxPooling2D((2, 2), strides=(2, 2), name='block4_pool')(x)

# Block 5
x = Conv2D(512, (3, 3), activation='relu', padding='same', name='block5_conv1')(x)
x = Conv2D(512, (3, 3), activation='relu', padding='same', name='block5_conv2')(x)
x = Conv2D(512, (3, 3), activation='relu', padding='same', name='block5_conv3')(x)
x = MaxPooling2D((2, 2), strides=(2, 2), name='block5_pool')(x)

# Flatten and dense layers


x = Flatten(name='flatten')(x)
x = Dense(4096, activation='relu', name='fc1')(x)
x = Dense(4096, activation='relu', name='fc2')(x)
output_tensor = Dense(num_classes, activation='softmax', name='predictions')(x)
VGG16 Implementation

# Block 4
x = Conv2D(512, (3, 3), activation='relu', padding='same', name='block4_conv1')(x)
x = Conv2D(512, (3, 3), activation='relu', padding='same', name='block4_conv2')(x)
x = Conv2D(512, (3, 3), activation='relu', padding='same', name='block4_conv3')(x)
x = MaxPooling2D((2, 2), strides=(2, 2), name='block4_pool')(x)

# Block 5
x = Conv2D(512, (3, 3), activation='relu', padding='same', name='block5_conv1')(x)
x = Conv2D(512, (3, 3), activation='relu', padding='same', name='block5_conv2')(x)
x = Conv2D(512, (3, 3), activation='relu', padding='same', name='block5_conv3')(x)
x = MaxPooling2D((2, 2), strides=(2, 2), name='block5_pool')(x)

# Flatten and dense layers


x = Flatten(name='flatten')(x)
x = Dense(4096, activation='relu', name='fc1')(x)
x = Dense(4096, activation='relu', name='fc2')(x)
output_tensor = Dense(num_classes, activation='softmax', name='predictions')(x)

# Create model
model = Model(inputs=input_tensor, outputs=output_tensor, name='vgg16')

return model
Deep learning for Computer Vision
Classical Architectures

Architecture Year Layers Key Innovations Parameters Researchers

Alex Krizhevsky
AlexNet 2012 8 CNN Architecture 62 million
et al.

Karen Simonyan
3x3 convolution filters, 138-144
VGGNet 2014 16-19 and Andrew
Deep architecture million
Zisserman
VGG16 architecture
Classical Architectures

Architecture Year Layers Key Innovations Parameters Researchers

Alex Krizhevsky
AlexNet 2012 8 CNN Architecture 62 million
et al.

Karen Simonyan
3x3 convolution filters, 138-144
VGGNet 2014 16-19 and Andrew
Deep architecture million
Zisserman

Next?
Classical Architectures

Architecture Year Layers Key Innovations Parameters Researchers

Alex Krizhevsky
AlexNet 2012 8 CNN Architecture 62 million
et al.

Karen Simonyan
3x3 convolution filters, 138-144
VGGNet 2014 16-19 and Andrew
Deep architecture million
Zisserman

Inception Net 2014 22-42 ? Szegedy et al.


Classical Architectures

Architecture Year Layers Key Innovations Parameters Researchers

Alex Krizhevsky
AlexNet 2012 8 CNN Architecture 62 million
et al.

Karen Simonyan
3x3 convolution filters, 138-144
VGGNet 2014 16-19 and Andrew
Deep architecture million
Zisserman

Inception Net 2014 22-42 4-12 million Szegedy et al.

Going Deeper with Convolutions


Inception Net
Inception Net
Fine-grained visual categories

Difficult to identify for Uniformly increased networks


Inception Net

● Inception Net is a deep convolutional neural network architecture developed


by Google researchers in 2014.
● Inception Net won the 2014 ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) with a top-5 error rate of 6.67%.
Inception Net
Inception Net (Key Idea)

● Inception Layer: The multi-pathway convolutional blocks that enable the


network to learn complex features using fewer parameters.
● Auxiliary classifiers: At intermediate layers of the network to encourage
intermediate feature learning.
● Inception Net uses a multi-branch architecture that allows it to learn features
at multiple scales and resolutions.
Inception Layer
Inception Layer
Tired of choosing filter sizes?
Use them all!
Inception Layer
Use padding = Same
Inception Layer
Size of the output?

Not sustainable!
Inception Layer (key idea)
1x1 Convolutions
1x1 Convolution
Recall: Convolutions on Images
1x1 Convolution
1x1 Convolution

1x1 kernel keeps the dimensions and scales input!


Inception Layer: Computational Cost
Inception Layer: Computational Cost

Reduction of multiplications by 1/10


Inception Layer
def inception_module(x, filters):
"""
Inception module of the InceptionNet
"""
tower_1 = Conv2D(filters[0], (1, 1), padding='same', activation='relu')(x)
tower_1 = Conv2D(filters[1], (3, 3), padding='same', activation='relu')(tower_1)

tower_2 = Conv2D(filters[2], (1, 1), padding='same', activation='relu')(x)


tower_2 = Conv2D(filters[3], (5, 5), padding='same', activation='relu')(tower_2)

tower_3 = MaxPooling2D((3, 3), strides=(1, 1), padding='same')(x)


tower_3 = Conv2D(filters[4], (1, 1), padding='same', activation='relu')(tower_3)

output = Concatenate(axis=-1)([tower_1, tower_2, tower_3])


return output
InceptionNet

Input
|
Conv2D -> ReLU -> MaxPooling2D …
|
Inception module
|

|

|
Inception module
|
GlobalAveragePooling -> Dense -> Softmax
GlobalAveragePooling

● Global Average Pooling replace fully


connected layers in classical CNNs.
GlobalAveragePooling

● Global Average Pooling replace fully


connected layers in classical CNNs.
● In this layer, the average value of
each feature map is computed,
resulting in a single output value for
each feature map.
GlobalAveragePooling

● Global Average Pooling replace fully


connected layers in classical CNNs.
● In this layer, the average value of
each feature map is computed,
resulting in a single output value for
each feature map.
● Global Average Pooling helps reduce
the number of parameters in the
network.
Inception Layer in InceptionNet
def InceptionNet(input_shape, num_classes):
"""
InceptionNet architecture using functional API.
"""
input_tensor = Input(shape=input_shape)

x = Conv2D(64, (7, 7), strides=(2, 2), padding='same', activation='relu')(input_tensor)


x = MaxPooling2D((3, 3), strides=(2, 2), padding='same')(x)
x = Conv2D(64, (1, 1), padding='same', activation='relu')(x)
x = Conv2D(192, (3, 3), padding='same', activation='relu')(x)
x = MaxPooling2D((3, 3), strides=(2, 2), padding='same')(x)

x = inception_module(x, [64, 128, 32, 32, 64])


x = inception_module(x, [128, 192, 96, 64, 128])

x = MaxPooling2D((3, 3), strides=(2, 2), padding='same')(x)


x = inception_module(x, [192, 208, 48, 64, 96])
x = inception_module(x, [160, 224, 64, 64, 112])
x = inception_module(x, [128, 256, 64, 64, 128])
x = inception_module(x, [112, 288, 64, 64, 144])
x = inception_module(x, [256, 320, 128, 128, 160])

x = MaxPooling2D((3, 3), strides=(2, 2), padding='same')(x)

x = inception_module(x, [256, 320, 128, 128, 160])


x = inception_module(x, [384, 384, 128, 128, 128])

x = GlobalAveragePooling2D()(x)
x = Dropout(0.4)(x)
x = Dense(num_classes, activation='softmax')(x)

model = Model(inputs=input_tensor, outputs=x, name='InceptionNet')


return model
Auxiliary classifiers
Auxiliary classifiers

● The auxiliary classifiers in InceptionNet are additional output branches that


are inserted into the network at intermediate stages.
Auxiliary classifiers

● The auxiliary classifiers in InceptionNet are additional output branches that


are inserted into the network at intermediate stages.
● Auxiliary classifiers provide additional supervision signals during training to
improve the overall performance of the network.
Auxiliary classifiers

● The auxiliary classifiers in InceptionNet are additional output branches that


are inserted into the network at intermediate stages.
● Auxiliary classifiers provide additional supervision signals during training to
improve the overall performance of the network.
● The use of auxiliary classifiers is not limited to InceptionNet and can be
applied to other deep learning architectures as well.
Auxiliary classifiers

Main classifier
Auxiliary classifiers

● During training, the loss from the auxiliary classifiers is added to the overall
loss (main classifier) of the network with a weight factor (usually 0.3).
Auxiliary classifiers

● During training, the loss from the auxiliary classifiers is added to the overall
loss (main classifier) of the network with a weight factor (usually 0.3).
● During inference, the outputs of the auxiliary classifiers are discarded, and
only the output of the main classifier is used to make predictions.
Auxiliary classifiers

● During training, the loss from the auxiliary classifiers is added to the overall
loss (main classifier) of the network with a weight factor (usually 0.3).
● During inference, the outputs of the auxiliary classifiers are discarded, and
only the output of the main classifier is used to make predictions.
● The number and placement of the auxiliary classifiers in InceptionNet can
vary depending on the specific architecture and task.
Inception Net
Inception Net
Inception Net
Inception Net (Main Components)
Inception Net inception (3a) inception (3b)

The input to the network is a 224x224x3 RGB image.

The network begins with a series of convolutional and pooling layers to extract
low-level features from the image.
Inception Net inception (3a) inception (3b)

The Inception module contains multiple parallel convolutional paths


of different filter sizes, including 1x1, 3x3, and 5x5 convolutions.
Inception Net inception (3a) inception (3b)

The pooling operations and 1x1 convolutions in Inception modules to


reduce the dimensionality of the input.
Inception Net inception (3a) inception (3b)

The outputs of each path are concatenated together


along the channel axis and fed into the next layer.
Inception Net inception (3a) inception (3b)

Inception modules are stacked on top of each other to form the "stem" of the network.

The stem is followed by a series of "Inception-A" and "Inception-B" modules.


inception
Inception Net inception
(4c)
(4d)

inception
inception (4b)
(4a)

Reduction

The network also includes several "Reduction" modules, which are used to reduce the spatial
dimensions of the feature maps.
inception
Inception Net inception
(4c)
(4d)

inception
inception (4b)
(4a)

In addition to the main classification, the network also includes two "Auxiliary classifiers" at
intermediate layers.
Inception Net inception
inception
(5b)
(5a)
inception
(4e)

on
cti
du
Re

Final layers of the network consist of a global


average pooling layer and a fully connected
layer with softmax activation.
import tensorflow as tf
from tensorflow.keras.applications.inception_v3 import InceptionV3,
preprocess_input, decode_predictions
from tensorflow.keras.preprocessing import image
import numpy as np

# Load the InceptionV3 model


model = InceptionV3(weights='imagenet')

# Load the image you want to classify


img_path = 'tiger_shark.jpeg'
img = image.load_img(img_path, target_size=(299, 299))

# Convert the image to an array


x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)

# Use the model to predict the class of the image


preds = model.predict(x)

# Print the top 5 predictions


print('Predicted:', decode_predictions(preds, top=5)[0])
def InceptionNet(input_shape, num_classes):
"""
InceptionNet architecture using functional API.
"""
input_tensor = Input(shape=input_shape)

x = Conv2D(64, (7, 7), strides=(2, 2), padding='same', activation='relu')(input_tensor)


x = MaxPooling2D((3, 3), strides=(2, 2), padding='same')(x)

x = Conv2D(64, (1, 1), padding='same', activation='relu')(x)


x = Conv2D(192, (3, 3), padding='same', activation='relu')(x)
x = MaxPooling2D((3, 3), strides=(2, 2), padding='same')(x)

x = inception_module(x, [64, 96, 128, 16, 32])


x = inception_module(x, [128, 128, 192, 32, 96])
x = MaxPooling2D((3, 3), strides=(2, 2), padding='same')(x)

x = inception_module(x, [192, 96, 208, 16, 48])


# Auxiliary Classifier 1
aux_output_1 = AveragePooling2D((5, 5), strides=(3, 3))(x)
aux_output_1 = Conv2D(128, (1, 1), padding='same', activation='relu')(aux_output_1)
aux_output_1 = Flatten()(aux_output_1)
aux_output_1 = Dense(1024, activation='relu')(aux_output_1)
aux_output_1 = Dropout(0.7)(aux_output_1)
aux_output_1 = Dense(num_classes, activation='softmax')(aux_output_1)

x = inception_module(x, [160, 112, 224, 24, 64])


x = inception_module(x, [128, 128, 256, 24, 64])
x = inception_module(x, [112, 144, 288, 32, 64])
# Auxiliary Classifier 2
aux_output_2 = AveragePooling2D((5, 5), strides=(3, 3))(x)
aux_output_2 = Conv2D(128, (1, 1), padding='same', activation='relu')(aux_output_2)
aux_output_2 = Flatten()(aux_output_2)
aux_output_2 = Dense(1024, activation='relu')(aux_output_2)
aux_output_2 = Dropout(0.7)(aux_output_2)
aux_output_2 = Dense(num_classes, activation='softmax')(aux_output_2)
# Auxiliary Classifier 2
aux_output_2 = AveragePooling2D((5, 5), strides=(3, 3))(x)
aux_output_2 = Conv2D(128, (1, 1), padding='same', activation='relu')(aux_output_2)
aux_output_2 = Flatten()(aux_output_2)
aux_output_2 = Dense(1024, activation='relu')(aux_output_2)
aux_output_2 = Dropout(0.7)(aux_output_2)
aux_output_2 = Dense(num_classes, activation='softmax')(aux_output_2)

x = inception_module(x, [256, 160, 320, 32, 128])


x = MaxPooling2D((3, 3), strides=(2, 2), padding='same')(x)

x = inception_module(x, [256, 160, 320, 32, 128])


x = inception_module(x, [384, 192, 384, 48, 128])

x = GlobalAveragePooling2D((7, 7))(x)
x = Dropout(0.4)(x)

output = Dense(num_classes, activation='softmax')(x)


InceptionNet variants

● Inception Net has been refined and optimized, leading to several smaller
and faster variants such as Inception-v2, Inception-v3, and
Inception-ResNet.
● Inception-ResNet incorporates residual connections into the Inception
modules to further improve training stability and performance.
InceptionNet Applications

● Image classification, Object Detection (fine-grained), and semantic


segmentation
● Image Quality Assessment for Inception Score.
Neural Style Transfer

VGG v/s InceptionNet


Neural Style Transfer

● Puzzle: VGG is better feature extractor then InceptionNet for Style Transfer.
The stylization performance degrades using InceptionNet instead of VGG.
Neural Style Transfer

Wang, Pei, Yijun Li, and Nuno Vasconcelos. "Rethinking and improving the robustness of image
style transfer." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition. 2021.

● Puzzle: VGG is better feature extractor then InceptionNet for Style Transfer.
The stylization performance degrades using InceptionNet instead of VGG.
Deep learning for Computer Vision
ImageNet Benchmark (Recap)
Common Performance Metrics

● Top-1 score: check if a sample's top class (i.e. the one with highest
probability) is the same as its target label
Common Performance Metrics

● Top-1 score: check if a sample's top class (i.e. the one with highest
probability) is the same as its target label
● Top-5 score: check if your label is in your 5 first predictions (i.e.
predictions with 5 highest probabilities)
Common Performance Metrics

● Top-1 score: check if a sample's top class (i.e. the one with highest
probability) is the same as its target label
● Top-5 score: check if your label is in your 5 first predictions (i.e.
predictions with 5 highest probabilities)
● Top-5 error: percentage of test samples for which the correct class
was not in the top 5 predicted classes
Classical Architecture (Recap)

Accuracy
Architecture Year Layers Key Innovations Parameters (ImageNet) Researchers

Alex Krizhevsky
AlexNet 2012 8 ReLU, LRN 62 million 57.2%
et al.

Karen Simonyan
3x3 convolution filters, 138-144
VGGNet 2014 16-19 74.4% and Andrew
Deep architecture million
Zisserman

Inception modules,
Inception Net 2014 22-42 4-12 million 74.8% Szegedy et al.
Auxiliary classifiers

Next? 2015 50-152 75.3% He et al.


Problem of Depth
Going Deeper

● There has been a general trend in recent years to design


deeper networks.
● Deeper network are known to produce more complex features
and tend to generalise better.
Going Deeper

● There has been a general trend in recent years to design


deeper networks.
● Deeper network are known to produce more complex features
and tend to generalise better.
● Training deep networks is however difficult.
○ Problem of vanishing gradients
○ Problem of exploding gradient
Vanishing Gradient Problem

● Gradient of the loss function with respect to the weights in the


lower layers becomes very small during backpropagation.
Vanishing Gradient Problem:

Vanishing gradients problem on this simple network:

=
Vanishing Gradient Problem:

Weight update issue

During the gradient descent, we evaluate which is a product of

the intermediate derivatives. If any of these is zero, then ≈0 .


Vanishing Gradient Problem:

Vanishing gradients problem on this simple network:

=
During the gradient descent, we evaluate which is a product of

the intermediate derivatives. If any of these is close to zero, then ≈0 .


Vanishing Gradient Problem:

● The small gradient is propagated back through the layers,


making it difficult for lower layers to learn meaningful
representations of the data.
Vanishing Gradient Problem:

● The small gradient is propagated back through the layers,


making it difficult for lower layers to learn meaningful
representations of the data.
● Very challenging to train CNNs, where the gradient can become
exponentially small.
Going Deeper

Now consider the problem of vanishing gradients on this new network:


Going Deeper

Now consider the problem of vanishing gradients on this new network:


Exploding Gradient Problem:

● Gradient of the loss function with respect to the weights in the


lower layers becomes very large during backpropagation.
Exploding Gradient Problem:

Exploding Gradient Problem on this simple network:

=
Exploding Gradient Problem:

Weight update issue

During the gradient descent, we evaluate which is a product of

the intermediate derivatives. If any of these is zero, then ≈0 .


Exploding Gradient Problem:

Exploding Gradient Problem on this simple network:

=
During the gradient descent, we evaluate which is a product of

the intermediate derivatives. If these are high value then ≈∞


Exploding Gradient Problem:

● The large gradient is propagated back through the layers,


causing weight updates that are too large.
Exploding Gradient Problem:

● The large gradient is propagated back through the layers,


causing weight updates that are too large.
● Very challenging to train CNNs, where the gradient can become
exponentially large.
Exploding Gradient Problem:

● The large gradient is propagated back through the layers,


causing weight updates that are too large.
● Very challenging to train CNNs, where the gradient can become
exponentially large.
● Weight clipping can be done for Exploding Gradient.
Problem of Depth
ResNet

Solution to Problem of Depth


Classical Architecture

Accuracy
Architecture Year Layers Key Innovations Parameters (ImageNet) Researchers

Alex Krizhevsky
AlexNet 2012 8 ReLU, LRN 62 million 57.2%
et al.

Karen Simonyan
3x3 convolution filters, 138-144
VGGNet 2014 16-19 74.4% and Andrew
Deep architecture million
Zisserman

Inception modules,
Inception Net 2014 22-42 4-12 million 74.8% Szegedy et al.
Auxiliary classifiers

Residual connections, 25.6-60


ResNet 2015 50-152 Shortcut connections
75.3% He et al.
million
ResNet
ResNet (Introduction)

● ResNet is a deep neural network architecture that was introduced by


researchers at Microsoft in 2015.
ResNet (Introduction)

● ResNet is a deep neural network architecture that was introduced by


researchers at Microsoft in 2015.

● The key innovation of ResNet is the use of residual connections,


which allow for much deeper networks to be trained.
ResNet (Introduction)

● ResNet is a deep neural network architecture that was introduced by


researchers at Microsoft in 2015.

● The key innovation of ResNet is the use of residual connections,


which allow for much deeper networks to be trained.

● ResNet comes in several variants, including ResNet-18, ResNet-34,


ResNet-50, ResNet-101, and ResNet-152.
ResNet (Introduction)

● ResNet is a deep neural network architecture that was introduced by


researchers at Microsoft in 2015.

● The key innovation of ResNet is the use of residual connections,


which allow for much deeper networks to be trained.

● ResNet comes in several variants, including ResNet-18, ResNet-34,


ResNet-50, ResNet-101, and ResNet-152.

● ResNet has achieved good performance on many computer vision


tasks, including classification, object detection, and segmentation.
Residual Block
Skip connection (key idea)

● ResNet is composed of a series of residual blocks, each of which


contains one or more convolutional layers, batch normalization,
and ReLU activation.
Skip connection (key idea)

● ResNet is composed of a series of residual blocks, each of which


contains one or more convolutional layers, batch normalization,
and ReLU activation.
● In residual blocks, there is a shortcut connection that bypasses
one or more layers and allows the gradient to flow directly to
earlier layers.
Skip connection (key idea)

● ResNet is composed of a series of residual blocks, each of which


contains one or more convolutional layers, batch normalization,
and ReLU activation.
● In residual blocks, there is a shortcut connection that bypasses
one or more layers and allows the gradient to flow directly to
earlier layers.
● This is shortcut connection known as a residual connection or
skip connection.
Two layers
Two layers

Input
Linear Non-linear
Two layers

Input
Linear Non-linear
Two layers
Residual Block (key idea)

Two layers

In each residual block, there is a skip connection (residual connection) that bypasses
one or more layers and allows the gradient to flow directly to earlier layers.
Residual Block

Two layers

In each residual block, there is a skip connection (residual connection) that bypasses
one or more layers and allows the gradient to flow directly to earlier layers.
Residual Block

Two layers

● Residual connection improve the gradient flow and enable the network
to learn deeper and more complex features.
ResNet Block
ResNet Block

The residual connection is added to the output of the convolutional layers before
the ReLU activation function is applied.
ResNet Block

from tensorflow.keras.layers import Input, Conv2D, BatchNormalization, Activation, Add

def resnet_block(inputs, filters, kernel_size, strides=(1, 1), padding='same'):

x = Conv2D(filters=filters, kernel_size=kernel_size, strides=strides, padding=padding )(inputs)


x = BatchNormalization()(x)
x = Activation('relu')(x)

x = Conv2D(filters=filters, kernel_size=kernel_size, strides=(1, 1), padding=padding)(x)


x = BatchNormalization()(x)

x = Add()([x, inputs])
x = Activation('relu')(x)
return x
ResNet
ResNet

● Top: a residual network with 34 parameter layers (3.6 billion FLOPs).


● Middle: a plain network with 34 parameter layers (3.6 billion FLOPs).
● Bottom: the VGG-19 model (19.6 billion FLOPs) as a reference.
ResNet

● Top: a residual network with 34 parameter layers (3.6 billion FLOPs).


● Middle: a plain network with 34 parameter layers (3.6 billion FLOPs).
● Bottom: the VGG-19 model (19.6 billion FLOPs) as a reference.
ResNet

● Top: a residual network with 34 parameter layers (3.6 billion FLOPs).


● Middle: a plain network with 34 parameter layers (3.6 billion FLOPs).
● Bottom: the VGG-19 model (19.6 billion FLOPs) as a reference.
ResNet

ResNet Architectures for ImageNet. Building blocks are shown in brackets, with the
numbers of blocks stacked.
ResNet

Left: a building block (on 56×56 Right: a “bottleneck” building block


feature maps) as in ResNet34. for ResNet-50/101/152.
ResNet

ResNet Architectures for ImageNet. Downsampling is performed by conv3 1, conv4 1,


and conv5 1 with a stride of 2.
ResNet
● The ResNet architecture typically begins with a single convolutional layer,
followed by a max pooling layer.
ResNet
● The ResNet architecture typically begins with a single convolutional layer,
followed by a max pooling layer.
● After the initial layer, there are several stages of residual blocks with
different number of convolutional layers.
ResNet
● The ResNet architecture typically begins with a single convolutional layer,
followed by a max pooling layer.
● After the initial layer, there are several stages of residual blocks with
different number of convolutional layers.
● Each residual block contains one or more convolutional layers, batch
normalization, and ReLU activation functions.
ResNet
● The ResNet architecture typically begins with a single convolutional layer,
followed by a max pooling layer.
● After the initial layer, there are several stages of residual blocks with
different number of convolutional layers.
● Each residual block contains one or more convolutional layers, batch
normalization, and ReLU activation functions.
● The convolutional layers in each residual block typically have small filter
sizes, such as 3x3 or 1x1.
ResNet
● The ResNet architecture typically begins with a single convolutional layer,
followed by a max pooling layer.
● After the initial layer, there are several stages of residual blocks with
different number of convolutional layers.
● Each residual block contains one or more convolutional layers, batch
normalization, and ReLU activation functions.
● The convolutional layers in each residual block typically have small filter
sizes, such as 3x3 or 1x1.
● The final layers of the network are typically global average pooling and a
fully connected layer with a softmax activation function
Why do ResNets Work?
Why do ResNets Work?
Why do ResNets Work?
Why do ResNets Work?

We kept the same values and added a non-linearity.


Why do ResNets Work?

The network can effectively choose to use fewer layers when it is not
necessary, which can improve efficiency and reduce overfitting.
Why do ResNets Work?

Residual connections also allow the network to adaptively determine how


many layers to use for a particular input.
Why do ResNets Work?

● Shortcut connections in ResNets enable the gradient to flow more directly and
efficiently through the network.
● ResNets address the problem of vanishing gradients that can occur in very
deep neural networks.
ResNet

Training on ImageNet. If we make network deeper, at some point the


performance starts to decrease.

Left: plain networks of 18 and 34 layers.

Thin curves denote training error, and bold curves denote validation error of the center crops.
ResNet

Training on ImageNet. ResNet Solution

Left: plain networks of 18 and 34 layers. Right: ResNets of 18 and 34 layers

Thin curves denote training error, and bold curves denote validation error of the center crops.
import tensorflow as tf
from tensorflow.keras.applications.resnet50 import ResNet50, preprocess_input,
decode_predictions
from tensorflow.keras.preprocessing import image
import numpy as np

# Load the ResNet50 model


model = ResNet50(weights='imagenet')

# Load the image you want to classify


img_path = 'tiger_shark.jpeg'
img = image.load_img(img_path, target_size=(224, 224))

# Convert the image to an array


x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)

# Use the model to predict the class of the image


preds = model.predict(x)

# Print the top 5 predictions


print('Predicted:', decode_predictions(preds, top=5)[0])
References

● AlexNet: "ImageNet Classification with Deep Convolutional Neural Networks"


by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton (2012).
● VGGNet: "Very Deep Convolutional Networks for Large-Scale Image
Recognition" by Karen Simonyan and Andrew Zisserman (2014).
● Inception Net: "Going Deeper with Convolutions" by Christian Szegedy, Wei
Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru
Erhan, Vincent Vanhoucke, and Andrew Rabinovich (2015).
● ResNet: "Deep Residual Learning for Image Recognition" by Kaiming He,
Xiangyu Zhang, Shaoqing Ren, and Jian Sun (2016).
Final Note on Classical
Architecture
Classical Architecture

Accuracy
Architecture Year Layers Key Innovations Parameters (ImageNet) Researchers

Alex Krizhevsky
AlexNet 2012 8 ReLU, LRN 62 million 57.2%
et al.

Karen Simonyan
3x3 convolution filters, 138-144
VGGNet 2014 16-19 74.4% and Andrew
Deep architecture million
Zisserman

Inception modules,
Inception Net 2014 22-42 Auxiliary classifiers, 4-12 million 74.8% Szegedy et al.
Batch normalization

Residual connections, 25.6-60


ResNet 2015 50-152 75.3% He et al.
Shortcut connections million
Comparing Complexity

Top1 vs. network. Single-crop top-1 validation accuracies for


top scoring single-model architectures
Comparing Complexity

Top1 vs. operations, size ∝ parameters. Top-1


one-crop accuracy versus amount of operations
required for a single forward pass. The size of
the blobs is proportional to the number of
network parameters
Comparing Complexity
Comparing Complexity
Comparing Complexity
Comparing Complexity
Comparing Complexity
Accuracy per parameter vs. network

Accuracy per parameter vs. network. Information density (accuracy per parameters) is an efficiency metric that
highlight that capacity of a specific architecture to better utilise its parametric space.
References

● Canziani, Alfredo, Adam Paszke, and Eugenio Culurciello. "An analysis of


deep neural network models for practical applications." (2016).
Object Recognition and Face Recognition
Image Classification
Object Detection and Localization

Can we relate Object Detection


and Classification?
Object Detection: Task Definition
Object Detection: Challenges
Object Detection: Challenges
How to verify if the
output is correct?
Comparing Boxes: Intersection over Union (loU)
Comparing Boxes: Intersection over Union (loU)
Comparing Boxes: Intersection over Union (loU)
Comparing Boxes: Intersection over Union (loU)
Comparing Boxes: Intersection over Union (loU)
Comparing Boxes: Intersection over Union (loU)
Detecting a single
object
Detecting a single object
Detecting a single object
Detecting a single object
Detecting a single object
Detecting a single object
Detecting a single object
Detecting Multiple
Objects
Detecting Multiple Objects
Detecting Multiple Objects
Detecting Multiple Objects
Detecting Multiple Objects: Sliding Window
Detecting Multiple Objects: Sliding Window
Detecting Multiple Objects: Sliding Window
Detecting Multiple Objects: Sliding Window
Detecting Multiple Objects: Sliding Window
Detecting Multiple Objects: Sliding Window
Detecting Multiple Objects: Sliding Window

h=2 w=2
Detecting Multiple Objects: Sliding Window

h=2 w=2
Detecting Multiple Objects: Sliding Window

Need to apply CNN to huge


number of locations and
scales, computationally h=2 w=2
expensive!
Region Proposals
R-CNN
R-CNN
R-CNN
R-CNN
R-CNN
R-CNN
R-CNN

● Training is slow (84h), takes a


lot of disk space
● Inference (detection) is slow
R-CNN

● Training is slow (84h), takes a


lot of disk space
● Inference (detection) is slow

Idea: Pass the image through


convnet before cropping! Crop the
conv feature instead!
Fast R-CNN
Fast R-CNN

“Backbone” network:
AlexNet, VGG, ResNet, etc
Fast R-CNN
Fast R-CNN
Cropping Features: RoI Pool
Cropping Features: RoI Pool
Cropping Features: RoI Pool
Cropping Features: RoI Pool
Cropping Features: RoI Pool
Cropping Features: RoI Pool
Cropping Features: RoI Align

● In Practice ROI Align is used in Fast


R-CNN. It uses bilinear interpolation to
compute the feature values.
● It a more precise and accurate way to
extract features from region proposals.
Fast R-CNN: Fully-connected layers
Fast R-CNN
Fast R-CNN (Training)
Fast R-CNN (Training)
R-CNN vs Fast R-CNN
R-CNN vs Fast R-CNN

Problem: Runtime dominated by


region proposals!
R-CNN vs Fast R-CNN

Problem: Runtime dominated by


region proposals!

Solution: make CNN do region


proposals.
Faster R-CNN:
Faster R-CNN:
Region Proposal Network
Region Proposal Network
Region Proposal Network
Region Proposal Network
Region Proposal Network
Region Proposal Network
Faster R-CNN:
Faster R-CNN:
Faster R-CNN:
References

● Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and
semantic segmentation." Proceedings of the IEEE conference on computer vision and
pattern recognition. 2014.
● Girshick, Ross. "Fast R-CNN." Proceedings of the IEEE international conference on
computer vision. 2015.
● Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region
proposal networks." Advances in neural information processing systems 28 (2015).
Object Recognition and Face Recognition
Object Detection: Task Definition
R-CNN

Problem: Training is slow (84h),


takes a lot of disk space
R-CNN

Problem: Training is slow (84h),


takes a lot of disk space

Solution: Pass the image through


convnet before cropping! Crop the
conv feature instead!
Fast R-CNN
Multi-task loss Problem: Runtime dominated
by region proposals!
Fast R-CNN
Multi-task loss Problem: Runtime dominated
by region proposals!

Solution: make CNN do region


proposals.
Faster R-CNN:

Two stages: region proposal


generation and object
classification
Faster R-CNN:

Problem: Not practical for


real time Object Detection,
Inference is slow
Accurate object detection is slow!

Pascal 2007 mAP Speed


DPM v5 33.7 .07 FPS 14 s/img
R-CNN 66.0 .05 FPS 20 s/img
Accurate object detection is slow!

Pascal 2007 mAP Speed


DPM v5 33.7 .07 FPS 14 s/img
R-CNN 66.0 .05 FPS 20 s/img

60 miles/H (96.5 Km/H)

⅓ Mile, 1760 feet


20 sec
Accurate object detection is slow!

Pascal 2007 mAP Speed


DPM v5 33.7 .07 FPS 14 s/img
R-CNN 66.0 .05 FPS 20 s/img
Fast R-CNN 70.0 .5 FPS 2 s/img

176 feet
Accurate object detection is slow!

Pascal 2007 mAP Speed


DPM v5 33.7 .07 FPS 14 s/img
R-CNN 66.0 .05 FPS 20 s/img
Fast R-CNN 70.0 .5 FPS 2 s/img
Faster R-CNN 73.2 7 FPS 140 ms/img

8 feet

12 feet
Accurate object detection is slow!

Pascal 2007 mAP Speed


DPM v5 33.7 .07 FPS 14 s/img
R-CNN 66.0 .05 FPS 20 s/img
Fast R-CNN 70.0 .5 FPS 2 s/img
Faster R-CNN 73.2 7 FPS 140 ms/img
YOLO 63.4 45 FPS 22 ms/img

2 feet
Accurate object detection is slow!

Pascal 2007 mAP Speed


DPM v5 33.7 .07 FPS 14 s/img
R-CNN 66.0 .05 FPS 20 s/img
Fast R-CNN 70.0 .5 FPS 2 s/img
Faster R-CNN 73.2 7 FPS 140 ms/img
YOLO 63.4 69.0 45 FPS 22 ms/img

2 feet
With YOLO, you only look once at an image to
perform detection

YOLO: You Only Look Once


We split the image into a grid
Each cell predicts boxes and confidences: P(Object)
Each cell predicts boxes and confidences: P(Object)
Each cell predicts boxes and confidences: P(Object)
Each cell predicts boxes and confidences: P(Object)
Each cell predicts boxes and confidences: P(Object)
Each cell predicts boxes and confidences: P(Object)
Each cell also predicts a class probability.
Each cell also predicts a class probability.

Bicycle Car

Dog

Dining
Table
Conditioned on object: P(Car | Object)

Bicycle Car

Dog

Dining
Table
Then we combine the box and class predictions.
Finally, set a threshold (Non-maximal suppression) to fix multiple
detections
This parameterization fixes the output size

Each cell predicts:

- For each bounding box:


- 4 coordinates (x, y, w, h)
- 1 confidence value
- Some number of class
probabilities
This parameterization fixes the output size

Each cell predicts:

- For each bounding box:


- 4 coordinates (x, y, w, h)
- 1 confidence value
- Some number of class
probabilities

For
TotalPascal VOC:
Predictions

- 7x7 grid
- 2 bounding boxes / cell
- 20 classes

7 x 7 x (2 x 5 + 20) = 7 x 7 x 30 tensor = 1470 outputs


Thus we can train one neural network to be a whole
detection pipeline
Look at that cell’s predicted boxes
Find the best one, adjust it, increase the confidence
Find the best one, adjust it, increase the confidence
Find the best one, adjust it, increase the confidence
Decrease the confidence of other boxes
Decrease the confidence of other boxes
Some cells don’t have any ground truth detections!
Some cells don’t have any ground truth detections!
Decrease the confidence of these boxes
Decrease the confidence of these boxes
Don’t adjust the class probabilities or coordinates
We train with standard tricks:

- Pretraining on Imagenet
- Extensive data augmentation
- For details, see the paper
YOLO works across a variety of natural images
It also generalizes well to new domains (like art)
Visualizing
importance
The occlusion experiment
The occlusion experiment

Block different parts of the image and see how the classification score
changes
The occlusion experiment

Block different parts of the image and see how the classification score
changes

The face of the


dog is more
important for
correct
classification
The occlusion experiment

Create a map, where each pixel represents the classification probability if


an occlusion square is placed in that region

high values

small values
The occlusion experiment

Most important pixels for


classification
Semantic Segmentation
Semantic Segmentation
Semantic Segmentation Idea: Sliding Window
Semantic Segmentation Idea: Sliding Window
Semantic Segmentation Idea: Sliding Window
Semantic Segmentation Idea: Sliding Window
Semantic Segmentation Idea: Sliding Window
Semantic Segmentation idea: CNN

Challenge: output size is of the


order of input size
Semantic Segmentation idea: CNN
Semantic Segmentation idea: CNN
Semantic Segmentation idea: CNN
Semantic Segmentation idea: CNN
Semantic Segmentation idea: CNN
Semantic Segmentation Limitation
Transfer Learning
Transfer Learning

● Training your own model can be difficult with limited data and other
resources, e.g.,
● It is a laborious task to manually annotate your own training dataset
● Why not reuse already pre-trained models?
Transfer Learning

Distribution Distribution

Use what has been


learned for another
setting
Transfer Learning for Images

Low-level Mid-level Top-level


features features features
Transfer Learning
Trained on
ImageNet

Feature
extraction
Transfer Learning
Trained on
ImageNet

Decision layers

Parts of an object (wheel, window)

Simple geometrical shapes (circles, etc)

Edges
Transfer Learning
Trained on
ImageNet

New dataset with C


classes
Transfer Learning

If the dataset is big


enough train more
layers with a low
learning rate
When Transfer Learning makes Sense

● When task T1 and T2 have the same input (e.g. an RGB image)
● When you have more data for task T1 than for task T2
● When the low-level features for T1 could be useful to learn T2
Now you are:

● Ready to perform image classification on any dataset


● Ready to design your own architecture
● Ready to deal with other problems such as semantic
segmentation (Fully Convolutional Network)
Deep Learning for Natural Language
Processing (NLP)
Natural
Language
Processing
(NLP)
Natural
Language
Processing
(NLP)

Deep Learning for Natural Language Processing (NLP)


Sequence Modelling
Sequence modeling

?
Sequence modeling
Sequence modeling
Sequences are Everywhere
Sequences are Everywhere
Sequences are Everywhere
Sequences are Everywhere
Sequences are Everywhere
Sequences are Everywhere
Sequence Modeling Applications
Sequence data Input Data Output

Speech Recognition Sequence modeling

Machine Translation This is an apple.


यह एक सेब है ।
ਇਹ ਇੱ ਕ ਸੇਬ ਹੈ।
Language Modeling Recurrent neural __? Network

Named Entity Recognition “Person”: Mark Zuckerberg


Mark Zuckerberg is one of the founders of Facebook,
“Company”: Facebook
a company from the United States
“Location”: United States
Sentiment Classification There is nothing to like in this movie.

Video Activity
Recognition Punching
NLP Tasks Overview
Introduction To Data Science
Language Translation
Introduction To Data Science
Query Recommendations
Introduction To Data Science
Spelling and Grammar Corrections
Introduction To Data Science
Sentiment Analysis
Introduction To Data Science
Topic modeling

Document-Word Matrix
NLP Tasks

(chat about a specific topic)

(chat about any topic)


Introduction To Data Science
ChatGPT
NLP Tasks challenge
NLP Tasks challenge
• Variable length sequence
The food is great
Jaipur city is famous as a pink city
NLP Tasks challenge
• Variable length sequence
The food is great
Jaipur city is famous as a pink city
• Long-term dependency
France is where he grew up, but now he is in India. He speaks fluent ______.
NLP Tasks challenge
• Variable length sequence
The food is great
Jaipur city is famous as a pink city
• Long-term dependency
France is where he grew up, but now he is in India. He speaks fluent ______. French
NLP Tasks challenge
• Variable length sequence
The food is great
Jaipur city is famous as a pink city
• Long-term dependency
France is where he grew up, but now he is in India. He speaks fluent ______. French
• Differences in sequence order.
The food was good, not bad at all
The food was bad, not good at all.
NLP Tasks challenge
• Variable length sequence
The food is great
Jaipur city is famous as a pink city
• Long-term dependency
France is where he grew up, but now he is in India. He speaks fluent ______. French
• Differences in sequence order.
The food was good, not bad at all
The food was bad, not good at all.
Sequence Modeling Design Criteria

The sequential model need to


• Handle variable-length sequence
Sequence Modeling Design Criteria

The sequential model need to


• Handle variable-length sequence
• Track long-term dependencies
Sequence Modeling Design Criteria

The sequential model need to


• Handle variable-length sequence
• Track long-term dependencies
• Maintain information about the order
Sequence Modeling Design Criteria

The sequential model need to


• Handle variable-length sequence
• Track long-term dependencies
• Maintain information about the order
Sequence to Sequence Learning
with Neural Networks
Feed-Forward Neural Network
Feed-Forward Neural Network

Difficult to model speech recognition,


question answering, and machine
translation (Sequence to Sequence) etc.
Sequence to Sequence Problem

One to One

Image Classification
Sequence to Sequence Problem

One to One One to many

Image Classification Image Captioning


Sequence to Sequence Problem

One to One One to many many to one

Image Classification Image Captioning Language Recognition


Sequence to Sequence Problem

Many to Many

Machine Translation
Sequence to Sequence Problem

Many to Many Many to many

Machine Translation Video Activity Recognition


seq2seq Learning

• Seq2seq models are neural network architecture used to


transform input sequences into output sequences of variable
length.
seq2seq Learning

• Seq2seq models are neural network architecture used to


transform input sequences into output sequences of variable
length.
• Seq2seq does the sequence transformation using a simple RNN
or using LSTM or GRU (to avoid problem of vanishing gradients)
Basics of RNN
RNN

Recurrent Neural Network is a family of neural networks that:


● Process sequence data
● Take sequential input of variable length
● Apply the same weights on each step
● Can produce output of variable length
Output Vector

Input Vector
Recurrent Neural Networks

Output Vector

Recurrent
Cell h0 h1 h2

Input Vector
Recurrent Neural Networks

Output

ht

Input
Recurrent Neural Networks

Output

ht

Input
Recurrent Neural Networks

Output

ht

Input
Recurrent Neural Networks

Output

ht

Input
RNN Parameters

RNN RNN RNN RNN RNN


cell cell cell cell cell
RNN Parameters

RNN RNN RNN RNN RNN


cell cell cell cell cell
RNN Parameters

RNN RNN RNN RNN RNN


cell cell cell cell cell
RNN Computational Graph Across Time

RNN RNN RNN RNN RNN


cell cell cell cell cell
RNN Computational
Graph
RNN Computational Graph
RNN Computational Graph
RNN Computational Graph
RNN Computational Graph
RNN Computational Graph
RNN Computational Graph
RNN Computational Graph
Sequence to
Sequence Modeling
using RNNs
Many to Many
Many to One
One to Many
Training RNNs
RNN State Update
Output Vector
Output Vector

Recurrent
Cell
ht
RNN Cell update

Input Vector
Input Vector
Vanilla RNN
Gradient Flow

Backpropagation from ht to ht-1


Backpropagation steps

• Input feedforward in the network


• Compute the Loss
• Take the derivative of the Loss with respect to each parameter
• Update parameters to minimize the Loss
Backpropagation Through Time (BPTT)
Backpropagation Through Time (BPTT)
Backpropagation Through Time
Backpropagation Through Time
Truncated Backpropagation Through Time
Truncated Backpropagation Through Time
Truncated Backpropagation Through Time
Problems in RNN

● Slow to train (inefficient / not parallelizable)


● Suffer from exploding or vanishing gradients
● Cannot handle very long-term dependencies
Problems in RNN

● Slow to train (inefficient / not parallelizable)


● Suffer from exploding or vanishing gradients
● Cannot handle very long-term dependencies

Solution: LSTM or GRU


Deep Learning for Natural Language
Processing (NLP)
RNN

Backpropagation from ht to ht-1


Exploding/vanishing gradients issues in RNN
LSTM vs RNN: big picture
LSTMs

● LSTMs proposed in 1997 as a solution to the vanishing gradient problem


in traditional RNNs.
LSTMs

● LSTMs proposed in 1997 as a solution to the vanishing gradient problem


in traditional RNNs.
● Basic unit of an LSTM is a cell, that contains a hidden state and cell state.
○ Cell state is used to store information over time
○ Hidden state is used to selectively output information from the cell
state at each time step.
hidden state in LSTMs

● Metaphor: The hidden state of the neural network can be considered as a


short-term memory.
hidden state in LSTMs

● Metaphor: The hidden state of the neural network can be considered as a


short-term memory.
● LSTM architecture tries to make this short-term memory last as long as
possible by preventing vanishing gradients.
hidden state in LSTMs

● Metaphor: The hidden state of the neural network can be considered as a


short-term memory.
● LSTM architecture tries to make this short-term memory last as long as
possible by preventing vanishing gradients.
● LSTM allowing the model to selectively forget or remember information
over time.
LSTM vs RNN: inside picture
Gating mechanism
Information flow in an LSTM
Observe x
Information flow in an LSTM
Input
Information flow in an LSTM
Don’t forget
Information flow in an LSTM
Don’t forget
Information flow in an LSTM
Output
LSTMs gates
● To control the flow of information into and out of the cell, LSTMs uses
three types of gates:
○ Input gates: Decide which information from current input to include in
cell state.
○ Forget gates: Deciding which information from the previous cell state
to forget.
○ Output gates: Deciding which information from current cell state to
output as hidden state.
Long short-term memory (LSTM) Key idea

~
ct
Long short-term memory (LSTM) Key idea

~
ct ~
ct
A look inside an LSTM cell

v
Forget gate

v
Input gate

v
Update candidate

v
Memory cell update

v
Output gate

v
Output

v
Observation
~
ct

● If the forget gate is always 1 and the input gate is always 0, the memory cell internal
state will remain constant forever.
● However, input gates and forget gates give the flexibility to learn when to keep this
value unchanged and when to perturb it in response to subsequent inputs.
What about gradient flow?
Long Short Term Memory (LSTM): Gradient Flow
Long Short Term Memory (LSTM): Gradient Flow
Long Short Term Memory (LSTM): Gradient Flow
Gated Recurrent Units (GRU)
Main solution for better RNNs: Units

● Gated recurrent units (GRUs) are a gating mechanism in recurrent neural


networks, introduced in 2014 by Kyunghyun Cho et al.
Main solution for better RNNs: Units

● Gated recurrent units (GRUs) are a gating mechanism in recurrent neural


networks, introduced in 2014 by Kyunghyun Cho et al.
● GRU is like a long short-term memory (LSTM) but has fewer parameters
than LSTM, as it lacks an output gate.
Main solution for better RNNs: Units

● Gated recurrent units (GRUs) are a gating mechanism in recurrent neural


networks, introduced in 2014 by Kyunghyun Cho et al.
● GRU is like a long short-term memory (LSTM) but has fewer parameters
than LSTM, as it lacks an output gate.
● GRU's performance on certain tasks of polyphonic music modeling,
speech signal modeling and natural language processing was found to be
similar to that of LSTM.
Main solution for better RNNs: Units

● In GRU, the LSTM’s three gates are replaced by two


● Reset gate controls how much of the previous state we might still want to
remember.
● Update gate would allow us to control how much of the new state is just
a copy of the old state.
Gated Recurrent Units (GRU)
Gated Recurrent Units (GRU)

● Units with short-term dependencies will


have active reset gates.
● Units with long term dependencies have
active update gates.
Gated Recurrent Units (GRU)

Memory Content
Gated Recurrent Units (GRU)

Final Memory at current time step


Bidirectional RNNs
Bidirectional and Multilayer RNN: Motivation
Bidirectional and Multilayer RNN: Motivation
Bidirectional and Multilayer RNN: Motivation
Bidirectional and Multilayer RNN: Motivation
Bidirectional and Multilayer RNN: Motivation

What about the right context


Bidirectional RNNs

For classification you want to incorporate information from words both


preceding and following.
Bidirectional RNNs

For classification you want to incorporate information from words both


preceding and following.
Two type of connections:
1) One going forward in time, which helps us learn from previous
representations
2) Another going backward in time, which helps us learn from future
representations
Bidirectional RNNs

For classification you want to incorporate information from words both


preceding and following.
Two type of connections:
1) One going forward in time, which helps us learn from previous
representations
2) Another going backward in time, which helps us learn from future
representations
Bidirectional RNNs can better exploit context in both directions,
Bidirectional RNNs
Bidirectional RNNs

Data is processed in both


directions with two
separate hidden layers,
which are then fed
forward into the same
output layer.
Bidirectional RNNs
Bidirectional RNNs
Bidirectional RNNs
Bidirectional RNNs
Bidirectional RNNs: simplified diagram
Bidirectional RNNs

● Bidirectional RNNs are only applicable if you have access to the entire input sequence.
○ Bidirectional RNNs are not applicable to Language Modeling, because in LM you
only have left context available.
● Bidirectional LSTMs perform better than unidirectional ones in speech recognition.
Deep RNNs
Single-Layer RNNs
Single-Layer RNNs
Multi-layer RNNs
Multi-layer RNNs
yt

ht2

ht1

Input
Multi-layer RNNs
yt

ht2

ht1

Input
Multi-layer RNNs
yt

ht2

ht1

Input
Multi-layer RNNs

● RNNs are already "deep" on one dimension (they unroll over many
timesteps).
Multi-layer RNNs

● RNNs are already "deep" on one dimension (they unroll over many
timesteps).
● We can also make them "deep" in another dimension by applying
multiple RNNs — this is a multi-layer RNN.
Multi-layer RNNs

● RNNs are already "deep" on one dimension (they unroll over many
timesteps).
● We can also make them "deep" in another dimension by applying
multiple RNNs — this is a multi-layer RNN.
● This allows the network to compute more complex representations
Multi-layer RNNs

● RNNs are already "deep" on one dimension (they unroll over many
timesteps).
● We can also make them "deep" in another dimension by applying
multiple RNNs — this is a multi-layer RNN.
● This allows the network to compute more complex representations
● The lower RNNs should compute lower-level features and the higher
RNNs should compute higher-level features.
Multi-layer RNNs

● RNNs are already "deep" on one dimension (they unroll over many
timesteps).
● We can also make them "deep" in another dimension by applying
multiple RNNs — this is a multi-layer RNN.
● This allows the network to compute more complex representations
● The lower RNNs should compute lower-level features and the higher
RNNs should compute higher-level features.
● Multi-layer RNNs are also called stacked RNNs

You might also like