Professional Documents
Culture Documents
Outline
The 2018 Turing Award was awarded jointly to Yoshua Bengio, Geoffrey
Hinton, and Yann LeCun for their pioneering work on deep learning
Geoffrey Hinton
is known by
many to be the
godfather of
deep learning.
Geoffrey E.
Hinton is known
by many to be
the godfather of
deep learning.
● Dense layers require a lot of parameters, which can lead to overfitting when
the number of input features is large.
● Dense layers are not translation invariant, meaning that small shifts in the
input image can result in large changes in the output.
● Dense layers do not take advantage of the spatial structure of images, and
can therefore be inefficient for processing large images.
Convolution Neural
Networks for images
What are
convolutions?
What are Convolutions?
Discrete case: box filter
Input
Filter
Output
Feature Map Dimension
Input
Filter
Output
Feature Map Dimension
Input
Filter
Output
Feature Map Dimension
Input
Filter
Output
Feature Map Dimension
Input
Filter
Output
Feature Map Dimension
Input
Filter
Output
RGB image
Convolution Layer
RGB image
Convolution Layer
RGB image
Convolution Layer
RGB image
Convolution Layer
RGB image
Stride
The stride determines how much the convolutional
kernel is moved across the input image at each step.
Stride
Stride
Stride
Convolution on images
Input
Filter
Stride
Output
Convolution on images
Input
Filter
Stride
Output
Convolution on images
Input
Filter
Stride
Output
Feature Map Dimension
Input
Filter
Stride
Output
Feature Map Dimension
Input
Filter
Stride
Output
Feature Map Dimension
Input
Filter
Stride
Output
Input
Filter
Stride
Output
● Stride Reduce the size of the output feature map, which can help to reduce
the computational cost and memory requirements of the network.
● Striding can make the network more efficient by reducing the number of
operations required to process the input data.
● Striding can help to reduce overfitting by reducing the number of parameters in
the network and forcing the network to learn more abstract features.
Deep learning for Computer Vision
Problems using FC Layers on Images
● the same set of weights is used to compute the output for all neurons in
the same feature map.
● Parameter sharing reduces the number of parameters needed to train.
● Parameter Sharing enables the model to learn translation-invariant
features, for example, kernel for edge.
Edge Detection by Convolution
?
Edge Detection by Convolution
Efficiency of Convolution
Huge
Features
Features
Features
Input
Filter
Stride
Output
Input Image
Zero Padding
Convolution Layers: Padding
Why padding?
Zero Padding
Convolution Layers: Padding
Output Size:
● Valid Padding:
○ No padding at all.
○ The output feature map is smaller than the input feature map.
● Same Padding:
○ Adding enough padding to the input image so that the output feature map
has the same size as the input image.
Set padding to P = F - 1 with stride S =1
2
○ Verify formula
Padding
● Reflective Padding:
○ The padded pixels are not filled with zeros, but with the reflected values
of the input image.
○ This type of padding is useful when the input image contains edges or
other sharp features that would be distorted by zero padding.
● Symmetric Padding:
● Each filter has a size of 5x5x3 (since there are 3 input channels).
● There are 6 filters in the layer. Plus there is one bias term for each filter.
● Therefore, the total number of parameters is 6*(5x5x3+1) = 456.
Convolution Layer
Convolution Layer
Stacking Convolution layers
Stacking Convolution layers
Stacking Convolution layers
Stacking Convolution layers
Pooling
(sub-sampling)
● Conv Layer = Feature Extraction
— Computes a feature in a given region
Pooling
● Pooling Layer = Feature Selection
— Picks the strongest activation in a region
Pooling Layer: Max Pooling
Common Setting
F=2, S=2
F=3, S=2
Pooling Layer
● Pooling Layer reduces the spatial dimensions of the input, which reduces
the computational cost of the network.
● Pooling layers aimed to important features, which help the network to learn
more robust representations of the input data.
● Pooling layers can improve the translation invariance of the network by
selecting the maximum or average value in a given region.
● Pooling layers provides distortion invariance by selecting the maximum or
average value in a region, which reduces effects of small variations.
Receptive Field
Receptive Field
Receptive field refers to the area of the input image that is used by a
particular neuron or feature map in the network.
Receptive Field
Receptive field refers to the area of the input image that is used by a
particular neuron or feature map in the network.
Receptive Field
Receptive field refers to the area of the input image that is used by a
particular neuron or feature map in the network.
Receptive Field
Receptive field refers to the area of the input image that is used by a
particular neuron or feature map in the network.
Deep learning for Computer Vision
Receptive Field
Receptive field refers to the area of the input image that is used by a
particular neuron or feature map in the network.
Receptive fields (Advantages)
At that time average pooling was used, now max pooling is much more common
LeNet
● Convolutional layer 1: This layer applies a set of filters to the input image to
extract features such as edges and corners.
LeNet
● Convolutional layer 1: This layer applies a set of filters to the input image to
extract features such as edges and corners.
● Subsampling layer 1:
○ Reduces the spatial size of the feature maps, while retaining the most
important features.
○ Reduce the computational complexity.
LeNet
● Subsampling layer 2: Reduces the size of the feature maps, while retaining
the most important features.
LeNet
10
LeNet
10
● Output layer: Fully connected layers with 10 units each, using a softmax
activation function to produce the final output probabilities.
from keras.models import Sequential
from keras.layers import Conv2D, AveragePooling2D, Flatten, Dense
# Input layer
inputs = Input(shape=(32, 32, 1))
# Convolutional layer 1
conv1 = Conv2D(filters=6, kernel_size=(5, 5), strides=(1, 1), activation='relu',
padding='same')(inputs)
# Convolutional layer 2
conv2 = Conv2D(filters=16, kernel_size=(5, 5), strides=(1, 1), activation='relu',
padding='valid')(pool1)
# Flatten layer
flatten = Flatten()(pool2)
# Output layer
outputs = Dense(units=10, activation='softmax')(fc2)
Advantages of Convolutional Networks
Special Convolution
(Depth-wise separable
convolutions)
Normal convolutions
Simple convolution
But why?
Stride 1
Transposed convolution with a 2x2 kernel. The shaded portions are a portion of an intermediate
tensor as well as the input and kernel tensor elements used for the computation.
Transpose Convolution: 2D Example
Stride 1
Transposed convolution with a 2x2 kernel. The shaded portions are a portion of an intermediate
tensor as well as the input and kernel tensor elements used for the computation.
Stride 2
Stride 2
Transposed Convolution
# Add a transposed convolutional layer with 1 filters, kernel size 2x2, stride 2, no
padding
Conv2DTranspose(2, (2, 2), strides=(2, 2), padding='valid', input_shape=(height,
width, channels))
● (2, 2): This is the size of the convolutional kernel in the layer.
AI researcher Fei-Fei Li
began working on the idea
for ImageNet in 2006.
Classical Architecture
● Alexnet
● VGG Network
● ResNet
● Inception Net
Revolution of depth (ImageNet Benchmark)
Non-CNN
CNN
Alexnet
CNNs Success
Alexnet
AlexNet uses max pooling after the first, second, and fifth convolutional
layers to reduce the spatial dimensions of the feature maps.
AlexNet
Output layer:
● This layer has 1000 units (i.e., number of classes in the ImageNet dataset).
● The output is passed through a softmax activation function to produce the
final class probabilities.
AlexNet
from keras.layers import Input, Conv2D, MaxPooling2D, Flatten, Dense, Dropout
from keras.layers.normalization import BatchNormalization
from keras.models import Model
# First convolutional layer, 96 filters, kernel size of 11x11 and stride of 4x4, followed by ReLU
conv1 = Conv2D(filters=96, kernel_size=(11, 11), strides=(4, 4), activation='relu')(inputs)
# Max pooling layer with pool size of 3x3 and stride of 2x2
pool1 = MaxPooling2D(pool_size=(3, 3), strides=(2, 2))(conv1)
# First convolutional layer, 96 filters, kernel size of 11x11 and stride of 4x4, followed by ReLU
conv1 = Conv2D(filters=96, kernel_size=(11, 11), strides=(4, 4), activation='relu')(inputs)
# Max pooling layer with pool size of 3x3 and stride of 2x2
pool1 = MaxPooling2D(pool_size=(3, 3), strides=(2, 2))(conv1)
# Second convolutional layer with 256 filters, kernel size of 5x5 and padding of same, followed by
ReLU activation
conv2 = Conv2D(filters=256, kernel_size=(5, 5), padding='same', activation='relu')(bn1)
# Max pooling layer with pool size of 3x3 and stride of 2x2
pool2 = MaxPooling2D(pool_size=(3, 3), strides=(2, 2))(conv2)
# Fifth convolutional layer, 256 filters, kernel size of 3x3 and padding of same, followed by ReLU
conv5 = Conv2D(filters=256, kernel_size=(3, 3), padding='same', activation='relu')(conv4)
# Max pooling layer with pool size of 3x3 and stride of 2x2
pool5 = MaxPooling2D(pool_size=(3, 3), strides=(2, 2))(conv5)
# Batch normalization layer
bn3 = BatchNormalization()(pool5)
AlexNet
# Third convolutional layer with 384 filters, kernel size of 3x3 and padding of same, followed by ReLU
conv3 = Conv2D(filters=384, kernel_size=(3, 3), padding='same', activation='relu')(bn2)
# Fourth convolutional layer with 384 filters, kernel size of 3x3 and padding of same, followed by ReLU
conv4 = Conv2D(filters=384, kernel_size=(3, 3), padding='same', activation='relu')(conv3)
# Fifth convolutional layer, 256 filters, kernel size of 3x3 and padding of same, followed by ReLU
conv5 = Conv2D(filters=256, kernel_size=(3, 3), padding='same', activation='relu')(conv4)
# Max pooling layer with pool size of 3x3 and stride of 2x2
pool5 = MaxPooling2D(pool_size=(3, 3), strides=(2, 2))(conv5)
# Batch normalization layer
bn3 = BatchNormalization()(pool5)
# Flatten layer
flatten = Flatten()(bn3)
# First fully connected layer with 4096 units, followed by ReLU activation and dropout
fc1 = Dense(units=4096, activation='relu')(flatten)
dropout1 = Dropout(0.5)(fc1)
# Second fully connected layer with 4096 units, followed by ReLU activation and dropout
fc2 = Dense(units=4096, activation='relu')(dropout1)
dropout2 = Dropout(0.5)(fc2)
● AlexNet is deeper and more complex than LeNet, with more layers
and more parameters.
LeNet and AlexNet
● AlexNet is deeper and more complex than LeNet, with more layers
and more parameters.
● AlexNet has larger filters in initial layers (e.g., 11x11 in the first layer),
while LeNet uses smaller filters (e.g., 5x5).
LeNet and AlexNet
● AlexNet is deeper and more complex than LeNet, with more layers
and more parameters.
● AlexNet has larger filters in initial layers (e.g., 11x11 in the first layer),
while LeNet uses smaller filters (e.g., 5x5).
● AlexNet achieved state-of-the-art performance on the ImageNet
dataset, while LeNet was designed and tested on a smaller
handwritten digit recognition task.
VGG Network
VGG Network
VGG16 showed good
performance on the ImageNet.
(ZFNet)
VGG Networks
● VGG stands for Visual Geometry Group (University of Oxford) that developed
a family of CNN architectures for image classification.
VGG Networks
● VGG stands for Visual Geometry Group (University of Oxford) that developed
a family of CNN architectures for image classification.
● The original VGG model, VGG16, has 16 layers including 13 convolutional
layers and 3 fully connected layers.
VGG Networks
● VGG stands for Visual Geometry Group (University of Oxford) that developed
a family of CNN architectures for image classification.
● The original VGG model, VGG16, has 16 layers including 13 convolutional
layers and 3 fully connected layers.
● VGG19 is with 19 layers including 16 convolutional layers and 3 fully
connected layers.
VGG Networks
● VGG stands for Visual Geometry Group (University of Oxford) that developed
a family of CNN architectures for image classification.
● The original VGG model, VGG16, has 16 layers including 13 convolutional
layers and 3 fully connected layers.
● VGG19 is with 19 layers including 16 convolutional layers and 3 fully
connected layers.
● VGG family also includes VGG11 and VGG13, with fewer convolutional layers
than VGG16 and VGG19.
VGG Networks
● VGG stands for Visual Geometry Group (University of Oxford) that developed
a family of CNN architectures for image classification.
● The original VGG model, VGG16, has 16 layers including 13 convolutional
layers and 3 fully connected layers.
● VGG19 is with 19 layers including 16 convolutional layers and 3 fully
connected layers.
● VGG family also includes VGG11 and VGG13, with fewer convolutional layers
than VGG16 and VGG19.
● VGG family generalize well to many tasks such as classification, object
detection, segmentation, style transfer, and transfer learning.
VGG16 architecture
VGG16 architecture
VGG16 architecture
VGG16 architecture
VGG16 architecture
VGG16 architecture
VGG16 (summary)
# Block 1
x = Conv2D(64, (3, 3), activation='relu', padding='same', name='block1_conv1')(input_tensor)
x = Conv2D(64, (3, 3), activation='relu', padding='same', name='block1_conv2')(x)
x = MaxPooling2D((2, 2), strides=(2, 2), name='block1_pool')(x)
VGG16 Implementation
# Block 1
x = Conv2D(64, (3, 3), activation='relu', padding='same', name='block1_conv1')(input_tensor)
x = Conv2D(64, (3, 3), activation='relu', padding='same', name='block1_conv2')(x)
x = MaxPooling2D((2, 2), strides=(2, 2), name='block1_pool')(x)
# Block 2
x = Conv2D(128, (3, 3), activation='relu', padding='same', name='block2_conv1')(x)
x = Conv2D(128, (3, 3), activation='relu', padding='same', name='block2_conv2')(x)
x = MaxPooling2D((2, 2), strides=(2, 2), name='block2_pool')(x)
# Block 3
x = Conv2D(256, (3, 3), activation='relu', padding='same', name='block3_conv1')(x)
x = Conv2D(256, (3, 3), activation='relu', padding='same', name='block3_conv2')(x)
x = Conv2D(256, (3, 3), activation='relu', padding='same', name='block3_conv3')(x)
x = MaxPooling2D((2, 2), strides=(2, 2), name='block3_pool')(x)
VGG16 Implementation
# Block 4
x = Conv2D(512, (3, 3), activation='relu', padding='same', name='block4_conv1')(x)
x = Conv2D(512, (3, 3), activation='relu', padding='same', name='block4_conv2')(x)
x = Conv2D(512, (3, 3), activation='relu', padding='same', name='block4_conv3')(x)
x = MaxPooling2D((2, 2), strides=(2, 2), name='block4_pool')(x)
# Block 5
x = Conv2D(512, (3, 3), activation='relu', padding='same', name='block5_conv1')(x)
x = Conv2D(512, (3, 3), activation='relu', padding='same', name='block5_conv2')(x)
x = Conv2D(512, (3, 3), activation='relu', padding='same', name='block5_conv3')(x)
x = MaxPooling2D((2, 2), strides=(2, 2), name='block5_pool')(x)
VGG16 Implementation
# Block 4
x = Conv2D(512, (3, 3), activation='relu', padding='same', name='block4_conv1')(x)
x = Conv2D(512, (3, 3), activation='relu', padding='same', name='block4_conv2')(x)
x = Conv2D(512, (3, 3), activation='relu', padding='same', name='block4_conv3')(x)
x = MaxPooling2D((2, 2), strides=(2, 2), name='block4_pool')(x)
# Block 5
x = Conv2D(512, (3, 3), activation='relu', padding='same', name='block5_conv1')(x)
x = Conv2D(512, (3, 3), activation='relu', padding='same', name='block5_conv2')(x)
x = Conv2D(512, (3, 3), activation='relu', padding='same', name='block5_conv3')(x)
x = MaxPooling2D((2, 2), strides=(2, 2), name='block5_pool')(x)
# Block 4
x = Conv2D(512, (3, 3), activation='relu', padding='same', name='block4_conv1')(x)
x = Conv2D(512, (3, 3), activation='relu', padding='same', name='block4_conv2')(x)
x = Conv2D(512, (3, 3), activation='relu', padding='same', name='block4_conv3')(x)
x = MaxPooling2D((2, 2), strides=(2, 2), name='block4_pool')(x)
# Block 5
x = Conv2D(512, (3, 3), activation='relu', padding='same', name='block5_conv1')(x)
x = Conv2D(512, (3, 3), activation='relu', padding='same', name='block5_conv2')(x)
x = Conv2D(512, (3, 3), activation='relu', padding='same', name='block5_conv3')(x)
x = MaxPooling2D((2, 2), strides=(2, 2), name='block5_pool')(x)
# Create model
model = Model(inputs=input_tensor, outputs=output_tensor, name='vgg16')
return model
Deep learning for Computer Vision
Classical Architectures
Alex Krizhevsky
AlexNet 2012 8 CNN Architecture 62 million
et al.
Karen Simonyan
3x3 convolution filters, 138-144
VGGNet 2014 16-19 and Andrew
Deep architecture million
Zisserman
VGG16 architecture
Classical Architectures
Alex Krizhevsky
AlexNet 2012 8 CNN Architecture 62 million
et al.
Karen Simonyan
3x3 convolution filters, 138-144
VGGNet 2014 16-19 and Andrew
Deep architecture million
Zisserman
Next?
Classical Architectures
Alex Krizhevsky
AlexNet 2012 8 CNN Architecture 62 million
et al.
Karen Simonyan
3x3 convolution filters, 138-144
VGGNet 2014 16-19 and Andrew
Deep architecture million
Zisserman
Alex Krizhevsky
AlexNet 2012 8 CNN Architecture 62 million
et al.
Karen Simonyan
3x3 convolution filters, 138-144
VGGNet 2014 16-19 and Andrew
Deep architecture million
Zisserman
Not sustainable!
Inception Layer (key idea)
1x1 Convolutions
1x1 Convolution
Recall: Convolutions on Images
1x1 Convolution
1x1 Convolution
Input
|
Conv2D -> ReLU -> MaxPooling2D …
|
Inception module
|
…
|
…
|
Inception module
|
GlobalAveragePooling -> Dense -> Softmax
GlobalAveragePooling
x = GlobalAveragePooling2D()(x)
x = Dropout(0.4)(x)
x = Dense(num_classes, activation='softmax')(x)
Main classifier
Auxiliary classifiers
● During training, the loss from the auxiliary classifiers is added to the overall
loss (main classifier) of the network with a weight factor (usually 0.3).
Auxiliary classifiers
● During training, the loss from the auxiliary classifiers is added to the overall
loss (main classifier) of the network with a weight factor (usually 0.3).
● During inference, the outputs of the auxiliary classifiers are discarded, and
only the output of the main classifier is used to make predictions.
Auxiliary classifiers
● During training, the loss from the auxiliary classifiers is added to the overall
loss (main classifier) of the network with a weight factor (usually 0.3).
● During inference, the outputs of the auxiliary classifiers are discarded, and
only the output of the main classifier is used to make predictions.
● The number and placement of the auxiliary classifiers in InceptionNet can
vary depending on the specific architecture and task.
Inception Net
Inception Net
Inception Net
Inception Net (Main Components)
Inception Net inception (3a) inception (3b)
The network begins with a series of convolutional and pooling layers to extract
low-level features from the image.
Inception Net inception (3a) inception (3b)
Inception modules are stacked on top of each other to form the "stem" of the network.
inception
inception (4b)
(4a)
Reduction
The network also includes several "Reduction" modules, which are used to reduce the spatial
dimensions of the feature maps.
inception
Inception Net inception
(4c)
(4d)
inception
inception (4b)
(4a)
In addition to the main classification, the network also includes two "Auxiliary classifiers" at
intermediate layers.
Inception Net inception
inception
(5b)
(5a)
inception
(4e)
on
cti
du
Re
x = GlobalAveragePooling2D((7, 7))(x)
x = Dropout(0.4)(x)
● Inception Net has been refined and optimized, leading to several smaller
and faster variants such as Inception-v2, Inception-v3, and
Inception-ResNet.
● Inception-ResNet incorporates residual connections into the Inception
modules to further improve training stability and performance.
InceptionNet Applications
● Puzzle: VGG is better feature extractor then InceptionNet for Style Transfer.
The stylization performance degrades using InceptionNet instead of VGG.
Neural Style Transfer
Wang, Pei, Yijun Li, and Nuno Vasconcelos. "Rethinking and improving the robustness of image
style transfer." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition. 2021.
● Puzzle: VGG is better feature extractor then InceptionNet for Style Transfer.
The stylization performance degrades using InceptionNet instead of VGG.
Deep learning for Computer Vision
ImageNet Benchmark (Recap)
Common Performance Metrics
● Top-1 score: check if a sample's top class (i.e. the one with highest
probability) is the same as its target label
Common Performance Metrics
● Top-1 score: check if a sample's top class (i.e. the one with highest
probability) is the same as its target label
● Top-5 score: check if your label is in your 5 first predictions (i.e.
predictions with 5 highest probabilities)
Common Performance Metrics
● Top-1 score: check if a sample's top class (i.e. the one with highest
probability) is the same as its target label
● Top-5 score: check if your label is in your 5 first predictions (i.e.
predictions with 5 highest probabilities)
● Top-5 error: percentage of test samples for which the correct class
was not in the top 5 predicted classes
Classical Architecture (Recap)
Accuracy
Architecture Year Layers Key Innovations Parameters (ImageNet) Researchers
Alex Krizhevsky
AlexNet 2012 8 ReLU, LRN 62 million 57.2%
et al.
Karen Simonyan
3x3 convolution filters, 138-144
VGGNet 2014 16-19 74.4% and Andrew
Deep architecture million
Zisserman
Inception modules,
Inception Net 2014 22-42 4-12 million 74.8% Szegedy et al.
Auxiliary classifiers
=
Vanishing Gradient Problem:
=
During the gradient descent, we evaluate which is a product of
=
Exploding Gradient Problem:
=
During the gradient descent, we evaluate which is a product of
Accuracy
Architecture Year Layers Key Innovations Parameters (ImageNet) Researchers
Alex Krizhevsky
AlexNet 2012 8 ReLU, LRN 62 million 57.2%
et al.
Karen Simonyan
3x3 convolution filters, 138-144
VGGNet 2014 16-19 74.4% and Andrew
Deep architecture million
Zisserman
Inception modules,
Inception Net 2014 22-42 4-12 million 74.8% Szegedy et al.
Auxiliary classifiers
Input
Linear Non-linear
Two layers
Input
Linear Non-linear
Two layers
Residual Block (key idea)
Two layers
In each residual block, there is a skip connection (residual connection) that bypasses
one or more layers and allows the gradient to flow directly to earlier layers.
Residual Block
Two layers
In each residual block, there is a skip connection (residual connection) that bypasses
one or more layers and allows the gradient to flow directly to earlier layers.
Residual Block
Two layers
● Residual connection improve the gradient flow and enable the network
to learn deeper and more complex features.
ResNet Block
ResNet Block
The residual connection is added to the output of the convolutional layers before
the ReLU activation function is applied.
ResNet Block
x = Add()([x, inputs])
x = Activation('relu')(x)
return x
ResNet
ResNet
ResNet Architectures for ImageNet. Building blocks are shown in brackets, with the
numbers of blocks stacked.
ResNet
The network can effectively choose to use fewer layers when it is not
necessary, which can improve efficiency and reduce overfitting.
Why do ResNets Work?
● Shortcut connections in ResNets enable the gradient to flow more directly and
efficiently through the network.
● ResNets address the problem of vanishing gradients that can occur in very
deep neural networks.
ResNet
Thin curves denote training error, and bold curves denote validation error of the center crops.
ResNet
Thin curves denote training error, and bold curves denote validation error of the center crops.
import tensorflow as tf
from tensorflow.keras.applications.resnet50 import ResNet50, preprocess_input,
decode_predictions
from tensorflow.keras.preprocessing import image
import numpy as np
Accuracy
Architecture Year Layers Key Innovations Parameters (ImageNet) Researchers
Alex Krizhevsky
AlexNet 2012 8 ReLU, LRN 62 million 57.2%
et al.
Karen Simonyan
3x3 convolution filters, 138-144
VGGNet 2014 16-19 74.4% and Andrew
Deep architecture million
Zisserman
Inception modules,
Inception Net 2014 22-42 Auxiliary classifiers, 4-12 million 74.8% Szegedy et al.
Batch normalization
Accuracy per parameter vs. network. Information density (accuracy per parameters) is an efficiency metric that
highlight that capacity of a specific architecture to better utilise its parametric space.
References
h=2 w=2
Detecting Multiple Objects: Sliding Window
h=2 w=2
Detecting Multiple Objects: Sliding Window
“Backbone” network:
AlexNet, VGG, ResNet, etc
Fast R-CNN
Fast R-CNN
Cropping Features: RoI Pool
Cropping Features: RoI Pool
Cropping Features: RoI Pool
Cropping Features: RoI Pool
Cropping Features: RoI Pool
Cropping Features: RoI Pool
Cropping Features: RoI Align
● Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and
semantic segmentation." Proceedings of the IEEE conference on computer vision and
pattern recognition. 2014.
● Girshick, Ross. "Fast R-CNN." Proceedings of the IEEE international conference on
computer vision. 2015.
● Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region
proposal networks." Advances in neural information processing systems 28 (2015).
Object Recognition and Face Recognition
Object Detection: Task Definition
R-CNN
176 feet
Accurate object detection is slow!
8 feet
12 feet
Accurate object detection is slow!
2 feet
Accurate object detection is slow!
2 feet
With YOLO, you only look once at an image to
perform detection
Bicycle Car
Dog
Dining
Table
Conditioned on object: P(Car | Object)
Bicycle Car
Dog
Dining
Table
Then we combine the box and class predictions.
Finally, set a threshold (Non-maximal suppression) to fix multiple
detections
This parameterization fixes the output size
For
TotalPascal VOC:
Predictions
- 7x7 grid
- 2 bounding boxes / cell
- 20 classes
- Pretraining on Imagenet
- Extensive data augmentation
- For details, see the paper
YOLO works across a variety of natural images
It also generalizes well to new domains (like art)
Visualizing
importance
The occlusion experiment
The occlusion experiment
Block different parts of the image and see how the classification score
changes
The occlusion experiment
Block different parts of the image and see how the classification score
changes
high values
small values
The occlusion experiment
● Training your own model can be difficult with limited data and other
resources, e.g.,
● It is a laborious task to manually annotate your own training dataset
● Why not reuse already pre-trained models?
Transfer Learning
Distribution Distribution
Feature
extraction
Transfer Learning
Trained on
ImageNet
Decision layers
Edges
Transfer Learning
Trained on
ImageNet
● When task T1 and T2 have the same input (e.g. an RGB image)
● When you have more data for task T1 than for task T2
● When the low-level features for T1 could be useful to learn T2
Now you are:
?
Sequence modeling
Sequence modeling
Sequences are Everywhere
Sequences are Everywhere
Sequences are Everywhere
Sequences are Everywhere
Sequences are Everywhere
Sequences are Everywhere
Sequence Modeling Applications
Sequence data Input Data Output
Video Activity
Recognition Punching
NLP Tasks Overview
Introduction To Data Science
Language Translation
Introduction To Data Science
Query Recommendations
Introduction To Data Science
Spelling and Grammar Corrections
Introduction To Data Science
Sentiment Analysis
Introduction To Data Science
Topic modeling
Document-Word Matrix
NLP Tasks
One to One
Image Classification
Sequence to Sequence Problem
Many to Many
Machine Translation
Sequence to Sequence Problem
Input Vector
Recurrent Neural Networks
Output Vector
Recurrent
Cell h0 h1 h2
Input Vector
Recurrent Neural Networks
Output
ht
Input
Recurrent Neural Networks
Output
ht
Input
Recurrent Neural Networks
Output
ht
Input
Recurrent Neural Networks
Output
ht
Input
RNN Parameters
Recurrent
Cell
ht
RNN Cell update
Input Vector
Input Vector
Vanilla RNN
Gradient Flow
~
ct
Long short-term memory (LSTM) Key idea
~
ct ~
ct
A look inside an LSTM cell
v
Forget gate
v
Input gate
v
Update candidate
v
Memory cell update
v
Output gate
v
Output
v
Observation
~
ct
● If the forget gate is always 1 and the input gate is always 0, the memory cell internal
state will remain constant forever.
● However, input gates and forget gates give the flexibility to learn when to keep this
value unchanged and when to perturb it in response to subsequent inputs.
What about gradient flow?
Long Short Term Memory (LSTM): Gradient Flow
Long Short Term Memory (LSTM): Gradient Flow
Long Short Term Memory (LSTM): Gradient Flow
Gated Recurrent Units (GRU)
Main solution for better RNNs: Units
Memory Content
Gated Recurrent Units (GRU)
● Bidirectional RNNs are only applicable if you have access to the entire input sequence.
○ Bidirectional RNNs are not applicable to Language Modeling, because in LM you
only have left context available.
● Bidirectional LSTMs perform better than unidirectional ones in speech recognition.
Deep RNNs
Single-Layer RNNs
Single-Layer RNNs
Multi-layer RNNs
Multi-layer RNNs
yt
ht2
ht1
Input
Multi-layer RNNs
yt
ht2
ht1
Input
Multi-layer RNNs
yt
ht2
ht1
Input
Multi-layer RNNs
● RNNs are already "deep" on one dimension (they unroll over many
timesteps).
Multi-layer RNNs
● RNNs are already "deep" on one dimension (they unroll over many
timesteps).
● We can also make them "deep" in another dimension by applying
multiple RNNs — this is a multi-layer RNN.
Multi-layer RNNs
● RNNs are already "deep" on one dimension (they unroll over many
timesteps).
● We can also make them "deep" in another dimension by applying
multiple RNNs — this is a multi-layer RNN.
● This allows the network to compute more complex representations
Multi-layer RNNs
● RNNs are already "deep" on one dimension (they unroll over many
timesteps).
● We can also make them "deep" in another dimension by applying
multiple RNNs — this is a multi-layer RNN.
● This allows the network to compute more complex representations
● The lower RNNs should compute lower-level features and the higher
RNNs should compute higher-level features.
Multi-layer RNNs
● RNNs are already "deep" on one dimension (they unroll over many
timesteps).
● We can also make them "deep" in another dimension by applying
multiple RNNs — this is a multi-layer RNN.
● This allows the network to compute more complex representations
● The lower RNNs should compute lower-level features and the higher
RNNs should compute higher-level features.
● Multi-layer RNNs are also called stacked RNNs