Professional Documents
Culture Documents
3 A. Rosebrock: Deep Learning for Computer Vision with Python - Starter Bundle Faculty of Engineering Sciences, ESAT-PSI
Convolutional Layers
4 A. Rosebrock: Deep Learning for Computer Vision with Python - Starter Bundle Faculty of Engineering Sciences, ESAT-PSI
Pooling Features (“Subsampling”)
6 CSC411: Machine Learning and Data Mining Faculty of Engineering Sciences, ESAT-PSI
7 http://cs231n.github.io/convolutional-networks/ Faculty of Engineering Sciences, ESAT-PSI
Max Pooling as Hierarchical Invariance
• Max Pooling:
At each level of the hierarchy, we use an “or” to get features that
are invariant across a bigger range of transformations.
• Average Pooling is a little bit like an “AND”
8 CSC411: Machine Learning and Data Mining Faculty of Engineering Sciences, ESAT-PSI
Putting it all together
9 CSC411: Machine Learning and Data Mining Faculty of Engineering Sciences, ESAT-PSI
12 http://cs231n.github.io/convolutional-networks/ Faculty of Engineering Sciences, ESAT-PSI
13 Faculty of Engineering Sciences, ESAT-PSI
http://cs231n.github.io/convolutional-networks/
Why Convolutional Nets
It’s possible to compute the same outputs in a fully connected neural network, but
The network is much harder to train
more weights, more data, slower convergence
There is more danger of overfitting if we try it with a really big network
A convolutional network has fewer parameters due to weight sharing *
It makes sense to detect features and then combine them
That’s what the brain seems to be doing
* Small fully connected networks can work very well, but are hard to train
14 CSC411: Machine Learning and Data Mining Faculty of Engineering Sciences, ESAT-PSI
LeNet (1998): The origin of convolutional neural network
http://www.hpcadvisorycouncil.com/events/2017/stanford-workshop/pdf/JBernauer__MLIntro_Tutorial_Tuesday_02072017.pdf
25 Faculty of Engineering Sciences, ESAT-PSI
https://www.slideshare.net/0xdata/intro-to-machine-learning-for-gpus
ReLu Non-Linearity – Simpler Activation
https://medium.com/coinmonks/paper-review-of-vggnet-1st-runner-up-of-
34 Faculty of Engineering Sciences, ESAT-PSI
ilsvlc-2014-image-classification-d02355543a11
VGGNet
• Deep (16/19 layers) networks
• On par with GoogleNet on ILSVRC
• Large (500MB!)
• Achieves higher classification accuracy (compared to GoogleNet) in practice
• Generalizes better
• Better fit for transfer learning and fine-tuning
38 CSC411: Machine Learning and Data Mining Faculty of Engineering Sciences, ESAT-PSI
Learn multi-scale features
Do additional max-pooling (at the time max-pooling was claimed essential
Super expensive if we want a decent number of filters in each layer
39 CSC411: Machine Learning and Data Mining Faculty of Engineering Sciences, ESAT-PSI
Inception Module
The 1x1 convolutions at the bottom of the module reduce the number of inputs
by a factor of
40 CSC411: Machine Learning and Data Mining Faculty of Engineering Sciences, ESAT-PSI
GoogleNet Key Ideas
42 CSC411: Machine Learning and Data Mining Faculty of Engineering Sciences, ESAT-PSI
Auxiliary Classifiers
43 CSC411: Machine Learning and Data Mining Faculty of Engineering Sciences, ESAT-PSI
45 Faculty of Engineering Sciences, ESAT-PSI
47 Faculty of Engineering Sciences, ESAT-PSI
Why go deeper?
• According to the universal approximation theorem, given enough capacity, we
know that a feedforward network with a single layer is sufficient to represent
any function.
• However, the layer might be massive and the network is prone to overfitting
the data.
• Therefore, there is a common trend in the research community that our
network architecture needs to go deeper.
• AlexNet: 5 convolutional layers
• VGGNet: 19
• GoogleNet: 22
51 A. Rosebrock: Deep Learning for Computer Vision with Python - Practitioner Bundle Faculty of Engineering Sciences, ESAT-PSI
Nested function classes
• Adding layers doesn’t only make the network more expressive, it also changes it in sometimes not quite so predictable
ways
• Only if larger function classes contain the smaller ones are we guaranteed that increasing them strictly increases the
expressive power of the network.
• At the heart of ResNet is the idea that every additional layer should contain the identity function as one of its elements.
This means that if we can train the newly-added layer into an identity mapping 𝑓(𝐱)=𝐱, the new model will be as effective
as the original model. As the new model may get a better solution to fit the training data set, the added layer might make it
easier to reduce training errors.
60 A. Rosebrock: Deep Learning for Computer Vision with Python - Practitioner Bundle Faculty of Engineering Sciences, ESAT-PSI
Transfer Learning and fine-tuning
• But there is another type of transfer
learning, one that can actually
outperform the feature extraction
method if you have sufficient data.
• This method is called fine-tuning.
• First, cut off the final set of fully-
connected layers from a pre-trained
Convolutional Neural Network.
• Replace with a new set of fully-
connected layers with random
initializations.
• All pre-FC layers are frozen so their
weights cannot be updated.
• Un-freeze and train with very low
learning rate
61 A. Rosebrock: Deep Learning for Computer Vision with Python - Practitioner Bundle Faculty of Engineering Sciences, ESAT-PSI
Transfer Learning
• Three ways in which transfer might improve learning.
66 A. Rosebrock: Deep Learning for Computer Vision with Python - Practitioner Bundle Faculty of Engineering Sciences, ESAT-PSI
Downsides
• There are many downsides to treating a neural network trained for
classification as an object detector, namely:
• Sliding windows + image pyramids are incredibly slow, even when
utilizing a GPU for inference
• It can be tedious to tune the scale for the image pyramid and step size
for sliding window
• Due to the tediousness of the parameter selection, we can easily miss
objects in our images
• With these negatives in mind, it raises the question:
“Is there a way to build an end-to-end object detector with deep learning?
And if so, why even bother studying the fundamentals of object detection?”
67 Faculty of Engineering Sciences, ESAT-PSI
End-to-End DCNN Object Detection
• The answer to the first part of the question is, yes, we can train end-to-end deep learning
object detectors, but we need to leverage specific network architectures and frameworks
to do so, namely Faster R-CNNs and SSDs.
• To answer the second question, we need to understand the concept of sliding window to
understand how traditional methods localized objects. Deep learning-based object
detectors utilize either:
• Region proposal methods to zero in on the areas of an image that look “interesting”
and therefore deserve closer attention and more computation.
• Image division where an image is partitioned into regions, passed into a CNN, and
then the regions are modified and grouped together based on the output predictions.
• It would be extremely challenging, if not impossible, to understand and appreciate these
methods to object detection without first understanding the classical approach of image
pyramids and sliding windows.
68 Faculty of Engineering Sciences, ESAT-PSI
Bounding/Anchor Boxes
• Bounding Boxes • Anchor Boxes