You are on page 1of 55

Click icon to add picture

Deep Convolutional Neural Networks


for Image Classification
Click icon to add picture

Click icon to add picture

CNN basic elements

2 Faculty of Engineering Sciences, ESAT-PSI


Convolutional Layers

3 A. Rosebrock: Deep Learning for Computer Vision with Python - Starter Bundle Faculty of Engineering Sciences, ESAT-PSI
Convolutional Layers

4 A. Rosebrock: Deep Learning for Computer Vision with Python - Starter Bundle Faculty of Engineering Sciences, ESAT-PSI
Pooling Features (“Subsampling”)

• The job of complex cells


• Max Pooling
• Is there a diagonal edge somewhere in an area of the image?
• Take the maximum over the responses to the feature detector in the area
• Average Pooling
• Is there a blobs pattern in an area of the image?
• Take the average over the responses to the feature detectors in the area
• Max Pooling generally works better

6 CSC411: Machine Learning and Data Mining Faculty of Engineering Sciences, ESAT-PSI
7 http://cs231n.github.io/convolutional-networks/ Faculty of Engineering Sciences, ESAT-PSI
Max Pooling as Hierarchical Invariance
• Max Pooling:
At each level of the hierarchy, we use an “or” to get features that
are invariant across a bigger range of transformations.
• Average Pooling is a little bit like an “AND”

8 CSC411: Machine Learning and Data Mining Faculty of Engineering Sciences, ESAT-PSI
Putting it all together

• Different types of layers: convolution and subsampling.


• Convolution layers compute feature maps: the response to multiple feature detectors on a grid
in the lower layer
• Subsampling layers pool the features from a lower layer into a smaller feature map

9 CSC411: Machine Learning and Data Mining Faculty of Engineering Sciences, ESAT-PSI
12 http://cs231n.github.io/convolutional-networks/ Faculty of Engineering Sciences, ESAT-PSI
13 Faculty of Engineering Sciences, ESAT-PSI
http://cs231n.github.io/convolutional-networks/
Why Convolutional Nets

It’s possible to compute the same outputs in a fully connected neural network, but
The network is much harder to train
more weights, more data, slower convergence
There is more danger of overfitting if we try it with a really big network
A convolutional network has fewer parameters due to weight sharing *
It makes sense to detect features and then combine them
That’s what the brain seems to be doing

* Small fully connected networks can work very well, but are hard to train

14 CSC411: Machine Learning and Data Mining Faculty of Engineering Sciences, ESAT-PSI
LeNet (1998): The origin of convolutional neural network

17 S. Banerjee, SlideShare Faculty of Engineering Sciences, ESAT-PSI


18 Faculty of Engineering Sciences, ESAT-PSI
AlexNet (2012)

20 S. Banerjee, SlideShare Faculty of Engineering Sciences, ESAT-PSI


Layer of Neurons: matrix multiplication

21 Faculty of Engineering Sciences, ESAT-PSI


CUDA

Historically, GPUs were used for graphics processing.


But people realized that the fine- grained parallelism inherent in GPU
architecture could be exploited for general purpose computing.

CUDA (Compute Unified Device Architecture)


Parallel computing platform
Programming model and API
Allows enabled GPUs for general purpose processing

23 Faculty of Engineering Sciences, ESAT-PSI


GPU acceleration

• CPU 7th gen i7–7500U, 2.7 GHz


• GPU NVidia GeForce 940MX, 2GB (laptop)
• GPU NVidia GeForce 1070, 8GB (desktop)
• 2 x AMD Opteron 6168 1.9 GHz Processor (2x12 cores total)
taken from PowerEdge R715 server

24 Faculty of Engineering Sciences, ESAT-PSI


CPU vs GPU

http://www.hpcadvisorycouncil.com/events/2017/stanford-workshop/pdf/JBernauer__MLIntro_Tutorial_Tuesday_02072017.pdf
25 Faculty of Engineering Sciences, ESAT-PSI
https://www.slideshare.net/0xdata/intro-to-machine-learning-for-gpus
ReLu Non-Linearity – Simpler Activation

26 S. Banerjee, SlideShare Faculty of Engineering Sciences, ESAT-PSI


VGGNet (2014)
• Only 3x3 convolutions

• Doubling number of filters per “layer”


• Layer (height*width) time layer thickness constant
• Pretraining
• Training smaller versions of network
• Use converged weights as initialization for larger network layers
https://medium.com/coinmonks/paper-review-of-vggnet-1st-runner-up-of-
32 Faculty of Engineering Sciences, ESAT-PSI
ilsvlc-2014-image-classification-d02355543a11
VGGNet

33 S. Banerjee, SlideShare Faculty of Engineering Sciences, ESAT-PSI


VGGNet

https://medium.com/coinmonks/paper-review-of-vggnet-1st-runner-up-of-
34 Faculty of Engineering Sciences, ESAT-PSI
ilsvlc-2014-image-classification-d02355543a11
VGGNet
• Deep (16/19 layers) networks
• On par with GoogleNet on ILSVRC
• Large (500MB!)
• Achieves higher classification accuracy (compared to GoogleNet) in practice
• Generalizes better
• Better fit for transfer learning and fine-tuning

35 Faculty of Engineering Sciences, ESAT-PSI


GoogleNet or Inception (2014)

37 S. Banerjee, SlideShare Faculty of Engineering Sciences, ESAT-PSI


https://towardsdatascience.com/a-simple-guide-to-the-versions-of-the-inception-network-7fc52b863202
Heterogeneous Set of Convolutions

38 CSC411: Machine Learning and Data Mining Faculty of Engineering Sciences, ESAT-PSI
Learn multi-scale features
Do additional max-pooling (at the time max-pooling was claimed essential
Super expensive if we want a decent number of filters in each layer

39 CSC411: Machine Learning and Data Mining Faculty of Engineering Sciences, ESAT-PSI
Inception Module

The 1x1 convolutions at the bottom of the module reduce the number of inputs
by a factor of

Decreases computation cost dramatically

40 CSC411: Machine Learning and Data Mining Faculty of Engineering Sciences, ESAT-PSI
GoogleNet Key Ideas

41 S. Banerjee, SlideShare Faculty of Engineering Sciences, ESAT-PSI


Inception v1

42 CSC411: Machine Learning and Data Mining Faculty of Engineering Sciences, ESAT-PSI
Auxiliary Classifiers

Deep Network: risk of vanishing gradients


Add auxiliary classifiers
Softmax outputs in the middle of the network, the same
as at the top
Encourages the network to learn features that are useful
for classification in the middle

The total loss function is a weighted sum of


the auxiliary loss and the real loss.

43 CSC411: Machine Learning and Data Mining Faculty of Engineering Sciences, ESAT-PSI
45 Faculty of Engineering Sciences, ESAT-PSI
47 Faculty of Engineering Sciences, ESAT-PSI
Why go deeper?
• According to the universal approximation theorem, given enough capacity, we
know that a feedforward network with a single layer is sufficient to represent
any function.
• However, the layer might be massive and the network is prone to overfitting
the data.
• Therefore, there is a common trend in the research community that our
network architecture needs to go deeper.
• AlexNet: 5 convolutional layers
• VGGNet: 19
• GoogleNet: 22

48 Faculty of Engineering Sciences, ESAT-PSI


Why go deeper?
• Wide network is good for memorization, but not so good for generalization
• Multiple layers can learn features at various levels of abstraction
• Deep layers can provide features with global semantic meaning and abstract
details (relations of relations ... of relations of objects), while using only small
kernels
• Small kernels keep the number of parameters less

49 Faculty of Engineering Sciences, ESAT-PSI


How to go Deep?

50 S. Banerjee, SlideShare Faculty of Engineering Sciences, ESAT-PSI


https://towardsdatascience.com/an-overview-of-resnet-and-its-variants-5281e2f56035
ResNet (Residual Neural Network) (2015)
• Extremely deep networks are feasible
• Can be trained using standard SGD
• And a reasonable initialization
• Relies on a micro-architecture called
Residual Module

51 A. Rosebrock: Deep Learning for Computer Vision with Python - Practitioner Bundle Faculty of Engineering Sciences, ESAT-PSI
Nested function classes
• Adding layers doesn’t only make the network more expressive, it also changes it in sometimes not quite so predictable
ways
• Only if larger function classes contain the smaller ones are we guaranteed that increasing them strictly increases the
expressive power of the network.
• At the heart of ResNet is the idea that every additional layer should contain the identity function as one of its elements.
This means that if we can train the newly-added layer into an identity mapping 𝑓(𝐱)=𝐱, the new model will be as effective
as the original model. As the new model may get a better solution to fit the training data set, the added layer might make it
easier to reduce training errors.

52 https://d2l.ai/chapter_convolutional-modern/resnet.html Faculty of Engineering Sciences, ESAT-PSI


Residual Learning
• Learn y = f(x) + x
• These residual layers start at the identity
function and evolve to become more
complex as the network learns.
• This type of residual learning framework
allows us to train networks that are
substantially deeper than previously
proposed network architectures.
• Furthermore, since the input is included in
every residual module, it turns out the
network can learn faster and with larger
learning rates.
53 A. Rosebrock: Deep Learning for Computer Vision with Python - Practitioner Bundle Faculty of Engineering Sciences, ESAT-PSI
ResNet

54 S. Banerjee, SlideShare Faculty of Engineering Sciences, ESAT-PSI


ResNet pre-activation variant
• Deeper ResNet still to outperform shallower Resnet

56 Faculty of Engineering Sciences, ESAT-PSI


The Deeper the better !!!

57 Faculty of Engineering Sciences, ESAT-PSI


DCNN as feature extractors
1. Use a pre-trained Convolutional Neural
Network as feature extractor.
2. Using this feature extractor, forward
propagate your dataset of images through
the network, extract the activations at a
given layer.
3. A standard machine learning classifier can
then be trained on top of the CNN features

60 A. Rosebrock: Deep Learning for Computer Vision with Python - Practitioner Bundle Faculty of Engineering Sciences, ESAT-PSI
Transfer Learning and fine-tuning
• But there is another type of transfer
learning, one that can actually
outperform the feature extraction
method if you have sufficient data.
• This method is called fine-tuning.
• First, cut off the final set of fully-
connected layers from a pre-trained
Convolutional Neural Network.
• Replace with a new set of fully-
connected layers with random
initializations.
• All pre-FC layers are frozen so their
weights cannot be updated.
• Un-freeze and train with very low
learning rate

61 A. Rosebrock: Deep Learning for Computer Vision with Python - Practitioner Bundle Faculty of Engineering Sciences, ESAT-PSI
Transfer Learning
• Three ways in which transfer might improve learning.

62 https://machinelearningmastery.com/transfer-learning-for-deep-learning/ Faculty of Engineering Sciences, ESAT-PSI


Generative Adversarial Networks (2014)
• GANs can be used to generate synthetic (i.e., fake) images that are perceptually near identical to their ground-truth, authentic
originals.
• In order to generate synthetic images, we make use of two neural networks during training:
• A generator that accepts an input vector of randomly generated noise and produces an output “imitation” image that looks
similar, if not identical to an authentic image
• A discriminator or adversary which attempts to determine if a given image is an “authentic” or “fake”
• By training both of these networks at the same time, one giving feedback to the other, we can learn to generate synthetic images.

Image credit: Thalles Silva


A. Rosebrock: Deep Learning for Computer Vision with Python - Practitioner Bundle
63 Goodfellow et al. (2014) Generative Adversarial Networks Faculty of Engineering Sciences, ESAT-PSI
Radford et al. (2015) Unsupervised Representation Learning with Deep Convolution Generative Adversarial Networks
GAN
GANs’ potential is huge, because they can learn to mimic any distribution of data. That is, GANs
can be taught to create worlds eerily similar to our own in any domain: images, music, speech,
prose. They are robot artists in a sense, and their output is impressive – poignant even.

In a surreal turn, Christie’s sold a portrait for


$432,000 that had been generated by a GAN,
based on
open-source code written by Robbie Barrat of Stanf
ord
. Like most true artists, he didn’t see any of the
money, which instead went to the French company,
Obvious.0

64 https://skymind.ai/wiki/generative-adversarial-network-gan Faculty of Engineering Sciences, ESAT-PSI


GAN
• Discriminative algorithms try to classify input data; that is, given the features
(x) of an instance of data, they predict p(y|x) (posterior) a label or category (y)
to which that data belongs.
• Discriminative models learn the boundary between classes

• Generative algorithms attempt to predict p(x|y) (likelihood) features (x) given a


certain label y.
• Generative models model the distribution of individual classes

65 https://skymind.ai/wiki/generative-adversarial-network-gan Faculty of Engineering Sciences, ESAT-PSI


Object Detection using CNN
• Use traditional object detection procedure
1. Sliding windows
2. Image pyramids
3. Non-maxima suppression
4. Batch processing
• Substitute conventional classifier by CNN
classifier

66 A. Rosebrock: Deep Learning for Computer Vision with Python - Practitioner Bundle Faculty of Engineering Sciences, ESAT-PSI
Downsides
• There are many downsides to treating a neural network trained for
classification as an object detector, namely:
• Sliding windows + image pyramids are incredibly slow, even when
utilizing a GPU for inference
• It can be tedious to tune the scale for the image pyramid and step size
for sliding window
• Due to the tediousness of the parameter selection, we can easily miss
objects in our images
• With these negatives in mind, it raises the question:
“Is there a way to build an end-to-end object detector with deep learning?
And if so, why even bother studying the fundamentals of object detection?”
67 Faculty of Engineering Sciences, ESAT-PSI
End-to-End DCNN Object Detection
• The answer to the first part of the question is, yes, we can train end-to-end deep learning
object detectors, but we need to leverage specific network architectures and frameworks
to do so, namely Faster R-CNNs and SSDs.
• To answer the second question, we need to understand the concept of sliding window to
understand how traditional methods localized objects. Deep learning-based object
detectors utilize either:
• Region proposal methods to zero in on the areas of an image that look “interesting”
and therefore deserve closer attention and more computation.
• Image division where an image is partitioned into regions, passed into a CNN, and
then the regions are modified and grouped together based on the output predictions.
• It would be extremely challenging, if not impossible, to understand and appreciate these
methods to object detection without first understanding the classical approach of image
pyramids and sliding windows.
68 Faculty of Engineering Sciences, ESAT-PSI
Bounding/Anchor Boxes
• Bounding Boxes • Anchor Boxes

69 Faculty of Engineering Sciences, ESAT-PSI


Labeling training set anchor boxes
• In the training set, we consider each
anchor box as a training example.
• Each training anchor box gets two types of
labels
• the category of the target contained in
the anchor box (category)
• the offset of the ground-truth bounding
box relative to the anchor box

70 Faculty of Engineering Sciences, ESAT-PSI


Output Bounding Boxes for prediction
• In object detection, we first generate multiple anchor boxes, predict the
categories and offsets for each anchor box, adjust the anchor box position
according to the predicted offset to obtain the bounding boxes to be used for
prediction, and finally filter (NMS) out the prediction bounding boxes that need
to be output.

71 Faculty of Engineering Sciences, ESAT-PSI


Single Shot Multibox Detection
• Single-Shot: localization and
detection performed in same
forward inference pass
• Multibox: multiple objects at
the same time
• Detector: both category and
position
• Multi-scale
• Base Network as feature
generator (ResNet e.g.)
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., & Berg, A. C. (2016, October). Ssd: Single shot multibox detector. In European conference on
72 computer vision (pp. 21-37). Springer, Cham. Faculty of Engineering Sciences, ESAT-PSI
SSMD
• We progressively reduce the volume size in deeper layers (cf. standard CNN)
• Each of the CONV layers connects to the final detection layer (varying scale detection)
• Trained on categorical cross-entropy loss (labels) and L1 (location)

73 Faculty of Engineering Sciences, ESAT-PSI

You might also like