CVlecture 5

COMPUTER VISION
Introduction to Deep Learning and CNNs
Andrew French
This week
• What is deep learning?

• Neural networks
• Convolutional Neural Networks (CNNs)
• Convolution layers
• Pooling layers
• Activation functions
• Final stage: Softmax etc.
• What’s going on in a network?

• An example use case
NEURAL NETWORKS
Why we need CNNs; what are the main components of a CNN
Machine Learning
Remember…
• A classifier operates on the feature vector:
1
2
-1
-4
0
5 Neural
-6
Network “Not a root tip”
-4
2
1
-3
…
Ideally we want to learn Learning algorithm/
4
which features to use to
→Deep learning classifier goes here
What is deep learning?
• Deep learning is a popular AI technique
• Essentially it is a kind of neural network (as we will see)
• Deep refers to the ability to have many layers in the network

• This is not possible with traditional neural networks
• So what led to the development of deep learning?

• Several concurrent factors including:
• GPU development
• Algorithm improvements (e.g. ReLU function….)
• Availability of large training image sets… (see last week)
• …
When we are talking about deep learning for computer vision, we normally mean
CNNs….(although we will introduce Transformers in a few lectures time)
Classical Neural Nets
Neurons
Humans have ~100-1,000 trillion connections in their brains

Neurons
Modelling Neurons
Modern artificial networks tend to use 100k – 10b connections

Convolutional Neural Nets (CNNs)
• Make the assumption the input is an image
• Neural networks that use convolution in place of general
matrix multiplication in at least one of their layers.
• An end-to-end learned solution to many vision tasks

• Local analysis matches the natural structure of most images
• Learn hierarchical models of image content
MLP versus CNN
Convolutional Neural Network- Overview
• Multilayer Perceptron Network

• Wide application scenario – not just images
• Neurons are fully connected – can’t scale well to large size data (e.g. images)
• Convolutional Neural Network

• Neurons are arranged in ‘3D’, each neuron is only connected to a small region
of previous layer
• Typical CNN structure: Input-conv-activation-pool-fully connected-output
1
1
Multilayer Perceptron (MLP) Network Convolutional Neural Network

Locally Connected Layers
• Classical neural nets’ hidden layers are fully connected
• 200x200 image, 40K hidden units (1 per

pixel) means 1.6B weights to learn
• Waste of resources
• Spatial correlation is local
• most of the weights would be 0
• Would require an impractically large

training set to learn this many weights
Locally Connected Layers
• CNNs’ early layers are locally

connected
• e.g. 200x200 image, 40K

hidden units, fed by 10 x 10
filters, means 4M weights to
learn
= much fewer than 1.6B
Convolution Layers
• In image processing/vision we
usually want to apply the same
convolution mask at each
location
• Each neuron has the same

weights
• e.g. 200x200 image, 40K

hidden units,
10 x 10 mask means only 100
weights to learn
Convolution
• We wish to convolve the input image with a set of learnable, small-size filters
• Input size(W); filter size (F); zero padding (P); stride (S);
• F, conv filter size, is like a receptive field
• Output volume size calculation (W-F+2P)/S+1;
Input Learnable filters Output
100 x 100 x3 10, 5x5x3 100x100x10

(75+1)*10 parameters (100-5+2*2)/1+1=100
1
5
Does this look familiar?
Convolution Layers
• We can afford to learn multiple filters
• e.g. 100 10x10 masks is only 10K
parameters
• Convolutional layers are filter banks

performing convolutions with the
learned kernels (masks)
• Could be applied to all pixels, or

have a small ‘stride’ to spread them
out
OTHER LAYER TYPES
Pooling and activation
Pooling Layers
• Suppose one of our convolutions is an
eye detector – how can we make the net
robust to the exact location of the eye?
• By pooling (e.g. taking the max) filter
responses at different locations
• Don’t pool different
features
Pooling Layers
• A number of pooling methods exist, including subsampling

• Effect is to reduce the resolution of the filter outputs
• Subsequent convolutional layers therefore access larger areas
of the image
http://cs231n.github.io/convolutional-networks/#pool
Effect of Pooling
• Reduce the spatial size of the representation and reduce the amount of parameters
• Effectively down-sampling the input to increase the receptive field size
• Max operation with stride of 2 is a popular choice
2
0
Activation functions
Convolutional Neural Network- Activation
• Add nonlinearity to the network
tanh Leaky Relu
✓ Transfer the range to ✓ Zero centred range: [-1, 1] ✓ Solves ✓ Overcome the ‘dying
[0, 1] x Saturate and kill gradients: vanishing/exploding neuron’ problem
x Saturate and kill small gradient at region of gradients x Performance not
gradients: small 0 and 1 ✓ Simple to calculate consistent
gradient at region of 0 x Some neurons can be
and 1 ‘dead’ with negative input
2
1
Fully Convolutional
connected Neural Network – Fully Connected Layer & Softmax
layer and softmax
• Global feature learning: fully connected to all activations in the previous layer, as the same in MLP.
• Softmax - converts the prediction to the range of [0..1] for each class
Multilayer Perceptron (MLP) Network
2
2
Softmax?
• With Linear regression, we try to predict a real value y, for an input value x
• Perhaps we are predicting someone’s age from a picture
• Linear regression is not good if we are predicting if input data should be
assigned class A or class B
• Perhaps we are predicting Happy or not-Happy from a picture
• To do this there are more suitable
kinds of regression
• E.g. We can use Logistic regression
• This can be used to “squash” a
number into one of two classes
• Example: sigmoid function →
Softmax?
• So why do we need a softmax function?
• We need a different function if we have multiple (ie.>2, non-binary) classes

• Works a bit like logistic regression, but for multiple classes
• A softmax function takes a set of numbers as an input, and outputs a probability distribution
spread over a set of k classes
• We want to know which class k is the most likely given a data point
• E.g. predicting if a person is happy, sad, grumpy, sleepy, etc. from a picture.
• Softmax allows us to do this
• Gives us a probability for each class for a given input in the range 0..1
• Probabilities sum to 1. E.g:
Class Probability
Happy 0.05
Sad 0.32
Grumpy 0.44
Sleepy 0.19
CNNS
What are they doing, and what can they do?
CNN Architectures (Classification)
• Classification systems follow a common pattern

• Normalization might be grey level, or contrast normalization
• Number, size and stride of filters varies
• Classifier could be a fully connected net, SVM, softmax, etc.
• Non-linearity is optional, decides whether given features progress

any further
Training a CNN
• Back-propagation algorithm (like classic neural nets)

• stochastic gradient descent – aim is to minimise error between net output and a training set
• Time consuming. 5 convolutional layers, 3 layers in top neural network -> 50,000,000
parameters, days(?) to train (GPUs are getting better all the time…)
How this works in practice
1. Capture and annotate dataset 2. Design network architecture
4. Test Network 5. Deploy

3. Train Network
What is going on here?
• Many CNNs can be built from the basic components, performance is
extremely impressive. What makes them effective?
• In practice
• Increased depth
• Smaller filters at the lower levels
• More convolutional than fully connected layers
• But why? What are they learning?

• Computer Vision is science, as well as engineering
We need tools that allow us to understand, as well as just use CNNs

Visualising Learned Filters
• Zeiler and Fergus (2014) grafted a
de-convolution net onto the side of
a CNN
• Allows filter responses to be
mapped back down the net
• To examine an activation, set rest of
layer to 0
• Deconvnet reconstructs the activity
in the layer beneath that gave rise
to the chosen activation
“Visualizing and Understanding Convolutional Networks”, Zeiler

and Fergus (2014). ECCV 2014, Part I, LNCS 8689, pp. 818–
833, 2014.
Visualising Learned Filters
• Top 9 activations in a randomly selected set of (16) feature maps,

projected down to previous level
• The image patches that generated those activations
Visualising Learned Features
Visualising Learned Features
There is a feature hierarchy
Learning Hierarchical Models
• Together, the layers of a classification
CNN learn a hierarchical feature set
Biologically inspired
Beyond Classification: Segmentation
Beyond Classification: Stereo
• CNN approach has been applied to classic binocular stereo,

wide baseline stereo and general image matching
Beyond Classification: New Tasks
• Two networks trained on depth maps

• One learns a coarse, global mapping, the
other learns local refinement
• Size-depth ambiguity remains an issue input coarse refined ground
image map map truth
An Example Problem - RootNav
• In 2014 we released a tool called

RootNav to measure roots
• Aims to quickly quantify 2D root
images in a variety of growth
conditions
• Accuracy is a primary focus
An Example Problem – ‘RootNav’
• RootNav is only semi-

automatic
• Usually users must
identify root tips
themselves
Q: How can we completely automate this??

A: Deep learning! (?)
The problem: Image Features
• A feature detector such as the Harris

corner detector may identify root tips
• But! There are many false positives and
negatives
• => Insufficient for true automation
Machine Learning
• Machine learning allows us to train software to recognise

patterns of data
• Does this image contain a root tip?

• Find useful information (features)
• Train a neural network to interpret
these features
Machine Learning
• Traditional machine learning:
1
2
-1 • Custom features let
Input Image -4 us:
0
5 • Choose useful
-6 features like
-4
2
orientation
1 • Reduce the number
Orientation -3 of variables
…
Features 4
Feature Vector
Machine Learning
• A classifier operates on the feature vector:
1
2
-1
-4
0
5 Neural
-6
Network “Not a root tip”
-4
2
1
-3
…
4 Learning algorithm/
classifier goes here
Convolutional Neural Networks
• Early layers perform convolutions

• Final layers perform classification
• All weights & convolution masks are learned
Convolutional Neural Network

Root Dataset
• Deep learning usually requires more

data than traditional learning approaches
• Our RootNav datasets contain thousands
of fully annotated root system images
• In this work we used ~2.7k images
Root Feature Detection
• Images of 32x32 pixels were used

• Positive examples from each
annotated root tip
• Negative both at random, and
specifically on non-tip root
• ~44k images total (35k training, 9k
validation)
Version 1: Localisation
• We can scan an image, classifying at regular intervals to

perform localisation
Results
Version 2 Network architecture
Our next network was based on the hourglass architecture [1], which we
have adapted to perform simultaneous classification of wheat awns
[1] Newell, Alejandro, Kaiyu Yang, and Jia Deng. "Stacked Hourglass Networks for Human Pose Estimation." European
Conference on Computer Vision. Springer International Publishing, 2016.
Training
Training selects a feature at random and performs data

augmentation on that region of the image
The network was trained over 500 epochs. We applied a mean

squared error loss function on the heatmaps, and cross
entropy on the classification branch
Improved Feature Localisation
!
Conclusion
• CNNs are dominating current vision research
• end-to-end learned solutions
• system components (e.g. learned feature detection in tracking)
• Classification nets follow a common structure, new

architectures are emerging
• Applied to most areas of vision, and have encouraged
study of new problems/re-examination of others
• Performance is impressive, should be considered in any
applied project
• Tools for scientific study of CNNs are starting to appear.
Next week’s lecture: A brief history of famous network architectures;
Some related techniques: e.g. transfer learning.
Further reading
• http://cs231n.github.io/neural-networks-1/
• Nice introduction to CNNs

CVlecture 5

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CVlecture 5

Uploaded by

Copyright:

Available Formats

COMPUTER VISION

Introduction to Deep Learning and CNNs

• What is deep learning?

• What’s going on in a network?

• Deep refers to the ability to have many layers in the network

• So what led to the development of deep learning?

Humans have ~100-1,000 trillion connections in their brains

Modern artificial networks tend to use 100k – 10b connections

• An end-to-end learned solution to many vision tasks

• Multilayer Perceptron Network

• Convolutional Neural Network

Multilayer Perceptron (MLP) Network Convolutional Neural Network

• 200x200 image, 40K hidden units (1 per

• Would require an impractically large

• CNNs’ early layers are locally

• e.g. 200x200 image, 40K

• Each neuron has the same

• e.g. 200x200 image, 40K

Input Learnable filters Output

100 x 100 x3 10, 5x5x3 100x100x10

• Convolutional layers are filter banks

• Could be applied to all pixels, or

• A number of pooling methods exist, including subsampling

• Add nonlinearity to the network

tanh Leaky Relu

Multilayer Perceptron (MLP) Network

• We need a different function if we have multiple (ie.>2, non-binary) classes

• Classification systems follow a common pattern

• Number, size and stride of filters varies

• Classifier could be a fully connected net, SVM, softmax, etc.

• Non-linearity is optional, decides whether given features progress

• Back-propagation algorithm (like classic neural nets)

4. Test Network 5. Deploy

• Smaller filters at the lower levels

• More convolutional than fully connected layers

• But why? What are they learning?

We need tools that allow us to understand, as well as just use CNNs

“Visualizing and Understanding Convolutional Networks”, Zeiler

• Top 9 activations in a randomly selected set of (16) feature maps,

• CNN approach has been applied to classic binocular stereo,

• Two networks trained on depth maps

• In 2014 we released a tool called

• RootNav is only semi-

Q: How can we completely automate this??

• A feature detector such as the Harris

• Machine learning allows us to train software to recognise

• Does this image contain a root tip?

• Traditional machine learning:

• A classifier operates on the feature vector:

• Early layers perform convolutions

Convolutional Neural Network

• Deep learning usually requires more

• Images of 32x32 pixels were used

• We can scan an image, classifying at regular intervals to

Training selects a feature at random and performs data

The network was trained over 500 epochs. We applied a mean

• Classification nets follow a common structure, new

You might also like