You are on page 1of 56

COMPUTER VISION

Introduction to Deep Learning and CNNs

Andrew French
This week

• What is deep learning?


• Neural networks
• Convolutional Neural Networks (CNNs)
• Convolution layers
• Pooling layers
• Activation functions
• Final stage: Softmax etc.

• What’s going on in a network?


• An example use case
NEURAL NETWORKS
Why we need CNNs; what are the main components of a CNN
Machine Learning
Remember…
• A classifier operates on the feature vector:

1
2
-1
-4
0
5 Neural
-6
Network “Not a root tip”
-4
2
1
-3

Ideally we want to learn Learning algorithm/
4
which features to use to
→Deep learning classifier goes here
What is deep learning?
• Deep learning is a popular AI technique
• Essentially it is a kind of neural network (as we will see)

• Deep refers to the ability to have many layers in the network


• This is not possible with traditional neural networks

• So what led to the development of deep learning?


• Several concurrent factors including:
• GPU development
• Algorithm improvements (e.g. ReLU function….)
• Availability of large training image sets… (see last week)
• …
When we are talking about deep learning for computer vision, we normally mean
CNNs….(although we will introduce Transformers in a few lectures time)
Classical Neural Nets
Neurons

Humans have ~100-1,000 trillion connections in their brains


Neurons
Modelling Neurons

Modern artificial networks tend to use 100k – 10b connections


Convolutional Neural Nets (CNNs)
• Make the assumption the input is an image
• Neural networks that use convolution in place of general
matrix multiplication in at least one of their layers.

• An end-to-end learned solution to many vision tasks


• Local analysis matches the natural structure of most images
• Learn hierarchical models of image content
MLP versus CNN
Convolutional Neural Network- Overview

• Multilayer Perceptron Network


• Wide application scenario – not just images
• Neurons are fully connected – can’t scale well to large size data (e.g. images)

• Convolutional Neural Network


• Neurons are arranged in ‘3D’, each neuron is only connected to a small region
of previous layer
• Typical CNN structure: Input-conv-activation-pool-fully connected-output

1
1

Multilayer Perceptron (MLP) Network Convolutional Neural Network


Locally Connected Layers
• Classical neural nets’ hidden layers are fully connected

• 200x200 image, 40K hidden units (1 per


pixel) means 1.6B weights to learn

• Waste of resources
• Spatial correlation is local
• most of the weights would be 0

• Would require an impractically large


training set to learn this many weights
Locally Connected Layers

• CNNs’ early layers are locally


connected

• e.g. 200x200 image, 40K


hidden units, fed by 10 x 10
filters, means 4M weights to
learn
= much fewer than 1.6B
Convolution Layers
• In image processing/vision we
usually want to apply the same
convolution mask at each
location

• Each neuron has the same


weights

• e.g. 200x200 image, 40K


hidden units,
10 x 10 mask means only 100
weights to learn
Convolution
• We wish to convolve the input image with a set of learnable, small-size filters
• Input size(W); filter size (F); zero padding (P); stride (S);
• F, conv filter size, is like a receptive field
• Output volume size calculation (W-F+2P)/S+1;

Input Learnable filters Output

100 x 100 x3 10, 5x5x3 100x100x10


(75+1)*10 parameters (100-5+2*2)/1+1=100

1
5
Does this look familiar?
Convolution Layers
• We can afford to learn multiple filters
• e.g. 100 10x10 masks is only 10K
parameters

• Convolutional layers are filter banks


performing convolutions with the
learned kernels (masks)

• Could be applied to all pixels, or


have a small ‘stride’ to spread them
out
OTHER LAYER TYPES
Pooling and activation
Pooling Layers
• Suppose one of our convolutions is an
eye detector – how can we make the net
robust to the exact location of the eye?
• By pooling (e.g. taking the max) filter
responses at different locations
• Don’t pool different
features
Pooling Layers

• A number of pooling methods exist, including subsampling


• Effect is to reduce the resolution of the filter outputs
• Subsequent convolutional layers therefore access larger areas
of the image
http://cs231n.github.io/convolutional-networks/#pool
Effect of Pooling
• Reduce the spatial size of the representation and reduce the amount of parameters
• Effectively down-sampling the input to increase the receptive field size
• Max operation with stride of 2 is a popular choice

2
0
Activation functions
Convolutional Neural Network- Activation

• Add nonlinearity to the network

tanh Leaky Relu

✓ Transfer the range to ✓ Zero centred range: [-1, 1] ✓ Solves ✓ Overcome the ‘dying
[0, 1] x Saturate and kill gradients: vanishing/exploding neuron’ problem
x Saturate and kill small gradient at region of gradients x Performance not
gradients: small 0 and 1 ✓ Simple to calculate consistent
gradient at region of 0 x Some neurons can be
and 1 ‘dead’ with negative input
2
1
Fully Convolutional
connected Neural Network – Fully Connected Layer & Softmax
layer and softmax
• Global feature learning: fully connected to all activations in the previous layer, as the same in MLP.
• Softmax - converts the prediction to the range of [0..1] for each class

Multilayer Perceptron (MLP) Network

2
2
Softmax?
• With Linear regression, we try to predict a real value y, for an input value x
• Perhaps we are predicting someone’s age from a picture
• Linear regression is not good if we are predicting if input data should be
assigned class A or class B
• Perhaps we are predicting Happy or not-Happy from a picture
• To do this there are more suitable
kinds of regression
• E.g. We can use Logistic regression
• This can be used to “squash” a
number into one of two classes
• Example: sigmoid function →
Softmax?
• So why do we need a softmax function?

• We need a different function if we have multiple (ie.>2, non-binary) classes


• Works a bit like logistic regression, but for multiple classes
• A softmax function takes a set of numbers as an input, and outputs a probability distribution
spread over a set of k classes
• We want to know which class k is the most likely given a data point
• E.g. predicting if a person is happy, sad, grumpy, sleepy, etc. from a picture.
• Softmax allows us to do this
• Gives us a probability for each class for a given input in the range 0..1
• Probabilities sum to 1. E.g:

Class Probability
Happy 0.05
Sad 0.32
Grumpy 0.44
Sleepy 0.19
CNNS
What are they doing, and what can they do?
CNN Architectures (Classification)

• Classification systems follow a common pattern


• Normalization might be grey level, or contrast normalization

• Number, size and stride of filters varies

• Classifier could be a fully connected net, SVM, softmax, etc.

• Non-linearity is optional, decides whether given features progress


any further
Training a CNN

• Back-propagation algorithm (like classic neural nets)


• stochastic gradient descent – aim is to minimise error between net output and a training set
• Time consuming. 5 convolutional layers, 3 layers in top neural network -> 50,000,000
parameters, days(?) to train (GPUs are getting better all the time…)
How this works in practice
1. Capture and annotate dataset 2. Design network architecture

4. Test Network 5. Deploy


3. Train Network
What is going on here?
• Many CNNs can be built from the basic components, performance is
extremely impressive. What makes them effective?
• In practice
• Increased depth

• Smaller filters at the lower levels

• More convolutional than fully connected layers

• But why? What are they learning?


• Computer Vision is science, as well as engineering

We need tools that allow us to understand, as well as just use CNNs


Visualising Learned Filters
• Zeiler and Fergus (2014) grafted a
de-convolution net onto the side of
a CNN
• Allows filter responses to be
mapped back down the net
• To examine an activation, set rest of
layer to 0
• Deconvnet reconstructs the activity
in the layer beneath that gave rise
to the chosen activation

“Visualizing and Understanding Convolutional Networks”, Zeiler


and Fergus (2014). ECCV 2014, Part I, LNCS 8689, pp. 818–
833, 2014.
Visualising Learned Filters

• Top 9 activations in a randomly selected set of (16) feature maps,


projected down to previous level
• The image patches that generated those activations
Visualising Learned Features
Visualising Learned Features
There is a feature hierarchy
Learning Hierarchical Models
• Together, the layers of a classification
CNN learn a hierarchical feature set

Biologically inspired
Beyond Classification: Segmentation
Beyond Classification: Stereo

• CNN approach has been applied to classic binocular stereo,


wide baseline stereo and general image matching
Beyond Classification: New Tasks

• Two networks trained on depth maps


• One learns a coarse, global mapping, the
other learns local refinement
• Size-depth ambiguity remains an issue input coarse refined ground
image map map truth
An Example Problem - RootNav

• In 2014 we released a tool called


RootNav to measure roots
• Aims to quickly quantify 2D root
images in a variety of growth
conditions
• Accuracy is a primary focus
An Example Problem – ‘RootNav’

• RootNav is only semi-


automatic
• Usually users must
identify root tips
themselves

Q: How can we completely automate this??


A: Deep learning! (?)
The problem: Image Features

• A feature detector such as the Harris


corner detector may identify root tips
• But! There are many false positives and
negatives
• => Insufficient for true automation
Machine Learning

• Machine learning allows us to train software to recognise


patterns of data

• Does this image contain a root tip?


• Find useful information (features)
• Train a neural network to interpret
these features
Machine Learning

• Traditional machine learning:

1
2
-1 • Custom features let
Input Image -4 us:
0
5 • Choose useful
-6 features like
-4
2
orientation
1 • Reduce the number
Orientation -3 of variables

Features 4
Feature Vector
Machine Learning

• A classifier operates on the feature vector:

1
2
-1
-4
0
5 Neural
-6
Network “Not a root tip”
-4
2
1
-3

4 Learning algorithm/
classifier goes here
Convolutional Neural Networks

• Early layers perform convolutions


• Final layers perform classification
• All weights & convolution masks are learned

Convolutional Neural Network


Root Dataset

• Deep learning usually requires more


data than traditional learning approaches
• Our RootNav datasets contain thousands
of fully annotated root system images
• In this work we used ~2.7k images
Root Feature Detection

• Images of 32x32 pixels were used


• Positive examples from each
annotated root tip
• Negative both at random, and
specifically on non-tip root
• ~44k images total (35k training, 9k
validation)
Version 1: Localisation

• We can scan an image, classifying at regular intervals to


perform localisation
Results
Version 2 Network architecture

Our next network was based on the hourglass architecture [1], which we
have adapted to perform simultaneous classification of wheat awns

[1] Newell, Alejandro, Kaiyu Yang, and Jia Deng. "Stacked Hourglass Networks for Human Pose Estimation." European
Conference on Computer Vision. Springer International Publishing, 2016.
Training

Training selects a feature at random and performs data


augmentation on that region of the image

The network was trained over 500 epochs. We applied a mean


squared error loss function on the heatmaps, and cross
entropy on the classification branch
Improved Feature Localisation

!
Conclusion
• CNNs are dominating current vision research
• end-to-end learned solutions
• system components (e.g. learned feature detection in tracking)

• Classification nets follow a common structure, new


architectures are emerging
• Applied to most areas of vision, and have encouraged
study of new problems/re-examination of others
• Performance is impressive, should be considered in any
applied project
• Tools for scientific study of CNNs are starting to appear.
Next week’s lecture: A brief history of famous network architectures;
Some related techniques: e.g. transfer learning.
Further reading
• http://cs231n.github.io/neural-networks-1/
• Nice introduction to CNNs

You might also like