You are on page 1of 46

Convolutional Neural Networks in Computer

Vision
Jochen Lang

jlang@uottawa.ca

Faculté de génie | Faculty of Engineering


Jochen Lang, EECS
jlang@uOttawa.ca
Neural Networks Basics

• Multi-layer perceptron
• Feed forward networks
• Activation functions
• Loss function
• Training by back propagation

Jochen Lang, EECS


jlang@uOttawa.ca
Solving Linear Classification

• Recall: There is no known closed-form solution to the


log loss and we need to resort to (non-linear) numerical
optimization
– For simplicity, let’s look at Linear Least Squares with
the sum of squared error for a line with data points
𝑵

• Perceptron Learning Algorithm by Rosenblatt (1958)


– Loop over the training points and after each
misclassified training point update the line estimate
– The Perceptron algorithm forms one of the basis of
Neural Networks
Jochen Lang, EECS
jlang@uOttawa.ca
Hinge Loss

• In general, we minimize the cost function or loss


function (in machine learning)
– LLSQ uses a L2 loss
𝑵

– Perceptron uses a hinge loss (only misclassified


points enter the loss function
  with opposite signs)

∈ℳ

Jochen Lang, EECS


jlang@uOttawa.ca
Perceptron Algorithm

• Given the definition of the Hinge Loss Function


 
∈ℳ where is the set of misclassified points
– Calculate the gradient of the loss
 
function

∈ℳ 

∈ℳ
– Loop over the training points and evaluate the gradient
after each training point and update the line estimate

Jochen Lang, EECS


jlang@uOttawa.ca
Gradient Descent

• Find a solution by taking steps down the steepest slope


– 2D Example

Image Source: downhill.readthedocs.io


“Downhill 0.4.0 Documentation”, Johnson et
al., Google

Jochen Lang, EECS


jlang@uOttawa.ca
Minimizing the Loss

• Only in simple cases (linear least squares) a direct


solution is possible
• Perceptron algorithm is a form of stochastic gradient
descent
– Limitations:
• The order of points will influence the solution
• The solution is non-unique
• If the classes overlap, the algorithm enters a limit
cycle

Jochen Lang, EECS


jlang@uOttawa.ca
Feedforward Neural Networks

• Here we focus on a feed forward neural network


– No loops in the network (unlike recurrent networks)
• Classic layout consists of one input, a single hidden and
one output layer
Output Layer

Hidden Layer

Input Layer

Jochen Lang, EECS


jlang@uOttawa.ca
Basic Equations

• Derived features are created from the input layer

– Example (number of nodes in hidden layers)


– Using the activation function
• Output is calculated from the derived features

– Here K=1 (number of nodes in output layer)


• Mapping to classes with Output Layer 𝑇

logit or softmax
Hidden Layer 𝑍 𝑍 𝑍

Input Layer 𝑋 𝑋

Jochen Lang, EECS


jlang@uOttawa.ca
Comparison to Least Squares and
Perceptron
– We have an extra hidden layer
– We have the activation function
• For classification, we use the logit or softmax
function
– Aside: The constant terms (the bias) can be
integrated into the layers (see below)

Output Layer

Hidden Layer

Input Layer

Jochen Lang, EECS


jlang@uOttawa.ca
Activation Functions
• The activation function introduces non-linearities into
the neural network
• Many different choices and still some research
– Classic: sigmoid, hyperbolic tangent
– Modern: ReLu (and variants)
– Other: Radial basis functions, softplus, hard tanh
• Because we like to solve the fitting or training of the
network with gradient descent
– Functions should be differentiable (at least nearly
everywhere)
– Derivatives should also be non-zero everywhere in
region of interests
• Caution: Goodrich et al. [2016] state: “Many unpublished
activation functions perform just as well as the popular
ones” and gives examples of cosine on MNIST

Jochen Lang, EECS


jlang@uOttawa.ca
Sigmoid Function

• Before deep learning most often the sigmoid function


– Sigmoid
– Same as in logistic regression
– Derivative is well defined

– Caution: Derivative becomes very small for large


magnitude of v because the function saturates

Jochen Lang, EECS


jlang@uOttawa.ca
Hyperbolic Tangent
• Similar function shape and in fact closely related to
sigmoid

– Derivative again well defined

– Advantage compared to sigmoid:


• Hyperbolic tangent is close to the identity function in
the neighborhood of 0.
• Does not introduce non-linearity into optimization if
input is close to 0.
– Similar to sigmoid, the function saturates for large
magnitudes and derivatives become very small

Jochen Lang, EECS


jlang@uOttawa.ca
ReLu
• Goal: Make the function close to identity and prevent
saturation.

– Derivatives is trivial and nearly defined everywhere

– In practice, implement the derivative at 0 as the left or


right derivative
– Advantage compared to sigmoid like functions:
• Identity function if node is “active” .
• Gradient remains useful for all
• Sometimes an additional affine transform is used to
move e.g. input values to

Jochen Lang, EECS


jlang@uOttawa.ca
Generalization of ReLu

• Absolute value rectification

Used in special cases to enforce symmetry


• Leaky ReLu
– Avoid derivatives of 0, introduce a leak scalar

– Randomized version pick at random during training


and fix for testing/application
• Parametric ReLu or PReLu
– Same as leaky ReLu but learn the leak scalar for each
node

Jochen Lang, EECS


jlang@uOttawa.ca
Exponential Linear Units ELU
• A further generalization of ReLu [Clevert et al. 2015]

– Note that the derivative is

• Advantages compared to ReLu


– Gradient is non-zero for negative values and only
approaches 0 for large negative values
– Linear range of function smoothly transitions to
(neg.) exponential
• Disadvantages
– Higher computation cost
Jochen Lang, EECS
jlang@uOttawa.ca
Training Neural Networks

• Single Hidden Layer: We have to find the weights

– Call them collectively


– In our example, we have two inputs , three
hidden nodes , and two output node
weights
Output Layer

Hidden Layer

Input Layer

Jochen Lang, EECS


jlang@uOttawa.ca
Weights

• Notation with subscript from node to node


Output Layer

Hidden Layer

Input Layer
Jochen Lang, EECS
jlang@uOttawa.ca
Matrix Notation
• In practice NN are conveniently expressed as matrix-vector
multiplies
• Our network example
– 𝑇 = 𝛽 + 𝜷 𝒁 is a matrix equation (here 𝜷 is a vector because
we have a single output)
𝛽 1 1
𝛽 𝑧 𝑇 𝛽 𝛽 𝛽 𝛽 𝑧
𝑇 = and =
𝛽 𝑧 𝑇 𝛽 𝛽 𝛽 𝛽 𝑧
𝛽 𝑧 𝑧
– 𝑍 = 𝔞 𝛼 + 𝜶 𝑿 with (here) 𝑀 = 3 leads to a matrix
equation, 𝔞 is an element-wise activation function
𝑧 𝛼 𝛼 𝛼 1
𝑧 =𝔞 𝛼 𝛼 𝛼 𝑥
𝑧 𝛼 𝛼 𝛼 𝑥

Jochen Lang, EECS


jlang@uOttawa.ca
Neural Networks Loss Function

– Need a loss function to describe the difference


between desired and calculated result
• Squared error (mostly in regression)
𝑲
𝟐

• Cross entropy (deviance) (often in classification)


𝑲

– Note that is non-linear because of the


activation function.
– Generic approach uses gradient descent

Jochen Lang, EECS


jlang@uOttawa.ca
Batch Gradient Descent
• Goal: Minimize the Loss Function (Softmax),

– Find the partial derivatives

– Given the derivatives update the weights


for the output layer

for the hidden layer

Jochen Lang, EECS


jlang@uOttawa.ca
Gradient Descent Variations
• Batch gradient descent
– Calculate the gradient based on all training sample
– Uses all training sample, or one epoch for one update
– Not practical if training data is sizable
• Stochastic gradient descent
– Calculate the gradient based on a single, randomly
chosen training example
– Can lead to very noisy gradients (large changes
between evaluations because of selected sample)
– Approach in the Perceptron by Rosenblatt
• Mini-batch gradient descent
– In between solution, select a number (size of the
mini-batch) randomly
– Practical for large datasets but not as noisy as just a
single sample

Jochen Lang, EECS


jlang@uOttawa.ca
From Gradient Descent to Back
Propagation
• Major insight:
– Each layer of the network has a simple derivative
– Use multivariable chain rule to calculate derivatives
for the network
• Approach
– Separate input at a node and output after application
of non-linearity, consider
Output Layer

Hidden Layer

Jochen Lang, EECS


jlang@uOttawa.ca
Hidden Layer

• Approach
– Separate input at a node and output after application
of non-linearity, consider

Output Layer
– And hence overall

Hidden Layer

Input Layer
Jochen Lang, EECS
jlang@uOttawa.ca
Partial Differentials

• Softmax with cross-entropy loss

, , where is the Kronecker delta

which simplifies for one-hot output to , ,

• Other differentials are:


, ,

• Need to find the values at the nodes given input data


sample
– To calculate the gradient we also need the target
and the current weights

Jochen Lang, EECS


jlang@uOttawa.ca
Back Propagation

• Two pass algorithm


– Given the current weights and a training sample
calculate the network output
– Calculate the loss and the errors
• This is the Forward pass
– Using back-propagation
• the above error
• the equations for the partial derivatives with the
values at the node
• and the learning rate to calculate updates to the
weights

Jochen Lang, EECS


jlang@uOttawa.ca
Forward Pass: Calculate the Error

– Positive training sample for class


requiring
Output Layer

Hidden Layer

Input Layer
Jochen Lang, EECS
jlang@uOttawa.ca
Backward Pass: Calculate the
Updates
– Calculate the partial derivatives for all weights

Output Layer

Hidden Layer

Input Layer
Jochen Lang, EECS
jlang@uOttawa.ca
Apply the Updates to the Weights

• Apply the partial derivatives and to all weights


and using the learning rate
for the output layer

for the hidden layer


• Possibly use some momentum
– Calculate the update based on current and previous
derivatives to stabilize the updates
• Result is a new model with new weights and

Jochen Lang, EECS


jlang@uOttawa.ca
Practical Back-Propagation

• Partial derivatives can be calculated in different ways


– Manually or analytically (as we have just done)
– Symbolically (as when using symbolic math
packages – maple, mathematica, etc.)
– Numerically (as perturbing the different inputs of a
node and measuring the change in output).
– AutoDiff using dual numbers or complex numbers

Jochen Lang, EECS


jlang@uOttawa.ca
Review: Symbolic Differentiation

• Equations can be expressed as a parse tree (and of


course are build by any compiler).
• Example

* *

3 2 *

x y
Jochen Lang, EECS
jlang@uOttawa.ca
Symbolic Differentiation Example
,
• Assume we want to find
– We know (code) simple rules, e.g., sum rule, product
rule, partial derivative of a variable itself etc. (in
total there are not many)
– We find the derivative for each node and expand the
graph based (in this example) with the product and
sum rule
3 x

* y
+
Jochen Lang, EECS
jlang@uOttawa.ca
Example Result

• The example leads to +

+ +

* * * *

0 x 3 1 *
+
0 y x
* *

2 0 x y 1
Jochen Lang, EECS
jlang@uOttawa.ca
Simplify
,
• The expression
and its graph are correct but can obviously simplified to
,
. Automatic simplification is a bit harder.
• Symbolic differentiation can lead to lengthy expressions,
even if simplification is successful.
• Not all network nodes are nicely differentiable (e.g.,
activation layer with ReLu)

Jochen Lang, EECS


jlang@uOttawa.ca
Numerical Differentiation

• We can use finite differences


( )
– E.g., forward differences
• Accuracy in neural networks becomes an issue as we
are often dealing with very small numbers
• Calculate the derivative wrt to many variables is
expensive, i.e., if our function has high-dimensional
input
• Higher order finite difference approximations become
even more expensive

Jochen Lang, EECS


jlang@uOttawa.ca
Forward Differences
• Example again:
and double g(x,y) { return 3 * x * x + 2 * y * x; }
• Can define a helper function
double delta_g(x,y,eps_x,eps_y) {
return (g(x+eps_x,y+eps_y)-g(x,y))
/(eps_x+eps_y); }
• Now we can calculate the two forward differences with
these two functions
• in X: delta_g(x,y,eps_x,0)
• in Y: delta_g(x,y,0,eps_y)
• Let’s look at with

Jochen Lang, EECS


jlang@uOttawa.ca
AutoDiff

• AutoDiff calculates an exact value for the gradient of a


function
– AutoDiff is not new, see Wengert [1964], Kedem
[1980], Rall [1981].
– It has been implemented and used in many tools*
• AutoDiff is very efficient. It is only more expensive than
a function evaluation by a constant factor.

*E.g., Carpenter, Bob, et al. "The stan math library:


Reverse-mode automatic differentiation in C++." arXiv
preprint arXiv:1509.07164 (2015).

Jochen Lang, EECS


jlang@uOttawa.ca
Dual Numbers

• Dual numbers are similar to complex numbers but


replace the imaginary part with an infinitesimal number
where and are real number and is an
infinitesimal number such that but
• Operations analogous to complex numbers, e.g:
• Scale
• Add
• Multiply
• Trigonometry, e.g., cosine

Jochen Lang, EECS


jlang@uOttawa.ca
Smooth Functions of Dual Numbers

• Consider a Taylor Series of a smooth function

• Now with a dual number

• Notice that this is exact as

Jochen Lang, EECS


jlang@uOttawa.ca
Forward-mode AutoDiff

• Example again: at

* *

3 * 2 *

x y

Jochen Lang, EECS


jlang@uOttawa.ca
Forward-mode vs. Reverse-mode
AutoDiff
• Forward mode AutoDiff requires one pass over the
graph for each parameter (not efficient when there are
many input parameters as in neural networks)
• Reverse-mode AutoDiff is an algorithm requiring one
forward and one backward pass for each output
– First pass through the graph evaluates the function
values at the current parameter values
– Second pass is applying the chain rule to calculate
the derivative
– Backpropagation is a special case of Reverse-mode
AutoDiff

Jochen Lang, EECS


jlang@uOttawa.ca
Reverse-mode AutoDiff

• Example again: at
• Result of forward pass in blue
• Setup for reverse mode

* *

3 * 2 *

x y
Jochen Lang, EECS
jlang@uOttawa.ca
Differentials – Part I

* *

3 * 2 *

x y
Jochen Lang, EECS
jlang@eecs.uOttawa.ca
Differentials – Part II

+
=
* *

3 * 2 *

x y
Jochen Lang, EECS
jlang@eecs.uOttawa.ca
Observation for AutoDiff

• Based on the abstract syntax tree (AST) and either


dual numbers or the chain rule
• Does not apply just to neural network training. Used in
fluid dynamics, computer graphics, etc.
• Forward-mode AutoDiff is efficient if there are many
outputs but only a few inputs
• Reverse-mode AutoDiff is efficient if there are many
inputs but only a few outputs
• Each node does only need to be able to calculate the
partial derivative based on its inputs
• Linear models need only the basic nodes implementing
the product rule and the sum rule

Jochen Lang, EECS


jlang@uOttawa.ca
Summary

• Multi-layer perceptron
• feed forward networks
• activation functions
• loss function
• training by back propagation

Jochen Lang, EECS


jlang@uOttawa.ca

You might also like