Convolutional Neural Networks in Computer Vision: Jochen Lang

Convolutional Neural Networks in Computer
Vision
Jochen Lang
jlang@uottawa.ca
Faculté de génie | Faculty of Engineering

Jochen Lang, EECS
jlang@uOttawa.ca
Neural Networks Basics
• Multi-layer perceptron
• Feed forward networks
• Activation functions
• Loss function
• Training by back propagation
Jochen Lang, EECS

jlang@uOttawa.ca
Solving Linear Classification
• Recall: There is no known closed-form solution to the

log loss and we need to resort to (non-linear) numerical
optimization
– For simplicity, let’s look at Linear Least Squares with
the sum of squared error for a line with data points
𝑵
• Perceptron Learning Algorithm by Rosenblatt (1958)

– Loop over the training points and after each
misclassified training point update the line estimate
– The Perceptron algorithm forms one of the basis of
Neural Networks
Jochen Lang, EECS
jlang@uOttawa.ca
Hinge Loss
• In general, we minimize the cost function or loss

function (in machine learning)
– LLSQ uses a L2 loss
𝑵
– Perceptron uses a hinge loss (only misclassified

points enter the loss function
with opposite signs)
∈ℳ
Jochen Lang, EECS

jlang@uOttawa.ca
Perceptron Algorithm
• Given the definition of the Hinge Loss Function

∈ℳ where is the set of misclassified points
– Calculate the gradient of the loss

function
∈ℳ
∈ℳ
– Loop over the training points and evaluate the gradient
after each training point and update the line estimate
Jochen Lang, EECS

jlang@uOttawa.ca
Gradient Descent
• Find a solution by taking steps down the steepest slope

– 2D Example
Image Source: downhill.readthedocs.io

“Downhill 0.4.0 Documentation”, Johnson et
al., Google
Jochen Lang, EECS

jlang@uOttawa.ca
Minimizing the Loss
• Only in simple cases (linear least squares) a direct

solution is possible
• Perceptron algorithm is a form of stochastic gradient
descent
– Limitations:
• The order of points will influence the solution
• The solution is non-unique
• If the classes overlap, the algorithm enters a limit
cycle
Jochen Lang, EECS

jlang@uOttawa.ca
Feedforward Neural Networks
• Here we focus on a feed forward neural network

– No loops in the network (unlike recurrent networks)
• Classic layout consists of one input, a single hidden and
one output layer
Output Layer
Hidden Layer
Input Layer
Jochen Lang, EECS

jlang@uOttawa.ca
Basic Equations
• Derived features are created from the input layer
– Example (number of nodes in hidden layers)

– Using the activation function
• Output is calculated from the derived features
– Here K=1 (number of nodes in output layer)

• Mapping to classes with Output Layer 𝑇
logit or softmax
Hidden Layer 𝑍 𝑍 𝑍
Input Layer 𝑋 𝑋
Jochen Lang, EECS

jlang@uOttawa.ca
Comparison to Least Squares and
Perceptron
– We have an extra hidden layer
– We have the activation function
• For classification, we use the logit or softmax
function
– Aside: The constant terms (the bias) can be
integrated into the layers (see below)
Output Layer
Hidden Layer
Input Layer
Jochen Lang, EECS

jlang@uOttawa.ca
Activation Functions
• The activation function introduces non-linearities into
the neural network
• Many different choices and still some research
– Classic: sigmoid, hyperbolic tangent
– Modern: ReLu (and variants)
– Other: Radial basis functions, softplus, hard tanh
• Because we like to solve the fitting or training of the
network with gradient descent
– Functions should be differentiable (at least nearly
everywhere)
– Derivatives should also be non-zero everywhere in
region of interests
• Caution: Goodrich et al. [2016] state: “Many unpublished
activation functions perform just as well as the popular
ones” and gives examples of cosine on MNIST
Jochen Lang, EECS

jlang@uOttawa.ca
Sigmoid Function
• Before deep learning most often the sigmoid function

– Sigmoid
– Same as in logistic regression
– Derivative is well defined
– Caution: Derivative becomes very small for large

magnitude of v because the function saturates
Jochen Lang, EECS

jlang@uOttawa.ca
Hyperbolic Tangent
• Similar function shape and in fact closely related to
sigmoid
– Derivative again well defined
– Advantage compared to sigmoid:

• Hyperbolic tangent is close to the identity function in
the neighborhood of 0.
• Does not introduce non-linearity into optimization if
input is close to 0.
– Similar to sigmoid, the function saturates for large
magnitudes and derivatives become very small
Jochen Lang, EECS

jlang@uOttawa.ca
ReLu
• Goal: Make the function close to identity and prevent
saturation.
– Derivatives is trivial and nearly defined everywhere
– In practice, implement the derivative at 0 as the left or

right derivative
– Advantage compared to sigmoid like functions:
• Identity function if node is “active” .
• Gradient remains useful for all
• Sometimes an additional affine transform is used to
move e.g. input values to
Jochen Lang, EECS

jlang@uOttawa.ca
Generalization of ReLu
• Absolute value rectification
Used in special cases to enforce symmetry

• Leaky ReLu
– Avoid derivatives of 0, introduce a leak scalar
– Randomized version pick at random during training

and fix for testing/application
• Parametric ReLu or PReLu
– Same as leaky ReLu but learn the leak scalar for each
node
Jochen Lang, EECS

jlang@uOttawa.ca
Exponential Linear Units ELU
• A further generalization of ReLu [Clevert et al. 2015]
– Note that the derivative is
• Advantages compared to ReLu

– Gradient is non-zero for negative values and only
approaches 0 for large negative values
– Linear range of function smoothly transitions to
(neg.) exponential
• Disadvantages
– Higher computation cost
Jochen Lang, EECS
jlang@uOttawa.ca
Training Neural Networks
• Single Hidden Layer: We have to find the weights
– Call them collectively

– In our example, we have two inputs , three
hidden nodes , and two output node
weights
Output Layer
Hidden Layer
Input Layer
Jochen Lang, EECS

jlang@uOttawa.ca
Weights
• Notation with subscript from node to node

Output Layer
Hidden Layer
Input Layer
Jochen Lang, EECS
jlang@uOttawa.ca
Matrix Notation
• In practice NN are conveniently expressed as matrix-vector
multiplies
• Our network example
– 𝑇 = 𝛽 + 𝜷 𝒁 is a matrix equation (here 𝜷 is a vector because
we have a single output)
𝛽 1 1
𝛽 𝑧 𝑇 𝛽 𝛽 𝛽 𝛽 𝑧
𝑇 = and =
𝛽 𝑧 𝑇 𝛽 𝛽 𝛽 𝛽 𝑧
𝛽 𝑧 𝑧
– 𝑍 = 𝔞 𝛼 + 𝜶 𝑿 with (here) 𝑀 = 3 leads to a matrix
equation, 𝔞 is an element-wise activation function
𝑧 𝛼 𝛼 𝛼 1
𝑧 =𝔞 𝛼 𝛼 𝛼 𝑥
𝑧 𝛼 𝛼 𝛼 𝑥
Jochen Lang, EECS

jlang@uOttawa.ca
Neural Networks Loss Function
– Need a loss function to describe the difference

between desired and calculated result
• Squared error (mostly in regression)
𝑲
𝟐
• Cross entropy (deviance) (often in classification)

𝑲
– Note that is non-linear because of the

activation function.
– Generic approach uses gradient descent
Jochen Lang, EECS

jlang@uOttawa.ca
Batch Gradient Descent
• Goal: Minimize the Loss Function (Softmax),
– Find the partial derivatives
– Given the derivatives update the weights

for the output layer
for the hidden layer
Jochen Lang, EECS

jlang@uOttawa.ca
Gradient Descent Variations
• Batch gradient descent
– Calculate the gradient based on all training sample
– Uses all training sample, or one epoch for one update
– Not practical if training data is sizable
• Stochastic gradient descent
– Calculate the gradient based on a single, randomly
chosen training example
– Can lead to very noisy gradients (large changes
between evaluations because of selected sample)
– Approach in the Perceptron by Rosenblatt
• Mini-batch gradient descent
– In between solution, select a number (size of the
mini-batch) randomly
– Practical for large datasets but not as noisy as just a
single sample
Jochen Lang, EECS

jlang@uOttawa.ca
From Gradient Descent to Back
Propagation
• Major insight:
– Each layer of the network has a simple derivative
– Use multivariable chain rule to calculate derivatives
for the network
• Approach
– Separate input at a node and output after application
of non-linearity, consider
Output Layer
Hidden Layer
Jochen Lang, EECS

jlang@uOttawa.ca
Hidden Layer
• Approach
– Separate input at a node and output after application
of non-linearity, consider
Output Layer
– And hence overall
Hidden Layer
Input Layer
Jochen Lang, EECS
jlang@uOttawa.ca
Partial Differentials
• Softmax with cross-entropy loss
, , where is the Kronecker delta
which simplifies for one-hot output to , ,
• Other differentials are:

, ,
• Need to find the values at the nodes given input data

sample
– To calculate the gradient we also need the target
and the current weights
Jochen Lang, EECS

jlang@uOttawa.ca
Back Propagation
• Two pass algorithm

– Given the current weights and a training sample
calculate the network output
– Calculate the loss and the errors
• This is the Forward pass
– Using back-propagation
• the above error
• the equations for the partial derivatives with the
values at the node
• and the learning rate to calculate updates to the
weights
Jochen Lang, EECS

jlang@uOttawa.ca
Forward Pass: Calculate the Error
– Positive training sample for class

requiring
Output Layer
Hidden Layer
Input Layer
Jochen Lang, EECS
jlang@uOttawa.ca
Backward Pass: Calculate the
Updates
– Calculate the partial derivatives for all weights
Output Layer
Hidden Layer
Input Layer
Jochen Lang, EECS
jlang@uOttawa.ca
Apply the Updates to the Weights
• Apply the partial derivatives and to all weights

and using the learning rate
for the output layer
for the hidden layer

• Possibly use some momentum
– Calculate the update based on current and previous
derivatives to stabilize the updates
• Result is a new model with new weights and
Jochen Lang, EECS

jlang@uOttawa.ca
Practical Back-Propagation
• Partial derivatives can be calculated in different ways

– Manually or analytically (as we have just done)
– Symbolically (as when using symbolic math
packages – maple, mathematica, etc.)
– Numerically (as perturbing the different inputs of a
node and measuring the change in output).
– AutoDiff using dual numbers or complex numbers
Jochen Lang, EECS

jlang@uOttawa.ca
Review: Symbolic Differentiation
• Equations can be expressed as a parse tree (and of

course are build by any compiler).
• Example
* *
3 2 *
x y
Jochen Lang, EECS
jlang@uOttawa.ca
Symbolic Differentiation Example
,
• Assume we want to find
– We know (code) simple rules, e.g., sum rule, product
rule, partial derivative of a variable itself etc. (in
total there are not many)
– We find the derivative for each node and expand the
graph based (in this example) with the product and
sum rule
3 x
* y
+
Jochen Lang, EECS
jlang@uOttawa.ca
Example Result
• The example leads to +
+ +
* * * *
0 x 3 1 *
+
0 y x
* *
2 0 x y 1
Jochen Lang, EECS
jlang@uOttawa.ca
Simplify
,
• The expression
and its graph are correct but can obviously simplified to
,
. Automatic simplification is a bit harder.
• Symbolic differentiation can lead to lengthy expressions,
even if simplification is successful.
• Not all network nodes are nicely differentiable (e.g.,
activation layer with ReLu)
Jochen Lang, EECS

jlang@uOttawa.ca
Numerical Differentiation
• We can use finite differences

( )
– E.g., forward differences
• Accuracy in neural networks becomes an issue as we
are often dealing with very small numbers
• Calculate the derivative wrt to many variables is
expensive, i.e., if our function has high-dimensional
input
• Higher order finite difference approximations become
even more expensive
Jochen Lang, EECS

jlang@uOttawa.ca
Forward Differences
• Example again:
and double g(x,y) { return 3 * x * x + 2 * y * x; }
• Can define a helper function
double delta_g(x,y,eps_x,eps_y) {
return (g(x+eps_x,y+eps_y)-g(x,y))
/(eps_x+eps_y); }
• Now we can calculate the two forward differences with
these two functions
• in X: delta_g(x,y,eps_x,0)
• in Y: delta_g(x,y,0,eps_y)
• Let’s look at with
Jochen Lang, EECS

jlang@uOttawa.ca
AutoDiff
• AutoDiff calculates an exact value for the gradient of a

function
– AutoDiff is not new, see Wengert [1964], Kedem
[1980], Rall [1981].
– It has been implemented and used in many tools*
• AutoDiff is very efficient. It is only more expensive than
a function evaluation by a constant factor.
*E.g., Carpenter, Bob, et al. "The stan math library:

Reverse-mode automatic differentiation in C++." arXiv
preprint arXiv:1509.07164 (2015).
Jochen Lang, EECS

jlang@uOttawa.ca
Dual Numbers
• Dual numbers are similar to complex numbers but

replace the imaginary part with an infinitesimal number
where and are real number and is an
infinitesimal number such that but
• Operations analogous to complex numbers, e.g:
• Scale
• Add
• Multiply
• Trigonometry, e.g., cosine
Jochen Lang, EECS

jlang@uOttawa.ca
Smooth Functions of Dual Numbers
• Consider a Taylor Series of a smooth function
• Now with a dual number
• Notice that this is exact as
Jochen Lang, EECS

jlang@uOttawa.ca
Forward-mode AutoDiff
• Example again: at
* *
3 * 2 *
x y
Jochen Lang, EECS

jlang@uOttawa.ca
Forward-mode vs. Reverse-mode
AutoDiff
• Forward mode AutoDiff requires one pass over the
graph for each parameter (not efficient when there are
many input parameters as in neural networks)
• Reverse-mode AutoDiff is an algorithm requiring one
forward and one backward pass for each output
– First pass through the graph evaluates the function
values at the current parameter values
– Second pass is applying the chain rule to calculate
the derivative
– Backpropagation is a special case of Reverse-mode
AutoDiff
Jochen Lang, EECS

jlang@uOttawa.ca
Reverse-mode AutoDiff
• Example again: at
• Result of forward pass in blue
• Setup for reverse mode
* *
3 * 2 *
x y
Jochen Lang, EECS
jlang@uOttawa.ca
Differentials – Part I
* *
3 * 2 *
x y
Jochen Lang, EECS
jlang@eecs.uOttawa.ca
Differentials – Part II
+
=
* *
3 * 2 *
x y
Jochen Lang, EECS
jlang@eecs.uOttawa.ca
Observation for AutoDiff
• Based on the abstract syntax tree (AST) and either

dual numbers or the chain rule
• Does not apply just to neural network training. Used in
fluid dynamics, computer graphics, etc.
• Forward-mode AutoDiff is efficient if there are many
outputs but only a few inputs
• Reverse-mode AutoDiff is efficient if there are many
inputs but only a few outputs
• Each node does only need to be able to calculate the
partial derivative based on its inputs
• Linear models need only the basic nodes implementing
the product rule and the sum rule
Jochen Lang, EECS

jlang@uOttawa.ca
Summary
• Multi-layer perceptron
• feed forward networks
• activation functions
• loss function
• training by back propagation
Jochen Lang, EECS

jlang@uOttawa.ca

Convolutional Neural Networks in Computer Vision: Jochen Lang

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Convolutional Neural Networks in Computer Vision: Jochen Lang

Uploaded by

Copyright:

Available Formats

Convolutional Neural Networks in Computer

Faculté de génie | Faculty of Engineering

Jochen Lang, EECS

• Recall: There is no known closed-form solution to the

• Perceptron Learning Algorithm by Rosenblatt (1958)

• In general, we minimize the cost function or loss

– Perceptron uses a hinge loss (only misclassified

Jochen Lang, EECS

• Given the definition of the Hinge Loss Function

Jochen Lang, EECS

• Find a solution by taking steps down the steepest slope

Image Source: downhill.readthedocs.io

Jochen Lang, EECS

• Only in simple cases (linear least squares) a direct

Jochen Lang, EECS

• Here we focus on a feed forward neural network

Jochen Lang, EECS

• Derived features are created from the input layer

– Example (number of nodes in hidden layers)

– Here K=1 (number of nodes in output layer)

Jochen Lang, EECS

Jochen Lang, EECS

Jochen Lang, EECS

• Before deep learning most often the sigmoid function

– Caution: Derivative becomes very small for large

Jochen Lang, EECS

– Derivative again well defined

– Advantage compared to sigmoid:

Jochen Lang, EECS

– Derivatives is trivial and nearly defined everywhere

– In practice, implement the derivative at 0 as the left or

Jochen Lang, EECS

• Absolute value rectification

Used in special cases to enforce symmetry

– Randomized version pick at random during training

Jochen Lang, EECS

– Note that the derivative is

• Advantages compared to ReLu

• Single Hidden Layer: We have to find the weights

– Call them collectively

Jochen Lang, EECS

• Notation with subscript from node to node

Jochen Lang, EECS

– Need a loss function to describe the difference

• Cross entropy (deviance) (often in classification)

– Note that is non-linear because of the

Jochen Lang, EECS

– Find the partial derivatives

– Given the derivatives update the weights

for the hidden layer

Jochen Lang, EECS

Jochen Lang, EECS

Jochen Lang, EECS

• Softmax with cross-entropy loss

, , where is the Kronecker delta

which simplifies for one-hot output to , ,

• Other differentials are:

• Need to find the values at the nodes given input data

Jochen Lang, EECS

• Two pass algorithm

Jochen Lang, EECS

– Positive training sample for class