You are on page 1of 30

Matrix Operation

1 4 0.98
1 y1
-2
1
-1 -2 0.12
-1 y2
1
0

1 −2 1 1 0.98
𝜎 + =
−1 1 −1 0 0.12
4
−2
Neural Network
x1 …… y1
x2 W1 W2 ……
WL y2
b1 b2 bL

……
……

……

……

……
xN x a1 ……
a2 y yM

𝜎 W1 x + b1
𝜎 W2 a1 + b2
𝜎 WL aL-1 + bL
Neural Network
x1 …… y1
x2 W1 W2 ……
WL y2
b1 b2 bL

……
……

……

……

……
xN x a1 ……
a2 y yM

Using parallel computing techniques


y =𝑓 x
to speed up matrix operation

=𝜎 WL …𝜎 W2 𝜎 W1 x + b1 + b2 … + bL
Generalization
Deep Networks (overfitting)
Besides local minima ……
cost
Very slow at the
plateau
Stuck at saddle point

Stuck at local minima

𝛻𝐶 𝜃 𝛻𝐶 𝜃 𝛻𝐶 𝜃
≈0 =0 =0
parameter space
Fat + Short v.s. Thin + Tall
The same number
of parameters

Which one is better?


……

x1 x2 …… xN x1 x2 …… xN

Shallow Deep
Non-Linearity of AF
• Even if we use very very deep neural
networks without the non-linear activation
function, we will just learn the ‘y’ as a linear
transformation of ‘x’. It can only represent
linear relations between ‘x’ and ‘y’.
• we will be constrained to learning linear
decision boundaries and we can’t learn any
arbitrary non-linear decision boundaries.
Effect of hidden layer neurons
Activation Function properties
• 1. The function is continuous and differentiable
everywhere (or almost everywhere).
• 2. The derivative of the function does not
saturate (i.e., become very small, tending
towards zero) over its expected input range. Very
small derivatives tend to stall out the learning
process.
• 3. The derivative does not explode (i.e., become
very large, tending towards infinity), since this
would lead to issues of numerical instability
Common activation function
Activation Functions
• Range : When the range of the activation
function is finite, gradient-based training
methods tend to be more stable, because
pattern presentations significantly affect only
limited weights. When the range is infinite,
training is generally more efficient because
pattern presentations significantly affect most
of the weights.
Continuously differentiable
• — This property is for enabling gradient-
based optimization methods.
• The binary step activation function is not
differentiable at 0, and it differentiates to 0 for
all other values, so gradient-based methods
can make no progress with it.
• (ReLU is not continuously differentiable and
has some issues with gradient-based
optimization, but it is still possible)
Sigmoid in BP learning
• In the logistic function, a small change in the input only
causes a small change in the output as opposed to the
stepped output. Hence the output is much smoother
than the step function output.
• During backpropagation through the network with
sigmoid activation, the gradients in neurons whose
output is near 0 or 1 are nearly 0. These neurons are
called saturated neurons. Thus, the weights in these
neurons do not update. Not only that, the weights of
neurons connected to such neurons are also slowly
updated. This problem is also known as vanishing
gradient.
Vanishing gradient problem
• It describes the situation where a neural
network is unable to propagate useful
gradient information from the output end of
the model back to the layers near the input
end of the model.
• The result is the general inability of models
with many layers to learn on a given dataset,
or for models with many layers to prematurely
converge to a poor solution.
The saturation problem
• A logistic neuron is said to be saturated when
it reaches its peak value either maximum or
minimum.
• In the logistic function mathematical formula,
when you put in a large positive number
logistic function becomes 1 and a large
negative number logistic function becomes 0.
• Tanh also saturates at large positive/negative
values
Vanishing Gradient problem of Sigmoid
• the sigmoidal activation functions have two mostly-
flat regions and a relatively narrow region where the
gradient is substantially non-zero, using sigmoids can
result in almost-zero gradient values relatively
easily—for instance, if the initialization of parameters
were inappropriate, in a regime where the gradients
are mostly small.
• vanishing gradients make it hard for gradient descent
to make progress. ReLus are less susceptible to this
vanishing gradient problem, as their gradient is non-
zero for half the domain (a half space)
A simple change produced drastic improvements
in the network’s ability to learn

Advantages of the Relu over the sigmoid function


1. Very easy to implement
2. Range is from 0 to infinity
3. No vanishing gradient problem / Why?
Relu
• The gradient of ReLU remains constant and never
saturates for positive x, speeding up training. It has
been found in practice that networks that use ReLU
offer a significant speed up in training compared to
sigmoid activation.
• Both the function and its derivative can be computed
using elementary and efficient mathematical
operations (no exponentiation).
• The function is not differentiable at 0 but is
differentiable everywhere else, including at points
arbitrarily close to 0. In practice, we “set” the
derivative at 0 to be either 0 (the left derivative) or 1
(the right derivative).
The ReLu and its variants
ReLU
• simple and computationally efficient: a simple
thresholding operation, checking whether input is
greater than zero, ReLUs can be computed much
faster than more nonlinear functions such as, tanh or
other sigmoids.
• the derivative of ReLU is piecewise constant: zero or
one, depending again on whether the input z is
greater than zero. This simplicity can provide a small
constant-factor speed advantage.
• the thresholding nature of a ReLU is more similar to
biological neurons, which tend to fire beyond a
threshold.

You might also like