Professional Documents
Culture Documents
holding a number which comes from the ending branches (Synapses) supplied at that Neuron, what
happens is for a Layer of Neural Network we multiply the input to the Neuron with the weight held by
that synapse and sum all of those up.
For example (see D in above figure), if the weights are w1, w2, w3 …. wN and inputs being i1, i2, i3 …. iN
we get a summation of : w1*i1 + w2*i2 + w3*i3 …. wN*iN
For several layers of Neural Networks and Connections we can have varied values of wX and iX
and the summation S which varies according to whether the particular Neuron is activated or not,
so to normalize this and prevent drastically different range of values, we use what is called a
Activation Function for Neural networks that turns these values into something equivalent
between 0,1 or -1,1 to make the whole process statistically balanced.
Nonlinear — When the activation function is non-linear, then a two-layer neural network
can be proven to be a universal function approximator. The identity activation function
does not satisfy this property. When multiple layers use the identity activation function,
the entire network is equivalent to a single-layer model.
Range — When the range of the activation function is finite, gradient-based training
methods tend to be more stable, because pattern presentations significantly affect only
limited weights. When the range is infinite, training is generally more efficient because
pattern presentations significantly affect most of the weights. In the latter case, smaller
learning rates are typically necessary.
Continuously differentiable — This property is desirable (ReLU is not continuously
differentiable and has some issues with gradient-based optimization, but it is still
possible) for enabling gradient-based optimization methods. The binary step activation
function is not differentiable at 0, and it differentiates to 0 for all other values, so
gradient-based methods can make no progress with it.
Monotonic — When the activation function is monotonic, the error surface associated
with a single-layer model is guaranteed to be convex. Monotonic function: A function
which is either entirely non-increasing or non-decreasing.
Smooth functions with a monotonic derivative — These have been shown to generalize
better in some cases.
Approximates identity near the origin — When activation functions have this property,
the neural network will learn efficiently when its weights are initialized with small
random values. When the activation function does not approximate identity near the
origin, special care must be used when initializing the weights.
Equation : f(x) = x
It doesn’t help with the complexity or various parameters of usual data that is fed to the neural
networks.
It makes it easy for the model to generalize or adapt with variety of data and to differentiate
between the output.
Derivative or Differential: Change in y-axis w.r.t. change in x-axis.It is also known as slope.
The main reason why we use sigmoid function is because it exists between (0 to 1). Therefore, it
is especially used for models where we have to predict the probability as an output.Since
probability of anything exists only between the range of 0 and 1, sigmoid is the right choice.
The function is differentiable.That means, we can find the slope of the sigmoid curve at any two
points.
The logistic sigmoid function can cause a neural network to get stuck at the training time.
The softmax function is a more generalized logistic activation function which is used for
multiclass classification.
The advantage is that the negative inputs will be mapped strongly negative and the zero inputs
will be mapped near zero in the tanh graph.
Both tanh and logistic sigmoid activation functions are used in feed-forward nets.
As you can see, the ReLU is half rectified (from bottom). f(z) is zero when z is less than zero and
f(z) is equal to z when z is above or equal to zero.
Range: [ 0 to infinity)
But the issue is that all the negative values become zero immediately which decreases the ability
of the model to fit or train from the data properly. That means any negative input given to the
ReLU activation function turns the value into zero immediately in the graph, which in turns
affects the resulting graph by not mapping the negative values appropriately.
4. Leaky ReLU
It is an attempt to solve the dying ReLU problem
Fig : ReLU v/s Leaky ReLU
The leak helps to increase the range of the ReLU function. Usually, the value of a is 0.01 or so.
Both Leaky and Randomized ReLU functions are monotonic in nature. Also, their derivatives
also monotonic in nature.