You are on page 1of 28

ACTIVATION FUNCTION

Good site- https://mlfromscratch.com/activation-functions-explained/#/


What is activation function?
An activation function is a
mathematical equation
attached to each hidden
and output neuron in the
network. Its role is to
determine whether the
neuron should be activated
or not. This is typically
based on whether the
neuron input is important
for the output prediction.
Apart from neurons of the
input layer, the weighted
sum of each neuron inputs,
having a bias added to it, is passed through an activation function.
Why use activation function?
The relation between weights and neuron inputs will always be linear if we do not make
use of an activation function. Thus, the output is a simple linear transformation of these
latters. When only linear relationship between variables is addressed, the predicted
model will be unsuitable for complex problems. The use of an activation function that
provides the non-linearity becomes a must. This enables us to build more complex
models that fit the real-world problems.
Activation functions can be divided into three main categories:
□ Binary Step Function
□ Linear Activation Function
□ Non-Linear Activation functions
o There are various types of activation functions. Some of them are
listed below(with some key points):
Table Synthesis:
 Range : specifies the range of the function outputs.
 0-centered : indicates whether the function is 0-centered or not.
 Saturation : indicates whether the function suffers from saturation or
not. If yes, the neurons values for which the saturation takes place are
mentioned.
 Vanishing Gradient : specifies if the activation function causes the
vanishing gradient problem or not.
 Computation : indicates if the function is easy to compute or rather
compute-intensive.

Why use non-linearity?


 Linear Activation network or Step function is not practical to use in
complex applications of neural network. The main advantages over
linear function are:
o Differentials are possible in all non linear function
o Stacking of network is possible which helps to create the deep
neural nets.
Binary Step Function:

It is a threshold-based activation function.


It only supports binary classification. In other terms, it does not allow multi-value
outputs.

Pro:
 Binary classification
Cons:
 Doesn’t work in multilabel classification
 The derivative for the gradient calculation is always 0 so impossible to update
weights

Linear activation Function:


It is used for simple linear regression model.
Since the derivative is a constant, it is not possible to use backpropagation to train
the model and understand which weights can provide a better prediction.
It is unable to capture complex patterns no matter the depth of the neural
network. Typically, all layers collapse into one as if the network turns into just one
layer : the output layer is a linear function of the input one.

Pros:
 Binary and multiclass classification
 Highly interpretable
Cons:
 The derivative correspond to “a” so the update of weights and biaises during the
backprogation will be constant.
 Not efficient if the gradient is always the same.

Non-linear activation Function:


 Sigmoid/Logistic
It normalizes the output of the neuron to a range between 0 and 1, giving the
probability of the input value. This makes the sigmoid useful for output neurons of
classification-aimed neural networks.
It is highly compute-intensive since it requires computation of an exponent, which
makes the convergence of the network slower.
It suffers from saturation problem. A neuron is considered as saturated if it
reaches its maximum or minimum value (Ex. Sigmoid : f(x) = 0 or 1), so that its
derivative (Ex. Sigmoid : f(x)(1- f(x)) = 0) is equal to 0. In that case, there is no
update in weights. The gradient of the loss function with respect to weights
consequently vanishes till it goes down to 0. This phenomenon is known as
vanishing gradient that causes poor learning for deep networks.
It is not a zero-centered function. So, the gradient of all the weights connected to
the same neuron is either positive or negative. During update process, these
weights are only allowed to move in one direction, i.e. positive or negative, at a
time. This makes the loss function optimization harder.
 Tanh (hyperbolic):
It normalizes the output of the neuron to a range between -1 and 1.
Unlike Sigmoid, it is a zero-centered function so that the optimization of the loss
function becomes easier.
As for Sigmoid, Tanh is highly compute-intensive and suffers from saturation
problem and thus vanishing gradient. In fact, when the neuron reaches the
minimum or maximum value of its range, that respectively correspond to -1 and 1,
its derivative is equal to 0.

Pros:
 Range between -1 and 1
 The gradient is stronger than sigmoid ( derivatives are steeper)
Cons:
 Like sigmoid, tanh also has a vanishing gradient problem
 Saturation

 ReLU
It is the most used activation function.
It is easy to compute so that the neural network converges very quickly.
As its derivative is not 0 for the positive values of the neuron (f’(x)=1 for x ≥ 0),
ReLu does not saturate and no dead neurons are reported. Saturation and
vanishing gradient only occur for negative values that, given to ReLu, are turned
into 0.
It is not a zero-centered function.
Pros:
 Easy to implement and very fast
 True 0 value
 Optimization are easy when activation function are linear
 Most used in the neural networks ecosystem
Cons:
 The function can not be differentiable when x = 0. The gradient descent can’t be
computed for this point but, in practice that has not an influence. The linear part
correspond to a slope with value 1 and the negative part is equal to zero.
 “dying ReLU problem”: corresponds to the inactive part of the neurons if the
output are 0. There no gradient when neurons are not active so if a large part of
neurons are not activated it can result of poor performance of the model
 Not appropriate for RNN class algorithm (RNN, LSTM, GRU)

 Leaky ReLU (LReLU)


In an attempt to solve the dying ReLu at negative values, Leaky ReLu introduces a
small slope. Having the negative values scaled by α enables their corresponding
neurons to “stay alive”. The Leaky ReLu is the appellation of the activation
function when α = 0.01. It is known as Randomized ReLu if α is equal to any small
value other than 0.01.
It is easy to compute.
It is close to zero-centered function.

Pros:
 Correct the “dying ReLU problem”
 Same comportement of the ReLU activation function for the part y=x

 Parametric ReLU (PReLU)

Pros:
 Generalize the ReLU activation function
 Avoid the “dying ReLU problem”
 The parameter “a” is learned by the neural network

 Exponential Linear Unit(eLU)


Pros:
 ELU becomes smooth slowly until its output equal to -α whereas RELU sharply
smoothes.
 ELU is a strong alternative to ReLU.
 Unlike to ReLU, ELU can produce negative outputs.
Cons:
 For x > 0, it can blow up the activation with the output range of [0, inf].

 ReLU-6
Another variation of the ReLU function is the ReLU-6, 6 is an arbitrary parameter fixed
by hand. The advantage is to shape the output for large positive number to the 6 value.

 Softplus
The softplus activation function is an alternative of sigmoid and tanh functions.
This functions have limits (upper, lower) but softplus is in the range (0, +inf).
 Softsign
This activation function is a variation of tanh but is not very used in practice. tanh and
softsign functions are closely related, tanh converges exponentially whereas softsign
converges polynomially.

 Softmax
The softmax activation function is different from the other because it compute the
probability distribution. The sum of the output is equal to 1.
The corresponding code:
def softmax_active_function(x):
return numpy.exp(x)/numpy.sum(numpy.exp(x))

 Swish
Swish is the newer activation function, published by Google in 2017 it improves the
performances of ReLU on deeper models. This function is a variation of sigmoid function
because it can be expressed by: x*sigmoid(x).
GRADIENT DECENT

Optimization—

Optimization plays an important role in machine learning/deep learning. It is the task of making the
best or most effective use (finding maximum or minimum) of a function f(x) parameterized by x.

Optimization algorithms (in case of minimization) have one of the following goals:
 Find the global minimum of the objective function. This is feasible if the objective
function is convex, i.e. any local minimum is a global minimum.
 Find the lowest possible value of the objective function within its neighborhood.
That’s usually the case if the objective function is not convex as the case in most deep
learning problems.

There are three kinds of optimization algorithms:


 Optimization algorithm that is not iterative and simply solves for one point.
 Optimization algorithm that is iterative in nature and converges to acceptable
solution regardless of the parameters initialization such as gradient descent applied to
logistic regression.
 Optimization algorithm that is iterative in nature and applied to a set of problems
that have non-convex cost functions such as neural networks. Therefore, parameters’
initialization plays a critical role in speeding up convergence and achieving lower
error rates.

What is gradient descent?

- Gradient descent is an optimization algorithm (most common in the field of deep


learning/neural network) used to find the values of parameters (coefficients) of a
function (f) that minimizes a cost function (cost).
Gradient descent is best used when the parameters cannot be calculated analytically
(e.g. using linear algebra) and must be searched for by an optimization algorithm.

There are many variants of gradient descent algorithm, like:


i) Batch Gradient Descent
ii) Mini-batch Gradient Descent
iii) Stochastic Gradient Descent etc.
The starting point is just an arbitrary point for us to evaluate the performance. From that
starting point, we will find the derivative (or slope), and from there, we can use a tangent
line to observe the steepness of the slope. The slope will inform the updates to the
parameters—i.e. the weights and bias. The slope at the starting point will be steeper,
but as new parameters are generated, the steepness should gradually reduce until it
reaches the lowest point on the curve, known as the point of convergence.   

Similar to finding the line of best fit in linear regression, the goal of gradient descent is to
minimize the cost function, or the error between predicted and actual y. In order to do
this, it requires two data points—a direction and a learning rate. These factors
determine the partial derivative calculations of future iterations, allowing it to gradually
arrive at the local or global minimum (i.e. point of convergence). More detail on these
components can be found below:

 Learning rate (also referred to as step size or the alpha) is the size of the steps
that are taken to reach the minimum. This is typically a small value, and it is
evaluated and updated based on the behavior of the cost function. High learning
rates result in larger steps but risks overshooting the minimum. Conversely, a low
learning rate has small step sizes. While it has the advantage of more precision,
the number of iterations compromises overall efficiency as this takes more time
and computations to reach the minimum.
 The cost (or loss) function measures the difference, or error, between actual y
and predicted y at its current position. This improves the machine learning model's
efficacy by providing feedback to the model so that it can adjust the parameters to
minimize the error and find the local or global minimum. It continuously iterates,
moving along the direction of steepest descent (or the negative gradient) until the
cost function is close to or at zero. At this point, the model will stop learning.
Additionally, while the terms, cost function and loss function, are considered
synonymous, there is a slight difference between them. It’s worth noting that a loss
function refers to the error of one training example, while a cost function calculates
the average error across an entire training set.

Very large learning step/rate can miss the minimum value.


Very small learning step/rate can take very long amount of
time to reach minimum value as shown.
Plot of Good learning rates:
More in detail:

https://www.analyticsvidhya.com/bl
og/2020/10/how-does-the-gradient-
descent-algorithm-work-in-machine-
learning/
Gradient Descent algorithm:

It is an optimization algorithm to find the values of the coefficients of the variables of a


function (f) so as to minimize the cost function.
It is mostly used when the coefficients cannot be calculated analytically (for example, by
using linear algebra) but need to be searched using some optimization algorithm. In
short, it is a strategy for searching through a large or infinite parameter space.

Gradient descent search determines a weight vector (w) that minimizes error, E by
starting with some arbitrary initial weight vector and gradually and repeatedly modifies
it in small steps. As the gradient specifies the direction that produces the steepest ascent
in Error, the negative of this vector, therefore, gives the direction of the steepest
decrease. At each step, the weight vector (w) is altered in the direction that produces the
steepest descent along with the error. This process continues until the global minimum
error is reached. To construct a practical algorithm for iteratively updating weights, we
need an efficient way of calculating the gradient at each step. The gradient which is the
vector of partial derivatives can be calculated by differentiating the cost function (E). The
training rule for gradient descent (with MSE as cost function) at a particular point can be
given by,
We see here that to update each ‘wi’, gradient descent uses the summation of errors of all
the data points and hence is also referred to as the batch gradient descent.

Example:
Visit:
https://www.khanacademy.org/math/multivariable-calculus/applications-of-
multivariable-derivatives/optimizing-multivariable-functions/a/what-is-gradient-
descent

Types:
BATCH GRADIENT DESCENT

Batch gradient descent sums the error for each point in a training set, updating the
model only after all training examples have been evaluated. This process referred to as
a training epoch.

While this batching provides computation efficiency, it can still have a long processing
time for large training datasets as it still needs to store all of the data into memory.
Batch gradient descent also usually produces a stable error gradient and convergence,
but sometimes that convergence point isn’t the most ideal, finding the local minimum
versus the global one.

STOCHASTIC GRADIENT DESCENT
Stochastic gradient descent (SGD) runs a training epoch for each example within the
dataset and it updates each training example's parameters one at a time. Since you
only need to hold one training example, they are easier to store in memory. While these
frequent updates can offer more detail and speed, it can result in losses in
computational efficiency when compared to batch gradient descent. Its frequent updates
can result in noisy gradients, but this can also be helpful in escaping the local minimum
and finding the global one.

MINI-BATCH GRADIENT DESCENT

Mini-batch gradient descent combines concepts from both batch gradient descent and


stochastic gradient descent. It splits the training dataset into small batch sizes and
performs updates on each of those batches. This approach strikes a balance between
the computational efficiency of batch gradient descent and the speed of
stochastic gradient descent.
HOPFIELD NETWORK

Introduction:

Hopfield network is a special kind of neural network whose response is different from
other neural networks. It is calculated by converging iterative process. It has just one
layer of neurons relating to the size of the input and output, which must be the same.
When such a network recognizes, for example, digits, we present a list of correctly
rendered digits to the network. Subsequently, the network can transform a noise input
to the relating perfect output.

 In 1982, John Hopfield introduced an artificial neural network to store and retrieve
memory like the human brain.
 Here, a neuron either is on (firing) or is off (not firing), a vast simplification of the real
situation.
 The state of a neuron (on: +1 or off: -1) will be renewed depending on the input it
receives from other neurons.
 A Hopfield network is initially trained to store a number of patterns or memories.
 It is then able to recognise any of the learned patterns by exposure to only partial or
even some corrupted information about that pattern, i.e., it eventually settles down and
returns the closest pattern or the best guess.
 Thus, like the human brain, the Hopfield model has stability in
pattern recognition.
 A Hopfield network is single-layered and recurrent network:
the neurons are fully connected, i.e., every neuron is
connected to every other neuron.
 Given two neurons i and j there is a connectivity weight wij
between them which is symmetric wij = wji with zero self-
connectivity wii = 0.
 Below three neurons i = 1, 2, 3 with values xi = ±1 have
connectivity wij; any update has input xi and output yi.

Updating Rule:
o Assume N neurons = 1, · · · , N with values xi = ±1
o The update rule is for the node i is given by: If hi ≥ 0
then 1 ← xi otherwise − 1 ← xi where hi = PN j=1 wijxj + bi is called the field
at i, with bi ∈ R a bias.
o Thus, xi ← sgn(hi), where sgn(r) = 1 if r ≥ 0, and sgn(r) = −1 if r < 0.
o We put bi = 0 for simplicity as it makes no difference to training the network
with random patterns.
o We therefore assume hi = PN j=1 wijxj .
o Updates in the Hopfield network can be performed in two different ways:
 Asynchronous: Only one unit is updated at a time. This unit can be picked at random, or
a pre-defined order can be imposed from the very beginning.
 Synchronous: All units are updated at the same time. This requires a central clock to the
system in order to maintain synchronization. This method is viewed by some as less
realistic, based on an absence of observed global clock influencing analogous biological
or physical systems of interest.

It is basically of 2 types: i) Discrete Hopfield Network and,


ii)Continuous Hopfield Network

Illustration:

http://web.cs.ucla.edu/~rosen/161/notes/hopfield.html

For the discrete and continuous formula deduction part:

https://www.tutorialspoint.com/artificial_neural_network/artificial_neural_network_hopfie
ld.htm

PS:
There is another computational role for Hopfield nets. Instead of using the net
to store memories, we use it to construct interpretations of sensory input. The
input is represented by the visible units, the interpretation is represented by
the states of the hidden units, and the badness of the interpretation is
represented by the energy.
HEBBIAN LEARNING

The Hebb learning rule assumes that – If two neighbor neurons activated


and deactivated at the same time. Then the weight connecting these
neurons should increase. For neurons operating in the opposite phase, the
weight between them should decrease. If there is no signal correlation, the
weight should not change.
When inputs of both the nodes are either positive or negative, then a strong
positive weight exists between the nodes. If the input of a node is positive
and negative for other, a strong negative weight exists between the nodes.

At the start, values of all weights are set to zero. This learning rule can be
used0 for both soft- and hard-activation functions. Since desired responses
of neurons are not used in the learning procedure, this is the unsupervised
learning rule. The absolute values of the weights are usually proportional to
the learning time, which is undesired.
Hebbian Learning Rule Algorithm : 
1. Set all weights to zero, wi = 0 for i=1 to n, and bias to zero.
2. For each input vector, S(input vector) : t(target output pair), repeat
steps 3-5.
3. Set activations for input units with the input vector X i = Si for i = 1 to
n.
4. Set the corresponding output value to the output neuron, i.e. y = t.
5. Update weight and bias by applying Hebb rule for all i = 1 to n:
Three major points were stated as a part of this learning mechanism:
 Information is stored in the connections between neurons in neural
networks, in the form of weights.
 Weight change between neurons is proportional to the product of
activation values for neurons.

 As learning takes place, simultaneous or repeated activation of weakly


connected neurons incrementally changes the strength and pattern of
weights, leading to stronger connections.
ASSOCIATIVE MEMORY PARADIGM

Learning is the adaptation of the network to better handle a task by


considering sample observations. Learning involves adjusting the weights
(and optional thresholds) of the network to improve the accuracy of the
result. This is done by minimizing the observed errors. Learning is complete
when examining additional observations does not usefully reduce the error
rate. Even after learning, the error rate typically does not reach 0. If after
learning, the error rate is too high, the network typically must be redesigned.
So to significant extent learning is the process of forming associations
between related patterns. Human memory connects ideas that are 1) similar,
2) contrary, 3) occur in close proximity [spatial] and 4) occur in close
succession [temporal]. The pattern we associate may be of same type [visual
image with another visual image] or sensory modality or of different type
[fragrance or feeling with a visual image].

An associative memory network may serve as a highly simplified model of


human memory to learn a set of pattern pair (or associations). Associative
memories are the systems associating the input patterns with the stored
pattern or prototypes. Associative memory provides an approach of storing
and retrieving data based on content rather than storage access.
Information recording is the large set of that are stored (priori information
memorized)
and
Information retrieval/recall is when stored pattern are excited according to
the input key pattern.
Each association is an input-output vector pair (s and f say). For 2 patterns s
and f, two associations is possible:
1) if s=f : then neural network is called auto-associative memory
[input pattern: distorted square; output pattern: square]
Auto-assosiative networks are special kind of networks used to simulate associative
processes.
 capable of retrieving a piece of data from one category upon
presentation of only partial data of that piece of data
 training input and output vectors are same
 uses Hebb’s rule/outer peoduction rule to find the weights
of an associative memory network[wij(new)=wij(old)+xiyj]
 n input vectors and n output vectors
 input and output are connected through weighted
connection

Auto associative memory

2) if s≠f : then neural network is called hetero-associative memory


[input pattern: square/distorted square; output pattern: rhomboid]
Hetero-associative networks stores input-output pattern pairs to recall stored output
pattern by receiving noisy or incomplete version.
 retrieving a piece of data from one category upon
presentation of data from another category
 training input and output vectors are different
 uses Hebb’s rule/Delta rule/Outer production rule to find
the weights of an associative memory network
[wij(new)=wij(old)+xiyj]

Hetero associative memory


Here, M = matrix type operator
Recording = M is the prototype
vector stored
Retrieval = mapping of x → v (input
vector x finds a desired vector v
stored in prototype); It can be linear
or non-linear.
I/O relation:
Hetero: [2set of prototype vector]
x i → v i :v i ≠ x i for i=1,2 , … , p
Auto: [1 set of prototype vector]
x i → v i :v i=x i for i=1,2 , … , p
Memory will be static or dynamic is determined by recall principal.
Static:
 Recall an I/P response after the input has been applied in one feed
forward pass and theoretically without any delay.
 Np recurrent (no forward, no delay)

vi =M 1 [x ¿¿ i]¿
Dynamic:
 Produce recall as a result of outpt/input feedback interaction which
requires time
 Recurrent, time-delayed
 Dynamically evolve and finally converge to an equilibrium state
according to the recursive formula
vi +1=M 2 [x ¿ ¿ i , v i ]¿

Before training AM NN the original patterns must be converted to appropriate


representation for computation. But not all representation of same pattern is
equally powerful or efficient. Two common training methods for single layer
nets are usually considered:
1) Hebbian learning rule and its variation and
2) gradient descendent.
Architecture of AM can be feed-forward or recurrent (iterative). On that basis,
AM is total of 4 types:
i) feed-forward hetero associative
ii) feed-forward associative
iii) recurrent hetero associative, and
iv) recurrent associative
RECURRENT NEURAL NETWORK

Recurrent neural networks, also known as RNNs, are a class of neural networks
that allow previous outputs to be used as inputs while having hidden states.
Basic feed forward networks “remember” things too, but they remember things
they learnt during training. While RNNs learn similarly while training, in
addition, they remember things learnt from prior input(s) while generating
output(s).
RNNs can take one or more input vectors and produce one or more output
vectors and the output(s) are influenced not just by weights applied on inputs
like a regular NN, but also by a “hidden” state vector representing the context
based on prior input(s)/output(s). So, the same input could produce a different
output depending on previous inputs in the series.
Parameter Sharing: It uses the same parameters for each input as it performs
the same task on all the inputs or hidden layers to produce the output. This
reduces the complexity of parameters, unlike other neural networks.

The pros and cons of a typical RNN architecture are summed up in the table below:
Advantages Drawbacks
• Possibility of processing input of any ~ Computation being slow
length ~ Difficulty of accessing information
• Model size not increasing with size of from a long time ago
input ~ Cannot consider any future input
• Computation takes into account historical
for the current state.
information ~ It could not process very long
• Weights are shared across time sequences if it were using tanh or
relu like an activation function
RNNs have attributes that have made them very popular for tasks where data
must be handled in a sequential manner.

Types:

 Each rectangle represents vectors


 Arrows represent functions
 Red represents Input vectors
 Blue represents output vectors
 Green holds RNN's state.

Name Description Example


1 One-to-one Image classification
This is also called Plain Neural
networks. It deals with a fixed size
of the input to the fixed size of
output, where they are
independent of previous
information/output.
Image Captioning takes the
It deals with a fixed size of
image as input and
2 One-to-many information as input that gives a
outputs a sentence of
sequence of data as output.
words
sentiment analysis where
It takes a sequence of information
any sentence is classified
3 Many-to-one as input and outputs a fixed size of
as expressing the positive
the output.
or negative sentiment
Machine Translation,
It takes a Sequence of information
where the RNN reads any
as input and processes the
4 Many-to-many sentence in English and
recurrently outputs as a Sequence
then outputs the sentence
of data
in French
Synced sequence input and
output. Notice that in every case
are no pre-specified constraints on Video classification where
Bidirectional
5 the lengths sequences because the we wish to label every
many-to-many
recurrent transformation (green) is frame of the video
fixed and can be applied as many
times as we like

RNN models are mostly used in the fields of natural language processing and
speech recognition. Apart from this RNN vastly used in:
i. Machine Translation
ii. Robot control
iii. Time series prediction
iv. Speech recognition
v. Speech synthesis
vi. Time series anomaly detection
vii. Rhythm learning
viii. Music composition
ix. Grammar learning
x. Handwriting recognition
xi. Human action recognition
xii. Protein Homology Detection
xiii. Predicting subcellular localization of proteins
xiv. Several prediction tasks in the area of business process management
xv. Prediction in medical care pathways
MISCELLANEOUS

Type of learning Type of network Used for

Artificial Neural Network Regression & classification


Convolutional Neural
Supervised learning Computer Vision
Network

Recurrent Neural Network Time Series Analysis

Self-organizing maps Feature detection

Unsupervised learning Deep Boltzmann machines Recommendation system

Autoencoder Recommendation systems

You might also like