Professional Documents
Culture Documents
Pro:
Binary classification
Cons:
Doesn’t work in multilabel classification
The derivative for the gradient calculation is always 0 so impossible to update
weights
Pros:
Binary and multiclass classification
Highly interpretable
Cons:
The derivative correspond to “a” so the update of weights and biaises during the
backprogation will be constant.
Not efficient if the gradient is always the same.
Pros:
Range between -1 and 1
The gradient is stronger than sigmoid ( derivatives are steeper)
Cons:
Like sigmoid, tanh also has a vanishing gradient problem
Saturation
ReLU
It is the most used activation function.
It is easy to compute so that the neural network converges very quickly.
As its derivative is not 0 for the positive values of the neuron (f’(x)=1 for x ≥ 0),
ReLu does not saturate and no dead neurons are reported. Saturation and
vanishing gradient only occur for negative values that, given to ReLu, are turned
into 0.
It is not a zero-centered function.
Pros:
Easy to implement and very fast
True 0 value
Optimization are easy when activation function are linear
Most used in the neural networks ecosystem
Cons:
The function can not be differentiable when x = 0. The gradient descent can’t be
computed for this point but, in practice that has not an influence. The linear part
correspond to a slope with value 1 and the negative part is equal to zero.
“dying ReLU problem”: corresponds to the inactive part of the neurons if the
output are 0. There no gradient when neurons are not active so if a large part of
neurons are not activated it can result of poor performance of the model
Not appropriate for RNN class algorithm (RNN, LSTM, GRU)
Pros:
Correct the “dying ReLU problem”
Same comportement of the ReLU activation function for the part y=x
Pros:
Generalize the ReLU activation function
Avoid the “dying ReLU problem”
The parameter “a” is learned by the neural network
ReLU-6
Another variation of the ReLU function is the ReLU-6, 6 is an arbitrary parameter fixed
by hand. The advantage is to shape the output for large positive number to the 6 value.
Softplus
The softplus activation function is an alternative of sigmoid and tanh functions.
This functions have limits (upper, lower) but softplus is in the range (0, +inf).
Softsign
This activation function is a variation of tanh but is not very used in practice. tanh and
softsign functions are closely related, tanh converges exponentially whereas softsign
converges polynomially.
Softmax
The softmax activation function is different from the other because it compute the
probability distribution. The sum of the output is equal to 1.
The corresponding code:
def softmax_active_function(x):
return numpy.exp(x)/numpy.sum(numpy.exp(x))
Swish
Swish is the newer activation function, published by Google in 2017 it improves the
performances of ReLU on deeper models. This function is a variation of sigmoid function
because it can be expressed by: x*sigmoid(x).
GRADIENT DECENT
Optimization—
Optimization plays an important role in machine learning/deep learning. It is the task of making the
best or most effective use (finding maximum or minimum) of a function f(x) parameterized by x.
Optimization algorithms (in case of minimization) have one of the following goals:
Find the global minimum of the objective function. This is feasible if the objective
function is convex, i.e. any local minimum is a global minimum.
Find the lowest possible value of the objective function within its neighborhood.
That’s usually the case if the objective function is not convex as the case in most deep
learning problems.
Similar to finding the line of best fit in linear regression, the goal of gradient descent is to
minimize the cost function, or the error between predicted and actual y. In order to do
this, it requires two data points—a direction and a learning rate. These factors
determine the partial derivative calculations of future iterations, allowing it to gradually
arrive at the local or global minimum (i.e. point of convergence). More detail on these
components can be found below:
Learning rate (also referred to as step size or the alpha) is the size of the steps
that are taken to reach the minimum. This is typically a small value, and it is
evaluated and updated based on the behavior of the cost function. High learning
rates result in larger steps but risks overshooting the minimum. Conversely, a low
learning rate has small step sizes. While it has the advantage of more precision,
the number of iterations compromises overall efficiency as this takes more time
and computations to reach the minimum.
The cost (or loss) function measures the difference, or error, between actual y
and predicted y at its current position. This improves the machine learning model's
efficacy by providing feedback to the model so that it can adjust the parameters to
minimize the error and find the local or global minimum. It continuously iterates,
moving along the direction of steepest descent (or the negative gradient) until the
cost function is close to or at zero. At this point, the model will stop learning.
Additionally, while the terms, cost function and loss function, are considered
synonymous, there is a slight difference between them. It’s worth noting that a loss
function refers to the error of one training example, while a cost function calculates
the average error across an entire training set.
https://www.analyticsvidhya.com/bl
og/2020/10/how-does-the-gradient-
descent-algorithm-work-in-machine-
learning/
Gradient Descent algorithm:
Gradient descent search determines a weight vector (w) that minimizes error, E by
starting with some arbitrary initial weight vector and gradually and repeatedly modifies
it in small steps. As the gradient specifies the direction that produces the steepest ascent
in Error, the negative of this vector, therefore, gives the direction of the steepest
decrease. At each step, the weight vector (w) is altered in the direction that produces the
steepest descent along with the error. This process continues until the global minimum
error is reached. To construct a practical algorithm for iteratively updating weights, we
need an efficient way of calculating the gradient at each step. The gradient which is the
vector of partial derivatives can be calculated by differentiating the cost function (E). The
training rule for gradient descent (with MSE as cost function) at a particular point can be
given by,
We see here that to update each ‘wi’, gradient descent uses the summation of errors of all
the data points and hence is also referred to as the batch gradient descent.
Example:
Visit:
https://www.khanacademy.org/math/multivariable-calculus/applications-of-
multivariable-derivatives/optimizing-multivariable-functions/a/what-is-gradient-
descent
Types:
BATCH GRADIENT DESCENT
Batch gradient descent sums the error for each point in a training set, updating the
model only after all training examples have been evaluated. This process referred to as
a training epoch.
While this batching provides computation efficiency, it can still have a long processing
time for large training datasets as it still needs to store all of the data into memory.
Batch gradient descent also usually produces a stable error gradient and convergence,
but sometimes that convergence point isn’t the most ideal, finding the local minimum
versus the global one.
STOCHASTIC GRADIENT DESCENT
Stochastic gradient descent (SGD) runs a training epoch for each example within the
dataset and it updates each training example's parameters one at a time. Since you
only need to hold one training example, they are easier to store in memory. While these
frequent updates can offer more detail and speed, it can result in losses in
computational efficiency when compared to batch gradient descent. Its frequent updates
can result in noisy gradients, but this can also be helpful in escaping the local minimum
and finding the global one.
MINI-BATCH GRADIENT DESCENT
Introduction:
Hopfield network is a special kind of neural network whose response is different from
other neural networks. It is calculated by converging iterative process. It has just one
layer of neurons relating to the size of the input and output, which must be the same.
When such a network recognizes, for example, digits, we present a list of correctly
rendered digits to the network. Subsequently, the network can transform a noise input
to the relating perfect output.
In 1982, John Hopfield introduced an artificial neural network to store and retrieve
memory like the human brain.
Here, a neuron either is on (firing) or is off (not firing), a vast simplification of the real
situation.
The state of a neuron (on: +1 or off: -1) will be renewed depending on the input it
receives from other neurons.
A Hopfield network is initially trained to store a number of patterns or memories.
It is then able to recognise any of the learned patterns by exposure to only partial or
even some corrupted information about that pattern, i.e., it eventually settles down and
returns the closest pattern or the best guess.
Thus, like the human brain, the Hopfield model has stability in
pattern recognition.
A Hopfield network is single-layered and recurrent network:
the neurons are fully connected, i.e., every neuron is
connected to every other neuron.
Given two neurons i and j there is a connectivity weight wij
between them which is symmetric wij = wji with zero self-
connectivity wii = 0.
Below three neurons i = 1, 2, 3 with values xi = ±1 have
connectivity wij; any update has input xi and output yi.
Updating Rule:
o Assume N neurons = 1, · · · , N with values xi = ±1
o The update rule is for the node i is given by: If hi ≥ 0
then 1 ← xi otherwise − 1 ← xi where hi = PN j=1 wijxj + bi is called the field
at i, with bi ∈ R a bias.
o Thus, xi ← sgn(hi), where sgn(r) = 1 if r ≥ 0, and sgn(r) = −1 if r < 0.
o We put bi = 0 for simplicity as it makes no difference to training the network
with random patterns.
o We therefore assume hi = PN j=1 wijxj .
o Updates in the Hopfield network can be performed in two different ways:
Asynchronous: Only one unit is updated at a time. This unit can be picked at random, or
a pre-defined order can be imposed from the very beginning.
Synchronous: All units are updated at the same time. This requires a central clock to the
system in order to maintain synchronization. This method is viewed by some as less
realistic, based on an absence of observed global clock influencing analogous biological
or physical systems of interest.
Illustration:
http://web.cs.ucla.edu/~rosen/161/notes/hopfield.html
https://www.tutorialspoint.com/artificial_neural_network/artificial_neural_network_hopfie
ld.htm
PS:
There is another computational role for Hopfield nets. Instead of using the net
to store memories, we use it to construct interpretations of sensory input. The
input is represented by the visible units, the interpretation is represented by
the states of the hidden units, and the badness of the interpretation is
represented by the energy.
HEBBIAN LEARNING
At the start, values of all weights are set to zero. This learning rule can be
used0 for both soft- and hard-activation functions. Since desired responses
of neurons are not used in the learning procedure, this is the unsupervised
learning rule. The absolute values of the weights are usually proportional to
the learning time, which is undesired.
Hebbian Learning Rule Algorithm :
1. Set all weights to zero, wi = 0 for i=1 to n, and bias to zero.
2. For each input vector, S(input vector) : t(target output pair), repeat
steps 3-5.
3. Set activations for input units with the input vector X i = Si for i = 1 to
n.
4. Set the corresponding output value to the output neuron, i.e. y = t.
5. Update weight and bias by applying Hebb rule for all i = 1 to n:
Three major points were stated as a part of this learning mechanism:
Information is stored in the connections between neurons in neural
networks, in the form of weights.
Weight change between neurons is proportional to the product of
activation values for neurons.
vi =M 1 [x ¿¿ i]¿
Dynamic:
Produce recall as a result of outpt/input feedback interaction which
requires time
Recurrent, time-delayed
Dynamically evolve and finally converge to an equilibrium state
according to the recursive formula
vi +1=M 2 [x ¿ ¿ i , v i ]¿
Recurrent neural networks, also known as RNNs, are a class of neural networks
that allow previous outputs to be used as inputs while having hidden states.
Basic feed forward networks “remember” things too, but they remember things
they learnt during training. While RNNs learn similarly while training, in
addition, they remember things learnt from prior input(s) while generating
output(s).
RNNs can take one or more input vectors and produce one or more output
vectors and the output(s) are influenced not just by weights applied on inputs
like a regular NN, but also by a “hidden” state vector representing the context
based on prior input(s)/output(s). So, the same input could produce a different
output depending on previous inputs in the series.
Parameter Sharing: It uses the same parameters for each input as it performs
the same task on all the inputs or hidden layers to produce the output. This
reduces the complexity of parameters, unlike other neural networks.
The pros and cons of a typical RNN architecture are summed up in the table below:
Advantages Drawbacks
• Possibility of processing input of any ~ Computation being slow
length ~ Difficulty of accessing information
• Model size not increasing with size of from a long time ago
input ~ Cannot consider any future input
• Computation takes into account historical
for the current state.
information ~ It could not process very long
• Weights are shared across time sequences if it were using tanh or
relu like an activation function
RNNs have attributes that have made them very popular for tasks where data
must be handled in a sequential manner.
Types:
RNN models are mostly used in the fields of natural language processing and
speech recognition. Apart from this RNN vastly used in:
i. Machine Translation
ii. Robot control
iii. Time series prediction
iv. Speech recognition
v. Speech synthesis
vi. Time series anomaly detection
vii. Rhythm learning
viii. Music composition
ix. Grammar learning
x. Handwriting recognition
xi. Human action recognition
xii. Protein Homology Detection
xiii. Predicting subcellular localization of proteins
xiv. Several prediction tasks in the area of business process management
xv. Prediction in medical care pathways
MISCELLANEOUS