Professional Documents
Culture Documents
Neural Networks
The essence of non-linearity
The universal approximation theorem implies that a neural network can approximate
any continuous function that maps inputs (X) to outputs (y). The ability to represent
any function is what makes the neural networks so powerful and widely-used.
Neural networks are combinations of layers that contain many nodes. Thus, the
building process starts with a node. The following represents a node without an
activation function.
A neuron without an activation function (image by author)
The output y is a linear combination of inputs and a bias. We need to somehow add an
element of non-linearity. Consider the following node structure.
In this post, we will talk about 5 commonly used activations in neural networks.
1. Sigmoid
The sigmoid function bounds a range of values between 0 and 1. It is also used in
logistic regression models.
(image by author)
Whatever the input values to a sigmoid function are, the output values will be between
0 and 1. Thus, the output of each neuron is normalized into the range 0–1.
(image by author)
The output (y) is more sensitive to the changes on the input (x) for x values close to 0.
As the input values move away from zero, the output value becomes less sensitive.
After some point, even a large change in the input values result in little to no change in
the output value. That is how the sigmoid function achieves non-linearity.
There is a downside associated with this non-linearity. Let’s first see the derivative of
the sigmoid function.
(image by author)
The derivate tends towards zero as we move away from zero. The “learning” process of
a neural network depends on the derivative because the weights are updated based on
the gradient which basically is the derivate of a function. If the gradient is very close to
zero, weights are updated with very small increments. This results in a neural network
that learns so slow and takes forever to converge. This is also known as vanishing
gradient problem.
It is very similar to the sigmoid except that the output values are in the range of -1 to
+1. Thus, tanh is said to be zero centered.
(image by author)
The difference between the sigmoid and tanh is that the gradients are not restricted to
move in one direction for tanh. Thus, tanh is likely to converge faster than the sigmoid
function.
The vanishing gradient problem also exist for tanh activation function.
The relu function is only interested in the positive values. It keeps the input values
greater than 0 as is. All the input values less than zero become 0.
(image by author)
The output values of a neuron can be arranged to be less than 0. If we apply the relu
function to the output of that neuron, all the values returned from that neuron become
0. Thus, relu allows cancelling out some of the neurons.
We are able to activate only some of the neurons with the relu function where as all of
the neurons are activated with tanh and sigmoid which results in intense
computations. Thus, relu converges faster than tanh and sigmoid.
The derivative of relu is 0 for input values less than 0. For those values, the weights
are never updated during back-propagation and thus the neural network cannot learn.
This issue is known as dying relu problem.
4. Leaky ReLU
It can be considered as a solution to the dying relu problem. Leaky relu outputs a
small value for negative inputs.
(image by author)
Although leaky relu seems to be solving the dying relu problem, some argue that there
is not a significant difference on the accuracy in most cases. I guess it comes down to
trying both and see if there is any difference for a particular task.
5. Softmax
Softmax is usually used in multi-class classification tasks and applied to the output
neuron. What it does is normalizing the output values into a probability distribution
in a way the probability values add up to 1.
Softmax function divides the exponential of each output by the sum of the
exponentials of all all the outputs. The resulting values form a probability distribution
with probabilities that add up to 1.
Let’s do an example. Consider a case in which the target variable has 4 classes. The
following is the output of the neural network for 5 different data points (i.e.
observations).
Each column represents the output for an observation (image by author)
(image by author)
In the first line, we applied the softmax function to the values in matrix A. The second
line reduced the floating point precision to 2 decimals.
(image by author)
Conclusion
There is no free lunch! Activation functions also cause a burden to neural networks in
terms of computational complexity. They also have an impact on the convergence of
models.
It is important to know the properties of the activations functions and how they
behave so that we can choose the activation function that best fits a particular task.
Computationally inexpensive
Zero centered
Introduction
In Deep learning, a neural network without an activation function is just a linear regression model as
these functions actually do the non-linear computations to the input of a neural network making it
capable to learn and perform more complex tasks. Thus, it is quite essential to study the derivatives and
implementation of activation functions, also analyze the benefits and downsides for each activation
function, for choosing the right type of activation function that could provide non-linearity and accuracy
in a specific neural network model.
Table of Contents
1. Introduction to Activation Functions
2. Types of Activation Functions
3. Activation Functions and their Derivatives
4. Implementation using Python
5. Pros and Cons of Activation Functions
Activation functions are functions used in a neural network to compute the weighted sum of inputs and
biases, which is in turn used to decide whether a neuron can be activated or not. It manipulates the
presented data and produces an output for the neural network that contains the parameters in the data.
The activation functions are also referred to as transfer functions in some literature. These can either be
linear or nonlinear
depending on the function it represents and is used to control the output of neural networks across
different domains.
For a linear model, a linear mapping of an input function to output is performed in the hidden layers
before the final prediction for each label is given. The input vector x transformation is given by
f(x) = wT . x + b
Linear results are produced from the mappings of the above equation and the need for the activation
function arises here, first to convert these linear outputs into non-linear output for further computation,
and then to learn the patterns in the data. The output of these models are given by
y = (w1 x1 + w2 x2 + … + wn xn + b)
These outputs of each layer are fed into the next subsequent layer for multilayered networks until the
final output is obtained, but they are linear by default. The expected output is said to determine the type
of activation function that has to be deployed in a given network.
However, since the outputs are linear in nature, the nonlinear activation functions are required to convert
these linear inputs to non-linear outputs. These transfer functions, applied to the outputs of the linear
models to produce the transformed non-linear outputs are ready for further processing. The non-linear
output after the application of the activation function is given by
y = α (w1 x1 + w2 x2 + … + wn xn + b)
The need for these activation functions includes converting the linear input signals and models into non-
linear output signals, which aids the learning of high order polynomials for deeper networks.
Linear summation of inputs: In the above diagram, it has two inputs x1, x2 with weights w1, w2, and
bias b. And the linear sum z = w1 x1 + w2 x2 + … + wn xn + b
Activation computation: This computation decides, whether a neuron should be activated or not,
by calculating the weighted sum and further adding bias with it. The purpose of the activation
function is to introduce non-linearity into the output of a neuron.
Most neural networks begin by computing the weighted sum of the inputs. Each node in the layer can
have its own unique weighting. However, the activation function is the same across all nodes in the
layer. They are typical of a fixed form whereas the weights are considered to be the learning
parameters.
A proper choice has to be made in choosing the activation function to improve the results in neural
network computing. All activation functions must be monotonic, differentiable, and quickly converging
with respect to the weights for optimization purposes.
f(x) = ax + c
The problem with this activation is that it cannot be defined in a specific range. Applying this function
in all the nodes makes the activation function work like linear regression. The final layer of the Neural
Network will be working as a linear function of the first layer. Another issue is the gradient descent
when differentiation is done, it has a constant output which is not good because during backpropagation
the rate of change of error is constant that can ruin the output and the logic of backpropagation.
Leaky Relu is a variant of ReLU. Instead of being 0 when z<0, a leaky ReLU allows a small, non-zero,
constant gradient α (normally, α=0.01). However, the consistency of the benefit across tasks is presently
unclear. Leaky ReLUs attempt to fix the “dying ReLU” problem.
PReLU gives the neurons the ability to choose what slope is best in the negative region. They can
become ReLU or leaky ReLU with certain values of α.
d) Maxout:
The Maxout activation is a generalization of the ReLU and the leaky ReLU functions. It is a piecewise
linear function that returns the maximum of inputs, designed to be used in conjunction with the dropout
regularization technique. Both ReLU and leaky ReLU are special cases of Maxout. The Maxout neuron,
therefore, enjoys all the benefits of a ReLU unit and does not have any drawbacks like dying ReLU.
However, it doubles the total number of parameters for each neuron, and hence, a higher total number of
parameters need to be trained.
e) ELU
The Exponential Linear Unit or ELU is a function that tends to converge faster and produce more
accurate results. Unlike other activation functions, ELU has an extra alpha constant which should be a
positive number. ELU is very similar to ReLU except for negative inputs. They are both in the identity
function form for non-negative inputs. On the other hand, ELU becomes smooth slowly until its output
equal to -α whereas ReLU sharply smoothes.
Specifically, it depends on the problem type and the value range of the expected output. For example, to
predict values that are larger than 1, tanh or sigmoid are not suitable to be used in the output layer,
instead, ReLU can be used. On the other hand, if the output values have to be in the range (0,1) or (-1, 1)
then ReLU is not a good choice, and sigmoid or tanh can be used here. While performing a classification
task and using the neural network to predict a probability distribution over the mutually exclusive class
labels, the softmax activation function should be used in the last layer. However, regarding the hidden
layers, as a rule of thumb, use ReLU as an activation for these layers.
In the case of a binary classifier, the Sigmoid activation function should be used. The sigmoid activation
function and the tanh activation function work terribly for the hidden layer. For hidden layers, ReLU or
its better version leaky ReLU should be used. For a multiclass classifier, Softmax is the best-used
activation function. Though there are more activation functions known, these are known to be the most
used activation functions.
Having learned the types and significance of each activation function, it is also essential to implement
some basic (non-linear) activation
functions using python code and observe the output for more clear understanding of the concepts:
Sigmoid Activation Function
import matplotlib.pyplot as plt
import numpy as np
def sigmoid(x):
s=1/(1+np.exp(-x))
ds=s*(1-s)
return s,ds
x=np.arange(-6,6,0.01)
sigmoid(x)
fig, ax = plt.subplots(figsize=(9, 5))
ax.spines['left'].set_position('center')
ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')
ax.xaxis.set_ticks_position('bottom')
ax.yaxis.set_ticks_position('left')
ax.plot(x,sigmoid(x)[0], color="#307EC7", linewidth=3, label="sigmoid")
ax.plot(x,sigmoid(x)[1], color="#9621E2", linewidth=3, label="derivative")
ax.legend(loc="upper right", frameon=False)
fig.show()
Observations:
The sigmoid function has values between 0 to 1.
The output is not zero-centered.
Sigmoids saturate and kill gradients.
At the top and bottom level of sigmoid functions, the curve changes slowly, the derivative curve
above shows that the slope or gradient it is zero.
Tanh The gradient is stronger for tanh Tanh also has a vanishing gradient
than sigmoid i.e. derivatives are problem.
steeper.
ReLU It avoids and rectifies the It should only be used within hidden
vanishing gradient problem. layers of a Neural Network Model.
ReLu is less computationally Some gradients can be fragile during
expensive than tanh and sigmoid training and can die. It can cause a
because it involves simpler weight update which will make it
mathematical operations. never activate on any data point
again. Thus, ReLu could even result
in Dead Neurons.
Leaky ReLU Leaky ReLUs is one attempt to fix As it possesses linearity, it can’t be
the “dying ReLU” problem by used for complex Classification. It
having a small negative slope lags behind the Sigmoid and Tanh
for some of the use cases.
ELU Unlike ReLU, ELU can produce For x>0, it can blow up the activation
negative outputs. with the output range of [0, ∞].
When all of this is said and done, the actual purpose of an activation function is to feature some
reasonably non-linear property to the function, which could be a neural network. A neural network,
without the activation functions, might perform solely linear mappings from the inputs to the outputs,
and also the mathematical operation throughout the forward propagation would be the dot-products
between an input vector and a weight matrix.
Since one dot product could be a linear operation, sequent dot products would be nothing more than
multiple linear operations repeated one after another. And sequent linear operations may be thought of
as mutually single learn operations. To be able to work out extremely attention-grabbing stuff, the neural
networks should be able to approximate the nonlinear relations from input features to the output labels.
The more complicated the information, the more non-linear the mapping of features to the bottom truth
label will usually be. If there is no activation function in a neural network, the network would in turn not
be able to understand such complicated mappings mathematically and wouldn’t be able to solve tasks
that the network is really meant to resolve.
As such, a careful choice of activation function must be made for each deep learning
neural network project.
In this tutorial, you will discover how to choose activation functions for neural network
models.
Tutorial Overview
This tutorial is divided into three parts; they are:
1. Activation Functions
2. Activation for Hidden Layers
3. Activation for Output Layers
Activation Functions
An activation function in a neural network defines how the weighted sum of the input is
transformed into an output from a node or nodes in a layer of the network.
Sometimes the activation function is called a “transfer function.” If the output range of
the activation function is limited, then it may be called a “squashing function.” Many
activation functions are nonlinear and may be referred to as the “ nonlinearity” in the
layer or the network design.
The choice of activation function has a large impact on the capability and performance of
the neural network, and different activation functions may be used in different parts of
the model.
Technically, the activation function is used within or after the internal processing of each
node in the network, although networks are designed to use the same activation function
for all nodes in a layer.
A network may have three types of layers: input layers that take raw input from the
domain, hidden layers that take input from another layer and pass output to another
layer, and output layers that make a prediction.
All hidden layers typically use the same activation function. The output layer will
typically use a different activation function from the hidden layers and is dependent upon
the type of prediction required by the model.
Activation functions are also typically differentiable, meaning the first-order derivative
can be calculated for a given input value. This is required given that neural networks are
typically trained using the backpropagation of error algorithm that requires the derivative
of prediction error in order to update the weights of the model.
There are many different types of activation functions used in neural networks, although
perhaps only a small number of functions used in practice for hidden and output layers.
Let’s take a look at the activation functions used for each type of layer in turn.
A hidden layer does not directly contact input data or produce outputs for a model, at
least in general.
In order to get access to a much richer hypothesis space that would benefit from deep
representations, you need a non-linearity, or activation function.
There are perhaps three activation functions you may want to consider for use in hidden
layers; they are:
max(0.0, x)
This means that if the input value (x) is negative, then a value 0.0 is returned, otherwise,
the value is returned.
You can learn more about the details of the ReLU activation function in this tutorial:
5 def rectified(x):
6 return max(0.0, x)
10 # calculate outputs
13 pyplot.plot(inputs, outputs)
14 pyplot.show()
Running the example calculates the outputs for a range of values and creates a plot of
inputs versus outputs.
We can see the familiar kink shape of the ReLU activation function.
Plot of Inputs vs. Outputs for the ReLU Activation Function.
When using the ReLU function for hidden layers, it is a good practice to use a “ He
Normal” or “He Uniform” weight initialization and scale input data to the range 0-1
(normalize) prior to training.
Sigmoid Hidden Layer Activation Function
The sigmoid activation function is also called the logistic function.
The function takes any real value as input and outputs values in the range 0 to 1. The
larger the input (more positive), the closer the output value will be to 1.0, whereas the
smaller the input (more negative), the closer the output will be to 0.0.
6 def sigmoid(x):
8
9 # define input data
11 # calculate outputs
14 pyplot.plot(inputs, outputs)
15 pyplot.show()
Running the example calculates the outputs for a range of values and creates a plot of
inputs versus outputs.
When using the Sigmoid function for hidden layers, it is a good practice to use a “ Xavier
Normal” or “Xavier Uniform” weight initialization (also referred to Glorot initialization,
named for Xavier Glorot) and scale input data to the range 0-1 (e.g. the range of the
activation function) prior to training.
Tanh Hidden Layer Activation Function
The hyperbolic tangent activation function is also referred to simply as the Tanh (also
“tanh” and “TanH“) function.
It is very similar to the sigmoid activation function and even has the same S-shape.
The function takes any real value as input and outputs values in the range -1 to 1. The
larger the input (more positive), the closer the output value will be to 1.0, whereas the
smaller the input (more negative), the closer the output will be to -1.0.
The Tanh activation function is calculated as follows:
6 def tanh(x):
11 # calculate outputs
14 pyplot.plot(inputs, outputs)
15 pyplot.show()
Running the example calculates the outputs for a range of values and creates a plot of
inputs versus outputs.
Traditionally, the sigmoid activation function was the default activation function in the
1990s. Perhaps through the mid to late 1990s to 2010s, the Tanh function was the default
activation function for hidden layers.
… the hyperbolic tangent activation function typically performs better than the logistic
sigmoid.
Modern neural network models with common architectures, such as MLP and CNN, will
make use of the ReLU activation function, or extensions.
In modern neural networks, the default recommendation is to use the rectified linear unit
or ReLU …
There are perhaps three activation functions you may want to consider for use in the
output layer; they are:
Linear
Logistic (Sigmoid)
Softmax
This is not an exhaustive list of activation functions used for output layers, but they are
the most commonly used.
We can get an intuition for the shape of this function with the worked example below.
1 # example plot for the linear activation function
5 def linear(x):
6 return x
10 # calculate outputs
13 pyplot.plot(inputs, outputs)
14 pyplot.show()
Running the example calculates the outputs for a range of values and creates a plot of
inputs versus outputs.
We can see a diagonal line shape where inputs are plotted against identical outputs.
Target values used to train a model with a linear activation function in the output layer
are typically scaled prior to modeling using normalization or standardization transforms.
6 def sigmoid(x):
11 # calculate outputs
14 pyplot.plot(inputs, outputs)
15 pyplot.show()
Running the example calculates the outputs for a range of values and creates a plot of
inputs versus outputs.
e^x / sum(e^x)
Where x is a vector of outputs and e is a mathematical constant that is the base of the
natural logarithm.
You can learn more about the details of the Softmax function in this tutorial:
4 def softmax(x):
9 # calculate outputs
10 outputs = softmax(inputs)
12 print(outputs)
14 print(outputs.sum())
Running the example calculates the softmax output for the input vector.
We then confirm that the sum of the outputs of the softmax indeed sums to the value 1.0.
1.0
Target labels used to train a model with the softmax activation function in the output
layer will be vectors with 1 for the target class and 0 for all other classes.
For example, you may divide prediction problems into two main groups, predicting a
categorical variable (classification) and predicting a numerical variable (regression).
If your problem is a regression problem, you should use a linear activation function.
If there are two mutually exclusive classes (binary classification), then your output layer
will have one node and a sigmoid activation function should be used. If there are more
than two mutually exclusive classes (multiclass classification), then your output layer
will have one node per class and a softmax activation should be used. If there are two or
more mutually inclusive classes (multilabel classification), then your output layer will
have one node for each class and a sigmoid activation function is used.
In a neural network, the activation function is responsible for transforming the summed
weighted input from the node into the activation of the node or output for that input.
The rectified linear activation function or ReLU for short is a piecewise linear function
that will output the input directly if it is positive, otherwise, it will output zero. It has
become the default activation function for many types of neural networks because a
model that uses it is easier to train and often achieves better performance.
In this tutorial, you will discover the rectified linear activation function for deep learning
neural networks.
The sigmoid and hyperbolic tangent activation functions cannot be used in networks with many layers due to the
vanishing gradient problem.
The rectified linear activation function overcomes the vanishing gradient problem, allowing models to learn
faster and perform better.
The rectified linear activation is the default activation when developing multilayer Perceptron and convolutional
neural networks.
Kick-start your project with my new book Better Deep Learning, including step-by-step
tutorials and the Python source code files for all examples.
Let’s get started.
Jun/2019: Fixed error in the equation for He weight initialization (thanks Maltev).
Tutorial Overview
This tutorial is divided into six parts; they are:
For a given node, the inputs are multiplied by the weights in a node and summed
together. This value is referred to as the summed activation of the node. The summed
activation is then transformed via an activation function and defines the specific output
or “activation” of the node.
The simplest activation function is referred to as the linear activation, where no
transform is applied at all. A network comprised of only linear activation functions is very
easy to train, but cannot learn complex mapping functions. Linear activation functions
are still used in the output layer for networks that predict a quantity (e.g. regression
problems).
Nonlinear activation functions are preferred as they allow the nodes to learn more
complex structures in the data. Traditionally, two widely used nonlinear activation
functions are the sigmoid and hyperbolic tangent activation functions.
The sigmoid activation function, also called the logistic function, is traditionally a very
popular activation function for neural networks. The input to the function is transformed
into a value between 0.0 and 1.0. Inputs that are much larger than 1.0 are transformed to
the value 1.0, similarly, values much smaller than 0.0 are snapped to 0.0. The shape of
the function for all possible inputs is an S-shape from zero up through 0.5 to 1.0. For a
long time, through the early 1990s, it was the default activation used on neural networks.
The hyperbolic tangent function, or tanh for short, is a similar shaped nonlinear activation
function that outputs values between -1.0 and 1.0. In the later 1990s and through the
2000s, the tanh function was preferred over the sigmoid activation function as models
that used it were easier to train and often had better predictive performance.
… the hyperbolic tangent activation function typically performs better than the logistic
sigmoid.
The limited sensitivity and saturation of the function happen regardless of whether the
summed activation from the node provided as input contains useful information or not.
Once saturated, it becomes challenging for the learning algorithm to continue to adapt
the weights to improve the performance of the model.
… sigmoidal units saturate across most of their domain—they saturate to a high value
when z is very positive, saturate to a low value when z is very negative, and are only
strongly sensitive to their input when z is near 0.
Layers deep in large networks using these nonlinear activation functions fail to receive
useful gradient information. Error is back propagated through the network and used to
update the weights. The amount of error decreases dramatically with each additional
layer through which it is propagated, given the derivative of the chosen activation
function. This is called the vanishing gradient problem and prevents deep (multi-layered)
networks from learning effectively.
Vanishing gradients make it difficult to know which direction the parameters should
move to improve the cost function
How to Fix Vanishing Gradients Using the Rectified Linear Activation Function
Although the use of nonlinear activation functions allows neural networks to learn
complex mapping functions, they effectively prevent the learning algorithm from working
with deep networks.
Workarounds were found in the late 2000s and early 2010s using alternate network types
such as Boltzmann machines and layer-wise training or unsupervised pre-training.
The solution had been bouncing around in the field for some time, although was not
highlighted until papers in 2009 and 2011 shone a light on it.
The solution is to use the rectified linear activation function, or ReL for short.
A node or unit that implements this activation function is referred to as a rectified linear
activation unit, or ReLU for short. Often, networks that use the rectifier function for the
hidden layers are referred to as rectified networks.
Adoption of ReLU may easily be considered one of the few milestones in the deep
learning revolution, e.g. the techniques that now permit the routine development of very
deep neural networks.
[another] major algorithmic change that has greatly improved the performance of
feedforward networks was the replacement of sigmoid hidden units with piecewise linear
hidden units, such as rectified linear units.
2 return input
3else:
4 return 0
We can describe this function g() mathematically using the max() function over the set
of 0.0 and the input z; for example:
1g(z) = max{0, z}
The function is linear for values greater than zero, meaning it has a lot of the desirable
properties of a linear activation function when training a neural network using
backpropagation. Yet, it is a nonlinear function as negative values are always output as
zero.
Because rectified linear units are nearly linear, they preserve many of the properties that
make linear models easy to optimize with gradient-based methods. They also preserve
many of the properties that make linear models generalize well.
Perhaps the simplest implementation is using the max() function; for example:
1# rectified linear function
2def rectified(x):
3 return max(0.0, x)
We expect that any positive value will be returned unchanged whereas an input value of
0.0 or a negative value will be returned as the value 0.0.
Below are a few examples of inputs and outputs of the rectified linear activation function.
4 def rectified(x):
5 return max(0.0, x)
8 x = 1.0
10 x = 1000.0
11 print('rectified(%.1f) is %.1f' % (x, rectified(x)))
13 x = 0.0
16 x = -1.0
18 x = -1000.0
Running the example, we can see that positive values are returned regardless of their
size, whereas negative values are snapped to the value 0.0.
1rectified(1.0) is 1.0
2rectified(1000.0) is 1000.0
3rectified(0.0) is 0.0
4rectified(-1.0) is 0.0
5rectified(-1000.0) is 0.0
We can get an idea of the relationship between inputs and outputs of the function by
plotting a series of inputs and the calculated outputs.
The example below generates a series of integers from -10 to 10 and calculates the
rectified linear activation for each input, then plots the result.
5 def rectified(x):
6 return max(0.0, x)
13 pyplot.plot(series_in, series_out)
14 pyplot.show()
Running the example creates a line plot showing that all negative values and zero inputs
are snapped to 0.0, whereas the positive outputs are returned as-is, resulting in a linearly
increasing slope, given that we created a linearly increasing series of positive values
(e.g. 1 to 10).
Line Plot of Rectified Linear Activation for Negative and Positive Inputs
The derivative of the rectified linear function is also easy to calculate. Recall that the
derivative of the activation function is required when updating the weights of a node as
part of the backpropagation of error.
The derivative of the function is the slope. The slope for negative values is 0.0 and the
slope for positive values is 1.0.
Traditionally, the field of neural networks has avoided any activation function that was
not completely differentiable, perhaps delaying the adoption of the rectified linear
function and other piecewise-linear functions. Technically, we cannot calculate the
derivative when the input is 0.0, therefore, we can assume it is zero. This is not a
problem in practice.
For example, the rectified linear function g(z) = max{0, z} is not differentiable at z = 0.
This may seem like it invalidates g for use with a gradient-based learning algorithm. In
practice, gradient descent still performs well enough for these models to be used for
machine learning tasks.
As such, it is important to take a moment to review some of the benefits of the approach,
first highlighted by Xavier Glorot, et al. in their milestone 2012 paper on using ReLU titled
“Deep Sparse Rectifier Neural Networks“.
1. Computational Simplicity.
The rectifier function is trivial to implement, requiring a max() function.
This is unlike the tanh and sigmoid activation function that require the use of an
exponential calculation.
Computations are also cheaper: there is no need for computing the exponential function
in activations
This is unlike the tanh and sigmoid activation functions that learn to approximate a zero
output, e.g. a value very close to zero, but not a true zero value.
This means that negative inputs can output true zero values allowing the activation of
hidden layers in neural networks to contain one or more true zero values. This is called a
sparse representation and is a desirable property in representational learning as it can
accelerate learning and simplify the model.
An area where efficient representations such as sparsity are studied and sought is in
autoencoders, where a network learns a compact representation of an input (called the
code layer), such as an image or series, before it is reconstructed from the compact
representation.
One way to achieve actual zeros in h for sparse (and denoising) autoencoders […] The
idea is to use rectified linear units to produce the code layer. With a prior that actually
pushes the representations to zero (like the absolute value penalty), one can thus
indirectly control the average number of zeros in the representation.
In general, a neural network is easier to optimize when its behavior is linear or close to
linear.
Rectified linear units […] are based on the principle that models are easier to optimize if
their behavior is closer to linear.
Because of this linearity, gradients flow well on the active paths of neurons (there is no
gradient vanishing effect due to activation non-linearities of sigmoid or tanh units).
In turn, cumbersome networks such as Boltzmann machines could be left behind as well
as cumbersome training schemes such as layer-wise training and unlabeled pre-training.
… deep rectifier networks can reach their best performance without requiring any
unsupervised pre-training on purely supervised tasks with large labeled datasets. Hence,
these results can be seen as a new milestone in the attempts at understanding the
difficulty in training deep but purely supervised neural networks, and closing the
performance gap between neural networks learnt with and without unsupervised pre-
training.
For modern deep learning neural networks, the default activation function is the rectified
linear activation function.
Prior to the introduction of rectified linear units, most neural networks used the logistic
sigmoid activation function or the hyperbolic tangent activation function.
If in doubt, start with ReLU in your neural network, then perhaps try other piecewise
linear activation functions to see how their performance compares.
In modern neural networks, the default recommendation is to use the rectified linear unit
or ReLU
It is recommended as the default for both Multilayer Perceptron (MLP) and Convolutional
Neural Networks (CNNs).
The use of ReLU with CNNs has been investigated thoroughly, and almost universally
results in an improvement in results, initially, surprisingly so.
… how do the non-linearities that follow the filter banks influence the recognition
accuracy. The surprising answer is that using a rectifying non-linearity is the single most
important factor in improving the performance of a recognition system.
A typical layer of a convolutional network consists of three stages […] In the second
stage, each linear activation is run through a nonlinear activation function, such as the
rectified linear activation function. This stage is sometimes called the detector stage.
At first sight, ReLUs seem inappropriate for RNNs because they can have very large
outputs so they might be expected to be far more likely to explode than units that have
bounded values.
The bias has the effect of shifting the activation function and it is traditional to set the
bias input value to 1.0.
When using ReLU in your network, consider setting the bias to a small value, such as 0.1.
… it can be a good practice to set all elements of [the bias] to a small, positive value,
such as 0.1. This makes it very likely that the rectified linear units will be initially active
for most inputs in the training set and allow the derivatives to pass through.
When using ReLU in your network and initializing weights to small random values
centered on zero, then by default half of the units in the network will output a zero value.
For example, after uniform initialization of the weights, around 50% of hidden units
continuous output values are real zeros
Prior to the wide adoption of ReLU, Xavier Glorot and Yoshua Bengio proposed an
initialization scheme in their 2010 paper titled “Understanding the difficulty of training
deep feedforward neural networks” that quickly became the default when using sigmoid
and tanh activation functions, generally referred to as “Xavier initialization“. Weights are
set at random values sampled uniformly from a range proportional to the size of the
number of nodes in the previous layer (specifically +/- 1/sqrt(n) where n is the number
of nodes in the prior layer).
Kaiming He, et al. in their 2015 paper titled “Delving Deep into Rectifiers: Surpassing
Human-Level Performance on ImageNet Classification” suggested that Xavier
initialization and other schemes were not appropriate for ReLU and extensions.
Glorot and Bengio proposed to adopt a properly scaled uniform distribution for
initialization. This is called “Xavier” initialization […]. Its derivation is based on the
assumption that the activations are linear. This assumption is invalid for ReLU
This may involve standardizing variables to have a zero mean and unit variance or
normalizing each value to the scale 0-to-1.
Without data scaling on many problems, the weights of the neural network can grow
large, making the network unstable and increasing the generalization error.
This good practice of scaling inputs applies whether using ReLU for your network or not.
This means that in some cases, the output can continue to grow in size. As such, it may
be a good idea to use a form of weight regularization, such as an L1 or L2 vector norm.
Another problem could arise due to the unbounded behavior of the activations; one may
thus want to use a regularizer to prevent potential numerical problems. Therefore, we
use the L1 penalty on the activation values, which also promotes additional sparsity
Key among the limitations of ReLU is the case where large weight updates can mean that
the summed input to the activation function is always negative, regardless of the input to
the network.
This means that a node with this problem will forever output an activation value of 0.0.
This is referred to as a “dying ReLU“.
the gradient is 0 whenever the unit is not active. This could lead to cases where a unit
never activates as a gradient-based optimization algorithm will not adjust the weights of
a unit that never activates initially. Further, like the vanishing gradients problem, we
might expect learning to be slow when training ReL networks with constant 0 gradients.
The Leaky ReLU (LReLU or LReL) modifies the function to allow small negative values
when the input is less than zero.
The leaky rectifier allows for a small, non-zero gradient when the unit is saturated and
not active
— Rectifier Nonlinearities Improve Neural Network Acoustic Models, 2013.
The Exponential Linear Unit, or ELU, is a generalization of the ReLU that uses a
parameterized exponential function to transition from the positive to small negative
values.
ELUs have negative values which pushes the mean of the activations closer to zero.
Mean activations that are closer to zero enable faster learning as they bring the gradient
closer to the natural gradient
— Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs), 2016.
The Parametric ReLU, or PReLU, learns parameters that control the shape and leaky-ness
of the function.
We define a simple new model called maxout (so named because its output is the max of
a set of inputs, and because it is a natural companion to dropout) designed to both
facilitate optimization by dropout and improve the accuracy of dropout’s fast
approximate model averaging technique.
Posts
How to Fix Vanishing Gradients Using the Rectified Linear Activation Function
Books
Section 6.3.1 Rectified Linear Units and Their Generalizations, Deep Learning, 2016.
Papers
What is the best multi-stage architecture for object recognition? , 2009
Rectified Linear Units Improve Restricted Boltzmann Machines , 2010.
Deep Sparse Rectifier Neural Networks, 2011.
Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013.
Understanding the difficulty of training deep feedforward neural networks , 2010.
Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015.
Maxout Networks, 2013.
API
max API
Articles
Neural Network FAQ
Activation function, Wikipedia.
Vanishing gradient problem, Wikipedia.
Rectifier (neural networks), Wikipedia.
Piecewise Linear Function, Wikipedia.
Summary
In this tutorial, you discovered the rectified linear activation function for deep learning
neural networks.
The sigmoid and hyperbolic tangent activation functions cannot be used in networks with many layers due to the
vanishing gradient problem.
The rectified linear activation function overcomes the vanishing gradient problem, allowing models to learn
faster and perform better.
The rectified linear activation is the default activation when developing multilayer Perceptron and convolutional
neural networks.
In this tutorial, you will discover the softmax activation function used in neural network
models.
Linear and Sigmoid activation functions are inappropriate for multi-class classification tasks.
Softmax can be thought of as a softened version of the argmax function that returns the index of the largest
value in a list.
How to implement the softmax function from scratch in Python and how to convert the output into a class label.
Let’s get started.
Tutorial Overview
This tutorial is divided into three parts; they are:
Classification problems are those that involve predicting a class label for a given input. A
standard approach to modeling classification problems is to use a model to predict the
probability of class membership. That is, given an example, what is the probability of it
belonging to each of the known class labels?
For a binary classification problem, a Binomial probability distribution is used. This is achieved using a network
with a single node in the output layer that predicts the probability of an example belonging to class 1.
For a multi-class classification problem, a Multinomial probability is used. This is achieved using a network with
one node for each class in the output layer and the sum of the predicted probabilities equals one.
A neural network model requires an activation function in the output layer of the model to
make the prediction.
There are different activation functions to choose from; let’s look at a few.
This function is also called the logistic function. Regardless of the input, the function
always outputs a value between 0 and 1. The form of the function is an S-shape between
0 and 1 with the vertical or middle of the “S” at 0.5.
This allows very large values given as the weighted sum of the input to be output as 1.0
and very small or negative values to be mapped to 0.0.
The sigmoid activation is an ideal activation function for a binary classification problem
where the output is interpreted as a Binomial probability distribution.
The sigmoid activation function can also be used as an activation function for multi-class
classification problems where classes are non-mutually exclusive. These are often
referred to as a multi-label classification rather than multi-class classification.
The sigmoid activation function is not appropriate for multi-class classification problems
with mutually exclusive classes where a multinomial probability distribution is required.
2# define data
3data = [1, 3, 2]
5result = max(data)
6print(result)
Running the example returns the largest value “3” from the list of numbers.
13
Argmax Function
The argmax, or “arg max,” mathematical function returns the index in the list that
contains the largest value.
Think of it as the meta version of max: one level of indirection above max, pointing to the
position in the list that has the max value rather than the value itself.
We can implement this using the argmax() NumPy function; for example:
1# example of the argmax of a list of numbers
3# define data
4data = [1, 3, 2]
6result = argmax(data)
7print(result)
Running the example returns the list index value “1” that points to the array index [1] that
contains the largest value in the list “3”.
11
Softmax Function
The softmax, or “soft max,” mathematical function can be thought to be a probabilistic or
“softer” version of the argmax function.
The term softmax is used because this activation function represents a smooth version
of the winner-takes-all activation model in which the unit with the largest input has
output +1 while all other units have output 0.
What if we were less sure and wanted to express the argmax probabilistically, with
likelihoods?
This can be achieved by scaling the values in the list and converting them into
probabilities such that all values in the returned list sum to 1.0.
This can be achieved by calculating the exponent of each value in the list and dividing it
by the sum of the exponent values.
7 # report probabilities
10 print(p1 + p2 + p3)
Running the example converts each value in the list into a probability and reports the
values, then confirms that all probabilities sum to the value 1.0.
We can see that most weight is put on index 1 (67 percent) with less weight on index 2
(24 percent) and even less on index 0 (9 percent).
21.0
We can implement it as a function that takes a list of numbers and returns the softmax or
multinomial probability distribution for the list.
The example below implements the function and demonstrates it on our small list of
numbers.
3
4 # calculate the softmax of a vector
5 def softmax(vector):
6 e = exp(vector)
7 return e / e.sum()
9 # define data
10 data = [1, 3, 2]
12 result = softmax(data)
14 print(result)
16 print(sum(result))
Running the example reports roughly the same numbers with minor differences in
precision.
21.0
Finally, we can use the built-in softmax() NumPy function to calculate the softmax for an
array or list of numbers, as follows:
1 # example of calculating the softmax for a list of numbers
3 # define data
4 data = [1, 3, 2]
5 # calculate softmax
6 result = softmax(data)
8 print(result)
10 print(sum(result))
Running the example, again, we get very similar results with very minor differences in
precision.
20.9999999999999997
Now that we are familiar with the softmax function, let’s look at how it is used in a neural
network model.
That is, softmax is used as the activation function for multi-class classification problems
where class membership is required on more than two class labels.
Any time we wish to represent a probability distribution over a discrete variable with n
possible values, we may use the softmax function. This can be seen as a generalization
of the sigmoid function which was used to represent a probability distribution over a
binary variable.
Softmax units naturally represent a probability distribution over a discrete variable with k
possible values, so they may be used as a kind of switch.
1...
2model.add(Dense(3, activation='softmax'))
By definition, the softmax activation will output one value for each node in the output
layer. The output values will represent (or can be interpreted as) probabilities and the
values sum to 1.0.
When modeling a multi-class classification problem, the data must be prepared. The
target variable containing the class labels is first label encoded, meaning that an integer
is applied to each class label from 0 to N-1, where N is the number of class labels.
The label encoded (or integer encoded) target variables are then one-hot encoded. This is
a probabilistic representation of the class label, much like the softmax output. A vector is
created with a position for each class label and the position. All values are marked 0
(impossible) and a 1 (certain) is used to mark the position for the class label.
For example, three class labels will be integer encoded as 0, 1, and 2. Then encoded to
vectors as follows:
Class 0: [1, 0, 0]
Class 1: [0, 1, 0]
Class 2: [0, 0, 1]
This is called a one-hot encoding.
It represents the expected multinomial probability distribution for each class used to
correct the model under supervised learning.
The softmax function will output a probability of class membership for each class label
and attempt to best approximate the expected target for a given input.
For example, if the integer encoded class 1 was expected for one example, the target
vector would be:
[0, 1, 0]
The softmax output might look as follows, which puts the most weight on class 1 and
less weight on the other classes.
This can be achieved using the argmax() function that returns the index of the list with
the largest value. Given that the class labels are integer encoded from 0 to N-1, the
argmax of the probabilities will always be the integer encoded class label.
class integer = argmax([0.09003057 0.66524096 0.24472847])
class integer = 1
Further Reading
This section provides more resources on the topic if you are looking to go deeper.
Books
Neural Networks for Pattern Recognition, 1995.
Neural Networks: Tricks of the Trade: Tricks of the Trade, 2nd Edition, 2012.
Deep Learning, 2016.
APIs
numpy.argmax API.
scipy.special.softmax API.
Articles
Softmax function, Wikipedia.
Summary
In this tutorial, you discovered the softmax activation function used in neural network
models.
Linear and Sigmoid activation functions are inappropriate for multi-class classification tasks.
Softmax can be thought of as a softened version of the argmax function that returns the index of the largest
value in a list.
How to implement the softmax function from scratch in Python and how to convert the output into a class label.
If you are familiar with how Neural Networks works, one of the most important decisions which you have to take is
which Activation Function to use in the various layers. As you may be aware, Neural networks is build out of three
layers broadly:
1. Input Layer : This layer just takes input from the outside world, doesn't do any
computation by itself, and passes the information to the hidden layers.
2. Hidden Layers : This set of layers accepts the input information from the input layer,
does all the computation and sends the output to the Output layer. This layer is not
visible to the outside world and is part of the abstraction provided by the Neural
Networks.
3. Output Layer: This layer accepts the input from hidden layer and provides the output
to the outside world in the desired range.
Why use Activation Function ?
Activation functions have an important task to decide whether a neuron should be activated or not. It basically is used
to introduced Non-linearity in the output of the Neuron.
If we look at how we take the weighted sum of the weights and bias, we would appreciate the fact that it is linear in
nature. Z = WX+B, where W is the vectorised representation of the weights, X is the input features/outputs of
previous layer activation functions, and B is the Bias associated to each node. This is a linear relation, similar to Y =
mx + c. If we were to remove the activation function from the Network, You will see that the output of the Network
will again be a Linear Equation. Now it will have two issues if we go ahead with this approach, i) The network will
not be capable of understanding the intricacies of the features and ii) The derivative of the linear function will be a
constant and would lead to issues in Back propagation, which is a way to tune the network with gradient descent
procedure. Due to these two reason, we introduce Non Linear Activation functions in a NN.
Now that we understand, why do we need Non Linear Activation functions, lets look at the various options we have
available with us.
1. Sigmoid Function
2. Tanh Function
3. Relu Function
4. Softmax Function
Lets look at each of them in detail:
1. It's called Hyperbolic Tangent function. It's an extension of sigmoid function. The
equation is (e^-z - e^-z)/(e^z + e^-z), or 2 * sigmoid(2z) -1
2. Value Range is between -1 to 1
3. It's a Non Linear function which is mostly used in the hidden layer. Since its value lies
between -1 to 1, it helps centre the data around 0 which helps the next layer learning
much easier.
4. Derivative of tanh function is g'(z) = 1-(g(z))^2, Derivation of the same is as below:
3.RELU Function : The graph of the function is as below:
1. It's called Rectified linear unit. It's the most widely used activation in the hidden
layers.
2. The formula is g(z) = max(0,z). That would mean, if the z value is negative, the
function outputs 0, and if it is non negative, then it output the same value.
3. RELU is less computationally expensive as compared to Sigmoid and Tanh functions.
Less number of neurons get activated and hence the network becomes sparse and easy
to back-propagate.
4. The derivative of RELU function is 0 (If the value is negative) and 1(if the value is
positive)
4. Softmax Function : The formula for softmax function is as below:
1. The softmax function is also a type of sigmoid function but is handy when we are trying
to handle classification problems. Its is used when there are more than 2 classes in the
output layer.
2. This function is responsible for squeezing the output probabilities for each class
between 0 and 1, which will all add up to one. It takes exponent of the weighted sum of
previous activation function, weights and bias for each class node and divides by
summation of all the class exponents.
Finally some tips on how to use these activation functions:
1. Almost in all situations, you can use RELU activation function in the hidden and Input
layers. Although as a best practise you should always try a few other activation
function, but, more often than not, you will get better performance with RELU.
2. For the Output layer, If you are performing a classification problem, you need to check,
how many classes are there in the target variable. If it happens to be just two, use
sigmoid function, and if more, use Softmax. For example, if you are doing a Image
classification (Dogs vs Cats) using CNN, You should use Sigmoid function and if say
you are doing hand written numbers recognition on MNIST dataset, where the output
classes are 10 (0,1,2,...9), then you should use Softmax. One thing to care about, while
you choose your activation function, you need to accurately choose your loss function
too. Like in case of two classes, you can use binary_crossentropy and in case of more
number of classes, you can use categorical_crossentropy. also, don't forget to perform
One hot encoding for your output variable in case of categorical_crossentropy.
3. Finally, If you are performing a Regression with Neural Network, for the output layer,
you should use the Linear activation function, that means, you will output the weighted
sum as your output.
Hope the above article was helpful in understanding the various activation functions choices we have in NN and the
selection criteria.
Activation Functions in Neural
Networks [12 Types & Use
Cases]
What is a neural network activation function and how does it work? Explore twelve
different types of activation functions and learn how to pick the right one.
As it turns out—
This saying holds true both for our brains as well as machine learning.
Every single moment our brain is trying to segregate the incoming information into the “useful” and
“not-so-useful” categories.
A similar process occurs in artificial neural network architectures in deep learning.
The segregation plays a key role in helping a neural network properly function, ensuring that it
learns from the useful information rather than get stuck analyzing the not-useful part.
And this is also where activation functions come into the picture.
💡 Activation Function helps the neural network to use important information while
suppressing irrelevant data points.
The role of the Activation Function is to derive output from a set of input values fed to a node (or a
layer).
But—
Let’s take a step back and clarify: What exactly is a node?
Well, if we compare the neural network to our brain, a node is a replica of a neuron that receives a
set of input signals—external stimuli.
Depending on the nature and intensity of these input signals, the brain processes them and decides
whether the neuron should be activated (“fired”) or not.
In deep learning, this is also the role of the Activation Function—that’s why it’s often referred to as
a Transfer Function in Artificial Neural Network.
The primary role of the Activation Function is to transform the summed weighted input from the
node into an output value to be fed to the next hidden layer or as output.
Now, let's have a look at the Neural Networks Architecture.
If you don’t understand the concept of neural networks and how they work, diving deeper into the
topic of activation functions might be challenging.
That’s why it’s a good idea to refresh your knowledge and take a quick look at the structure of the
Neural Networks Architecture and its components. Here it is.
In the image above, you can see a neural network made of interconnected neurons. Each of them
is characterized by its weight, bias, and activation function.
The hidden layer performs all kinds of computation on the features entered through the input layer
and transfers the result to the output layer.
Output Layer
It’s the final layer of the network that brings the information learned through the hidden layer and
delivers the final value as a result.
📢 Note: All hidden layers usually use the same activation function. However, the output layer will
typically use a different activation function from the hidden layers. The choice depends on the goal
or type of prediction made by the model.
Feedforward vs. Backpropagation
When learning about neural networks, you will come across two essential terms describing the
movement of information—feedforward and backpropagation.
💡 Feedforward Propagation - the flow of information occurs in the forward direction. The
input is used to calculate some intermediate function in the hidden layer, which is then
used to calculate an output.
In the feedforward propagation, the Activation Function is a mathematical “gate” in between the
input feeding the current neuron and its output going to the next layer.
💡 Backpropagation - the weights of the network connections are repeatedly adjusted to
minimize the difference between the actual output vector of the net and the desired
output vector.
To put it simply—backpropagation aims to minimize the cost function by adjusting the network’s
weights and biases. The cost function gradients determine the level of adjustment with respect to
parameters like activation function, weights, bias, etc.
Well, the purpose of an activation function is to add non-linearity to the neural network.
Activation functions introduce an additional step at each layer during the forward propagation, but its
computation is worth it. Here is why—
Let’s suppose we have a neural network working without the activation functions.
In that case, every neuron will only be performing a linear transformation on the inputs using the
weights and biases. It’s because it doesn’t matter how many hidden layers we attach in the neural
network; all layers will behave in the same way because the composition of two linear functions is a
linear function itself.
Although the neural network becomes simpler, learning any complex task is impossible, and our
model would be just a linear regression model.
The input fed to the activation function is compared to a certain threshold; if the input is greater than
it, then the neuron is activated, else it is deactivated, meaning that its output is not passed on to the
next hidden layer.
The function doesn't do anything to the weighted sum of the input, it simply spits out the value it was
given.
Because of its limited power, this does not allow the model to create complex mappings between
the network’s inputs and outputs.
Non-linear activation functions solve the following limitations of linear activation functions:
They allow backpropagation because now the derivative function would be related to the
input, and it’s possible to go back and understand which weights in the input neurons
can provide a better prediction.
They allow the stacking of multiple layers of neurons as the output would now be a non-
linear combination of input passed through multiple layers. Any output can be
represented as a functional computation in a neural network.
Now, let’s have a look at ten different non-linear neural networks activation functions and their
characteristics.
Here’s why sigmoid/logistic activation function is one of the most widely used functions:
It is commonly used for models where we have to predict the probability as an output.
Since probability of anything exists only between the range of 0 and 1, sigmoid is the
right choice because of its range.
The function is differentiable and provides a smooth gradient, i.e., preventing jumps in
output values. This is represented by an S-shape of the sigmoid activation function.
It implies that for values greater than 3 or less than -3, the function will have very small gradients.
As the gradient value approaches zero, the network ceases to learn and suffers from the Vanishing
gradient problem.
The output of the logistic function is not symmetric around zero. So the output of all the
neurons will be of the same sign. This makes the training of the neural network more
difficult and unstable.
Have a look at the gradient of the tanh activation function to understand its limitations.
Gradient of the Tanh Activation Function
As you can see— it also faces the problem of vanishing gradients similar to the sigmoid activation
function. Plus the gradient of the tanh function is much steeper as compared to the sigmoid
function.
💡 Note: Although both sigmoid and tanh face vanishing gradient issue, tanh is zero
centered, and the gradients are not restricted to move in a certain direction. Therefore,
in practice, tanh nonlinearity is always preferred to sigmoid nonlinearity.
ReLU Function
ReLU stands for Rectified Linear Unit.
Although it gives an impression of a linear function, ReLU has a derivative function and allows for
backpropagation while simultaneously making it computationally efficient.
The main catch here is that the ReLU function does not activate all the neurons at the same time.
The neurons will only be deactivated if the output of the linear transformation is less than 0.
ReLU Activation Function
Mathematically it can be represented as:
Note: For building the most reliable ML models, split your data into train, validation, and test sets.
Leaky ReLU Function
Leaky ReLU is an improved version of ReLU function to solve the Dying ReLU problem as it has a
small positive slope in the negative area.
Leaky ReLU
Mathematically it can be represented as:
The advantages of Leaky ReLU are same as that of ReLU, in addition to the fact that it does enable
backpropagation, even for negative input values.
By making this minor modification for negative input values, the gradient of the left side of the graph
comes out to be a non-zero value. Therefore, we would no longer encounter dead neurons in that
region.
This function provides the slope of the negative part of the function as an argument a. By
performing backpropagation, the most appropriate value of a is learnt.
Parametric ReLU
The parameterized ReLU function is used when the leaky ReLU function still fails at solving the
problem of dead neurons, and the relevant information is not successfully passed to the next layer.
This function’s limitation is that it may perform differently for different problems depending upon the
value of slope parameter a.
Exponential Linear Units (ELUs) Function
Exponential Linear Unit, or ELU for short, is also a variant of ReLU that modifies the slope of the
negative part of the function.
ELU uses a log curve to define the negativ values unlike the leaky ReLU and Parametric ReLU
functions with a straight line.
Probability
The output of the sigmoid function was in the range of 0 to 1, which can be thought of as
probability.
But—
Let’s suppose we have five output values of 0.8, 0.9, 0.7, 0.8, and 0.6, respectively. How can we
move forward with it?
The above values don’t make sense as the sum of all the classes/output probabilities should be
equal to 1.
It calculates the relative probabilities. Similar to the sigmoid/logistic activation function, the SoftMax
function returns the probability of each class.
It is most commonly used as an activation function for the last layer of the neural network in the
case of multi-class classification.
Softmax Function
Applying the softmax function over these values to give a probabilistic view will result in the
following outcome: [0.58, 0.23, 0.19].
The function returns 1 for the largest probability index while it returns 0 for the other two array
indexes. Here, giving full weight to index 0 and no weight to index 1 and index 2. So the output
would be the class corresponding to the 1st neuron(index 0) out of three.
You can see now how softmax activation function make things easy for multi-class classification
problems.
Swish
It is a self-gated activation function developed by researchers at Google.
Swish consistently matches or outperforms ReLU activation function on deep networks applied to
various challenging domains such as image classification, machine translation etc.
Here are a few advantages of the Swish activation function over ReLU:
Swish is a smooth function that means that it does not abruptly change direction like
ReLU does near x = 0. Rather, it smoothly bends from 0 towards values < 0 and then
upwards again.
Small negative values were zeroed out in ReLU activation function. However, those
negative values may still be relevant for capturing patterns underlying the data. Large
negative values are zeroed out for reasons of sparsity making it a win-win situation.
The swish function being non-monotonous enhances the expression of input data and
weight to be learnt.
ReLU and dropout together yield a neuron’s output. ReLU does it deterministically by multiplying the
input by zero or one (depending upon the input value being positive or negative) and dropout
stochastically multiplying by zero.
We merge this functionality by multiplying the input by either zero or one which is stochastically
determined and is dependent upon the input. We multiply the neuron input x by
m ∼ Bernoulli(Φ(x)), where Φ(x) = P(X ≤x), X ∼ N (0, 1) is the cumulative distribution function of the
standard normal distribution.
This distribution is chosen since neuron inputs tend to follow a normal distribution, especially with
Batch Normalization.
SELU has both positive and negative values to shift the mean, which was impossible for ReLU
activation function as it cannot output negative values.
Gradients can be used to adjust the variance. The activation function needs a region with a gradient
larger than one to increase it.
SELU is a relatively newer activation function and needs more papers on architectures such as
CNNs and RNNs, where it is comparatively explored.
Therefore, a large change in the input of the sigmoid function will cause a small change in the
output. Hence, the derivative becomes small. For shallow networks with only a few layers that use
these activations, this isn’t a big problem.
However, when more layers are used, it can cause the gradient to be too small for training to work
effectively.
Exploding Gradients
Exploding gradients are problems where significant error gradients accumulate and result in very
large updates to neural network model weights during training.
An unstable network can result when there are exploding gradients, and the learning cannot be
completed.
The values of the weights can also become so large as to overflow and result in something called
NaN values.
As a rule of thumb, you can begin with using the ReLU activation function and then move over to
other activation functions if ReLU doesn’t provide optimum results.
Finally, a few rules for choosing the activation function for your output layer based on the type of
prediction problem that you are solving:
1. Regression - Linear Activation Function
2. Binary Classification—Sigmoid/Logistic Activation Function
3. Multiclass Classification—Softmax
4. Multilabel Classification—Sigmoid
The activation function used in hidden layers is typically chosen based on the type of neural
network architecture.
5. Convolutional Neural Network (CNN): ReLU activation function.
6. Recurrent Neural Network: Tanh and/or Sigmoid activation function.
And hey—use this cheatsheet to consolidate all the knowledge on the Neural Network Activation
Functions that you've just acquired :)
Neural Network Activation Functions: Cheat Sheet
You’ve made it this far ;-) Now, let’s have a quick recap of everything you’ve learnt in this tutorial:
Activation Functions are used to introduce non-linearity in the network.
A neural network will almost always have the same activation function in all hidden
layers. This activation function should be differentiable so that the parameters of the
network are learned in backpropagation.
ReLU is the most commonly used activation function for hidden layers.
While selecting an activation function, you must consider the problems it might face:
vanishing and exploding gradients.
Regarding the output layer, we must always consider the expected value range of the
predictions. If it can be any numeric value (as in case of the regression problem) you can
use the linear activation function or ReLU.
Use Softmax or Sigmoid function for the classification problems.