You are on page 1of 89

5 Must-Know Activation Functions Used in

Neural Networks
The essence of non-linearity

Photo by Drew Patrick Miller on Unsplash

The universal approximation theorem implies that a neural network can approximate
any continuous function that maps inputs (X) to outputs (y). The ability to represent
any function is what makes the neural networks so powerful and widely-used.

To be able to approximate any function, we need non-linearity. That’s where the


activation functions come into play. They are used to add non-linearity to neural
networks. Without activation functions, neural networks can be considered as a
collection of linear models.

Neural networks are combinations of layers that contain many nodes. Thus, the
building process starts with a node. The following represents a node without an
activation function.
A neuron without an activation function (image by author)

The output y is a linear combination of inputs and a bias. We need to somehow add an
element of non-linearity. Consider the following node structure.

A neuron with an activation function (image by author)

Non-linearity is achieved by applying an activation function to the sum of the linear


combination of inputs and bias. The added non-linearity depends on the activation
function.

In this post, we will talk about 5 commonly used activations in neural networks.

1. Sigmoid
The sigmoid function bounds a range of values between 0 and 1. It is also used in
logistic regression models.

(image by author)

Whatever the input values to a sigmoid function are, the output values will be between
0 and 1. Thus, the output of each neuron is normalized into the range 0–1.

(image by author)

The output (y) is more sensitive to the changes on the input (x) for x values close to 0.
As the input values move away from zero, the output value becomes less sensitive.
After some point, even a large change in the input values result in little to no change in
the output value. That is how the sigmoid function achieves non-linearity.

There is a downside associated with this non-linearity. Let’s first see the derivative of
the sigmoid function.
(image by author)

The derivate tends towards zero as we move away from zero. The “learning” process of
a neural network depends on the derivative because the weights are updated based on
the gradient which basically is the derivate of a function. If the gradient is very close to
zero, weights are updated with very small increments. This results in a neural network
that learns so slow and takes forever to converge. This is also known as vanishing
gradient problem.

2. Tanh (Hyperbolic Tangent)

It is very similar to the sigmoid except that the output values are in the range of -1 to
+1. Thus, tanh is said to be zero centered.
(image by author)

The difference between the sigmoid and tanh is that the gradients are not restricted to
move in one direction for tanh. Thus, tanh is likely to converge faster than the sigmoid
function.

The vanishing gradient problem also exist for tanh activation function.

3. ReLU (Rectified Linear Unit)

The relu function is only interested in the positive values. It keeps the input values
greater than 0 as is. All the input values less than zero become 0.
(image by author)

The output values of a neuron can be arranged to be less than 0. If we apply the relu
function to the output of that neuron, all the values returned from that neuron become
0. Thus, relu allows cancelling out some of the neurons.

We are able to activate only some of the neurons with the relu function where as all of
the neurons are activated with tanh and sigmoid which results in intense
computations. Thus, relu converges faster than tanh and sigmoid.

The derivative of relu is 0 for input values less than 0. For those values, the weights
are never updated during back-propagation and thus the neural network cannot learn.
This issue is known as dying relu problem.

4. Leaky ReLU

It can be considered as a solution to the dying relu problem. Leaky relu outputs a
small value for negative inputs.
(image by author)

Although leaky relu seems to be solving the dying relu problem, some argue that there
is not a significant difference on the accuracy in most cases. I guess it comes down to
trying both and see if there is any difference for a particular task.

5. Softmax

Softmax is usually used in multi-class classification tasks and applied to the output
neuron. What it does is normalizing the output values into a probability distribution
in a way the probability values add up to 1.

Softmax function divides the exponential of each output by the sum of the
exponentials of all all the outputs. The resulting values form a probability distribution
with probabilities that add up to 1.

Let’s do an example. Consider a case in which the target variable has 4 classes. The
following is the output of the neural network for 5 different data points (i.e.
observations).
Each column represents the output for an observation (image by author)

We can apply the softmax function to these outputs as follows:

(image by author)

In the first line, we applied the softmax function to the values in matrix A. The second
line reduced the floating point precision to 2 decimals.

Here is the output of the softmax function.

(image by author)

As you can see, the probability values in each column add up to 1.

Conclusion

We have discussed 5 different activation functions used in the neural networks. It is a


must to use activation functions in neural networks in order to add non-linearity.

There is no free lunch! Activation functions also cause a burden to neural networks in
terms of computational complexity. They also have an impact on the convergence of
models.
It is important to know the properties of the activations functions and how they
behave so that we can choose the activation function that best fits a particular task.

In general, the desired properties of an activation function are:

 Computationally inexpensive

 Zero centered

 Differentiable. The derivative of an activation function needs to carry


information about the input values because weights are updated based on the
gradients.

 Not causing vanishing gradient problem

Activation Functions and their Derivatives –


A Quick & Complete Guide
Lakshmi Panneerselvam — April 14, 2021

Advanced Deep Learning Maths

This article was published as a part of the Data Science Blogathon.

Introduction

In Deep learning, a neural network without an activation function is just a linear regression model as
these functions actually do the non-linear computations to the input of a neural network making it
capable to learn and perform more complex tasks. Thus, it is quite essential to study the derivatives and
implementation of activation functions, also analyze the benefits and downsides for each activation
function, for choosing the right type of activation function that could provide non-linearity and accuracy
in a specific neural network model.
Table of Contents
1. Introduction to Activation Functions
2. Types of Activation Functions
3. Activation Functions and their Derivatives
4. Implementation using Python
5. Pros and Cons of Activation Functions

Introduction to Activation Functions


What it actually is?

Activation functions are functions used in a neural network to compute the weighted sum of inputs and
biases, which is in turn used to decide whether a neuron can be activated or not. It manipulates the
presented data and produces an output for the neural network that contains the parameters in the data.
The activation functions are also referred to as transfer functions in some literature. These can either be
linear or nonlinear
depending on the function it represents and is used to control the output of neural networks across
different domains.

For a linear model, a linear mapping of an input function to output is performed in the hidden layers
before the final prediction for each label is given. The input vector x transformation is given by

f(x) = wT . x + b

where, x = input, w = weight, and b = bias.

Linear results are produced from the mappings of the above equation and the need for the activation
function arises here, first to convert these linear outputs into non-linear output for further computation,
and then to learn the patterns in the data. The output of these models are given by

y = (w1 x1 + w2 x2 + … + wn xn + b)

These outputs of each layer are fed into the next subsequent layer for multilayered networks until the
final output is obtained, but they are linear by default. The expected output is said to determine the type
of activation function that has to be deployed in a given network.
However, since the outputs are linear in nature, the nonlinear activation functions are required to convert
these linear inputs to non-linear outputs. These transfer functions, applied to the outputs of the linear
models to produce the transformed non-linear outputs are ready for further processing. The non-linear
output after the application of the activation function is given by

y = α (w1 x1 + w2 x2 + … + wn xn + b)

where α is the activation function.

Why Activation Functions?

The need for these activation functions includes converting the linear input signals and models into non-
linear output signals, which aids the learning of high order polynomials for deeper networks.

How to use it?

In a neural network every neuron will do two computations:

 Linear summation of inputs: In the above diagram, it has two inputs x1, x2 with weights w1, w2, and
bias b. And the linear sum z = w1 x1 + w2 x2 + … + wn xn + b
 Activation computation: This computation decides, whether a neuron should be activated or not,
by calculating the weighted sum and further adding bias with it. The purpose of the activation
function is to introduce non-linearity into the output of a neuron.
Most neural networks begin by computing the weighted sum of the inputs. Each node in the layer can
have its own unique weighting. However, the activation function is the same across all nodes in the
layer. They are typical of a fixed form whereas the weights are considered to be the learning
parameters.

What is a Good Activation Function?

A proper choice has to be made in choosing the activation function to improve the results in neural
network computing. All activation functions must be monotonic, differentiable, and quickly converging
with respect to the weights for optimization purposes.

Types of Activation Functions

The different kinds of activation functions include:

1) Linear Activation Functions


A linear function is also known as a straight-line function where the activation is proportional to the
input i.e. the weighted sum from neurons. It has a simple function with the equation:

f(x) = ax + c

The problem with this activation is that it cannot be defined in a specific range. Applying this function
in all the nodes makes the activation function work like linear regression. The final layer of the Neural
Network will be working as a linear function of the first layer. Another issue is the gradient descent
when differentiation is done, it has a constant output which is not good because during backpropagation
the rate of change of error is constant that can ruin the output and the logic of backpropagation.

2) Non-Linear Activation Functions


The non-linear functions are known to be the most used activation functions. It makes it easy for a
neural network model to adapt with a variety of data and to differentiate between the outcomes.

These functions are mainly divided basis on their range or curves:


a) Sigmoid Activation Functions
Sigmoid takes a real value as the input and outputs another value between 0 and 1. The sigmoid
activation function translates the input ranged in (-∞,∞) to the range in (0,1)

b) Tanh Activation Functions


The tanh function is just another possible function that can be used as a non-linear activation function
between layers of a neural network. It shares a few things in common with the sigmoid activation
function. Unlike a sigmoid function that will map input values between 0 and 1, the Tanh will map
values between -1 and 1. Similar to the sigmoid function, one of the interesting properties of the tanh
function is that the derivative of tanh can be expressed in terms of the function itself.

c) ReLU Activation Functions


The formula is deceptively simple: max(0,z). Despite its name, Rectified Linear Units, it’s not linear and
provides the same benefits as Sigmoid but with better performance.

(i) Leaky Relu

Leaky Relu is a variant of ReLU. Instead of being 0 when z<0, a leaky ReLU allows a small, non-zero,
constant gradient α (normally, α=0.01). However, the consistency of the benefit across tasks is presently
unclear. Leaky ReLUs attempt to fix the “dying ReLU” problem.

(ii) Parametric Relu

PReLU gives the neurons the ability to choose what slope is best in the negative region. They can
become ReLU or leaky ReLU with certain values of α.

d) Maxout:
The Maxout activation is a generalization of the ReLU and the leaky ReLU functions. It is a piecewise
linear function that returns the maximum of inputs, designed to be used in conjunction with the dropout
regularization technique. Both ReLU and leaky ReLU are special cases of Maxout. The Maxout neuron,
therefore, enjoys all the benefits of a ReLU unit and does not have any drawbacks like dying ReLU.
However, it doubles the total number of parameters for each neuron, and hence, a higher total number of
parameters need to be trained.
e) ELU
The Exponential Linear Unit or ELU is a function that tends to converge faster and produce more
accurate results. Unlike other activation functions, ELU has an extra alpha constant which should be a
positive number. ELU is very similar to ReLU except for negative inputs. They are both in the identity
function form for non-negative inputs. On the other hand, ELU becomes smooth slowly until its output
equal to -α whereas ReLU sharply smoothes.

f) Softmax Activation Functions


Softmax function calculates the probabilities distribution of the event over ‘n’ different events. In a
general way, this function will calculate the probabilities of each target class over all possible target
classes. Later the calculated probabilities will help determine the target class for the given inputs.

When to use which Activation Function in a Neural Network?

Specifically, it depends on the problem type and the value range of the expected output. For example, to
predict values that are larger than 1, tanh or sigmoid are not suitable to be used in the output layer,
instead, ReLU can be used. On the other hand, if the output values have to be in the range (0,1) or (-1, 1)
then ReLU is not a good choice, and sigmoid or tanh can be used here. While performing a classification
task and using the neural network to predict a probability distribution over the mutually exclusive class
labels, the softmax activation function should be used in the last layer. However, regarding the hidden
layers, as a rule of thumb, use ReLU as an activation for these layers.

In the case of a binary classifier, the Sigmoid activation function should be used. The sigmoid activation
function and the tanh activation function work terribly for the hidden layer. For hidden layers, ReLU or
its better version leaky ReLU should be used. For a multiclass classifier, Softmax is the best-used
activation function. Though there are more activation functions known, these are known to be the most
used activation functions.

Activation Functions and their Derivatives


Implementation using Python

Having learned the types and significance of each activation function, it is also essential to implement
some basic (non-linear) activation
functions using python code and observe the output for more clear understanding of the concepts:
Sigmoid Activation Function
import matplotlib.pyplot as plt
import numpy as np
def sigmoid(x):
s=1/(1+np.exp(-x))
ds=s*(1-s)
return s,ds
x=np.arange(-6,6,0.01)
sigmoid(x)
fig, ax = plt.subplots(figsize=(9, 5))
ax.spines['left'].set_position('center')
ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')
ax.xaxis.set_ticks_position('bottom')
ax.yaxis.set_ticks_position('left')
ax.plot(x,sigmoid(x)[0], color="#307EC7", linewidth=3, label="sigmoid")
ax.plot(x,sigmoid(x)[1], color="#9621E2", linewidth=3, label="derivative")
ax.legend(loc="upper right", frameon=False)
fig.show()

Observations:
 The sigmoid function has values between 0 to 1.
 The output is not zero-centered.
 Sigmoids saturate and kill gradients.
 At the top and bottom level of sigmoid functions, the curve changes slowly, the derivative curve
above shows that the slope or gradient it is zero.

Tanh Activation Function


import matplotlib.pyplot as plt
import numpy as np
def tanh(x):
t=(np.exp(x)-np.exp(-x))/(np.exp(x)+np.exp(-x))
dt=1-t**2
return t,dt
z=np.arange(-4,4,0.01)
tanh(z)[0].size,tanh(z)[1].size
fig, ax = plt.subplots(figsize=(9, 5))
ax.spines['left'].set_position('center')
ax.spines['bottom'].set_position('center')
ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')
ax.xaxis.set_ticks_position('bottom')
ax.yaxis.set_ticks_position('left')
ax.plot(z,tanh(z)[0], color="#307EC7", linewidth=3, label="tanh")
ax.plot(z,tanh(z)[1], color="#9621E2", linewidth=3, label="derivative")
ax.legend(loc="upper right", frameon=False)
fig.show()
Observations:
 Its output is zero-centered because its range is between -1 to 1. i.e. -1 < output < 1.
 Optimization is easier in this method hence in practice it is always preferred over the Sigmoid
function.

Pros and Cons of Activation Functions

Type of Pros Cons


Function
Linear  It gives a range of activations, so it  It is a constant gradient and the
is not binary activation. descent is going to be on a constant
 It can definitely connect a few gradient.
neurons together and if more  If there is an error in prediction, the
than 1 fire, take the max and changes made by backpropagation
decide based on that. are constant and not depending on
the change in input.

Sigmoid  It is nonlinear in nature.  Sigmoids saturate and kill gradients.


Combinations of this function are It gives rise to a problem of
also nonlinear. “vanishing gradients”
 It will give an analog activation,  The network refuses to learn further
unlike the step function. or is drastically slow.

Tanh  The gradient is stronger for tanh  Tanh also has a vanishing gradient
than sigmoid i.e. derivatives are problem.
steeper.

ReLU  It avoids and rectifies the  It should only be used within hidden
vanishing gradient problem. layers of a Neural Network Model.
 ReLu is less computationally  Some gradients can be fragile during
expensive than tanh and sigmoid training and can die. It can cause a
because it involves simpler weight update which will make it
mathematical operations. never activate on any data point
again. Thus, ReLu could even result
in Dead Neurons.

Leaky ReLU  Leaky ReLUs is one attempt to fix  As it possesses linearity, it can’t be
the “dying ReLU” problem by used for complex Classification. It
having a small negative slope lags behind the Sigmoid and Tanh
for some of the use cases.

ELU  Unlike ReLU, ELU can produce  For x>0, it can blow up the activation
negative outputs. with the output range of [0, ∞].

When all of this is said and done, the actual purpose of an activation function is to feature some
reasonably non-linear property to the function, which could be a neural network. A neural network,
without the activation functions, might perform solely linear mappings from the inputs to the outputs,
and also the mathematical operation throughout the forward propagation would be the dot-products
between an input vector and a weight matrix.

Since one dot product could be a linear operation, sequent dot products would be nothing more than
multiple linear operations repeated one after another. And sequent linear operations may be thought of
as mutually single learn operations. To be able to work out extremely attention-grabbing stuff, the neural
networks should be able to approximate the nonlinear relations from input features to the output labels.

The more complicated the information, the more non-linear the mapping of features to the bottom truth
label will usually be. If there is no activation function in a neural network, the network would in turn not
be able to understand such complicated mappings mathematically and wouldn’t be able to solve tasks
that the network is really meant to resolve.

How to Choose an Activation Function for Deep


Learning
by Jason Brownlee on January 18, 2021 in Deep Learning

Last Updated on January 22, 2021

Activation functions are a critical part of the design of a neural network.


The choice of activation function in the hidden layer will control how well the network
model learns the training dataset. The choice of activation function in the output layer
will define the type of predictions the model can make.

As such, a careful choice of activation function must be made for each deep learning
neural network project.
In this tutorial, you will discover how to choose activation functions for neural network
models.

After completing this tutorial, you will know:

 Activation functions are a key part of neural network design.


 The modern default activation function for hidden layers is the ReLU function.
 The activation function for output layers depends on the type of prediction problem.
Let’s get started.

Tutorial Overview
This tutorial is divided into three parts; they are:

1. Activation Functions
2. Activation for Hidden Layers
3. Activation for Output Layers

Activation Functions
An activation function in a neural network defines how the weighted sum of the input is
transformed into an output from a node or nodes in a layer of the network.
Sometimes the activation function is called a “transfer function.” If the output range of
the activation function is limited, then it may be called a “squashing function.” Many
activation functions are nonlinear and may be referred to as the “ nonlinearity” in the
layer or the network design.
The choice of activation function has a large impact on the capability and performance of
the neural network, and different activation functions may be used in different parts of
the model.

Technically, the activation function is used within or after the internal processing of each
node in the network, although networks are designed to use the same activation function
for all nodes in a layer.

A network may have three types of layers: input layers that take raw input from the
domain, hidden layers that take input from another layer and pass output to another
layer, and output layers that make a prediction.
All hidden layers typically use the same activation function. The output layer will
typically use a different activation function from the hidden layers and is dependent upon
the type of prediction required by the model.

Activation functions are also typically differentiable, meaning the first-order derivative
can be calculated for a given input value. This is required given that neural networks are
typically trained using the backpropagation of error algorithm that requires the derivative
of prediction error in order to update the weights of the model.

There are many different types of activation functions used in neural networks, although
perhaps only a small number of functions used in practice for hidden and output layers.

Let’s take a look at the activation functions used for each type of layer in turn.

Activation for Hidden Layers


A hidden layer in a neural network is a layer that receives input from another layer (such
as another hidden layer or an input layer) and provides output to another layer (such as
another hidden layer or an output layer).

A hidden layer does not directly contact input data or produce outputs for a model, at
least in general.

A neural network may have zero or more hidden layers.

Typically, a differentiable nonlinear activation function is used in the hidden layers of a


neural network. This allows the model to learn more complex functions than a network
trained using a linear activation function.

In order to get access to a much richer hypothesis space that would benefit from deep
representations, you need a non-linearity, or activation function.

— Page 72, Deep Learning with Python, 2017.

There are perhaps three activation functions you may want to consider for use in hidden
layers; they are:

 Rectified Linear Activation (ReLU)


 Logistic (Sigmoid)
 Hyperbolic Tangent (Tanh)
This is not an exhaustive list of activation functions used for hidden layers, but they are
the most commonly used.

Let’s take a closer look at each in turn.

ReLU Hidden Layer Activation Function


The rectified linear activation function, or ReLU activation function, is perhaps the most
common function used for hidden layers.
It is common because it is both simple to implement and effective at overcoming the
limitations of other previously popular activation functions, such as Sigmoid and Tanh.
Specifically, it is less susceptible to vanishing gradients that prevent deep models from
being trained, although it can suffer from other problems like saturated or “ dead” units.
The ReLU function is calculated as follows:

 max(0.0, x)
This means that if the input value (x) is negative, then a value 0.0 is returned, otherwise,
the value is returned.

You can learn more about the details of the ReLU activation function in this tutorial:

 A Gentle Introduction to the Rectified Linear Unit (ReLU)


We can get an intuition for the shape of this function with the worked example below.

1 # example plot for the relu activation function

2 from matplotlib import pyplot

4 # rectified linear function

5 def rectified(x):

6 return max(0.0, x)

8 # define input data

9 inputs = [x for x in range(-10, 10)]

10 # calculate outputs

11 outputs = [rectified(x) for x in inputs]

12 # plot inputs vs outputs

13 pyplot.plot(inputs, outputs)

14 pyplot.show()

Running the example calculates the outputs for a range of values and creates a plot of
inputs versus outputs.

We can see the familiar kink shape of the ReLU activation function.
Plot of Inputs vs. Outputs for the ReLU Activation Function.

When using the ReLU function for hidden layers, it is a good practice to use a “ He
Normal” or “He Uniform” weight initialization and scale input data to the range 0-1
(normalize) prior to training.
Sigmoid Hidden Layer Activation Function
The sigmoid activation function is also called the logistic function.

It is the same function used in the logistic regression classification algorithm.

The function takes any real value as input and outputs values in the range 0 to 1. The
larger the input (more positive), the closer the output value will be to 1.0, whereas the
smaller the input (more negative), the closer the output will be to 0.0.

The sigmoid activation function is calculated as follows:

 1.0 / (1.0 + e^-x)


Where e is a mathematical constant, which is the base of the natural logarithm.
We can get an intuition for the shape of this function with the worked example below.

1 # example plot for the sigmoid activation function

2 from math import exp

3 from matplotlib import pyplot

5 # sigmoid activation function

6 def sigmoid(x):

7 return 1.0 / (1.0 + exp(-x))

8
9 # define input data

10 inputs = [x for x in range(-10, 10)]

11 # calculate outputs

12 outputs = [sigmoid(x) for x in inputs]

13 # plot inputs vs outputs

14 pyplot.plot(inputs, outputs)

15 pyplot.show()

Running the example calculates the outputs for a range of values and creates a plot of
inputs versus outputs.

We can see the familiar S-shape of the sigmoid activation function.

Plot of Inputs vs. Outputs for the Sigmoid Activation Function.

When using the Sigmoid function for hidden layers, it is a good practice to use a “ Xavier
Normal” or “Xavier Uniform” weight initialization (also referred to Glorot initialization,
named for Xavier Glorot) and scale input data to the range 0-1 (e.g. the range of the
activation function) prior to training.
Tanh Hidden Layer Activation Function
The hyperbolic tangent activation function is also referred to simply as the Tanh (also
“tanh” and “TanH“) function.
It is very similar to the sigmoid activation function and even has the same S-shape.

The function takes any real value as input and outputs values in the range -1 to 1. The
larger the input (more positive), the closer the output value will be to 1.0, whereas the
smaller the input (more negative), the closer the output will be to -1.0.
The Tanh activation function is calculated as follows:

 (e^x – e^-x) / (e^x + e^-x)


Where e is a mathematical constant that is the base of the natural logarithm.
We can get an intuition for the shape of this function with the worked example below.

1 # example plot for the tanh activation function

2 from math import exp

3 from matplotlib import pyplot

5 # tanh activation function

6 def tanh(x):

7 return (exp(x) - exp(-x)) / (exp(x) + exp(-x))

9 # define input data

10 inputs = [x for x in range(-10, 10)]

11 # calculate outputs

12 outputs = [tanh(x) for x in inputs]

13 # plot inputs vs outputs

14 pyplot.plot(inputs, outputs)

15 pyplot.show()

Running the example calculates the outputs for a range of values and creates a plot of
inputs versus outputs.

We can see the familiar S-shape of the Tanh activation function.

Plot of Inputs vs. Outputs for the Tanh Activation Function.


When using the TanH function for hidden layers, it is a good practice to use a “ Xavier
Normal” or “Xavier Uniform” weight initialization (also referred to Glorot initialization,
named for Xavier Glorot) and scale input data to the range -1 to 1 (e.g. the range of the
activation function) prior to training.
How to Choose a Hidden Layer Activation Function
A neural network will almost always have the same activation function in all hidden
layers.

It is most unusual to vary the activation function through a network model.

Traditionally, the sigmoid activation function was the default activation function in the
1990s. Perhaps through the mid to late 1990s to 2010s, the Tanh function was the default
activation function for hidden layers.

… the hyperbolic tangent activation function typically performs better than the logistic
sigmoid.

— Page 195, Deep Learning, 2016.


Both the sigmoid and Tanh functions can make the model more susceptible to problems
during training, via the so-called vanishing gradients problem.

You can learn more about this problem in this tutorial:

 A Gentle Introduction to the Rectified Linear Unit (ReLU)


The activation function used in hidden layers is typically chosen based on the type of
neural network architecture.

Modern neural network models with common architectures, such as MLP and CNN, will
make use of the ReLU activation function, or extensions.

In modern neural networks, the default recommendation is to use the rectified linear unit
or ReLU …

— Page 174, Deep Learning, 2016.


Recurrent networks still commonly use Tanh or sigmoid activation functions, or even
both. For example, the LSTM commonly uses the Sigmoid activation for recurrent
connections and the Tanh activation for output.

 Multilayer Perceptron (MLP): ReLU activation function.


 Convolutional Neural Network (CNN): ReLU activation function.
 Recurrent Neural Network: Tanh and/or Sigmoid activation function.
If you’re unsure which activation function to use for your network, try a few and compare
the results.
The figure below summarizes how to choose an activation function for the hidden layers
of your neural network model.

How to Choose a Hidden Layer Activation Function

Activation for Output Layers


The output layer is the layer in a neural network model that directly outputs a prediction.

All feed-forward neural network models have an output layer.

There are perhaps three activation functions you may want to consider for use in the
output layer; they are:

 Linear
 Logistic (Sigmoid)
 Softmax
This is not an exhaustive list of activation functions used for output layers, but they are
the most commonly used.

Let’s take a closer look at each in turn.

Linear Output Activation Function


The linear activation function is also called “identity” (multiplied by 1.0) or “no
activation.”
This is because the linear activation function does not change the weighted sum of the
input in any way and instead returns the value directly.

We can get an intuition for the shape of this function with the worked example below.
1 # example plot for the linear activation function

2 from matplotlib import pyplot

4 # linear activation function

5 def linear(x):

6 return x

8 # define input data

9 inputs = [x for x in range(-10, 10)]

10 # calculate outputs

11 outputs = [linear(x) for x in inputs]

12 # plot inputs vs outputs

13 pyplot.plot(inputs, outputs)

14 pyplot.show()

Running the example calculates the outputs for a range of values and creates a plot of
inputs versus outputs.

We can see a diagonal line shape where inputs are plotted against identical outputs.

Plot of Inputs vs. Outputs for the Linear Activation Function

Target values used to train a model with a linear activation function in the output layer
are typically scaled prior to modeling using normalization or standardization transforms.

Sigmoid Output Activation Function


The sigmoid of logistic activation function was described in the previous section.
Nevertheless, to add some symmetry, we can review for the shape of this function with
the worked example below.

1 # example plot for the sigmoid activation function

2 from math import exp

3 from matplotlib import pyplot

5 # sigmoid activation function

6 def sigmoid(x):

7 return 1.0 / (1.0 + exp(-x))

9 # define input data

10 inputs = [x for x in range(-10, 10)]

11 # calculate outputs

12 outputs = [sigmoid(x) for x in inputs]

13 # plot inputs vs outputs

14 pyplot.plot(inputs, outputs)

15 pyplot.show()

Running the example calculates the outputs for a range of values and creates a plot of
inputs versus outputs.

We can see the familiar S-shape of the sigmoid activation function.

Plot of Inputs vs. Outputs for the Sigmoid Activation Function.


Target labels used to train a model with a sigmoid activation function in the output layer
will have the values 0 or 1.

Softmax Output Activation Function


The softmax function outputs a vector of values that sum to 1.0 that can be interpreted
as probabilities of class membership.
It is related to the argmax function that outputs a 0 for all options and 1 for the chosen
option. Softmax is a “softer” version of argmax that allows a probability-like output of a
winner-take-all function.
As such, the input to the function is a vector of real values and the output is a vector of
the same length with values that sum to 1.0 like probabilities.

The softmax function is calculated as follows:

 e^x / sum(e^x)
Where x is a vector of outputs and e is a mathematical constant that is the base of the
natural logarithm.
You can learn more about the details of the Softmax function in this tutorial:

 Softmax Activation Function with Python


We cannot plot the softmax function, but we can give an example of calculating it in
Python.

1 from numpy import exp

3 # softmax activation function

4 def softmax(x):

5 return exp(x) / exp(x).sum()

7 # define input data

8 inputs = [1.0, 3.0, 2.0]

9 # calculate outputs

10 outputs = softmax(inputs)

11 # report the probabilities

12 print(outputs)

13 # report the sum of the probabilities

14 print(outputs.sum())

Running the example calculates the softmax output for the input vector.

We then confirm that the sum of the outputs of the softmax indeed sums to the value 1.0.

1[0.09003057 0.66524096 0.24472847]


2

1.0

Target labels used to train a model with the softmax activation function in the output
layer will be vectors with 1 for the target class and 0 for all other classes.

How to Choose an Output Activation Function


You must choose the activation function for your output layer based on the type of
prediction problem that you are solving.

Specifically, the type of variable that is being predicted.

For example, you may divide prediction problems into two main groups, predicting a
categorical variable (classification) and predicting a numerical variable (regression).
If your problem is a regression problem, you should use a linear activation function.

 Regression: One node, linear activation.


If your problem is a classification problem, then there are three main types of
classification problems and each may use a different activation function.

Predicting a probability is not a regression problem; it is classification. In all cases of


classification, your model will predict the probability of class membership (e.g.
probability that an example belongs to each class) that you can convert to a crisp class
label by rounding (for sigmoid) or argmax (for softmax).

If there are two mutually exclusive classes (binary classification), then your output layer
will have one node and a sigmoid activation function should be used. If there are more
than two mutually exclusive classes (multiclass classification), then your output layer
will have one node per class and a softmax activation should be used. If there are two or
more mutually inclusive classes (multilabel classification), then your output layer will
have one node for each class and a sigmoid activation function is used.

 Binary Classification: One node, sigmoid activation.


 Multiclass Classification: One node per class, softmax activation.
 Multilabel Classification: One node per class, sigmoid activation.
The figure below summarizes how to choose an activation function for the output layer of
your neural network model.
A Gentle Introduction to the Rectified Linear Unit
(ReLU)
by Jason Brownlee on January 9, 2019 in Deep Learning Performance

Last Updated on August 20, 2020

In a neural network, the activation function is responsible for transforming the summed
weighted input from the node into the activation of the node or output for that input.

The rectified linear activation function or ReLU for short is a piecewise linear function
that will output the input directly if it is positive, otherwise, it will output zero. It has
become the default activation function for many types of neural networks because a
model that uses it is easier to train and often achieves better performance.
In this tutorial, you will discover the rectified linear activation function for deep learning
neural networks.

After completing this tutorial, you will know:

 The sigmoid and hyperbolic tangent activation functions cannot be used in networks with many layers due to the
vanishing gradient problem.
 The rectified linear activation function overcomes the vanishing gradient problem, allowing models to learn
faster and perform better.
 The rectified linear activation is the default activation when developing multilayer Perceptron and convolutional
neural networks.
Kick-start your project with my new book Better Deep Learning, including step-by-step
tutorials and the Python source code files for all examples.
Let’s get started.

 Jun/2019: Fixed error in the equation for He weight initialization (thanks Maltev).

Tutorial Overview
This tutorial is divided into six parts; they are:

1. Limitations of Sigmoid and Tanh Activation Functions


2. Rectified Linear Activation Function
3. How to Implement the Rectified Linear Activation Function
4. Advantages of the Rectified Linear Activation
5. Tips for Using the Rectified Linear Activation
6. Extensions and Alternatives to ReLU
Limitations of Sigmoid and Tanh Activation Functions
A neural network is comprised of layers of nodes and learns to map examples of inputs to
outputs.

For a given node, the inputs are multiplied by the weights in a node and summed
together. This value is referred to as the summed activation of the node. The summed
activation is then transformed via an activation function and defines the specific output
or “activation” of the node.
The simplest activation function is referred to as the linear activation, where no
transform is applied at all. A network comprised of only linear activation functions is very
easy to train, but cannot learn complex mapping functions. Linear activation functions
are still used in the output layer for networks that predict a quantity (e.g. regression
problems).

Nonlinear activation functions are preferred as they allow the nodes to learn more
complex structures in the data. Traditionally, two widely used nonlinear activation
functions are the sigmoid and hyperbolic tangent activation functions.
The sigmoid activation function, also called the logistic function, is traditionally a very
popular activation function for neural networks. The input to the function is transformed
into a value between 0.0 and 1.0. Inputs that are much larger than 1.0 are transformed to
the value 1.0, similarly, values much smaller than 0.0 are snapped to 0.0. The shape of
the function for all possible inputs is an S-shape from zero up through 0.5 to 1.0. For a
long time, through the early 1990s, it was the default activation used on neural networks.

The hyperbolic tangent function, or tanh for short, is a similar shaped nonlinear activation
function that outputs values between -1.0 and 1.0. In the later 1990s and through the
2000s, the tanh function was preferred over the sigmoid activation function as models
that used it were easier to train and often had better predictive performance.

… the hyperbolic tangent activation function typically performs better than the logistic
sigmoid.

— Page 195, Deep Learning, 2016.


A general problem with both the sigmoid and tanh functions is that they saturate. This
means that large values snap to 1.0 and small values snap to -1 or 0 for tanh and sigmoid
respectively. Further, the functions are only really sensitive to changes around their mid-
point of their input, such as 0.5 for sigmoid and 0.0 for tanh.

The limited sensitivity and saturation of the function happen regardless of whether the
summed activation from the node provided as input contains useful information or not.
Once saturated, it becomes challenging for the learning algorithm to continue to adapt
the weights to improve the performance of the model.

… sigmoidal units saturate across most of their domain—they saturate to a high value
when z is very positive, saturate to a low value when z is very negative, and are only
strongly sensitive to their input when z is near 0.

— Page 195, Deep Learning, 2016.


Finally, as the capability of hardware increased through GPUs’ very deep neural networks
using sigmoid and tanh activation functions could not easily be trained.

Layers deep in large networks using these nonlinear activation functions fail to receive
useful gradient information. Error is back propagated through the network and used to
update the weights. The amount of error decreases dramatically with each additional
layer through which it is propagated, given the derivative of the chosen activation
function. This is called the vanishing gradient problem and prevents deep (multi-layered)
networks from learning effectively.
Vanishing gradients make it difficult to know which direction the parameters should
move to improve the cost function

— Page 290, Deep Learning, 2016.


For an example of how ReLU can fix the vanishing gradients problem, see the tutorial:

 How to Fix Vanishing Gradients Using the Rectified Linear Activation Function
Although the use of nonlinear activation functions allows neural networks to learn
complex mapping functions, they effectively prevent the learning algorithm from working
with deep networks.
Workarounds were found in the late 2000s and early 2010s using alternate network types
such as Boltzmann machines and layer-wise training or unsupervised pre-training.

Rectified Linear Activation Function


In order to use stochastic gradient descent with backpropagation of errors to train deep
neural networks, an activation function is needed that looks and acts like a linear
function, but is, in fact, a nonlinear function allowing complex relationships in the data to
be learned.
The function must also provide more sensitivity to the activation sum input and avoid
easy saturation.

The solution had been bouncing around in the field for some time, although was not
highlighted until papers in 2009 and 2011 shone a light on it.

The solution is to use the rectified linear activation function, or ReL for short.

A node or unit that implements this activation function is referred to as a rectified linear
activation unit, or ReLU for short. Often, networks that use the rectifier function for the
hidden layers are referred to as rectified networks.
Adoption of ReLU may easily be considered one of the few milestones in the deep
learning revolution, e.g. the techniques that now permit the routine development of very
deep neural networks.

[another] major algorithmic change that has greatly improved the performance of
feedforward networks was the replacement of sigmoid hidden units with piecewise linear
hidden units, such as rectified linear units.

— Page 226, Deep Learning, 2016.


The rectified linear activation function is a simple calculation that returns the value
provided as input directly, or the value 0.0 if the input is 0.0 or less.

We can describe this using a simple if-statement:

1if input > 0:

2 return input

3else:

4 return 0

We can describe this function g() mathematically using the max() function over the set
of 0.0 and the input z; for example:
1g(z) = max{0, z}
The function is linear for values greater than zero, meaning it has a lot of the desirable
properties of a linear activation function when training a neural network using
backpropagation. Yet, it is a nonlinear function as negative values are always output as
zero.

Because rectified linear units are nearly linear, they preserve many of the properties that
make linear models easy to optimize with gradient-based methods. They also preserve
many of the properties that make linear models generalize well.

— Page 175, Deep Learning, 2016.


Because the rectified function is linear for half of the input domain and nonlinear for the
other half, it is referred to as a piecewise linear function or a hinge function.
However, the function remains very close to linear, in the sense that is a piecewise linear
function with two linear pieces.

— Page 175, Deep Learning, 2016.


Now that we are familiar with the rectified linear activation function, let’s look at how we
can implement it in Python.

How to Code the Rectified Linear Activation Function


We can implement the rectified linear activation function easily in Python.

Perhaps the simplest implementation is using the max() function; for example:
1# rectified linear function

2def rectified(x):

3 return max(0.0, x)

We expect that any positive value will be returned unchanged whereas an input value of
0.0 or a negative value will be returned as the value 0.0.

Below are a few examples of inputs and outputs of the rectified linear activation function.

1 # demonstrate the rectified linear function

3 # rectified linear function

4 def rectified(x):

5 return max(0.0, x)

7 # demonstrate with a positive input

8 x = 1.0

9 print('rectified(%.1f) is %.1f' % (x, rectified(x)))

10 x = 1000.0
11 print('rectified(%.1f) is %.1f' % (x, rectified(x)))

12 # demonstrate with a zero input

13 x = 0.0

14 print('rectified(%.1f) is %.1f' % (x, rectified(x)))

15 # demonstrate with a negative input

16 x = -1.0

17 print('rectified(%.1f) is %.1f' % (x, rectified(x)))

18 x = -1000.0

19 print('rectified(%.1f) is %.1f' % (x, rectified(x)))

Running the example, we can see that positive values are returned regardless of their
size, whereas negative values are snapped to the value 0.0.

1rectified(1.0) is 1.0

2rectified(1000.0) is 1000.0

3rectified(0.0) is 0.0

4rectified(-1.0) is 0.0

5rectified(-1000.0) is 0.0

We can get an idea of the relationship between inputs and outputs of the function by
plotting a series of inputs and the calculated outputs.

The example below generates a series of integers from -10 to 10 and calculates the
rectified linear activation for each input, then plots the result.

1 # plot inputs and outputs

2 from matplotlib import pyplot

4 # rectified linear function

5 def rectified(x):

6 return max(0.0, x)

8 # define a series of inputs

9 series_in = [x for x in range(-10, 11)]

10 # calculate outputs for our inputs

11 series_out = [rectified(x) for x in series_in]

12 # line plot of raw inputs to rectified outputs

13 pyplot.plot(series_in, series_out)

14 pyplot.show()

Running the example creates a line plot showing that all negative values and zero inputs
are snapped to 0.0, whereas the positive outputs are returned as-is, resulting in a linearly
increasing slope, given that we created a linearly increasing series of positive values
(e.g. 1 to 10).

Line Plot of Rectified Linear Activation for Negative and Positive Inputs

The derivative of the rectified linear function is also easy to calculate. Recall that the
derivative of the activation function is required when updating the weights of a node as
part of the backpropagation of error.

The derivative of the function is the slope. The slope for negative values is 0.0 and the
slope for positive values is 1.0.

Traditionally, the field of neural networks has avoided any activation function that was
not completely differentiable, perhaps delaying the adoption of the rectified linear
function and other piecewise-linear functions. Technically, we cannot calculate the
derivative when the input is 0.0, therefore, we can assume it is zero. This is not a
problem in practice.

For example, the rectified linear function g(z) = max{0, z} is not differentiable at z = 0.
This may seem like it invalidates g for use with a gradient-based learning algorithm. In
practice, gradient descent still performs well enough for these models to be used for
machine learning tasks.

— Page 192, Deep Learning, 2016.


Using the rectified linear activation function offers many advantages; let’s take a look at
a few in the next section.
Advantages of the Rectified Linear Activation Function
The rectified linear activation function has rapidly become the default activation function
when developing most types of neural networks.

As such, it is important to take a moment to review some of the benefits of the approach,
first highlighted by Xavier Glorot, et al. in their milestone 2012 paper on using ReLU titled
“Deep Sparse Rectifier Neural Networks“.
1. Computational Simplicity.
The rectifier function is trivial to implement, requiring a max() function.
This is unlike the tanh and sigmoid activation function that require the use of an
exponential calculation.

Computations are also cheaper: there is no need for computing the exponential function
in activations

— Deep Sparse Rectifier Neural Networks, 2011.


2. Representational Sparsity
An important benefit of the rectifier function is that it is capable of outputting a true zero
value.

This is unlike the tanh and sigmoid activation functions that learn to approximate a zero
output, e.g. a value very close to zero, but not a true zero value.

This means that negative inputs can output true zero values allowing the activation of
hidden layers in neural networks to contain one or more true zero values. This is called a
sparse representation and is a desirable property in representational learning as it can
accelerate learning and simplify the model.

An area where efficient representations such as sparsity are studied and sought is in
autoencoders, where a network learns a compact representation of an input (called the
code layer), such as an image or series, before it is reconstructed from the compact
representation.

One way to achieve actual zeros in h for sparse (and denoising) autoencoders […] The
idea is to use rectified linear units to produce the code layer. With a prior that actually
pushes the representations to zero (like the absolute value penalty), one can thus
indirectly control the average number of zeros in the representation.

— Page 507, Deep Learning, 2016.


3. Linear Behavior
The rectifier function mostly looks and acts like a linear activation function.

In general, a neural network is easier to optimize when its behavior is linear or close to
linear.

Rectified linear units […] are based on the principle that models are easier to optimize if
their behavior is closer to linear.

— Page 194, Deep Learning, 2016.


Key to this property is that networks trained with this activation function almost
completely avoid the problem of vanishing gradients, as the gradients remain
proportional to the node activations.

Because of this linearity, gradients flow well on the active paths of neurons (there is no
gradient vanishing effect due to activation non-linearities of sigmoid or tanh units).

— Deep Sparse Rectifier Neural Networks, 2011.


4. Train Deep Networks
Importantly, the (re-)discovery and adoption of the rectified linear activation function
meant that it became possible to exploit improvements in hardware and successfully
train deep multi-layered networks with a nonlinear activation function using
backpropagation.

In turn, cumbersome networks such as Boltzmann machines could be left behind as well
as cumbersome training schemes such as layer-wise training and unlabeled pre-training.

… deep rectifier networks can reach their best performance without requiring any
unsupervised pre-training on purely supervised tasks with large labeled datasets. Hence,
these results can be seen as a new milestone in the attempts at understanding the
difficulty in training deep but purely supervised neural networks, and closing the
performance gap between neural networks learnt with and without unsupervised pre-
training.

— Deep Sparse Rectifier Neural Networks, 2011.


Tips for Using the Rectified Linear Activation
In this section, we’ll take a look at some tips when using the rectified linear activation
function in your own deep learning neural networks.
Use ReLU as the Default Activation Function
For a long time, the default activation to use was the sigmoid activation function. Later,
it was the tanh activation function.

For modern deep learning neural networks, the default activation function is the rectified
linear activation function.

Prior to the introduction of rectified linear units, most neural networks used the logistic
sigmoid activation function or the hyperbolic tangent activation function.

— Page 195, Deep Learning, 2016.


Most papers that achieve state-of-the-art results will describe a network using ReLU. For
example, in the milestone 2012 paper by Alex Krizhevsky, et al. titled “ImageNet
Classification with Deep Convolutional Neural Networks,” the authors developed a deep
convolutional neural network with ReLU activations that achieved state-of-the-art results
on the ImageNet photo classification dataset.
… we refer to neurons with this nonlinearity as Rectified Linear Units (ReLUs). Deep
convolutional neural networks with ReLUs train several times faster than their
equivalents with tanh units.

If in doubt, start with ReLU in your neural network, then perhaps try other piecewise
linear activation functions to see how their performance compares.

In modern neural networks, the default recommendation is to use the rectified linear unit
or ReLU

— Page 174, Deep Learning, 2016.


Use ReLU with MLPs, CNNs, but Probably Not RNNs
The ReLU can be used with most types of neural networks.

It is recommended as the default for both Multilayer Perceptron (MLP) and Convolutional
Neural Networks (CNNs).

The use of ReLU with CNNs has been investigated thoroughly, and almost universally
results in an improvement in results, initially, surprisingly so.

… how do the non-linearities that follow the filter banks influence the recognition
accuracy. The surprising answer is that using a rectifying non-linearity is the single most
important factor in improving the performance of a recognition system.

— What is the best multi-stage architecture for object recognition?, 2009


Work investigating ReLU with CNNs is what provoked their use with other network types.
[others] have explored various rectified nonlinearities […] in the context of convolutional
networks and have found them to improve discriminative performance.

— Rectified Linear Units Improve Restricted Boltzmann Machines, 2010.


When using ReLU with CNNs, they can be used as the activation function on the filter
maps themselves, followed then by a pooling layer.

A typical layer of a convolutional network consists of three stages […] In the second
stage, each linear activation is run through a nonlinear activation function, such as the
rectified linear activation function. This stage is sometimes called the detector stage.

— Page 339, Deep Learning, 2016.


Traditionally, LSTMs use the tanh activation function for the activation of the cell state
and the sigmoid activation function for the node output. Given their careful design, ReLU
were thought to not be appropriate for Recurrent Neural Networks (RNNs) such as the
Long Short-Term Memory Network (LSTM) by default.

At first sight, ReLUs seem inappropriate for RNNs because they can have very large
outputs so they might be expected to be far more likely to explode than units that have
bounded values.

— A Simple Way to Initialize Recurrent Networks of Rectified Linear Units, 2015.


Nevertheless, there has been some work on investigating the use of ReLU as the output
activation in LSTMs, the result of which is a careful initialization of network weights to
ensure that the network is stable prior to training. This is outlined in the 2015 paper titled
“A Simple Way to Initialize Recurrent Networks of Rectified Linear Units.”
Try a Smaller Bias Input Value
The bias is the input on the node that has a fixed value.

The bias has the effect of shifting the activation function and it is traditional to set the
bias input value to 1.0.

When using ReLU in your network, consider setting the bias to a small value, such as 0.1.

… it can be a good practice to set all elements of [the bias] to a small, positive value,
such as 0.1. This makes it very likely that the rectified linear units will be initially active
for most inputs in the training set and allow the derivatives to pass through.

— Page 193, Deep Learning, 2016.


There are some conflicting reports as to whether this is required, so compare
performance to a model with a 1.0 bias input.
Use “He Weight Initialization”
Before training a neural network,the weights of the network must be initialized to small
random values.

When using ReLU in your network and initializing weights to small random values
centered on zero, then by default half of the units in the network will output a zero value.

For example, after uniform initialization of the weights, around 50% of hidden units
continuous output values are real zeros

— Deep Sparse Rectifier Neural Networks, 2011.


There are many heuristic methods to initialize the weights for a neural network, yet there
is no best weight initialization scheme and little relationship beyond general guidelines
for mapping weight initialization schemes to the choice of activation function.

Prior to the wide adoption of ReLU, Xavier Glorot and Yoshua Bengio proposed an
initialization scheme in their 2010 paper titled “Understanding the difficulty of training
deep feedforward neural networks” that quickly became the default when using sigmoid
and tanh activation functions, generally referred to as “Xavier initialization“. Weights are
set at random values sampled uniformly from a range proportional to the size of the
number of nodes in the previous layer (specifically +/- 1/sqrt(n) where n is the number
of nodes in the prior layer).
Kaiming He, et al. in their 2015 paper titled “Delving Deep into Rectifiers: Surpassing
Human-Level Performance on ImageNet Classification” suggested that Xavier
initialization and other schemes were not appropriate for ReLU and extensions.
Glorot and Bengio proposed to adopt a properly scaled uniform distribution for
initialization. This is called “Xavier” initialization […]. Its derivation is based on the
assumption that the activations are linear. This assumption is invalid for ReLU

— Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet


Classification, 2015.
They proposed a small modification of Xavier initialization to make it suitable for use with
ReLU, now commonly referred to as “He initialization” (specifically
+/- sqrt(2/n) where n is the number of nodes in the prior layer known as the fan-in). In
practice, both Gaussian and uniform versions of the scheme can be used.
Scale Input Data
It is good practice to scale input data prior to using a neural network.

This may involve standardizing variables to have a zero mean and unit variance or
normalizing each value to the scale 0-to-1.
Without data scaling on many problems, the weights of the neural network can grow
large, making the network unstable and increasing the generalization error.

This good practice of scaling inputs applies whether using ReLU for your network or not.

Use Weight Penalty


By design, the output from ReLU is unbounded in the positive domain.

This means that in some cases, the output can continue to grow in size. As such, it may
be a good idea to use a form of weight regularization, such as an L1 or L2 vector norm.
Another problem could arise due to the unbounded behavior of the activations; one may
thus want to use a regularizer to prevent potential numerical problems. Therefore, we
use the L1 penalty on the activation values, which also promotes additional sparsity

— Deep Sparse Rectifier Neural Networks, 2011.


This can be a good practice to both promote sparse representations (e.g. with L1
regularization) and reduced generalization error of the model.

Extensions and Alternatives to ReLU


The ReLU does have some limitations.

Key among the limitations of ReLU is the case where large weight updates can mean that
the summed input to the activation function is always negative, regardless of the input to
the network.

This means that a node with this problem will forever output an activation value of 0.0.
This is referred to as a “dying ReLU“.
the gradient is 0 whenever the unit is not active. This could lead to cases where a unit
never activates as a gradient-based optimization algorithm will not adjust the weights of
a unit that never activates initially. Further, like the vanishing gradients problem, we
might expect learning to be slow when training ReL networks with constant 0 gradients.

— Rectifier Nonlinearities Improve Neural Network Acoustic Models, 2013.


Some popular extensions to the ReLU relax the non-linear output of the function to allow
small negative values in some way.

The Leaky ReLU (LReLU or LReL) modifies the function to allow small negative values
when the input is less than zero.

The leaky rectifier allows for a small, non-zero gradient when the unit is saturated and
not active
— Rectifier Nonlinearities Improve Neural Network Acoustic Models, 2013.
The Exponential Linear Unit, or ELU, is a generalization of the ReLU that uses a
parameterized exponential function to transition from the positive to small negative
values.

ELUs have negative values which pushes the mean of the activations closer to zero.
Mean activations that are closer to zero enable faster learning as they bring the gradient
closer to the natural gradient

— Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs), 2016.
The Parametric ReLU, or PReLU, learns parameters that control the shape and leaky-ness
of the function.

… we propose a new generalization of ReLU, which we call Parametric Rectified Linear


Unit (PReLU). This activation function adaptively learns the parameters of the rectifiers

— Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet


Classification, 2015.
Maxout is an alternative piecewise linear function that returns the maximum of the
inputs, designed to be used in conjunction with the dropout regularization technique.

We define a simple new model called maxout (so named because its output is the max of
a set of inputs, and because it is a natural companion to dropout) designed to both
facilitate optimization by dropout and improve the accuracy of dropout’s fast
approximate model averaging technique.

— Maxout Networks, 2013.


Further Reading
This section provides more resources on the topic if you are looking to go deeper.

Posts
 How to Fix Vanishing Gradients Using the Rectified Linear Activation Function
Books
 Section 6.3.1 Rectified Linear Units and Their Generalizations, Deep Learning, 2016.
Papers
 What is the best multi-stage architecture for object recognition? , 2009
 Rectified Linear Units Improve Restricted Boltzmann Machines , 2010.
 Deep Sparse Rectifier Neural Networks, 2011.
 Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013.
 Understanding the difficulty of training deep feedforward neural networks , 2010.
 Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015.
 Maxout Networks, 2013.
API
 max API
Articles
 Neural Network FAQ
 Activation function, Wikipedia.
 Vanishing gradient problem, Wikipedia.
 Rectifier (neural networks), Wikipedia.
 Piecewise Linear Function, Wikipedia.
Summary
In this tutorial, you discovered the rectified linear activation function for deep learning
neural networks.

Specifically, you learned:

 The sigmoid and hyperbolic tangent activation functions cannot be used in networks with many layers due to the
vanishing gradient problem.
 The rectified linear activation function overcomes the vanishing gradient problem, allowing models to learn
faster and perform better.
 The rectified linear activation is the default activation when developing multilayer Perceptron and convolutional
neural networks.

Softmax Activation Function with Python


by Jason Brownlee on October 19, 2020 in Deep Learning

Softmax is a mathematical function that converts a vector of numbers into a vector of


probabilities, where the probabilities of each value are proportional to the relative scale
of each value in the vector.
The most common use of the softmax function in applied machine learning is in its use as
an activation function in a neural network model. Specifically, the network is configured
to output N values, one for each class in the classification task, and the softmax function
is used to normalize the outputs, converting them from weighted sum values into
probabilities that sum to one. Each value in the output of the softmax function is
interpreted as the probability of membership for each class.

In this tutorial, you will discover the softmax activation function used in neural network
models.

After completing this tutorial, you will know:

 Linear and Sigmoid activation functions are inappropriate for multi-class classification tasks.
 Softmax can be thought of as a softened version of the argmax function that returns the index of the largest
value in a list.
 How to implement the softmax function from scratch in Python and how to convert the output into a class label.
Let’s get started.

Tutorial Overview
This tutorial is divided into three parts; they are:

1. Predicting Probabilities With Neural Networks


2. Max, Argmax, and Softmax
3. Softmax Activation Function

Predicting Probabilities With Neural Networks


Neural network models can be used to model classification predictive modeling
problems.

Classification problems are those that involve predicting a class label for a given input. A
standard approach to modeling classification problems is to use a model to predict the
probability of class membership. That is, given an example, what is the probability of it
belonging to each of the known class labels?

 For a binary classification problem, a Binomial probability distribution is used. This is achieved using a network
with a single node in the output layer that predicts the probability of an example belonging to class 1.
 For a multi-class classification problem, a Multinomial probability is used. This is achieved using a network with
one node for each class in the output layer and the sum of the predicted probabilities equals one.
A neural network model requires an activation function in the output layer of the model to
make the prediction.

There are different activation functions to choose from; let’s look at a few.

Linear Activation Function


One approach to predicting class membership probabilities is to use a linear activation.
A linear activation function is simply the sum of the weighted input to the node, required
as input for any activation function. As such, it is often referred to as “ no activation
function” as no additional transformation is performed.
Recall that a probability or a likelihood is a numeric value between 0 and 1.
Given that no transformation is performed on the weighted sum of the input, it is possible
for the linear activation function to output any numeric value. This makes the linear
activation function inappropriate for predicting probabilities for either the binomial or
multinomial case.

Sigmoid Activation Function


Another approach to predicting class membership probabilities is to use a sigmoid
activation function.

This function is also called the logistic function. Regardless of the input, the function
always outputs a value between 0 and 1. The form of the function is an S-shape between
0 and 1 with the vertical or middle of the “S” at 0.5.
This allows very large values given as the weighted sum of the input to be output as 1.0
and very small or negative values to be mapped to 0.0.

The sigmoid activation is an ideal activation function for a binary classification problem
where the output is interpreted as a Binomial probability distribution.

The sigmoid activation function can also be used as an activation function for multi-class
classification problems where classes are non-mutually exclusive. These are often
referred to as a multi-label classification rather than multi-class classification.

The sigmoid activation function is not appropriate for multi-class classification problems
with mutually exclusive classes where a multinomial probability distribution is required.

Instead, an alternate activation is required called the softmax function.


Max, Argmax, and Softmax
Max Function
The maximum, or “max,” mathematical function returns the largest numeric value for a
list of numeric values.
We can implement this using the max() Python function; for example:
1# example of the max of a list of numbers

2# define data

3data = [1, 3, 2]

4# calculate the max of the list

5result = max(data)
6print(result)

Running the example returns the largest value “3” from the list of numbers.

13

Argmax Function
The argmax, or “arg max,” mathematical function returns the index in the list that
contains the largest value.
Think of it as the meta version of max: one level of indirection above max, pointing to the
position in the list that has the max value rather than the value itself.

We can implement this using the argmax() NumPy function; for example:
1# example of the argmax of a list of numbers

2from numpy import argmax

3# define data

4data = [1, 3, 2]

5# calculate the argmax of the list

6result = argmax(data)

7print(result)

Running the example returns the list index value “1” that points to the array index [1] that
contains the largest value in the list “3”.

11

Softmax Function
The softmax, or “soft max,” mathematical function can be thought to be a probabilistic or
“softer” version of the argmax function.
The term softmax is used because this activation function represents a smooth version
of the winner-takes-all activation model in which the unit with the largest input has
output +1 while all other units have output 0.

— Page 238, Neural Networks for Pattern Recognition, 1995.


From a probabilistic perspective, if the argmax() function returns 1 in the previous
section, it returns 0 for the other two array indexes, giving full weight to index 1 and no
weight to index 0 and index 2 for the largest value in the list [1, 3, 2].
1[0, 1, 0]

What if we were less sure and wanted to express the argmax probabilistically, with
likelihoods?

This can be achieved by scaling the values in the list and converting them into
probabilities such that all values in the returned list sum to 1.0.
This can be achieved by calculating the exponent of each value in the list and dividing it
by the sum of the exponent values.

 probability = exp(value) / sum v in list exp(v)


For example, we can turn the first value “1” in the list [1, 3, 2] into a probability as
follows:

 probability = exp(1) / (exp(1) + exp(3) + exp(2))


 probability = exp(1) / (exp(1) + exp(3) + exp(2))
 probability = 2.718281828459045 / 30.19287485057736
 probability = 0.09003057317038046
We can demonstrate this for each value in the list [1, 3, 2] in Python as follows:

1 # transform values into probabilities

2 from math import exp

3 # calculate each probability

4 p1 = exp(1) / (exp(1) + exp(3) + exp(2))

5 p2 = exp(3) / (exp(1) + exp(3) + exp(2))

6 p3 = exp(2) / (exp(1) + exp(3) + exp(2))

7 # report probabilities

8 print(p1, p2, p3)

9 # report sum of probabilities

10 print(p1 + p2 + p3)

Running the example converts each value in the list into a probability and reports the
values, then confirms that all probabilities sum to the value 1.0.

We can see that most weight is put on index 1 (67 percent) with less weight on index 2
(24 percent) and even less on index 0 (9 percent).

10.09003057317038046 0.6652409557748219 0.24472847105479767

21.0

This is the softmax function.

We can implement it as a function that takes a list of numbers and returns the softmax or
multinomial probability distribution for the list.

The example below implements the function and demonstrates it on our small list of
numbers.

1 # example of a function for calculating softmax for a list of numbers

2 from numpy import exp

3
4 # calculate the softmax of a vector

5 def softmax(vector):

6 e = exp(vector)

7 return e / e.sum()

9 # define data

10 data = [1, 3, 2]

11 # convert list of numbers to a list of probabilities

12 result = softmax(data)

13 # report the probabilities

14 print(result)

15 # report the sum of the probabilities

16 print(sum(result))

Running the example reports roughly the same numbers with minor differences in
precision.

1[0.09003057 0.66524096 0.24472847]

21.0

Finally, we can use the built-in softmax() NumPy function to calculate the softmax for an
array or list of numbers, as follows:
1 # example of calculating the softmax for a list of numbers

2 from scipy.special import softmax

3 # define data

4 data = [1, 3, 2]

5 # calculate softmax

6 result = softmax(data)

7 # report the probabilities

8 print(result)

9 # report the sum of the probabilities

10 print(sum(result))

Running the example, again, we get very similar results with very minor differences in
precision.

1[0.09003057 0.66524096 0.24472847]

20.9999999999999997

Now that we are familiar with the softmax function, let’s look at how it is used in a neural
network model.

Softmax Activation Function


The softmax function is used as the activation function in the output layer of neural
network models that predict a multinomial probability distribution.

That is, softmax is used as the activation function for multi-class classification problems
where class membership is required on more than two class labels.

Any time we wish to represent a probability distribution over a discrete variable with n
possible values, we may use the softmax function. This can be seen as a generalization
of the sigmoid function which was used to represent a probability distribution over a
binary variable.

— Page 184, Deep Learning, 2016.


The function can be used as an activation function for a hidden layer in a neural network,
although this is less common. It may be used when the model internally needs to choose
or weight multiple different inputs at a bottleneck or concatenation layer.

Softmax units naturally represent a probability distribution over a discrete variable with k
possible values, so they may be used as a kind of switch.

— Page 196, Deep Learning, 2016.


In the Keras deep learning library with a three-class classification task, use of softmax in
the output layer may look as follows:

1...

2model.add(Dense(3, activation='softmax'))

By definition, the softmax activation will output one value for each node in the output
layer. The output values will represent (or can be interpreted as) probabilities and the
values sum to 1.0.

When modeling a multi-class classification problem, the data must be prepared. The
target variable containing the class labels is first label encoded, meaning that an integer
is applied to each class label from 0 to N-1, where N is the number of class labels.

The label encoded (or integer encoded) target variables are then one-hot encoded. This is
a probabilistic representation of the class label, much like the softmax output. A vector is
created with a position for each class label and the position. All values are marked 0
(impossible) and a 1 (certain) is used to mark the position for the class label.

For example, three class labels will be integer encoded as 0, 1, and 2. Then encoded to
vectors as follows:

 Class 0: [1, 0, 0]
 Class 1: [0, 1, 0]
 Class 2: [0, 0, 1]
This is called a one-hot encoding.
It represents the expected multinomial probability distribution for each class used to
correct the model under supervised learning.

The softmax function will output a probability of class membership for each class label
and attempt to best approximate the expected target for a given input.

For example, if the integer encoded class 1 was expected for one example, the target
vector would be:

 [0, 1, 0]
The softmax output might look as follows, which puts the most weight on class 1 and
less weight on the other classes.

 [0.09003057 0.66524096 0.24472847]


The error between the expected and predicted multinomial probability distribution is
often calculated using cross-entropy, and this error is then used to update the model.
This is called the cross-entropy loss function.

For more on cross-entropy for calculating the difference between probability


distributions, see the tutorial:

 A Gentle Introduction to Cross-Entropy for Machine Learning


We may want to convert the probabilities back into an integer encoded class label.

This can be achieved using the argmax() function that returns the index of the list with
the largest value. Given that the class labels are integer encoded from 0 to N-1, the
argmax of the probabilities will always be the integer encoded class label.
 class integer = argmax([0.09003057 0.66524096 0.24472847])
 class integer = 1

Further Reading
This section provides more resources on the topic if you are looking to go deeper.

Books
 Neural Networks for Pattern Recognition, 1995.
 Neural Networks: Tricks of the Trade: Tricks of the Trade, 2nd Edition, 2012.
 Deep Learning, 2016.
APIs
 numpy.argmax API.
 scipy.special.softmax API.
Articles
 Softmax function, Wikipedia.
Summary
In this tutorial, you discovered the softmax activation function used in neural network
models.

Specifically, you learned:

 Linear and Sigmoid activation functions are inappropriate for multi-class classification tasks.
 Softmax can be thought of as a softened version of the argmax function that returns the index of the largest
value in a list.
 How to implement the softmax function from scratch in Python and how to convert the output into a class label.

Activation Functions in Neural Networks


Rahul Jain

Senior Business Data Analyst at Intuit | MTech in Data Science


Publicat la 26 iun. 2019

If you are familiar with how Neural Networks works, one of the most important decisions which you have to take is
which Activation Function to use in the various layers. As you may be aware, Neural networks is build out of three
layers broadly:

1. Input Layer : This layer just takes input from the outside world, doesn't do any
computation by itself, and passes the information to the hidden layers.
2. Hidden Layers : This set of layers accepts the input information from the input layer,
does all the computation and sends the output to the Output layer. This layer is not
visible to the outside world and is part of the abstraction provided by the Neural
Networks.
3. Output Layer: This layer accepts the input from hidden layer and provides the output
to the outside world in the desired range.
Why use Activation Function ?

Activation functions have an important task to decide whether a neuron should be activated or not. It basically is used
to introduced Non-linearity in the output of the Neuron.

Why to introduce Non-Linearity and what does it mean ?

If we look at how we take the weighted sum of the weights and bias, we would appreciate the fact that it is linear in
nature. Z = WX+B, where W is the vectorised representation of the weights, X is the input features/outputs of
previous layer activation functions, and B is the Bias associated to each node. This is a linear relation, similar to Y =
mx + c. If we were to remove the activation function from the Network, You will see that the output of the Network
will again be a Linear Equation. Now it will have two issues if we go ahead with this approach, i) The network will
not be capable of understanding the intricacies of the features and ii) The derivative of the linear function will be a
constant and would lead to issues in Back propagation, which is a way to tune the network with gradient descent
procedure. Due to these two reason, we introduce Non Linear Activation functions in a NN.

Now that we understand, why do we need Non Linear Activation functions, lets look at the various options we have
available with us.

1. Sigmoid Function
2. Tanh Function
3. Relu Function
4. Softmax Function
Lets look at each of them in detail:

1. Sigmoid Function : The graph for the sigmoid function is as below

Important points to remember about the sigmoid function are:

1. The equation for the Sigmoid function is A = 1/(1 + e^-z)


2. It's a Non Linear function, 'S' shaped curve, with min and max values as 0 and 1
respectively.
3. Its used mostly in binary classification problems. It's almost always used in output layer
and not in hidden layer, as there are better activation functions than this when we talk
about the hidden layer. When ever you have two output classes in your classification
function, you can use this function without thinking twice.
4. The derivation of the sigmoid function gives g'(z) = g(z) * (1-g(z)), heres a derivation.
2. Tanh Function : The graph of the function is as below:

Important points to remember for tanh function:

1. It's called Hyperbolic Tangent function. It's an extension of sigmoid function. The
equation is (e^-z - e^-z)/(e^z + e^-z), or 2 * sigmoid(2z) -1
2. Value Range is between -1 to 1
3. It's a Non Linear function which is mostly used in the hidden layer. Since its value lies
between -1 to 1, it helps centre the data around 0 which helps the next layer learning
much easier.
4. Derivative of tanh function is g'(z) = 1-(g(z))^2, Derivation of the same is as below:
3.RELU Function : The graph of the function is as below:

1. It's called Rectified linear unit. It's the most widely used activation in the hidden
layers.
2. The formula is g(z) = max(0,z). That would mean, if the z value is negative, the
function outputs 0, and if it is non negative, then it output the same value.
3. RELU is less computationally expensive as compared to Sigmoid and Tanh functions.
Less number of neurons get activated and hence the network becomes sparse and easy
to back-propagate.
4. The derivative of RELU function is 0 (If the value is negative) and 1(if the value is
positive)
4. Softmax Function : The formula for softmax function is as below:

1. The softmax function is also a type of sigmoid function but is handy when we are trying
to handle classification problems. Its is used when there are more than 2 classes in the
output layer.
2. This function is responsible for squeezing the output probabilities for each class
between 0 and 1, which will all add up to one. It takes exponent of the weighted sum of
previous activation function, weights and bias for each class node and divides by
summation of all the class exponents.
Finally some tips on how to use these activation functions:

1. Almost in all situations, you can use RELU activation function in the hidden and Input
layers. Although as a best practise you should always try a few other activation
function, but, more often than not, you will get better performance with RELU.
2. For the Output layer, If you are performing a classification problem, you need to check,
how many classes are there in the target variable. If it happens to be just two, use
sigmoid function, and if more, use Softmax. For example, if you are doing a Image
classification (Dogs vs Cats) using CNN, You should use Sigmoid function and if say
you are doing hand written numbers recognition on MNIST dataset, where the output
classes are 10 (0,1,2,...9), then you should use Softmax. One thing to care about, while
you choose your activation function, you need to accurately choose your loss function
too. Like in case of two classes, you can use binary_crossentropy and in case of more
number of classes, you can use categorical_crossentropy. also, don't forget to perform
One hot encoding for your output variable in case of categorical_crossentropy.
3. Finally, If you are performing a Regression with Neural Network, for the output layer,
you should use the Linear activation function, that means, you will output the weighted
sum as your output.
Hope the above article was helpful in understanding the various activation functions choices we have in NN and the
selection criteria.
Activation Functions in Neural
Networks [12 Types & Use
Cases]
What is a neural network activation function and how does it work? Explore twelve
different types of activation functions and learn how to pick the right one.

The world is one big data problem.”

As it turns out—

This saying holds true both for our brains as well as machine learning.

Every single moment our brain is trying to segregate the incoming information into the “useful” and
“not-so-useful” categories.


A similar process occurs in artificial neural network architectures in deep learning.
The segregation plays a key role in helping a neural network properly function, ensuring that it
learns from the useful information rather than get stuck analyzing the not-useful part.

And this is also where activation functions come into the picture.

💡 Activation Function helps the neural network to use important information while
suppressing irrelevant data points.

Sounds a little confusing? Worry not!

Here’s what we’ll cover:


1. What is a Neural Network Activation Function?
2. Why do Neural Networks Need an Activation Function?
3. 3 Types of Neural Networks Activation Functions
4. Why are deep neural networks hard to train?
5. How to choose the right Activation Function?
6. Neural Networks Activation Functions in a Nutshell

Ready? Let’s get started :)

What is a Neural Network Activation Function?


An Activation Function decides whether a neuron should be activated or not. This means that it
will decide whether the neuron’s input to the network is important or not in the process of prediction
using simpler mathematical operations.

The role of the Activation Function is to derive output from a set of input values fed to a node (or a
layer).

But—
Let’s take a step back and clarify: What exactly is a node?

Well, if we compare the neural network to our brain, a node is a replica of a neuron that receives a
set of input signals—external stimuli.

Depending on the nature and intensity of these input signals, the brain processes them and decides
whether the neuron should be activated (“fired”) or not.

In deep learning, this is also the role of the Activation Function—that’s why it’s often referred to as
a Transfer Function in Artificial Neural Network.

The primary role of the Activation Function is to transform the summed weighted input from the
node into an output value to be fed to the next hidden layer or as output.
Now, let's have a look at the Neural Networks Architecture.

Elements of a Neural Networks Architecture


Here’s the thing—

If you don’t understand the concept of neural networks and how they work, diving deeper into the
topic of activation functions might be challenging.

That’s why it’s a good idea to refresh your knowledge and take a quick look at the structure of the
Neural Networks Architecture and its components. Here it is.
In the image above, you can see a neural network made of interconnected neurons. Each of them
is characterized by its weight, bias, and activation function.

Here are other elements of this network.


Input Layer
The input layer takes raw input from the domain. No computation is performed at this layer. Nodes
here just pass on the information (features) to the hidden layer.
Hidden Layer
As the name suggests, the nodes of this layer are not exposed. They provide an abstraction to the
neural network.

The hidden layer performs all kinds of computation on the features entered through the input layer
and transfers the result to the output layer.
Output Layer
It’s the final layer of the network that brings the information learned through the hidden layer and
delivers the final value as a result.

📢 Note: All hidden layers usually use the same activation function. However, the output layer will

typically use a different activation function from the hidden layers. The choice depends on the goal
or type of prediction made by the model.
Feedforward vs. Backpropagation
When learning about neural networks, you will come across two essential terms describing the
movement of information—feedforward and backpropagation.

Let’s explore them.

💡 Feedforward Propagation - the flow of information occurs in the forward direction. The
input is used to calculate some intermediate function in the hidden layer, which is then
used to calculate an output.

In the feedforward propagation, the Activation Function is a mathematical “gate” in between the
input feeding the current neuron and its output going to the next layer.


💡 Backpropagation - the weights of the network connections are repeatedly adjusted to
minimize the difference between the actual output vector of the net and the desired
output vector.

To put it simply—backpropagation aims to minimize the cost function by adjusting the network’s
weights and biases. The cost function gradients determine the level of adjustment with respect to
parameters like activation function, weights, bias, etc.

Why do Neural Networks Need an Activation Function?


So we know what Activation Function is and what it does, but—

Why do Neural Networks need it?

Well, the purpose of an activation function is to add non-linearity to the neural network.

Activation functions introduce an additional step at each layer during the forward propagation, but its
computation is worth it. Here is why—

Let’s suppose we have a neural network working without the activation functions.
In that case, every neuron will only be performing a linear transformation on the inputs using the
weights and biases. It’s because it doesn’t matter how many hidden layers we attach in the neural
network; all layers will behave in the same way because the composition of two linear functions is a
linear function itself.

Although the neural network becomes simpler, learning any complex task is impossible, and our
model would be just a linear regression model.

3 Types of Neural Networks Activation Functions


Now, as we’ve covered the essential concepts, let’s go over the most popular neural networks
activation functions.

Binary Step Function


Binary step function depends on a threshold value that decides whether a neuron should be
activated or not.

The input fed to the activation function is compared to a certain threshold; if the input is greater than
it, then the neuron is activated, else it is deactivated, meaning that its output is not passed on to the
next hidden layer.

Binary Step Function


Mathematically it can be represented as:
Here are some of the limitations of binary step function:
 It cannot provide multi-value outputs—for example, it cannot be used for multi-class
classification problems.
 The gradient of the step function is zero, which causes a hindrance in the
backpropagation process.

Linear Activation Function


The linear activation function, also known as "no activation," or "identity function" (multiplied x1.0), is
where the activation is proportional to the input.

The function doesn't do anything to the weighted sum of the input, it simply spits out the value it was
given.

Linear Activation Function

Mathematically it can be represented as:


However, a linear activation function has two major problems :
 It’s not possible to use backpropagation as the derivative of the function is a constant
and has no relation to the input x.
 All layers of the neural network will collapse into one if a linear activation function is
used. No matter the number of layers in the neural network, the last layer will still be a
linear function of the first layer. So, essentially, a linear activation function turns the
neural network into just one layer.

Non-Linear Activation Functions


The linear activation function shown above is simply a linear regression model.

Because of its limited power, this does not allow the model to create complex mappings between
the network’s inputs and outputs.

Non-linear activation functions solve the following limitations of linear activation functions:
 They allow backpropagation because now the derivative function would be related to the
input, and it’s possible to go back and understand which weights in the input neurons
can provide a better prediction.
 They allow the stacking of multiple layers of neurons as the output would now be a non-
linear combination of input passed through multiple layers. Any output can be
represented as a functional computation in a neural network.

Now, let’s have a look at ten different non-linear neural networks activation functions and their
characteristics.

10 Non-Linear Neural Networks Activation Functions


Sigmoid / Logistic Activation Function
This function takes any real value as input and outputs values in the range of 0 to 1.
The larger the input (more positive), the closer the output value will be to 1.0, whereas the smaller
the input (more negative), the closer the output will be to 0.0, as shown below.

Sigmoid/Logistic Activation Function


Mathematically it can be represented as:

Here’s why sigmoid/logistic activation function is one of the most widely used functions:
 It is commonly used for models where we have to predict the probability as an output.
Since probability of anything exists only between the range of 0 and 1, sigmoid is the
right choice because of its range.
 The function is differentiable and provides a smooth gradient, i.e., preventing jumps in
output values. This is represented by an S-shape of the sigmoid activation function.

The limitations of sigmoid function are discussed below:


 The derivative of the function is f'(x) = sigmoid(x)*(1-sigmoid(x)).
The derivative of the Sigmoid Activation Function

As we can see from the above Figure, the gradient values are only significant for range -3 to 3, and
the graph gets much flatter in other regions.

It implies that for values greater than 3 or less than -3, the function will have very small gradients.
As the gradient value approaches zero, the network ceases to learn and suffers from the Vanishing
gradient problem.
 The output of the logistic function is not symmetric around zero. So the output of all the
neurons will be of the same sign. This makes the training of the neural network more
difficult and unstable.

Tanh Function (Hyperbolic Tangent)


Tanh function is very similar to the sigmoid/logistic activation function, and even has the same S-
shape with the difference in output range of -1 to 1. In Tanh, the larger the input (more positive), the
closer the output value will be to 1.0, whereas the smaller the input (more negative), the closer the
output will be to -1.0.
Tanh Function (Hyperbolic Tangent)
Mathematically it can be represented as:

Advantages of using this activation function are:


 The output of the tanh activation function is Zero centered; hence we can easily map the
output values as strongly negative, neutral, or strongly positive.
 Usually used in hidden layers of a neural network as its values lie between -1 to;
therefore, the mean for the hidden layer comes out to be 0 or very close to it. It helps in
centering the data and makes learning for the next layer much easier.

Have a look at the gradient of the tanh activation function to understand its limitations.
Gradient of the Tanh Activation Function
As you can see— it also faces the problem of vanishing gradients similar to the sigmoid activation
function. Plus the gradient of the tanh function is much steeper as compared to the sigmoid
function.

💡 Note: Although both sigmoid and tanh face vanishing gradient issue, tanh is zero
centered, and the gradients are not restricted to move in a certain direction. Therefore,
in practice, tanh nonlinearity is always preferred to sigmoid nonlinearity.

ReLU Function
ReLU stands for Rectified Linear Unit.

Although it gives an impression of a linear function, ReLU has a derivative function and allows for
backpropagation while simultaneously making it computationally efficient.

The main catch here is that the ReLU function does not activate all the neurons at the same time.

The neurons will only be deactivated if the output of the linear transformation is less than 0.
ReLU Activation Function
Mathematically it can be represented as:

The advantages of using ReLU as an activation function are as follows:


 Since only a certain number of neurons are activated, the ReLU function is far more
computationally efficient when compared to the sigmoid and tanh functions.
 ReLU accelerates the convergence of gradient descent towards the global minimum of
the loss function due to its linear, non-saturating property.

The limitations faced by this function are:


 The Dying ReLU problem, which I explained below.
The Dying ReLU problem

The negative side of the graph makes the gradient value zero. Due to this reason, during the
backpropagation process, the weights and biases for some neurons are not updated. This can
create dead neurons which never get activated.
 All the negative input values become zero immediately, which decreases the model’s
ability to fit or train from the data properly.

Note: For building the most reliable ML models, split your data into train, validation, and test sets.
Leaky ReLU Function
Leaky ReLU is an improved version of ReLU function to solve the Dying ReLU problem as it has a
small positive slope in the negative area.
Leaky ReLU
Mathematically it can be represented as:

The advantages of Leaky ReLU are same as that of ReLU, in addition to the fact that it does enable
backpropagation, even for negative input values.

By making this minor modification for negative input values, the gradient of the left side of the graph
comes out to be a non-zero value. Therefore, we would no longer encounter dead neurons in that
region.

Here is the derivative of the Leaky ReLU function.


The derivative of the Leaky ReLU function

The limitations that this function faces include:


 The predictions may not be consistent for negative input values.
 The gradient for negative values is a small value that makes the learning of model
parameters time-consuming.

Parametric ReLU Function


Parametric ReLU is another variant of ReLU that aims to solve the problem of gradient’s becoming
zero for the left half of the axis.

This function provides the slope of the negative part of the function as an argument a. By
performing backpropagation, the most appropriate value of a is learnt.
Parametric ReLU

Mathematically it can be represented as:

Where "a" is the slope parameter for negative values.

The parameterized ReLU function is used when the leaky ReLU function still fails at solving the
problem of dead neurons, and the relevant information is not successfully passed to the next layer.

This function’s limitation is that it may perform differently for different problems depending upon the
value of slope parameter a.
Exponential Linear Units (ELUs) Function
Exponential Linear Unit, or ELU for short, is also a variant of ReLU that modifies the slope of the
negative part of the function.

ELU uses a log curve to define the negativ values unlike the leaky ReLU and Parametric ReLU
functions with a straight line.

ELU Activation Function

Mathematically it can be represented as:


ELU is a strong alternative for f ReLU because of the following advantages:
 ELU becomes smooth slowly until its output equal to -α whereas RELU sharply
smoothes.
 Avoids dead ReLU problem by introducing log curve for negative values of input. It helps
the network nudge weights and biases in the right direction.

The limitations of the ELU function are as follow:


 It increases the computational time because of the exponential operation included
 No learning of the ‘a’ value takes place
 Exploding gradient problem

ELU Activation Function and its derivative

Mathematically it can be represented as:



Softmax Function
Before exploring the ins and outs of the Softmax activation function, we should focus on its building
block—the sigmoid/logistic activation function that works on calculating probability values.

Probability

The output of the sigmoid function was in the range of 0 to 1, which can be thought of as
probability.

But—

This function faces certain problems.

Let’s suppose we have five output values of 0.8, 0.9, 0.7, 0.8, and 0.6, respectively. How can we
move forward with it?

The answer is: We can’t.

The above values don’t make sense as the sum of all the classes/output probabilities should be
equal to 1.

You see, the Softmax function is described as a combination of multiple sigmoids.

It calculates the relative probabilities. Similar to the sigmoid/logistic activation function, the SoftMax
function returns the probability of each class.

It is most commonly used as an activation function for the last layer of the neural network in the
case of multi-class classification.

Mathematically it can be represented as:

Softmax Function

Let’s go over a simple example together.


Assume that you have three classes, meaning that there would be three neurons in the output
layer. Now, suppose that your output from the neurons is [1.8, 0.9, 0.68].

Applying the softmax function over these values to give a probabilistic view will result in the
following outcome: [0.58, 0.23, 0.19].

The function returns 1 for the largest probability index while it returns 0 for the other two array
indexes. Here, giving full weight to index 0 and no weight to index 1 and index 2. So the output
would be the class corresponding to the 1st neuron(index 0) out of three.

You can see now how softmax activation function make things easy for multi-class classification
problems.
Swish
It is a self-gated activation function developed by researchers at Google.

Swish consistently matches or outperforms ReLU activation function on deep networks applied to
various challenging domains such as image classification, machine translation etc.

Swish Activation Function


This function is bounded below but unbounded above i.e. Y approaches to a constant value
as X approaches negative infinity but Y approaches to infinity as X approaches infinity.

Mathematically it can be represented as:

Here are a few advantages of the Swish activation function over ReLU:
 Swish is a smooth function that means that it does not abruptly change direction like
ReLU does near x = 0. Rather, it smoothly bends from 0 towards values < 0 and then
upwards again.
 Small negative values were zeroed out in ReLU activation function. However, those
negative values may still be relevant for capturing patterns underlying the data. Large
negative values are zeroed out for reasons of sparsity making it a win-win situation.
 The swish function being non-monotonous enhances the expression of input data and
weight to be learnt.

Gaussian Error Linear Unit (GELU)


The Gaussian Error Linear Unit (GELU) activation function is compatible with BERT, ROBERTa,
ALBERT, and other top NLP models. This activation function is motivated by combining properties
from dropout, zoneout, and ReLUs.

ReLU and dropout together yield a neuron’s output. ReLU does it deterministically by multiplying the
input by zero or one (depending upon the input value being positive or negative) and dropout
stochastically multiplying by zero.

RNN regularizer called zoneout stochastically multiplies inputs by one.

We merge this functionality by multiplying the input by either zero or one which is stochastically
determined and is dependent upon the input. We multiply the neuron input x by
m ∼ Bernoulli(Φ(x)), where Φ(x) = P(X ≤x), X ∼ N (0, 1) is the cumulative distribution function of the
standard normal distribution.

This distribution is chosen since neuron inputs tend to follow a normal distribution, especially with
Batch Normalization.

Gaussian Error Linear Unit (GELU) Activation Function

Mathematically it can be represented as:


GELU nonlinearity is better than ReLU and ELU activations and finds performance improvements
across all tasks in domains of computer vision, natural language processing, and speech
recognition.
Scaled Exponential Linear Unit (SELU)
SELU was defined in self-normalizing networks and takes care of internal normalization which
means each layer preserves the mean and variance from the previous layers. SELU enables this
normalization by adjusting the mean and variance.

SELU has both positive and negative values to shift the mean, which was impossible for ReLU
activation function as it cannot output negative values.

Gradients can be used to adjust the variance. The activation function needs a region with a gradient
larger than one to increase it.

SELU Activation Function

Mathematically it can be represented as:



SELU has values of alpha α and lambda λ predefined.

Here’s the main advantage of SELU over ReLU:


 Internal normalization is faster than external normalization, which means the network
converges faster.

SELU is a relatively newer activation function and needs more papers on architectures such as
CNNs and RNNs, where it is comparatively explored.

Why are deep neural networks hard to train?


There are two challenges you might encounter when training your deep neural networks.

Let's discuss them in more detail.


Vanishing Gradients
Like the sigmoid function, certain activation functions squish an ample input space into a small
output space between 0 and 1.

Therefore, a large change in the input of the sigmoid function will cause a small change in the
output. Hence, the derivative becomes small. For shallow networks with only a few layers that use
these activations, this isn’t a big problem.

However, when more layers are used, it can cause the gradient to be too small for training to work
effectively.
Exploding Gradients
Exploding gradients are problems where significant error gradients accumulate and result in very
large updates to neural network model weights during training.
An unstable network can result when there are exploding gradients, and the learning cannot be
completed.

The values of the weights can also become so large as to overflow and result in something called
NaN values.

How to choose the right Activation Function?


You need to match your activation function for your output layer based on the type of prediction
problem that you are solving—specifically, the type of predicted variable.

Here’s what you should keep in mind.

As a rule of thumb, you can begin with using the ReLU activation function and then move over to
other activation functions if ReLU doesn’t provide optimum results.

And here are a few other guidelines to help you out.


1. ReLU activation function should only be used in the hidden layers.
2. Sigmoid/Logistic and Tanh functions should not be used in hidden layers as they make
the model more susceptible to problems during training (due to vanishing gradients).
3. Swish function is used in neural networks having a depth greater than 40 layers.

Finally, a few rules for choosing the activation function for your output layer based on the type of
prediction problem that you are solving:
1. Regression - Linear Activation Function
2. Binary Classification—Sigmoid/Logistic Activation Function
3. Multiclass Classification—Softmax
4. Multilabel Classification—Sigmoid

The activation function used in hidden layers is typically chosen based on the type of neural
network architecture.
5. Convolutional Neural Network (CNN): ReLU activation function.
6. Recurrent Neural Network: Tanh and/or Sigmoid activation function.

And hey—use this cheatsheet to consolidate all the knowledge on the Neural Network Activation
Functions that you've just acquired :)
Neural Network Activation Functions: Cheat Sheet

Neural Networks Activation Functions in a Nutshell


Well done!

You’ve made it this far ;-) Now, let’s have a quick recap of everything you’ve learnt in this tutorial:
 Activation Functions are used to introduce non-linearity in the network.
 A neural network will almost always have the same activation function in all hidden
layers. This activation function should be differentiable so that the parameters of the
network are learned in backpropagation.
 ReLU is the most commonly used activation function for hidden layers.
 While selecting an activation function, you must consider the problems it might face:
vanishing and exploding gradients.
 Regarding the output layer, we must always consider the expected value range of the
predictions. If it can be any numeric value (as in case of the regression problem) you can
use the linear activation function or ReLU.
 Use Softmax or Sigmoid function for the classification problems.

You might also like