0% found this document useful (0 votes)

42 views36 pages

Choosing Neural Network Activation Functions

The document discusses various types of neural network activation functions, explaining their roles, advantages, and limitations. It covers twelve activation functions including Binary Step, Sigmoid, Tanh, ReLU, and Softmax, providing insights on how to choose the appropriate function based on the specific needs of a neural network model. The document emphasizes the importance of non-linearity in activation functions for effective learning and prediction in neural networks.

Uploaded by

pgggg622

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views36 pages

Choosing Neural Network Activation Functions

Uploaded by

pgggg622

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Types of Neural Network Activation

Functions: How to Choose?

What is a neural network activation function and how does it
work? Explore twelve different types of activation functions and
learn how to pick the right one.
1. What is a Neural
An Activation Function decides whether a neuron
Network Activation should be activated or not. This means that it will

Function? decide whether the neuron’s input to the network

is important or not in the process of prediction
using simpler mathematical operations.

The role of the Activation Function is to derive

output from a set of input values fed to a node (or
a layer).

The primary role of the Activation Function is

to transform the summed weighted input
from the node into an output value to be fed
to the next hidden layer or as output.
Activation Function
Well, the purpose of an activation function is to
2. Why do Neural add non-linearity to the neural network.

Networks Need an
Activation Function?
3 Types of Neural
Networks Activation
Functions let’s go over the most popular neural
networks activation functions.
Binary Step Function Binary step function depends on a threshold value
that decides whether a neuron should be activated
or not.

The input fed to the activation function is compared

to a certain threshold; if the input is greater than it,
then the neuron is activated, else it is deactivated,
meaning that its output is not passed on to the next
hidden layer.
Binary Step Here are some of the limitations of binary step
function:
Function
❖ It cannot provide multi-value outputs—for
example, it cannot be used for multi-class
classification problems.
❖ The gradient of the step function is zero,
which causes a hindrance in the
backpropagation process.
Linear Activation The linear activation function is also
known as Identity Function where the
Function activation is proportional to the input.
Linear Activation However, a linear activation function has two
major problems :
Function ● It’s not possible to use backpropagation
as the derivative of the function is a
constant and has no relation to the input
x.
● All layers of the neural network will
collapse into one if a linear activation
function is used. No matter the number
of layers in the neural network, the last
layer will still be a linear function of the
first layer. So, essentially, a linear
activation function turns the neural
network into just one layer.
Non-Linear Non-linear activation functions solve the
following limitations of linear activation
Activation functions:

Functions ➔ They allow backpropagation because

now the derivative function would be
related to the input, and it’s possible to go
back and understand which weights in
the input neurons can provide a better
prediction.
➔ They allow the stacking of multiple layers
of neurons as the output would now be a
non-linear combination of input passed
through multiple layers. Any output can
be represented as a functional
computation in a neural network.
Sigmoid / Logistic This function takes any real value as input and
outputs values in the range of 0 to 1.
Activation Function
The larger the input (more positive), the closer
the output value will be to 1.0, whereas the
smaller the input (more negative), the closer
the output will be to 0.0, as shown below.
Sigmoid / Logistic Here’s why sigmoid/logistic activation
function is one of the most widely used
Activation Function functions:

● It is commonly used for models where we

have to predict the probability as an
output. Since probability of anything
exists only between the range of 0 and 1,
sigmoid is the right choice because of its
range.
● The function is differentiable and
provides a smooth gradient, i.e.,
preventing jumps in output values. This is
represented by an S-shape of the sigmoid
activation function.
The limitations of The limitations of sigmoid function are discussed
below:

sigmoid function The derivative of the function is f'(x) = sigmoid(x)*(1-

sigmoid(x)).

As we can see from the Figure, the gradient values

are only significant for range -3 to 3, and the graph
gets much flatter in other regions.

It implies that for values greater than 3 or less than

-3, the function will have very small gradients. As
the gradient value approaches zero, the network
ceases to learn and suffers from the Vanishing
gradient problem.
● The output of the logistic function is not
symmetric around zero. So the output of all the
neurons will be of the same sign. This makes the
training of the neural network more difficult and
unstable.
Tanh Function
(Hyperbolic Tanh function is very similar to the
sigmoid/logistic activation function, and even
Tangent) has the same S-shape with the difference in
output range of -1 to 1. In Tanh, the larger the
input (more positive), the closer the output
value will be to 1.0, whereas the smaller the
input (more negative), the closer the output will
be to -1.0.
Advantages of using ● The output of the tanh activation
Tanh Function function is Zero centered; hence
we can easily map the output
(Hyperbolic Tangent) values as strongly negative,
neutral, or strongly positive.
activation function
are:
● Usually used in hidden layers of a
neural network as its values lie
between -1 to 1; therefore, the mean
for the hidden layer comes out to be 0
or very close to it. It helps in centering
the data and makes learning for the
next layer much easier.
Limitations of Tanh As you can see— it also faces the problem
of vanishing gradients similar to the
sigmoid activation function. Plus the gradient
of the tanh function is much steeper as
compared to the sigmoid function.

Note: Although both sigmoid and

tanh face vanishing gradient issue, tanh
is zero centered, and the gradients are
not restricted to move in a certain
direction. Therefore, in practice, tanh
nonlinearity is always preferred to
ReLU Function
(Rectified Linear Although it gives an impression of a linear
function, ReLU has a derivative function and
Unit) allows for backpropagation while
simultaneously making it computationally
efficient.

The main catch here is that the ReLU function

does not activate all the neurons at the same
time.

The neurons will only be deactivated if the

output of the linear transformation is less than
0.
The advantages of ● Since only a certain number of
neurons are activated, the ReLU
using ReLU function is far more
computationally efficient when
compared to the sigmoid and tanh
functions.

● ReLU accelerates the convergence of

gradient descent towards the global
minimum of the loss function due to
its linear, non-saturating property.
limitations faced by
ReLU function ● The Dying ReLU problem, which
explained below.

The negative side of the graph makes the

gradient value zero. Due to this reason,
during the backpropagation process, the
weights and biases for some neurons are not
updated. This can create dead neurons which
never get activated.
● All the negative input values become
zero immediately, which decreases the
model’s ability to fit or train from the
data properly.
Leaky ReLU
Function Leaky ReLU is an improved version of
ReLU function to solve the Dying ReLU
problem as it has a small positive slope in
the negative area.
Advantages of The advantages of Leaky ReLU are same as

Leaky ReLU that of ReLU, in addition to the fact that it

does enable backpropagation, even for
negative input values.

By making this minor modification for

negative input values, the gradient of the left
side of the graph comes out to be a non-zero
value. Therefore, we would no longer
encounter dead neurons in that region.
Limitations of ● The predictions may not be

Leaky ReLU consistent for negative input

values.

● The gradient for negative values is

a small value that makes the
learning of model parameters time-
consuming.
Parametric ReLU Parametric ReLU is another variant of ReLU

Function that aims to solve the problem of gradient’s

becoming zero for the left half of the axis.

This function provides the slope of the

negative part of the function as an argument
a. By performing backpropagation, the most
appropriate value of a is learnt.

Where "a" is the slope parameter for negative values.

Limitations The parameterized ReLU function is used
when the leaky ReLU function still fails at
solving the problem of dead neurons, and the
relevant information is not successfully
passed to the next layer.

This function’s limitation is that it may

perform differently for different problems
depending upon the value of slope
parameter a.
Exponential Linear Exponential Linear Unit, or ELU for short, is
Units (ELUs) also a variant of ReLU that modifies the

Function slope of the negative part of the function.

ELU uses a log curve to define the negative

values unlike the leaky ReLU and Parametric
ReLU functions with a straight line.
Advantages ELU is a strong alternative for f ReLU
because of the following advantages:
● ELU becomes smooth slowly until its
output equal to -α whereas RELU
sharply smoothes.

● Avoids dead ReLU problem by

introducing log curve for negative values
of input. It helps the network nudge
weights and biases in the right direction.
The limitations of the ELU function are as
follow:
● It increases the computational time
because of the exponential operation
included
● No learning of the ‘a’ value takes place
● Exploding gradient problem
Softmax Function the Softmax function is described as a
combination of multiple sigmoids.

★ It calculates the relative probabilities.

Similar to the sigmoid/logistic activation
function, the SoftMax function returns the
probability of each class.

★ It is most commonly used as an activation

function for the last layer of the neural
network in the case of multi-class
classification.
Let’s go over a simple example together.

Assume that you have three classes, meaning that

there would be three neurons in the output layer.
Now, suppose that your output from the neurons is
[1.8, 0.9, 0.68].
Applying the softmax function over these values to
give a probabilistic view will result in the following
outcome: [0.58, 0.23, 0.19].

The function returns 1 for the largest probability index

while it returns 0 for the other two array indexes.
Here, giving full weight to index 0 and no weight to
index 1 and index 2. So the output would be the class
corresponding to the 1st neuron(index 0) out of three.

You can see now how softmax activation function

make things easy for multi-class classification
problems.
Swish It is a self-gated activation function
developed by researchers at Google.

Swish consistently matches or outperforms

ReLU activation function on deep networks
applied to various challenging domains such
as image classification, machine translation
etc.
● Swish is a smooth function that means
Advantages of the that it does not abruptly change

Swish activation direction like ReLU does near x = 0.

Rather, it smoothly bends from 0
function over ReLU: towards values < 0 and then upwards
again.
● Small negative values were zeroed out
in ReLU activation function. However,
those negative values may still be
relevant for capturing patterns
underlying the data. Large negative
values are zeroed out for reasons of
sparsity making it a win-win situation.
● The swish function being non-
monotonous enhances the expression
of input data and weight to be learnt.
Gaussian Error
The Gaussian Error Linear Unit (GELU)
Linear Unit (GELU) activation function is compatible with BERT,
ROBERTa, ALBERT, and other top NLP
models. This activation function is motivated
by combining properties from dropout,
zoneout, and ReLUs.
SELU was defined in self-normalizing networks and

Scaled Exponential takes care of internal normalization which means each

layer preserves the mean and variance from the
Linear Unit (SELU) previous layers. SELU enables this normalization by
adjusting the mean and variance.

SELU has both positive and negative values to shift the

mean, which was impossible for ReLU activation
function as it cannot output negative values.

Gradients can be used to adjust the variance. The

activation function needs a region with a gradient larger
than one to increase it.
SELU has values of alpha α and lambda λ
predefined.

Here’s the main advantage of SELU over

ReLU:
● Internal normalization is faster than
external normalization, which means the
network converges faster.

SELU is a relatively newer activation function

and needs more papers on architectures
such as CNNs and RNNs, where it is
comparatively explored.
Reference
[Link]
activation-
functions#:~:text=ReLU%20activation%20function%20
should%20only,depth%20greater%20than%2040%20la
yers.

Activation Function
No ratings yet
Activation Function
36 pages
Neural Network Activation Guide
No ratings yet
Neural Network Activation Guide
43 pages
Activation
No ratings yet
Activation
7 pages
Mod 2.3 - Activation Function
No ratings yet
Mod 2.3 - Activation Function
9 pages
Module1 - Upto Loss Function
No ratings yet
Module1 - Upto Loss Function
137 pages
Deep Learning Activation Functions
No ratings yet
Deep Learning Activation Functions
10 pages
Module-4 Neural Network
No ratings yet
Module-4 Neural Network
61 pages
Neural Network Activation Functions Explained
No ratings yet
Neural Network Activation Functions Explained
25 pages
Importance of Activation Functions in ANN
No ratings yet
Importance of Activation Functions in ANN
7 pages
DL Module 2
No ratings yet
DL Module 2
148 pages
Activation Function
No ratings yet
Activation Function
9 pages
Understanding Activation Functions in Neural Networks
No ratings yet
Understanding Activation Functions in Neural Networks
7 pages
Understanding Activation Functions in Neural Networks
No ratings yet
Understanding Activation Functions in Neural Networks
15 pages
Understanding Activation Functions in Neural Networks
No ratings yet
Understanding Activation Functions in Neural Networks
6 pages
Lect 5 - Non Linear Activation Functions
No ratings yet
Lect 5 - Non Linear Activation Functions
41 pages
Understanding Deep Learning Basics
No ratings yet
Understanding Deep Learning Basics
124 pages
Understanding Neural Network Activation Functions
No ratings yet
Understanding Neural Network Activation Functions
31 pages
Activation Function
No ratings yet
Activation Function
10 pages
Activation Functions in Neural Networks
No ratings yet
Activation Functions in Neural Networks
7 pages
Understanding Activation Functions in Neural Networks
No ratings yet
Understanding Activation Functions in Neural Networks
9 pages
Lec08-1Activation Functions
No ratings yet
Lec08-1Activation Functions
19 pages
Deep Learning: Neural Network Architectures
No ratings yet
Deep Learning: Neural Network Architectures
31 pages
7 Types of Neural Network Activation Functions
No ratings yet
7 Types of Neural Network Activation Functions
16 pages
Deep Learning Tutorial 3
No ratings yet
Deep Learning Tutorial 3
12 pages
Activation Functions in Neural Networks
No ratings yet
Activation Functions in Neural Networks
12 pages
Neural Network Activation Guide
No ratings yet
Neural Network Activation Guide
11 pages
003 Activation Functions in Machine Learning
No ratings yet
003 Activation Functions in Machine Learning
19 pages
Activation Functions in Neural Networks
No ratings yet
Activation Functions in Neural Networks
10 pages
Activation Function
No ratings yet
Activation Function
34 pages
Activation Functions
No ratings yet
Activation Functions
3 pages
Activation Functions in Neural Networks
No ratings yet
Activation Functions in Neural Networks
6 pages
Activation Function
No ratings yet
Activation Function
18 pages
Dl-Module 2
No ratings yet
Dl-Module 2
138 pages
Unit 3 Deep Learning
No ratings yet
Unit 3 Deep Learning
11 pages
Activation Functions
No ratings yet
Activation Functions
8 pages
Activation Function
No ratings yet
Activation Function
13 pages
Activation Functions - Ipynb - Colaboratory
No ratings yet
Activation Functions - Ipynb - Colaboratory
10 pages
Activation Function in NN
No ratings yet
Activation Function in NN
29 pages
Understanding Neural Network Activation Functions
No ratings yet
Understanding Neural Network Activation Functions
14 pages
Linear vs Nonlinear Models Explained
No ratings yet
Linear vs Nonlinear Models Explained
10 pages
UNIT-3 Deep Learning (Revised) - 1
No ratings yet
UNIT-3 Deep Learning (Revised) - 1
92 pages
Activation Function
No ratings yet
Activation Function
4 pages
Overview of Neural Network Activation Functions
No ratings yet
Overview of Neural Network Activation Functions
7 pages
Understanding Artificial Neural Networks
No ratings yet
Understanding Artificial Neural Networks
94 pages
Neural Networks: A Deep Dive
No ratings yet
Neural Networks: A Deep Dive
34 pages
Activation Functions in Neural Networks
No ratings yet
Activation Functions in Neural Networks
3 pages
Deeplearning Shreiyans
No ratings yet
Deeplearning Shreiyans
18 pages
Feedforward Neural Network Overview
No ratings yet
Feedforward Neural Network Overview
35 pages
Arjun Yadav 32, Activation Function Assignment
No ratings yet
Arjun Yadav 32, Activation Function Assignment
7 pages
Lecture 2.1.2activation Function
No ratings yet
Lecture 2.1.2activation Function
15 pages
Artificial Neural Networks (ANN)
No ratings yet
Artificial Neural Networks (ANN)
67 pages
Activation Functions in Neural Networks
No ratings yet
Activation Functions in Neural Networks
10 pages
Overview of Neural Network Activation Functions
No ratings yet
Overview of Neural Network Activation Functions
4 pages
Understanding Shallow Neural Networks
No ratings yet
Understanding Shallow Neural Networks
44 pages
Derivatives of Activation Functions
No ratings yet
Derivatives of Activation Functions
89 pages
Understanding ANN and Activation Functions
No ratings yet
Understanding ANN and Activation Functions
22 pages
Key Neural Network Activation Functions
No ratings yet
Key Neural Network Activation Functions
4 pages
Understanding Neural Network Activation Functions
No ratings yet
Understanding Neural Network Activation Functions
2 pages
Fundamentals of Neural Network
No ratings yet
Fundamentals of Neural Network
84 pages
KENR65590001
No ratings yet
KENR65590001
2 pages
TBL-P6 - OF - 배관-배관 ITEM 표준속성 LIST
No ratings yet
TBL-P6 - OF - 배관-배관 ITEM 표준속성 LIST
49 pages
Grade 8 Computer Hardware Exam Guide
No ratings yet
Grade 8 Computer Hardware Exam Guide
2 pages
TOTEM ECDIS Ver3.0 Manual
No ratings yet
TOTEM ECDIS Ver3.0 Manual
4 pages
Andhra Pradesh Bus Ticket Details
No ratings yet
Andhra Pradesh Bus Ticket Details
1 page
Foundations and History of AI
No ratings yet
Foundations and History of AI
22 pages
Python Crash Course, 3rd Edition2-043
No ratings yet
Python Crash Course, 3rd Edition2-043
1 page
Deleuze's Influence on Philosophy and Theology
No ratings yet
Deleuze's Influence on Philosophy and Theology
18 pages
Calibration Certificate
No ratings yet
Calibration Certificate
1 page
Summarization Techniques for QA
No ratings yet
Summarization Techniques for QA
1,006 pages
Circular Spherical Fuzzy Sugeno Weber - Compressed
No ratings yet
Circular Spherical Fuzzy Sugeno Weber - Compressed
22 pages
X-Station Technical Training Guide
No ratings yet
X-Station Technical Training Guide
7 pages
IGCSE Sociology Paper1 Flashcards
No ratings yet
IGCSE Sociology Paper1 Flashcards
3 pages
Nightmare Exam Part 1 Section 1: The Fundamentals - True/Fale
100% (1)
Nightmare Exam Part 1 Section 1: The Fundamentals - True/Fale
2 pages
Jonathan Marks - Why I Am Not A Scientist. Anthropology and Modern Knowledge, University of California Press (2009)
No ratings yet
Jonathan Marks - Why I Am Not A Scientist. Anthropology and Modern Knowledge, University of California Press (2009)
341 pages
All - Indall - India - Summary For 17.02.13 at 1947 HRS
No ratings yet
All - Indall - India - Summary For 17.02.13 at 1947 HRS
1 page
IECEx Certificate for Motors
No ratings yet
IECEx Certificate for Motors
6 pages
Screen Time Impact on Student Sleep
No ratings yet
Screen Time Impact on Student Sleep
3 pages
Probability and Random Processes With Applications To Signal Processing and Communications 2nd Edition by Scott L.miller Donald Childers
No ratings yet
Probability and Random Processes With Applications To Signal Processing and Communications 2nd Edition by Scott L.miller Donald Childers
2 pages
A 457
No ratings yet
A 457
3 pages
1.8L (2H0) - Specifications - Fastener Tightening Specifications
No ratings yet
1.8L (2H0) - Specifications - Fastener Tightening Specifications
9 pages
Engineering Mechanics Basics Guide
No ratings yet
Engineering Mechanics Basics Guide
31 pages
FEMINE
No ratings yet
FEMINE
133 pages
Audible Resistance-Checker: Ideas
No ratings yet
Audible Resistance-Checker: Ideas
2 pages
Calculator Use and Exam Instructions
No ratings yet
Calculator Use and Exam Instructions
16 pages
Cambridge O Level: BIOLOGY 5090/03
No ratings yet
Cambridge O Level: BIOLOGY 5090/03
6 pages
Cho 1975
No ratings yet
Cho 1975
8 pages
Important Questions Class 9 English Beehive Chapter 3
No ratings yet
Important Questions Class 9 English Beehive Chapter 3
5 pages
Sample Vacation and Vocation Plans
No ratings yet
Sample Vacation and Vocation Plans
3 pages
De Kiem Tra Giua HK1 Anh 8 Global Success
100% (1)
De Kiem Tra Giua HK1 Anh 8 Global Success
6 pages

Choosing Neural Network Activation Functions

Uploaded by

Choosing Neural Network Activation Functions

Uploaded by

Types of Neural Network Activation

Functions: How to Choose?

Function? decide whether the neuron’s input to the network

The role of the Activation Function is to derive

The primary role of the Activation Function is

The input fed to the activation function is compared

Functions ➔ They allow backpropagation because

● It is commonly used for models where we

sigmoid function The derivative of the function is f'(x) = sigmoid(x)*(1-

As we can see from the Figure, the gradient values

It implies that for values greater than 3 or less than

Note: Although both sigmoid and

The main catch here is that the ReLU function

The neurons will only be deactivated if the

● ReLU accelerates the convergence of

The negative side of the graph makes the

Leaky ReLU that of ReLU, in addition to the fact that it

By making this minor modification for

Leaky ReLU consistent for negative input

● The gradient for negative values is

Function that aims to solve the problem of gradient’s

This function provides the slope of the

Where "a" is the slope parameter for negative values.

This function’s limitation is that it may

Function slope of the negative part of the function.

ELU uses a log curve to define the negative

● Avoids dead ReLU problem by

★ It calculates the relative probabilities.

★ It is most commonly used as an activation

Assume that you have three classes, meaning that

The function returns 1 for the largest probability index

You can see now how softmax activation function

Swish consistently matches or outperforms

Swish activation direction like ReLU does near x = 0.

Scaled Exponential takes care of internal normalization which means each

SELU has both positive and negative values to shift the

Gradients can be used to adjust the variance. The

Here’s the main advantage of SELU over

SELU is a relatively newer activation function

You might also like