Report Phase2 ESA v1

Dissertation on
“Comparative Study of Optimizers & Learning Rates on

Esh Activation Function”
Submitted in partial fulfilment of the requirements for the award of degree of
Master of Technology
in
Data Science and
Machine Learning
UE20CS972 – Project Phase - 2
Submitted by:
Anjan Arun Bhowmick PES2PGE21DS104
Under the guidance of
Sudha BG
Professor
Great Learning
April 2023 - June 2023
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

FACULTY OF ENGINEERING
PES UNIVERSITY
(Established under Karnataka Act No. 16 of 2013)
Electronic City, Hosur Road, Bengaluru – 560 100, Karnataka, India
PES UNIVERSITY
(Established under Karnataka Act No. 16 of 2013)
Electronic City, Hosur Road, Bengaluru – 560 100, Karnataka, India
FACULTY OF ENGINEERING
CERTIFICATE
This is to certify that the dissertation entitled
‘Comparative Study of Optimizers & Learning Rates on Esh

Activation Function’
is a bonafide work carried out by
Anjan Arun Bhowmick PES2PGE21DS104
In partial fulfilment for the completion of Fourth Semester Project Phase - 2 (UE20CS972)
in the Program of Study - Master of Technology in Data Science and Machine learning
under rules and regulations of PES University, Bengaluru during the period April 2023 –
June 2023. It is certified that all corrections / suggestions indicated for internal assessment
have been incorporated in the report. The dissertation has been approved as it satisfies
the 4th semester academic requirements in respect of project work.
Signature Signature Signature Signature

Sudha BG Aditya Kishan Dr. Sandesh B Dr. B K Keshavan
Professor,GL Data Scientist,GL J Dean of Faculty
Chairperson
Name of the Examiners External Viva

Signature with Date
1.
2.
DECLARATION
We hereby declare that the Project Phase - 2 entitled “Comparative Study of

Optimizers & Learning Rates on Esh Activation Function” has been
carried out by us under the guidance of Sudha BG, and submitted in partial
fulfilment of the course requirements for the award of degree of Master of
Technology in Data Science and Machine Learning of PES University,
Bengaluru during the academic semester April 2023 – June 2023. The matter
embodied in this report has not been submitted to any other university or
institution for the award of any degree.
PES2PGE21DS104 Anjan Arun Bhowmick

ACKNOWLEDGEMENT
I would like to express my gratitude to Sudha BG, “Professor”, “Great

Learning”, for his/her continuous guidance, assistance, and encouragement
throughout the development of this UE20CS972 -Project Phase – 2.
I am grateful to the Project Coordinator, Aditya Kishan, Academic

Counsellor, for organizing, managing and helping with the entire process.
I take this opportunity to thank Dr. Sandesh B J, Chairperson, Department

of Computer Science and Engineering, PES University, for all the
knowledge and support I have received from the department. I would like
to thank Dr. B.K. Keshavan, Dean of Faculty, PES University for his help.
I am deeply grateful to Dr. M. R. Doreswamy, Chancellor, PES University,

Prof. Jawahar Doreswamy, Pro Chancellor – PES University, Dr.
Suryaprasad J, Vice-Chancellor, PES University for providing me various
opportunities and enlightenment at every step of the way. Finally, this
project could not have been completed without the continual support and
encouragement I have received from my family and friends.
ABSTRACT
Optimizers are algorithms that are used to update the parameters (weights
and biases) of a neural network during training. The goal of these
algorithms is to minimize the loss function of the network by finding the
optimal values of the parameters. Some common optimizers include
Stochastic Gradient Descent (SGD), Adam, Adagrad and RMSProp.
Learning rate is a hyperparameter that controls the step size of the

optimizer during parameter updates. A high learning rate can cause the
optimizer to overshoot the minimum of the loss function, while a low
learning rate can cause the optimizer to converge slowly or get stuck in a
local minimum.
The Esh activation function is a new activation function with the formula f
(x) = xtanh(sigmoid(x)) that has shown promising results in deep neural
networks. Compared to other activation functions like ReLU, GELU, Mish,
and Swish, the Esh activation function offers a more consistent loss
landscape. Optimizers and learning rates are important hyperparameters
that affect the performance of a neural network. This study aims to
investigate the impact of different optimizers and learning rates on the
performance of Esh activation function in a deep neural network. The
study will compare the performance of different optimizers such as
Stochastic Gradient Descent (SGD), Adam, Adagrad and RMSProp, with
different learning rates on the Esh activation function on the MNIST,
CIFAR-10 and CIFAR-100 data sets using VGG16 and ResNet CNN
architectures. The results of this study can provide insights into the optimal
hyperparameters for the Esh activation function and can contribute to the
development of better deep neural networks.
In conclusion, while there is limited research on the comparative study of

optimizers and learning rates on Esh activation function, it is likely that
well-tuned optimizers and learning rates can improve the performance of
the network, as they do with other activation functions. However, the
specific impact may depend on the unique properties of Esh activation
function, and further research is needed to fully understand its behavior.
TABLE OF CONTENTS
Chapter No. Title Page No.
1. INTRODUCTION 01
1.1 Background
2. PROBLEM STATEMENT 03
2.1 Objective of Optimizers and Learning Rates on a Neural Network
2.2 The working of Optimizers and Learning Rates
3. LITERATURE REVIEW 05
3.1 Activation Functions
3.2 Deep Neural Network Architecture
3.3 Image Classification
3.4 Optimizers
3.4.1 Recently Proposed Optimizers
3.5 Learning Rate
4. NEURAL NETWORK ARCHITECTURE 10

4.1 ResNet
4.1.1 ResNet-20 and ResNet-50
4.2 VGG-16
5. ACTIVATION FUNCTIONS 13
Desired Characteristics of the Activation Functions
5.1 Sigmoid Activation Function
5.2 Swish Activation Function
5.3 Mish Activation Function
5.4 Tanh Activation Function
5.5 Esh Activation Function
5.5.1 Derivative of Esh
5.5.2 Properties of Esh
6. OPTIMIZERS 21
6.1 Stochastic Gradient Descent (SGD)
6.2 Root Mean Square Propagation (RMSProp)
6.3 Adaptive Gradient (Adagrad)
6.4 Adadelta
6.5 Adamax
6.6 Adaptive Moment Estimation (Adam)
7. METHODOLOGY 27
7.1 The Datasets
7.1.1 MNIST
7.1.2 EMNIST
7.1.3 CIFAR-10
7.2 Data Augmentation
7.3 Preprocessing
7.4 Technologies Used
8. RESULTS AND DISCUSSION 29
9. CONCLUSION AND FUTURE WORK 44
REFERENCES 46
APPENDIX A: DEFINITIONS, ACRONYMS AND ABBREVIATIONS 53

LIST OF FIGURES
Figure No. Title Page No.

1 VGG-16 11
2 Comparison of ReLu, Mish, Swish, GELU, ELU and Esh 16
3 Derivatives of Mish, Swish and Esh activation functions 18
4 Output landscape comparison of ReLu, Swish, Mish and 18

Esh
5 LR vs Accuracy comparison with VGG-16 on CIFAR-10 29
6 LR vs Loss comparison with VGG-16 on CIFAR-10 30
7 LR vs Accuracy comparison with ResNet-20 on CIFAR- 31

10
8 LR vs Loss comparison with ResNet-20 on CIFAR-10 32
9 LR vs Accuracy comparison with ResNet-56 on CIFAR- 33

10
10 LR vs Loss comparison with ResNet-56 on CIFAR-10 34
11 LR vs Accuracy comparison with VGG-16 on EMNIST 35
12 LR vs Loss comparison with VGG-16 on EMNIST 36
13 LR vs Accuracy comparison with ResNet-20 on 37

EMNIST
14 LR vs Loss comparison with ResNet-20 on EMNIST 38
15 LR vs Accuracy comparison with ResNet-56 on 39

EMNIST
16 LR vs Loss comparison with ResNet-56 on EMNIST 40
17 LR vs Accuracy comparison of Esh Activation Function 41

on CIFAR-10 dataset
18 LR vs Loss comparison of Esh Activation Function on 41
CIFAR-10 dataset
19 LR vs Accuracy comparison of Esh Activation Function 42
on EMNIST dataset
20 LR vs Loss comparison of Esh Activation Function on 42
EMNIST dataset
Comparative Study of Optimizers & Learning Rates on
Esh Activation Function
CHAPTER 1
INTRODUCTION
Optimizers and learning rates are essential components in the training of deep learning models.
In deep learning, the goal is to minimize the loss function, which represents the difference between
the predicted and actual values. The optimizer is the algorithm that updates the model parameters
during training to minimize the loss function.
There are various types of optimizers, including stochastic gradient descent (SGD), Adam,
Adagrad, and RMSprop. These optimizers differ in how they update the model parameters and how
they handle learning rates.
The learning rate is a hyperparameter that controls the size of the step taken during the optimization
process. A larger learning rate can lead to faster convergence, but it may also cause the model to
overshoot the optimal solution.
Conversely, a smaller learning rate can lead to slower convergence, but it may also help the model
converge to a more precise optimal solution.
Finding the optimal learning rate can be challenging, as it depends on various factors such as the
problem, the optimizer used, and the architecture of the model. There are various techniques for
selecting the optimal learning rate, such as using a learning rate schedule, applying learning rate
annealing, or using adaptive learning rates.
Overall, optimizers and learning rates play a crucial role in the training of deep learning models,
and selecting the appropriate combination can significantly impact the model’s performance.
1.1 Background
The introduction of non-linearity in neural networks is essential for learning complex relationships
in the data, and activation functions play a crucial role in achieving this objective.
The Esh activation function, which has been proposed for image classification tasks, offers several
advantages over existing activation functions such as ReLU, Swish, and GELU.
fEsh(x) = x × tanh(sigmoid(x)) (1)
The Esh activation function has multiple advantages that make it beneficial for neural networks
used in classification tasks. One of its primary benefits is that it accelerates the learning process,
leading to increased accuracy. The Esh function achieves this by having a lower slope compared to
other activation functions, which results in faster convergence. Additionally, the Esh function’s
unbounded upper limit helps prevent saturation that can cause the training process to slow down to
almost zero gradients, while its lower limit produces a strong regularization effect.
Dept. of CSE April - June, 2023 1

Experimental results have demonstrated that the Esh activation function performs better than
ReLU, Swish, and GELU on widely used benchmark datasets like MNIST, CIFAR-10, CIFAR-
100, VGG16, and ResNet network architectures. Moreover, researchers have established that the
Esh activation function has a smoother loss landscape compared to other activation functions. This
feature contributes to faster and more stable convergence in neural networks.
While earlier studies have established the effectiveness of the Esh activation function in image
classification tasks, there is a need to further investigate how different hyperparameters can
influence its performance on the benchmark datasets. To address this gap, future research could
extend the Comparative Study of Optimizers Learning Rate on Esh Activation Function. By doing
so, researchers could gain a deeper understanding of the ideal selection of hyperparameters when
utilizing the Esh activation function in deep learning applications.

CHAPTER 2
PROBLEM STATEMENT
Image classification[69] is a fundamental task in computer vision that involves assigning a label or
a category to an image. It plays a critical role in various real-world applications such as medical
diagnosis, autonomous driving, surveillance, and object recognition.
The ability of machines to identify and categorize objects in images accurately is essential in
enabling automation, improving efficiency, and increasing the accuracy of decision-making
processes in various industries. For instance, in healthcare, image classification is used to detect
and diagnose diseases from medical images, while in the automotive industry, it is used to identify
road signs, pedestrians, and other vehicles to facilitate autonomous driving.
Furthermore, with the proliferation of digital media, social networking platforms, and e-commerce
sites, image classification has become increasingly important for content filtering, product
recommendation, and user personalization. As such, the development of accurate and efficient
image classification models is crucial in enabling machines to interpret and understand the visual
world around them.
The choice of activation functions plays a crucial role in determining the effectiveness of deep
learning models in image classification tasks. Although activation functions such as ReLU, Swish,
and GELU have been widely studied and implemented, there is a need to investigate the
performance of the Esh activation function in the context of image classification tasks. The Esh
activation function is a novel activation function that has shown promising results in image
classification, but its potential in this area has not been thoroughly explored yet.
The convergence and accuracy of a model depend heavily on the selection of hyperparameters[31],
particularly the optimizers[10] and learning rates. This study aims to investigate the effect of
varying optimizers and learning rates on the performance of the Esh activation function in a neural
network. Additionally, the study aims to provide valuable insights into selecting the most optimal
hyperparameters for utilizing the Esh activation function in deep learning applications.
2.1 Objective of Optimizers and Learning Rates on a Neural Network

The objective of optimizers and learning rates in a neural network is to minimize the loss function
of the model during the training process. The loss function is a measure of how well the model is
able to make predictions on the training data. The lower the loss function, the better the model’s
performance.
Optimizers[3] are algorithms that update the weights and biases of a neural network during training
Dept. of CSE April – June, 2023 3

in order to minimize the loss function. The optimizer’s goal is to find the set of weights and biases
that result in the lowest possible loss. There are several different types of optimizers available,
including Stochastic Gradient Descent (SGD), Adagrad, Adam, and RMSprop. Each optimizer has
its own unique approach to updating the weights and biases.
Learning rate, on the other hand, is a hyperparameter that determines how quickly the optimizer
adjusts the weights and biases of the neural network. A higher learning rate means that the weights
and biases are updated more quickly, while a lower learning rate means that the updates are slower.
Setting the learning rate too high can result in the optimizer overshooting the optimal set of weights
and biases, while setting it too low can result in slow convergence to the optimal solution.
Therefore, choosing an appropriate learning rate is crucial for achieving good performance in a
neural network.
2.2 The working of Optimizers and Learning Rates

Optimizers and learning rates work together in a neural network to adjust the weights and biases
during the training process, with the aim of minimizing the loss function.
During each iteration of training, the optimizer calculates the gradient of the loss function[68] with
respect to the weights and biases. This gradient tells the optimizer how to adjust the weights and
biases in order to decrease the loss function. The optimizer then uses this information to update the
weights and biases accordingly.
Different optimizers use different strategies for updating the weights and biases. For example, the
Stochastic Gradient Descent (SGD) optimizer updates the weights and biases by subtracting the
gradient of the loss function multiplied by a learning rate, while the Adam optimizer adjusts the
learning rate based on estimates of the first and second moments of the gradients.
The learning rate determines how much the weights and biases are adjusted in response to the
gradient. A higher learning rate results in larger updates to the weights and biases, while a lower
learning rate results in smaller updates. Setting the learning rate too high can lead to overshooting
the optimal set of weights and biases, while setting it too low can result in slow convergence to the
optimal solution.
The choice of optimizer and learning rate can have a significant impact on the performance of a
neural network. Different optimizers may work better for different types of problems or
architectures, and finding the optimal learning rate often requires experimentation and tuning. In
general, the goal is to find the optimizer and learning rate that allow the neural network to converge
quickly and accurately to the optimal set of weights and biases.

CHAPTER 3
LITERATURE REVIEW
3.1 Activation Functions

Here are some notable studies and papers on Activation Functions:
1. Rectifier Nonlinearities Improve Neural Network Acoustic Models[56] by Vinod Nair
and Geoffrey E. Hinton (2010): This influential paper introduced the Rectified Linear Unit
(ReLU) activation function and demonstrated its effectiveness in improving the performance of
deep neural networks. The authors showed that ReLU activations lead to faster convergence
during training and better generalization compared to traditional activation functions like
sigmoid and tanh.
2. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet
Classification[20] by Kaiming He et al. (2015): This paper further explored the benefits
of ReLU activation and proposed an improved variant called the "Parametric Rectified
Linear Unit" (PReLU). The authors demonstrated that PReLU can further enhance the
learning capacity of deep networks, leading to state-of-the-art performance on challenging
computer vision tasks.
3. Understanding the Difficulty of Training Deep Feedforward Neural Networks[17] by
Xavier Glorot and Yoshua Bengio (2010): This paper investigated the challenges of training
deep neural networks and proposed the use of the "Xavier initialization" technique for
weight initialization. The authors also highlighted the importance of choosing appropriate
activation functions, noting that ReLU activations are particularly effective in combating the
vanishing gradient problem.
4. Identity Mappings in Deep Residual Networks[21] by Kaiming He et al. (2016): This
paper introduced the "Identity mapping" concept and highlighted the effectiveness of the
"Residual Network" (ResNet) architecture for training very deep neural networks. It showed
that using skip connections with identity mappings and ReLU activations can mitigate the
degradation problem encountered in extremely deep networks.
5. Swish: A Self-Gated Activation Function[42] by Prajit Ramachandran et al. (2017): This
paper proposed the Swish activation function, which is a smooth and non-monotonic
alternative to ReLU. The authors showed that Swish activations can improve the
performance of deep networks, offering better generalization and faster convergence
compared to ReLU on various tasks.
These papers represent a small sample of the extensive research conducted on activation functions
in deep neural networks. The field of activation functions continues to evolve, with new variations
and modifications being proposed to enhance the performance and capabilities of deep learning
models.

3.2 Deep Neural Network Architectures

Here are some notable studies and papers on Deep Neural Network Architectures:
1. Deep Residual Learning for Image Recognition[19] by Kaiming He et al. (2016): This
paper introduced the ResNet architecture, which utilizes residual connections to address
the degradation problem in deep networks. ResNet achieved state-of-the-art performance
on various image recognition tasks, and its skip connections have become a fundamental
building block in deep learning.
2. Very Deep Convolutional Networks for Large-Scale Image Recognition[47] by Karen
Simonyan and Andrew Zisserman (2015): This paper introduced the VGGNet
architecture, which consists of deep convolutional layers with small 3x3 filters. VGGNet
achieved excellent performance on the ImageNet challenge and provided insights into the
importance of depth in neural networks.
3. Going Deeper with Convolutions[53] by Christian Szegedy et al. (2015): This paper
introduced the Inception architecture (also known as GoogLeNet) that employs the
concept of "inception modules" with multiple filter sizes in parallel. It demonstrated the
benefits of multi-scale feature extraction and efficient use of parameters.
4. DeepFace: Closing the Gap to Human-Level Performance in Face Verification[54]
by Yaniv Taigman et al. (2014): This paper proposed the DeepFace architecture, which
achieved significant advancements in face verification tasks. DeepFace utilized a deep
convolutional neural network with a siamese architecture and triplet loss function, paving
the way for subsequent developments in facial recognition.
5. Generative Adversarial Networks[18] by Ian Goodfellow et al. (2014): This paper
introduced the Generative Adversarial Network (GAN) framework, which consists of a
generator and a discriminator network trained in an adversarial setting. GANs have since
become a popular approach for generative modeling and have been extended to various
domains, including image synthesis, text generation, and style transfer.
6. Attention Is All You Need[57] by Vaswani et al. (2017): This paper introduced the
Transformer architecture, which revolutionized natural language processing tasks.
Transformers utilize self-attention mechanisms to capture long-range dependencies and
achieved state-of-the-art results in machine translation and other sequence-to-sequence
tasks.
These papers represent a small subset of the vast literature on deep neural network architectures.
Deep learning research continues to explore new architectural innovations, model interpretability,
efficiency improvements, and applications across various domains.
Dept. of CSE April – June 2023 6

3.3 Image Classification
Here are some notable studies and papers on image classification:

1. ImageNet Classification with Deep Convolutional Neural Networks[28] by Alex
Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. This paper introduced the use of deep
convolutional neural networks for image classification and won the ImageNet Large Scale
Visual Recognition Challenge in 2012.
2. Very Deep Convolutional Networks for Large-Scale Image Recognition[47] by Karen
Simonyan and Andrew Zisserman. This paper presented the VGG-16 and VGG-19 models,
which achieved state-of-the-art performance on the ImageNet challenge in 2014.
3. Going Deeper with Convolutions[53] by Christian Szegedy, Wei Liu, Yangqing Jia, Pierre
Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and
Andrew Rabinovich. This paper introduced the Inception model, which utilizes multiple
convolutions of varying sizes to achieve high accuracy on the ImageNet challenge.
4. ResNet: Deep Residual Learning for Image Recognition[19] by Kaiming He, Xiangyu
Zhang, Shaoqing Ren, and Jian Sun. This paper proposed the ResNet model, which
introduced residual connections to address the vanishing gradient problem in deep neural
networks.
5. Bag of Tricks for Image Classification with Convolutional Neural Networks[22] by
Tong He, Zhi Zhang, Hang Zhang, Zhongyue Zhang, Junyuan Xie, and Mu Li. This paper
presented a series of techniques and best practices for improving the performance of
convolutional neural networks on image classification tasks.
6. Training Deep Neural Networks on Noisy Labels with Bootstrapping[44] by Hongyi
Zhang, Moustapha Cisse, Yann Dauphin, and David Lopez-Paz. This paper introduced the
use of bootstrapping to train deep neural networks on datasets with noisy labels, which are
common in image classification tasks.
7. Mixup: Beyond Empirical Risk Minimization[67] by Hongyi Zhang, Moustapha Cisse,
Yann Dauphin, and David Lopez-Paz. This paper presented the mixup technique, which
improves the generalization of deep neural networks on image classification tasks by linearly
interpolating between pairs of training examples.

3.4 Optimizers
Optimizers play a significant role in the training of deep neural networks. Here are some recent
studies on the performance of optimizers in deep learning:
1. A systematic study of the class of Adam methods for deep learning by Liu et al. (2020):
This study systematically investigates the class of Adam optimizers for deep learning. The
authors propose several modifications to the Adam algorithm and demonstrate their
effectiveness on a range of benchmark datasets.
2. On the variance of the adaptive learning rate and beyond [33] by Li et al. (2019): This
study analyzes the variance of adaptive learning rate algorithms such as AdaGrad, RMSProp,
and Adam, and proposes a new algorithm called Yogi. The authors demonstrate that Yogi
outperforms other adaptive learning rate algorithms on several benchmark datasets.
3. Decoupled weight decay regularization[34] by Loshchilov and Hutter (2019): This study
proposes a decoupled weight decay regularization method for stochastic gradient descent
(SGD)[60] and its variants. The authors show that the method improves the generalization
performance of deep neural networks on several benchmark datasets.
4. Improving generalization performance by switching from Adam to SGD[25] by Jaegle et
al. (2018): This study proposes a method that switches from the Adam[26] optimizer to
stochastic gradient descent (SGD) during the training process. The authors show that the
method improves the generalization performance of deep neural networks on several
benchmark datasets.
5. Adaptive gradient methods with dynamic bound of learning rate[35] by You et al.
(2017): This study proposes a new family of adaptive gradient methods with a dynamic
bound of learning rate. The authors show that the methods outperform other adaptive
gradient methods on several benchmark datasets.
Overall, these studies demonstrate that the choice of optimizer can significantly impact the
performance of deep neural networks, and highlight the importance of careful optimization in deep
learning applications.
3.4.1 Recently proposed Optimizers
There have been several recently proposed optimizers for neural networks. Here are a few examples:
1. SWATS (Stochastic Weight Averaging with Tunable Stepsize)[25] : This optimizer uses
stochastic weight averaging to improve the generalization of deep neural networks. It
achieves this by taking a weighted average of the weights obtained during training, rather
than using the final weights. The stepsize of the optimizer is tunable, which allows for better
control over the convergence of the optimization process.
2. AdaBound[35] : This optimizer is a modification of the Adam optimizer that uses dynamic
bounds on the learning rate to improve convergence. It achieves this by gradually decreasing
the learning rate as the optimization progresses, which helps prevent overshooting the
optimal set of weights and biases.


3. Madam[4] : This optimizer is a modification of the Adam optimizer that uses momentum
with adaptive damping to improve convergence. It achieves this by dynamically adjusting the
momentum and damping terms based on the gradient and curvature of the loss function.
These are just a few examples of the many recently proposed optimizers for neural networks. Each
optimizer has its own strengths and weaknesses, and the choice of optimizer often depends on the
specific problem and architecture being used.
3.5 Learning Rate

Learning rate is an important hyperparameter in deep learning that affects the convergence and
accuracy of the model. Here are some key research papers that discuss learning rate and its
optimization techniques:
1. On the importance of initialization and momentum in deep learning[52] by Ilya

Sutskever et al. (2013): This paper introduces the concept of adaptive learning rates, which
adjust the learning rate dynamically during training based on the past gradients. The authors
show that this technique can significantly improve the performance of deep neural networks.
2. Cyclical learning rates for training neural networks[50] by Leslie N. Smith (2017): This
paper proposes a new learning rate schedule that cyclically varies the learning rate between a
minimum and maximum value, which can lead to faster convergence and better performance.
3. The marginal value of adaptive gradient methods in machine learning[59] by Ashia C.
Wilson et al. (2017): This paper investigates the effectiveness of different optimization
methods, including various adaptive learning rate methods, on a range of deep learning tasks.
The authors conclude that simple stochastic gradient descent (SGD) with a carefully chosen
learning rate can achieve comparable performance to more complex adaptive methods.
4. Large batch training of convolutional networks[63] by Priya Goyal et al. (2017): This
paper explores the impact of batch size and learning rate on the performance of deep neural
networks. The authors find that increasing batch size can lead to higher accuracy, but also
requires a larger learning rate to prevent overfitting.
5. A systematic study of the class imbalance problem in convolutional neural networks[7]
by Wen et al. (2019): This paper investigates the impact of learning rate on the performance
of CNNs for imbalanced datasets. The authors find that a lower learning rate can improve the
accuracy of minority classes.
6. How to find your learning rate[49] by Leslie N. Smith (2018): This paper proposes a
method for finding an appropriate learning rate for a given neural network architecture and
dataset. The method involves gradually increasing the learning rate during training and
monitoring the loss, and choosing the largest learning rate that still results in a decreasing
loss.
Overall, these papers demonstrate the importance of the learning rate hyperparameter in deep
learning and highlight various techniques for optimizing it.

0
CHAPTER 4
NEURAL NETWORK ARCHITECTURES
4.1 ResNet
ResNet, short for "Residual Network,"[19] is a deep neural network architecture that was introduced
by Microsoft researchers in 2015. It won the ImageNet[46] and COCO[32] 2015 competitions in
several categories and has since become a popular and powerful architecture for a wide range of
computer vision tasks.
The basic idea behind ResNet is to use skip connections or "residual connections"[55] to allow the
network to learn residual mappings. These connections enable the network to more effectively
propagate gradients through the network during training, which can help prevent the vanishing
gradient problem that can occur in very deep networks.
The ResNet architecture consists of a series of residual blocks, each of which includes multiple
convolutional layers and a skip connection that allows the input to be added to the output of the
block. By stacking these residual blocks together, the network can learn increasingly complex
representations of the input data.
One of the key innovations of ResNet is the use of "bottleneck" blocks, which consist of three
convolutional layers with different filter sizes (1x1, 3x3, and 1x1). The 1x1 convolutions are used to
reduce the dimensionality of the input, while the 3x3 convolution performs the main computation.
This allows the network to learn more efficient and compact representations of the input, which can
improve performance and reduce memory usage.
ResNet comes in various sizes, including ResNet-18, ResNet-34, ResNet-50, ResNet-101, and
ResNet-152. These numbers correspond to the number of layers in the network, with ResNet-18
being the smallest and ResNet-152 being the largest.
Overall, ResNet is a powerful and widely used architecture for a variety of computer vision tasks,
including image classification, object detection, and segmentation[66]. Its use of residual
connections allows it to effectively learn very deep representations of the input data, making it an
important tool in the field of deep learning.
4.1.1 ResNet-20 and ResNet-56

ResNet-20[23] and ResNet-56[48] are two variations of the ResNet architecture that have been used
for image classification tasks. These networks are shallower than the larger ResNet-50, ResNet-101,
and ResNet-152 models, but are still capable of achieving good performance on a range of datasets.
ResNet-20 consists of 20 layers, including 18 convolutional layers and 2 fully connected layers. The
first layer is a 3x3 convolutional layer with 16 filters, followed by several residual blocks with 3
convolutional layers each. The network uses batch normalization and ReLU[2] activation after each

1
convolutional layer, and includes a global average pooling layer before the final fully connected
layer. ResNet-20 has approximately 0.27 million parameters and has been shown to achieve state-of-
the-art performance on CIFAR-10 and CIFAR-100 datasets.
ResNet-56 is a deeper version of ResNet-20, consisting of 56 layers with 54 convolutional layers and
2 fully connected layers. Like ResNet-20, it starts with a 3x3 convolutional layer with 16 filters and
includes several residual blocks with 3 convolutional layers each. It also uses batch normalization
and ReLU activation after each convolutional layer, and includes a global average pooling layer
before the final fully connected layer. ResNet-56 has approximately 0.86 million parameters and has
been shown to outperform ResNet-20 on the CIFAR-10 and CIFAR-100 datasets.
Overall, ResNet-20 and ResNet-56 are examples of smaller ResNet architectures that can be used for
image classification tasks, particularly on datasets with limited amounts of training data. These
networks can be trained efficiently on GPUs and have been shown to achieve good performance on a
variety of benchmarks.
4.2 VGG-16
VGG16[41] is a deep convolutional neural network architecture that was introduced by researchers
at the Visual Geometry Group (VGG) at the University of Oxford in 2014. The architecture consists
of 16 layers, including 13 convolutional layers and 3 fully connected layers, and has been used for a
wide range of computer vision tasks, including image classification, object detection, and
segmentation.
The key innovation of VGG16 is the use of very small convolutional filters (3x3) throughout the
network, which allows the network to learn more complex features without increasing the number of
parameters too much. The convolutional layers are stacked one after the other, with max pooling
layers used to downsample the feature maps and reduce the spatial dimensionality of the data.
Figure 1: VGG-16

2
The architecture of VGG16 can be divided into five blocks. The first two blocks each consist of two
convolutional layers with 64 filters, followed by a max pooling layer. The third and fourth blocks
each consist of three convolutional layers with 128 and 256 filters, respectively, followed by a max
pooling[16] layer. The fifth block consists of three convolutional layers with 512 filters, followed by
a max pooling layer. The fully connected layers are then added on top of these convolutional layers
to perform the final classification.
Overall, VGG16 is a powerful architecture that has been shown to achieve state-of-the-art
performance on a variety of computer vision tasks. However, it is a relatively large network with
over 138 million parameters, which can make it difficult to train on limited computational resources.

3
CHAPTER 5
ACTIVATION FUNCTIONS
In the process of building a neural network, one of the choices we get to make is what Activation
Function[30] to use in the hidden layer as well as at the output layer of the network. We know, the
neural network has neurons that work in correspondence with weight, bias, and their respective
activation function. In a neural network, we would update the weights and biases of the neurons
on the basis of the error at the output. This process is known as back-propagation. Activation
functions make the back-propagation[6] possible since the gradients are supplied along with the
error to update the weights and biases.
The activation function decides whether a neuron should be activated or not by calculating the
weighted sum and further adding bias to it. The purpose of the activation function is to introduce
non-linearity into the output of a neuron. A neural network without an activation function is
essentially just a linear regression model. The activation function does the non-linear
transformation to the input making it capable to learn and perform more complex tasks.
5.0.1 Desired characteristics of the activation functions

There is no universal rule for determining the best activation function; it varies depending on the
problem under consideration. Nonetheless, some of the desirable qualities of activation functions
are well known in the literature. The following are the essential characteristics of any activation
function.
1. Nonlinerity: One of the most essential characteristics of an activation function is

nonlinearity[58]. In comparison to linear activation functions, the non-linearity of the
activation function significantly improves the learning capability of neural networks[9]. In
[13] Cybenko and [24] Hornik advocates for the nonlinear property of the activation
function, demonstrating that the activation function must be bounded, non-constant,
monotonically growing, and continuous in order to ensure the neural network’s universal
approximation property. In [37, 38] Morita later discovered that neural networks with non-
monotonic activation functions perform better in terms of memory capacity and retrieval
ability. Recently, Sonoda and Murata [51] showed that neural networks equipped with
unbounded but non-polynomial activation functions (e.g., ReLU) are universal
approximators.
2. Computationally cheap: The activation function must be easy to evaluate in terms of
computation. This has the potential to greatly improve network efficiency.
3. The vanishing and exploding gradient problems[8]: The vanishing and exploding
gradient problems are the important problems of activation functions. The variation of the
inputs and outputs of some activation functions, such as the logistic function (Sigmoid), is
extremely large. To put it another way, they reduce and transform a bigger input space into

4
a smaller output space that falls between [0,1]. As a result, the back-propagation algorithm
has almost no gradients to propagate backward in the network, and any residual gradients
that do exist continue to dilute as the program goes down through the top layers. Due to
this, the initial hidden-layers are left with no information about the gradients. For
hyperbolic tangent and sigmoid[39] activation functions, it has been observed that the
saturation region for large input (both positive and negative) is a major reason behind the
vanishing of gradient. One of the important remedies to this problem is the use of non-
saturating activation functions. Other non-saturating functions, such as ReLU, leaky
ReLU[62], and other variants of ReLU, have been proposed to solve this problem.
4. Finite range/boundedness: Gradient-based training approaches are more stable when the
range of the activation function is finite, because pattern presentations significantly affect
only limited weights.
5. Differentiability: The most desirable quality for using gradient-based optimization
approaches is continuously differentiable activation functions. This ensures that the back-
propagation algorithm works properly.
5.1 Sigmoid Activation Functions
The sigmoid activation function is a mathematical function commonly used in artificial neural
networks. It maps any input value to a value between 0 and 1, which makes it useful for modeling
binary classification problems.
The sigmoid function is defined as follows:

1
Sigmoid(x) = (2)
1 + e−x
where x is the input to the function and e is the mathematical constant known as Euler’s number.
The sigmoid function has a distinctive S-shaped curve, which means that small changes in the
input values will result in small changes in the output values when the input is close to 0. As the
input value moves away from 0, the output changes more rapidly, resulting in a steeper slope.
The sigmoid function is often used as an activation function in the output layer of a neural
network that is trained to classify input data into one of two classes. The output of the sigmoid
function can be interpreted as the probability that the input belongs to the positive class.
However, the use of the sigmoid function has some limitations. It can suffer from the vanishing
gradient problem, which can slow down the learning process in deep neural networks.
Additionally, it is not well-suited for input values that are significantly different from 0, as the

5
output values will
Esh Activation either saturate or become too small, which can cause numerical stability issues.
Function
5.2 Swish Activation Functions

Ramachandran et al. [43] have performed an automatic search, which resulted in a Swish AF. It is
defined as,
Swish(x) = x × Sigmoid(β × x) (3)
where β is a learnable parameter. The output range of −∞

Swish is ( , ). Based on the learnt value
∞
of β the shape of the Swish AF is adjusted between the linear and ReLU functions. The
smaller and higher values of β lead towards the linear and ReLU functions, respectively. Thus,
it can control the amount of non-linearity based on the dataset and network complexity.he
sigmoid activation function is a mathematical function commonly used in artificial neural
networks. It maps any input value to a value between 0 and 1, which makes it useful for modeling
binary classification problems.
5.3 Mish Activation Functions

Mish[36] is a novel smooth and non-monotonic neural activation function which can be defined as:
f (x) = xtanh(ln(1 + ex)) (4)
where, ln(1 + ex) is the softplus activation function.

—
Like both Swish and ReLU, Mish is bounded below and unbounded above with a range [ 0.31, ).
Mish can be easily implemented using any standard deep learning framework by defining a
custom activation layer. Although it’s difficult to explain the reason why one activation function
performs better than another due to many other training factors, the properties of Mish like being
unbounded above, bounded below, smooth and nonmonotonic, all play a significant role in the
improvement of results.
5.4 Tanh Exponential Activation Function
Tanh Exponential Activation Function (TanhExp) which can improve the performance on image
classification task significantly. The definition of TanhExp is
f (x) = xtanh(ex) (5)
TanhExp outperforms its counterparts in both convergence speed and accuracy. Its behaviour also
remains stable even with noise added and dataset altered. It is shown in [61] that without
increasing the size of the network, the capacity of lightweight neural networks can be enhanced by
TanhExp with only a few training epochs and no extra parameters added.

6
5.5 Esh Activation Function
The Esh activation function was introduced as a novel activation function for image classification tasks.
It has shown promising results in improving model performance and generalization compared to traditional
activation functions such as ReLU, Swish, and GELU. However, its effectiveness in image segmentation
tasks remains unexplored.
The Esh Activation Function (Esh) is defined as
fEsh(x) = x × tanh(sigmoid(x))
(6)where sigmoid refers to the sigmoid activation function:

1
fSigmoid(x) = (7)
1 + e−x
5.5.1 Derivative of Esh
The comparison of Swish, Mish and Esh activation function. Esh extends below zero in the negative
half like other smooth functions do, but it has a steeper gradient. Despite the fact that they are not the
same, the figures of Mish and Swish appear intuitively to be similar to those of Esh. Esh takes fewer
calculations even when they are close to one another. Eq allows for the calculation of Esh’s first
derivative Eq. (8).
The first and second derivatives of Esh, Swish, and Mish are shown in Fig. 3.

7
Figure 2: Comparison of ReLU, Mish, Swish, GELU, ELU and Esh activation functions
First order derivative of Esh

′ ex xex ex
2
fEsh(x) = tanh( )+ sech ( ) (8)
1 + ex (1 + ex)2 1 + ex
Second order derivative of Esh
′′
ex 2 ex
fEsh(x) = sech ( )
2x
x (1 + ex)4 1 + ex
− xe2x + 4ex + x + 2 − 2xextanh( 1+e
e
x)!
(9)

8
Figure 3: Derivatives of Mish, Swish and Esh activation functions
Figure 4: Output landscape comparison of ReLU, Swish, Mish and Esh activation functions
5.5.2 Properties of Esh
Esh has a minimum value close to x = -1.309, which is approximately -0.274. Additionally, Esh
gets Swish’s "Self- gated" characteristic. A function of the form f (x) = xg(x) is referred to as "Self-
gated". The input is multiplied by itself together with a function that takes the input as its
argument, preventing the network from changing the initial distribution of the input on the positive
part while simultaneously creating a buffer at the negative part, close to zero.
Additionally, Esh makes sure that its output is sparse. According to a sparse activation, not all
inputs in a network with a random initialization state are active. From the definition and
representation of Esh, we have
lim fEsh(x) = 0 (10)

x→∞

9
As a result, when the input x has a significant negative value, the neuron can roughly be
considered as not being activated, satisfying the concept of sparsity. While being more likely to be
linearly separable, this sparse feature allows a model to control the actual dimensionality of the
representation for an input. Esh ’s likelihood of deactivating these neurons is lower than that of
ReLU, which suppresses 50% of the hidden units. We believe that less of the data is affected by
noise, and that ReLU will block more relevant features than Esh. Furthermore, because half of the
neurons in a network with ReLU activation are not active, the network may not function well.
Esh appears to be similar to other smooth activation functions, but it differs from them in a
number of ways.
First, once the input is more than 1, Esh is nearly equivalent to a linear transformation, with an
output value and input value change of no more than 0.01.
Second, Esh has a steeper gradient close to zero, which helps speed up the updating of network
parameters. The network modifies its parameters during backpropagation, as shown by Eq. (31).
wnew = wold − η∇w (11)
where η is the current learning rate and∇w is the backpropagation gradient. The weights of the
network before and after updating are represented, respectively, by wold and wnew.
We refer to L, the network’s loss, as the difference between the output of the network and the
label corresponding to the ground truth. The cross-entropy loss can be used as L for an image
recognition task. Equation Eq. (32) calculates the cross-entropy loss.
i i i i
N i=1
Σ
1
L=− y logyˆ + (1 − y )log(1 − yˆ ) (12)
In a loss function, N is the overall sample count, yi is the ith sample’s ground truth label, and yˆi is
the ith sample’s network prediction. By updating the network’s parameters, the value of L should
be minimized in order to increase accuracy. Then, using the computed loss L, ∇w can be
calculated as
δL
∇w = (13)
δwold

0
∇
As a result, if w is slightly larger, the weight’s update rate would be accelerated, which would result
in quick convergence. However, since we want to get to the global minimum value, an activation
function with a large gradient can prevent the network from converging, whereas a roughly linear
function is a reasonable option. The equivalent of scaling up the bias unit and moving the incoming
units towards zero is a bias shift correction of the unit natural gradient. Therefore, the steeper
gradient of Esh can also aid in lowering the function’s mean value to zero, which accelerates
learning even further.
In landscape comparison, the other three activation functions display a smoother landscape in
comparison to ReLU, indicating that they do not make abrupt shifts like ReLU does. The transition
curve of Esh is particularly seamless and fluent when compared to the other two smooth functions.
This characteristic ensures that Esh may combine the benefits of both piecewise and non-piecewise
activation functions and results in impressive performance.

1
CHAPTER 6
OPTIMIZERS
Optimizers are algorithms that are used to update the parameters of a deep learning model during
training to minimize the loss function. There are several types of optimizers, each with their own
advantages and disadvantages:
6.1 Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent (SGD) is a simple and widely-used optimizer in deep learning that
updates the model parameters in the direction of the negative gradient of the loss function. During
training, the loss function is computed for each training example, and the gradients are averaged
over a batch of examples. The optimizer then updates the model parameters by subtracting the
learning rate multiplied by the gradients.
SGD is a widely used optimization algorithm in deep learning. It updates the model parameters
based on the gradients of the loss function with respect to the parameters. The update rule for SGD
can be summarized as follows:
parameter = parameter − lr × gradient (14)
where parameter represents a model parameter (weight or bias), lr is a hyperparameter that controls
the step size of the update, and gradient denotes the gradient of the loss function with respect to the
parameter.
SGD operates on a single sample or a batch of samples at a time, hence the term "stochastic". It
iteratively performs parameter updates for each sample or batch until convergence or a specified
number of iterations.
SGD has several advantages, including low memory requirements and computational efficiency, as it
only needs to store the gradients for a single batch of examples at a time. However, SGD may suffer
from slow convergence and mayget stuck in local minima. To overcome these issues, several
variants of SGD have been developed, such as momentum SGD and Nesterov accelerated gradient
(NAG).
Momentum SGD takes into account the past gradients to reduce the oscillations in the optimization
path and accelerate convergence. It introduces a momentum term that accumulates the past gradients
and adds them to the current gradient. NAG is a variant of momentum SGD that uses a "look-ahead"
approach to compute the gradient at a future point in time, which can lead to faster convergence.
Overall, SGD and its variants are simple yet powerful optimization algorithms that can be effective
in many deep learning applications. However, their performance may depend on the specific task,
dataset, and model architecture, and tuning the learning rate and other hyperparameters may be
necessary to achieve optimal results.

2
6.2 Root Mean Square Propagation (RMSprop)

Root Mean Square Propagation (RMSProp) is an optimization algorithm that adapts the learning rate
for each parameter based on the historical gradient information. It was developed to address some
of the limitations of Stochastic Gradient Descent (SGD) and Adagrad, another popular optimizer.
In RMSProp,[14] the optimizer maintains an exponentially decaying average of the squared
gradients. The learning rate is then scaled by the inverse square root of this average. By scaling the
learning rate in this way, RMSProp can adapt to the magnitude of the gradients and the curvature
of the loss function. This can help the optimizer to converge faster and more reliably. The update
rule for RMSProp can be summarized as follows:
ma = dr × ma + (1 − dr) × gradient2 (15)

√
parameter = parameter − lr × gradient/( ma + ϵ)
(16) where ma represents the decaying average of squared gradients, dr controls the weighting of the
past gradients, gradient denotes the gradient of the loss function, lr is the step size, and ϵ is a small
value (e.g., 1e-8) added for numerical stability.
One advantage of RMSProp over Adagrad is that it uses a moving average of the squared gradients
instead of accumulating all past gradients, which can help to reduce the diminishing learning rate
problem that can occur in Adagrad.
However, like all optimizers, RMSProp has its limitations. For example, it can struggle with saddle
points in the loss landscape, and it may require careful tuning of hyperparameters, such as the learning
rate and decay rate.
Overall, RMSProp is a popular optimizer that can be effective for many deep learning applications,
especially when combined with other techniques, such as learning rate schedules or early stopping.
6.3 Adaptive Gradient (Adagrad)
Adagrad[15] is an optimization algorithm that adapts the learning rate for each parameter based
on the historical gradient information. It was developed to address the issue of manually tuning
the learning rate in Stochastic Gradient Descent (SGD)[45].
In Adagrad, the optimizer adapts the learning rate of each parameter based on the sum of the
squares of the gradients for that parameter. This means that the learning rate is reduced for
parameters that have large gradients, which can help to stabilize the optimization process and
prevent overshooting. The learning rate is then scaled by the inverse square root of this sum. The
update rule for Adagrad can be summarized as follows:
accumulator = accumulator + gradient2 (17)

3
√
parameter = parameter − lr × gradient/( accumulator + ϵ) (18)
where accumulator represents a running sum of the squared gradients, lr is the step size, gradient
denotes the gradient of the loss function, and ϵ is a small value (e.g., 1e-8) added for numerical
stability.
Root Mean Square Propagation (RMSProp) is an optimization algorithm that adapts the learning
rate for each parameter based on the historical gradient information. It was developed to address
some of the limitations of Stochastic Gradient Descent (SGD) and Adagrad, another popular
optimizer.
One advantage of Adagrad over SGD is that it can handle sparse data well, as it effectively
prioritizes the learning rate for features that occur more frequently in the data. Another advantage
is that it requires less tuning of hyperparameters, as it adapts the learning rate for each parameter
automatically.
However, Adagrad also has some limitations. For example, it can suffer from a diminishing
learning rate over time, which can cause slow convergence. It can also require a large number of
iterations to converge, especially for large-scale problems with many parameters.
Overall, Adagrad is a useful optimizer that can be effective for many deep learning applications,
especially those with sparse data. However, its limitations may require additional techniques, such
as learning rate schedules or early stopping, to achieve optimal performance.
6.4 Adadelta
Adadelta[65] is an optimization algorithm that is similar to RMSProp and Adagrad, but it

addresses some of their limitations. Like RMSProp, Adadelta also uses a moving average of the
squared gradients to adapt the learning rate, but it also includes a new term that allows it to adapt
to the gradient scale.
In Adadelta, the optimizer maintains a moving average of the squared gradients and a moving
average of the squared updates. The learning rate is then computed based on the ratio of these two
moving averages. This allows Adadelta to adapt to the gradient scale, which can help to stabilize
the optimization process and prevent overshooting. The update rule for a parameter θ using
Adadelta can be summarized as follows:

4
Esh Activation Function E[g2]t = p × E[g2](t−1) + (1 − p) × g2
t (19)
√
rmsδ = (E[∆θ2 ](t−1) + ϵ) (20)
√
∆θt = −(rmsδ/ (E[g2]t + ϵ)) × gt (21)
E[∆θ2 ]t = p × E[∆θ2 ](t−1) + (1 − p) × ∆θ2t (22)
θt = θ(t−1) + ∆θt (23)
where gt is the gradient of the objective function w.r.t. parameter θ at time step t, E[g2]t is the
exponentially decaying average of squared gradients, ϵ is a small constant for numerical stability,
∆θt is the update to the parameter θ at time step t, E[∆θ2]t is the exponentially decaying average of
squared parameter updates and p is the decay rate controlling the exponential decay of the moving
averages.
One advantage of Adadelta over RMSProp and Adagrad is that it does not require the tuning of a
global learning rate hyperparameter, which can make it more robust and easier to use.
Additionally, Adadelta can handle non-stationary objectives, as it does not require a decaying
average of the gradients.
However, Adadelta may require more iterations to converge than other optimizers, as it typically
starts with a higher learning rate. It can also be sensitive to the choice of hyperparameters, such as
the decay rate and the initial learning rate.
Overall, Adadelta is a useful optimizer that can be effective for many deep learning applications,
especially those with non-stationary objectives or large amounts of data. However, careful tuning
of hyperparameters may be necessary to achieve optimal performance.
6.5 Adamax
Adamax is an optimization algorithm that is a variant of the popular Adam optimizer. Adamax is
designed to handle the "vanishing updates" problem that can occur in Adam when the gradient
values are very large or the learning rate is very small.
In Adamax, the optimizer uses the infinity norm (maximum absolute value) of the weight updates
instead of the L2 norm used in Adam. This means that Adamax is less sensitive to large gradients

5
and
Eshcan preventFunction
Activation the vanishing updates problem that can occur in Adam. Additionally, Adamax
can be more computationally efficient than Adam, as it does not require computing the L2 norm.
Like Adam, Adamax also maintains exponential moving averages of the gradients and their
squares, as well as a bias correction term. These estimates are used to compute the weight updates
and the learning rate. The update rule for a parameter θ using Adamax can be summarized as
follows:
mt = β1 ∗ m(t−1) + (1 − β1) × gt (24)
ut = max(β2 × u(t−1), abs(gt)) (25)
θt = θ(t−1) − (η/(1 − β t1)) × (mt /(ut + ϵ)) (26)
where gt is the gradient of the objective function w.r.t. parameter θ at time step t, mt is the
exponentially decaying average of gradients, ut is the exponentially weighted infinity norm of the
gradients, η is the learning rate, β1 and β2 are the decay rates for the gradient and infinity norm
averages, respectively and ϵ is a small constant for numerical stability.
One advantage of Adamax over Adam is that it can be more robust to large gradients and smaller
learning rates, which can make it more suitable for certain deep learning applications. However,
Adamax may require tuning of its own set of hyperparameters, such as the beta1 and beta2
parameters that control the exponential moving averages.
Overall, Adamax is a useful optimizer that can be effective for many deep learning applications,
especially when the "vanishing updates" problem is a concern. However, careful tuning of
hyperparameters may be necessary to achieve optimal performance.
6.6 Adaptive Moment Estimation (Adam)

Adam (Adaptive Moment Estimation) is an optimization algorithm that is commonly used in deep
learning. It is a stochastic gradient descent-based algorithm that computes adaptive learning rates
for each parameter based on the first and second moments of the gradients.
The Adam optimizer maintains two moving averages of the gradient: the first moment (mean) and
the second moment (uncentered variance). These moving averages are used to compute the
adaptive learning rates for each parameter during training. The update rule for Adam can be
summarized as follows:
m = beta1 × m + (1 − beta1) × gradient (27)
v = beta2 × v + (1 − beta2) × gradient2 (28)

6
Esh Activation Function √
parameter = parameter − lr × m/( v + epsilon) (29)
where m represents the estimate of the first moment (mean) of the gradients, v denotes the
estimate of the second moment (uncentered variance) of the gradients, beta1 and beta2 are the
exponential decay rates for the moments, lr is the step size, gradient is the gradient of the loss
function, and epsilon is a small value (e.g., 1e-8) added for numerical stability.
Adam combines the advantages of two other optimization algorithms: RMSprop and AdaGrad.
Like RMSprop, it uses the moving average of the squared gradients to scale the learning rate.
However, unlike RMSprop, it also uses the moving average of the first moment of the gradient,
which can help the algorithm handle noisy gradients and converge faster.
Adam also includes bias correction to compensate for the fact that the moving averages are
initialized at zero, which can lead to bias in the early iterations of training.
One of the main advantages of Adam is that it requires minimal tuning of hyperparameters, as it
automatically adapts the learning rate for each parameter. It has become a popular choice for
deep learning tasks, particularly in computer vision and natural language processing.
However, it’s worth noting that Adam may not always perform optimally in certain scenarios,
such as when the data is imbalanced or when the gradients are very sparse. In such cases, other
optimization algorithms may be more suitable.

7
CHAPTER 7
METHODOLOGY
This section will outline the methodology used to examine the performance of the Esh activation
function when combined with different optimizers and learning rates across various datasets
available in the Python library Keras[11].
7.1 The Datasets
7.1.1 MNIST
MNIST[29] (Modified National Institute of Standards and Technology) is one of the largest and
most well-known standard datasets of handwritten digits, and it is frequently used to train
different image processing methods. The dataset is also widely used for training and testing×in the
field of deep learning. Each image is a 28 28 grayscale image that is linked to a label from one of
ten categories with 60,000 training examples and 10,000 test examples.
7.1.2 EMNIST
The EMNIST[12] (Extended Modified National Institute of Standards and Technology) dataset is
an extension of the MNIST dataset, which is a collection of handwritten digits. However, unlike
MNIST, EMNIST includes both digits and alphabetic characters. ×
The EMNIST dataset consists of 62 classes in total. These classes correspond to the 10 digits (0-9)
and the 52 uppercase and lowercase alphabetic characters (A-Z, a-z).
7.1.3 CIFAR-10
CIFAR-10[27] is an established computer-vision dataset used for object recognition. It was

collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. It is a subset of the 80 million ×
tiny images dataset and consists of 60,000 32 32 color images, with 6000 images per class, each
having one of the 10 object classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship,
truck), with 6000 images per class.
The dataset consisting of 50000 training images and 10000 test images. It is divided into five
training batches and one test batch, each with 10000 images. The test batch contains exactly 1000
randomly-selected images from each class. The training batches contain the remaining images in

8
random order, but
Esh Activation some training batches may contain more images from one class than another.
Function
Between them, the training batches contain exactly 5000 images from each class. The classes are
completely mutually exclusive.
7.2 Data Augmentation
Data augmentation[40] is a technique used in machine learning and computer vision to artificially
increase the size of a training dataset by applying various transformations or manipulations to the
existing data. These transformations can include flipping, rotation, cropping, scaling, and adding
noise to the images. The goal of data augmentation is to increase the diversity of the training data
and reduce overfitting, allowing the model to generalize better to new, unseen data.
Some frequently employed data augmentation techniques in this study comprise:

 Flipping: Flipping an image horizontally or vertically can increase the diversity of the
dataset.
 Cropping: Cropping a portion of an image can help the model focus on the most relevant
parts of the image.
7.3 Preprocessing
The preprocessing stage is crucial in image classification tasks, as it enhances the quality of input
images and ensures that they are in the appropriate format for training the classification model.
Below are some typical preprocessing techniques that can be utilized for the dataset:
 Resizing: This technique is used to adjust the size of the images to a fixed size that can be
easily fed to the neural network.
 Normalization[64]: It involves scaling the pixel values of the images to a common range,
such as [0,1] or [-1,1], to help the neural network learn more effectively.
 Color space conversion: This technique involves converting the images from one color
space to another, such as RGB to grayscale or HSV, to better highlight certain features.
7.4 Technologies Used
The deep learning algorithms in this study were implemented using Keras[11] with Google
TensorFlow[1] backend, and the experiments were conducted using a computational resource
based on Google Colab[5] Pro+.
This study utilized a Google Colab environment with a NVIDIA A100 Cloud GPU having 1100

9
compute units and
Esh Activation a RAM of 52GB.
Function
CHAPTER 8
RESULTS AND DISCUSSION
This section will outline the methodology used to examine the performance of the Esh activation
function when combined with different optimizers and learning rates across various datasets
available in the Python library Keras[11].
This section presents the experimental evaluation of the proposed Esh activation function. We
compared its performance to that of Mish and Swish with various optimizers (Adam, SGD,
Adagrad, and RMSProp) and learning rates (0.1, 0.01, 0.001, and 0.0001). On Google Colab Pro+,
experiments were carried out for several hours using well-known CNN architectures VGG16,
ResNet20, and ResNet56, on common datasets - EMNIST and CIFAR-10.
Figure 5: LR vs Accuracy comparison with VGG16 on CIFAR-10 datasets
LR Esh Mish Swish

Adam SGD Adagrad RMSProp Adam SGD Adagrad RMSProp Adam SGD Adagrad RMSProp
0.1 70.97 91.39 90.17 85.15 74.42 91.71 90.8 82.47 15.92 91.89 90.29 73.7
0.01 89.98 91.24 90.07 90.52 88.07 91.14 90.75 89.4 90.7 91.77 89.93 89.61
0.001 91.2 87.93 86.96 90.71 91.27 88.25 87.04 90.82 90.87 88.02 86.17 90.35
0.0001 91.66 83.37 75.14 90.86 91.84 82.99 76.27 91.07 91.62 82.63 74.54 91.5
Table 1: LR vs Accuracy comparison of different optimizers on activation functions with VGG-16 on CIFAR-10
dataset

0
Figure 6: LR vs Loss comparison with VGG16 on CIFAR-10 datasets
LR Esh Mish Swish

0.1 101.3120 0.4402 0.5491 0.5049 0.7661 0.3907 0.48 0.5451 4.1880 0.3928 0.5508 192.204
0.01 0.5152 0.3883 0.4771 0.3997 0.6129 0.3846 0.4468 0.4706 0.4167 0.3849 0.4679 0.515
0.001 0.3984 0.4874 0.5554 0.4423 0.4131 0.4953 0.5788 0.4393 0.4016 0.4948 0.6012 0.4575
0.0001 0.3560 0.5113 0.7285 0.4055 0.3518 0.5145 0.7006 0.3931 0.3631 0.5247 0.7508 0.3838
Table 2: LR vs Loss of different optimizers on activation functions with VGG-16 on CIFAR-10 dataset
The tables - (1 & 2) presented above compare the accuracy and loss achieved by Esh, Mish and
Swish activation functions on the CIFAR-10 datasets utilizing four different optimizers: Adam,
SGD, Adagrad, and RMSProp with learning rates of 0.1, 0.01, 0.001, and 0.0001 on VGG16
architectures. The results indicate that the Esh activation function outperforms other activation
functions with better accuracy and stable losses with VGG-16 architectures with RMSProp
optimizer and learning rate 0.001.

1
Figure 7: LR vs Accuracy comparison with ResNet-20 on CIFAR-10 datasets
LR Esh Mish Swish

0.1 85.07 89.31 88.28 80.02 86.64 89.09 87.85 79.76 84.98 89.95 88.16 82.79
0.01 88.81 87.18 86.08 86.18 89.12 86.96 85.99 88.62 89.42 87.25 85.73 89
0.001 89.22 75.6 66.2 88.77 88.8 74.46 64.63 89.32 89.65 73.92 64.27 88.98
0.0001 86.68 44.57 41.63 85.34 86.54 45.47 42.48 86.48 86.41 41.68 40.3 85.77
Table 3: LR vs Accuracy comparison of different optimizers on activation functions with ResNet-20 on CIFAR-10
dataset
LR Esh Mish Swish

0.1 0.5067 0.4767 0.5159 0.6955 0.4387 0.4821 0.5198 0.6924 0.5286 0.4775 0.5235 0.5706
0.01 0.5008 0.4269 0.4527 0.6127 0.4707 0.4627 0.4535 0.528 0.4768 0.4499 0.4639 0.5127
0.001 0.5117 0.7026 0.9421 0.5216 0.5154 0.7405 0.9884 0.5079 0.4562 0.7499 1.009 0.5107
0.0001 0.4745 1.494 1.579 0.5427 0.4917 1.48 1.535 0.5003 0.5032 1.563 1.615 0.5433
Table 4: LR vs Loss of different optimizers on activation functions with ResNet-20 on CIFAR-10 dataset

2
Figure 8: LR vs Loss comparison with ResNet-20 on CIFAR-10 datasets
The tables (3 & 4) presented above compare the accuracy and loss achieved by Esh, Mish and
SGD, Adagrad, and RMSProp with learning rates of 0.1, 0.01, 0.001, and 0.0001 on ResNet-20
architectures. The results indicate that the Esh activation function has comparable accuracy and
losses with ResNet-20 architecture with Adam optimizer and learning rate 0.001.

3
Figure 9: LR vs Accuracy comparison with ResNet-56 on CIFAR-10 datasets
LR Esh Mish Swish

0.1 78.19 87.99 85.67 78.13 78.66 89.15 86.21 65.24 76.75 88.51 84.62 77.05
0.01 84.12 85.25 83.08 84.64 87.06 84.83 82.23 85.88 87.85 84.13 82.65 85.69
0.001 89.91 71.23 60.6 87.58 89.08 70.56 59.99 88.16 88.2 71.9 57.94 87.02
0.0001 85.12 39.09 30.81 85.17 85.22 38.8 28.7 84.17 84.06 32.26 16.27 82.78
Table 5: LR vs Accuracy comparison of different optimizers on activation functions with ResNet-56 on CIFAR-10
dataset

4
Figure 10: LR vs Loss comparison with ResNet-56 on CIFAR-10 datasets
LR Esh Mish Swish

0.1 0.6446 0.5122 0.5988 0.6982 0.6328 0.4920 0.5799 1.597 0.6821 0.5042 0.6362 0.7877
0.01 0.5566 0.5696 0.633 0.5637 0.5218 0.5636 0.6549 0.5586 0.4955 0.6007 0.6412 0.5252
0.001 0.4558 0.8219 1.095 0.5353 0.4889 0.8405 1.119 0.5489 0.4944 0.8002 1.182 0.5669
0.0001 0.6122 1.61 1.85 0.6272 0.6324 1.619 1.892 0.6383 0.6623 1.769 2.21 0.7347
Table 6: LR vs Loss of different optimizers on activation functions with ResNet-56 on CIFAR-10 dataset
functions with better accuracy and stable losses with ResNet-56 architectures with RMSProp

5
Figure 11: LR vs Accuracy comparison with VGG16 on EMNIST datasets
LR Esh Mish Swish

0.1 92.688 93.288 93.106 92.01 84.736 93.510 93.274 92 22.663 93.752 93.192 90.481
0.01 93.101 93.163 93.635 93.308 93.26 93.173 93.538 93.418 93.149 92.918 93.072 93.317
0.001 93.418 93.389 93.26 93.202 93.341 93.462 93.471 93.197 93.250 93.370 93.183 93.01
0.0001 93.351 93.587 93.288 93.678 93.399 93.298 93.053 93.327 93.178 93.477 93.082 92.899
Table 7: LR vs Accuracy comparison of different optimizers on activation functions with VGG-16 on EMNIST dataset

6
Figure 12: LR vs Loss comparison with VGG16 on EMNIST datasets
LR Esh Mish Swish

0.1 0.2532 0.2605 0.2974 0.2809 0.511 0.259 0.2947 0.2872 4698.2040 0.2674 0.3171 1.532
0.01 0.282 0.2756 0.222 0.2706 0.2676 0.277 0.2246 0.2758 0.2641 0.2939 0.2523 0.2773
0.001 0.2924 0.2141 0.2312 0.3141 0.3019 0.2201 0.2212 0.3139 0.3074 0.2251 0.2036 0.3185
0.0001 0.2782 0.1921 0.2187 0.2778 0.2939 0.1959 0.2114 0.2913 0.3013 0.1966 0.2171 0.2005
Table 8: LR vs Loss of different optimizers on activation functions with VGG-16 on EMNIST dataset
Swish activation functions on the EMNIST datasets utilizing four different optimizers: Adam,
SGD, Adagrad, and RMSProp with learning rates of 0.1, 0.01, 0.001, and 0.0001 on VGG16
functions with better accuracy and stable losses with VGG16 architectures with RMSProp

7
Figure 13: LR vs Accuracy comparison with ResNet-20 on EMNIST datasets
LR Esh Mish Swish

0.1 92.298 93.067 93.697 92.038 92.168 93.024 93.288 91.091 92.048 93.014 93.688 92.404
0.01 93.168 93.236 93.538 93.019 93.534 93.505 93.635 93.322 93.505 92.885 93.423 93.014
0.001 93.293 92.337 91.519 93.043 93.111 92.841 91.5 92.88 92.976 92.702 91.087 93.51
0.0001 93.168 80.933 72.736 93.519 93.418 84.413 72.663 93.447 93.01 81.76 72.817 93.389
Table 9: LR vs Accuracy comparison of different optimizers on activation functions with ResNet-20 on EMNIST
dataset

8
Figure 14: LR vs Loss comparison with ResNet-20 on EMNIST datasets
LR Esh Mish Swish

0.1 0.2634 0.2314 0.2 0.2408 0.2509 0.2445 0.219 0.2813 0.2537 0.2445 0.2038 0.2602
0.01 0.241 0.2032 0.1894 0.2542 0.2319 0.1959 0.1896 0.2471 0.2215 0.2134 0.1926 0.2619
0.001 0.2427 0.2242 0.2573 0.2681 0.2567 0.2144 0.2561 0.2636 0.255 0.2132 0.2674 0.2482
0.0001 0.2191 0.6299 1.305 0.2058 0.2036 0.5003 0.969 0.208 0.225 0.5895 0.9807 0.2164
Table 10: LR vs Loss of different optimizers on activation functions with ResNet-20 on EMNIST dataset
losses with ResNet-20 architecture with Adam optimizer and learning rate 0.001.

9
Figure 15: LR vs Accuracy comparison with ResNet-56 on EMNIST datasets
LR Esh Mish Swish

0.1 91.394 93.313 93.519 91.183 91.861 93.452 93.24 92.563 92.644 92.827 93.365 91.837
0.01 93.058 92.913 93.558 93.87 93.274 93.505 93.327 93.317 93.225 93.346 93.683 93.12
0.001 93.625 93.149 92.207 93.111 93.221 92.822 91.87 93.668 93.072 92.524 91.726 93.452
0.0001 92.995 85.01 53.716 93.423 93.163 86.635 55.308 93.327 93.447 85.351 45.755 92.981
Table 11: LR vs Accuracy comparison of different optimizers on activation functions with ResNet-56 on EMNIST
dataset

0
Figure 16: LR vs Loss comparison with ResNet-56 on EMNIST datasets
LR Esh Mish Swish

0.1 0.264 0.23 0.2178 0.276 0.2599 0.2279 0.2259 0.5078 0.23 0.2557 0.2188 0.3695
0.01 0.237 0.2277 0.2048 0.1893 0.2373 0.2166 0.2126 0.23 0.2277 0.2216 0.2043 0.2307
0.001 0.2377 0.2064 0.2351 0.2567 0.2613 0.2207 0.2435 0.2449 0.2667 0.2188 0.2417 0.2639
0.0001 0.2453 0.4973 1.509 0.239 0.2365 0.4161 1.412 0.2397 0.2388 0.4651 1.755 0.2604
Table 12: LR vs Loss of different optimizers on activation functions with ResNet-56 on EMNIST dataset
losses with ResNet-56 architecture with RMSProp optimizer and learning rate 0.001.

1
Our results suggests that using the RMSProp optimizer with a learning rate of 0.001 with VGG16
architecture produces better results compared to other options when using the CIFAR-10 dataset.
Figure 17: LR vs Accuracy comparison of Esh Activation Function on CIFAR-10 datasets

2
Figure 18: LR vs Loss comparison of Esh Activation Function on CIFAR-10 datasets

3
Also, the results suggest that using the Adagrad optimizer with a learning rate of 0.001 with
ResNet-56 architecture produces better results compared to other options when using the
EMNIST dataset.
Figure 19: LR vs Accuracy comparison of Esh Activation Function on EMNIST datasets
Figure 20: LR vs Loss comparison of Esh Activation Function on EMNIST datasets

4
The tables presented above compare the accuracy and loss achieved by various activation functions
on the CIFAR-10 and EMNIST datasets utilizing four different optimizers: Adam, SGD, Adagrad,
and RMSProp with learning rates of 0.1, 0.01, 0.001, and 0.0001 on VGG16, ResNet-20, and
ResNet-56 architectures. The results indicate that the Esh activation function outperforms other
activation functions with better accuracy and stable losses across all three architectures.

CHAPTER 8
CONCLUSION AND FUTURE WORK
This paper presents a comparison of different optimizers and learning rates on a novel activation
function, Esh, defined as f (x) = xtanh(sigmoid(x)), for deep neural networks. Our findings show that
the Esh activation function results in better accuracy and stable losses across all three architectures,
indicating its effectiveness in improving the performance of these models.
We consider some commonly desirable properties in functions, such as capturing complex patterns or
exhibiting smoothness, we can make the following analysis:
1. Capturing Complex Patterns: The Esh function involves the composition of the tanh and
sigmoid functions. This combination allows for capturing complex patterns and non-linear
relationships. The sigmoid function introduces a non-linear mapping of values between 0 and
1, and the hyperbolic tangent (tanh) further non- linearly transforms those values. The
interaction of these non-linear functions can help capture more intricate patterns compared to
the Mish and Swish, which only involves the softplus and sigmoid function respectively.
2. Smoothness: The Esh function also has the advantage of incorporating the tanh function,
which is a smooth function that maps values to the range [-1, 1]. The tanh function has a
symmetric and smooth shape, allowing for a smoother overall function compared to Swish
function, which only uses the sigmoid function. The sigmoid function, while also smooth,
maps values to the range [0, 1] and may not exhibit the same degree of smoothness as the
tanh function.
Both Esh and Mish functions involve the use of the tanh function, which is a smooth function that
maps values to the range [-1, 1]. In terms of smoothness, there is no significant difference between
the two functions in this regard.
Based on the consideration of capturing complex patterns, the composition of tanh and sigmoid in
Esh function may provide more flexibility and expressive power compared to the Mish and Swish.
Our results also suggest that using the RMSProp optimizer with a learning rate of 0.001 produces
better results compared to other options when using the CIFAR-10 and EMNIST dataset.

Furthermore, the Esh activation function produces comparable results in terms of accuracy and loss
when compared to MISH and SWISH, indicating that it can be a viable alternative to these popular
activation functions.
However, the choice between these functions depends on the specific requirements of the problem
and the characteristics of the dataset being analyzed.

REFERENCES
[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A.
Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y.
Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D.
Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V.
Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke,
Y. Yu, and X. Zheng. TensorFlow: Large-scale machine learning on heterogeneous
systems, 2015. URL https://www.tensorflow.org/. Software available from
tensorflow.org.
[2] A. F. Agarap. Deep learning using rectified linear units (relu), 2019.
[3] D. Almeida, C. Winter, J. Tang, and W. Zaremba. A generalizable approach to

learning optimizers, 2021.
[4] J. Bernstein, J. Zhao, M. Meister, M.-Y. Liu, A. Anandkumar, and Y. Yue. Learning
compositional functions via multiplicative weight updates. Advances in neural information
processing systems, 33:13319–13330, 2020.
[5] E. Bisong. Google Colaboratory, pages 59–64. Apress, Berkeley, CA, 2019. ISBN
978-1-4842-4470-8. doi: 10.1007/978-1-4842-4470-8_7. URL
https://doi.org/10.1007/978-1-4842-4470-8_7.
[6] L. Boué. Deep learning for pedestrians: backpropagation in cnns, 2018.
[7] M. Buda, A. Maki, and M. A. Mazurowski. A systematic study of the class

imbalance problem in convolutional neural networks. Neural Networks, 106:249–259, oct
2018. doi: 10.1016/j.neunet.2018.07.011. URL
https://doi.org/10.1016%2Fj.neunet.2018.07.011.
[8] A. Ceni. Random orthogonal additive filters: a solution to the vanishing/exploding

gradient of deep neural networks, 2022.
[9] G. S. Chadha and A. Schwung. Learning the non-linearity in convolutional neural

networks, 2019.
[10] D. Choi, C. J. Shallue, Z. Nado, J. Lee, C. J. Maddison, and G. E. Dahl. On empirical

comparisons of optimizers for deep learning, 2020.
[11] F. Chollet et al. Keras, 2015. URL https://github.com/fchollet/keras.
[12] G. Cohen, S. Afshar, J. Tapson, and A. van Schaik. Emnist: an extension of mnist to
handwritten letters, 2017.
[13] G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics

of control, signals and systems, 2(4):303–314, 1989.
[14] S. De, A. Mukherjee, and E. Ullah. Convergence guarantees for rmsprop and adam
in non-convex optimization and an empirical comparison to nesterov acceleration, 2018.
[15] A. Défossez, L. Bottou, F. Bach, and N. Usunier. A simple convergence proof of

adam and adagrad, 2022.
[16] H. Gholamalinezhad and H. Khosravi. Pooling methods in deep neural networks, a

review, 2020.
[17] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward
neural networks. In Y. W. Teh and M. Titterington, editors, Proceedings of the Thirteenth
International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings
of Machine Learning Research, pages 249–256, Chia Laguna Resort, Sardinia, Italy, 13–
15 May 2010. PMLR. URL https://proceedings.mlr.press/v9/glorot10a.html.
[18] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A.

Courville, and Y. Bengio. Generative adversarial networks, 2014.
[19] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition,
2015.
[20] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-

level performance on imagenet classification. In Proceedings of the IEEE international

conference on computer vision, pages 1026–1034, 2015.
[21] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks,
2016.
[22] T. He, Z. Zhang, H. Zhang, Z. Zhang, and J. Xie. Bag of tricks for image
classification with convolutional neural networks. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 558–567, 2018.
[23] L. Heim, A. Biri, Z. Qu, and L. Thiele. Measuring what really matters: Optimizing
neural networks for tinyml. arxiv 2021. arXiv preprint arXiv:2104.10645.
[24] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are

universal approximators. Neural networks, 2(5):359–366, 1989.
[25] N. S. Keskar and R. Socher. Improving generalization performance by switching

from adam to sgd. arXiv preprint arXiv:1712.07628, 2017.
[26] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization, 2017.
[27] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images.
In International Conference on Machine Learning, pages 1097–1105, 2009.
[28] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep

convolutional neural networks. In F. Pereira, C. Burges, L. Bottou, and K. Weinberger,
editors, Advances in Neural Information Processing Systems, volume 25. Curran
Associates, Inc., 2012. URL https://proceedings.neurips.cc/paper_files/paper/
2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf.
[29] Y. LeCun, C. Cortes, and C. Burges. Gradient-based learning applied to document

recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
[30] J. Lederer. Activation functions in artificial neural networks: A systematic overview,

2021.

[31] K. Lee and J. Yim. Hyperparameter optimization with neural network pruning, 2022.
[32] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D.

Ramanan, C. L. Zitnick, and P. Dollár. Microsoft coco: Common objects in context, 2015.
[33] L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and J. Han. On the variance of the
adaptive learning rate and beyond, 2021.
[34] I. Loshchilov and F. Hutter. Decoupled weight decay regularization, 2019.
[35] L. Luo, Y. Xiong, Y. Liu, and X. Sun. Adaptive gradient methods with dynamic
bound of learning rate. arXiv preprint arXiv:1902.09843, 2019.
[36] D. Misra. Mish: A self regularized non-monotonic neural activation function. arXiv
preprint arXiv:1908.08681, 2019.
[37] M. Morita. Associative memory with nonmonotone dynamics. Neural networks,

6(1):115–126, 1993.
[38] M. Morita. Memory and learning of sequential patterns by nonmonotone neural

networks. Neural Networks, 9(8): 1477–1489, 1996.
[39] J. Pennington, S. S. Schoenholz, and S. Ganguli. Resurrecting the sigmoid in deep

learning through dynamical isometry: theory and practice, 2017.
[40] L. Perez and J. Wang. The effectiveness of data augmentation in image classification
using deep learning, 2017.
[41] H. Qassim, D. Feinzimer, and A. Verma. Residual squeeze vgg16, 2017.
[42] P. Ramachandran, B. Zoph, and Q. V. Le. Searching for activation functions. arXiv
preprint arXiv:1710.05941, 2017.
[43] P. Ramachandran, B. Zoph, and Q. V. Le. Searching for activation functions, 2017.
[44] S. Reed, H. Lee, D. Anguelov, C. Szegedy, D. Erhan, and A. Rabinovich. Training

deep neural networks on noisy labels with bootstrapping, 2015.
[45] S. Ruder. An overview of gradient descent optimization algorithms, 2017.
[46] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A.

Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. Imagenet large scale visual
recognition challenge, 2015.
[47] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale
image recognition, 2015.
[48] P. Singh, V. K. Verma, P. Rai, and V. P. Namboodiri. Hetconv: Beyond

homogeneous convolution kernels for deep cnns. International Journal of Computer
Vision, 128(8-9):2068–2088, 2020.
[49] L. N. Smith. Best practices for applying deep learning to novel applications, 2017.
[50] L. N. Smith. Cyclical learning rates for training neural networks, 2017.
[51] S. Sonoda and N. Murata. Neural network with unbounded activation functions is
universal approximator. Applied and Computational Harmonic Analysis, 43(2):233–268,
2017.
[52] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization

and momentum in deep learning. In S. Dasgupta and D. McAllester, editors, Proceedings
of the 30th International Conference on Machine Learning, volume 28 of Proceedings of
Machine Learning Research, pages 1139–1147, Atlanta, Georgia, USA, 17–19 Jun 2013.
PMLR. URL https://proceedings.mlr.press/v28/sutskever13.html.
[53] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V.

Vanhoucke, and A. Rabinovich. Going deeper with convolutions, 2014.
[54] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface: Closing the gap to
human-level performance in face verification. In 2014 IEEE Conference on Computer

Vision and Pattern Recognition, pages 1701–1708, 2014. doi: 10.1109/CVPR.2014.220.
[55] S. Takase, S. Kiyono, S. Kobayashi, and J. Suzuki. On layer normalizations and

residual connections in transformers, 2022.
[56] J. M. Tomczak. Improving neural networks with bunches of neurons modeled by

kumaraswamy units: Preliminary study, 2015.
[57] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser,

and I. Polosukhin. Attention is all you need, 2017.
[58] R. Wei, H. Yin, J. Jia, A. R. Benson, and P. Li. Understanding non-linearity in graph
neural networks from the bayesian-inference perspective, 2022.
[59] A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht. The marginal value of
adaptive gradient methods in machine learning, 2018.
[60] S. Wojtowytsch. Stochastic gradient descent with noise of machine learning type.
part i: Discrete time analysis, 2021.
[61] X. Xinyu Liu1. Tanhexp: A smooth activation function with high convergence speed
for lightweight neural networks, 2020.
[62] B. Xu, N. Wang, T. Chen, and M. Li. Empirical evaluation of rectified activations in
convolutional network. arXiv preprint arXiv:1505.00853, 2015.
[63] Y. You, I. Gitman, and B. Ginsburg. Large batch training of convolutional networks,
2017.
[64] J. Yu and K. Spiliopoulos. Normalization effects on deep neural networks, 2022.
[65] M. D. Zeiler. Adadelta: An adaptive learning rate method, 2012.
[66] D. Zhang, Y. Lin, H. Chen, Z. Tian, X. Yang, J. Tang, and K. T. Cheng.

Understanding the tricks of deep learning in medical image segmentation: Challenges and

future directions, 2023.
[67] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz. mixup: Beyond empirical
risk minimization, 2018.
[68] H. Zhao, O. Gallo, I. Frosio, and J. Kautz. Loss functions for neural networks for
image processing, 2018.
[69] Òscar Lorente, I. Riera, and A. Rana. Image classification with classic and deep
learning techniques, 2021.

APPENDIX A: DEFINITIONS, ACRONYMS AND

ABBREVIATIONS
1. SGD: Stochastic Gradient Descent

2. RMSProp: Root Mean Square Propagation
3. ResNet: Residual Network
4. SWATS: Stochastic Weight Averaging with Tunable Stepsize
5. CNN: Convolutional Neural Network
6. COCO: Common Objects in Context
7. ADAGRAD: Adaptive Gradient
8. ADAM: Adaptive Moment Estimation
9. MNIST: Modified National Institute of Standards and Technology
10. EMNIST: Extended Modified National Institute of Standards and Technology
11. GPU: Graphical Processing Unit
12. RAM: Read Only Memory
13. LR: Learning Rate

Report Phase2 ESA v1

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Report Phase2 ESA v1

Uploaded by

Copyright:

Available Formats

Dissertation on

“Comparative Study of Optimizers & Learning Rates on

UE20CS972 – Project Phase - 2

Anjan Arun Bhowmick PES2PGE21DS104

Under the guidance of

April 2023 - June 2023

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

‘Comparative Study of Optimizers & Learning Rates on Esh

is a bonafide work carried out by

Anjan Arun Bhowmick PES2PGE21DS104

Signature Signature Signature Signature

Name of the Examiners External Viva

We hereby declare that the Project Phase - 2 entitled “Comparative Study of

PES2PGE21DS104 Anjan Arun Bhowmick

I would like to express my gratitude to Sudha BG, “Professor”, “Great

I am grateful to the Project Coordinator, Aditya Kishan, Academic

I take this opportunity to thank Dr. Sandesh B J, Chairperson, Department

I am deeply grateful to Dr. M. R. Doreswamy, Chancellor, PES University,

Learning rate is a hyperparameter that controls the step size of the

In conclusion, while there is limited research on the comparative study of

4. NEURAL NETWORK ARCHITECTURE 10

8. RESULTS AND DISCUSSION 29

9. CONCLUSION AND FUTURE WORK 44

APPENDIX A: DEFINITIONS, ACRONYMS AND ABBREVIATIONS 53

Figure No. Title Page No.

2 Comparison of ReLu, Mish, Swish, GELU, ELU and Esh 16

3 Derivatives of Mish, Swish and Esh activation functions 18

4 Output landscape comparison of ReLu, Swish, Mish and 18

6 LR vs Loss comparison with VGG-16 on CIFAR-10 30

7 LR vs Accuracy comparison with ResNet-20 on CIFAR- 31

9 LR vs Accuracy comparison with ResNet-56 on CIFAR- 33

11 LR vs Accuracy comparison with VGG-16 on EMNIST 35

12 LR vs Loss comparison with VGG-16 on EMNIST 36

13 LR vs Accuracy comparison with ResNet-20 on 37

15 LR vs Accuracy comparison with ResNet-56 on 39

17 LR vs Accuracy comparison of Esh Activation Function 41

fEsh(x) = x × tanh(sigmoid(x)) (1)

Dept. of CSE April - June, 2023 1

Dept. of CSE April - June, 2023 2

2.1 Objective of Optimizers and Learning Rates on a Neural Network

Dept. of CSE April – June, 2023 3

2.2 The working of Optimizers and Learning Rates

Dept. of CSE April – June, 2023 4

3.1 Activation Functions

Dept. of CSE April – June, 2023 5

3.2 Deep Neural Network Architectures

Dept. of CSE April – June 2023 6

3.3 Image Classification

Here are some notable studies and papers on image classification:

Dept. of CSE April – June 2023 7

3.4.1 Recently proposed Optimizers

Dept. of CSE April – June 2023 8

Dept. of CSE April – June 2023 9

3.5 Learning Rate

1. On the importance of initialization and momentum in deep learning[52] by Ilya

Dept. of CSE April – June 2023 1

4.1.1 ResNet-20 and ResNet-56

Dept. of CSE April – June 2023 1

Dept. of CSE April – June 2023 1

Dept. of CSE April – June 2023 1

5.0.1 Desired characteristics of the activation functions

1. Nonlinerity: One of the most essential characteristics of an activation function is

Dept. of CSE April – June 2023 1