Professional Documents
Culture Documents
Report Phase2 ESA v1
Report Phase2 ESA v1
Master of Technology
in
Data Science and
Machine Learning
Submitted by:
Sudha BG
Professor
Great Learning
FACULTY OF ENGINEERING
CERTIFICATE
This is to certify that the dissertation entitled
In partial fulfilment for the completion of Fourth Semester Project Phase - 2 (UE20CS972)
in the Program of Study - Master of Technology in Data Science and Machine learning
under rules and regulations of PES University, Bengaluru during the period April 2023 –
June 2023. It is certified that all corrections / suggestions indicated for internal assessment
have been incorporated in the report. The dissertation has been approved as it satisfies
the 4th semester academic requirements in respect of project work.
1.
2.
DECLARATION
Optimizers are algorithms that are used to update the parameters (weights
and biases) of a neural network during training. The goal of these
algorithms is to minimize the loss function of the network by finding the
optimal values of the parameters. Some common optimizers include
Stochastic Gradient Descent (SGD), Adam, Adagrad and RMSProp.
The Esh activation function is a new activation function with the formula f
(x) = xtanh(sigmoid(x)) that has shown promising results in deep neural
networks. Compared to other activation functions like ReLU, GELU, Mish,
and Swish, the Esh activation function offers a more consistent loss
landscape. Optimizers and learning rates are important hyperparameters
that affect the performance of a neural network. This study aims to
investigate the impact of different optimizers and learning rates on the
performance of Esh activation function in a deep neural network. The
study will compare the performance of different optimizers such as
Stochastic Gradient Descent (SGD), Adam, Adagrad and RMSProp, with
different learning rates on the Esh activation function on the MNIST,
CIFAR-10 and CIFAR-100 data sets using VGG16 and ResNet CNN
architectures. The results of this study can provide insights into the optimal
hyperparameters for the Esh activation function and can contribute to the
development of better deep neural networks.
1. INTRODUCTION 01
1.1 Background
2. PROBLEM STATEMENT 03
2.1 Objective of Optimizers and Learning Rates on a Neural Network
2.2 The working of Optimizers and Learning Rates
3. LITERATURE REVIEW 05
3.1 Activation Functions
3.2 Deep Neural Network Architecture
3.3 Image Classification
3.4 Optimizers
3.4.1 Recently Proposed Optimizers
3.5 Learning Rate
4.2 VGG-16
5. ACTIVATION FUNCTIONS 13
Desired Characteristics of the Activation Functions
5.1 Sigmoid Activation Function
5.2 Swish Activation Function
5.3 Mish Activation Function
5.4 Tanh Activation Function
5.5 Esh Activation Function
5.5.1 Derivative of Esh
5.5.2 Properties of Esh
6. OPTIMIZERS 21
6.1 Stochastic Gradient Descent (SGD)
6.2 Root Mean Square Propagation (RMSProp)
6.3 Adaptive Gradient (Adagrad)
6.4 Adadelta
6.5 Adamax
6.6 Adaptive Moment Estimation (Adam)
7. METHODOLOGY 27
7.1 The Datasets
7.1.1 MNIST
7.1.2 EMNIST
7.1.3 CIFAR-10
7.2 Data Augmentation
7.3 Preprocessing
7.4 Technologies Used
REFERENCES 46
CHAPTER 1
INTRODUCTION
Optimizers and learning rates are essential components in the training of deep learning models.
In deep learning, the goal is to minimize the loss function, which represents the difference between
the predicted and actual values. The optimizer is the algorithm that updates the model parameters
during training to minimize the loss function.
There are various types of optimizers, including stochastic gradient descent (SGD), Adam,
Adagrad, and RMSprop. These optimizers differ in how they update the model parameters and how
they handle learning rates.
The learning rate is a hyperparameter that controls the size of the step taken during the optimization
process. A larger learning rate can lead to faster convergence, but it may also cause the model to
overshoot the optimal solution.
Conversely, a smaller learning rate can lead to slower convergence, but it may also help the model
converge to a more precise optimal solution.
Finding the optimal learning rate can be challenging, as it depends on various factors such as the
problem, the optimizer used, and the architecture of the model. There are various techniques for
selecting the optimal learning rate, such as using a learning rate schedule, applying learning rate
annealing, or using adaptive learning rates.
Overall, optimizers and learning rates play a crucial role in the training of deep learning models,
and selecting the appropriate combination can significantly impact the model’s performance.
1.1 Background
The introduction of non-linearity in neural networks is essential for learning complex relationships
in the data, and activation functions play a crucial role in achieving this objective.
The Esh activation function, which has been proposed for image classification tasks, offers several
advantages over existing activation functions such as ReLU, Swish, and GELU.
The Esh activation function has multiple advantages that make it beneficial for neural networks
used in classification tasks. One of its primary benefits is that it accelerates the learning process,
leading to increased accuracy. The Esh function achieves this by having a lower slope compared to
other activation functions, which results in faster convergence. Additionally, the Esh function’s
unbounded upper limit helps prevent saturation that can cause the training process to slow down to
almost zero gradients, while its lower limit produces a strong regularization effect.
Experimental results have demonstrated that the Esh activation function performs better than
ReLU, Swish, and GELU on widely used benchmark datasets like MNIST, CIFAR-10, CIFAR-
100, VGG16, and ResNet network architectures. Moreover, researchers have established that the
Esh activation function has a smoother loss landscape compared to other activation functions. This
feature contributes to faster and more stable convergence in neural networks.
While earlier studies have established the effectiveness of the Esh activation function in image
classification tasks, there is a need to further investigate how different hyperparameters can
influence its performance on the benchmark datasets. To address this gap, future research could
extend the Comparative Study of Optimizers Learning Rate on Esh Activation Function. By doing
so, researchers could gain a deeper understanding of the ideal selection of hyperparameters when
utilizing the Esh activation function in deep learning applications.
CHAPTER 2
PROBLEM STATEMENT
Image classification[69] is a fundamental task in computer vision that involves assigning a label or
a category to an image. It plays a critical role in various real-world applications such as medical
diagnosis, autonomous driving, surveillance, and object recognition.
The ability of machines to identify and categorize objects in images accurately is essential in
enabling automation, improving efficiency, and increasing the accuracy of decision-making
processes in various industries. For instance, in healthcare, image classification is used to detect
and diagnose diseases from medical images, while in the automotive industry, it is used to identify
road signs, pedestrians, and other vehicles to facilitate autonomous driving.
Furthermore, with the proliferation of digital media, social networking platforms, and e-commerce
sites, image classification has become increasingly important for content filtering, product
recommendation, and user personalization. As such, the development of accurate and efficient
image classification models is crucial in enabling machines to interpret and understand the visual
world around them.
The choice of activation functions plays a crucial role in determining the effectiveness of deep
learning models in image classification tasks. Although activation functions such as ReLU, Swish,
and GELU have been widely studied and implemented, there is a need to investigate the
performance of the Esh activation function in the context of image classification tasks. The Esh
activation function is a novel activation function that has shown promising results in image
classification, but its potential in this area has not been thoroughly explored yet.
The convergence and accuracy of a model depend heavily on the selection of hyperparameters[31],
particularly the optimizers[10] and learning rates. This study aims to investigate the effect of
varying optimizers and learning rates on the performance of the Esh activation function in a neural
network. Additionally, the study aims to provide valuable insights into selecting the most optimal
hyperparameters for utilizing the Esh activation function in deep learning applications.
Optimizers[3] are algorithms that update the weights and biases of a neural network during training
in order to minimize the loss function. The optimizer’s goal is to find the set of weights and biases
that result in the lowest possible loss. There are several different types of optimizers available,
including Stochastic Gradient Descent (SGD), Adagrad, Adam, and RMSprop. Each optimizer has
its own unique approach to updating the weights and biases.
Learning rate, on the other hand, is a hyperparameter that determines how quickly the optimizer
adjusts the weights and biases of the neural network. A higher learning rate means that the weights
and biases are updated more quickly, while a lower learning rate means that the updates are slower.
Setting the learning rate too high can result in the optimizer overshooting the optimal set of weights
and biases, while setting it too low can result in slow convergence to the optimal solution.
Therefore, choosing an appropriate learning rate is crucial for achieving good performance in a
neural network.
During each iteration of training, the optimizer calculates the gradient of the loss function[68] with
respect to the weights and biases. This gradient tells the optimizer how to adjust the weights and
biases in order to decrease the loss function. The optimizer then uses this information to update the
weights and biases accordingly.
Different optimizers use different strategies for updating the weights and biases. For example, the
Stochastic Gradient Descent (SGD) optimizer updates the weights and biases by subtracting the
gradient of the loss function multiplied by a learning rate, while the Adam optimizer adjusts the
learning rate based on estimates of the first and second moments of the gradients.
The learning rate determines how much the weights and biases are adjusted in response to the
gradient. A higher learning rate results in larger updates to the weights and biases, while a lower
learning rate results in smaller updates. Setting the learning rate too high can lead to overshooting
the optimal set of weights and biases, while setting it too low can result in slow convergence to the
optimal solution.
The choice of optimizer and learning rate can have a significant impact on the performance of a
neural network. Different optimizers may work better for different types of problems or
architectures, and finding the optimal learning rate often requires experimentation and tuning. In
general, the goal is to find the optimizer and learning rate that allow the neural network to converge
quickly and accurately to the optimal set of weights and biases.
CHAPTER 3
LITERATURE REVIEW
These papers represent a small sample of the extensive research conducted on activation functions
in deep neural networks. The field of activation functions continues to evolve, with new variations
and modifications being proposed to enhance the performance and capabilities of deep learning
models.
3.4 Optimizers
Optimizers play a significant role in the training of deep neural networks. Here are some recent
studies on the performance of optimizers in deep learning:
1. A systematic study of the class of Adam methods for deep learning by Liu et al. (2020):
This study systematically investigates the class of Adam optimizers for deep learning. The
authors propose several modifications to the Adam algorithm and demonstrate their
effectiveness on a range of benchmark datasets.
2. On the variance of the adaptive learning rate and beyond [33] by Li et al. (2019): This
study analyzes the variance of adaptive learning rate algorithms such as AdaGrad, RMSProp,
and Adam, and proposes a new algorithm called Yogi. The authors demonstrate that Yogi
outperforms other adaptive learning rate algorithms on several benchmark datasets.
3. Decoupled weight decay regularization[34] by Loshchilov and Hutter (2019): This study
proposes a decoupled weight decay regularization method for stochastic gradient descent
(SGD)[60] and its variants. The authors show that the method improves the generalization
performance of deep neural networks on several benchmark datasets.
4. Improving generalization performance by switching from Adam to SGD[25] by Jaegle et
al. (2018): This study proposes a method that switches from the Adam[26] optimizer to
stochastic gradient descent (SGD) during the training process. The authors show that the
method improves the generalization performance of deep neural networks on several
benchmark datasets.
5. Adaptive gradient methods with dynamic bound of learning rate[35] by You et al.
(2017): This study proposes a new family of adaptive gradient methods with a dynamic
bound of learning rate. The authors show that the methods outperform other adaptive
gradient methods on several benchmark datasets.
Overall, these studies demonstrate that the choice of optimizer can significantly impact the
performance of deep neural networks, and highlight the importance of careful optimization in deep
learning applications.
There have been several recently proposed optimizers for neural networks. Here are a few examples:
1. SWATS (Stochastic Weight Averaging with Tunable Stepsize)[25] : This optimizer uses
stochastic weight averaging to improve the generalization of deep neural networks. It
achieves this by taking a weighted average of the weights obtained during training, rather
than using the final weights. The stepsize of the optimizer is tunable, which allows for better
control over the convergence of the optimization process.
2. AdaBound[35] : This optimizer is a modification of the Adam optimizer that uses dynamic
bounds on the learning rate to improve convergence. It achieves this by gradually decreasing
the learning rate as the optimization progresses, which helps prevent overshooting the
optimal set of weights and biases.
3. Madam[4] : This optimizer is a modification of the Adam optimizer that uses momentum
with adaptive damping to improve convergence. It achieves this by dynamically adjusting the
momentum and damping terms based on the gradient and curvature of the loss function.
These are just a few examples of the many recently proposed optimizers for neural networks. Each
optimizer has its own strengths and weaknesses, and the choice of optimizer often depends on the
specific problem and architecture being used.
Overall, these papers demonstrate the importance of the learning rate hyperparameter in deep
learning and highlight various techniques for optimizing it.
CHAPTER 4
NEURAL NETWORK ARCHITECTURES
4.1 ResNet
ResNet, short for "Residual Network,"[19] is a deep neural network architecture that was introduced
by Microsoft researchers in 2015. It won the ImageNet[46] and COCO[32] 2015 competitions in
several categories and has since become a popular and powerful architecture for a wide range of
computer vision tasks.
The basic idea behind ResNet is to use skip connections or "residual connections"[55] to allow the
network to learn residual mappings. These connections enable the network to more effectively
propagate gradients through the network during training, which can help prevent the vanishing
gradient problem that can occur in very deep networks.
The ResNet architecture consists of a series of residual blocks, each of which includes multiple
convolutional layers and a skip connection that allows the input to be added to the output of the
block. By stacking these residual blocks together, the network can learn increasingly complex
representations of the input data.
One of the key innovations of ResNet is the use of "bottleneck" blocks, which consist of three
convolutional layers with different filter sizes (1x1, 3x3, and 1x1). The 1x1 convolutions are used to
reduce the dimensionality of the input, while the 3x3 convolution performs the main computation.
This allows the network to learn more efficient and compact representations of the input, which can
improve performance and reduce memory usage.
ResNet comes in various sizes, including ResNet-18, ResNet-34, ResNet-50, ResNet-101, and
ResNet-152. These numbers correspond to the number of layers in the network, with ResNet-18
being the smallest and ResNet-152 being the largest.
Overall, ResNet is a powerful and widely used architecture for a variety of computer vision tasks,
including image classification, object detection, and segmentation[66]. Its use of residual
connections allows it to effectively learn very deep representations of the input data, making it an
important tool in the field of deep learning.
ResNet-20 consists of 20 layers, including 18 convolutional layers and 2 fully connected layers. The
first layer is a 3x3 convolutional layer with 16 filters, followed by several residual blocks with 3
convolutional layers each. The network uses batch normalization and ReLU[2] activation after each
convolutional layer, and includes a global average pooling layer before the final fully connected
layer. ResNet-20 has approximately 0.27 million parameters and has been shown to achieve state-of-
the-art performance on CIFAR-10 and CIFAR-100 datasets.
ResNet-56 is a deeper version of ResNet-20, consisting of 56 layers with 54 convolutional layers and
2 fully connected layers. Like ResNet-20, it starts with a 3x3 convolutional layer with 16 filters and
includes several residual blocks with 3 convolutional layers each. It also uses batch normalization
and ReLU activation after each convolutional layer, and includes a global average pooling layer
before the final fully connected layer. ResNet-56 has approximately 0.86 million parameters and has
been shown to outperform ResNet-20 on the CIFAR-10 and CIFAR-100 datasets.
Overall, ResNet-20 and ResNet-56 are examples of smaller ResNet architectures that can be used for
image classification tasks, particularly on datasets with limited amounts of training data. These
networks can be trained efficiently on GPUs and have been shown to achieve good performance on a
variety of benchmarks.
4.2 VGG-16
VGG16[41] is a deep convolutional neural network architecture that was introduced by researchers
at the Visual Geometry Group (VGG) at the University of Oxford in 2014. The architecture consists
of 16 layers, including 13 convolutional layers and 3 fully connected layers, and has been used for a
wide range of computer vision tasks, including image classification, object detection, and
segmentation.
The key innovation of VGG16 is the use of very small convolutional filters (3x3) throughout the
network, which allows the network to learn more complex features without increasing the number of
parameters too much. The convolutional layers are stacked one after the other, with max pooling
layers used to downsample the feature maps and reduce the spatial dimensionality of the data.
Figure 1: VGG-16
The architecture of VGG16 can be divided into five blocks. The first two blocks each consist of two
convolutional layers with 64 filters, followed by a max pooling layer. The third and fourth blocks
each consist of three convolutional layers with 128 and 256 filters, respectively, followed by a max
pooling[16] layer. The fifth block consists of three convolutional layers with 512 filters, followed by
a max pooling layer. The fully connected layers are then added on top of these convolutional layers
to perform the final classification.
Overall, VGG16 is a powerful architecture that has been shown to achieve state-of-the-art
performance on a variety of computer vision tasks. However, it is a relatively large network with
over 138 million parameters, which can make it difficult to train on limited computational resources.
CHAPTER 5
ACTIVATION FUNCTIONS
In the process of building a neural network, one of the choices we get to make is what Activation
Function[30] to use in the hidden layer as well as at the output layer of the network. We know, the
neural network has neurons that work in correspondence with weight, bias, and their respective
activation function. In a neural network, we would update the weights and biases of the neurons
on the basis of the error at the output. This process is known as back-propagation. Activation
functions make the back-propagation[6] possible since the gradients are supplied along with the
error to update the weights and biases.
The activation function decides whether a neuron should be activated or not by calculating the
weighted sum and further adding bias to it. The purpose of the activation function is to introduce
non-linearity into the output of a neuron. A neural network without an activation function is
essentially just a linear regression model. The activation function does the non-linear
transformation to the input making it capable to learn and perform more complex tasks.
a smaller output space that falls between [0,1]. As a result, the back-propagation algorithm
has almost no gradients to propagate backward in the network, and any residual gradients
that do exist continue to dilute as the program goes down through the top layers. Due to
this, the initial hidden-layers are left with no information about the gradients. For
hyperbolic tangent and sigmoid[39] activation functions, it has been observed that the
saturation region for large input (both positive and negative) is a major reason behind the
vanishing of gradient. One of the important remedies to this problem is the use of non-
saturating activation functions. Other non-saturating functions, such as ReLU, leaky
ReLU[62], and other variants of ReLU, have been proposed to solve this problem.
4. Finite range/boundedness: Gradient-based training approaches are more stable when the
range of the activation function is finite, because pattern presentations significantly affect
only limited weights.
5. Differentiability: The most desirable quality for using gradient-based optimization
approaches is continuously differentiable activation functions. This ensures that the back-
propagation algorithm works properly.
The sigmoid activation function is a mathematical function commonly used in artificial neural
networks. It maps any input value to a value between 0 and 1, which makes it useful for modeling
binary classification problems.
where x is the input to the function and e is the mathematical constant known as Euler’s number.
The sigmoid function has a distinctive S-shaped curve, which means that small changes in the
input values will result in small changes in the output values when the input is close to 0. As the
input value moves away from 0, the output changes more rapidly, resulting in a steeper slope.
The sigmoid function is often used as an activation function in the output layer of a neural
network that is trained to classify input data into one of two classes. The output of the sigmoid
function can be interpreted as the probability that the input belongs to the positive class.
However, the use of the sigmoid function has some limitations. It can suffer from the vanishing
gradient problem, which can slow down the learning process in deep neural networks.
Additionally, it is not well-suited for input values that are significantly different from 0, as the
Tanh Exponential Activation Function (TanhExp) which can improve the performance on image
classification task significantly. The definition of TanhExp is
f (x) = xtanh(ex) (5)
TanhExp outperforms its counterparts in both convergence speed and accuracy. Its behaviour also
remains stable even with noise added and dataset altered. It is shown in [61] that without
increasing the size of the network, the capacity of lightweight neural networks can be enhanced by
TanhExp with only a few training epochs and no extra parameters added.
The Esh activation function was introduced as a novel activation function for image classification tasks.
It has shown promising results in improving model performance and generalization compared to traditional
activation functions such as ReLU, Swish, and GELU. However, its effectiveness in image segmentation
tasks remains unexplored.
fEsh(x) = x × tanh(sigmoid(x))
The comparison of Swish, Mish and Esh activation function. Esh extends below zero in the negative
half like other smooth functions do, but it has a steeper gradient. Despite the fact that they are not the
same, the figures of Mish and Swish appear intuitively to be similar to those of Esh. Esh takes fewer
calculations even when they are close to one another. Eq allows for the calculation of Esh’s first
derivative Eq. (8).
The first and second derivatives of Esh, Swish, and Mish are shown in Fig. 3.
(9)
Figure 4: Output landscape comparison of ReLU, Swish, Mish and Esh activation functions
Esh has a minimum value close to x = -1.309, which is approximately -0.274. Additionally, Esh
gets Swish’s "Self- gated" characteristic. A function of the form f (x) = xg(x) is referred to as "Self-
gated". The input is multiplied by itself together with a function that takes the input as its
argument, preventing the network from changing the initial distribution of the input on the positive
part while simultaneously creating a buffer at the negative part, close to zero.
Additionally, Esh makes sure that its output is sparse. According to a sparse activation, not all
inputs in a network with a random initialization state are active. From the definition and
representation of Esh, we have
As a result, when the input x has a significant negative value, the neuron can roughly be
considered as not being activated, satisfying the concept of sparsity. While being more likely to be
linearly separable, this sparse feature allows a model to control the actual dimensionality of the
representation for an input. Esh ’s likelihood of deactivating these neurons is lower than that of
ReLU, which suppresses 50% of the hidden units. We believe that less of the data is affected by
noise, and that ReLU will block more relevant features than Esh. Furthermore, because half of the
neurons in a network with ReLU activation are not active, the network may not function well.
Esh appears to be similar to other smooth activation functions, but it differs from them in a
number of ways.
First, once the input is more than 1, Esh is nearly equivalent to a linear transformation, with an
output value and input value change of no more than 0.01.
Second, Esh has a steeper gradient close to zero, which helps speed up the updating of network
parameters. The network modifies its parameters during backpropagation, as shown by Eq. (31).
where η is the current learning rate and∇w is the backpropagation gradient. The weights of the
network before and after updating are represented, respectively, by wold and wnew.
We refer to L, the network’s loss, as the difference between the output of the network and the
label corresponding to the ground truth. The cross-entropy loss can be used as L for an image
recognition task. Equation Eq. (32) calculates the cross-entropy loss.
i i i i
N i=1
Σ
1
L=− y logyˆ + (1 − y )log(1 − yˆ ) (12)
In a loss function, N is the overall sample count, yi is the ith sample’s ground truth label, and yˆi is
the ith sample’s network prediction. By updating the network’s parameters, the value of L should
be minimized in order to increase accuracy. Then, using the computed loss L, ∇w can be
calculated as
δL
∇w = (13)
δwold
∇
As a result, if w is slightly larger, the weight’s update rate would be accelerated, which would result
in quick convergence. However, since we want to get to the global minimum value, an activation
function with a large gradient can prevent the network from converging, whereas a roughly linear
function is a reasonable option. The equivalent of scaling up the bias unit and moving the incoming
units towards zero is a bias shift correction of the unit natural gradient. Therefore, the steeper
gradient of Esh can also aid in lowering the function’s mean value to zero, which accelerates
learning even further.
In landscape comparison, the other three activation functions display a smoother landscape in
comparison to ReLU, indicating that they do not make abrupt shifts like ReLU does. The transition
curve of Esh is particularly seamless and fluent when compared to the other two smooth functions.
This characteristic ensures that Esh may combine the benefits of both piecewise and non-piecewise
activation functions and results in impressive performance.
CHAPTER 6
OPTIMIZERS
Optimizers are algorithms that are used to update the parameters of a deep learning model during
training to minimize the loss function. There are several types of optimizers, each with their own
advantages and disadvantages:
where parameter represents a model parameter (weight or bias), lr is a hyperparameter that controls
the step size of the update, and gradient denotes the gradient of the loss function with respect to the
parameter.
SGD operates on a single sample or a batch of samples at a time, hence the term "stochastic". It
iteratively performs parameter updates for each sample or batch until convergence or a specified
number of iterations.
SGD has several advantages, including low memory requirements and computational efficiency, as it
only needs to store the gradients for a single batch of examples at a time. However, SGD may suffer
from slow convergence and mayget stuck in local minima. To overcome these issues, several
variants of SGD have been developed, such as momentum SGD and Nesterov accelerated gradient
(NAG).
Momentum SGD takes into account the past gradients to reduce the oscillations in the optimization
path and accelerate convergence. It introduces a momentum term that accumulates the past gradients
and adds them to the current gradient. NAG is a variant of momentum SGD that uses a "look-ahead"
approach to compute the gradient at a future point in time, which can lead to faster convergence.
Overall, SGD and its variants are simple yet powerful optimization algorithms that can be effective
in many deep learning applications. However, their performance may depend on the specific task,
dataset, and model architecture, and tuning the learning rate and other hyperparameters may be
necessary to achieve optimal results.
(16) where ma represents the decaying average of squared gradients, dr controls the weighting of the
past gradients, gradient denotes the gradient of the loss function, lr is the step size, and ϵ is a small
value (e.g., 1e-8) added for numerical stability.
One advantage of RMSProp over Adagrad is that it uses a moving average of the squared gradients
instead of accumulating all past gradients, which can help to reduce the diminishing learning rate
problem that can occur in Adagrad.
However, like all optimizers, RMSProp has its limitations. For example, it can struggle with saddle
points in the loss landscape, and it may require careful tuning of hyperparameters, such as the learning
rate and decay rate.
Overall, RMSProp is a popular optimizer that can be effective for many deep learning applications,
especially when combined with other techniques, such as learning rate schedules or early stopping.
Adagrad[15] is an optimization algorithm that adapts the learning rate for each parameter based
on the historical gradient information. It was developed to address the issue of manually tuning
the learning rate in Stochastic Gradient Descent (SGD)[45].
In Adagrad, the optimizer adapts the learning rate of each parameter based on the sum of the
squares of the gradients for that parameter. This means that the learning rate is reduced for
parameters that have large gradients, which can help to stabilize the optimization process and
prevent overshooting. The learning rate is then scaled by the inverse square root of this sum. The
update rule for Adagrad can be summarized as follows:
√
parameter = parameter − lr × gradient/( accumulator + ϵ) (18)
where accumulator represents a running sum of the squared gradients, lr is the step size, gradient
denotes the gradient of the loss function, and ϵ is a small value (e.g., 1e-8) added for numerical
stability.
Root Mean Square Propagation (RMSProp) is an optimization algorithm that adapts the learning
rate for each parameter based on the historical gradient information. It was developed to address
some of the limitations of Stochastic Gradient Descent (SGD) and Adagrad, another popular
optimizer.
One advantage of Adagrad over SGD is that it can handle sparse data well, as it effectively
prioritizes the learning rate for features that occur more frequently in the data. Another advantage
is that it requires less tuning of hyperparameters, as it adapts the learning rate for each parameter
automatically.
However, Adagrad also has some limitations. For example, it can suffer from a diminishing
learning rate over time, which can cause slow convergence. It can also require a large number of
iterations to converge, especially for large-scale problems with many parameters.
Overall, Adagrad is a useful optimizer that can be effective for many deep learning applications,
especially those with sparse data. However, its limitations may require additional techniques, such
as learning rate schedules or early stopping, to achieve optimal performance.
6.4 Adadelta
In Adadelta, the optimizer maintains a moving average of the squared gradients and a moving
average of the squared updates. The learning rate is then computed based on the ratio of these two
moving averages. This allows Adadelta to adapt to the gradient scale, which can help to stabilize
the optimization process and prevent overshooting. The update rule for a parameter θ using
Adadelta can be summarized as follows:
√
rmsδ = (E[∆θ2 ](t−1) + ϵ) (20)
√
∆θt = −(rmsδ/ (E[g2]t + ϵ)) × gt (21)
where gt is the gradient of the objective function w.r.t. parameter θ at time step t, E[g2]t is the
exponentially decaying average of squared gradients, ϵ is a small constant for numerical stability,
∆θt is the update to the parameter θ at time step t, E[∆θ2]t is the exponentially decaying average of
squared parameter updates and p is the decay rate controlling the exponential decay of the moving
averages.
One advantage of Adadelta over RMSProp and Adagrad is that it does not require the tuning of a
global learning rate hyperparameter, which can make it more robust and easier to use.
Additionally, Adadelta can handle non-stationary objectives, as it does not require a decaying
average of the gradients.
However, Adadelta may require more iterations to converge than other optimizers, as it typically
starts with a higher learning rate. It can also be sensitive to the choice of hyperparameters, such as
the decay rate and the initial learning rate.
Overall, Adadelta is a useful optimizer that can be effective for many deep learning applications,
especially those with non-stationary objectives or large amounts of data. However, careful tuning
of hyperparameters may be necessary to achieve optimal performance.
6.5 Adamax
Adamax is an optimization algorithm that is a variant of the popular Adam optimizer. Adamax is
designed to handle the "vanishing updates" problem that can occur in Adam when the gradient
values are very large or the learning rate is very small.
In Adamax, the optimizer uses the infinity norm (maximum absolute value) of the weight updates
instead of the L2 norm used in Adam. This means that Adamax is less sensitive to large gradients
Like Adam, Adamax also maintains exponential moving averages of the gradients and their
squares, as well as a bias correction term. These estimates are used to compute the weight updates
and the learning rate. The update rule for a parameter θ using Adamax can be summarized as
follows:
where gt is the gradient of the objective function w.r.t. parameter θ at time step t, mt is the
exponentially decaying average of gradients, ut is the exponentially weighted infinity norm of the
gradients, η is the learning rate, β1 and β2 are the decay rates for the gradient and infinity norm
averages, respectively and ϵ is a small constant for numerical stability.
One advantage of Adamax over Adam is that it can be more robust to large gradients and smaller
learning rates, which can make it more suitable for certain deep learning applications. However,
Adamax may require tuning of its own set of hyperparameters, such as the beta1 and beta2
parameters that control the exponential moving averages.
Overall, Adamax is a useful optimizer that can be effective for many deep learning applications,
especially when the "vanishing updates" problem is a concern. However, careful tuning of
hyperparameters may be necessary to achieve optimal performance.
The Adam optimizer maintains two moving averages of the gradient: the first moment (mean) and
the second moment (uncentered variance). These moving averages are used to compute the
adaptive learning rates for each parameter during training. The update rule for Adam can be
summarized as follows:
where m represents the estimate of the first moment (mean) of the gradients, v denotes the
estimate of the second moment (uncentered variance) of the gradients, beta1 and beta2 are the
exponential decay rates for the moments, lr is the step size, gradient is the gradient of the loss
function, and epsilon is a small value (e.g., 1e-8) added for numerical stability.
Adam combines the advantages of two other optimization algorithms: RMSprop and AdaGrad.
Like RMSprop, it uses the moving average of the squared gradients to scale the learning rate.
However, unlike RMSprop, it also uses the moving average of the first moment of the gradient,
which can help the algorithm handle noisy gradients and converge faster.
Adam also includes bias correction to compensate for the fact that the moving averages are
initialized at zero, which can lead to bias in the early iterations of training.
One of the main advantages of Adam is that it requires minimal tuning of hyperparameters, as it
automatically adapts the learning rate for each parameter. It has become a popular choice for
deep learning tasks, particularly in computer vision and natural language processing.
However, it’s worth noting that Adam may not always perform optimally in certain scenarios,
such as when the data is imbalanced or when the gradients are very sparse. In such cases, other
optimization algorithms may be more suitable.
CHAPTER 7
METHODOLOGY
This section will outline the methodology used to examine the performance of the Esh activation
function when combined with different optimizers and learning rates across various datasets
available in the Python library Keras[11].
7.1.1 MNIST
MNIST[29] (Modified National Institute of Standards and Technology) is one of the largest and
most well-known standard datasets of handwritten digits, and it is frequently used to train
different image processing methods. The dataset is also widely used for training and testing×in the
field of deep learning. Each image is a 28 28 grayscale image that is linked to a label from one of
ten categories with 60,000 training examples and 10,000 test examples.
7.1.2 EMNIST
The EMNIST[12] (Extended Modified National Institute of Standards and Technology) dataset is
an extension of the MNIST dataset, which is a collection of handwritten digits. However, unlike
MNIST, EMNIST includes both digits and alphabetic characters. ×
The EMNIST dataset consists of 62 classes in total. These classes correspond to the 10 digits (0-9)
and the 52 uppercase and lowercase alphabetic characters (A-Z, a-z).
7.1.3 CIFAR-10
The dataset consisting of 50000 training images and 10000 test images. It is divided into five
training batches and one test batch, each with 10000 images. The test batch contains exactly 1000
randomly-selected images from each class. The training batches contain the remaining images in
Between them, the training batches contain exactly 5000 images from each class. The classes are
completely mutually exclusive.
Data augmentation[40] is a technique used in machine learning and computer vision to artificially
increase the size of a training dataset by applying various transformations or manipulations to the
existing data. These transformations can include flipping, rotation, cropping, scaling, and adding
noise to the images. The goal of data augmentation is to increase the diversity of the training data
and reduce overfitting, allowing the model to generalize better to new, unseen data.
7.3 Preprocessing
The preprocessing stage is crucial in image classification tasks, as it enhances the quality of input
images and ensures that they are in the appropriate format for training the classification model.
Below are some typical preprocessing techniques that can be utilized for the dataset:
Resizing: This technique is used to adjust the size of the images to a fixed size that can be
easily fed to the neural network.
Normalization[64]: It involves scaling the pixel values of the images to a common range,
such as [0,1] or [-1,1], to help the neural network learn more effectively.
Color space conversion: This technique involves converting the images from one color
space to another, such as RGB to grayscale or HSV, to better highlight certain features.
The deep learning algorithms in this study were implemented using Keras[11] with Google
TensorFlow[1] backend, and the experiments were conducted using a computational resource
based on Google Colab[5] Pro+.
This study utilized a Google Colab environment with a NVIDIA A100 Cloud GPU having 1100
CHAPTER 8
RESULTS AND DISCUSSION
This section will outline the methodology used to examine the performance of the Esh activation
function when combined with different optimizers and learning rates across various datasets
available in the Python library Keras[11].
This section presents the experimental evaluation of the proposed Esh activation function. We
compared its performance to that of Mish and Swish with various optimizers (Adam, SGD,
Adagrad, and RMSProp) and learning rates (0.1, 0.01, 0.001, and 0.0001). On Google Colab Pro+,
experiments were carried out for several hours using well-known CNN architectures VGG16,
ResNet20, and ResNet56, on common datasets - EMNIST and CIFAR-10.
Table 1: LR vs Accuracy comparison of different optimizers on activation functions with VGG-16 on CIFAR-10
dataset
Table 2: LR vs Loss of different optimizers on activation functions with VGG-16 on CIFAR-10 dataset
The tables - (1 & 2) presented above compare the accuracy and loss achieved by Esh, Mish and
Swish activation functions on the CIFAR-10 datasets utilizing four different optimizers: Adam,
SGD, Adagrad, and RMSProp with learning rates of 0.1, 0.01, 0.001, and 0.0001 on VGG16
architectures. The results indicate that the Esh activation function outperforms other activation
functions with better accuracy and stable losses with VGG-16 architectures with RMSProp
optimizer and learning rate 0.001.
The tables (3 & 4) presented above compare the accuracy and loss achieved by Esh, Mish and
Swish activation functions on the CIFAR-10 datasets utilizing four different optimizers: Adam,
SGD, Adagrad, and RMSProp with learning rates of 0.1, 0.01, 0.001, and 0.0001 on ResNet-20
architectures. The results indicate that the Esh activation function has comparable accuracy and
losses with ResNet-20 architecture with Adam optimizer and learning rate 0.001.
Table 5: LR vs Accuracy comparison of different optimizers on activation functions with ResNet-56 on CIFAR-10
dataset
Table 6: LR vs Loss of different optimizers on activation functions with ResNet-56 on CIFAR-10 dataset
The tables - (5 & 6) presented above compare the accuracy and loss achieved by Esh, Mish and
Swish activation functions on the CIFAR-10 datasets utilizing four different optimizers: Adam,
SGD, Adagrad, and RMSProp with learning rates of 0.1, 0.01, 0.001, and 0.0001 on ResNet-56
architectures. The results indicate that the Esh activation function outperforms other activation
functions with better accuracy and stable losses with ResNet-56 architectures with RMSProp
optimizer and learning rate 0.001.
The tables - (7 & 8) presented above compare the accuracy and loss achieved by Esh, Mish and
Swish activation functions on the EMNIST datasets utilizing four different optimizers: Adam,
SGD, Adagrad, and RMSProp with learning rates of 0.1, 0.01, 0.001, and 0.0001 on VGG16
architectures. The results indicate that the Esh activation function outperforms other activation
functions with better accuracy and stable losses with VGG16 architectures with RMSProp
optimizer and learning rate 0.001.
The tables - (9 & 10) presented above compare the accuracy and loss achieved by Esh, Mish and
Swish activation functions on the EMNIST datasets utilizing four different optimizers: Adam,
SGD, Adagrad, and RMSProp with learning rates of 0.1, 0.01, 0.001, and 0.0001 on ResNet-20
architectures. The results indicate that the Esh activation function has comparable accuracy and
losses with ResNet-20 architecture with Adam optimizer and learning rate 0.001.
The tables - (11 & 12) presented above compare the accuracy and loss achieved by Esh, Mish and
Swish activation functions on the EMNIST datasets utilizing four different optimizers: Adam,
SGD, Adagrad, and RMSProp with learning rates of 0.1, 0.01, 0.001, and 0.0001 on ResNet-56
architectures. The results indicate that the Esh activation function has comparable accuracy and
losses with ResNet-56 architecture with RMSProp optimizer and learning rate 0.001.
Our results suggests that using the RMSProp optimizer with a learning rate of 0.001 with VGG16
architecture produces better results compared to other options when using the CIFAR-10 dataset.
Also, the results suggest that using the Adagrad optimizer with a learning rate of 0.001 with
ResNet-56 architecture produces better results compared to other options when using the
EMNIST dataset.
The tables presented above compare the accuracy and loss achieved by various activation functions
on the CIFAR-10 and EMNIST datasets utilizing four different optimizers: Adam, SGD, Adagrad,
and RMSProp with learning rates of 0.1, 0.01, 0.001, and 0.0001 on VGG16, ResNet-20, and
ResNet-56 architectures. The results indicate that the Esh activation function outperforms other
activation functions with better accuracy and stable losses across all three architectures.
CHAPTER 8
CONCLUSION AND FUTURE WORK
This paper presents a comparison of different optimizers and learning rates on a novel activation
function, Esh, defined as f (x) = xtanh(sigmoid(x)), for deep neural networks. Our findings show that
the Esh activation function results in better accuracy and stable losses across all three architectures,
indicating its effectiveness in improving the performance of these models.
We consider some commonly desirable properties in functions, such as capturing complex patterns or
exhibiting smoothness, we can make the following analysis:
1. Capturing Complex Patterns: The Esh function involves the composition of the tanh and
sigmoid functions. This combination allows for capturing complex patterns and non-linear
relationships. The sigmoid function introduces a non-linear mapping of values between 0 and
1, and the hyperbolic tangent (tanh) further non- linearly transforms those values. The
interaction of these non-linear functions can help capture more intricate patterns compared to
the Mish and Swish, which only involves the softplus and sigmoid function respectively.
2. Smoothness: The Esh function also has the advantage of incorporating the tanh function,
which is a smooth function that maps values to the range [-1, 1]. The tanh function has a
symmetric and smooth shape, allowing for a smoother overall function compared to Swish
function, which only uses the sigmoid function. The sigmoid function, while also smooth,
maps values to the range [0, 1] and may not exhibit the same degree of smoothness as the
tanh function.
Both Esh and Mish functions involve the use of the tanh function, which is a smooth function that
maps values to the range [-1, 1]. In terms of smoothness, there is no significant difference between
the two functions in this regard.
Based on the consideration of capturing complex patterns, the composition of tanh and sigmoid in
Esh function may provide more flexibility and expressive power compared to the Mish and Swish.
Our results also suggest that using the RMSProp optimizer with a learning rate of 0.001 produces
better results compared to other options when using the CIFAR-10 and EMNIST dataset.
Furthermore, the Esh activation function produces comparable results in terms of accuracy and loss
when compared to MISH and SWISH, indicating that it can be a viable alternative to these popular
activation functions.
However, the choice between these functions depends on the specific requirements of the problem
and the characteristics of the dataset being analyzed.
REFERENCES
[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A.
Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y.
Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D.
Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V.
Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke,
Y. Yu, and X. Zheng. TensorFlow: Large-scale machine learning on heterogeneous
systems, 2015. URL https://www.tensorflow.org/. Software available from
tensorflow.org.
[2] A. F. Agarap. Deep learning using rectified linear units (relu), 2019.
[4] J. Bernstein, J. Zhao, M. Meister, M.-Y. Liu, A. Anandkumar, and Y. Yue. Learning
compositional functions via multiplicative weight updates. Advances in neural information
processing systems, 33:13319–13330, 2020.
[5] E. Bisong. Google Colaboratory, pages 59–64. Apress, Berkeley, CA, 2019. ISBN
978-1-4842-4470-8. doi: 10.1007/978-1-4842-4470-8_7. URL
https://doi.org/10.1007/978-1-4842-4470-8_7.
[12] G. Cohen, S. Afshar, J. Tapson, and A. van Schaik. Emnist: an extension of mnist to
handwritten letters, 2017.
[14] S. De, A. Mukherjee, and E. Ullah. Convergence guarantees for rmsprop and adam
in non-convex optimization and an empirical comparison to nesterov acceleration, 2018.
[17] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward
neural networks. In Y. W. Teh and M. Titterington, editors, Proceedings of the Thirteenth
International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings
of Machine Learning Research, pages 249–256, Chia Laguna Resort, Sardinia, Italy, 13–
15 May 2010. PMLR. URL https://proceedings.mlr.press/v9/glorot10a.html.
[19] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition,
2015.
[20] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-
[21] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks,
2016.
[22] T. He, Z. Zhang, H. Zhang, Z. Zhang, and J. Xie. Bag of tricks for image
classification with convolutional neural networks. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 558–567, 2018.
[23] L. Heim, A. Biri, Z. Qu, and L. Thiele. Measuring what really matters: Optimizing
neural networks for tinyml. arxiv 2021. arXiv preprint arXiv:2104.10645.
[26] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization, 2017.
[27] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images.
In International Conference on Machine Learning, pages 1097–1105, 2009.
[31] K. Lee and J. Yim. Hyperparameter optimization with neural network pruning, 2022.
[33] L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and J. Han. On the variance of the
adaptive learning rate and beyond, 2021.
[35] L. Luo, Y. Xiong, Y. Liu, and X. Sun. Adaptive gradient methods with dynamic
bound of learning rate. arXiv preprint arXiv:1902.09843, 2019.
[36] D. Misra. Mish: A self regularized non-monotonic neural activation function. arXiv
preprint arXiv:1908.08681, 2019.
[40] L. Perez and J. Wang. The effectiveness of data augmentation in image classification
using deep learning, 2017.
[42] P. Ramachandran, B. Zoph, and Q. V. Le. Searching for activation functions. arXiv
preprint arXiv:1710.05941, 2017.
[43] P. Ramachandran, B. Zoph, and Q. V. Le. Searching for activation functions, 2017.
[47] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale
image recognition, 2015.
[49] L. N. Smith. Best practices for applying deep learning to novel applications, 2017.
[50] L. N. Smith. Cyclical learning rates for training neural networks, 2017.
[51] S. Sonoda and N. Murata. Neural network with unbounded activation functions is
universal approximator. Applied and Computational Harmonic Analysis, 43(2):233–268,
2017.
[54] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface: Closing the gap to
human-level performance in face verification. In 2014 IEEE Conference on Computer
[58] R. Wei, H. Yin, J. Jia, A. R. Benson, and P. Li. Understanding non-linearity in graph
neural networks from the bayesian-inference perspective, 2022.
[59] A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht. The marginal value of
adaptive gradient methods in machine learning, 2018.
[60] S. Wojtowytsch. Stochastic gradient descent with noise of machine learning type.
part i: Discrete time analysis, 2021.
[61] X. Xinyu Liu1. Tanhexp: A smooth activation function with high convergence speed
for lightweight neural networks, 2020.
[62] B. Xu, N. Wang, T. Chen, and M. Li. Empirical evaluation of rectified activations in
convolutional network. arXiv preprint arXiv:1505.00853, 2015.
[63] Y. You, I. Gitman, and B. Ginsburg. Large batch training of convolutional networks,
2017.
[68] H. Zhao, O. Gallo, I. Frosio, and J. Kautz. Loss functions for neural networks for
image processing, 2018.
[69] Òscar Lorente, I. Riera, and A. Rana. Image classification with classic and deep
learning techniques, 2021.