You are on page 1of 12

Loss Functions

• A loss function is a function that compares the target and predicted


output values; measures how well the neural network models the
training data.

• In the training process, we aim to minimize this loss between the


predicted and target outputs.
Loss Functions
Goal:

Find weights, wT, and biases, b, that minimize the


value of J (average loss).

Hyperparameters: Weights, biases


Types of Loss Functions
• Supervised Learning
• Regression
• Compare the predicted value and target, which are numerical values
• Ex: Mean Absolute Error, Mean Squared Error

• Classification
• Compare the predicted value and target, which are labels
• Ex: Binary Cross-Entropy (Binary classification problems)
• Categorical Cross-Entropy (Multi-class Classification problems)
Optimizers
• An optimizer is an algorithm used to update the parameters (weights
and biases) of the model during training, with the goal of minimizing
the error between the predicted output and the actual output.
• The optimizer works by adjusting the values of the parameters in the
direction of steepest descent of the loss function.
• The loss function is a measure of how well the model is performing
on the training data, and the goal of the optimizer is to find the set
of parameters that minimize this loss.
Optimizers Examples
• There are many different optimizers that can be used in neural
networks, each with its own advantages and disadvantages.
• Some of the most popular optimizers include:
• Stochastic Gradient Descent (SGD)
• Adam
• RMSProp
• Adagrad
Optimizers – In detail
• Stochastic Gradient Descent (SGD): This is a simple and widely used optimizer
that updates the parameters based on the gradient of the loss function with
respect to the parameters.

• Adam: This is a popular optimizer that uses a combination of adaptive learning


rates and momentum to update the parameters.

• RMSProp: This is another optimizer that uses adaptive learning rates, but it also
includes a moving average of the squared gradients to normalize the updates.

• Adagrad: This optimizer adapts the learning rate of each parameter based on the
historical gradient information.
How to select the Suitable Optimizers ???
How to select the Suitable Optimizers ???
• Choosing the right optimizer for a particular task depends on several
factors, including
• the complexity of the model
• the size of the dataset
• the desired speed and accuracy of the training process.

Momentum optimization Such as: Adam optimizer is generally considered to be faster than vanilla
accumulates the gradients over SGD (Stochastic Gradient Descent) optimizer in most cases. This is because
previous iterations and uses this Adam combines the advantages of both Adagrad (adaptive learning rates)
accumulated gradient to update the and momentum optimization, and it uses adaptive learning rates to adjust
model parameters. the step size of each parameter during training. This means that the
learning rate is adjusted automatically for each parameter, which can speed
up the convergence of the training process and make it more robust to
changes in the learning rate.
Batch Processing
• In ANN batch processing refers to the practice of training the network
on a batch of input data samples at a time, instead of training on each
individual sample separately.
• When training a neural network, batch processing involves dividing
the training data into smaller subsets (batches) and feeding these
batches through the network to update the model's parameters.
• Batch size:
• The size of a batch must be more than or equal to one and less than
or equal to the number of samples in the training dataset.
Batch Processing
• During batch processing, the gradients of the loss function with
respect to the model parameters are computed for each sample in
the batch, and then these gradients are averaged across the batch.
• The averaged gradients are then used to update the model
parameters. This process is repeated for each batch in the training
data until the model converges to an acceptable level of accuracy.
Why Batch Processing?
• Can used to reduce the impact of noisy samples on the model's
training by averaging the gradients across multiple samples.
• Can help speed up the training process by allowing for parallel
processing of batches.
• Can help stabilize the training process by providing a more consistent
estimate of the gradient across multiple samples.
Batch size used during training is an important hyperparameter that
needs to be optimized. Smaller batch sizes typically lead to faster
convergence and better generalization performance, but larger batch
sizes may lead to more stable convergence and better hardware
utilization.
How to Measure the Performance of the Final
Model
• Regression Tasks:
• Mean Squared Error (MSE)
• R-squared (R2)
• Classification Tasks:
• Accuracy
• Confusion Matrix
• Precision and Recall
• F1-score

You might also like