Loss Functions Explained

08/12/2019 Loss Functions Explained - Deep Learning Demystified - Medium
Loss Functions Explained

Harsha Bommana Follow
Sep 30 · 8 min read
In any deep learning project, configuring the loss function is one of the most
important steps to ensure the model will work in the intended manner. The
loss function can give a lot of practical flexibility to your neural networks
and it will define how exactly the output of the network is connected with
the rest of the network.
There are several tasks neural networks can perform, from predicting
continuous values like monthly expenditure to classifying discrete classes
like cats and dogs. Each different task would require a different type of loss
since the output format will be different. For very specialized tasks, it’s up
to us how we want to define the loss.
https://medium.com/deep-learning-demystified/loss-functions-explained-3098e8ff2b27 1/16
From a very simplified perspective, the loss function (J) can be defined as a
function which takes in two parameters:
1. Predicted Output
2. True Output
Neural Network Loss Visualization
This function will essentially calculate how poorly our model is performing
by comparing what the model is predicting with the actual value it is
supposed to output. If Y_pred is very far off from Y, the Loss value will be
very high. However if both values are almost similar, the Loss value will be
very low. Hence we need to keep a loss function which can penalize a
model effectively while it is training on a dataset.
If the loss is very high, this huge value will propagate through the network
while it’s training and the weights will be changed a little more than usual.
If it’s small then the weights won’t change that much since the network is
already doing a good job.
This scenario is somewhat analogous to studying for exams. If one does

poorly in an exam, we can say the loss is very high, and that person will
have to change a lot of things within themselves in order to get a better
grade next time. However if the exam went well, then they wouldn’t do
anything very different from what they are already doing for the next exam.
Now let’s look at classification as a task and understand how the loss
functions work in this case.
. . .
Classification Losses
When a neural network is trying to predict a discrete value, we can
consider it to be a classification model. This could be a network trying to
predict what kind of animal is present in an image, or whether an email is
spam or not. First let’s look at how the output is represented for a
classification neural network.
Classi cation Neural Network Output Format
The number of nodes of the output layer will depend on the number of
classes present in the data. Each node will represent a single class. The
value of each output node essentially represents the probability of that
class being the correct class.
Pr(Class 1) = Probability of Class 1 being the correct class
Once we get the probabilities of all the different classes, we will consider
the class having the highest probability to be the predicted class for that
instance. First let’s explore how binary classification is done.
Binary Classification
In binary classification, there will be only one node in the output layer even
though we will be predicting between two classes. In order to get the output
in a probability format, we need to apply an activation function. Since
probability requires a value in between 0 and 1 we will use the sigmoid
function which can squish any real value to a value between 0 and 1.
Sigmoid Function Graph Visualization
As the input to the sigmoid becomes larger and tends to plus infinity, the
output of the sigmoid will tend to 1. And as the input becomes smaller and
tends to negative infinity, the output will tend to 0. Now we are guaranteed
to always get a value between 0 and 1, which is exactly how we need it to be
since we require probabilities.
If the output is above 0.5 (50% Probability), we will consider it to be falling

under the positive class and if it is below 0.5 we will consider it to be falling
under the negative class. For example if we are training a network to

classify between cats and dogs, we can assign dogs the positive class and the
output value in the dataset for dogs will be 1, similarly cats will be assigned
the negative class and the output value for cats will be 0.
The loss function we use for binary classification is called binary cross
entropy (BCE). This function effectively penalizes the neural network for
binary classification task. Let’s look at how this function looks.
Binary Cross Entropy Loss Graphs
As you can see, there are two separate functions, one for each value of Y.
When we need to predict the positive class (Y = 1), we will use
Loss = -log(Y_pred)
And when we need to predict the negative class (Y = 0), we will use
Loss = -log(1-Y_pred)
As you can see in the graphs. For the first function, when Y_pred is equal to
1, the Loss is equal to 0, which makes sense because Y_pred is exactly the
same as Y. As Y_pred value becomes closer to 0, we can observe the Loss
value increasing at a very high rate and when Y_pred becomes 0 it tends to
infinity. This is because, from a classification perspective, 0 and 1 have to be
polar opposites due to the fact that they each represent completely different
classes. So when Y_pred is 0 when Y is 1, the loss will have to be very high
in order for the network to learn it’s mistakes more effectively.
Binary Classi cation Loss Comparisons
We can mathematically represent the entire loss function into one equation
as follows:
Binary Cross Entropy Full Equation
This loss function is also called as Log Loss. This is how the loss function is
designed for a binary classification neural network. Now let’s move on to
see how the loss is defined for a multiclass classification network.
Multiclass Classification
Multiclass classification is appropriate when we need our model to predict
one possible class output every time. Now since we are still dealing with
probabilities it might make sense to just apply sigmoid to all the output
nodes so that we get values between 0–1 for all the outputs, but there is an
issue with this. When we are considering probabilities for multiple classes,
we need to ensure that the sum of all the individual probabilities is equal to
one, since that is how probability is defined. Applying sigmoid does not
ensure that the sum is always equal to one, hence we need to use another
activation function.
The activation function we use in this case is softmax. This function ensures
that all the output nodes have values between 0–1 and the sum of all
output node values equals to 1 always. The formula for softmax is as
follows:
Softmax Formula
Let’s visualize this with an example:
Softmax Example Visualization
So as you can see, we are simply passing all the values into a exponential
function. After that, to make sure they are all in the range of 0–1 and to
make sure the sum of all the output values equals to 1, we are just dividing
each exponential with the sum of all exponentials.
So why do we have to pass each value through an exponential before

normalizing them? Why can’t we just normalize the values themselves? This
is because the goal of softmax is to make sure one value is very high (close
to 1) and all other values are very low (close to 0). Softmax uses
exponential to make sure this happens. And then we are normalizing
because we need probabilities.
Now that our outputs are in a proper format, let’s go ahead to look at how
we configure the loss function for this. The good thing is that the loss
function is essentially the same as that of binary classification. We will just
apply log loss on each output node with respect to its respective target
value and then we will find the sum of this across all output nodes.
Categorical Cross Entropy Visualization
This loss is called as Categorical Cross Entropy. Now let’s move onto a
special case of classification called multilabel classification.
Multilabel Classification
Multilabel classification is done when your model needs to predict multiple
classes as the output. For example, let’s say you are training a neural
network to predict the ingredients present in a picture of some food. There
will be multiple ingredients we need to predict so there will be multiple 1’s
in Y.
For this we can’t use softmax because softmax will always force only one
class to become 1 and other classes to become 0. So instead we can simply
keep sigmoid on all the output node values since we are trying to predict
each class’s individual probability.
As for the loss we can directly use log loss on each node and sum it, similar
to what we did in multiclass classification.
Now that we have covered classification, let’s now move on to regression.
Regression Loss
In regression, our model is trying to predict a continuous value. Some
examples of regression models are:
House price prediction
Person Age prediction
In regression models, our neural network will have one output node for
every continuous value we are trying to predict. Regression losses are
calculated by performing direct comparisons between the output value and

the true value.
The most popular loss function we use for regression models is the mean
squared error loss function. In this we simply calculate the square of the
difference between Y and Y_pred and average this over all the data.
Suppose there are n data points:
Mean Squared Error Loss Function
Here Y_i and Y_pred_i refer to the i’th Y value in the dataset and the
corresponding Y_pred from the neural network for the same data.
That concludes this article. Hopefully now you have a deeper

understanding of how loss functions are configured for various tasks in
deep learning. Thank you for reading!
Machine Learning Deep Learning Mathematics Data Science Neural Networks
Discover Medium Make Medium yours Become a member

Welcome to a place where words matter. Follow all the topics you care about, and Get unlimited access to the best stories
On Medium, smart voices and original we’ll deliver the best stories for you to on Medium — and support writers while
ideas take center stage - with no ads in your homepage and inbox. Explore you’re at it. Just $5/month. Upgrade
sight. Watch
About Help Legal

Loss Functions Explained - Deep Learning Demystified - Medium PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Loss Functions Explained - Deep Learning Demystified - Medium PDF

Uploaded by

Copyright:

Available Formats

08/12/2019 Loss Functions Explained - Deep Learning Demystified - Medium

Neural Network Loss Visualization

This scenario is somewhat analogous to studying for exams. If one does

Classi cation Neural Network Output Format

Pr(Class 1) = Probability of Class 1 being the correct class

Sigmoid Function Graph Visualization

If the output is above 0.5 (50% Probability), we will consider it to be falling

under the negative class. For example if we are training a network to

Binary Cross Entropy Loss Graphs

Binary Classi cation Loss Comparisons

Binary Cross Entropy Full Equation

Let’s visualize this with an example:

Softmax Example Visualization

So why do we have to pass each value through an exponential before

Categorical Cross Entropy Visualization

Now that we have covered classification, let’s now move on to regression.

House price prediction

Person Age prediction

calculated by performing direct comparisons between the output value and

Mean Squared Error Loss Function

That concludes this article. Hopefully now you have a deeper

Machine Learning Deep Learning Mathematics Data Science Neural Networks

Discover Medium Make Medium yours Become a member

About Help Legal

You might also like