You are on page 1of 42

Cost function for logistic regression

Why doesn’t MSE work with logistic


regression?
One of the main reasons why MSE doesn’t
work with logistic regression is when the MSE
loss function is plotted with respect to
weights of the logistic regression model, the
curve obtained is not a convex curve which
makes it very difficult to find the global
minimum.
This non-convex nature of MSE with logistic
regression is because non-linearity has been
introduced into the model in the form of a
sigmoid function which makes the relation
between weight parameters and Error very
complex.
If we use this in the above MSE equation
then it will give a non-convex graph with
many local minima as shown
Cost Function in Logistic Regression
• A linear equation (z) is given to a sigmoidal
activation function (σ) to predict the output
(ŷ).
• To evaluate the performance of the model,
we calculate the loss.
• The most commonly used loss function is the
mean squared error.
But in logistic regression, as the output is a
probability value between 0 or 1, mean squared
error wouldn’t be the right choice.
So, instead, we use the cross-entropy loss
function.
The cross-entropy loss function is used to
measure the performance of a classification
model whose output is a probability value.
Cost Function in Logistic Regression
• we derive a different cost function for logistic
regression called log loss which is also
derived from the maximum likelihood
estimation method.
Hence, we need a different cost function for
our new model.
Here comes the log loss in the picture.
• In the first case when the class is 1 and the
probability is close to 1, the left side of the
equation becomes active and the right part
vanishes.
• You will notice in the plot below as the
predicted probability moves towards 0 the
cost increases sharply.
• Similarly, when the actual class is 0 and the
predicted probability is 0, the right side
becomes active and the left side vanishes.
• Increasing the cost of the wrong predictions.
Later, these two parts will be added.
Optimize the model
• Once we have our model and the appropriate
cost function handy, we can use “The
Gradient Descent Algorithm” to optimize our
model parameters.
• As we do in the case of linear regression.
Sigmoid function Q(Z) Equation
• We have the cost function J and the
parameters B and b.
• Now we can differentiate the cost function J
with parameters B and b.
• Once we have our values of GB and Gb, we
can update our parameters.
• The whole process will go iteratively until we
get our best parameters.
log values are negative. To deal with the
negative sign, we take the negative average
of these values, to maintain a common
convention that lower loss scores are better.
why “Log Loss Function.”
It is important to first understand the log
function before jumping into log loss. If we
plot y = log(x), the graph in quadrant iv looks
like this
y = log(x) graph
• We’re only concerned with the region 0–1 on
X-axis. In the above graph when

x=1 → y=0
x =0 → y=-inf
In the above graph, we have to observe that
as we go towards x=0, y value increases
almost similar to an exponential curve.
This characteristic makes the log graph a
good candidate for using as a loss function
because it satisfies the first characteristic of
the loss function, i.e., heavily penalizing the
samples that are away from the desired
value.
As we always prefer positive values, we plot
the above function with a slight modification
(y = -log(x)) so that our concerned area in the
above graph is moved into quadrant I.
y = -log(x) graph
Axes in the above graph are interpreted as:

X-Axis: Probability of input sample being true


output value
Y-Axis: Penalty for the corresponding X-Axis
value.
By default, the output of the logistic
regression model is the Probability of the
input sample being positive (indicated by 1).
Now let’s see how the above log function
works in the two use cases of logistic
regression, i.e., when the actual output value
is 1 & 0.
If a logistic regression model is trained to
classify a mail as spam and not spam where
mail being spam(=positive) is indicated as 1,
and not-spam(=negative) as 0, then the
output of the model p is the Probability of
mail being spam(=positive).
If we want the Probability of mail being not
spam (=negative), it can be represented as
1-p.

You might also like