You are on page 1of 10

www.justintodata.

com /logistic-regression-for-machine-learning-tutorial/

Logistic Regression for Machine Learning: complete


Tutorial Understand this popular Supervised Classification
Algorithm step-by-step
⋮ 5/6/2020

In this tutorial, we’ll help you understand the logistic regression algorithm in machine learning.

Logistic Regression is a popular algorithm for supervised learning – classification problems. It’s relatively
simple and easy to interpret, which makes it one of the first predictive algorithms that a data scientist learns
and applies.

Following this beginner-friendly tutorial, you’ll learn step-by-step:

What is logistic regression in machine learning (ML).


What are odds, logistic function.
How to optimize using Maximum Likelihood Estimation/cross entropy cost function.

1/10
How to predict with the logistic model.
And more.

Even if you’ve already learned logistic regression, this tutorial is also a helpful review.

Let’s get started!

Before beginning our logistic regression tutorial, if you are not familiar with ML algorithms, please take a look
at Machine Learning for Beginners: Overview of Algorithm Types.

Understanding linear regression is critical to studying logistic regression as well. Check out Linear
Regression in Machine Learning: Practical Python Tutorial.

It’s also helpful to understand basic statistics such as probability theory. But we’ll try to explain with
references or examples.

What is Logistic Regression in Machine Learning?


Logistic Regression is a machine learning (ML) algorithm for supervised learning – classification analysis.

Within classification problems, we have a labeled training dataset consisting of input variables (X) and a
categorical output variable (y). The logistic regression algorithm helps us to find the best fit logistic function
to describe the relationship between X and y.

For the classic logistic regression, y is a binary variable with two possible values, such as win/loss,
good/bad. Since y is binary, we often label classes as either 1 or 0, with 1 being the desired class of
prediction.

When new observations come in, we can use its input variables and the logistic relationship to predict the
probability of the new case belonging to class y = 1. The formula for this probability given the input
variables X is written below. Let’s denote it as p for simplicity.

P( y = 1 | X ) = p

How does this probability link to a classification problem?

This probability, ranging from 0 to 1, can be used as a criterion to classify the new observation. The higher
the value of p, the more likely the new observation belongs to class y = 1, instead of y = 0.
For example, we can choose a cutoff threshold of 0.5. When p > 0.5, the new observation will be classified
as y = 1 , otherwise as y = 0.

Note that logistic regression generally means binary logistic regression with the binary target. This can be
extended to model outputs with multiple classes such as win/tie/loss, dog/cat/fox/rabbit. In this tutorial, we
only cover the binary logistic regression.

2/10
Some common applications for logistic regression include:

fraud detection.
customer churn prediction.
cancer diagnosis.

Great, now we are ready to dive into the details of logistic regression.

Follow along, and you’ll get pieces of how logistic regression works explained!

Let’s start with the basics.

Odds and log Odds: the Prerequisites


What is the definition of odds in statistics?

Oddsare the ratio of the probability of something happening to the probability of it not happening. It’s also a
metric representing the likelihood of the event occurring.

For example, the odds of the observation belonging to class y = 1 is p/(1-p). When the odds are between 0
and 1, the odds are against the observation belonging to y = 1. When the odds are greater than 1, the odds
are for the observation belonging to y = 1.

Or, it might be easier to think of odds in terms of gambling, when we bet money on an event to occur.
For example, let’s bet that a six will come up for a toss of a fair six-sided die. The probability of it happening
is 1/6. So the odds in favor of us winning are (1/6) / (5/6) = 1/5 or 1:5. Or the odds of us losing are (5/6) /
(1/6) = 5:1. The odds are clearly against us winning.

3/10
The log odds, log(p/(1-p)), is merely taking thelogarithm of odds, with natural logarithm being most
commonly used. The log(odds) is also called the logit function.

Why do we want to take log of odds?

Because we want to adapt the well-studied linear regression algorithm to classification problems. As
mentioned, the classification tasks have output being a probability p ranging from 0 to 1. We can apply linear
regression on the transformation, if we can transform p to a range from -infinity to +infinity.

log-odds is the popular mapping function of this transformation.

Since p ranges from 0 to 1, the odds p/(1-p) range from 0 to +infinity. After the log transformation, log(odds)
ranges from -infinity to +infinity!

Logistic Function: the Logistic Regression Model/Equation


Now with the transformation, we can model the log(odds) as a linear equation. Assume we have multiple
explanatory variables x1, …, xm, and coefficients w0, …, wm, the relationship can be shown as below:

logit(p) = log(odds) = log(p/(1-p)) = w0 + w1*x1 + w2*x2 + … + wm*xm

If you understand linear regression, logistic regression equation should look very familiar.

Now, how is this linked to the “logistic” function?

The standard logistic function is simply the inverse of the logit equation above. If we solve for p from the
logit equation, the formula of the logistic function is below:

p = 1/(1 + e^(-(w0 + w1*x1 + w2*x2 + … + wm*xm)))


where e is the base of the natural logarithms

The logistic function is a type of sigmoid function.

sigmoid(h) = 1/(1 + e^(-h))


where h = w0 + w1*x1 + w2*x2 + … + wm*xm for logistic function.

The logistic or sigmoid function has an S-shaped curve or sigmoid curve with the y-axis ranging from 0 and 1
as below.

4/10
Great!

Now we know the logistic regression formula we are trying to solve, let’s see how to find the best fit equation.

Maximum Likelihood Estimation: the Best Model Fit


Like linear regression, the logistic regression algorithm finds the best values of coefficients (w0, w1, …, wm)
to fit the training dataset.

How do we find the best fit model?

Can we use the same estimation method, (Ordinary) Least Squares (OLS), as linear regression?

The answer is NO. We have to use a different method.

The OLS targets minimizing the sum of squared residuals, which involves the difference of the predicted
output and the actual output; while the actual output in the logistic linear equation is log(p/(1-p)), we can’t
calculate its value since we don’t know the value of p. The only output we know is the class of either y = 0 or
1. So we have to use another estimation method.

What is Maximum Likelihood Estimation?

The standard way to determine the best fit for logistic regression is maximum likelihood estimation (MLE).

In this estimation method, we use a likelihood function that measures how well a set of parameters fit a
sample of data. The parameter values that maximize the likelihood function are the maximum likelihood
estimates. In other words, the goal is to make inferences about the population that is most likely to have
generated the training dataset.

5/10
In the logistic regression case, we want to find the estimates for the parameters w0, …, wm.

Let’s see a simple example with the following dataset:

Observation # Input x1 Binary Output y


0 0.5 0
1 1.0 0
2 0.65 0
3 0.75 1
4 1.2 1

With one input variable x1, the logistic regression formula becomes:

log(p/(1-p)) = w0 + w1*x1
or
p = 1/(1 + e^(-(w0 + w1*x1)))

Since y is binary of values 0 or 1, a bernoulli random variable can be used to model its probability:

P(y=1) = p
P(y=0) = 1 – p

Or:

P(y) = (p^y)*(1-p)^(1-y)
with y being either 0 or 1

This distribution formula is only for a single observation. How do we model the distribution of multiple
observations like P(y0, y1, y2, y3, y4)?

Let’s assume these observations are mutually independent from each other. Then we can write the joint
distribution of the training dataset as:

P(y0, y1, y2, y3, y4) = P(y0) * P(y1) * P(y2) * P(y3) * P(y4)

To make it more specific, each observed y has a different probability of being 1. Let’s assume P(yi = 1) = pi
for i = 0,1,2,3,4. Then we can rewrite the formula as below:

P(y0) * P(y1) * P(y2) * P(y3) * P(y4) = p0^(y0)*(1-p0)^(1-y0) * p1^(y1)*(1-p1)^(1-y1) *… * p4^(y4)*(1-p4)^(1-


y4)

We can calculate the p estimate for each observation based on the logistic function formula:

p0 = 1/(1 + e^(-(w0 + w1*0.5)))


p1 = 1/(1 + e^(-(w0 + w1*1.0)))
p2 = 1/(1 + e^(-(w0 + w1*0.65)))

6/10
p3 = 1/(1 + e^(-(w0 + w1*0.75)))
p4 = 1/(1 + e^(-(w0 + w1*1.2)))

We also have the values of the output variable y:

y0 = 0
y1 = 0
y2 = 0
y3 = 1
y4 = 1

Log Likelihood Function in statistics

So we have all the p0 – p4 and y0 – y4 values from the training dataset. Our likelihood becomes a function
of the parameters w0 and w1:

L(w0, w1) = p0^(y0)*(1-p0)^(1-y0) * p1^(y1)*(1-p1)^(1-y1) * … * p4^(y4)*(1-p4)^(1-y4)

The goal is to choose the values of w0 and w1 that result in the maximum likelihood based on the training
dataset.

Note that it’s computationally more convenient to optimize the log-likelihood function. Since the natural
logarithm is a strictly increasing function, the same w0 and w1 values that maximize L would also maximize l
= log(L).

So in statistics, we often try to maximize the function below:

l(w0, w1) = log(L(w0, w1)) = y0*log(p0) + (1-y0)*log(1-p0) + y1*log(p1) + (1-y1)*log(1-p1) + … + (1-y4)*log(1-


p4)

Cost Function (Cross Entropy Loss) in machine learning

While in machine learning, we prefer the idea of minimizing cost/loss functions, so we often define the cost
function as the negative of the average log-likelihood.

cost function = – avg(l(w0, w1)) = – 1/5 * l(w0, w1) = – 1/5 * (y0*log(p0) + (1-y0)*log(1-p0) + y1*log(p1) +
(1-y1)*log(1-p1) + … + (1-y4)*log(1-p4))

This is also called the average of the cross entropy loss. Take a look at cross entropy’s general definition for
more details.

Maximizing the (log) likelihood is the same as minimizing the cross entropy loss function.

Optimization Methods

7/10
Unlike OLS estimation for the linear regression, we don’t have a closed-form solution for the MLE. But we do
know that the cost function is convex, which means a local minimum is also the global minimum.

To minimize this cost function, Python libraries such as scikit-learn (sklearn) use numerical methods similar
to Gradient Descent. And since sklearn uses gradients to minimize the cost function, it’s better to scale the
input variables and/or use regularization to make the algorithm more stable.

Model Interpretations

In this logistic regression tutorial, we are not showing any code. But by using the Logistic Regression
algorithm in Python sklearn, we can find the best estimates are w0 = -4.411 and w1 = 4.759 for our example
dataset.

We can plot the logistic regression with the sample dataset. As you can see, the output y only has two
values of 0 and 1, while the logistic function has an S shape.

We can also make some interpretations with the parameter w1.

Recall that we have:

log(odds of y=1) = log(p/(1-p)) = w0 + w1*x1


where p = P(y = 1)

Since w1 = 4.759, with a one-unit increase of x1, the log odds is expected to increase by 4.759 as well.

How to use Logistic Regression Models to Predict?


As mentioned earlier, we often use logistic regression models for predictions.

Given a new observation, how would we predict which class y = 0 or 1 it belongs to?

For example, say a new observation has input variable x1 = 0.9. By using the logistic regression equation
estimated from MLE, we can calculate the probability p of it belongs to y = 1.

8/10
p = 1/(1 + e^(-(-4.411 + 4.759*0.9))) = 46.8%

If we use 50% as the threshold, we would predict that this observation is in class 0, since p < 50%.

Since the logistic regression has an S shape, the larger x1, the more likely the observation has class y = 1.
What’s the threshold of x1 for us to classify the observation as y = 1?

At the threshold of probability p=50%, the odds are p/(1-p) = 50%/50% = 1. So the log(odds) = log(1) = 0.

While log (odds) fits the linear regression equation, we have:

log(odds) = 0 = -4.411 + 4.759*x1

Solving for x1, we get 0.927. That’s the threshold of x1 for prediction, i.e., when x1 > 0.927, the observation
will be classified as y = 1.

More for Logistic Regression Implementation


Similar to linear regression, logistic regression does have certain assumptions. But most of them can be
relaxed as well with transformation, as long as the model provides reliable predictions. Some assumptions
are listed below:

Binary output/target: as mentioned at the beginning, logistic regression is for classification problems.
We need to make sure the target is binary and transform to values of 0 or 1.
Linear relationship: the logistic algorithm makes use of the linear equation, so the same assumptions
apply here.
Independent inputs: the highly correlated (multicollinearity) input variables can fail the model
convergence.

You’ve made it!

In this tutorial, you’ve learned a lot about logistic regression, a critical machine learning classification
algorithm.

We may cover an application of logistic regression in Python in another tutorial. Stay tuned!

Leave a comment for any questions you may have or anything else.

Related Resources:

Python crash course: Break into Data Science – FREE

A FREE Python online course, beginner-friendly tutorial. Start your successful data science career journey:
learn Python for data science, machine learning.

9/10
How to GroupBy with Python Pandas Like a Boss

Read this pandas tutorial to learn Group by in pandas. It is an essential operation on datasets (DataFrame)
when doing data manipulation or analysis.

How to Learn Data Science Online: ALL You Need to Know

Check out this for a detailed review of resources online, including courses, books, free tutorials, portfolios
building, and more.

10/10

You might also like