Professional Documents
Culture Documents
com /logistic-regression-for-machine-learning-tutorial/
In this tutorial, we’ll help you understand the logistic regression algorithm in machine learning.
Logistic Regression is a popular algorithm for supervised learning – classification problems. It’s relatively
simple and easy to interpret, which makes it one of the first predictive algorithms that a data scientist learns
and applies.
1/10
How to predict with the logistic model.
And more.
Even if you’ve already learned logistic regression, this tutorial is also a helpful review.
Before beginning our logistic regression tutorial, if you are not familiar with ML algorithms, please take a look
at Machine Learning for Beginners: Overview of Algorithm Types.
Understanding linear regression is critical to studying logistic regression as well. Check out Linear
Regression in Machine Learning: Practical Python Tutorial.
It’s also helpful to understand basic statistics such as probability theory. But we’ll try to explain with
references or examples.
Within classification problems, we have a labeled training dataset consisting of input variables (X) and a
categorical output variable (y). The logistic regression algorithm helps us to find the best fit logistic function
to describe the relationship between X and y.
For the classic logistic regression, y is a binary variable with two possible values, such as win/loss,
good/bad. Since y is binary, we often label classes as either 1 or 0, with 1 being the desired class of
prediction.
When new observations come in, we can use its input variables and the logistic relationship to predict the
probability of the new case belonging to class y = 1. The formula for this probability given the input
variables X is written below. Let’s denote it as p for simplicity.
P( y = 1 | X ) = p
This probability, ranging from 0 to 1, can be used as a criterion to classify the new observation. The higher
the value of p, the more likely the new observation belongs to class y = 1, instead of y = 0.
For example, we can choose a cutoff threshold of 0.5. When p > 0.5, the new observation will be classified
as y = 1 , otherwise as y = 0.
Note that logistic regression generally means binary logistic regression with the binary target. This can be
extended to model outputs with multiple classes such as win/tie/loss, dog/cat/fox/rabbit. In this tutorial, we
only cover the binary logistic regression.
2/10
Some common applications for logistic regression include:
fraud detection.
customer churn prediction.
cancer diagnosis.
Great, now we are ready to dive into the details of logistic regression.
Follow along, and you’ll get pieces of how logistic regression works explained!
Oddsare the ratio of the probability of something happening to the probability of it not happening. It’s also a
metric representing the likelihood of the event occurring.
For example, the odds of the observation belonging to class y = 1 is p/(1-p). When the odds are between 0
and 1, the odds are against the observation belonging to y = 1. When the odds are greater than 1, the odds
are for the observation belonging to y = 1.
Or, it might be easier to think of odds in terms of gambling, when we bet money on an event to occur.
For example, let’s bet that a six will come up for a toss of a fair six-sided die. The probability of it happening
is 1/6. So the odds in favor of us winning are (1/6) / (5/6) = 1/5 or 1:5. Or the odds of us losing are (5/6) /
(1/6) = 5:1. The odds are clearly against us winning.
3/10
The log odds, log(p/(1-p)), is merely taking thelogarithm of odds, with natural logarithm being most
commonly used. The log(odds) is also called the logit function.
Because we want to adapt the well-studied linear regression algorithm to classification problems. As
mentioned, the classification tasks have output being a probability p ranging from 0 to 1. We can apply linear
regression on the transformation, if we can transform p to a range from -infinity to +infinity.
Since p ranges from 0 to 1, the odds p/(1-p) range from 0 to +infinity. After the log transformation, log(odds)
ranges from -infinity to +infinity!
If you understand linear regression, logistic regression equation should look very familiar.
The standard logistic function is simply the inverse of the logit equation above. If we solve for p from the
logit equation, the formula of the logistic function is below:
The logistic or sigmoid function has an S-shaped curve or sigmoid curve with the y-axis ranging from 0 and 1
as below.
4/10
Great!
Now we know the logistic regression formula we are trying to solve, let’s see how to find the best fit equation.
Can we use the same estimation method, (Ordinary) Least Squares (OLS), as linear regression?
The OLS targets minimizing the sum of squared residuals, which involves the difference of the predicted
output and the actual output; while the actual output in the logistic linear equation is log(p/(1-p)), we can’t
calculate its value since we don’t know the value of p. The only output we know is the class of either y = 0 or
1. So we have to use another estimation method.
The standard way to determine the best fit for logistic regression is maximum likelihood estimation (MLE).
In this estimation method, we use a likelihood function that measures how well a set of parameters fit a
sample of data. The parameter values that maximize the likelihood function are the maximum likelihood
estimates. In other words, the goal is to make inferences about the population that is most likely to have
generated the training dataset.
5/10
In the logistic regression case, we want to find the estimates for the parameters w0, …, wm.
With one input variable x1, the logistic regression formula becomes:
log(p/(1-p)) = w0 + w1*x1
or
p = 1/(1 + e^(-(w0 + w1*x1)))
Since y is binary of values 0 or 1, a bernoulli random variable can be used to model its probability:
P(y=1) = p
P(y=0) = 1 – p
Or:
P(y) = (p^y)*(1-p)^(1-y)
with y being either 0 or 1
This distribution formula is only for a single observation. How do we model the distribution of multiple
observations like P(y0, y1, y2, y3, y4)?
Let’s assume these observations are mutually independent from each other. Then we can write the joint
distribution of the training dataset as:
P(y0, y1, y2, y3, y4) = P(y0) * P(y1) * P(y2) * P(y3) * P(y4)
To make it more specific, each observed y has a different probability of being 1. Let’s assume P(yi = 1) = pi
for i = 0,1,2,3,4. Then we can rewrite the formula as below:
We can calculate the p estimate for each observation based on the logistic function formula:
6/10
p3 = 1/(1 + e^(-(w0 + w1*0.75)))
p4 = 1/(1 + e^(-(w0 + w1*1.2)))
y0 = 0
y1 = 0
y2 = 0
y3 = 1
y4 = 1
So we have all the p0 – p4 and y0 – y4 values from the training dataset. Our likelihood becomes a function
of the parameters w0 and w1:
The goal is to choose the values of w0 and w1 that result in the maximum likelihood based on the training
dataset.
Note that it’s computationally more convenient to optimize the log-likelihood function. Since the natural
logarithm is a strictly increasing function, the same w0 and w1 values that maximize L would also maximize l
= log(L).
While in machine learning, we prefer the idea of minimizing cost/loss functions, so we often define the cost
function as the negative of the average log-likelihood.
cost function = – avg(l(w0, w1)) = – 1/5 * l(w0, w1) = – 1/5 * (y0*log(p0) + (1-y0)*log(1-p0) + y1*log(p1) +
(1-y1)*log(1-p1) + … + (1-y4)*log(1-p4))
This is also called the average of the cross entropy loss. Take a look at cross entropy’s general definition for
more details.
Maximizing the (log) likelihood is the same as minimizing the cross entropy loss function.
Optimization Methods
7/10
Unlike OLS estimation for the linear regression, we don’t have a closed-form solution for the MLE. But we do
know that the cost function is convex, which means a local minimum is also the global minimum.
To minimize this cost function, Python libraries such as scikit-learn (sklearn) use numerical methods similar
to Gradient Descent. And since sklearn uses gradients to minimize the cost function, it’s better to scale the
input variables and/or use regularization to make the algorithm more stable.
Model Interpretations
In this logistic regression tutorial, we are not showing any code. But by using the Logistic Regression
algorithm in Python sklearn, we can find the best estimates are w0 = -4.411 and w1 = 4.759 for our example
dataset.
We can plot the logistic regression with the sample dataset. As you can see, the output y only has two
values of 0 and 1, while the logistic function has an S shape.
Since w1 = 4.759, with a one-unit increase of x1, the log odds is expected to increase by 4.759 as well.
Given a new observation, how would we predict which class y = 0 or 1 it belongs to?
For example, say a new observation has input variable x1 = 0.9. By using the logistic regression equation
estimated from MLE, we can calculate the probability p of it belongs to y = 1.
8/10
p = 1/(1 + e^(-(-4.411 + 4.759*0.9))) = 46.8%
If we use 50% as the threshold, we would predict that this observation is in class 0, since p < 50%.
Since the logistic regression has an S shape, the larger x1, the more likely the observation has class y = 1.
What’s the threshold of x1 for us to classify the observation as y = 1?
At the threshold of probability p=50%, the odds are p/(1-p) = 50%/50% = 1. So the log(odds) = log(1) = 0.
Solving for x1, we get 0.927. That’s the threshold of x1 for prediction, i.e., when x1 > 0.927, the observation
will be classified as y = 1.
Binary output/target: as mentioned at the beginning, logistic regression is for classification problems.
We need to make sure the target is binary and transform to values of 0 or 1.
Linear relationship: the logistic algorithm makes use of the linear equation, so the same assumptions
apply here.
Independent inputs: the highly correlated (multicollinearity) input variables can fail the model
convergence.
In this tutorial, you’ve learned a lot about logistic regression, a critical machine learning classification
algorithm.
We may cover an application of logistic regression in Python in another tutorial. Stay tuned!
Leave a comment for any questions you may have or anything else.
Related Resources:
A FREE Python online course, beginner-friendly tutorial. Start your successful data science career journey:
learn Python for data science, machine learning.
9/10
How to GroupBy with Python Pandas Like a Boss
Read this pandas tutorial to learn Group by in pandas. It is an essential operation on datasets (DataFrame)
when doing data manipulation or analysis.
Check out this for a detailed review of resources online, including courses, books, free tutorials, portfolios
building, and more.
10/10