You are on page 1of 13

DATA SCIENCE : PYTHON

LOGISTIC REGRESSION
January 2020
Definition

• Logistic Regression is a classification algorithm.


• Logistic regression, also called a logit model, is used to predict a binary outcome (1 / 0, Yes
/ No, True / False) given a set of independent variables.
• In the logit model the log odds of the outcome is modeled as a linear combination of the
predictor variables.
• Think of logistic regression as a special case of linear regression when the outcome variable is
categorical, where we are using log of odds as dependent variable.
Examples
• Spam Detection : Predicting if an email is Spam or not
• Credit Card Fraud : Predicting if a given credit card transaction is fraud or not
• Health : Predicting if a given mass of tissue is benign or malignant
• Marketing : Predicting if a given user will buy an insurance product or not
• Banking : Predicting if a customer will default on a loan
Linear vs Logistic

• In many situations, the response variable is qualitative or, in other words, categorical. For
example, gender is qualitative, taking on values male or female.
• Predicting a qualitative response for an observation can be referred to as classifying that
observation, since it involves assigning the observation to a category, or class. On the other
hand, the methods that are often used for classification first predict the probability of each of
the categories of a qualitative variable, as the basis for making the classification.
• Linear regression is not capable of predicting probability. If you use linear regression to model
a binary response variable, for example, the resulting model may not restrict the predicted Y
values within 0 and 1. Here's where logistic regression comes into play, where you get a
probability score that reflects the probability of the occurrence at the event.
Logistic Regression : statistical model equation

• Logistic regression is an instance of classification technique that can be used to predict a


qualitative response. More specifically, logistic regression models the probability
that gender belongs to a particular category.
• For example, for gender classification, where the response gender falls into one of the two
categories, male or female, use logistic regression models to estimate the probability
that gender belongs to a particular category.

The probability of gender given longhair can be written as:


P(gender=female|longhair)

The values of P(gender=female|longhair) will range between 0 and 1.


Then, for any given value of longhair, a prediction can be made for gender.
Logistic Regression : statistical model equation

• Logistic regression is defined as the log of odds of the event(logit function), ln(P/1−P), where,
P is the probability of event.
• P always lies between 0 and 1.

Where,
• β: log-odds ratio associated with predictors
• e β: odds ratio
Logistic regression is based on Maximum Likelihood Estimation which says coefficients
should be chosen in such a way that it maximizes the Probability of Y given X (likelihood).
Assumptions of Logistic Regression

• First, binary logistic regression requires the dependent variable to be binary.


• Second, logistic regression requires the observations to be independent of each other, i.e., the
observations should not come from repeated measurements or matched data.
• Third, logistic regression requires there should be little or no multicollinearity among the
independent variables, i.e., the independent variables should not be too highly correlated with
each other.
• Fourth, logistic regression assumes linearity of independent variables and log odds.
Steps for Model Building

• Identification of the dependent variable


• Data Import
• Outlier Detection and Treatment
• Missing Value Detection and Imputation
• Dummy variable Creation
• Creation of train and test samples
• Checking Multicollinearity
• Model Development
– Checking variable significance & Removal of insignificant variables
• Validation of model parameters on Test data
Evaluate Logistic Regression Model Fit
and Accuracy
• Confusion Matrix
• ROC Curve
• AUC
Confusion matrix

Confusion Matrix
• A confusion matrix is formed from the four outcomes produced as a result of binary
classification.
Four outcomes of classification
• True positive (TP): correct positive prediction
• False positive (FP): incorrect positive prediction
• True negative (TN): correct negative prediction
• False negative (FN): incorrect negative prediction
Evaluate Logistic Regression Model Fit
and Accuracy
Accuracy
• Accuracy (ACC) is calculated as the number of all correct predictions divided by the total number of
the dataset. The best accuracy is 1.0, whereas the worst is 0.0. It can also be calculated by 1 – ERR.
Error rate
• Error rate (ERR) is calculated as the number of all incorrect predictions divided by the total number of
the dataset. The best error rate is 0.0, whereas the worst is 1.0.
• specificity, is more informative than accuracy and error rate.
Sensitivity (Recall or True positive rate)
• Sensitivity is calculated as the number of correct positive predictions (TP) divided by the total number
of positives (P).
• It is also called recall (REC) or true positive rate (TPR). The best sensitivity is 1.0, whereas the worst
is 0.0.
Specificity (True negative rate)
• Specificity is calculated as the number of correct negative predictions (TN) divided by the total
number of negatives (N).
• It is also called true negative rate (TNR). The best specificity is 1.0, whereas the worst is 0.0.
Precision (Positive predictive value)
• Precision is calculated as the number of correct positive predictions (TP) divided by the total number
of positive predictions (TP + FP).
• It is also called positive predictive value (PPV). The best precision is 1.0, whereas the worst is 0.0.
Evaluate Logistic Regression Model Fit
and Accuracy
False positive rate
• False positive rate (FPR) is calculated as the number of incorrect positive predictions divided
by the total number of negatives. The best false positive rate is 0.0 whereas the worst is 1.0. It
can also be calculated as 1 – specificity.
F-score
• F-score is a harmonic mean of precision and recall.

Note:
In general we are concerned with one of the above defined metric. For instance, in a
pharmaceutical company, they will be more concerned with minimal wrong positive diagnosis.
Hence, they will be more concerned about high Specificity. On the other hand an attrition model
will be more concerned with Senstivity. Confusion matrix are generally used only with class
output models.
Evaluate Logistic Regression Model Fit
and Accuracy
ROC CURVE
• The ROC curve is the plot between sensitivity and (1- specificity).
• (1- specificity) is also known as false positive rate and sensitivity is also known as True
Positive rate.

Area under the curve(AUC)


• Overall measure of model performance
• In classification,
• AUC = Concordance+0.5*Ties
• A random classifier has an area under the curve of 0.5, while AUC for a perfect classifier is equal
to 1. In practice, most of the classification models have an AUC between 0.5 and 1.
Questions

You might also like