You are on page 1of 18

Logistic Regression

Classification techniques are an essential part of machine learning and data mining applications.
Approximately 70% of problems in Data Science are classification problems. There are lots of
classification problems that are available, but the logistics regression is common and is a useful
regression method for solving the binary classification problem. Another category of classification
is Multinomial classification, which handles the issues where multiple classes are present in the
target variable.

Logistic Regression can be used for various classification problems such as

1. Spam detection
2. Diabetes prediction,
3. if a given customer will purchase a particular product or will they churn another
competitor,
4. whether the user will click on a given advertisement link or not,

and many more examples are in the bucket.

Logistic Regression is one of the most simple and commonly used Machine Learning algorithms
for two-class classification. It is easy to implement and can be used as the baseline for any
binary classification problem. Its basic fundamental concepts are also constructive in deep
learning. Logistic regression describes and estimates the relationship between one dependent
binary variable and independent variables.

Definition of Logistic Regression

Logistic regression is a statistical method for predicting binary classes. The outcome or target

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
variable has only two possible classes. For example, it can be used for cancer detection
problems. It computes the probability of an event occurrence.

It is a special case of linear regression where the target variable is categorical in nature. It uses a
log of odds as the dependent variable. Logistic Regression predicts the probability of occurrence
of a binary event utilizing a logit function.

Linear Regression Equation:

y = m1X1 + m2X2 + m3X3 + ......... mnXn

Where, y is dependent variable and X1, X2 ... and Xn are explanatory variables.

Sigmoid Function:
p = 1/[1+(e**(-y))]

Apply Sigmoid function on linear regression:

p = 1/[1+(e**(-(m1X1 + m2X2 + m3X3 + ......... mnXn))]

In [1]: import pandas as pd


import numpy as np
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
matplotlib.style.use('ggplot')
plt.figure(figsize=(9,9))
def sigmoid(t): # Define the sigmoid function
return (1/(1 + np.e**(-t)))
plot_range = np.arange(-6, 6, 0.1)
y_values = sigmoid(plot_range)
# Plot curve

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
plt.plot(plot_range, # X-axis range
y_values, # Predicted values
color="red")

Out[1]: [<matplotlib.lines.Line2D at 0x18bcb22da90>]

The sigmoid function, also called logistic function gives an ‘S’ shaped curve that can take any

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
real-valued number and map it into a value between 0 and 1. If the curve goes to positive infinity,
y predicted will become 1, and if the curve goes to negative infinity, y predicted will become 0. If
the output of the sigmoid function is more than 0.5, we can classify the outcome as 1 or YES,
and if it is less than 0.5, we can classify it as 0 or NO. The outputcannotFor example: If the
output is 0.75, we can say in terms of probability as: There is a 75 percent chance that patient
will suffer from cancer.

The most important thing to know is that logistic regression outputs probabilities that we can use
to classify observations.

Linear Regression Vs. Logistic Regression

1. Linear regression gives you a continuous output, but logistic regression provides a
constant output.
2. An example of the continuous output is house price and stock price. Example's of the
discrete output is predicting whether a patient has cancer or not OR predicting whether
the customer will churn.
3. Linear regression is estimated using Ordinary Least Squares (OLS) while logistic
regression is estimated using Maximum Likelihood Estimation (MLE) approach

Maximum Likelihood Estimation Vs. Least Square


Method

The MLE is a "likelihood" maximization method, while OLS is a distance-minimizing


approximation method. Maximizing the likelihood function determines the parameters that are
most likely to produce the observed data. From a statistical point of view, MLE sets the mean
and variance as parameters in determining the specific parametric values for a given model

Ordinary Least squares estimates are computed by fitting a regression line on given data points
that has the minimum sum of the squared deviations (least square error). Both are used to
estimate the parameters of a linear regression model.

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Types of Logistic Regression

1. Binary Logistic Regression: The target variable has only two possible outcomes such as
Spam or Not Spam, Cancer or No Cancer.

1. Multinomial Logistic Regression: The target variable has three or more nominal
categories such as predicting the type of Wine.

1. Ordinal Logistic Regression: The target variable has three or more ordinal categories
such as restaurant or product rating from 1 to 5.

Model building in Scikit-learn

Let's build the diabetes prediction model.

Let's first load the required Pima Indian Diabetes dataset using the pandas' read CSV function

In [8]: #import pandas


import pandas as pd
col_names = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness'
, 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']

# load dataset
diab = pd.read_csv('diabetes.csv', names=col_names, header=0)

In [9]: diab.head()

Out[9]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeF

0 6 148 72 35 0 33.6 0.627

1 1 85 66 29 0 26.6 0.351

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeF

2 8 183 64 0 0 23.3 0.672

3 1 89 66 23 94 28.1 0.167

4 0 137 40 35 168 43.1 2.288

In [10]: diab.dtypes

Out[10]: Pregnancies int64


Glucose int64
BloodPressure int64
SkinThickness int64
Insulin int64
BMI float64
DiabetesPedigreeFunction float64
Age int64
Outcome int64
dtype: object

Selecting Feature
Here, you need to divide the given columns into two types of variables dependent(or target
variable) and independent variable(or feature variables).

In [13]: #split dataset in features and target variable


feature_cols = ['Pregnancies', 'Glucose', 'BloodPressure', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age']

X = diab[feature_cols] # Features (Independent Variable)

y = diab.Outcome # Target variable

Splitting Data

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
To understand model performance, dividing the dataset into a training set and a test set is a good
strategy.

Let's split dataset by using function train_test_split(). You need to pass 3 parameters features,
target, and test_set size. Additionally, you can use random_state to select records randomly.

In [14]: # split X and y into training and testing sets

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,ran
dom_state=0)

Here, the Dataset is broken into two parts in a ratio of 75:25. It means 75% data will be used for
model training and 25% for model testing.

Model Development and Prediction

First, import the Logistic Regression module and create a Logistic Regression classifier object
using LogisticRegression() function.

Then, fit your model on the train set using fit() and perform prediction on the test set using
predict().

In [15]: # import the class


from sklearn.linear_model import LogisticRegression

In [16]: # instantiate the model (using the default parameters)


logreg = LogisticRegression()

In [17]: # fit the model with data


logreg.fit(X_train,y_train)

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Out[17]: LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=
True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=
1,
penalty='l2', random_state=None, solver='liblinear', tol=0.00
01,
verbose=0, warm_start=False)

In [18]: #
y_pred=logreg.predict(X_test)

In [19]: len(y_pred)

Out[19]: 192

Model Evaluation using Confusion Matrix

A confusion matrix is a table that is used to evaluate the performance of a classification model.
You can also visualize the performance of an algorithm. The fundamental of a confusion matrix is
the number of correct and incorrect predictions are summed up class-wise.

In [20]: # import the metrics class


from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix

Out[20]: array([[119, 11],


[ 26, 36]], dtype=int64)

Here, you can see the confusion matrix in the form of the array object. The dimension of this
matrix is 2*2 because this model is binary classification. You have two classes 0 and 1. Diagonal
values represent accurate predictions, while non-diagonal elements are inaccurate predictions.

In the output, 119 and 36 are accurate predictions, and 26 and 11 are incorrect predictions.

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Visualizing Confusion Matrix using Heatmap

Let's visualize the results of the model in the form of a confusion matrix using matplotlib and
seaborn.

Here, you will visualize the confusion matrix using Heatmap.

In [21]: # import required modules


import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [22]: class_names=[0,1] # name of classes


fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)

# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt=
'g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

Out[22]: Text(0.5,257.44,'Predicted label')

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Confusion Matrix Evaluation Metrics

Let's evaluate the model using model evaluation metrics such as accuracy, precision, and recall.

In [23]: metrics.accuracy_score(y_test, y_pred)

Out[23]: 0.8072916666666666

In [24]: from sklearn.metrics import classification_report


print(classification_report(y_test,y_pred))

precision recall f1-score support

0 0.82 0.92 0.87 130


1 0.77 0.58 0.66 62

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
avg / total 0.80 0.81 0.80 192

Well, you got a classification rate of 80%, considered as good accuracy.

Precision Score

“When it predicts the positive result, how often is it correct?”


TP – True Positives

FP – False Positives

Precision – Accuracy of positive predictions.

Precision = TP/(TP + FP)


Precision: Precision is about being precise, i.e., how accurate your model is. In other words, you
can say, when a model makes a prediction, how often it is correct. In your prediction case, when
your Logistic Regression model predicted patients are going to suffer from diabetes, that patients
have 76% of the time.

Recall Score

“When it is actually the positive result, how often does it predict


correctly?”
FN – False Negatives

Recall (aka sensitivity or true positive rate): Fraction of positives That were correctly identified.

Recall = TP/(TP+FN)

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Recall: If there are patients who have diabetes in the test set and your Logistic Regression
model can identify it 58% of the time.

F1 Score
F1 Score (aka F-Score or F-Measure) – A helpful metric for comparing two classifiers. F1 Score
takes into account precision and the recall. It is created by finding the the harmonic mean of
precision and recall.

F1 = 2 x (precision x recall)/(precision + recall)

ROC CURVE

A more visual way to measure the performance of a binary classifier is the receiver operating
characteristic (ROC) curve. It is created by plotting the true positive rate (TPR) (or recall) against
the false positive rate (FPR), which we haven’t defined explicitly yet:

In [25]: from sklearn.metrics import roc_auc_score, roc_curve, auc


from sklearn.metrics import precision_recall_curve, average_precision_s
core

In [26]: y_pred_proba = logreg.predict_proba(X_test)[::,1]

In [27]: y_pred_proba

Out[27]: array([0.88459582, 0.20986589, 0.15635047, 0.59809835, 0.1819723 ,


0.07669943, 0.68248388, 0.74300535, 0.45160224, 0.36999983,
0.54747218, 0.89912238, 0.3127755 , 0.25260265, 0.17094379,
0.2011401 , 0.78633552, 0.06930978, 0.36315988, 0.31124388,
0.55119242, 0.39579856, 0.34043061, 0.09150209, 0.10820372,
0.3624859 , 0.0906731 , 0.83068212, 0.16823406, 0.21233896,
0.4598109 , 0.26575779, 0.13757528, 0.46147237, 0.17515309,
0.65413557, 0.45357887, 0.13171106, 0.36104806, 0.67555452,

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
0.31877839, 0.24869867, 0.23348612, 0.73208462, 0.71025713,
0.02982383, 0.14406944, 0.26678717, 0.36912682, 0.32653216,
0.44657314, 0.25476223, 0.817448 , 0.42571693, 0.18927724,
0.01264126, 0.10312738, 0.45791007, 0.32138061, 0.22763647,
0.61345042, 0.48605364, 0.17008395, 0.72728988, 0.62693749,
0.84366782, 0.63849841, 0.19445794, 0.41795759, 0.15724276,
0.18907796, 0.53389831, 0.13514101, 0.87540158, 0.76290278,
0.33379141, 0.13972236, 0.59125757, 0.10854175, 0.20593514,
0.34837321, 0.38289872, 0.24982253, 0.07716112, 0.2445919 ,
0.24214284, 0.36014128, 0.43174265, 0.79190052, 0.22365636,
0.19701887, 0.22085911, 0.31272608, 0.08446738, 0.53039015,
0.2481466 , 0.40826489, 0.47257778, 0.53278713, 0.28229219,
0.28015691, 0.1707862 , 0.23832661, 0.08279293, 0.54927883,
0.39299867, 0.1787912 , 0.30319795, 0.08668146, 0.71932275,
0.16511955, 0.34024947, 0.5848725 , 0.33979684, 0.54753691,
0.55575702, 0.20536801, 0.68627949, 0.13716406, 0.66253399,
0.33868304, 0.27879737, 0.31788304, 0.44165912, 0.26128697,
0.09934927, 0.34078975, 0.35316982, 0.46098238, 0.37910186,
0.40787941, 0.14601379, 0.10663505, 0.72197467, 0.33678755,
0.37987528, 0.1999669 , 0.38170901, 0.70257082, 0.27320585,
0.14906075, 0.50805189, 0.13227697, 0.12578012, 0.37712546,
0.14894167, 0.1300954 , 0.14165793, 0.18261958, 0.2262175 ,
0.14340148, 0.57446483, 0.16620863, 0.21960711, 0.67390156,
0.18778124, 0.6619777 , 0.21289901, 0.62697869, 0.93770321,
0.63264962, 0.69136452, 0.06134282, 0.31691602, 0.79648062,
0.44420579, 0.28581201, 0.19030059, 0.31561807, 0.13289843,
0.21073731, 0.24787568, 0.24689633, 0.26406978, 0.45514708,
0.18129071, 0.38958252, 0.13822798, 0.17549839, 0.11509271,
0.20239073, 0.7167396 , 0.19107857, 0.8723455 , 0.49741011,
0.06390097, 0.69149168, 0.30815069, 0.44162597, 0.21045327,
0.23071042, 0.13339437])

In [29]: fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)

In [30]: fpr, tpr, _

Out[30]: (array([0. , 0. , 0.00769231, 0.00769231, 0.01538462,


0.01538462, 0.02307692, 0.02307692, 0.03076923, 0.03076923,
0.03846154, 0.03846154, 0.04615385, 0.04615385, 0.05384615,

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
0.05384615, 0.06153846, 0.06153846, 0.06923077, 0.06923077,
0.08461538, 0.08461538, 0.12307692, 0.12307692, 0.13076923,
0.13076923, 0.14615385, 0.14615385, 0.15384615, 0.15384615,
0.17692308, 0.17692308, 0.18461538, 0.18461538, 0.2 ,
0.2 , 0.20769231, 0.20769231, 0.21538462, 0.21538462,
0.23846154, 0.23846154, 0.26153846, 0.26153846, 0.28461538,
0.28461538, 0.33846154, 0.33846154, 0.35384615, 0.35384615,
0.37692308, 0.37692308, 0.38461538, 0.38461538, 0.43846154,
0.43846154, 0.47692308, 0.47692308, 0.55384615, 0.55384615,
1. ]),
array([0.01612903, 0.0483871 , 0.0483871 , 0.08064516, 0.08064516,
0.09677419, 0.09677419, 0.30645161, 0.30645161, 0.32258065,
0.32258065, 0.35483871, 0.35483871, 0.46774194, 0.46774194,
0.53225806, 0.53225806, 0.56451613, 0.56451613, 0.58064516,
0.58064516, 0.59677419, 0.59677419, 0.61290323, 0.61290323,
0.62903226, 0.62903226, 0.64516129, 0.64516129, 0.66129032,
0.66129032, 0.67741935, 0.67741935, 0.70967742, 0.70967742,
0.74193548, 0.74193548, 0.75806452, 0.75806452, 0.77419355,
0.77419355, 0.80645161, 0.80645161, 0.83870968, 0.83870968,
0.87096774, 0.87096774, 0.88709677, 0.88709677, 0.90322581,
0.90322581, 0.91935484, 0.91935484, 0.93548387, 0.93548387,
0.9516129 , 0.9516129 , 0.98387097, 0.98387097, 1. ,
1. ]),
array([0.93770321, 0.88459582, 0.87540158, 0.84366782, 0.83068212,
0.817448 , 0.79648062, 0.69136452, 0.68627949, 0.68248388,
0.67555452, 0.66253399, 0.6619777 , 0.59809835, 0.59125757,
0.55119242, 0.54927883, 0.54747218, 0.53389831, 0.53278713,
0.50805189, 0.49741011, 0.4598109 , 0.45791007, 0.45514708,
0.45357887, 0.44657314, 0.44420579, 0.44165912, 0.44162597,
0.41795759, 0.40826489, 0.40787941, 0.39299867, 0.38289872,
0.37987528, 0.37910186, 0.37712546, 0.36999983, 0.36912682,
0.36104806, 0.35316982, 0.34043061, 0.33979684, 0.33379141,
0.32138061, 0.31124388, 0.30815069, 0.28581201, 0.28229219,
0.27320585, 0.26678717, 0.26575779, 0.26406978, 0.24787568,
0.24689633, 0.23071042, 0.2262175 , 0.20536801, 0.20239073,
0.01264126]))

In [31]: y_pred

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Out[31]: array([1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1,
0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
1,
1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1,
1,
1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
1,
0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
0,
1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0], dtype=int64)

In [32]: y_test

Out[32]: 661 1
122 0
113 0
14 1
529 0
103 0
338 1
588 1
395 0
204 0
31 1
546 1
278 0
593 0
737 0
202 0
175 1
55 0
479 0

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
365 0
417 1
577 1
172 0
352 0
27 0
605 0
239 0
744 0
79 0
496 0
..
97 0
530 0
327 0
619 1
518 0
632 0
524 0
536 0
597 0
462 0
17 1
739 1
263 0
241 0
344 0
302 0
704 0
240 0
170 1
691 1
490 0
45 1
750 1
62 0
78 1
366 1
301 1

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
382 0
140 0
463 0
Name: Outcome, Length: 192, dtype: int64

In [35]: auc=roc_auc_score(y_test,y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

AUC score for the case is 0.86. AUC score 1 represents perfect classifier, and 0.5 represents a
worthless classifier.

Advantages of Logistic Regression

Because of its efficient and straightforward nature, doesn't require high computation power, easy
to implement, easily interpretable, used widely by data analyst and scientist. Also, it doesn't
require scaling of features. Logistic regression provides a probability score for observations.

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Disadvantages of Logistic Regression

Logistic regression is not able to handle a large number of categorical features/variables. It is


vulnerable to overfitting. Also, can't solve the non-linear problem with the logistic regression that
is why it requires a transformation of non-linear features. Logistic regression will not perform well
with independent variables that are not correlated to the target variable and are very similar or
correlated to each other.

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD

You might also like