14th LESSON (ANKUR - PROSCHOOL) - LOGISTIC REGRESSION CASE S

Logistic Regression
Classification techniques are an essential part of machine learning and data mining applications.
Approximately 70% of problems in Data Science are classification problems. There are lots of
classification problems that are available, but the logistics regression is common and is a useful
regression method for solving the binary classification problem. Another category of classification
is Multinomial classification, which handles the issues where multiple classes are present in the
target variable.
Logistic Regression can be used for various classification problems such as
1. Spam detection
2. Diabetes prediction,
3. if a given customer will purchase a particular product or will they churn another
competitor,
4. whether the user will click on a given advertisement link or not,
and many more examples are in the bucket.
Logistic Regression is one of the most simple and commonly used Machine Learning algorithms
for two-class classification. It is easy to implement and can be used as the baseline for any
binary classification problem. Its basic fundamental concepts are also constructive in deep
learning. Logistic regression describes and estimates the relationship between one dependent
binary variable and independent variables.
Definition of Logistic Regression
Logistic regression is a statistical method for predicting binary classes. The outcome or target
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
variable has only two possible classes. For example, it can be used for cancer detection
problems. It computes the probability of an event occurrence.
It is a special case of linear regression where the target variable is categorical in nature. It uses a
log of odds as the dependent variable. Logistic Regression predicts the probability of occurrence
of a binary event utilizing a logit function.
Linear Regression Equation:
y = m1X1 + m2X2 + m3X3 + ......... mnXn
Where, y is dependent variable and X1, X2 ... and Xn are explanatory variables.
Sigmoid Function:
p = 1/[1+(e**(-y))]
Apply Sigmoid function on linear regression:
p = 1/[1+(e**(-(m1X1 + m2X2 + m3X3 + ......... mnXn))]
In [1]: import pandas as pd

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
matplotlib.style.use('ggplot')
plt.figure(figsize=(9,9))
def sigmoid(t): # Define the sigmoid function
return (1/(1 + np.e**(-t)))
plot_range = np.arange(-6, 6, 0.1)
y_values = sigmoid(plot_range)
# Plot curve
plt.plot(plot_range, # X-axis range
y_values, # Predicted values
color="red")
Out[1]: [<matplotlib.lines.Line2D at 0x18bcb22da90>]
The sigmoid function, also called logistic function gives an ‘S’ shaped curve that can take any
real-valued number and map it into a value between 0 and 1. If the curve goes to positive infinity,
y predicted will become 1, and if the curve goes to negative infinity, y predicted will become 0. If
the output of the sigmoid function is more than 0.5, we can classify the outcome as 1 or YES,
and if it is less than 0.5, we can classify it as 0 or NO. The outputcannotFor example: If the
output is 0.75, we can say in terms of probability as: There is a 75 percent chance that patient
will suffer from cancer.
The most important thing to know is that logistic regression outputs probabilities that we can use
to classify observations.
Linear Regression Vs. Logistic Regression
1. Linear regression gives you a continuous output, but logistic regression provides a
constant output.
2. An example of the continuous output is house price and stock price. Example's of the
discrete output is predicting whether a patient has cancer or not OR predicting whether
the customer will churn.
3. Linear regression is estimated using Ordinary Least Squares (OLS) while logistic
regression is estimated using Maximum Likelihood Estimation (MLE) approach
Maximum Likelihood Estimation Vs. Least Square

Method
The MLE is a "likelihood" maximization method, while OLS is a distance-minimizing

approximation method. Maximizing the likelihood function determines the parameters that are
most likely to produce the observed data. From a statistical point of view, MLE sets the mean
and variance as parameters in determining the specific parametric values for a given model
Ordinary Least squares estimates are computed by fitting a regression line on given data points
that has the minimum sum of the squared deviations (least square error). Both are used to
estimate the parameters of a linear regression model.
Types of Logistic Regression
1. Binary Logistic Regression: The target variable has only two possible outcomes such as
Spam or Not Spam, Cancer or No Cancer.
1. Multinomial Logistic Regression: The target variable has three or more nominal
categories such as predicting the type of Wine.
1. Ordinal Logistic Regression: The target variable has three or more ordinal categories
such as restaurant or product rating from 1 to 5.
Model building in Scikit-learn
Let's build the diabetes prediction model.
Let's first load the required Pima Indian Diabetes dataset using the pandas' read CSV function
In [8]: #import pandas

import pandas as pd
col_names = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness'
, 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
# load dataset
diab = pd.read_csv('diabetes.csv', names=col_names, header=0)
In [9]: diab.head()
Out[9]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeF
0 6 148 72 35 0 33.6 0.627
1 1 85 66 29 0 26.6 0.351
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeF
2 8 183 64 0 0 23.3 0.672
3 1 89 66 23 94 28.1 0.167
4 0 137 40 35 168 43.1 2.288
In [10]: diab.dtypes
Out[10]: Pregnancies int64

Glucose int64
BloodPressure int64
SkinThickness int64
Insulin int64
BMI float64
DiabetesPedigreeFunction float64
Age int64
Outcome int64
dtype: object
Selecting Feature
Here, you need to divide the given columns into two types of variables dependent(or target
variable) and independent variable(or feature variables).
In [13]: #split dataset in features and target variable

feature_cols = ['Pregnancies', 'Glucose', 'BloodPressure', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age']
X = diab[feature_cols] # Features (Independent Variable)
y = diab.Outcome # Target variable
Splitting Data
To understand model performance, dividing the dataset into a training set and a test set is a good
strategy.
Let's split dataset by using function train_test_split(). You need to pass 3 parameters features,
target, and test_set size. Additionally, you can use random_state to select records randomly.
In [14]: # split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,ran
dom_state=0)
Here, the Dataset is broken into two parts in a ratio of 75:25. It means 75% data will be used for
model training and 25% for model testing.
Model Development and Prediction
First, import the Logistic Regression module and create a Logistic Regression classifier object
using LogisticRegression() function.
Then, fit your model on the train set using fit() and perform prediction on the test set using
predict().
In [15]: # import the class

from sklearn.linear_model import LogisticRegression
In [16]: # instantiate the model (using the default parameters)

logreg = LogisticRegression()
In [17]: # fit the model with data

logreg.fit(X_train,y_train)
Out[17]: LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=
True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=
1,
penalty='l2', random_state=None, solver='liblinear', tol=0.00
01,
verbose=0, warm_start=False)
In [18]: #
y_pred=logreg.predict(X_test)
In [19]: len(y_pred)
Out[19]: 192
Model Evaluation using Confusion Matrix
A confusion matrix is a table that is used to evaluate the performance of a classification model.
You can also visualize the performance of an algorithm. The fundamental of a confusion matrix is
the number of correct and incorrect predictions are summed up class-wise.
In [20]: # import the metrics class

from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix
Out[20]: array([[119, 11],

[ 26, 36]], dtype=int64)
Here, you can see the confusion matrix in the form of the array object. The dimension of this
matrix is 2*2 because this model is binary classification. You have two classes 0 and 1. Diagonal
values represent accurate predictions, while non-diagonal elements are inaccurate predictions.
In the output, 119 and 36 are accurate predictions, and 26 and 11 are incorrect predictions.
Visualizing Confusion Matrix using Heatmap
Let's visualize the results of the model in the form of a confusion matrix using matplotlib and
seaborn.
Here, you will visualize the confusion matrix using Heatmap.
In [21]: # import required modules

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
In [22]: class_names=[0,1] # name of classes

fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt=
'g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
Out[22]: Text(0.5,257.44,'Predicted label')
Confusion Matrix Evaluation Metrics
Let's evaluate the model using model evaluation metrics such as accuracy, precision, and recall.
In [23]: metrics.accuracy_score(y_test, y_pred)
Out[23]: 0.8072916666666666
In [24]: from sklearn.metrics import classification_report

print(classification_report(y_test,y_pred))
precision recall f1-score support
0 0.82 0.92 0.87 130

1 0.77 0.58 0.66 62
avg / total 0.80 0.81 0.80 192
Well, you got a classification rate of 80%, considered as good accuracy.
Precision Score
“When it predicts the positive result, how often is it correct?”

TP – True Positives
FP – False Positives
Precision – Accuracy of positive predictions.
Precision = TP/(TP + FP)

Precision: Precision is about being precise, i.e., how accurate your model is. In other words, you
can say, when a model makes a prediction, how often it is correct. In your prediction case, when
your Logistic Regression model predicted patients are going to suffer from diabetes, that patients
have 76% of the time.
Recall Score
“When it is actually the positive result, how often does it predict

correctly?”
FN – False Negatives
Recall (aka sensitivity or true positive rate): Fraction of positives That were correctly identified.
Recall = TP/(TP+FN)
Recall: If there are patients who have diabetes in the test set and your Logistic Regression
model can identify it 58% of the time.
F1 Score
F1 Score (aka F-Score or F-Measure) – A helpful metric for comparing two classifiers. F1 Score
takes into account precision and the recall. It is created by finding the the harmonic mean of
precision and recall.
F1 = 2 x (precision x recall)/(precision + recall)
ROC CURVE
A more visual way to measure the performance of a binary classifier is the receiver operating
characteristic (ROC) curve. It is created by plotting the true positive rate (TPR) (or recall) against
the false positive rate (FPR), which we haven’t defined explicitly yet:
In [25]: from sklearn.metrics import roc_auc_score, roc_curve, auc

from sklearn.metrics import precision_recall_curve, average_precision_s
core
In [26]: y_pred_proba = logreg.predict_proba(X_test)[::,1]
In [27]: y_pred_proba
Out[27]: array([0.88459582, 0.20986589, 0.15635047, 0.59809835, 0.1819723 ,

0.07669943, 0.68248388, 0.74300535, 0.45160224, 0.36999983,
0.54747218, 0.89912238, 0.3127755 , 0.25260265, 0.17094379,
0.2011401 , 0.78633552, 0.06930978, 0.36315988, 0.31124388,
0.55119242, 0.39579856, 0.34043061, 0.09150209, 0.10820372,
0.3624859 , 0.0906731 , 0.83068212, 0.16823406, 0.21233896,
0.4598109 , 0.26575779, 0.13757528, 0.46147237, 0.17515309,
0.65413557, 0.45357887, 0.13171106, 0.36104806, 0.67555452,
0.31877839, 0.24869867, 0.23348612, 0.73208462, 0.71025713,
0.02982383, 0.14406944, 0.26678717, 0.36912682, 0.32653216,
0.44657314, 0.25476223, 0.817448 , 0.42571693, 0.18927724,
0.01264126, 0.10312738, 0.45791007, 0.32138061, 0.22763647,
0.61345042, 0.48605364, 0.17008395, 0.72728988, 0.62693749,
0.84366782, 0.63849841, 0.19445794, 0.41795759, 0.15724276,
0.18907796, 0.53389831, 0.13514101, 0.87540158, 0.76290278,
0.33379141, 0.13972236, 0.59125757, 0.10854175, 0.20593514,
0.34837321, 0.38289872, 0.24982253, 0.07716112, 0.2445919 ,
0.24214284, 0.36014128, 0.43174265, 0.79190052, 0.22365636,
0.19701887, 0.22085911, 0.31272608, 0.08446738, 0.53039015,
0.2481466 , 0.40826489, 0.47257778, 0.53278713, 0.28229219,
0.28015691, 0.1707862 , 0.23832661, 0.08279293, 0.54927883,
0.39299867, 0.1787912 , 0.30319795, 0.08668146, 0.71932275,
0.16511955, 0.34024947, 0.5848725 , 0.33979684, 0.54753691,
0.55575702, 0.20536801, 0.68627949, 0.13716406, 0.66253399,
0.33868304, 0.27879737, 0.31788304, 0.44165912, 0.26128697,
0.09934927, 0.34078975, 0.35316982, 0.46098238, 0.37910186,
0.40787941, 0.14601379, 0.10663505, 0.72197467, 0.33678755,
0.37987528, 0.1999669 , 0.38170901, 0.70257082, 0.27320585,
0.14906075, 0.50805189, 0.13227697, 0.12578012, 0.37712546,
0.14894167, 0.1300954 , 0.14165793, 0.18261958, 0.2262175 ,
0.14340148, 0.57446483, 0.16620863, 0.21960711, 0.67390156,
0.18778124, 0.6619777 , 0.21289901, 0.62697869, 0.93770321,
0.63264962, 0.69136452, 0.06134282, 0.31691602, 0.79648062,
0.44420579, 0.28581201, 0.19030059, 0.31561807, 0.13289843,
0.21073731, 0.24787568, 0.24689633, 0.26406978, 0.45514708,
0.18129071, 0.38958252, 0.13822798, 0.17549839, 0.11509271,
0.20239073, 0.7167396 , 0.19107857, 0.8723455 , 0.49741011,
0.06390097, 0.69149168, 0.30815069, 0.44162597, 0.21045327,
0.23071042, 0.13339437])
In [29]: fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
In [30]: fpr, tpr, _
Out[30]: (array([0. , 0. , 0.00769231, 0.00769231, 0.01538462,

0.01538462, 0.02307692, 0.02307692, 0.03076923, 0.03076923,
0.03846154, 0.03846154, 0.04615385, 0.04615385, 0.05384615,
0.05384615, 0.06153846, 0.06153846, 0.06923077, 0.06923077,
0.08461538, 0.08461538, 0.12307692, 0.12307692, 0.13076923,
0.13076923, 0.14615385, 0.14615385, 0.15384615, 0.15384615,
0.17692308, 0.17692308, 0.18461538, 0.18461538, 0.2 ,
0.2 , 0.20769231, 0.20769231, 0.21538462, 0.21538462,
0.23846154, 0.23846154, 0.26153846, 0.26153846, 0.28461538,
0.28461538, 0.33846154, 0.33846154, 0.35384615, 0.35384615,
0.37692308, 0.37692308, 0.38461538, 0.38461538, 0.43846154,
0.43846154, 0.47692308, 0.47692308, 0.55384615, 0.55384615,
1. ]),
array([0.01612903, 0.0483871 , 0.0483871 , 0.08064516, 0.08064516,
0.09677419, 0.09677419, 0.30645161, 0.30645161, 0.32258065,
0.32258065, 0.35483871, 0.35483871, 0.46774194, 0.46774194,
0.53225806, 0.53225806, 0.56451613, 0.56451613, 0.58064516,
0.58064516, 0.59677419, 0.59677419, 0.61290323, 0.61290323,
0.62903226, 0.62903226, 0.64516129, 0.64516129, 0.66129032,
0.66129032, 0.67741935, 0.67741935, 0.70967742, 0.70967742,
0.74193548, 0.74193548, 0.75806452, 0.75806452, 0.77419355,
0.77419355, 0.80645161, 0.80645161, 0.83870968, 0.83870968,
0.87096774, 0.87096774, 0.88709677, 0.88709677, 0.90322581,
0.90322581, 0.91935484, 0.91935484, 0.93548387, 0.93548387,
0.9516129 , 0.9516129 , 0.98387097, 0.98387097, 1. ,
1. ]),
array([0.93770321, 0.88459582, 0.87540158, 0.84366782, 0.83068212,
0.817448 , 0.79648062, 0.69136452, 0.68627949, 0.68248388,
0.67555452, 0.66253399, 0.6619777 , 0.59809835, 0.59125757,
0.55119242, 0.54927883, 0.54747218, 0.53389831, 0.53278713,
0.50805189, 0.49741011, 0.4598109 , 0.45791007, 0.45514708,
0.45357887, 0.44657314, 0.44420579, 0.44165912, 0.44162597,
0.41795759, 0.40826489, 0.40787941, 0.39299867, 0.38289872,
0.37987528, 0.37910186, 0.37712546, 0.36999983, 0.36912682,
0.36104806, 0.35316982, 0.34043061, 0.33979684, 0.33379141,
0.32138061, 0.31124388, 0.30815069, 0.28581201, 0.28229219,
0.27320585, 0.26678717, 0.26575779, 0.26406978, 0.24787568,
0.24689633, 0.23071042, 0.2262175 , 0.20536801, 0.20239073,
0.01264126]))
In [31]: y_pred
Out[31]: array([1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1,
0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
1,
1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1,
1,
1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
1,
0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
0,
1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0], dtype=int64)
In [32]: y_test
Out[32]: 661 1
122 0
113 0
14 1
529 0
103 0
338 1
588 1
395 0
204 0
31 1
546 1
278 0
593 0
737 0
202 0
175 1
55 0
479 0
365 0
417 1
577 1
172 0
352 0
27 0
605 0
239 0
744 0
79 0
496 0
..
97 0
530 0
327 0
619 1
518 0
632 0
524 0
536 0
597 0
462 0
17 1
739 1
263 0
241 0
344 0
302 0
704 0
240 0
170 1
691 1
490 0
45 1
750 1
62 0
78 1
366 1
301 1
382 0
140 0
463 0
Name: Outcome, Length: 192, dtype: int64
In [35]: auc=roc_auc_score(y_test,y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()
AUC score for the case is 0.86. AUC score 1 represents perfect classifier, and 0.5 represents a
worthless classifier.
Advantages of Logistic Regression
Because of its efficient and straightforward nature, doesn't require high computation power, easy
to implement, easily interpretable, used widely by data analyst and scientist. Also, it doesn't
require scaling of features. Logistic regression provides a probability score for observations.
Disadvantages of Logistic Regression
Logistic regression is not able to handle a large number of categorical features/variables. It is

vulnerable to overfitting. Also, can't solve the non-linear problem with the logistic regression that
is why it requires a transformation of non-linear features. Logistic regression will not perform well
with independent variables that are not correlated to the target variable and are very similar or
correlated to each other.

14th LESSON (ANKUR - PROSCHOOL) - LOGISTIC REGRESSION CASE S

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

14th LESSON (ANKUR - PROSCHOOL) - LOGISTIC REGRESSION CASE S

Uploaded by

Copyright:

Available Formats

Logistic Regression

Logistic Regression can be used for various classification problems such as

and many more examples are in the bucket.

Definition of Logistic Regression

Linear Regression Equation:

y = m1X1 + m2X2 + m3X3 + ......... mnXn

Apply Sigmoid function on linear regression:

p = 1/[1+(e**(-(m1X1 + m2X2 + m3X3 + ......... mnXn))]

In [1]: import pandas as pd

Out[1]: [<matplotlib.lines.Line2D at 0x18bcb22da90>]

Linear Regression Vs. Logistic Regression

Maximum Likelihood Estimation Vs. Least Square

The MLE is a "likelihood" maximization method, while OLS is a distance-minimizing

Model building in Scikit-learn

Let's build the diabetes prediction model.

In [8]: #import pandas

0 6 148 72 35 0 33.6 0.627

2 8 183 64 0 0 23.3 0.672

4 0 137 40 35 168 43.1 2.288

Out[10]: Pregnancies int64

In [13]: #split dataset in features and target variable

X = diab[feature_cols] # Features (Independent Variable)

y = diab.Outcome # Target variable

In [14]: # split X and y into training and testing sets

from sklearn.model_selection import train_test_split

Model Development and Prediction

In [15]: # import the class

In [16]: # instantiate the model (using the default parameters)

In [17]: # fit the model with data

Model Evaluation using Confusion Matrix

In [20]: # import the metrics class

Out[20]: array([[119, 11],

Here, you will visualize the confusion matrix using Heatmap.

In [21]: # import required modules

In [22]: class_names=[0,1] # name of classes

Out[22]: Text(0.5,257.44,'Predicted label')

In [23]: metrics.accuracy_score(y_test, y_pred)

In [24]: from sklearn.metrics import classification_report

precision recall f1-score support

0 0.82 0.92 0.87 130

Well, you got a classification rate of 80%, considered as good accuracy.

“When it predicts the positive result, how often is it correct?”

Precision – Accuracy of positive predictions.

Precision = TP/(TP + FP)

“When it is actually the positive result, how often does it predict

F1 = 2 x (precision x recall)/(precision + recall)

In [25]: from sklearn.metrics import roc_auc_score, roc_curve, auc

In [26]: y_pred_proba = logreg.predict_proba(X_test)[::,1]

Out[27]: array([0.88459582, 0.20986589, 0.15635047, 0.59809835, 0.1819723 ,

In [29]: fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)

In [30]: fpr, tpr, _

Out[30]: (array([0. , 0. , 0.00769231, 0.00769231, 0.01538462,

Advantages of Logistic Regression

Logistic regression is not able to handle a large number of categorical features/variables. It is

You might also like