Professional Documents
Culture Documents
14th LESSON (ANKUR - PROSCHOOL) - LOGISTIC REGRESSION CASE S
14th LESSON (ANKUR - PROSCHOOL) - LOGISTIC REGRESSION CASE S
Classification techniques are an essential part of machine learning and data mining applications.
Approximately 70% of problems in Data Science are classification problems. There are lots of
classification problems that are available, but the logistics regression is common and is a useful
regression method for solving the binary classification problem. Another category of classification
is Multinomial classification, which handles the issues where multiple classes are present in the
target variable.
1. Spam detection
2. Diabetes prediction,
3. if a given customer will purchase a particular product or will they churn another
competitor,
4. whether the user will click on a given advertisement link or not,
Logistic Regression is one of the most simple and commonly used Machine Learning algorithms
for two-class classification. It is easy to implement and can be used as the baseline for any
binary classification problem. Its basic fundamental concepts are also constructive in deep
learning. Logistic regression describes and estimates the relationship between one dependent
binary variable and independent variables.
Logistic regression is a statistical method for predicting binary classes. The outcome or target
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
variable has only two possible classes. For example, it can be used for cancer detection
problems. It computes the probability of an event occurrence.
It is a special case of linear regression where the target variable is categorical in nature. It uses a
log of odds as the dependent variable. Logistic Regression predicts the probability of occurrence
of a binary event utilizing a logit function.
Where, y is dependent variable and X1, X2 ... and Xn are explanatory variables.
Sigmoid Function:
p = 1/[1+(e**(-y))]
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
plt.plot(plot_range, # X-axis range
y_values, # Predicted values
color="red")
The sigmoid function, also called logistic function gives an ‘S’ shaped curve that can take any
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
real-valued number and map it into a value between 0 and 1. If the curve goes to positive infinity,
y predicted will become 1, and if the curve goes to negative infinity, y predicted will become 0. If
the output of the sigmoid function is more than 0.5, we can classify the outcome as 1 or YES,
and if it is less than 0.5, we can classify it as 0 or NO. The outputcannotFor example: If the
output is 0.75, we can say in terms of probability as: There is a 75 percent chance that patient
will suffer from cancer.
The most important thing to know is that logistic regression outputs probabilities that we can use
to classify observations.
1. Linear regression gives you a continuous output, but logistic regression provides a
constant output.
2. An example of the continuous output is house price and stock price. Example's of the
discrete output is predicting whether a patient has cancer or not OR predicting whether
the customer will churn.
3. Linear regression is estimated using Ordinary Least Squares (OLS) while logistic
regression is estimated using Maximum Likelihood Estimation (MLE) approach
Ordinary Least squares estimates are computed by fitting a regression line on given data points
that has the minimum sum of the squared deviations (least square error). Both are used to
estimate the parameters of a linear regression model.
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Types of Logistic Regression
1. Binary Logistic Regression: The target variable has only two possible outcomes such as
Spam or Not Spam, Cancer or No Cancer.
1. Multinomial Logistic Regression: The target variable has three or more nominal
categories such as predicting the type of Wine.
1. Ordinal Logistic Regression: The target variable has three or more ordinal categories
such as restaurant or product rating from 1 to 5.
Let's first load the required Pima Indian Diabetes dataset using the pandas' read CSV function
# load dataset
diab = pd.read_csv('diabetes.csv', names=col_names, header=0)
In [9]: diab.head()
Out[9]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeF
1 1 85 66 29 0 26.6 0.351
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeF
3 1 89 66 23 94 28.1 0.167
In [10]: diab.dtypes
Selecting Feature
Here, you need to divide the given columns into two types of variables dependent(or target
variable) and independent variable(or feature variables).
Splitting Data
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
To understand model performance, dividing the dataset into a training set and a test set is a good
strategy.
Let's split dataset by using function train_test_split(). You need to pass 3 parameters features,
target, and test_set size. Additionally, you can use random_state to select records randomly.
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,ran
dom_state=0)
Here, the Dataset is broken into two parts in a ratio of 75:25. It means 75% data will be used for
model training and 25% for model testing.
First, import the Logistic Regression module and create a Logistic Regression classifier object
using LogisticRegression() function.
Then, fit your model on the train set using fit() and perform prediction on the test set using
predict().
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Out[17]: LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=
True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=
1,
penalty='l2', random_state=None, solver='liblinear', tol=0.00
01,
verbose=0, warm_start=False)
In [18]: #
y_pred=logreg.predict(X_test)
In [19]: len(y_pred)
Out[19]: 192
A confusion matrix is a table that is used to evaluate the performance of a classification model.
You can also visualize the performance of an algorithm. The fundamental of a confusion matrix is
the number of correct and incorrect predictions are summed up class-wise.
Here, you can see the confusion matrix in the form of the array object. The dimension of this
matrix is 2*2 because this model is binary classification. You have two classes 0 and 1. Diagonal
values represent accurate predictions, while non-diagonal elements are inaccurate predictions.
In the output, 119 and 36 are accurate predictions, and 26 and 11 are incorrect predictions.
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Visualizing Confusion Matrix using Heatmap
Let's visualize the results of the model in the form of a confusion matrix using matplotlib and
seaborn.
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt=
'g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Confusion Matrix Evaluation Metrics
Let's evaluate the model using model evaluation metrics such as accuracy, precision, and recall.
Out[23]: 0.8072916666666666
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
avg / total 0.80 0.81 0.80 192
Precision Score
FP – False Positives
Recall Score
Recall (aka sensitivity or true positive rate): Fraction of positives That were correctly identified.
Recall = TP/(TP+FN)
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Recall: If there are patients who have diabetes in the test set and your Logistic Regression
model can identify it 58% of the time.
F1 Score
F1 Score (aka F-Score or F-Measure) – A helpful metric for comparing two classifiers. F1 Score
takes into account precision and the recall. It is created by finding the the harmonic mean of
precision and recall.
ROC CURVE
A more visual way to measure the performance of a binary classifier is the receiver operating
characteristic (ROC) curve. It is created by plotting the true positive rate (TPR) (or recall) against
the false positive rate (FPR), which we haven’t defined explicitly yet:
In [27]: y_pred_proba
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
0.31877839, 0.24869867, 0.23348612, 0.73208462, 0.71025713,
0.02982383, 0.14406944, 0.26678717, 0.36912682, 0.32653216,
0.44657314, 0.25476223, 0.817448 , 0.42571693, 0.18927724,
0.01264126, 0.10312738, 0.45791007, 0.32138061, 0.22763647,
0.61345042, 0.48605364, 0.17008395, 0.72728988, 0.62693749,
0.84366782, 0.63849841, 0.19445794, 0.41795759, 0.15724276,
0.18907796, 0.53389831, 0.13514101, 0.87540158, 0.76290278,
0.33379141, 0.13972236, 0.59125757, 0.10854175, 0.20593514,
0.34837321, 0.38289872, 0.24982253, 0.07716112, 0.2445919 ,
0.24214284, 0.36014128, 0.43174265, 0.79190052, 0.22365636,
0.19701887, 0.22085911, 0.31272608, 0.08446738, 0.53039015,
0.2481466 , 0.40826489, 0.47257778, 0.53278713, 0.28229219,
0.28015691, 0.1707862 , 0.23832661, 0.08279293, 0.54927883,
0.39299867, 0.1787912 , 0.30319795, 0.08668146, 0.71932275,
0.16511955, 0.34024947, 0.5848725 , 0.33979684, 0.54753691,
0.55575702, 0.20536801, 0.68627949, 0.13716406, 0.66253399,
0.33868304, 0.27879737, 0.31788304, 0.44165912, 0.26128697,
0.09934927, 0.34078975, 0.35316982, 0.46098238, 0.37910186,
0.40787941, 0.14601379, 0.10663505, 0.72197467, 0.33678755,
0.37987528, 0.1999669 , 0.38170901, 0.70257082, 0.27320585,
0.14906075, 0.50805189, 0.13227697, 0.12578012, 0.37712546,
0.14894167, 0.1300954 , 0.14165793, 0.18261958, 0.2262175 ,
0.14340148, 0.57446483, 0.16620863, 0.21960711, 0.67390156,
0.18778124, 0.6619777 , 0.21289901, 0.62697869, 0.93770321,
0.63264962, 0.69136452, 0.06134282, 0.31691602, 0.79648062,
0.44420579, 0.28581201, 0.19030059, 0.31561807, 0.13289843,
0.21073731, 0.24787568, 0.24689633, 0.26406978, 0.45514708,
0.18129071, 0.38958252, 0.13822798, 0.17549839, 0.11509271,
0.20239073, 0.7167396 , 0.19107857, 0.8723455 , 0.49741011,
0.06390097, 0.69149168, 0.30815069, 0.44162597, 0.21045327,
0.23071042, 0.13339437])
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
0.05384615, 0.06153846, 0.06153846, 0.06923077, 0.06923077,
0.08461538, 0.08461538, 0.12307692, 0.12307692, 0.13076923,
0.13076923, 0.14615385, 0.14615385, 0.15384615, 0.15384615,
0.17692308, 0.17692308, 0.18461538, 0.18461538, 0.2 ,
0.2 , 0.20769231, 0.20769231, 0.21538462, 0.21538462,
0.23846154, 0.23846154, 0.26153846, 0.26153846, 0.28461538,
0.28461538, 0.33846154, 0.33846154, 0.35384615, 0.35384615,
0.37692308, 0.37692308, 0.38461538, 0.38461538, 0.43846154,
0.43846154, 0.47692308, 0.47692308, 0.55384615, 0.55384615,
1. ]),
array([0.01612903, 0.0483871 , 0.0483871 , 0.08064516, 0.08064516,
0.09677419, 0.09677419, 0.30645161, 0.30645161, 0.32258065,
0.32258065, 0.35483871, 0.35483871, 0.46774194, 0.46774194,
0.53225806, 0.53225806, 0.56451613, 0.56451613, 0.58064516,
0.58064516, 0.59677419, 0.59677419, 0.61290323, 0.61290323,
0.62903226, 0.62903226, 0.64516129, 0.64516129, 0.66129032,
0.66129032, 0.67741935, 0.67741935, 0.70967742, 0.70967742,
0.74193548, 0.74193548, 0.75806452, 0.75806452, 0.77419355,
0.77419355, 0.80645161, 0.80645161, 0.83870968, 0.83870968,
0.87096774, 0.87096774, 0.88709677, 0.88709677, 0.90322581,
0.90322581, 0.91935484, 0.91935484, 0.93548387, 0.93548387,
0.9516129 , 0.9516129 , 0.98387097, 0.98387097, 1. ,
1. ]),
array([0.93770321, 0.88459582, 0.87540158, 0.84366782, 0.83068212,
0.817448 , 0.79648062, 0.69136452, 0.68627949, 0.68248388,
0.67555452, 0.66253399, 0.6619777 , 0.59809835, 0.59125757,
0.55119242, 0.54927883, 0.54747218, 0.53389831, 0.53278713,
0.50805189, 0.49741011, 0.4598109 , 0.45791007, 0.45514708,
0.45357887, 0.44657314, 0.44420579, 0.44165912, 0.44162597,
0.41795759, 0.40826489, 0.40787941, 0.39299867, 0.38289872,
0.37987528, 0.37910186, 0.37712546, 0.36999983, 0.36912682,
0.36104806, 0.35316982, 0.34043061, 0.33979684, 0.33379141,
0.32138061, 0.31124388, 0.30815069, 0.28581201, 0.28229219,
0.27320585, 0.26678717, 0.26575779, 0.26406978, 0.24787568,
0.24689633, 0.23071042, 0.2262175 , 0.20536801, 0.20239073,
0.01264126]))
In [31]: y_pred
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Out[31]: array([1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1,
0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
1,
1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1,
1,
1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
1,
0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
0,
1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0,
0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0], dtype=int64)
In [32]: y_test
Out[32]: 661 1
122 0
113 0
14 1
529 0
103 0
338 1
588 1
395 0
204 0
31 1
546 1
278 0
593 0
737 0
202 0
175 1
55 0
479 0
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
365 0
417 1
577 1
172 0
352 0
27 0
605 0
239 0
744 0
79 0
496 0
..
97 0
530 0
327 0
619 1
518 0
632 0
524 0
536 0
597 0
462 0
17 1
739 1
263 0
241 0
344 0
302 0
704 0
240 0
170 1
691 1
490 0
45 1
750 1
62 0
78 1
366 1
301 1
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
382 0
140 0
463 0
Name: Outcome, Length: 192, dtype: int64
In [35]: auc=roc_auc_score(y_test,y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()
AUC score for the case is 0.86. AUC score 1 represents perfect classifier, and 0.5 represents a
worthless classifier.
Because of its efficient and straightforward nature, doesn't require high computation power, easy
to implement, easily interpretable, used widely by data analyst and scientist. Also, it doesn't
require scaling of features. Logistic regression provides a probability score for observations.
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Disadvantages of Logistic Regression
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD