Logistic+regression Data

prathameshlavekar@gmail.
com
YHZEPDBA51
Logistic Regression
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
This file is meant for personal use by prathameshlavekar@gmail.com only. 1
Sharing or publishing the contents in part or full is liable for legal action.
Logistic Regression
• Logistic regression is a supervised learning algorithm used to predict the probability of a target variable.
• The nature of target or dependent variable is dichotomous, which means there would be only two possible classes.
• In simple words, the dependent variable is binary in nature having data coded as either 1 or 0.
• Mathematically, a logistic regression model predicts P(Y=1) as a function of X. It is one of the simplest ML algorithms that can be
used for various classification problems such as spam detection, Diabetes prediction, cancer detection etc.
prathameshlavekar@gmail.com
YHZEPDBA51
Assumptions
Before diving into the implementation of logistic regression, we must be aware of the following assumptions about the same
• The Response Variable is a one-of-a-kind binary variable.
• It is assumed that the observations are independent of one another.
• It assumes that the independent variables have almost no multicollinearity.
• There are no extreme outliers in logistic regression.

• The logit of the
YHZEPDBA51 outcome and each independent variable should have a linear relationship.
• To predict accurately, a large sample size is usually required.
Logistic regression model representation
hθ(x)=g(θT x) where 0≤hθ≤1

YHZEPDBA51
YHZEPDBA51
Performance metrics in Classification
Confusion Matrix
• Confusion matrix is one of the most intuitive and easiest metrics used for finding the correctness and accuracy of the model.
• It is used for Classification problem where the output can be of two or more types of classes.
YHZEPDBA51
• True Positives (TP): True positives are the cases when the actual class of the data point was True and predicted is also True.
• True Negatives (TN): True negatives are the cases when the actual class of the data point was False and the predicted is also False.
• False Positives (FP): False positives are the cases when the actual class of the data point was False and the predicted is True.
• False Negatives (FN): False negatives are the cases when the actual class of the data point was True and the predicted is False.
• Accuracy: Accuracy in classification problems is the number of correct predictions made by the model over all kinds predictions
• Precision: Precision is a metric that quantifies the number of correct positive predictions made by the model
• Recall: Recall is a metric that quantifies the number of correct positive predictions made out of all positive predictions that could
YHZEPDBA51
have been made.
• F1-Score: •F1 Score is the weighted average of Precision and Recall
AUC-ROC:
• The Receiver Operator Characteristic (ROC) curve is an evaluation metric for binary classification problems. It is a probability
curve that plots the true positive rate (TPR) against false positive rate (FPR) at various threshold values.
○ True Positive Rate (TPR):The true positive rate (TPR, also called sensitivity) is calculated as TP/TP+FN.
○ False Positive Rate (FPR):The false positive rate is calculated as FP/FP+TN.
• The Area Under the Curve (AUC) is the measure of the ability of a classifier to distinguish between classes and is used as a
summary of the ROC curve.
YHZEPDBA51
• The ROC curve can give us a clear idea to set a threshold value to classify the label and also help in model optimization.
Effect of Threshold in AUC-ROC Curve:
• When AUC = 1, then the classifier is able to perfectly distinguish between all the Positive and the Negative class points
correctly. If, however, the AUC had been 0, then the classifier would be predicting all Negatives as Positives, and all Positives
as Negatives.
• When 0.5<AUC<1, there is a high chance that the classifier will be able to distinguish the positive class values from the
negative class values.
• When AUC=0.5, then the classifier is not able to distinguish between Positive and Negative class points. Meaning either the
classifier is predicting random class or constant class for all the data points.
YHZEPDBA51
Explanation of Multi-Class Classification
YHZEPDBA51
Advantages and Disadvantages
Advantages
• Logistic Regression has less chance of overfitting.
• It is much easier to implement compared to other algorithms.
• Tuning of parameters is not required much
Disadvantages
• They fail to play good in large datasets
• The algorithm
YHZEPDBA51
only works fine in linearly separable data
• They are not flexible with continuous data
1. What is the Impact of Outliers on Logistic Regression?
The estimates of the Logistic Regression are sensitive to unusual observations such as outliers, high leverage, and influential
observations. Therefore, to solve the problem of outliers, a sigmoid function is used in Logistic Regression.
2. How do we handle categorical variables in Logistic Regression?
The inputs given to a Logistic Regression model need to be numeric. The algorithm cannot handle categorical variables directly.
So, we need to convert the categorical data into a numerical format that is suitable for the algorithm to process.
Each level of the categorical variable will be assigned a unique numeric value also known as a dummy variable. These dummy
variables are handled by the Logistic Regression model in the same manner as any other numeric value.
3. Can we solve the
YHZEPDBA51 multiclass classification problems using Logistic Regression? If Yes then How?
Yes, in order to deal with multiclass classification using Logistic Regression, the most famous method is known as the one-vs-all
approach. In this approach, a number of models are trained, which is equal to the number of classes. These models work in a
specific way.
For Example, the first model classifies the datapoint depending on whether it belongs to class 1 or some other class(not class 1);
the second model classifies the datapoint into class 2 or some other class(not class 2) and so-on for all other classes.
4. Which metric to use in the case of class imbalanced data?
In most real-life classification problems, imbalanced class distribution exists and thus F1-score is a better metric to evaluate our
model.
5. What is ROC curve threshold?
The ROC curve helps us find the threshold where the TPR is high and FPR is low i.e. misclassifications are low.
Therefore, ROC curves should be used to determine the optimal probability threshold for a classification model.
6.what is the difference between linear and logarithmic scale?

In Linear scale a change between the two values is perceived on the basis of the difference between the values.
Eg. changes from 2 to 3 would be perceived as the same increased from 1 to 2.
In Logarithmic scale a change between the two values in perceived on the basis of ratio of the two values.
YHZEPDBA51
Eg. changes from 1 to 2 would be perceived as the same increased from 4 to 8.
7.what are the hyper parameters of logistic regression?

The some of the hyperparameters used in logistic regression are
i. Solver
ii. Penalty
iii. Dataset Balancing
iv. Multi-class

Logistic+regression Data

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Logistic+regression Data

Uploaded by

Copyright:

Available Formats

prathameshlavekar@gmail.

• The Response Variable is a one-of-a-kind binary variable.

• It is assumed that the observations are independent of one another.

• It assumes that the independent variables have almost no multicollinearity.

• There are no extreme outliers in logistic regression.

• To predict accurately, a large sample size is usually required.

hθ(x)=g(θT x) where 0≤hθ≤1

have been made.

• F1-Score: •F1 Score is the weighted average of Precision and Recall

○ False Positive Rate (FPR):The false positive rate is calculated as FP/FP+TN.

6.what is the difference between linear and logarithmic scale?

7.what are the hyper parameters of logistic regression?

You might also like