Session 7-8 - Data Cleaning and Logistic Regression For Classification

ADVANCED HR
ANALYTICS
Session 7-8
Delivered by
Dr. Pratyush Banerjee
Session objectives
• To learn how to clean data
• How to perform data mining with Logistic regression
• Interpreting classification table output
• Applications in Data Mining
DATA CLEANING
3
Data cleaning with Orange- Impute
Widget
4
Data cleaning with Orange- Outlier
Widget
5
Outlier detection with Orange
Orange provides the following outlier detection and removal techniques:
One-class SVM with non-linear kernels (RBF) – This method performs well
with non-Gaussian distributions
Covariance estimator works only for data with Normal / Gaussian
distribution. Here Mahalanobis distance is used.
Local Outlier Factor algorithm- This algorithm computes a score reflecting
the degree of abnormality of the observations by measuring the local
density deviation of a given data point with respect to its neighbors. Here
usually Euclidean distance is used
Another efficient way of performing outlier detection in high-dimensional
datasets is to use random forests (Isolation Forest).
Graphical representation of different
outlier detection Algorithms
7
Orange Workflow for Imputation and Outlier Analysis
8
LOGISTIC REGRESSION
9
Logistic Regression
• Logistic regression is a very commonly used supervised machine learning algorithm which finds extensive
application in classification type problems.
• An example of application of logistic regression may be in identifying which employee is more likely to quit or
stay, which customer may churn or not, whether a client may be a potential loan defaulter or not, etc.
• There are, broadly speaking, three types of logistic regression:
A) Binary / Binomial Logistic Regression – when the target variable has only two outcomes (E.G., Yes / No)
B) Multinomial Logistic Regression – when the target variable has multiple possible outcomes (E.G., types of
stop signs)
C) Ordinal Logistic Regression – when the target variable has multiple possible outcomes which have a
necessary order of level, such as high, medium and low performers.
10
White-box Supervised Learning
Unlike more complex SL algorithms, LR is known as a white box model
White-box models are those which are comparatively easy to understand and whose mechanism of operation is
partially, if not completely observable
In this type of Supervised Learning, the historical data is available with the analyst.
•This historical dataset, also known as training data, contains an outcome variable with defined labels (discrete /
continuous value). For example, we may label the output (0 or 1) where 1 means positive outcome and 0 means
negative outcome.
•The goal here is to predict discrete values belonging to a particular class and evaluate on the basis of accuracy.
•It can be either binary or multi class classification.
•In binary classification, model predicts either 0 or 1 / yes or no. But in case of multi class classification, the
model predicts more than one class.
•For example, Google’s Gmail classifies e-mails in more than one classes like social, promotions, updates, forum.
11
12
Conducting LR analysis in Orange
Imports the Helps in specifying the
data from the features, target and
source meta-attributes
Helps in computing
the model Feature
performance (AUC, importance
The LR CA, Precision, analysis
Algorithm Recall, F1 Score)
13
LR OUTPUT (MODEL
ACCURACY)
Here the data has been partitioned using the k fold cross
validation technique, where k=number of folds=10
If data set is not very huge, then it is better to do the

conventional 70:20:10 split. In Orange, by default two splits
are available. By default training set size is 66% and test set
size is 34%. We can make it 70:30 if we choose to, and also go
for more than 2 splits
14
Cross-validation (tackling issue of imbalanced
data)
5-fold
cross
validation
A Typical Confusion Matrix Output In
Orange: Calculate Sensitivity & Specificity
16
How to detect classification
accuracy?
True Positive cases (TP): % of correctly predicted cases (with respect to
total predictions) as belonging to the class labeled as 1
True negative cases (TN): % of cases correctly predicted cases (with
respect to total predictions) as belonging to the class labeled as 0
False Positive cases (FP): % of incorrectly predicted cases as 1, when
actually the class label is 0
False Negative cases (FN): % of incorrectly predicted cases as 0, when
actually the class label is 1
17
Prediction Accuracy Metrics
Sensitivity (total correctly predicted positive cases/total
actual cases of positives): percentage of correct prediction of
outcome happening. Also known as recall
 Specificity (total correctly predicted negative cases/ total
actual cases of negatives): percentage of correct prediction of
outcome not happening
Precision –This tells us what proportion of predicted positive
cases were actually positive
18
A diagrammatic representation of
Sensitivity and Specificity
19
ROC Curve
• ROC stands for receiver operating characteristics.
• The term gets its origin from military RADAR operations.
• It is a graph showing the performance of a classification model
at various levels of prediction thresholds
ROC Curve
Note:
Higher the TP rate compared to FP rate,
steeper is the curve with respect to Y
Axis
Higher the TP rate compared to the FP
Rate, better is the accuracy
Hence, steeper the curve, better is the
accuracy of the model
21
Which independent variables are predicting
the outcome more?
This question can not be directly addressed through LR as a ML
algorithm
That means, there is no concept of odds ratio when LR is used as a ML
algorithm
So is there no way to find which Ivs were the main features for the
prediction accuracy of the LR model?
Yes, there is. This analysis is known as independent variable
importance analysis / sensitivity analysis
In Orange, we can conduct this analysis with the help of the RANK
widget
22
Computing Rank of Variables
a) Information Gain: the expected amount of information (reduction of entropy)
b) Information Gain Ratio: a ratio of the information gain and the attribute’s intrinsic information,
which reduces the bias towards multivalued features that occurs in information gain
c) Gini decrease: the inequality among values of a frequency distribution
d) ANOVA: the difference between average values of the feature in different classes
e) Chi Square: dependence between the feature and the class as measure by the chi-square statistic
f) ReliefF: the ability of an attribute to distinguish between classes on similar data instances
g) FCBF (Fast Correlation Based Filter): entropy-based measure, which also identifies redundancy
due to pairwise correlations between features
Rank – independent variable
importance analysis
Based on Information
gain and Gini
decrease, this widget
shows the top five
variables predicting
the outcome
24
Workflow for Predicting future outcomes
25
Predicted outcome table
Exercise 1: Machine Failure Prediction
• The operations manager at the firm wished to create a predictive analytic model to forecast future
machine failure situations. Such a predictive mechanism can be extremely helpful for preventive
maintenance, thus saving huge cost in terms of time lost due to downtime.
• The manager shared a historical dataset consisting of 7905 data points containing various input
features pertaining to the machine operation as highlighted in Table 1.
• There are essentially ten columns of data, with the target variable being specified as Failure Status.
The input features were chosen based on inputs from plant engineers and front-line operators.
• The actual dataset is available as Exercise 1a and Exercise 1b.
• Conduct the analysis using Logistic Regression
27
28
Exercise 2 – Attrition Prediction
• The manager shared a historical dataset containing various input features as highlighted in Table 2.
There are essentially 28 columns of data, with the target variable being specified as Attrition with
two levels (Yes and No).
• In total there are 1300 data points. The input features were retrieved from existing data records
from the Human Resource Information System (HRIS) of the organization.
• The actual dataset is available as Exercise 2a (historical employee data) and Exercise 2b (current
employee data).
• Conduct the analysis using Logistic Regression
29
THANK YOU

Session 7-8 - Data Cleaning and Logistic Regression For Classification

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Session 7-8 - Data Cleaning and Logistic Regression For Classification

Uploaded by

Copyright:

Available Formats

ADVANCED HR

If data set is not very huge, then it is better to do the

You might also like