PAM All Files

Road Map to Predictive Analytics
Dr. P.K.Viswanathan
Professor(Analytics)
Present Competitive Environment
has been witnessing cornucopia of
Data that is increasing at an
astonishing rate beyond
human imagination.
AI is the New Electricity
“AI will transform every

industry just like electricity
transformed them 100
years back.”
Andrew Ng
Connection between Analytics and AIMLDL
AI
ML
▪ Artificial Intelligence(AI is the major field)
▪ Machine Learning(ML) is a subfield of AI
▪ Deep Learning(DL) is a subfield of ML
DL
Pillars of Analytics
▪ Descriptive Analytics What has happened?
▪ Diagnostic Analytics Why it has happened?
▪ Predictive Analytics What will happen?
▪ Prescriptive Analytics What should be done?

What is Predictive Analytics?
▪ Predictive analytics involves the use of data and quantitative

modeling to predict future trends and events. Predictive Analytics
generates potential future scenarios that can help drive strategic
decisions.
▪ In this modern internet and information technology world,

predictive analytics uses machine-learning algorithms to
automate strategic decisions.
Predictive Analytics: Examples
▪ I have large amount of data on various customer characteristics.
Can you segment the market appropriately and then predict in each segment
whether a customer will buy my new product? Which segment has the highest
probability of buying?
▪ Can you predict when the market churn will take place so that my company
can take appropriate action to save a lot of money?
▪ What is the chance that a customer will default on a loan if I choose to give?
▪ What is the market demand for my new product that I would like to launch?
Why the term “Predictive Analytics”?
Predictive
Training Data Model
Algorithms
Data Set
Test Data Model Predictions

Supervised Learning
Nature of Y Nature of X Model to Use?
1)Continuous Continuous Multiple Regression
2)Continuous Categorical Dummy Regression
3)Continuous Mixed Multiple Regression(Dummy

Coding for Categorical)
Supervised Learning
Nature of Y Nature Model to Use?

of X
4)Binary(0/1) Continuous Logistic Regression/Discriminant

Analysis
5)Binary(0/1) Categorical Logistic Regression
6)MultiClass(>2) Continuous Multiple Discriminant Analysis

Supervised Learning
Modern Classifiers
• CART
• Neural Nets
• Random Forest
• Support Vector Machines(SVM)
• Naive Bayes
Unsupervised Learning
Nature of X Model to Use
1)Continuous If the variables are highly correlated,

collapse them into dimension by using Principal
Component Analysis.
2)Continuous If the aim is to reduce the objects, use Cluster

Analysis for Segmenting into Groups.
3)Categorical Use Correspondence Analysis for Dimension

Derivation and Clustering
Quick-Review Test
1) In an analytic study to understand consumer behavior toward

buying, the response variable rating is continuous(scale of 1 to 7
was used) and it depends on advertisement with three
levels(Low budget, medium budget, and high budget) and price
two levels(High and Low). The model to predict consumer rating
is
Dummy Regression(Preference Decomposition, and Conjoint

analysis are also correct )
Quick-Review Test
2) If the objective is to classify the consumers into low risk takers,

medium risk takers, and high risk takers based on key
characteristics, the models that can be used are
Discriminant Analysis, Logistic Regression, and Neural Network.
BABI-Review Test
3) In a predictive modeling study to predict loan default, two

independent variables were used namely Income and Current
Loans in Credit Card. The logistic regression gave the odds
(Exp(Beta) corresponding to Current Loans in Credit Card as
2.78. Interpret this number.
[0 represents Not a Defaulter and 1 represents a Defaulter]
The odds of defaulter to non-defaulter will be 2.78 to 1 for every unit

increase in Current Loans.
Quick-Review Test
4) When a very large number of variables are involved in a study to

understand selling behavior, the objective is to collapse these
variables into manageable dimensions. The appropriate
technique is
Principal Component Analysis

Quick-Review Test
5) When we want to understand Interaction between factors and

relationship between variables, we use
ANOVA and Correlation(Correlation and Regression is also

correct)
Logistic Regression-A Conceptual Framework
Presentation
Dr. P.K.Viswanathan
Logistic Regression-Examples
▪ Banking: What is the likelihood that someone will default on

a loan or prepay her mortgage?
▪ Marketing: What is the likelihood of someone responding to

a mail campaign?
▪ Medicine: What is the likelihood a patient will get well or

die?
▪ Fraud Detection: What is the likelihood a transaction/claim/

is fraudulent?
Why Logistic Regression?
▪ No matter, however hard we try, there is no guarantee in OLS

regression the probability of the dependent variable will be
in the range of 0-1.
▪ In all likelihood, a few observations will go out side 0-1 which

makes no sense in probability.
▪ Hence Logistic Regression is used by Analytics Professionals.

Odds and Probabilities
Probability = Odds/(odds+1)
Odds = Probability/(1-Probability)
Why Odds Anyway?
▪ Odds are used to counteract the fact that linear regression

produces probability values outside the range of 0 and 1.
▪ Going with an odds forces the upper bound on the probability.

The lower bound is achieved by taking the natural log of the
regression value.
Visual of Logit Curve
Logistic Regression
◼ Logistic Regression Equation

The relationship between Probability P and X1, X2, . . . , Xk is
described by the following equation:
𝑒𝑍
P=
(1+𝑒 𝑍 )
Z=b0+b1X1+b2X2+...bkXk
X1, X2, . . . , Xk are the predictor variables
P represents the probability that Y=1

1-P represents the probability that Y=0
Logistic Regression
Maximum Likelihood Estimation(MLE)
When Y=1, L=P

When Y=0, L=1-P.
This is for a single data point. You should multiply for all points.
Maximizing L is same as maximizing Log L (base e).
𝐿𝑜𝑔𝐿 =∑Ylog(P)+∑(1-Y)Log(1-P)
Walk the Talk
Simmons Catalogue1
Simmons’ catalogs are expensive and Simmons

would like to send them to only those customers who
have the highest probability of making a $200 purchase
using the discount coupon included in the catalog.
Simmons’ management thinks that annual spending
at Simmons Stores and whether a customer has a
Simmons credit card are two variables that might be
helpful in predicting whether a customer who receives
the catalog will use the coupon to make a $200
purchase.
1.Adapted from Anderson, Sweeney, and Williams purely for classroom discussion
Simmons Catalogue-Continues
Simmons conducted a study by sending out 100
catalogs, 50 to customers who have a Simmons credit
card and 50 to customers who do not have the card.
At the end of the test period, Simmons noted for each of
the 100 customers:
1) the amount the customer spent last year at Simmons,
2) whether the customer had a Simmons credit card, and
3) whether the customer made a $200 purchase.
The data file that contains the information is in
Logit-Simmons.csv
Develop a logistic regression model, obtain the output

and interpret the results
Example Problem-Books By Mail from Paul Green purely for
Classroom Discussion
• Books By Mail company is interested in offering a new title called The Art History of
Florence to 1000, existing customers. Of these, 83 actually purchased the book, a
response rate of 8.3 percent. Hence, the company sent a test mailing to them in
this regard. The company also sent out an identical mailing to another 1000
customers to serve as holdout sample. The scope of the study primarily confined to
predicting whether a customer will buy the new book or not is based on two input
variables namely months since last purchase and number of art books purchased. The
data files of the existing customers and the holdout sample are given in
“PaulBooks1.csv” and “Paulbooks2.csv” respectively.
Any Practical Value for Books By Mail?
We can assess the operational significance of the model by

using it to determine a mailing strategy for the 1000
customers in the holdout sample and then assessing the
profitability of the strategy. The cost of mailing an offer to
purchase The Art History of Florence is $1; if the customer
responds and purchases the book, then the net profit(after
the cost of mailing) is $6. What should be the mailing
strategy?
Fisher’s Linear Discriminant Analysis
A Multidimensional Perspective
Presentation
Dr. P.K.Viswanathan
What is Discriminant Analysis?
▪ Do heavy users and light users of our product differ in some

other characteristics?
▪ Is income a good predictor of this user status?
▪ Does amount of formal education discriminate between
viewers and non-viewers ?
The objective of discriminant analysis is to use the information from

the predictor variables to achieve the clearest possible separation or
discrimination among groups.
The Three Key Goals of Discriminant Analysis
1. Profiling
2. Differentiation
3. Classification
Applications of LDA
▪ In a textile mill, cotton quality depends on the chemical characteristics. LDA
can create the score required. If the score is more than a threshold value
(Cut off Point), Cotton quality is Good else Bad.
▪ In sanctioning loan for a customer, a bank uses a number of financial

indicators. The LDA score developed based on these indicators can be used
as a classifier. If the score is more than a threshold value, classify the
customer as a defaulter, else non-defaulter.
▪ For MBA admission in a business school based on past scholastic record

(Grade Point Average), GMAT score, and performance score in the
interview, whether to admit a candidate or not can be based on
discriminant score.
▪ The discussion here will be confined to two groups only as most of the
applications involve a dichotomous situation. However, LDA can easily
handle multiple classes.
Math Behind LDA
𝑍 = 𝑎1 𝑥1 + 𝑎2 𝑥2 +. . . . . . +𝑎𝑘 𝑥𝑘
𝑍1 = 𝑎1 𝑥1 (𝐼) + 𝑎2 𝑥2(𝐼) +. . . . +𝑎𝑘 𝑥𝑘(𝐼)
𝑍2 = 𝑎1 𝑥1 (𝐼𝐼) + 𝑎2 𝑥2(𝐼𝐼) +. . . . +𝑎𝑘 𝑥𝑘(𝐼𝐼)
𝑍1 − 𝑍2 = |a1D1+a2D2+a3D3+…..+akDk| =|aD|
Where D1, D2, ..Dk are the difference in means between the two
groups for predictor variables x1, x2, …., xk respectively. The values of
a1,a2,…ak will be so chosen as to
Maximize 𝑍1 − 𝑍2
subject to the constraint Var(Z)=1
Data Set
The seminal paper of Altman[1] classified and predicted corporate bankruptcy
based on a set of financial ratios. Z score of Fisher’s linear discriminant analysis
was employed to classify the firm into either “Bankrupt” or “Solvent”. The data
used in the study were from manufacturing corporations. The data set has 33
bankrupt firms and 33 solvent firms. The central goal was whether the bankrupt
firms and solvent firms could be sharply differentiated (separated) in terms of five
financial ratios. They are Working Capital/Total Assets(WCTA), Retained
Earnings/Total Assets(RETA), Earnings Before Interest and Taxes/Total
Assets(EBITTA), Market Value of Equity/Book Value of Total Debt(MVEBVTD), and
Sales/Total Assets(SATA). Original data set has been obtained from Morriosn’s Book
on “Multivariate Statistical Analysis”.
[The abbreviations within brackets are made for ease of identifying the ratios].
Brief Description of the Ratios
▪ WCTA: The Working Capital/Total Assets ratio, frequently found in studies of

corporate problems, is a measure of the net liquid assets of the firm relative to the
total capitalization. Ordinarily, a firm experiencing consistent operating losses will
have shrinking current assets in relation to total assets.
▪ RETA: Retained Earnings/Total Assets is a measure of cumulative profitability over

time. The age of a firm is implicitly considered in this ratio. For example, a
relatively young firm will probably show a low RETA ratio and is more vulnerable to
becoming “Bankrupt” compared to well established older firms who would have
got substantial accumulated earnings.
▪ EBITTA: This ratio is calculated by dividing the total assets of a firm into its earnings
before interest and tax reductions. Since a firm's ultimate existence is based on the
earning power of its assets, this ratio appears to be particularly appropriate for
studies dealing with corporate failure
▪ MVEBVTD: This ratio measure shows how much the firm's assets can decline in
value (measured by market value of equity plus debt) before the liabilities exceed
the assets and the firm becomes insolvent. It also appears to be a more effective
predictor of bankruptcy than the more commonly used ratio: Net worth/Total debt
▪ SATA: The capital-turnover ratio is a standard financial ratio illustrating the sales
generating ability of the firm's assets. It is one measure of management's capability
in dealing with competitive conditions. This final ratio is quite important because of
its unique relationship to other variables in the model. Statistically speaking,
perhaps, this ratio would appear to be least significant in discriminating power.
Profiling-Descriptive
Group Means
WCTA RETA EBITTA MVEBVTD SATA
Bankrupt -6.05 -62.51 -31.78 40.05 1.50
Solvent 41.38 35.25 15.32 254.67 1.94
Differentiation-Visual-WCTA
Differentiation-Visual-RETA
Differentiation-Visual-EBITTA
Differentiation-Visual-MVEBVTD
Differentiation-Visual-SATA
LDA -Z score Equation
Hyperplane Equation(Z score)
Z=0.0153WCTA+0.0183RETA+0.0418EBITTA+0.0077MVEBVTD+1.2543SATA
Cut off Point Score=2.9714
If the Score is >=2.9714, Predict “Solvent”

If the Score is < 2.9714, Predict “Bankrupt
Confusion Matrix
Accuracy(95.45%)
Correlation between DA( Z Scores) and
Input Variables in absolute terms)
Input DA Rank
Variable DA
WCTA 0.7304 3
RETA 0.8702 1
EBITTA 0.6809 4
MVEBVTD 0.7352 2
SATA 0.2589 5
ROC Curve
Discriminant Analysis
• LDA was discovered by Ronald Fisher

• The three Goals of LDA are Profiling, Differentiation, and Classification
• It has wide applications and easy to implement.
• It is capable of multiclass
• t= sqrt((r sq/(1-r sq))*sqrt(n-2),
• follows a t distribution with n-2 df.
Support Vector Machines-A Conceptual
Framework
Dr. P.K.Viswanathan
Professor Analytics
What is SVM?
▪ Support Vector Machines(SVM) which was originally
invented in 1963 by Vladimir N. Vapnik has the goal of
achieving the largest separation between classes.
▪ Support Vector Machines are based on the concept of
hyperplanes that define decision boundaries.
▪ A hyperplane is one that separates the objects of one
class from another.
Recruitment Example
X2
X1 = Hire the Candidate

= Do not Hire the Candidate
X1 is the Aptitude Test Score

X2 is the Interview Performance Score
Linearly separable data points
• Data that can be separated by a

line (or in general, a hyperplane)
is known as linearly separable
data
• The hyperplane acts as a linear
classifier.
Good vs Bad Separator?
 ✓
SVM Scenario
▪ Find lines that correctly classify the
training data
▪ Among all such lines, pick the one that
has the greatest distance to the points
closest to it (margin).
▪ The closest points that identify this line
are known as support vectors.
▪ region they define around the line is
known as the margin.
Sec. 15.1
Math Behind SVM
wTxa + b = 1
d
wTxb + b = -1
• Hyperplane
wT x + b = 0
• Extra scale constraint:

mini=1,…,n |wTxi + b| = 1
• This implies:
wT(xa–xb) = 2
d = ||xa–xb||= 2/||w||
wT x + b = 0
Math Behind SVM
• We can we can formulate the quadratic optimization problem:
Find w and b such that

2
d= is maximized; and for all {(xi , yi)}
w
wTxi + b ≥ 1 if yi=1; wTxi + b ≤ -1 if yi = -1
• A better formulation

Φ(w) =½ wTw is minimized;
and for all {(xi ,yi)}: yi (wTxi + b) ≥ 1

Math Behind SVM Simplified
Minimize (𝑊12 + 𝑊22 + 𝑊32 +……..+Wk2)/2
Subject to the constraints
Y1(W1X1+W2X12+W3X3…..+WkXk+b)>=1
Y2(W1X1+W2X2+W3X3…..+WkXk+b)>=1
…………………………
………………………….
Yn(W1X1+W2X2+W3X3…..+WkXk+b)>=1
Learning SVM–Classroom Exercise-Walk the Talk
*The file DiscriWinstonFR.csv contains information on the following items about
24 companies: EBITASS(Earnings before Income and Taxes, divided by Total
Assets), ROTC(Return on Total Capital), and Group(1 for “Most Admired” and 2
for “Least Admired” Companies).
1) Apply SVM to classify a company as a most admired or least admired

company.
2) Draw the Hyperplane separating the data points with
(Wx+b=0, Wx+b=-1,Wx+b=+1)
*Problem adapted from “Management Science Modeling” by Winston and Albright

purely for Classroom Discussion
Hyperplane Separating Data Points
0.3
0.25
0.2
0.15
0.1
ROTC
Wx+b=0
0.05
0
-0.1 0 0.1 0.2 0.3
-0.05
-0.1
-0.15
EBITTAS
Hyperplane Separating the Data Points
With Decision Boundaries
0.3 Hyperplane Equation is 24.97X1+ 91.98X2 -14.49 0.3
0.25 0.25
0.2 0.2
0.15 0.15
0.1 0.1
0.05 0.05
0 0
-0.1 -0.05 0 0.05 0.1 0.15 0.2 0.25 0.3
-0.05 -0.05
-0.1 -0.1
-0.15 -0.15
Wx+b=0 Wx+b=-1 Wx+b=+1 ROTC
Confusion Matrix
Predicted
Actual Most Admired Least Admired
Most Admired 12 0
Least Admired 0 12
Acuracy 100.00%
Dataset with noise
denotes +1 ◼ Hard Margin: So far we require all

data points be classified correctly
denotes -1
- No training error
◼ What if the training set is noisy?
- Solution 1: use very powerful
kernels
OVERFITTING!
Hard Margin v.s. Soft Margin
◼ The old formulation:
Φ(w) =½ wTw is minimized and for all {(xi ,yi)}
yi (wTxi + b) ≥ 1
◼ The new formulation incorporating slack variables:

Φ(w) =½ wTw + CΣξi is minimized and for all {(xi ,yi)}
yi (wTxi + b) ≥ 1- ξi and ξi ≥ 0 for all i
◼ Parameter C can be viewed as a way to control overfitting.

Non-linear SVMs: Feature spaces
◼ General idea: the original input space can always be mapped
to some higher-dimensional feature space where the training
set is separable:
Φ: x → φ(x)
The “Kernel Trick”
◼ The linear classifier relies on dot product between vectors K(xi,xj)=xiTxj
◼ If every data point is mapped into high-dimensional space via some
transformation Φ: x → φ(x), the dot product becomes:
K(xi,xj)= φ(xi) Tφ(xj)
◼ A kernel function is some function that corresponds to an inner product in
some expanded feature space.
◼ Example:
2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xiTxj)2,
Need to show that K(xi,xj)= φ(xi) Tφ(xj):
K(xi,xj)=(1 + xiTxj)2,
= 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2
= [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2]
= φ(xi) Tφ(xj), where φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2]
Examples of Kernel Functions
◼ Linear: K(xi,xj)= xi Txj
◼ Polynomial of power p: K(xi,xj)= (1+ xi Txj)p
◼ Gaussian (radial-basis function network):

2
xi − x j
K (x i , x j ) = exp(− )
2 2
◼ Sigmoid: K(xi,xj)= tanh(β0xi Txj + β1)

Properties of SVM
• Flexibility in choosing a similarity function
• Sparseness of solution when dealing with large data sets
• Ability to handle large feature spaces
• Nice math property: a simple convex optimization problem
which is guaranteed to converge to a single global solution
• Feature Selection
SVM Applications
▪ text (and hypertext) categorization

▪ image classification
▪ Cancer classification
▪ hand-written character recognition
Decision Tree and Random Forest
Presentation
Dr. P.K.Viswanathan
Components of a decision tree
Root node
Internal node Internal node

/ Decision / Decision
node node
Leaf node / Leaf node / Leaf node / Leaf node /

terminal terminal terminal terminal
node node node node
Generator Example
• An electric generator manufacturer would like to find a way of classifying

businesses in a city into those likely to purchase a generator and those not likely to
buy one.
• Data - 12 owners and 12 nonowners of generator in the city, their electricity

consumption and the business income
3
Dataset
Income Consumption Ownership

60 18.4 Owner Classify those likely to
85.5 16.8 Owner
64.8 21.6 Owner purchase using the Income
61.5 20.8 Owner and electricity
87 23.6 Owner
110.1 19.2 Owner consumption
108 17.6 Owner
82.8 22.4 Owner
69 20 Owner
93 20.8 Owner
51 22 Owner
81 20 Owner
75 19.6 Nonowner
52.8 20.8 Nonowner
64.8 17.2 Nonowner
43.2 20.4 Nonowner
84 17.6 Nonowner
49.2 17.6 Nonowner
59.4 16 Nonowner
66 18.4 Nonowner
47.4 16.4 Nonowner
33 18.8 Nonowner
51 14 Nonowner
63 14.8 Nonowner
4
How decision tree works – step1
Scatterplot of Income procedure will choose Income for the

vs Consumption – 1 first split with a splitting value of 60.
owners vs non-owners
split creates 2 rectangles, each is more

The left rectangle
contains points
homogeneous than the rectangle before
that are mostly the split.
nonowners (seven
nonowners and Basis of split
one owner) right rectangle contains Split points are ranked according to how much they
mostly owners (11 owners reduce impurity (heterogeneity) in the resulting
and five nonowners). rectangle.
A pure rectangle is one that is composed of a single
class (e.g., owners).
5
How decision tree works – step2
Pure node Pure node of 7

of 1 owner owners
Final stage of recursive

Next split is on partitioning –
consumption of 21
each rectangle
consisting of a single
Pure node of class (owners or
7 non owners nonowners)
Splitting the 24 records first by

income value of 60
Sowmya Vivek
6
Measures of impurity
• Gini Index
• Entropy measure
7
1. Calculate GINI for overall rectangle
2 2
12 12
1− −  𝟎. 𝟓
24 24
8
2. Calculation of GINI Index for left and right
rectangles
GINI index for the left & right
2 2
2 2 11 5
7 1 1− −  0.43
1− −  𝟎. 𝟐𝟏𝟗 16 16
8 8
9
3. weighted average of impurity measures
2 2
2 2 11 5
7 1 1− −  0.43
1− −  𝟎. 𝟐𝟏𝟗 16 16
8 8
8 16
× 0.219 + × 0.43 = 0.359
24 24
10
GINI Index before & after the split
Gini Index of original rectangle
0.5 0.359
Gini index after split

11
Steps in calculating GINI Index
Combined impurity
Calculate GINI for Calculate GINI for Calculate GINI for of left + right –
overall rectangle left rectangle right rectangle weighted average of
impurity measures
12
1. Calculate Entropy for overall rectangle
12 12 12 12
− × 𝑙𝑜𝑔2 +− × 𝑙𝑜𝑔2 =1
24 24 24 24
13
2. Calculation of entropy for left and right
rectangles
7 7 1 1
− × 𝑙𝑜𝑔2 +− × 𝑙𝑜𝑔2 = 0.54
8 8 8 8
11 11 5 5
− × 𝑙𝑜𝑔2 +− × 𝑙𝑜𝑔2 = 0.89
16 16 16 16
14
3. weighted average of entropy
entropyfor the left & right

7 7 1 1 11 11 5 5
− × 𝑙𝑜𝑔2 +− × 𝑙𝑜𝑔2 = 0.54 − × 𝑙𝑜𝑔2 +− × 𝑙𝑜𝑔2 = 0.89
8 8 8 8 16 16 16 16
8 16
× 0.54 + × 0.89 = 0.779
24 24
15
Entropy before & after the split
Entropy of original rectangle
1 0.779
Entropy after split

16
Information Gain
The entropy typically

changes when we use a
Information gain is a
node in a decision tree to
measure of this change in
partition the training
entropy
instances into smaller
subsets
Parameters in decision tree learning
Choosing the Binary or Finding the right

splitting criterion multiway splits sized tree
• Impurity based • Multiway split • Pre-pruning
criteria • Binary split • Post-pruning
• Information
gain
15-Mar-22 18
Random Forests
Ensemble methods
• A single decision tree does not perform well
• But, it is super fast
• What if we learn multiple trees?
We need to make sure they do not all just learn the same
Random Forest
• This is a widely used ensemble technique in view of its superior performance
and scalability.
• It is an ensemble of decision trees, where each decision tree is built from

bootstrap samples (K from N with replacement) and randomly selected
subset of features(m out of p) without replacement. The decision trees are
normally grown deep (without pruning).
• The hyperparameters that will be tuned to increase the model accuracy in a

Random Forest model are
1. Number of decision trees.
2. Number of records and features to be sampled.
3. Depth and search criteria (Gini impurity index or entropy).

PAM All Files

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PAM All Files

Uploaded by

Copyright:

Available Formats

Road Map to Predictive Analytics

“AI will transform every

▪ Descriptive Analytics What has happened?

▪ Diagnostic Analytics Why it has happened?

▪ Predictive Analytics What will happen?

▪ Prescriptive Analytics What should be done?

▪ Predictive analytics involves the use of data and quantitative

▪ In this modern internet and information technology world,

Test Data Model Predictions

Nature of Y Nature of X Model to Use?

1)Continuous Continuous Multiple Regression

2)Continuous Categorical Dummy Regression

3)Continuous Mixed Multiple Regression(Dummy

Nature of Y Nature Model to Use?

4)Binary(0/1) Continuous Logistic Regression/Discriminant

6)MultiClass(>2) Continuous Multiple Discriminant Analysis

• Support Vector Machines(SVM)

Nature of X Model to Use

1)Continuous If the variables are highly correlated,

2)Continuous If the aim is to reduce the objects, use Cluster

3)Categorical Use Correspondence Analysis for Dimension

1) In an analytic study to understand consumer behavior toward

Dummy Regression(Preference Decomposition, and Conjoint

2) If the objective is to classify the consumers into low risk takers,

3) In a predictive modeling study to predict loan default, two

The odds of defaulter to non-defaulter will be 2.78 to 1 for every unit

4) When a very large number of variables are involved in a study to

Principal Component Analysis

5) When we want to understand Interaction between factors and

ANOVA and Correlation(Correlation and Regression is also

▪ Banking: What is the likelihood that someone will default on

▪ Marketing: What is the likelihood of someone responding to

▪ Medicine: What is the likelihood a patient will get well or

▪ Fraud Detection: What is the likelihood a transaction/claim/

▪ No matter, however hard we try, there is no guarantee in OLS

▪ In all likelihood, a few observations will go out side 0-1 which

▪ Hence Logistic Regression is used by Analytics Professionals.

▪ Odds are used to counteract the fact that linear regression

▪ Going with an odds forces the upper bound on the probability.

◼ Logistic Regression Equation

P represents the probability that Y=1

When Y=1, L=P

Simmons’ catalogs are expensive and Simmons

Develop a logistic regression model, obtain the output

We can assess the operational significance of the model by

▪ Do heavy users and light users of our product differ in some

The objective of discriminant analysis is to use the information from

▪ In sanctioning loan for a customer, a bank uses a number of financial

▪ For MBA admission in a business school based on past scholastic record

▪ WCTA: The Working Capital/Total Assets ratio, frequently found in studies of

▪ RETA: Retained Earnings/Total Assets is a measure of cumulative profitability over

Hyperplane Equation(Z score)

Cut off Point Score=2.9714

If the Score is >=2.9714, Predict “Solvent”

• LDA was discovered by Ronald Fisher

X1 = Hire the Candidate

X1 is the Aptitude Test Score

• Data that can be separated by a

Math Behind SVM

• Extra scale constraint:

Find w and b such that

Find w and b such that

and for all {(xi ,yi)}: yi (wTxi + b) ≥ 1

Minimize (𝑊12 + 𝑊22 + 𝑊32 +……..+Wk2)/2