You are on page 1of 50

Business Analytics

Chapter - 1

Business Analytics

Data Mining : An Introduction (Classification)


Analytics Lifecycle

Run a Pilot Project and


implement your good • Learn the Business Domain
models in production • Frame the business Problems
environment Discov
ery
Data
Operati Transform the data
Prepar
onalize to analyze it
ation
Communicate the
results of failure or
success to the
stakeholders • Determine the
methods, techniques
Comm Model and workflows
unicate Plannin • Explore the data to
Results g learn the relationships
Model
Buildin
• Partition the data if
g required
• Execute different
Analytical Models
Business Usage of Data Mining
• From a large list of prospective customers, which are most likely
to respond? We can use classification techniques like logistic
regression, classification trees etc.
• To find which customers are most likely to commit, for example,
fraud (or might already have committed it)?
• To find which loan applicants are likely to default?
• To find which customers are most likely to abandon a
subscription service (telephone, magazine, etc.)?
What Is Data Mining? Where Is Data Mining Used?
• The process of exploration and analysis by automatic or semi-
automatic means
• Involves large quantities of data in order to discover meaningful
patterns and rules

• The military use data mining to learn what roles various factors
play in the accuracy of bombs
• Intelligence agencies might use it to determine which of a huge
quantity of intercepted communications are of interest
• Medical researchers might use it to predict the likelihood of a
cancer relapse
Types of Data Mining Techniques
• Supervised learning algorithms
• Unsupervised learning algorithms
Supervised & Unsupervised Learning
• Supervised learning algorithms are those used in classification
and prediction
• We must have data available in which the value of the outcome
of interest (e.g., purchase or no purchase) is known

• Unsupervised learning algorithms are those used where there is


no outcome variable to predict or classify
• Association rules, data reduction methods, and clustering
techniques are all unsupervised learning methods
Supervised Learning Process

Data

Partitioning

Testing Existing
Data Response(Y)
Training Data

Comparison

Testing Data
Model without Predicted
Building Response(Y) Response(Y)
Apply Model
variable
Use and Creation of Partitions
We typically deal with two or three partitions:
1. A Training set:
a. Typically the largest partition
b. Contains the data used to build the various models we
are examining
c. Generally used to develop multiple models
2. A Validation set:
Used to assess the performance of each model so that you
can compare models and pick the best one
3. A Test set:
a. Used if we need to assess the performance of the
chosen model with new data
b. Used to overcome the over-fitting problem
Business Analytics

Predictive Classification Models


Need for Predictive Models
• Identify strong links between variables of a data table (columns). Such
a link will translate into, for example, an equation between one
variable y (the so-called "dependent" or "response" variable) and a
group of other variables {xi} (the so- alled i depe de t varia les", or
"predictors") : y = f(x1, x2, ..., xn) + Small random noise
The discovery of such a link is an important piece of information by
itself, especially if the discovered link turns out to be causal

• Then use this equation for predicting the value of y for new individuals
whose value of y was not measured (and therefore that was not in the
original data table)
Need for Predictive Models
• When the response variable is numerical, predictive modeling is called
Regression

• When the response variable is nominal, predictive modeling is called


Classification. The values of the response variable are then modalities,
that can be considered as "class labels"
Classification Models
• THE NAIVE RULE
• NAIVE BAYES
• K-NEAREST NEIGHBORS
Example 1: Predicting Response
• A telecom firm has many customers. Each customer either
talks for the duration of more than 100 minutes or less than
100 minutes. The firm has launched a scheme for the
customers who talk more specially to optimize the amount
spent by them on bills. In this case each customer is a record,
and the response of interest, Y = {Bought ,Not Bought}, has
two classes that a company can be classified into: C1 = Bought
and C2 = Not Bought.
• The telecom firm has data on 150 customers that it
investigated in the past.
• For each customer they have information on whether a
customer has bought the scheme or not bought it and
whether he/she talks for more than 100 minutes in a week or
not.
Example 1: Predicting Response
• After partitioning the data into a training set (106 customers)
and a validation set (44 customers), the counts obtained from
the training set are shown

Talk time >= 100 Talk Time < 100 Total


min (X=0)
(X=1)
Bought(C1) 11 40 55
Not Bought (C2) 49 6 51
Total 60 46 106

How can this information be used to classify a certain customer as


responding or not responding?
The Naive Rule
• Classify the record as a member of the majority class
• The naive rule would classify all customers as being responsive,
because 51.89% of the customers investigated in the training set
were found to be buying
• The naive rule is used mainly as a baseline for evaluating the
performance of more complicated classifiers
• A classifier that uses external predictor information should
outperform the naive rule
The Naive Rule
• The probability of a record belonging to a certain class is now
evaluated not only based on the prevalence of that class but also
on the additional information
• Naive Bayes works only with predictors that are categorical
• Numerical predictors must be binned and converted to
categorical variables before the naive Bayes classifier can use
them
Example Of Naive Bayes
• Web search companies such as Google use naive Bayes classifiers
to correct misspellings that users type in.
• When you type into Google a phrase that includes a misspelled
word, a spelling correction is suggested.
• Suggestion(s) are based on information not only on the
frequencies of similarly spelled words typed by millions of other
users, but also on the other words in your phrase.
Conditional Probabilities
• A conditional probability of event A given event B [denoted by
P(A|B)] represents the chances of event A occurring only under
the scenario that event B occurs.
• In the response example, we are interested in P(bought| Talk
Time >=100).
Naive Bayes Method
• For a response with m classes, C1, C2, . . ., Cm, and the predictors
X1,X2, . . . , Xp, we want to compute: P(Ci|X1, . . .,Xp) for i = 1, 2,
. . .,m.
• To classify a record, we compute its chance of belonging to each
of the classes by computing P(Ci|X1, . . .,Xp) for each class i.
• We then classify the record to the class that has the highest
probability.
Bayes Formula
• The Bayes theorem gives us the following formula
to compute the probability that the record
belongs to class Ci:
Example

Talks for more than


Gender Response
100 min? (TT >= 100)

y male not bought


n male not bought
n female not bought
n female not bought
n male not bought
n male not bought
y male bought
y female bought
n female bought
y female bought
Naive Bayes Probabilities
• For the conditional probability of bought behaviors given (TT >=
100) = y, gender = male, the numerator is a multiplication of:
– Proportion of (TT >= 100) = y instances among the bought
customers
– Times the proportion of gender = male instances among the
bought customers,
– Times the proportion of bought customers: (3/4)(1/4)(4/10) =
0.075
• To get the actual probabilities, we must also compute the
numerator for the conditional probability of not bought given
(TT >= 100) = y, gender = male : (1/6)(4/6)(6/10) = 0.067
Naive Bayes Probabilities

• The denominator is then the sum of these two conditional


probabilities (0.075 + 0.067 = 0.14)
• The conditional probability of bought behaviors given (TT >= 100) =
y, gender = male is therefore 0.075/0.14 = 0.53

Similarly,
P(bought | (TT >= 100) = y, gender = female) = 0.87
P(bought | (TT >= 100) = n, gender = male) = 0.07
P(bought | (TT >= 100) = n, gender = female) = 0.31
K-Nearest Neighbors
• The idea in k-nearest neighbors methods is to identify k
observations in the training dataset that are similar to a new
record that we wish to classify.
• We look for re ords i our trai i g data that are si ilar or ear
the record to be classified in the predictor space (i.e., records
that have values close to x1, x2, . . . , xp).
• Then, based on the classes to which those proximate records
belong, we assign a class to the record that we want to classify.
• The k-nearest neighbors algorithm is a classification method that
does not make assumptions about the form of the relationship
between the classmembership (Y) and the predictors X1,X2, . .
.,Xp.
• This is a nonparametric method because it does not involve
estimation of parameters.
Distance Formula

• The Euclidean distance between two records (x1, x2, . . . , xp)


and (u1, u2, . . . , up) is
Example: Riding Mowers

• A riding-mower manufacturer
would like to find a way of
classifying families in a city into
those likely to purchase a riding
mower and those not likely to
buy one. A pilot random sample
is undertaken of owners and
non-owners in the city.
• Data: RidingMowers.csv
Example: Riding Mowers
• Consider a new household with $60,000 income and lot size
20,000 sq. ft
• Among the households in the training set, the one closest to the
new household (in Euclidean distance after normalizing income
and lot size) is household 4, with $61,500 income and lot size
20,800 sq. ft.
• If we use a 1-NN classifier, we would classify the new household
as an owner, like household 4
• If we use k = 3, the three nearest households are 4, 9, and 14
• The first two are owners of riding mowers, and the last is a
nonowner.
• The majority vote is therefore owner, and the new household
would be classified as an owner
Model Evaluation
• Classification Confusion Matrix

25  85 110
% Error    3.67%
201  25  85  2689 3000
201  2689 2890
% Accuracy    96.33%
201  25  85  2689 3000
Business Analytics

Classification Evaluation Metrics


Basic Quantitative Quality Indicators

TP – True Positive : Correctly assigned observations to the positive


class.

TN – True Negative : Correctly assigned observations to the negative


class.

FP – False Positive : Wrongly assigned observations to the positive


class.
(Which actually belong to the negative class)

FN – False Negative : Wrongly assigned observations to the negative


class.
(Which actually belong to the positive class)
Classification Confusion Matrix

Predicted Positive Negative


Real Class Class
Positive
Class TP FN
Negative
Class FP TN
Derived Quality Indicators

TPR - True Positive Rate: TPR = TP / (TP + FN)


Ability to detect members of the positive class

TNR - True Negative Rate: TNR = TN / (TN + FP)


Ability to detect members of the negative class

FPR - False Positive Rate: FPR = FP / (FP + TN)


Frequency of mistakes to classify the false observation in the
true class

FNR - False Negative Rate: FNR = FN / (FN + TP)


Frequency of mistakes to classify the true observation in the
false class
Classification Confusion Matrix

Predicted
Positive Class Negative Class
Real

Positive Class
TP FN
Negative Class
FP TN

TP/POS TN/NEG
FP/NEG FN/POS
Derived Quality Indicators
• SE – Sensitivity: SE = TP / (TP + FN)
Ability to detect members of the positive class

• SP – Specificity: SP = TN / (TN + FP)


Ability to detect members of the negative class

From above equations we get

SE = TPR 1 – SE = FNR

SP = TNR 1– SP = FPR
Overall Prediction Correctness

ACC (Total Accuracy) = P(correct Prediction)


=

ACC =
Business Analytics

Decision Trees
What is Decision Tree ?
• A structure that can be used to divide up a large collection of
records into successively smaller sets of records by applying a
sequence of simple decision rules.
• A decision tree model consists of a set of rules for dividing a
large heterogeneous population into smaller, more
homogeneous groups with respect to a particular target variable.
Types of Decision Trees
There are many specific decision-tree algorithms. Notable ones
include:
– ID3 (Iterative Dichotomiser 3)
– C4.5 (successor of ID3)
– CART (Classification And Regression Tree)
– CHAID (CHi-squared Automatic Interaction Detector). Performs
multi-level splits when computing classification trees.
– MARS: extends decision trees to handle numerical data better.
– Conditional Inference Trees. Statistics-based approach that uses
non-parametric tests as splitting criteria, corrected for multiple
testing to avoid over-fitting. This approach results in unbiased
predictor selection and does not require pruning.
Entropy
• ID3 algorithm uses entropy to calculate the homogeneity of a
sample.
• If the sample is completely homogeneous the entropy is zero and
if the sample is an equally divided it has entropy of one.
• We aim to decrease the entropy of the dataset until we reach
leaf nodes at which point the subset that we are left with is pure,
or has zero entropy and represents instances all of one class.

Where pi is the proportion of instances


in the dataset that take the ith value of
the target attribute, which has C
values.
Entropy
Classification Trees
• Two ideas underling classification trees
– Recursive Partitioning
– Pruning using validation data
Recursive Partitioning

• Recursive Partitioning divides up p-dimensional space of X-


variables into non-overlapping multidimensional rectangles
• X-variables can be continuous, binary or ordinal
Gini’s Impurity Index
• Gini impurity can be computed by summing the probability of
each item being chosen times the probability of a mistake in
categorizing that item
• It reaches its minimum (zero) when all cases in the node fall
into a single target category
• It is used by CART Algorithm
Steps in Recursive Partitioning
1. One of the variables say xi is selected and a value of xi say si is
chosen to split the p-dimensional space into two parts that
contain all the points with xi ≤ si and the other xi > si
2. One of the two parts is further divided in similar manner by
choosing a variable again
3. This process is continued so that we obtain smaller and smaller
rectangular regions
Pruning the tree
• Pruning of the tree is done to avoid overfitting of the data
• Elimination of the unstable splits by merging smaller leaves
through a process called pruning
Tree Representations
rpart()
rpart(formula, data=, method=,control=) where

formula is in the format


outcome ~ predictor1 + predictor2 + predictor3 + …
data= specifies the data frame
method= "class" for a classification tree
"anova" for a regression tree
control= optional parameters for controlling tree growth. For example,
control=rpart.control(minsplit=20, cp=0.01) requires that the minimum
number of observations in a node be 20 before attempting a split and that a
split must decrease the overall lack of fit by a factor of 0.01 (cost complexity
factor) before being attempted.

CART algorithm is implemented by the rpart() function in package


rpart
Party package

ctree(formula, data=)
The type of tree created will depend on the outcome
variable (nominal factor, ordered factor, numeric, etc.).
Tree growth is based on statistical stopping rules, so
pruning should not be required.
• The ctree() uses Conditional Inference Tree
• The tree generated y party pa kage doesn’t need to
be pruned because the tree growth is based on
statistical stopping rules
Regression Trees

• In case of regression trees, the difference is that on the leaf


nodes we have the means of the response variable values

• Evaluation Metric: RMSE


Applicability of Algorithms Learnt

Independent Variable Dependent Variable


Algorithm Precision Criteria
Type Type
Naïve Bayes Categorical Categorical Confusion Matrix

k-NN Numerical Binary Confusion Matrix

MLR Numerical / Categorical Numerical RMSE

Logistic Regression Numerical / Categorical Binary Confusion Matrix

Classification Tree Numerical / Categorical Binary Confusion Matrix

Regression Tree Numerical / Categorical Numerical RMSE

You might also like