Professional Documents
Culture Documents
Chapter - 1
Business Analytics
• The military use data mining to learn what roles various factors
play in the accuracy of bombs
• Intelligence agencies might use it to determine which of a huge
quantity of intercepted communications are of interest
• Medical researchers might use it to predict the likelihood of a
cancer relapse
Types of Data Mining Techniques
• Supervised learning algorithms
• Unsupervised learning algorithms
Supervised & Unsupervised Learning
• Supervised learning algorithms are those used in classification
and prediction
• We must have data available in which the value of the outcome
of interest (e.g., purchase or no purchase) is known
Data
Partitioning
Testing Existing
Data Response(Y)
Training Data
Comparison
Testing Data
Model without Predicted
Building Response(Y) Response(Y)
Apply Model
variable
Use and Creation of Partitions
We typically deal with two or three partitions:
1. A Training set:
a. Typically the largest partition
b. Contains the data used to build the various models we
are examining
c. Generally used to develop multiple models
2. A Validation set:
Used to assess the performance of each model so that you
can compare models and pick the best one
3. A Test set:
a. Used if we need to assess the performance of the
chosen model with new data
b. Used to overcome the over-fitting problem
Business Analytics
• Then use this equation for predicting the value of y for new individuals
whose value of y was not measured (and therefore that was not in the
original data table)
Need for Predictive Models
• When the response variable is numerical, predictive modeling is called
Regression
Similarly,
P(bought | (TT >= 100) = y, gender = female) = 0.87
P(bought | (TT >= 100) = n, gender = male) = 0.07
P(bought | (TT >= 100) = n, gender = female) = 0.31
K-Nearest Neighbors
• The idea in k-nearest neighbors methods is to identify k
observations in the training dataset that are similar to a new
record that we wish to classify.
• We look for re ords i our trai i g data that are si ilar or ear
the record to be classified in the predictor space (i.e., records
that have values close to x1, x2, . . . , xp).
• Then, based on the classes to which those proximate records
belong, we assign a class to the record that we want to classify.
• The k-nearest neighbors algorithm is a classification method that
does not make assumptions about the form of the relationship
between the classmembership (Y) and the predictors X1,X2, . .
.,Xp.
• This is a nonparametric method because it does not involve
estimation of parameters.
Distance Formula
• A riding-mower manufacturer
would like to find a way of
classifying families in a city into
those likely to purchase a riding
mower and those not likely to
buy one. A pilot random sample
is undertaken of owners and
non-owners in the city.
• Data: RidingMowers.csv
Example: Riding Mowers
• Consider a new household with $60,000 income and lot size
20,000 sq. ft
• Among the households in the training set, the one closest to the
new household (in Euclidean distance after normalizing income
and lot size) is household 4, with $61,500 income and lot size
20,800 sq. ft.
• If we use a 1-NN classifier, we would classify the new household
as an owner, like household 4
• If we use k = 3, the three nearest households are 4, 9, and 14
• The first two are owners of riding mowers, and the last is a
nonowner.
• The majority vote is therefore owner, and the new household
would be classified as an owner
Model Evaluation
• Classification Confusion Matrix
25 85 110
% Error 3.67%
201 25 85 2689 3000
201 2689 2890
% Accuracy 96.33%
201 25 85 2689 3000
Business Analytics
Predicted
Positive Class Negative Class
Real
Positive Class
TP FN
Negative Class
FP TN
TP/POS TN/NEG
FP/NEG FN/POS
Derived Quality Indicators
• SE – Sensitivity: SE = TP / (TP + FN)
Ability to detect members of the positive class
SE = TPR 1 – SE = FNR
SP = TNR 1– SP = FPR
Overall Prediction Correctness
ACC =
Business Analytics
Decision Trees
What is Decision Tree ?
• A structure that can be used to divide up a large collection of
records into successively smaller sets of records by applying a
sequence of simple decision rules.
• A decision tree model consists of a set of rules for dividing a
large heterogeneous population into smaller, more
homogeneous groups with respect to a particular target variable.
Types of Decision Trees
There are many specific decision-tree algorithms. Notable ones
include:
– ID3 (Iterative Dichotomiser 3)
– C4.5 (successor of ID3)
– CART (Classification And Regression Tree)
– CHAID (CHi-squared Automatic Interaction Detector). Performs
multi-level splits when computing classification trees.
– MARS: extends decision trees to handle numerical data better.
– Conditional Inference Trees. Statistics-based approach that uses
non-parametric tests as splitting criteria, corrected for multiple
testing to avoid over-fitting. This approach results in unbiased
predictor selection and does not require pruning.
Entropy
• ID3 algorithm uses entropy to calculate the homogeneity of a
sample.
• If the sample is completely homogeneous the entropy is zero and
if the sample is an equally divided it has entropy of one.
• We aim to decrease the entropy of the dataset until we reach
leaf nodes at which point the subset that we are left with is pure,
or has zero entropy and represents instances all of one class.
ctree(formula, data=)
The type of tree created will depend on the outcome
variable (nominal factor, ordered factor, numeric, etc.).
Tree growth is based on statistical stopping rules, so
pruning should not be required.
• The ctree() uses Conditional Inference Tree
• The tree generated y party pa kage doesn’t need to
be pruned because the tree growth is based on
statistical stopping rules
Regression Trees