You are on page 1of 31

eMBA933

Data Mining
Tools & Techniques
Lecture 12

Dr. Faiz Hamid


Associate Professor
Department of IME
IIT Kanpur
fhamid@iitk.ac.in
Classification: Basic Concepts
Classification
• Task of assigning objects to one of several predefined
categories or classes
• A form of data analysis that extracts model or classifier to
predict class labels
– find a model for class label attribute as a function of the values of other
attributes
– classifies data based on training set and values in a classifying attribute, and
uses it in classifying new data
– class labels are generally categorical (discrete or nominal)
– less effective for ordinal categories since implicit order among the categories
is not considered
• Numeric Prediction
– models continuous‐valued functions, i.e., predicts unknown or missing values
• Typical applications
– Credit/loan approval: loan application is “safe” or “risky”
– Medical diagnosis: tumor is “cancerous” or “benign”
– Fraud detection: transaction is “fraudulent”
Supervised vs. Unsupervised Learning
• Supervised learning (classification)
– Supervision: Training data is
accompanied by labels indicating the
class of the observations
– New data is classified based on the
training set

• Unsupervised learning (clustering)


– Class labels of training data is
unknown
– Given a set of observations, the aim is
to establish existence of classes or
clusters in the data
Classification— Two‐Step Process
• Model construction: Describe a set of predetermined classes
– Each tuple is assumed to belong to a predefined class, as determined by
the class label attribute
– The model is represented as classification rules, decision trees, or
mathematical formulae

• Model usage: Classify future or unknown objects


– Estimate accuracy of the model
• The known label of test sample is compared with the classified result
from the model
• Accuracy = percentage of test set samples that are correctly
classified by the model
• Test set is independent of training set (otherwise overfitting)
– If the accuracy is acceptable, use the model to classify new data
Phase 1: Model Construction
Classification
Algorithms
Training
Data

NAM E RANK YEARS TENURED Classifier


M ike Assistant Prof 3 no (Model)
M ary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
6
Phase 2: Model Usage
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAM E RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
M erlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
7
Classification
Linearly Separable Not Linearly Separable
Decision Trees
Decision Trees
• Divides the feature space by axes
aligned decision boundaries 1
• Each rectangular region is labeled 2
with one label/class
• Idea is to divide the entire X‐space
into rectangles such that each
rectangle is as homogeneous or
“pure” as possible 3
• Pure = containing records that
belong to just one class
• Recursive partitioning of p‐
dimensional space of the predictor
variables into non‐overlapping
multidimensional rectangles

Not Linearly Separable


Decision Trees
Internal nodes
branch (test on attributes)
(outcome of the test)

1 Width > 6.5 cm?


1
2 Yes No

2 Height > 9.5 cm? 3 Height > 6.0 cm?


Yes No Yes No
3

Leaf node
(class label)

Not Linearly Separable Decision tree: a flowchart‐like tree structure


Decision Trees
• If‐Then Rules
– If Width> 6.5 cm AND
Height> 9.5 cm THEN Lemon Width > 6.5 cm?
– If Width> 6.5 cm AND Yes No
Height≤ 9.5 cm THEN Orange
– If Width≤ 6.5 cm AND Height > 9.5 cm? Height > 6.0 cm?
Height> 6.0 cm THEN Lemon Yes No Yes No
– If Width≤ 6.5 cm AND
Height≤ 6.0 cm THEN Orange
Example
• Whether a customer will wait for a table at a restaurant?
• Attributes:
1. Alternate: is there an alternative restaurant nearby?
2. Bar: is there a comfortable bar area to wait in?
3. Fri/Sat: is today Friday or Saturday?
4. Hungry: are we hungry?
5. Patrons: number of people in the restaurant (None, Some, Full)
6. Price: price range ($, $$, $$$)
7. Raining: is it raining outside?
8. Reservation: have we made a reservation?
9. Type: kind of restaurant (French, Italian, Thai, Burger)
10. Wait Estimate: estimated waiting time (0‐10 min, 10‐30, 30‐60, >60)
Example
Class label attribute
Which Tree is Better?
The tree to decide whether to
wait (T) or not (F)
What Makes a Good Tree?
• Not too big:
– computational efficiency (avoid redundant, spurious attributes)
– avoid overfitting training examples
– generalise well to new/unseen observations
– easy to understand and interpret
• Not too small:
– need to handle important but possibly subtle distinctions in data
• Occam's Razor: "the simplest explanation is most likely
the right one"
– find the smallest tree that fits the observations
Learning Decision Trees
• In principle there are exponentially many DT constructed from a
given set of attributes that fits the same data
• Learning the simplest (smallest) decision tree is an NP complete
problem (Hyal & Rivest,1976)
• Resort to heuristics: efficient algorithms that induce reasonably
accurate, albeit suboptimal, DT in a reasonable amount of time
• Greedy strategy: series of locally optimal decisions
– Start from an empty decision tree
– Split on next best attribute
– Recurse
• What is best attribute?
• We use information theory to guide us
– ID3 (Iterative Dichotomiser) – Information Gain
– C4.5 – Gain Ratio
– Classification and Regression Trees (CART) – Gini index
Decision Tree Learning Algorithm
• Simple, greedy, recursive approach, builds up tree
node‐by‐node
1. pick an attribute to split at a non‐terminal node
2. split examples into groups based on attribute
value
3. for each group:
– if no examples ‐ return majority from parent
– else if all examples in same class ‐ return class
– else loop to Step 1
Choosing a Good Attribute
• Which attribute is better to split on, X1 or X2?

Pure Node

Idea:
1. use counts at leaves to define probability distributions, so we
can measure uncertainty
2. a good attribute splits the examples into subsets that are
(ideally) pure
Choosing a Good Attribute
• Restaurant example
• Test Patrons or Type first?

• A good attribute splits samples into groups that are (ideally)


all positive or negative
• Patrons is better attribute than Type
• Testing on good attributes early allows to minimise the tree
depth
Quantifying Uncertainty
• We Flip Two Different Coins
Quantifying Uncertainty
• Entropy H(X) of a random variable X

• Measures the level of impurity in a group of examples


Quantifying Uncertainty
• Entropy
Attribute Selection Measures
• An attribute selection measure is a heuristic for selecting the
splitting criterion that “best” separates a given data partition

• Pure partition: A partition is pure if all the tuples in it belong


to the same class
– split up the tuples according to mutually exclusive outcomes of the
splitting criterion

• Popular measures
– information gain, gain ratio, Gini index
Information Gain
• ID3 uses information gain as its attribute selection measure
• Node N represents tuples of partition D
• Attribute with the highest information gain is chosen as the
splitting attribute for node N
• Objective: to partition on an attribute that would do the “best
classification,” so that the amount of information still required
to finish classifying the tuples is minimal
• Minimize expected number of tests needed to classify a given
tuple and guarantee a simple tree is found
Notations
• D: data partition, a training set of class‐labeled tuples
• m: distinct values of class label attribute defining m
distinct classes, Ci (i = 1,…,m)
• Ci,D: set of tuples of class Ci in D
• |Dj|: number of tuples in Dj
• |Ci,D |: number of tuples in Ci,D
Information Gain
• Let pi be the probability that an arbitrary tuple in D belongs to class Ci,
estimated by |Ci, D|/|D|
• Expected information (entropy) needed to classify a tuple in D:

• Measures the average amount of information needed to identify the class of


a tuple in D
– Original information required based on just the proportion of classes
• How much more information would we still need (after the partitioning) to
arrive at an exact classification?
• Information needed (after using A to split D into v partitions) to classify D:
Information Gain
• InfoA(D) is the expected information required to classify a tuple from D
based on the partitioning by A
• Smaller the expected information (still) required, the greater the purity of
the partitions (reduces the entropy in the partitions)
• To determine how well a test condition performs, compare the degree of
impurity of the parent node (before splitting) and the child node (after
splitting)
– larger the difference, the better the test condition
• Information gained by branching on attribute A

– expected reduction in the information requirement caused by knowing the


value of A
– determines the goodness of the split
Example
Example

• InfoType(D) = 2 x 2/12 x 1 + 2 x 4/12 x 1 = 1/3 + 2/3 =1


• Gain(Type) = Info (D) ‐ InfoType(D) = 1 – 1 = 0

• InfoPatrons(D) = 2/12 x 0 + 4/12 x 0 + 6/12 x 0.918 = 0.459


• Gain(Patrons) = Info (D) ‐ InfoPatrons(D) = 1 – 0.459 = 0.541

• Patrons is a better attribute than Type


Example

You might also like