Lecture 12

eMBA933
Data Mining
Tools & Techniques
Lecture 12
Dr. Faiz Hamid

Associate Professor
Department of IME
IIT Kanpur
fhamid@iitk.ac.in
Classification: Basic Concepts
Classification
• Task of assigning objects to one of several predefined
categories or classes
• A form of data analysis that extracts model or classifier to
predict class labels
– find a model for class label attribute as a function of the values of other
attributes
– classifies data based on training set and values in a classifying attribute, and
uses it in classifying new data
– class labels are generally categorical (discrete or nominal)
– less effective for ordinal categories since implicit order among the categories
is not considered
• Numeric Prediction
– models continuous‐valued functions, i.e., predicts unknown or missing values
• Typical applications
– Credit/loan approval: loan application is “safe” or “risky”
– Medical diagnosis: tumor is “cancerous” or “benign”
– Fraud detection: transaction is “fraudulent”
Supervised vs. Unsupervised Learning
• Supervised learning (classification)
– Supervision: Training data is
accompanied by labels indicating the
class of the observations
– New data is classified based on the
training set
• Unsupervised learning (clustering)

– Class labels of training data is
unknown
– Given a set of observations, the aim is
to establish existence of classes or
clusters in the data
Classification— Two‐Step Process
• Model construction: Describe a set of predetermined classes
– Each tuple is assumed to belong to a predefined class, as determined by
the class label attribute
– The model is represented as classification rules, decision trees, or
mathematical formulae
• Model usage: Classify future or unknown objects

– Estimate accuracy of the model
• The known label of test sample is compared with the classified result
from the model
• Accuracy = percentage of test set samples that are correctly
classified by the model
• Test set is independent of training set (otherwise overfitting)
– If the accuracy is acceptable, use the model to classify new data
Phase 1: Model Construction
Classification
Algorithms
Training
Data
NAM E RANK YEARS TENURED Classifier

M ike Assistant Prof 3 no (Model)
M ary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
6
Phase 2: Model Usage
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAM E RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
M erlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
7
Classification
Linearly Separable Not Linearly Separable
Decision Trees
Decision Trees
• Divides the feature space by axes
aligned decision boundaries 1
• Each rectangular region is labeled 2
with one label/class
• Idea is to divide the entire X‐space
into rectangles such that each
rectangle is as homogeneous or
“pure” as possible 3
• Pure = containing records that
belong to just one class
• Recursive partitioning of p‐
dimensional space of the predictor
variables into non‐overlapping
multidimensional rectangles
Not Linearly Separable

Decision Trees
Internal nodes
branch (test on attributes)
(outcome of the test)
1 Width > 6.5 cm?

1
2 Yes No
2 Height > 9.5 cm? 3 Height > 6.0 cm?

Yes No Yes No
3
Leaf node
(class label)
Not Linearly Separable Decision tree: a flowchart‐like tree structure

Decision Trees
• If‐Then Rules
– If Width> 6.5 cm AND
Height> 9.5 cm THEN Lemon Width > 6.5 cm?
– If Width> 6.5 cm AND Yes No
Height≤ 9.5 cm THEN Orange
– If Width≤ 6.5 cm AND Height > 9.5 cm? Height > 6.0 cm?
Height> 6.0 cm THEN Lemon Yes No Yes No
– If Width≤ 6.5 cm AND
Height≤ 6.0 cm THEN Orange
Example
• Whether a customer will wait for a table at a restaurant?
• Attributes:
1. Alternate: is there an alternative restaurant nearby?
2. Bar: is there a comfortable bar area to wait in?
3. Fri/Sat: is today Friday or Saturday?
4. Hungry: are we hungry?
5. Patrons: number of people in the restaurant (None, Some, Full)
6. Price: price range ($, $$, $$$)
7. Raining: is it raining outside?
8. Reservation: have we made a reservation?
9. Type: kind of restaurant (French, Italian, Thai, Burger)
10. Wait Estimate: estimated waiting time (0‐10 min, 10‐30, 30‐60, >60)
Example
Class label attribute
Which Tree is Better?
The tree to decide whether to
wait (T) or not (F)
What Makes a Good Tree?
• Not too big:
– computational efficiency (avoid redundant, spurious attributes)
– avoid overfitting training examples
– generalise well to new/unseen observations
– easy to understand and interpret
• Not too small:
– need to handle important but possibly subtle distinctions in data
• Occam's Razor: "the simplest explanation is most likely
the right one"
– find the smallest tree that fits the observations
Learning Decision Trees
• In principle there are exponentially many DT constructed from a
given set of attributes that fits the same data
• Learning the simplest (smallest) decision tree is an NP complete
problem (Hyal & Rivest,1976)
• Resort to heuristics: efficient algorithms that induce reasonably
accurate, albeit suboptimal, DT in a reasonable amount of time
• Greedy strategy: series of locally optimal decisions
– Start from an empty decision tree
– Split on next best attribute
– Recurse
• What is best attribute?
• We use information theory to guide us
– ID3 (Iterative Dichotomiser) – Information Gain
– C4.5 – Gain Ratio
– Classification and Regression Trees (CART) – Gini index
Decision Tree Learning Algorithm
• Simple, greedy, recursive approach, builds up tree
node‐by‐node
1. pick an attribute to split at a non‐terminal node
2. split examples into groups based on attribute
value
3. for each group:
– if no examples ‐ return majority from parent
– else if all examples in same class ‐ return class
– else loop to Step 1
Choosing a Good Attribute
• Which attribute is better to split on, X1 or X2?
Pure Node
Idea:
1. use counts at leaves to define probability distributions, so we
can measure uncertainty
2. a good attribute splits the examples into subsets that are
(ideally) pure
Choosing a Good Attribute
• Restaurant example
• Test Patrons or Type first?
• A good attribute splits samples into groups that are (ideally)

all positive or negative
• Patrons is better attribute than Type
• Testing on good attributes early allows to minimise the tree
depth
Quantifying Uncertainty
• We Flip Two Different Coins
• Entropy H(X) of a random variable X
• Measures the level of impurity in a group of examples

• Entropy
Attribute Selection Measures
• An attribute selection measure is a heuristic for selecting the
splitting criterion that “best” separates a given data partition
• Pure partition: A partition is pure if all the tuples in it belong

to the same class
– split up the tuples according to mutually exclusive outcomes of the
splitting criterion
• Popular measures
– information gain, gain ratio, Gini index
Information Gain
• ID3 uses information gain as its attribute selection measure
• Node N represents tuples of partition D
• Attribute with the highest information gain is chosen as the
splitting attribute for node N
• Objective: to partition on an attribute that would do the “best
classification,” so that the amount of information still required
to finish classifying the tuples is minimal
• Minimize expected number of tests needed to classify a given
tuple and guarantee a simple tree is found
Notations
• D: data partition, a training set of class‐labeled tuples
• m: distinct values of class label attribute defining m
distinct classes, Ci (i = 1,…,m)
• Ci,D: set of tuples of class Ci in D
• |Dj|: number of tuples in Dj
• |Ci,D |: number of tuples in Ci,D
Information Gain
• Let pi be the probability that an arbitrary tuple in D belongs to class Ci,
estimated by |Ci, D|/|D|
• Expected information (entropy) needed to classify a tuple in D:
• Measures the average amount of information needed to identify the class of

a tuple in D
– Original information required based on just the proportion of classes
• How much more information would we still need (after the partitioning) to
arrive at an exact classification?
• Information needed (after using A to split D into v partitions) to classify D:
Information Gain
• InfoA(D) is the expected information required to classify a tuple from D
based on the partitioning by A
• Smaller the expected information (still) required, the greater the purity of
the partitions (reduces the entropy in the partitions)
• To determine how well a test condition performs, compare the degree of
impurity of the parent node (before splitting) and the child node (after
splitting)
– larger the difference, the better the test condition
• Information gained by branching on attribute A
– expected reduction in the information requirement caused by knowing the

value of A
– determines the goodness of the split
Example
Example
• InfoType(D) = 2 x 2/12 x 1 + 2 x 4/12 x 1 = 1/3 + 2/3 =1

• Gain(Type) = Info (D) ‐ InfoType(D) = 1 – 1 = 0
• InfoPatrons(D) = 2/12 x 0 + 4/12 x 0 + 6/12 x 0.918 = 0.459

• Gain(Patrons) = Info (D) ‐ InfoPatrons(D) = 1 – 0.459 = 0.541
• Patrons is a better attribute than Type

Example

Lecture 12

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 12

Uploaded by

Copyright:

Available Formats

eMBA933

Dr. Faiz Hamid

• Unsupervised learning (clustering)

• Model usage: Classify future or unknown objects

NAM E RANK YEARS TENURED Classifier

Not Linearly Separable

1 Width > 6.5 cm?

2 Height > 9.5 cm? 3 Height > 6.0 cm?

Not Linearly Separable Decision tree: a flowchart‐like tree structure

• A good attribute splits samples into groups that are (ideally)

• Measures the level of impurity in a group of examples

• Pure partition: A partition is pure if all the tuples in it belong

• Measures the average amount of information needed to identify the class of

– expected reduction in the information requirement caused by knowing the

• InfoType(D) = 2 x 2/12 x 1 + 2 x 4/12 x 1 = 1/3 + 2/3 =1

• InfoPatrons(D) = 2/12 x 0 + 4/12 x 0 + 6/12 x 0.918 = 0.459

• Patrons is a better attribute than Type

You might also like