Professional Documents
Culture Documents
Dr.J. Dhar 1
Introduction
• Data classification is the process of organizing data into
categories for its most effective and efficient use.
• The hidden information of information rich databases
that can be used for intelligent decision making.
• The classification and prediction are data analysis that
can be used to extract rules describing important data
classes or to predict future data trends.
Dr.J. Dhar 2
Examples
• We can build a classification models
to categorize weather conditions as game playing
is suitable or not.
to categorize bank loan applications as either safe
or risky.
• A prediction model to predict the expenditures of
potential customers on electronics items given
their income and occupation.
Dr.J. Dhar 3
Classification Process
Dr.J. Dhar 4
Prediction Process
Dr.J. Dhar 5
Classification Task
Dr.J. Dhar 6
Organization of the Lecture
• Classification by Decision Tree Induction
• Decision Tree Pruning
• Bayesian Classification
• Rule-Based Classification
• k-Nearest-Neighbor Classifiers
Dr.J. Dhar 7
Classification by Decision Tree Induction
• Decision tree
– A flow-like tree structure
– Internal node denotes a test on an attribute
– Each branch represents an outcome of the test
– All leaf nodes represent class labels or class distribution
• Popular decision tree algorithms, ID3, C4.5, and CART adopt a greedy (i.e.,
nonbacktracking) approach.
• The decision trees are constructed in a top-down recursive divide-and-
conquer manner.
Dr.J. Dhar 8
Decision Tree Induction
• Decision tree generation consists of two phase
Phase-1: Tree construction (TRAINING phase)
• At start, all the training examples are at the root
• Partition examples recursively based on selected
attributes
Phase-2: Tree pruning
• Identify and remove branches that reflect noise or
outliers
• After formation of decision tree: One can classify an
unknown sample
– Test the attribute values of the sample against the decision tree
Dr.J. Dhar 9
Example of a Decision Tree
3 No Single 70K No
Yes No
4 Yes Married 120K No NO Marital Status
5 No Divorced 95K Yes Married
Single, Divorced
6 No Married 60K No
Taxable income NO
7 Yes Divorced 220K No
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
10
Marital
Status Single,
Tid Refund Marital Taxable
Married Divorce
Status Income Cheat d
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO Taxable income
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes
10
Dr.J. Dhar 12
ID3 Algorithm: Information Gain
• ID3 uses information gain as its • Select attribute which partitions
attribute selection measure. This the learning set into subsets as
measure is based on pioneering “pure” as possible.
work by Claude Shannon on
information theory, which studied • An approach minimizes the
the value or “information content” expected number of tests needed
of messages. to classify a given tuple and
• The attribute with the highest guarantees that a (but not
information gain is chosen as the necessarily the simplest) tree is
splitting attribute for different level. found.
Dr.J. Dhar 13
Information Gain Approach
• To classify an object, a certain information is needed
– Info(D), information from whole data tuples (other notation I)
• After we have learned the value of attribute A, we only need some
remaining amount of information to classify the object
– Info_A(D), information from each attribute (other notation I_res)
• Gain
– Gain(A) = Info(D) – Info_A(D) or I – I_res(A)
• The most ‘informative’ attribute is the one that minimizes Info_A(D), i.e.,
maximizes Gain
Dr.J. Dhar 14
Mathematical Equations
• Where is the probability that an
arbitrary tuple in D belongs to class Ci,
i=1,2,…,r; i.e., i-th classes.
• is the expected information
required to classify a tuple from D
based on the partitioning of the
attribute A into m distinct parts (D_1,
D_2,…..D_m).
• The attribute A with the highest
information gain, (Gain(A)), is chosen
as the splitting attribute at node N.
Dr.J. Dhar 15
Entropy
• The average amount of information I needed to classify an object is
given by the entropy measure
entropy
p(c1)
Dr.J. Dhar 16
Example: Triangles and Squares
# Attribute Shape
Color Outline Dot
1 green dashed no triange
2 green dashed yes triange
Data Set: A set of classified objects
3 yellow dashed no square
Class Levels: triangle, square
4 red dashed no square
5 red solid no square
6 red solid yes triange
7 green solid no square .
. .
8 green dashed no triange
.
9 yellow solid yes square .
10 red solid no square
11 green solid yes square .
12 yellow dashed yes square
13 yellow solid no square
14 red dashed yes triange
Dr.J. Dhar 17
Info(D): Calculation
• 5 triangles
• 9 squares
. .
• class probabilities
. .
. .
Dr.J. Dhar 18
. .
. . .
.
. .
red
Color? green
.
data set yellow .
partitioning
.
.
Dr.J. Dhar 19
. .
. .
. .
. .
red
Color? green
.
yellow .
.
.
Dr.J. Dhar 20
. .
. .
.
Information Gain
.
. .
red
Color? green
.
yellow .
.
.
Dr.J. Dhar 21
Information Gain of All Attribute
• Attributes
– Gain(Color) = 0.246
– Gain(Outline) = 0.151
– Gain(Dot) = 0.048
• Heuristics: The attribute with the highest gain (i.e.,
Color) is chosen for first level of splitting attribute.
• This heuristics is local (local minimization of impurity)
Dr.J. Dhar 22
. .
. .
. .
. . red
Color?
green
Gain(Outline) = 0.971 – 0 = 0.971 bits
yellow
Gain(Dot) = 0.971 – 0.951 = 0.020 bits
.
. .
.
Dr.J. Dhar 23
. .
. . .
.
. . red
. yellow
.
.
.
solid
.
Outline?
dashed
.
Dr.J. Dhar 24
. .
. . .
.
. . red
yes .
Dot? .
Color?
no
green
. yellow
.
.
.
solid
.
Outline?
dashed
.
Dr.J. Dhar 25
Decision Tree
. .
. .
. .
Color
red green
yellow
Dr.J. Dhar 28