You are on page 1of 28

Lecture-10

Data Classification and Prediction


(Part-1)

Dr.J. Dhar 1
Introduction
• Data classification is the process of organizing data into
categories for its most effective and efficient use.
• The hidden information of information rich databases
that can be used for intelligent decision making.
• The classification and prediction are data analysis that
can be used to extract rules describing important data
classes or to predict future data trends.

Dr.J. Dhar 2
Examples
• We can build a classification models
 to categorize weather conditions as game playing
is suitable or not.
 to categorize bank loan applications as either safe
or risky.
• A prediction model to predict the expenditures of
potential customers on electronics items given
their income and occupation.
Dr.J. Dhar 3
Classification Process

Dr.J. Dhar 4
Prediction Process

Dr.J. Dhar 5
Classification Task

Dr.J. Dhar 6
Organization of the Lecture
• Classification by Decision Tree Induction
• Decision Tree Pruning
• Bayesian Classification
• Rule-Based Classification
• k-Nearest-Neighbor Classifiers

Dr.J. Dhar 7
Classification by Decision Tree Induction
• Decision tree
– A flow-like tree structure
– Internal node denotes a test on an attribute
– Each branch represents an outcome of the test
– All leaf nodes represent class labels or class distribution

• Popular decision tree algorithms, ID3, C4.5, and CART adopt a greedy (i.e.,
nonbacktracking) approach.
• The decision trees are constructed in a top-down recursive divide-and-
conquer manner.

Dr.J. Dhar 8
Decision Tree Induction
• Decision tree generation consists of two phase
Phase-1: Tree construction (TRAINING phase)
• At start, all the training examples are at the root
• Partition examples recursively based on selected
attributes
Phase-2: Tree pruning
• Identify and remove branches that reflect noise or
outliers
• After formation of decision tree: One can classify an
unknown sample
– Test the attribute values of the sample against the decision tree

Dr.J. Dhar 9
Example of a Decision Tree

Tid Refund Marital Taxable Root Node


Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No Refund

3 No Single 70K No
Yes No
4 Yes Married 120K No NO Marital Status
5 No Divorced 95K Yes Married
Single, Divorced
6 No Married 60K No
Taxable income NO
7 Yes Divorced 220K No
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
10

Training Tuples Decision Tree Model


Dr.J. Dhar 10
Example of Decision Tree

Marital
Status Single,
Tid Refund Marital Taxable
Married Divorce
Status Income Cheat d
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO Taxable income
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes
10

fits the same data!


Dr.J. Dhar 11
Decision Tree Algorithms

Dr.J. Dhar 12
ID3 Algorithm: Information Gain
• ID3 uses information gain as its • Select attribute which partitions
attribute selection measure. This the learning set into subsets as
measure is based on pioneering “pure” as possible.
work by Claude Shannon on
information theory, which studied • An approach minimizes the
the value or “information content” expected number of tests needed
of messages. to classify a given tuple and
• The attribute with the highest guarantees that a (but not
information gain is chosen as the necessarily the simplest) tree is
splitting attribute for different level. found.

Dr.J. Dhar 13
Information Gain Approach
• To classify an object, a certain information is needed
– Info(D), information from whole data tuples (other notation I)
• After we have learned the value of attribute A, we only need some
remaining amount of information to classify the object
– Info_A(D), information from each attribute (other notation I_res)

• Gain
– Gain(A) = Info(D) – Info_A(D) or I – I_res(A)
• The most ‘informative’ attribute is the one that minimizes Info_A(D), i.e.,
maximizes Gain

Dr.J. Dhar 14
Mathematical Equations
• Where is the probability that an
arbitrary tuple in D belongs to class Ci,
i=1,2,…,r; i.e., i-th classes.
• is the expected information
required to classify a tuple from D
based on the partitioning of the
attribute A into m distinct parts (D_1,
D_2,…..D_m).
• The attribute A with the highest
information gain, (Gain(A)), is chosen
as the splitting attribute at node N.

Dr.J. Dhar 15
Entropy
• The average amount of information I needed to classify an object is
given by the entropy measure

• For a two-class problem:

entropy

p(c1)

Dr.J. Dhar 16
Example: Triangles and Squares
# Attribute Shape
Color Outline Dot
1 green dashed no triange
2 green dashed yes triange
Data Set: A set of classified objects
3 yellow dashed no square
Class Levels: triangle, square
4 red dashed no square
5 red solid no square
6 red solid yes triange
7 green solid no square .
. .
8 green dashed no triange
.
9 yellow solid yes square .
10 red solid no square
11 green solid yes square .
12 yellow dashed yes square
13 yellow solid no square
14 red dashed yes triange

Dr.J. Dhar 17
Info(D): Calculation

• 5 triangles
• 9 squares
. .
• class probabilities
. .
. .

Dr.J. Dhar 18
. .
. . .
.
. .
red

Color? green

.
data set yellow .
partitioning

.
.

Dr.J. Dhar 19
. .
. .
. .
. .
red

Color? green

.
yellow .

.
.

Dr.J. Dhar 20
. .
. .
.

Information Gain
.
. .
red

Color? green

.
yellow .

.
.

Dr.J. Dhar 21
Information Gain of All Attribute

• Attributes
– Gain(Color) = 0.246
– Gain(Outline) = 0.151
– Gain(Dot) = 0.048
• Heuristics: The attribute with the highest gain (i.e.,
Color) is chosen for first level of splitting attribute.
• This heuristics is local (local minimization of impurity)

Dr.J. Dhar 22
. .
. .
. .
. . red

Color?
green
Gain(Outline) = 0.971 – 0 = 0.971 bits
yellow
Gain(Dot) = 0.971 – 0.951 = 0.020 bits
.
. .
.

Dr.J. Dhar 23
. .
. . .
.
. . red

Gain(Outline) = 0.971 – 0.951 = 0.020 bits


Color?
Gain(Dot) = 0.971 – 0 = 0.971 bits
green

. yellow
.
.
.

solid
.
Outline?

dashed

.
Dr.J. Dhar 24
. .
. . .
.
. . red
yes .

Dot? .
Color?
no
green

. yellow
.
.
.

solid
.
Outline?

dashed

.
Dr.J. Dhar 25
Decision Tree
. .

. .
. .

Color

red green
yellow

Dot square Outline

yes no dashed solid

triangle square triangle square


Dr.J. Dhar 26
A Defect of
• It favors attributes with many values
• Such attribute splits N to many subsets, and if these are small,
they will tend to be pure anyway
• One way to rectify this is through a corrected measure of
information gain ratio.
• C4.5, a successor of ID3, uses an extension to information gain
known as gain ratio,which attempts to overcome this bias.
• It applies a kind of normalization to information gain using a
“split information” value defined analogously with Info(D)
Dr.J. Dhar 27
Thank You

Dr.J. Dhar 28

You might also like