Csis355 Classifications 1

Data Mining
Techniques & Applications
Classification Techniques
Topics
 The Problem of Classification
 Decision Tree Approach and ID3 Algorithm
 Nearest Neighbour Approach and PEPLS Algorithm
 Bayesian Classifier Approach and Naïve Bayes
 Rule-based Classification Approach
 Principle of Artificial Neural Network
 The Problem of Model Overfitting and Solutions
 Evaluating Classification Models
 Comparison of Classification Solutions
 Classification in Practice
Sarajevo School of Science and Technology 2
Problem Description
 Two-Stage Description of Classification
◼ Given a data set of examples, use a classification
technique, known as classifier, to construct a classification
model.
Model
Example Construction
Data Set Method
Model
◼ Given a constructed model, classify a data record with

unknown class to one of the pre-defined classes.
Unseen ? A
Data
Model Class Tag
Problem Description
 Two-Stage Description of Classification
Example Data Set
Outlook
sunny
Temperature
hot
Humidity
high
Windy
FALSE
Class
N Model
sunny hot high TRUE N outlook
overcast hot high FALSE P
rain
rain
mild
cool
high
normal
FALSE
FALSE
P
P
Model sunny overcast rain
rain cool normal TRUE N
overcast
sunny
cool
mild
normal
high
TRUE
FALSE
P
N
Construction humidity P windy
sunny
rain
cool
mild
normal
normal
FALSE
FALSE
P
P
Method: ID3 high normal true false
sunny mild normal TRUE P
overcast mild high TRUE P N P N P
overcast hot normal FALSE P
rain mild high TRUE N Model
outlook
Unseen data record sunny overcast rain
sunny cool high true ? humidity P windy Class = N
high normal true false
N P N P

Problem Description
 Training, Testing and Evaluation Examples
◼ Construction stage, example data set split into three subsets:
 Training set is for constructing the initial model.
 Test set is for tuning the initial model to overcome overfitting.
 Evaluation set is used to measure the model’s accuracy.
Model Construction
Training Examples Initial Model
Method
Testing Examples Model Refinement

Method Final Model
Initial Model
Evaluation Examples
Accuracy Final Model with
Final Model Measurement accuracy rate A.

Problem Description
 Influential Factors to Successful
Classification
◼ Classification Accuracy (how accurate the
classification model is)
◼ Classification Performance (time taken for model
construction and time taken for model use)
◼ Comprehensibility of the classification model
(whether the model explains why the decision is
made)
Approaches of Classification
 Decision tree induction
◼ Constructing a decision tree as the classification model
◼ Traversing the tree from the root to a leaf to determine the class
◼ Which attribute is used as the root of the current tree is the
main issue for decision tree construction methods
 Nearest neighbour approach
◼ The model consists of a set of exemplars (historical records)
◼ The nearest exemplar or k-nearest exemplars are found for the
given record
◼ The class of the record is determined by the class of the
neighbour(s)

 Rule-Based Approach
◼ The model consists of a set of IF..THEN rules in some order
◼ Rules are searched in an order until a suitable rule is found
◼ The suitable rule is fired to determine the class of a record
 Artificial Neural Network
◼ The model is a neural network with suitable number of nodes
and trained with appropriate weights for the links.
◼ Attribute values of a record are given to the nodes on the input
layer. The neural network calculates and predicts the class of the
record through a collective reasoning of all network nodes.

 Bayesian Approach
◼ The Bayes theorem is the model
◼ The posterior probability is estimated according
to the prior probability
◼ The likelihood of a record belonging to a class is
estimated according to the posterior probability
 Others (Support Vector Machine, Mixture
Models (GMM), Regression)
Decision Tree Induction
Overview
 Overview
◼ Input: a training set, i.e. a table with descriptive attribute and a class
attribute.
◼ Output: a decision tree of nodes and links.
◼ Process: take the input table and induce the decision tree.
 Procedure for Decision Tree Induction
1. Take the entire training set or part of it as the input
2. If all examples are of the same class, create a leaf node, and mark it
with the class name
3. If examples in the training set are of a mixture of different classes,
determine which attribute should be the root of the current tree
4. Partition the input examples into subsets according to the values of the
selected root attribute
5. Construct a decision tree recursively for each subset from step 2.

Decision Tree Induction:
ID3 Tree Induction Algorithm
Algorithm ID3_TreeConstruct (C: Training Set) : Decision Tree
begin
Tree := Ø;
if C is not empty then
if all examples in C are of one class then
Tree := a leaf node labeled by the class tag
else begin
select attribute T with maximal Information Gain as the root;
partition C into C1, C2, ..., Cw by values of T;
for each Ci (1 ≤ i ≤ w) do ti  ID3_TreeConstruct (Ci);
Tree := a tree T as root and t1, t2, ..., tw as subtrees
label the linked from T to the subtrees with values of T
end;
return(Tree)
end;

Training Set
Outlook Temperature Humidity Windy Class
sunny hot high FALSE N
sunny hot high TRUE N
overcast hot high FALSE P Outlook
rain
rain
mild
cool
high
normal
FALSE
FALSE
P
P
Gain(Outlook) = 0.246 bits
rain cool normal TRUE N Gain(Temperature) = 0.029 bits sunny overcast rain
Gain(Humidity) = 0.151 bits
overcast cool normal TRUE P
sunny mild high FALSE N P
sunny
rain
cool
mild
normal
normal
FALSE
FALSE
P
P
Gain(Windy) = 0.048 bits
overcast mild high TRUE P
overcast hot normal FALSE P  Outlook is chosen as the root.
rain mild high TRUE N
Training Set(outlook=sunny)
Temperature Humidity Windy Class
Gain(Temperature) = 0.571 bits Outlook
hot high FALSE N Gain(Humidity) = 0.971 bits sunny overcast rain
hot high TRUE N
Gain(Windy) = 0.020 bits P
mild
cool
high
normal
FALSE
FALSE
N
P
Humidity
mild normal TRUE P
 Humidity is chosen as the root. high normal
N P

 Different attributes may give the same amount of information gain. In this case, other
factors (such as random choice) decide which attribute serves as the root.
 Partitioning the training set into subsets may result in an empty subset, which means no
examples in the training set have the combination of attribute values labeled on the
branches along the path. In this case, a leaf node with label “Null” is created. This
means that the class of the examples cannot be decided.
 ID3 sometimes uses a part of the training set to build a tree, and then repeats the
process so that a number of candidate trees are created. Then the best-fit tree is chosen
at the evaluation stage.
 For a numeric attribute, the ID3 finds the best split point and creates a root node with
two branches signified by ≤ and ≥.
 The complexity of a tree is measured in terms of the number of nodes. A tree is not a
good tree if it contains many branches and many paths from the root to a leaf, and have
very few examples at the leaf.

Other Algorithms
 ID3 Family algorithms: C4.5 and C5 (See5)
◼ Similar to ID3
◼ Use information gain ratio as attribute selection measure
 CART algorithm
◼ Produce binary decision tree
◼ Nodes represent a Boolean test
◼ Use Gini Index of Impurity as attribute selection measure
 CHAID algorithm
◼ Use Chi-square test (2) as attribute selection measure
 Study shows that there are marginal differences among
the attribute selection measures. Currently no measure is
significantly better than others.
Attribute Selection Measure
 Random selection
 Information Gain/Gain Ratio
 Gini Index of Impurity
 Chi-square Test

Random Selection
Temperature
cool mild hot

outlook outlook windy
sunny o’cast rain sunny o’cast rain true fals

e
P P windy windy P humidity N humidity
high normal high normal

true fals true fals
e e
N P P N windy P outlook P
true fals sunny rain
e o’cast
N P N P

Information Gain Measure
 An information system S consists of a set of n possible events E1,
E2,…, En. Each event may occur with a probability P(Ek) (1≤k≤n) and
P(E1) + P(E2) + … + P(En) = 1.
 In a table, each attribute can be considered as an information
system (e.g. Outlook)
 In classification, classifying tuples with attribute A of v values in the
data set C into two classes (Positive, Negative) involves two
information systems at a time:
◼ Attribute A:
 Events (values): a1, a2, …, av
 Probabilities of the events: p(a1), p(a2), …, p(av)
◼ Attribute Class:
 Events (values): Positive, Negative
 Probabilities of the events: p(Positive), p(Negative)

 The self-information of an event E represents the amount
of information being conveyed when E occurs. It is defined
as: 1
I ( E ) = log = − log p ( E )
p( E )
 Rationale:
◼ The smaller the chance for E to happen, the more information it
conveys once it really happens
◼ The logarithm measures the degree of magnitude of chances
◼ The base indicates the unit of measure (e.g. 2 => bits, 10 =>
digits)
e.g. In the weather condition table
9
I (Class = P) = − log 2 p(Class = P) = − log 2 = 0.63743 bits
14
 The average of the self-information of all events in
an information system S is called entropy, and is
defined as: H ( S ) = n p( E )  I ( E ) = − n p( E )  log p( E )

k =1
k k k =1
k k
 Property: H(S) = 0, when one event always

happens and other events never happen. H(S)
takes the maximum value log2N if every event has
the same probability (1/n). Therefore, entropy H(S)
represents a degree of uncertainty – lower entropy
more certainty, higher entropy higher uncertainty.
 e.g. In the weather condition table
Outlook
sunny
Temperature Humidity
hot high
Windy
FALSE
Class
N H (Class ) =
9 9 5 5
=− log 2 − log 2
overcast hot high FALSE P
rain mild high FALSE P
rain cool normal FALSE P 14 14 14 14
rain cool normal TRUE N
overcast cool normal TRUE P = 0.94 bits
sunny mild high FALSE N
sunny cool normal FALSE P
rain mild normal FALSE P

 The conditional self-information of event E of system S1, given
that event F of system S2 has occurred, is defined as
1 p( E  F )
I ( E | F ) = log = − log p ( E | F ) = − log
p( E | F ) p( F )
 The conditional entropy of system S1, in the presence of system
S2, is defined as
n m n m p( Ei  F j )
H ( S1 | S 2 ) =  p(Ei  F j )  p( Ei | F j ) = − p(Ei  F j )  log
i =1 j =1 i =1 j =1 p( F j )
 The mutual information between two systems S1 and S2 is also

known as information gain of S1 given S2. It can be calculated as
Gain(S1 ) = I(S1 , S2 ) = H(S1 ) - H(S1 | S2 )

Information Gain in ID3
 Suppose there are p positive and n negative examples,
and there are pi positive and ni negative examples whose
attribute A has the value ai, then we have a contingency
table: p p n n
H (Class) = − log − log
Classes p+n p+n p+n p+n
Attribute A v
pi pi n ni
Positive Negative H (Class | A) =  (− log − i log )
i =1 p+n pi + ni p + n pi + ni
a1 p1 n1
v  p + ni  p pi n ni 
a2 p2 n2 =  i  − i log − i log 
i =1  p + n  p + n p i + ni p + n pi + ni 
…… v
p + ni
av pv nv = i H (Class | A = ai )
i =1 p + n
Totals p n gain( A) = H (Class) − H (Class | A)

 In the weather condition table, p = 9 and n = 5
sunny hot high FALSE N 9 9 5 5
sunny hot high TRUE N H (Class) = − log 2 − log 2 = 0.94 bits
overcast hot high FALSE P 9+5 9+5 9+5 9+5
rain mild high FALSE P
rain cool normal FALSE P
rain
overcast
cool
cool
normal
normal
TRUE
TRUE
N
P
psunny = 2, nsunny = 3, H ( S Csunny ) = 0.971bits
sunny cool normal FALSE P povercast = 4, novercast = 0, H ( S Covercast ) = 0.0 bits
rain mild normal FALSE P
sunny mild normal TRUE P prain = 3, nrain = 2, H ( S Crain ) = 0.971bits
H (Class | Outlook ) =
gain(outlook)=0.246 bits 5 4 5
gain(temperature)=0.029 bits = H ( SCsunny ) + H ( SCovercast ) + H ( SCrain )
gain(humidity)=0.151 bits 14 14 14
gain(windy)=0.048 bits = 0.694 bits

 What Information Gain Actually Tell Us?
◼ Since entropy H(Class) represents the degree of
uncertainty over class values, the information gain
represents a reduction of uncertainty over the choice of
attribute A.
◼ Choosing the attribute with the maximum information gain
reduces the uncertainty over class the most.
◼ Practical implication is that most of examples are classified
as early as possible in a decision tree
◼ Information gain favours the attribute with most values. A
modification made in the algorithm C4.5 is the use of
information gain ratio: gainRatio( A) = gain( A) = H (Class) − H (Class | A)
H ( A) H ( A)
Decision Tree Induction
 Handling Noise Data in Decision Trees
◼ Noise data records in training set may cause the followings:
 Same description but different classes. In other words, attributes are
not adequate to classify an object.
 A decision tree of spurious complexity
◼ How to solve the problems:
 For the inadequate attribute problem: In addition to labeling a leaf
with class ID, a real number within [0,1] is also kept at the leaf. The
number represents the probability of the class.
 For the spurious complexity problem: use a user-defined minimum
information gain threshold or measure the independence between
class and the attribute to decide whether to continue in constructing
the tree.

Nearest Neighbour Approach
Overview
 A lazy learning approach: delaying the classification decision to the time
of the actual classification. In other words, there is NO training!
 Rationale: “If it walks like a duck, quacks like a duck, and looks like a
duck, it probably is a duck”.
 The classification model is a memory space containing all the training
examples. Each training example is considered as a point in the N-
dimensional space.
 There may be a selection process for training examples.
 To classify an unseen record, compute its proximity to all training
examples and locate either 1 or k nearest neighbour examples. The
classes of the neighbours determine the class of the record.
 In the case where there are more than one neighbour, the class of the
unseen record is determined by voting.

KNN Algorithm
Algorithm kNN (Tr: trainingSet; k: integer; r: dataRecord) : Class
begin
for each training example t in Tr do
calculate distance d(t, r) upon descriptive attributes
end for;
select k nearest neighbours into D according to the distance vector
Class = majority class in D
return Class
end;
Questions:
◼ How to measure distance?
◼ How to deal with nominal attributes?

The PEBLS Algorithm
 Minimum training
 Weights attached to the training
examples reflecting performance
 A proximity measure based on class
distribution than “likelihood”
 Suitable for symbolic attributes
(categorical and discrete)
The PEBLS Algorithm
Algorithm PEBLS(Tr: training set) : Exemplar Space
begin
Space := Ø;
for each attribute ai of Tr do
construct a value difference table on all possible pairs of values;
end for;
for each instance e in Tr do
add e into Space as an exemplar with weight = 1;
calculate distances between e and each exemplar in Space;
find the nearest neighbour exemplar e’;
adjust the weight of e’ according to its prediction
end for;
…..
end;

The PEBLS Algorithm
 PEBLS: Value Difference Table for Attribute
 For a training set of examples, an attribute A with A1, A2, …, Am
values and classes C1, C2, .., Ck form the following contingency
table, where Nij: the number of examples in the data sets that
have value Ai and belong to class Cj
 The difference between two values attribute A
c1 c2
classes
…. cj … cn
of the attribute v1 and v2 where A1
◼ r: constant that is normally set to 1. A2
…
◼ C1: number of examples with value V1 Ai Nij
◼ C2: number of examples with value V2 ….
◼ C1i: number of examples with value V1 Am
◼ and class I r
◼ C2i: number of examples with value V2 k
C1i C2i
and class I d (V1 ,V2 ) =  −
i =1 C1 C2

The PEBLS Algorithm
 PEBLS: Value Difference Table for Attribute
 The difference value table contains value differences
between all possible pairs of values for the attribute
Outlook Temperature Humidity Windy Class attribute A A1 A2 … Aj … Am
sunny hot high FALSE N
A1 d(A1, A1) d(A1, A2) d(A1, Aj) d(A1, Am)
overcast hot high FALSE P A2 d(A2, A1) d(A2, A2) d(A2, Aj) d(A2, Am)
rain mild high FALSE P …
rain cool normal FALSE P Ai d(Ai, A1) d(Ai, A2) d(Ai, Aj) d(Ai, Am)
rain cool normal TRUE N ….
overcast cool normal TRUE P Am d(Am, A1) d(Am, A2) d(Am, Aj) d(Am, Am)
sunny cool normal FALSE P
rain mild normal FALSE P Value Difference Table for Outlook
overcast hot normal FALSE P sunny overcast rain
rain mild high TRUE N sunny 0 1.2 0.4
2−4 + 3−0 = 3+3= overcast 1.2 0 0.8
d ( sunny , overcast ) = 1 .2 rain 0.4 0.8 0
5 4 5 4 5 5
The PEBLS Algorithm
 PEBLS: Distance Function
◼ The distance between any two data object X and Y are measured by
M
( X , Y ) = wX wY  d ( xi , yi ) r
i =1
where wX and wY are weights for X and Y, and M is the number of attributes. xi and
yi are respectively values of the ith attribute for X and Y, and r is a constant set to 2
(Euclidean distance).
◼ The weight is a ratio of total number of uses of the data object to the total number
of correct uses of the data object. Initially, all training examples have a weight of 1.
e.g.
Δ(row1, row2) = d(row1outlook, row2outlook)2 +
d(row1temp, row2temp)2 + d(row1humidity, row2humidity)2 +
d(row1windy, row2windy)2 = d(sunny,sunny)2 + d(hot,hot)2 +
d(high,high)2 + d(false,true)2 = 0 + 0 + 0 + (1/2)2 = 1/4

The PEBLS Example
Windy True False
Outlook Sunny Overcast Rain Temperature Hot Mild Cool True 0 0.418
Sunny 0 1.2 0.4 Hot 0 0.33 0.5 False 0.418 0
Overcast 1.2 0 0.8 Mild 0.33 0 0.33
Humidity High Normal
Rain 0.4 0.8 0 Cool 0.5 0.33 0
Hight 0 0.857
Normal 0.857 0
Exemplar Space
1 sunny hot high false N 1

The PEBLS Example
1 sunny hot high false N 1
2/2 = 1
2 sunny hot high true N 1
1 sunny hot high false N 1.5

3/2 = 1.5
2 sunny hot high true N 1
3 overcast hot high false P 1
(E1, E3) =1.44, (E2, E3) =1.615. E1 is the nearest neighbour.
The PEBLS Example
1 sunny hot high false N 1.5
2 sunny hot high true N 1 Calculate:
3 overcast hot high false P 1
(E4, E1),
(E4, E2),
(E4, E3)
4 rain hot high false P 1 ……
1 sunny hot high false N

2 sunny hot high true N
3 overcast hot high false P
••••••
14 rain mild high true N

Bayesian Classifiers
 Bayes Theorem P(Y | X ) =
P( X | Y )  P(Y )
◼ Bayes Theorem Equation: P( X )
◼ P(Y|X) is called the posterior probability of Y

with the presence of X, whereas P(Y) is called
the prior probability of Y.
◼ Bayes Theorem suggests that the likelyhood of Y
happens with the presence of circumstance X
depends on prior probability of Y.

 e.g. For two rival football teams (FCS and željo), FCS wins 65%
of time and željo wins the remaining matches. Among games
won by FCS, 30% comes from playing on Grbavica ground. 75%
of wins by željo are obtained on home ground. If a match is
planned on Grbavica, which team is more likely to win?
Let X represent hosting team, and Y represent winning team. We
know:
 P(WIN=FCS)=0.65, P(WIN=željo)=1–0.65=0.35,
P(HOME=željo|WIN=FCS)=0.3,
P(HOME=željo|WIN=željo)=0.75
Using Bayes theorem,
P(WIN=željo|HOME=željo)=0.2625,
P(WIN=FCS|HOME=željo)=0.195
Therefore, željo is more likely to win.

 Using Bayes Theorem for Classification
◼ Treat Y as the class attribute. Use Bayes theorem to predict the probabilities
of P(y1|X), P(y2|X), …, P(yk|X) for a given attribute value combination X.
e.g. X = (HomeOwner=no, Married=yes, Income=120k)
Calculate P(DefaultPay=yes|X) and P(DefaultPay=no|X) and
Determine which probability is higher for X, and hence indicating the class for
X.
◼ Among the terms for Bayesian probability, P(Y) can be calculated from the
training data, P(X) is the given evidence. In fact it can be ignored when we
compare Bayesian results before the same evidence is used for all probability
calculations. The class-conditional probability P(X|Y) needs to be estimated
from training data.
◼ Different ways of estimating P(X|Y):
 Naïve Bayes
 Bayesian Belief Network

Naïve Bayes Classifier
 Take the assumption that attributes are conditionally independent of each other given
the class label is known.
 Therefore the class-conditional probability of P(X|Y=y) is estimated as the product of all
conditional probabilities P(X1|Y=y), P(X2|Y=y), …, P(Xd|Y=y), i.e.
d
d P(Y ) P( X i | Y = y )
P( X | Y = y ) =  P( X i | Y = y ) P(Y | X ) = i =1
i =1 P( X )
 For a categorical attribute, P(Xi=xi|Y=y) is estimated according to the fraction of training
examples of class y that take the attribute value xi.
 For a continuous attribute, it is assumed that the attribute values follow a normal
distribution, the probability P(Xi=xi|Y=y) is therefore estimated according to the
Gaussian probability density function based on the mean and variance of the fraction of
training examples for the class. −
( xi −  iy ) 2
1 2 iy2
P ( X i = xi | Y = y ) = exp
2  iy

P(Class=N)=5/14; P(Class=P)=9/14=0.643
Example: Classify P(Outlook=sunny|Class=N) = 3/5
P(Outlook=overcast|Class=N) = 0
 Outlook=sunny, Temperature=80, P(Outlook=rain|Class=N) = 2/5
P(Outlook=sunny|Class=P) = 2/9
Humidity=normal, Windy=false P(Outlook=overcast|Class=P) = 4/9
P(Outlook=rain|Class=P) = 3/9
Training Set P(Humidity=high|Class=N) = 4/5
sunny 85 high FALSE N P(Humidity=normal|Class=N) = 1/5
sunny 80 high TRUE N P(Humidity=high|Class=P) = 3/9
overcast 83 high FALSE P P(Humidity=normal|Class=P) = 6/9
rain 70 high FALSE P P(Windy=false|Class=N) = 2/5
rain 68 normal FALSE P
rain 65 normal TRUE N P(Windy=true|Class=N) = 3/5
overcast 64 normal TRUE P P(Windy=false|Class=P) = 6/9
sunny 72 high FALSE N P(Windy=true|Class=P) = 3/9
sunny 69 normal FALSE P For Temperature,
rain 75 normal FALSE P
sunny 75 normal TRUE P if Class=N: sample mean = 74.6;
overcast 72 high TRUE P sample variance = 62.3
overcast 81 normal FALSE P if Class=P: sample mean = 73
rain 71 high TRUE N
sample variance = 38
 Example
To classify X = (Outlook=sunny, Temperature=80, Humidity=normal, Windy=false)
P(X|Class=N) = P(Outlook=sunny|Class=N) * P(Temperature=80|Class=N) *

P(Humidity=normal|Class=N) * P(Windy=false|Class=N)
= 3/5 * 0.040007 * 1/5 * 2/5 = 0.00192
P(X|Class=P) = P(Outlook=sunny |Class=P) * P(Temperature=80|Class=P) *

P(Humidity=normal|Class=P) * P(Windy=false|Class=P)
= 2/9 * 0.03397 * 6/9 * 6/9 = 0.00335
P(Class=N|X) = (0.357 * 0.00192)/P(X)

P(Class=P|X) = (0.643 * 0.00335)/P(X)
P(Class=P|X) > P(Class=N|X) therefore the class for X is P.

Rule-based Classification
Classification Rule
 Rule form:
◼ (A1 op1 v1)  (A2 op2 v2)  …  (Am opm vm)  Class = yi
where Aj is an attribute name, vj is an attribute value, and opj is a
comparison operator (<, >, =, ≠), and yi is a class label.
e.g. (GiveBirth = no)  (AerialCreature = yes) → Class = Bird
 Rules are derived from a given training set
 A rule covers a data record if the attribute values of the record
matches the antecedent part of the rule.
 Rule quality is measured by coverage and accuracy
◼ Given a data set D, let |D| represent the number of records in D,
and |A| represent the number of records that are covered by the
rule. Let |Ay| represent the number of records covered by the rule
and having class label y. | A| |A y|
coverage(r) = accuracy(r) =
| D| |A|

Models
 The classification model consists of a set of rules
e.g.
r1: (GiveBirth = no)  (Creature = Aerial) → Class = Bird
r2: (GiveBirth = no)  (Creature = Aquatic) → Class = Fish
r3: (GiveBirth = yes)  (BodyTemperature= warm) → Class = Mammal
 Given a record, rules are tested in a sequence from the beginning
 A rule is triggered or fired when the rule covers a record, and the class
for the record is then determined
 Rules are mutually exclusive if no more than one rule is triggered by the
same record. Rules are exhaustive if each combination of attribute
values is covered by a rule.
 Default rule: () → yj as the final rule where class label yj normally refers
to the majority class of training records not yet covered by the existing
rules.

Models
 Rules in the classifier are either ordered or unordered.
◼ Ordered: rules are listed in decreasing order of their priority (such as accuracy,
coverage, total description length, etc.). Records are classified according to the
same order.
◼ Unordered: rules are not ordered, and more than one rule may be triggered upon
the same record. In this case, the class label of the record is determined by voting
of the rules fired. Normally, the class in majority is used to determine the class of
the record.
 Rule ordering schemes
◼ Rule-based ordering: ordering by a quality measure
 Advantage: best rule is applied first
 Disadvantage: later rules are difficult to understand after all previous rules are
negated.
◼ Class-based ordering: ordering by class
 Advantage: simpler rules and easier to understand
 Disadvantage: may not apply the “best” rule
◼ Combined

Sequential Covering Algorithm
algorithm SeqCover (Tr: training set; Y : Classes): set of rules
Begin
R = Ø;
A = the set of all attribute value pairs {(Ai, vj)};
for each class y in Y – {yk} do
while a stopping condition not met do
r = learn_one_rule(Tr, A, y);
remove training examples covered by r from Tr;
add r to the bottom of the rule list R;
end while;
end for;
insert default rule {}→yk to the bottom of the rule list R;
return(R);
End;
R1
Original Training Set
R4
R2 R3
 During the rule extraction process, all examples of one class are
considered positive and examples of other classes are negative. A rule is
desirable if it covers most positive examples and no or very few negative
examples.
 Once a rule is discovered, examples covered by the rule are removed,
leaving negative examples and positive examples not covered yet
behind.
 The learn_one_rule operation finds the optimal rule using a greedy
approach: an initial rule is generated and continuously refined until a
certain evaluation criterion is satisfied.
◼ Rule Growing: start with {}→y, then select a best possible (Ai, vj) pair and add into
antecedent. Repeat the process until rule quality no longer improve.
◼ Rule Refining: start with a positive example, then remove one of its conjuncts so
that it covers more positive examples. Repeat the process until the rule starts to
cover negative examples.

Rule growing Rule refining
Yes: 3
Refund=No, Refund=No,
{} No: 4 Status=Single, Status=Single,
Income=85K Income=90K
(Class=Yes) (Class=Yes)
Refund=
No
Status =
Single
Status =
Divorced
Status =
Married
... Income
> 80K
Refund=No,
Status = Single
Yes: 3 Yes: 2 Yes: 1 Yes: 0 Yes: 3
(Class = Yes)
No: 4 No: 1 No: 0 No: 3 No: 1
(a) General-to-specific
(b) Specific-to-general

Artificial Neural Network
 Artificial Neuron (unit) i1 w1
where: i2
w2
w3
S y
◼ sum function: i3
x = w1*i1 + w2*i2 + w3*i3 Sigm oid Function
◼ transformation fn:
y = sigmoid(x) = 1/(1 + e-x)
Y
X

 The Network x1 x2 x3 x4 x5
◼ A neural network can have many

hidden layers, but one layer is Input
Layer
normally sufficient
◼ The wider a hidden layer is (i.e. with
more units), the more capacity of
pattern recognition, the less general Hidden
on training examples Layer
◼ The output layer can have more than
one unit, predicting the likelihood on a
number of classes
◼ The constant inputs can be fed into Output
the units in the hidden and output Layer
layers as inputs.
y

 A Learning Algorithm to Train an ANN
algorithm trainNetwork (Tr: training set): Network
Begin
R = initial network with a particular topology
initialise the weight vector with random values w(0)
repeat
for each training example t=<xi, yi> in Tr do
compute the predicted class output ŷ(k)
for each weight wj in the weight vector do
update the weight wj as follows
wj(k+1) = wj(k) + (yi - ŷ(k))xij
end for;
end for;
until stopping criterion is met;
return R;
end;

Problem of Overfitting
 All algorithms that build a classification model based on a finite set of training
examples tend to have a problem known as model overfitting: the model
induced fits the training examples too well. It reflects the features of the
training examples that may not be the features of the actual data population.
 Consequence of overfitting: less accuracy when the model is used in practice.
 Reasons: presence of noise and lack of representative examples
 Problem appearance in different classification approaches
◼ Decision Tree:
 Fuller tree with more branches and nodes
 Most leaf nodes only suit a small number of examples
◼ Nearest Neighbour:
 Bad influence of “not-so-appropriate” training examples
◼ Rule-based Approach:
 Poor quality rules, complex rules
◼ Artificial Neural Network:
 Inappropriate weights for network links

Decision Trees: Tree Pruning
 Obtain an independently sampled set of test examples. Use the tree to
classify the test examples, and use a tree refinement method to prune
the tree.
 Tree pruning means to substitute certain sub-trees by leaf nodes,
making the whole tree more robust with fewer branches and nodes.
 The pruned tree should make less errors in practice when features of
both training and testing examples are considered.
 Overview of Tree Pruning Methods:
◼ Reduced Error Pruning
◼ Cost Complexity Pruning
◼ Pessimistic Pruning
◼ Production Rule Simplification Pruning
◼ Path Length Pruning
◼ Cross-validation Pruning

Reduced Error Pruning
 Classify all examples in the test set using the tree. Note down, at
each non-leaf node, the nature and the number of errors.
 For every non-leaf node, count the number of errors if the sub-
tree, where the node is the root, is replaced by a leaf node with
the best possible class label.
 Choose the sub-tree that has the largest reduction of the number
of errors to prune, if any of its own sub-trees do not reduce the
number of errors.
 Repeat Step 2 & 3 until any further pruning increases the
number of errors.
◼ a) Prune the tree even when there is a zero error reduction.
◼ b) There may be a number of sub-trees with the same error reduction.
In this case, choose to prune the largest sub-tree.

Reduced Error Pruning
Test Set
The Tree
Age Group Ge nde r Marrie d Ye arsO fLicie nce C lass
teen male no 2 P
YearsOfLicence teen male no 3 P
teen female no 2 P
1 2 3
teen female yes 2 P
adult female no 3 N
AgeGroup Gender AgeGroup adult female yes 3 N
teen adult senior male female teen adult senior adult female no 2 N
Married teen female yes 1 P

N senior male yes 1 N
P Gender NULL yes no N NULL P senior female yes 1 N
male female N P senior female yes 3 N
adult male yes 2 N
N P
(8)
YearsOfLicence
1 2 3
(4) (2) (2) (4)
AgeGroup Gender AgeGroup
YearsOfLicence
(0) male female
1 2 3 teen adult senior teen adult senior
(0) Married N
N P N P Gender NULL yes no N NULL P
male female N P
N P

Test Set (8→N5)
Age Group Ge nde r Marrie d Ye arsO fLicie nce C lass
teen male no 2 P YearsOfLicence
teen male no 3 P
teen female no 2 P 1 2 3
teen female yes 2 P
(2→N1) (2→P2) (4→N1)
adult female no 3 N AgeGroup Gender AgeGroup
adult female yes 3 N
adult female no 2 N teen adult senior (0→PN1) male female teen adult senior
teen female yes 1 P (0→PN0) Married N
senior male yes 1 N P Gender NULL yes no P2N1 N NULL P
senior female yes 1 N
P1N0 P0N2 N P P1N0 P0N2 P0N1
senior female yes 3 N male female
adult male yes 2 N P0N1 P1N0
N P
(5→N5) P0N0 P0N0
YearsOfLicence Biggest reduction is in Years of Licence and Age Group – 3 errors
1 2 3 reduced each. Years of Licence has subtrees that reduce errors,
(2→N1) (2→P2)
while Age Group does not, so we prune the Age Group and count
AgeGroup Gender N errors again. (4→N5)
teen adult senior (0→PN1) male female P1N3
YearsOfLicence
(0→PN0) Married N
Gender NULL yes 1 2 3
P no P2N1 (2→P2)
P1N0
male femaleP0N2 N P N Gender N
P0N1 P1N0 P1N2
N P (0→PN1) male female P1N3
P0N0 P0N0 Married N
yes no P2N1
Biggest reduction is in Age Group which reduces 1 error and it’s subtree doesn’t
reduce error. So we reduce Age Group with N and count errors again. N P
P0N1 P1N0
(4→N5)
Only Gender has “favorable” reduction so we reduce it with P and
YearsOfLicence
count errors again.
1 2 3
N P
P3N1 N No more “favorable” reductions exist so we stop here!
P1N2
P1N3
Cost Complexity Pruning
 Definition of Cost Complexity of a tree:
Let T: a decision tree,
S: a subtree of T
L(T): the number of leaf nodes in T
N: the number of training examples
E: the number of misclassifications
The cost complexity of T is defined as: E/N + L(T)
 About Cost Complexity Factor :
Suppose we replace subtree S of T with the best possible leaf. The new tree would
contain L(S) – 1 fewer leaves and make M more errors on the training set.
Therefore, the cost complexity of the new tree is: (E+M)/N + (L(T) - (L(S) - 1))
It is intended that the cost complexities of a tree before pruning and after pruning
should remain the same, ex. E/N + L(T) = (E+M)/N + (L(T) - (L(S) - 1))
Hence,  = M / (N * (L(S) - 1))

Cost Complexity Pruning
 For every non-leaf node of a decision tree, calculate the cost complexity factor .
Replace the sub-tree with the minimum  with a leaf node. The leaf is the class tag of
the most examples in the sub-tree. Record the resulting tree. Repeat this step until the
entire tree is replaced by a single leaf.
Use each tree Ti to classify N’ examples from the test set. Calculate the number of
errors (Ei) for each tree. Find the minimum number of errors E’. Select the tree Tj with
the smallest number of nodes such that
E j  E '+ se( E ' ) N '− E '

se( E ' ) = E '
N'

PEBLS: Exemplar Weight Tuning
 User the exemplars in the memory space to classify the test examples
 The weights of poorly performed exemplars are increased, and the weights of well
performed exemplars remain close to 1.
 The Modified PEBLS Algorithm outline:
algorithm PEBLS(Tr: training set; Ts : test set) : Exemplar Space
Begin
Space := Ø;
for each attribute ai of Tr do
construct a value difference table on all possible pairs of values;
end for;
for each instance e in Tr do
add e into Space with weight = 1;
calculate distances between e and each previous exemplars in Space;
find the nearest neighbour exemplar e’;
if the e’.Class ≠ e.Class then adjust the weight of e’
end for;
for each instance e in Ts do
classify e and adjust the weights for exemplars in Space;
return (Space)
End;

Handling in Other Approaches
 Rule-based approach: rules that misclassify test
examples.
◼ Pruning away those rules
◼ Modifying those rules
 Artificial neural networks: the network misclassifies
test examples.
◼ Using the test examples in the same way as the training
examples to further tune the weights on links.
 Naïve Bayes: the theorem misclassifies test examples
◼ Retraining with test examples
◼ Advanced Bayesian methods (BNN)

Evaluating Classification Methods:
Accuracy Rate
Predicted Classes
 Accuracy is a main factor to evaluate a Confusion Matrix Positive Negative Total
classification model Actual
Positive 15 5 20
Negative 3 7 10
 Accuracy can be reflected by the error rate in a Classes
Total 18 12 30
classification matrix known as confusion matrix Errors 8 Error Rate: 27%
 Error rate upon what data?
◼ Training data? No, training data are used to build the model. Even with model refinement, the
accuracy is still in favour of training examples.
◼ Predicting error rate on unseen new data? We do not have such data.
◼ Testing data? Yes, this is the best we can do.
 We use an independently sampled test set (i.e. the evaluation set) to estimate the
accuracy of the model, provided the set is representative.
 How close the estimation to the true accuracy? The bigger the test set is, the better
estimate it is. (e.g. 75% out of 100 is different from 75% out of 1000)
 In statistics, given a confidence level c%, the true accuracy rate p is estimated within a
range of the test set estimation t  a, where a is related to the size of test set.

Holdout Method
 Divide the available data set to two partitions, one as training set and the
other as the test set, by following a sampling policy
 Normal sampling policy include 50%(training)-50%(testing) and
67%(training)-33%(testing)
 Problems:
◼ Fewer examples for training, higher risk of overfitting
◼ Model depends on the sizes of training and testing sets. If the training set is too
small, the model is less reliable. If the testing set is too small, the accuracy
estimation is unreliable.
◼ Model depends on the contents of training and testing sets. The problem of over-
representation of a class in one and under-representation of the class in the other
is difficult to avoid.
 Random Sub-sampling: use the holdout method k times and take the
average of the accuracy. Most problems mentioned above still exist.

Cross-Validation
 Partition the whole data set available for model development into
k folds.
 Repeat the model construction k times. At each iteration, one
fold is used for testing and the rest are used for training. In this
way, we can ensure that each example is used for testing exactly
once, and for training for the same number of times.
 Special versions:
◼ 2-fold cross validation
◼ N-fold cross validation (leave-one-out)
◼ Normally used: 10-fold cross validation
 The error is calculated by summing the number of errors for each
iteration or taking the average of all estimates.

Bootstrap
 Use sampling with replacement scheme to
obtain a training set of N examples. When N is
not small, 63.2% of original examples are
selected. The remaining examples are used as
testing examples.
 The error rate is folded by the estimate on the
test set and the estimate on the training set
 Repeat the process b times. The result error
rates are averaged.
Classification Methods
The Comparison
 Decision Tree Methods
✓ Ability to generate understandable rules
✓ Efficiency in classifying unseen data
✓ Ability to indicate the most important attribute for classification
 The Classifier is computationally expensive to build
 Nearest Neighbour Methods
✓ Short training time
✓ Working for various types of attributes
 The classification time is long
 Rule-based Methods
✓ Good for class imbalance situations
✓ Easy to understand
 Complex rules can be difficult to interpret
 Naïve Bayes Classifier
✓ Robust to noise and irrelevant attributes
✓ Can deal with missing values by ignoring them
 May not perform well when attributes are correlated.

Classification in Practice
 Data Mining Using Classification Methods
◼ Locate data
◼ Data preparation
◼ Choosing a classification method
◼ Construct the model and tune the model
◼ Measure its accuracy and go back to step 3 or 4 until the accuracy is
satisfactory
◼ Further evaluation the model with other evaluators such as complexity,
comprehensibility, etc.
◼ Deliver the model and test run it in real environment. Further modify the
model if necessary
Note: Steps 3 to 5 may be repeated in order to obtain a good quality

classification model.

 Data Mining Using Classification Methods
◼ Data Preparation
 Identify the descriptive features (input attributes)
 Identify or define the class
 Determine the sizes of the training, test and evaluation set
 Select examples
◼ Spread and coverage of classes
◼ Spread and coverage of attribute values
◼ Null values
◼ Noisy data
 Prepare the input values (categorical to continuous, continuous to categorical)
◼ Selecting classification methods
 For the purpose of investigation
 For better accuracy

 How to Select Models by Statistical Significance
◼ We use the same classification method to obtain two models, M1 and M2. We evaluate M1 with test set D1
with error rate e1. We also evaluate M2 with test set D2 with error rate e2.
◼ If the sizes of the test sets are sufficiently large, the error rates can be approximated using normal
distribution. Therefore, d = e1 – e2 follows a normal distribution with dt as mean and d as the standard
deviation
e1 (1 − e1 ) e2 (1 − e2 )
d = +
D1 D2
◼ The confidence interval for the true difference dt at a confidence level c is calculated as
d t = d  zc   d confidence level 0.99 0.98 0.95 0.9 0.8

Zc 2.58 2.33 1.96 1.65 1.28
◼ The difference between e1 and e2 is significant if the interval does not span 0; otherwise, the difference is
not significant.
Ex: |D1| = 100, e1 = 0.15, |D2| = 5000, e2 = 0.25, d = 0.1, d = 0.0655
dt = 0.1  1.96 x 0.0655 = 0.1  0.128
Since the interval spans 0, the difference between e1 and e2 are not significant.

Summary
 Classification is a process of classifying a record into one of the pre-defined classes.
 Depending on the classification model constructed, there are a number of approaches
for developing classification models. Different methods suit different application data
sets.
 Various measures of relevance between attribute values and the target class outcomes
are used during the model construction process.
 Problem of overfitting exists in any solutions that build a model out of a training set of
examples. To overcome this problem, the constructed model is refined, with the
assistance of test/validation examples.
 Evaluation of models requires the use of an independently sampled test set.
 To estimate the true accuracy rate of the model, a number of evaluation methods that
make use of existing examples can be adopted. These methods are more useful when
quality example data are limited.
 Accuracy of a classifier is an important factor of performance. Efficiencies in model
construction and model use are also important.

Further Reading
 Hongbo Du, „Data Mining Techniques and
Applications“, Chapters 6 & 7
 Tan, P., Steinbach, M. & Kumar, V.
“Introduction to Data Mining”, Chapter 4 &
5, Addison-Wesley, 2006
 Berry & Linoff, “Data Mining Techniques for
Marketing, Sales and Customer Support”,
Chapter 12, Wiley, 1997

Csis355 Classifications 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Csis355 Classifications 1

Uploaded by

Copyright:

Available Formats

Data Mining

Techniques & Applications

◼ Given a constructed model, classify a data record with

Sarajevo School of Science and Technology 4

Testing Examples Model Refinement

Sarajevo School of Science and Technology 5

Sarajevo School of Science and Technology 7

Sarajevo School of Science and Technology 8

Sarajevo School of Science and Technology 10

Sarajevo School of Science and Technology 11

Sarajevo School of Science and Technology 12

Sarajevo School of Science and Technology 13

Sarajevo School of Science and Technology 15

cool mild hot

sunny o’cast rain sunny o’cast rain true fals

high normal high normal

Sarajevo School of Science and Technology 16

Sarajevo School of Science and Technology 17

 Property: H(S) = 0, when one event always

Sarajevo School of Science and Technology 20

 The mutual information between two systems S1 and S2 is also

Sarajevo School of Science and Technology 21

Totals p n gain( A) = H (Class) − H (Class | A)

Sarajevo School of Science and Technology 22

Sarajevo School of Science and Technology 23

Sarajevo School of Science and Technology 25

Sarajevo School of Science and Technology 26

Sarajevo School of Science and Technology 27

Sarajevo School of Science and Technology 29

Sarajevo School of Science and Technology 30

Sarajevo School of Science and Technology 32

1 sunny hot high false N 1

Sarajevo School of Science and Technology 33

2 sunny hot high true N 1

1 sunny hot high false N 1.5

3 overcast hot high false P 1

1 sunny hot high false N

Sarajevo School of Science and Technology 35

◼ P(Y|X) is called the posterior probability of Y

Sarajevo School of Science and Technology 36

Sarajevo School of Science and Technology 37

Sarajevo School of Science and Technology 38

Sarajevo School of Science and Technology 39

To classify X = (Outlook=sunny, Temperature=80, Humidity=normal, Windy=false)

P(X|Class=N) = P(Outlook=sunny|Class=N) * P(Temperature=80|Class=N) *

P(X|Class=P) = P(Outlook=sunny |Class=P) * P(Temperature=80|Class=P) *

P(Class=N|X) = (0.357 * 0.00192)/P(X)

P(Class=P|X) > P(Class=N|X) therefore the class for X is P.

Sarajevo School of Science and Technology 41

Sarajevo School of Science and Technology 42

Sarajevo School of Science and Technology 43

Sarajevo School of Science and Technology 44

Original Training Set

Sarajevo School of Science and Technology 47

Sarajevo School of Science and Technology 48

x = w1*i1 + w2*i2 + w3*i3 Sigm oid Function

Sarajevo School of Science and Technology 49

◼ A neural network can have many

Sarajevo School of Science and Technology 50

Sarajevo School of Science and Technology 51

Sarajevo School of Science and Technology 52

Sarajevo School of Science and Technology 53

Sarajevo School of Science and Technology 54

x = w1i1 + w2i2 + w3*i3 Sigm oid Function