You are on page 1of 70

Data Mining

Techniques & Applications

Classification Techniques
Topics
 The Problem of Classification
 Decision Tree Approach and ID3 Algorithm
 Nearest Neighbour Approach and PEPLS Algorithm
 Bayesian Classifier Approach and Naïve Bayes
 Rule-based Classification Approach
 Principle of Artificial Neural Network
 The Problem of Model Overfitting and Solutions
 Evaluating Classification Models
 Comparison of Classification Solutions
 Classification in Practice
Sarajevo School of Science and Technology 2
Problem Description
 Two-Stage Description of Classification
◼ Given a data set of examples, use a classification
technique, known as classifier, to construct a classification
model.
Model
Example Construction
Data Set Method
Model

◼ Given a constructed model, classify a data record with


unknown class to one of the pre-defined classes.
Unseen ? A
Data
Model Class Tag
Sarajevo School of Science and Technology 3
Problem Description
 Two-Stage Description of Classification
Example Data Set
Outlook
sunny
Temperature
hot
Humidity
high
Windy
FALSE
Class
N Model
sunny hot high TRUE N outlook
overcast hot high FALSE P
rain
rain
mild
cool
high
normal
FALSE
FALSE
P
P
Model sunny overcast rain
rain cool normal TRUE N
overcast
sunny
cool
mild
normal
high
TRUE
FALSE
P
N
Construction humidity P windy
sunny
rain
cool
mild
normal
normal
FALSE
FALSE
P
P
Method: ID3 high normal true false
sunny mild normal TRUE P
overcast mild high TRUE P N P N P
overcast hot normal FALSE P
rain mild high TRUE N Model
outlook
Unseen data record sunny overcast rain
sunny cool high true ? humidity P windy Class = N
high normal true false
N P N P

Sarajevo School of Science and Technology 4


Problem Description
 Training, Testing and Evaluation Examples
◼ Construction stage, example data set split into three subsets:
 Training set is for constructing the initial model.
 Test set is for tuning the initial model to overcome overfitting.
 Evaluation set is used to measure the model’s accuracy.
Model Construction
Training Examples Initial Model
Method

Testing Examples Model Refinement


Method Final Model
Initial Model

Evaluation Examples
Accuracy Final Model with
Final Model Measurement accuracy rate A.

Sarajevo School of Science and Technology 5


Problem Description
 Influential Factors to Successful
Classification
◼ Classification Accuracy (how accurate the
classification model is)
◼ Classification Performance (time taken for model
construction and time taken for model use)
◼ Comprehensibility of the classification model
(whether the model explains why the decision is
made)
Sarajevo School of Science and Technology 6
Approaches of Classification
 Decision tree induction
◼ Constructing a decision tree as the classification model
◼ Traversing the tree from the root to a leaf to determine the class
◼ Which attribute is used as the root of the current tree is the
main issue for decision tree construction methods
 Nearest neighbour approach
◼ The model consists of a set of exemplars (historical records)
◼ The nearest exemplar or k-nearest exemplars are found for the
given record
◼ The class of the record is determined by the class of the
neighbour(s)

Sarajevo School of Science and Technology 7


Approaches of Classification
 Rule-Based Approach
◼ The model consists of a set of IF..THEN rules in some order
◼ Rules are searched in an order until a suitable rule is found
◼ The suitable rule is fired to determine the class of a record
 Artificial Neural Network
◼ The model is a neural network with suitable number of nodes
and trained with appropriate weights for the links.
◼ Attribute values of a record are given to the nodes on the input
layer. The neural network calculates and predicts the class of the
record through a collective reasoning of all network nodes.

Sarajevo School of Science and Technology 8


Approaches of Classification
 Bayesian Approach
◼ The Bayes theorem is the model
◼ The posterior probability is estimated according
to the prior probability
◼ The likelihood of a record belonging to a class is
estimated according to the posterior probability
 Others (Support Vector Machine, Mixture
Models (GMM), Regression)
Sarajevo School of Science and Technology 9
Decision Tree Induction
Overview
 Overview
◼ Input: a training set, i.e. a table with descriptive attribute and a class
attribute.
◼ Output: a decision tree of nodes and links.
◼ Process: take the input table and induce the decision tree.
 Procedure for Decision Tree Induction
1. Take the entire training set or part of it as the input
2. If all examples are of the same class, create a leaf node, and mark it
with the class name
3. If examples in the training set are of a mixture of different classes,
determine which attribute should be the root of the current tree
4. Partition the input examples into subsets according to the values of the
selected root attribute
5. Construct a decision tree recursively for each subset from step 2.

Sarajevo School of Science and Technology 10


Decision Tree Induction:
ID3 Tree Induction Algorithm
Algorithm ID3_TreeConstruct (C: Training Set) : Decision Tree
begin
Tree := Ø;
if C is not empty then
if all examples in C are of one class then
Tree := a leaf node labeled by the class tag
else begin
select attribute T with maximal Information Gain as the root;
partition C into C1, C2, ..., Cw by values of T;
for each Ci (1 ≤ i ≤ w) do ti  ID3_TreeConstruct (Ci);
Tree := a tree T as root and t1, t2, ..., tw as subtrees
label the linked from T to the subtrees with values of T
end;
return(Tree)
end;

Sarajevo School of Science and Technology 11


Decision Tree Induction:
ID3 Tree Induction Algorithm
Training Set
Outlook Temperature Humidity Windy Class
sunny hot high FALSE N
sunny hot high TRUE N
overcast hot high FALSE P Outlook
rain
rain
mild
cool
high
normal
FALSE
FALSE
P
P
Gain(Outlook) = 0.246 bits
rain cool normal TRUE N Gain(Temperature) = 0.029 bits sunny overcast rain
Gain(Humidity) = 0.151 bits
overcast cool normal TRUE P
sunny mild high FALSE N P
sunny
rain
cool
mild
normal
normal
FALSE
FALSE
P
P
Gain(Windy) = 0.048 bits
sunny mild normal TRUE P
overcast mild high TRUE P
overcast hot normal FALSE P  Outlook is chosen as the root.
rain mild high TRUE N

Training Set(outlook=sunny)
Temperature Humidity Windy Class
Gain(Temperature) = 0.571 bits Outlook
hot high FALSE N Gain(Humidity) = 0.971 bits sunny overcast rain
hot high TRUE N
Gain(Windy) = 0.020 bits P
mild
cool
high
normal
FALSE
FALSE
N
P
Humidity
mild normal TRUE P
 Humidity is chosen as the root. high normal
N P

Sarajevo School of Science and Technology 12


Decision Tree Induction:
ID3 Tree Induction Algorithm
 Different attributes may give the same amount of information gain. In this case, other
factors (such as random choice) decide which attribute serves as the root.
 Partitioning the training set into subsets may result in an empty subset, which means no
examples in the training set have the combination of attribute values labeled on the
branches along the path. In this case, a leaf node with label “Null” is created. This
means that the class of the examples cannot be decided.
 ID3 sometimes uses a part of the training set to build a tree, and then repeats the
process so that a number of candidate trees are created. Then the best-fit tree is chosen
at the evaluation stage.
 For a numeric attribute, the ID3 finds the best split point and creates a root node with
two branches signified by ≤ and ≥.
 The complexity of a tree is measured in terms of the number of nodes. A tree is not a
good tree if it contains many branches and many paths from the root to a leaf, and have
very few examples at the leaf.

Sarajevo School of Science and Technology 13


Decision Tree Induction:
Other Algorithms
 ID3 Family algorithms: C4.5 and C5 (See5)
◼ Similar to ID3
◼ Use information gain ratio as attribute selection measure
 CART algorithm
◼ Produce binary decision tree
◼ Nodes represent a Boolean test
◼ Use Gini Index of Impurity as attribute selection measure
 CHAID algorithm
◼ Use Chi-square test (2) as attribute selection measure
 Study shows that there are marginal differences among
the attribute selection measures. Currently no measure is
significantly better than others.
Sarajevo School of Science and Technology 14
Decision Tree Induction:
Attribute Selection Measure
 Random selection
 Information Gain/Gain Ratio
 Gini Index of Impurity
 Chi-square Test

Sarajevo School of Science and Technology 15


Decision Tree Induction:
Random Selection

Temperature

cool mild hot


outlook outlook windy

sunny o’cast rain sunny o’cast rain true fals


e
P P windy windy P humidity N humidity

high normal high normal


true fals true fals
e e
N P P N windy P outlook P
true fals sunny rain
e o’cast
N P N P

Sarajevo School of Science and Technology 16


Decision Tree Induction:
Information Gain Measure
 An information system S consists of a set of n possible events E1,
E2,…, En. Each event may occur with a probability P(Ek) (1≤k≤n) and
P(E1) + P(E2) + … + P(En) = 1.
 In a table, each attribute can be considered as an information
system (e.g. Outlook)
 In classification, classifying tuples with attribute A of v values in the
data set C into two classes (Positive, Negative) involves two
information systems at a time:
◼ Attribute A:
 Events (values): a1, a2, …, av
 Probabilities of the events: p(a1), p(a2), …, p(av)
◼ Attribute Class:
 Events (values): Positive, Negative
 Probabilities of the events: p(Positive), p(Negative)

Sarajevo School of Science and Technology 17


Decision Tree Induction:
Information Gain Measure
 The self-information of an event E represents the amount
of information being conveyed when E occurs. It is defined
as: 1
I ( E ) = log = − log p ( E )
p( E )
 Rationale:
◼ The smaller the chance for E to happen, the more information it
conveys once it really happens
◼ The logarithm measures the degree of magnitude of chances
◼ The base indicates the unit of measure (e.g. 2 => bits, 10 =>
digits)
e.g. In the weather condition table
9
I (Class = P) = − log 2 p(Class = P) = − log 2 = 0.63743 bits
14
Sarajevo School of Science and Technology 18
Decision Tree Induction:
Information Gain Measure
 The average of the self-information of all events in
an information system S is called entropy, and is
defined as: H ( S ) = n p( E )  I ( E ) = − n p( E )  log p( E )

k =1
k k k =1
k k

 Property: H(S) = 0, when one event always


happens and other events never happen. H(S)
takes the maximum value log2N if every event has
the same probability (1/n). Therefore, entropy H(S)
represents a degree of uncertainty – lower entropy
more certainty, higher entropy higher uncertainty.
Sarajevo School of Science and Technology 19
Decision Tree Induction:
Information Gain Measure
 e.g. In the weather condition table
Outlook
sunny
Temperature Humidity
hot high
Windy
FALSE
Class
N H (Class ) =
sunny hot high TRUE N
9 9 5 5
=− log 2 − log 2
overcast hot high FALSE P
rain mild high FALSE P
rain cool normal FALSE P 14 14 14 14
rain cool normal TRUE N
overcast cool normal TRUE P = 0.94 bits
sunny mild high FALSE N
sunny cool normal FALSE P
rain mild normal FALSE P
sunny mild normal TRUE P
overcast mild high TRUE P
overcast hot normal FALSE P
rain mild high TRUE N

Sarajevo School of Science and Technology 20


Decision Tree Induction:
Information Gain Measure
 The conditional self-information of event E of system S1, given
that event F of system S2 has occurred, is defined as
1 p( E  F )
I ( E | F ) = log = − log p ( E | F ) = − log
p( E | F ) p( F )
 The conditional entropy of system S1, in the presence of system
S2, is defined as
n m n m p( Ei  F j )
H ( S1 | S 2 ) =  p(Ei  F j )  p( Ei | F j ) = − p(Ei  F j )  log
i =1 j =1 i =1 j =1 p( F j )

 The mutual information between two systems S1 and S2 is also


known as information gain of S1 given S2. It can be calculated as
Gain(S1 ) = I(S1 , S2 ) = H(S1 ) - H(S1 | S2 )

Sarajevo School of Science and Technology 21


Decision Tree Induction:
Information Gain in ID3
 Suppose there are p positive and n negative examples,
and there are pi positive and ni negative examples whose
attribute A has the value ai, then we have a contingency
table: p p n n
H (Class) = − log − log
Classes p+n p+n p+n p+n
Attribute A v
pi pi n ni
Positive Negative H (Class | A) =  (− log − i log )
i =1 p+n pi + ni p + n pi + ni
a1 p1 n1
v  p + ni  p pi n ni 
a2 p2 n2 =  i  − i log − i log 
i =1  p + n  p + n p i + ni p + n pi + ni 
…… v
p + ni
av pv nv = i H (Class | A = ai )
i =1 p + n

Totals p n gain( A) = H (Class) − H (Class | A)

Sarajevo School of Science and Technology 22


Decision Tree Induction:
Information Gain Measure
 In the weather condition table, p = 9 and n = 5
Outlook Temperature Humidity Windy Class
sunny hot high FALSE N 9 9 5 5
sunny hot high TRUE N H (Class) = − log 2 − log 2 = 0.94 bits
overcast hot high FALSE P 9+5 9+5 9+5 9+5
rain mild high FALSE P
rain cool normal FALSE P
rain
overcast
cool
cool
normal
normal
TRUE
TRUE
N
P
psunny = 2, nsunny = 3, H ( S Csunny ) = 0.971bits
sunny mild high FALSE N
sunny cool normal FALSE P povercast = 4, novercast = 0, H ( S Covercast ) = 0.0 bits
rain mild normal FALSE P
sunny mild normal TRUE P prain = 3, nrain = 2, H ( S Crain ) = 0.971bits
overcast mild high TRUE P
overcast hot normal FALSE P
rain mild high TRUE N
H (Class | Outlook ) =
gain(outlook)=0.246 bits 5 4 5
gain(temperature)=0.029 bits = H ( SCsunny ) + H ( SCovercast ) + H ( SCrain )
gain(humidity)=0.151 bits 14 14 14
gain(windy)=0.048 bits = 0.694 bits

Sarajevo School of Science and Technology 23


Decision Tree Induction:
Information Gain Measure
 What Information Gain Actually Tell Us?
◼ Since entropy H(Class) represents the degree of
uncertainty over class values, the information gain
represents a reduction of uncertainty over the choice of
attribute A.
◼ Choosing the attribute with the maximum information gain
reduces the uncertainty over class the most.
◼ Practical implication is that most of examples are classified
as early as possible in a decision tree
◼ Information gain favours the attribute with most values. A
modification made in the algorithm C4.5 is the use of
information gain ratio: gainRatio( A) = gain( A) = H (Class) − H (Class | A)
H ( A) H ( A)
Sarajevo School of Science and Technology 24
Decision Tree Induction
 Handling Noise Data in Decision Trees
◼ Noise data records in training set may cause the followings:
 Same description but different classes. In other words, attributes are
not adequate to classify an object.
 A decision tree of spurious complexity
◼ How to solve the problems:
 For the inadequate attribute problem: In addition to labeling a leaf
with class ID, a real number within [0,1] is also kept at the leaf. The
number represents the probability of the class.
 For the spurious complexity problem: use a user-defined minimum
information gain threshold or measure the independence between
class and the attribute to decide whether to continue in constructing
the tree.

Sarajevo School of Science and Technology 25


Nearest Neighbour Approach
Overview
 A lazy learning approach: delaying the classification decision to the time
of the actual classification. In other words, there is NO training!
 Rationale: “If it walks like a duck, quacks like a duck, and looks like a
duck, it probably is a duck”.
 The classification model is a memory space containing all the training
examples. Each training example is considered as a point in the N-
dimensional space.
 There may be a selection process for training examples.
 To classify an unseen record, compute its proximity to all training
examples and locate either 1 or k nearest neighbour examples. The
classes of the neighbours determine the class of the record.
 In the case where there are more than one neighbour, the class of the
unseen record is determined by voting.

Sarajevo School of Science and Technology 26


Nearest Neighbour Approach
KNN Algorithm
Algorithm kNN (Tr: trainingSet; k: integer; r: dataRecord) : Class
begin
for each training example t in Tr do
calculate distance d(t, r) upon descriptive attributes
end for;
select k nearest neighbours into D according to the distance vector
Class = majority class in D
return Class
end;

Questions:
◼ How to measure distance?
◼ How to deal with nominal attributes?

Sarajevo School of Science and Technology 27


Nearest Neighbour Approach
The PEBLS Algorithm
 Minimum training
 Weights attached to the training
examples reflecting performance
 A proximity measure based on class
distribution than “likelihood”
 Suitable for symbolic attributes
(categorical and discrete)
Sarajevo School of Science and Technology 28
Nearest Neighbour Approach
The PEBLS Algorithm
Algorithm PEBLS(Tr: training set) : Exemplar Space
begin
Space := Ø;
for each attribute ai of Tr do
construct a value difference table on all possible pairs of values;
end for;
for each instance e in Tr do
add e into Space as an exemplar with weight = 1;
calculate distances between e and each exemplar in Space;
find the nearest neighbour exemplar e’;
adjust the weight of e’ according to its prediction
end for;
…..
end;

Sarajevo School of Science and Technology 29


Nearest Neighbour Approach
The PEBLS Algorithm
 PEBLS: Value Difference Table for Attribute
 For a training set of examples, an attribute A with A1, A2, …, Am
values and classes C1, C2, .., Ck form the following contingency
table, where Nij: the number of examples in the data sets that
have value Ai and belong to class Cj
 The difference between two values attribute A
c1 c2
classes
…. cj … cn
of the attribute v1 and v2 where A1
◼ r: constant that is normally set to 1. A2

◼ C1: number of examples with value V1 Ai Nij
◼ C2: number of examples with value V2 ….
◼ C1i: number of examples with value V1 Am
◼ and class I r
◼ C2i: number of examples with value V2 k
C1i C2i
and class I d (V1 ,V2 ) =  −
i =1 C1 C2

Sarajevo School of Science and Technology 30


Nearest Neighbour Approach
The PEBLS Algorithm
 PEBLS: Value Difference Table for Attribute
 The difference value table contains value differences
between all possible pairs of values for the attribute
Outlook Temperature Humidity Windy Class attribute A A1 A2 … Aj … Am
sunny hot high FALSE N
A1 d(A1, A1) d(A1, A2) d(A1, Aj) d(A1, Am)
sunny hot high TRUE N
overcast hot high FALSE P A2 d(A2, A1) d(A2, A2) d(A2, Aj) d(A2, Am)
rain mild high FALSE P …
rain cool normal FALSE P Ai d(Ai, A1) d(Ai, A2) d(Ai, Aj) d(Ai, Am)
rain cool normal TRUE N ….
overcast cool normal TRUE P Am d(Am, A1) d(Am, A2) d(Am, Aj) d(Am, Am)
sunny mild high FALSE N
sunny cool normal FALSE P
rain mild normal FALSE P Value Difference Table for Outlook
sunny mild normal TRUE P
overcast mild high TRUE P
overcast hot normal FALSE P sunny overcast rain
rain mild high TRUE N sunny 0 1.2 0.4
2−4 + 3−0 = 3+3= overcast 1.2 0 0.8
d ( sunny , overcast ) = 1 .2 rain 0.4 0.8 0
5 4 5 4 5 5
Sarajevo School of Science and Technology 31
Nearest Neighbour Approach
The PEBLS Algorithm
 PEBLS: Distance Function
◼ The distance between any two data object X and Y are measured by
M
( X , Y ) = wX wY  d ( xi , yi ) r
i =1

where wX and wY are weights for X and Y, and M is the number of attributes. xi and
yi are respectively values of the ith attribute for X and Y, and r is a constant set to 2
(Euclidean distance).
◼ The weight is a ratio of total number of uses of the data object to the total number
of correct uses of the data object. Initially, all training examples have a weight of 1.
e.g.
Δ(row1, row2) = d(row1outlook, row2outlook)2 +
d(row1temp, row2temp)2 + d(row1humidity, row2humidity)2 +
d(row1windy, row2windy)2 = d(sunny,sunny)2 + d(hot,hot)2 +
d(high,high)2 + d(false,true)2 = 0 + 0 + 0 + (1/2)2 = 1/4

Sarajevo School of Science and Technology 32


Nearest Neighbour Approach
The PEBLS Example
Windy True False
Outlook Sunny Overcast Rain Temperature Hot Mild Cool True 0 0.418
Sunny 0 1.2 0.4 Hot 0 0.33 0.5 False 0.418 0
Overcast 1.2 0 0.8 Mild 0.33 0 0.33
Humidity High Normal
Rain 0.4 0.8 0 Cool 0.5 0.33 0
Hight 0 0.857
Normal 0.857 0
Exemplar Space

1 sunny hot high false N 1

Sarajevo School of Science and Technology 33


Nearest Neighbour Approach
The PEBLS Example
1 sunny hot high false N 1
2/2 = 1

2 sunny hot high true N 1

1 sunny hot high false N 1.5


3/2 = 1.5
2 sunny hot high true N 1

3 overcast hot high false P 1

(E1, E3) =1.44, (E2, E3) =1.615. E1 is the nearest neighbour.
Sarajevo School of Science and Technology 34
Nearest Neighbour Approach
The PEBLS Example
1 sunny hot high false N 1.5
2 sunny hot high true N 1 Calculate:
3 overcast hot high false P 1
(E4, E1),
(E4, E2),
(E4, E3)
4 rain hot high false P 1 ……

1 sunny hot high false N


2 sunny hot high true N
3 overcast hot high false P
••••••
14 rain mild high true N

Sarajevo School of Science and Technology 35


Bayesian Classifiers
 Bayes Theorem P(Y | X ) =
P( X | Y )  P(Y )
◼ Bayes Theorem Equation: P( X )

◼ P(Y|X) is called the posterior probability of Y


with the presence of X, whereas P(Y) is called
the prior probability of Y.
◼ Bayes Theorem suggests that the likelyhood of Y
happens with the presence of circumstance X
depends on prior probability of Y.

Sarajevo School of Science and Technology 36


Bayesian Classifiers
 e.g. For two rival football teams (FCS and željo), FCS wins 65%
of time and željo wins the remaining matches. Among games
won by FCS, 30% comes from playing on Grbavica ground. 75%
of wins by željo are obtained on home ground. If a match is
planned on Grbavica, which team is more likely to win?
Let X represent hosting team, and Y represent winning team. We
know:
 P(WIN=FCS)=0.65, P(WIN=željo)=1–0.65=0.35,
P(HOME=željo|WIN=FCS)=0.3,
P(HOME=željo|WIN=željo)=0.75
Using Bayes theorem,
P(WIN=željo|HOME=željo)=0.2625,
P(WIN=FCS|HOME=željo)=0.195
Therefore, željo is more likely to win.

Sarajevo School of Science and Technology 37


Bayesian Classifiers
 Using Bayes Theorem for Classification
◼ Treat Y as the class attribute. Use Bayes theorem to predict the probabilities
of P(y1|X), P(y2|X), …, P(yk|X) for a given attribute value combination X.
e.g. X = (HomeOwner=no, Married=yes, Income=120k)
Calculate P(DefaultPay=yes|X) and P(DefaultPay=no|X) and
Determine which probability is higher for X, and hence indicating the class for
X.
◼ Among the terms for Bayesian probability, P(Y) can be calculated from the
training data, P(X) is the given evidence. In fact it can be ignored when we
compare Bayesian results before the same evidence is used for all probability
calculations. The class-conditional probability P(X|Y) needs to be estimated
from training data.
◼ Different ways of estimating P(X|Y):
 Naïve Bayes
 Bayesian Belief Network

Sarajevo School of Science and Technology 38


Bayesian Classifiers
Naïve Bayes Classifier
 Take the assumption that attributes are conditionally independent of each other given
the class label is known.
 Therefore the class-conditional probability of P(X|Y=y) is estimated as the product of all
conditional probabilities P(X1|Y=y), P(X2|Y=y), …, P(Xd|Y=y), i.e.
d
d P(Y ) P( X i | Y = y )
P( X | Y = y ) =  P( X i | Y = y ) P(Y | X ) = i =1
i =1 P( X )
 For a categorical attribute, P(Xi=xi|Y=y) is estimated according to the fraction of training
examples of class y that take the attribute value xi.
 For a continuous attribute, it is assumed that the attribute values follow a normal
distribution, the probability P(Xi=xi|Y=y) is therefore estimated according to the
Gaussian probability density function based on the mean and variance of the fraction of
training examples for the class. −
( xi −  iy ) 2
1 2 iy2
P ( X i = xi | Y = y ) = exp
2  iy

Sarajevo School of Science and Technology 39


Bayesian Classifiers
Naïve Bayes Classifier
P(Class=N)=5/14; P(Class=P)=9/14=0.643
Example: Classify P(Outlook=sunny|Class=N) = 3/5
P(Outlook=overcast|Class=N) = 0
 Outlook=sunny, Temperature=80, P(Outlook=rain|Class=N) = 2/5
P(Outlook=sunny|Class=P) = 2/9
Humidity=normal, Windy=false P(Outlook=overcast|Class=P) = 4/9
P(Outlook=rain|Class=P) = 3/9
Training Set P(Humidity=high|Class=N) = 4/5
Outlook Temperature Humidity Windy Class
sunny 85 high FALSE N P(Humidity=normal|Class=N) = 1/5
sunny 80 high TRUE N P(Humidity=high|Class=P) = 3/9
overcast 83 high FALSE P P(Humidity=normal|Class=P) = 6/9
rain 70 high FALSE P P(Windy=false|Class=N) = 2/5
rain 68 normal FALSE P
rain 65 normal TRUE N P(Windy=true|Class=N) = 3/5
overcast 64 normal TRUE P P(Windy=false|Class=P) = 6/9
sunny 72 high FALSE N P(Windy=true|Class=P) = 3/9
sunny 69 normal FALSE P For Temperature,
rain 75 normal FALSE P
sunny 75 normal TRUE P if Class=N: sample mean = 74.6;
overcast 72 high TRUE P sample variance = 62.3
overcast 81 normal FALSE P if Class=P: sample mean = 73
rain 71 high TRUE N
sample variance = 38
Sarajevo School of Science and Technology 40
Bayesian Classifiers
Naïve Bayes Classifier
 Example

To classify X = (Outlook=sunny, Temperature=80, Humidity=normal, Windy=false)

P(X|Class=N) = P(Outlook=sunny|Class=N) * P(Temperature=80|Class=N) *


P(Humidity=normal|Class=N) * P(Windy=false|Class=N)
= 3/5 * 0.040007 * 1/5 * 2/5 = 0.00192

P(X|Class=P) = P(Outlook=sunny |Class=P) * P(Temperature=80|Class=P) *


P(Humidity=normal|Class=P) * P(Windy=false|Class=P)
= 2/9 * 0.03397 * 6/9 * 6/9 = 0.00335

P(Class=N|X) = (0.357 * 0.00192)/P(X)


P(Class=P|X) = (0.643 * 0.00335)/P(X)

P(Class=P|X) > P(Class=N|X) therefore the class for X is P.

Sarajevo School of Science and Technology 41


Rule-based Classification
Classification Rule
 Rule form:
◼ (A1 op1 v1)  (A2 op2 v2)  …  (Am opm vm)  Class = yi
where Aj is an attribute name, vj is an attribute value, and opj is a
comparison operator (<, >, =, ≠), and yi is a class label.
e.g. (GiveBirth = no)  (AerialCreature = yes) → Class = Bird
 Rules are derived from a given training set
 A rule covers a data record if the attribute values of the record
matches the antecedent part of the rule.
 Rule quality is measured by coverage and accuracy
◼ Given a data set D, let |D| represent the number of records in D,
and |A| represent the number of records that are covered by the
rule. Let |Ay| represent the number of records covered by the rule
and having class label y. | A| |A y|
coverage(r) = accuracy(r) =
| D| |A|

Sarajevo School of Science and Technology 42


Rule-based Classification
Models
 The classification model consists of a set of rules
e.g.
r1: (GiveBirth = no)  (Creature = Aerial) → Class = Bird
r2: (GiveBirth = no)  (Creature = Aquatic) → Class = Fish
r3: (GiveBirth = yes)  (BodyTemperature= warm) → Class = Mammal
 Given a record, rules are tested in a sequence from the beginning
 A rule is triggered or fired when the rule covers a record, and the class
for the record is then determined
 Rules are mutually exclusive if no more than one rule is triggered by the
same record. Rules are exhaustive if each combination of attribute
values is covered by a rule.
 Default rule: () → yj as the final rule where class label yj normally refers
to the majority class of training records not yet covered by the existing
rules.

Sarajevo School of Science and Technology 43


Rule-based Classification
Models
 Rules in the classifier are either ordered or unordered.
◼ Ordered: rules are listed in decreasing order of their priority (such as accuracy,
coverage, total description length, etc.). Records are classified according to the
same order.
◼ Unordered: rules are not ordered, and more than one rule may be triggered upon
the same record. In this case, the class label of the record is determined by voting
of the rules fired. Normally, the class in majority is used to determine the class of
the record.
 Rule ordering schemes
◼ Rule-based ordering: ordering by a quality measure
 Advantage: best rule is applied first
 Disadvantage: later rules are difficult to understand after all previous rules are
negated.
◼ Class-based ordering: ordering by class
 Advantage: simpler rules and easier to understand
 Disadvantage: may not apply the “best” rule
◼ Combined

Sarajevo School of Science and Technology 44


Rule-based Classification
Sequential Covering Algorithm
algorithm SeqCover (Tr: training set; Y : Classes): set of rules
Begin
R = Ø;
A = the set of all attribute value pairs {(Ai, vj)};
for each class y in Y – {yk} do
while a stopping condition not met do
r = learn_one_rule(Tr, A, y);
remove training examples covered by r from Tr;
add r to the bottom of the rule list R;
end while;
end for;
insert default rule {}→yk to the bottom of the rule list R;
return(R);
End;
Sarajevo School of Science and Technology 45
Rule-based Classification
Sequential Covering Algorithm

R1

Original Training Set

R4
R2 R3
Sarajevo School of Science and Technology 46
Rule-based Classification
Sequential Covering Algorithm
 During the rule extraction process, all examples of one class are
considered positive and examples of other classes are negative. A rule is
desirable if it covers most positive examples and no or very few negative
examples.
 Once a rule is discovered, examples covered by the rule are removed,
leaving negative examples and positive examples not covered yet
behind.
 The learn_one_rule operation finds the optimal rule using a greedy
approach: an initial rule is generated and continuously refined until a
certain evaluation criterion is satisfied.
◼ Rule Growing: start with {}→y, then select a best possible (Ai, vj) pair and add into
antecedent. Repeat the process until rule quality no longer improve.
◼ Rule Refining: start with a positive example, then remove one of its conjuncts so
that it covers more positive examples. Repeat the process until the rule starts to
cover negative examples.

Sarajevo School of Science and Technology 47


Rule-based Classification
Sequential Covering Algorithm
Rule growing Rule refining
Yes: 3
Refund=No, Refund=No,
{} No: 4 Status=Single, Status=Single,
Income=85K Income=90K
(Class=Yes) (Class=Yes)

Refund=
No
Status =
Single
Status =
Divorced
Status =
Married
... Income
> 80K
Refund=No,
Status = Single
Yes: 3 Yes: 2 Yes: 1 Yes: 0 Yes: 3
(Class = Yes)
No: 4 No: 1 No: 0 No: 3 No: 1

(a) General-to-specific
(b) Specific-to-general

Sarajevo School of Science and Technology 48


Artificial Neural Network
 Artificial Neuron (unit) i1 w1
where: i2
w2
w3
S y
◼ sum function: i3

x = w1*i1 + w2*i2 + w3*i3 Sigm oid Function

◼ transformation fn:
y = sigmoid(x) = 1/(1 + e-x)

Y
X

Sarajevo School of Science and Technology 49


Artificial Neural Network
 The Network x1 x2 x3 x4 x5

◼ A neural network can have many


hidden layers, but one layer is Input
Layer
normally sufficient
◼ The wider a hidden layer is (i.e. with
more units), the more capacity of
pattern recognition, the less general Hidden
on training examples Layer
◼ The output layer can have more than
one unit, predicting the likelihood on a
number of classes
◼ The constant inputs can be fed into Output
the units in the hidden and output Layer
layers as inputs.
y

Sarajevo School of Science and Technology 50


Artificial Neural Network
 A Learning Algorithm to Train an ANN
algorithm trainNetwork (Tr: training set): Network
Begin
R = initial network with a particular topology
initialise the weight vector with random values w(0)
repeat
for each training example t=<xi, yi> in Tr do
compute the predicted class output ŷ(k)
for each weight wj in the weight vector do
update the weight wj as follows
wj(k+1) = wj(k) + (yi - ŷ(k))xij
end for;
end for;
until stopping criterion is met;
return R;
end;

Sarajevo School of Science and Technology 51


Problem of Overfitting
 All algorithms that build a classification model based on a finite set of training
examples tend to have a problem known as model overfitting: the model
induced fits the training examples too well. It reflects the features of the
training examples that may not be the features of the actual data population.
 Consequence of overfitting: less accuracy when the model is used in practice.
 Reasons: presence of noise and lack of representative examples
 Problem appearance in different classification approaches
◼ Decision Tree:
 Fuller tree with more branches and nodes
 Most leaf nodes only suit a small number of examples
◼ Nearest Neighbour:
 Bad influence of “not-so-appropriate” training examples
◼ Rule-based Approach:
 Poor quality rules, complex rules
◼ Artificial Neural Network:
 Inappropriate weights for network links

Sarajevo School of Science and Technology 52


Problem of Overfitting
Decision Trees: Tree Pruning
 Obtain an independently sampled set of test examples. Use the tree to
classify the test examples, and use a tree refinement method to prune
the tree.
 Tree pruning means to substitute certain sub-trees by leaf nodes,
making the whole tree more robust with fewer branches and nodes.
 The pruned tree should make less errors in practice when features of
both training and testing examples are considered.
 Overview of Tree Pruning Methods:
◼ Reduced Error Pruning
◼ Cost Complexity Pruning
◼ Pessimistic Pruning
◼ Production Rule Simplification Pruning
◼ Path Length Pruning
◼ Cross-validation Pruning

Sarajevo School of Science and Technology 53


Problem of Overfitting
Reduced Error Pruning
 Classify all examples in the test set using the tree. Note down, at
each non-leaf node, the nature and the number of errors.
 For every non-leaf node, count the number of errors if the sub-
tree, where the node is the root, is replaced by a leaf node with
the best possible class label.
 Choose the sub-tree that has the largest reduction of the number
of errors to prune, if any of its own sub-trees do not reduce the
number of errors.
 Repeat Step 2 & 3 until any further pruning increases the
number of errors.
◼ a) Prune the tree even when there is a zero error reduction.
◼ b) There may be a number of sub-trees with the same error reduction.
In this case, choose to prune the largest sub-tree.

Sarajevo School of Science and Technology 54


Problem of Overfitting
Reduced Error Pruning
Test Set
The Tree
Age Group Ge nde r Marrie d Ye arsO fLicie nce C lass
teen male no 2 P
YearsOfLicence teen male no 3 P
teen female no 2 P
1 2 3
teen female yes 2 P
adult female no 3 N
AgeGroup Gender AgeGroup adult female yes 3 N

teen adult senior male female teen adult senior adult female no 2 N

Married teen female yes 1 P


N senior male yes 1 N
P Gender NULL yes no N NULL P senior female yes 1 N
male female N P senior female yes 3 N
adult male yes 2 N
N P
(8)
YearsOfLicence
1 2 3
(4) (2) (2) (4)
AgeGroup Gender AgeGroup
YearsOfLicence
(0) male female
1 2 3 teen adult senior teen adult senior
(0) Married N
N P N P Gender NULL yes no N NULL P
male female N P
N P

Sarajevo School of Science and Technology 55


Test Set (8→N5)
Age Group Ge nde r Marrie d Ye arsO fLicie nce C lass
teen male no 2 P YearsOfLicence
teen male no 3 P
teen female no 2 P 1 2 3
teen female yes 2 P
(2→N1) (2→P2) (4→N1)
adult female no 3 N AgeGroup Gender AgeGroup
adult female yes 3 N
adult female no 2 N teen adult senior (0→PN1) male female teen adult senior
teen female yes 1 P (0→PN0) Married N
senior male yes 1 N P Gender NULL yes no P2N1 N NULL P
senior female yes 1 N
P1N0 P0N2 N P P1N0 P0N2 P0N1
senior female yes 3 N male female
adult male yes 2 N P0N1 P1N0
N P
(5→N5) P0N0 P0N0
YearsOfLicence Biggest reduction is in Years of Licence and Age Group – 3 errors
1 2 3 reduced each. Years of Licence has subtrees that reduce errors,
(2→N1) (2→P2)
while Age Group does not, so we prune the Age Group and count
AgeGroup Gender N errors again. (4→N5)
teen adult senior (0→PN1) male female P1N3
YearsOfLicence
(0→PN0) Married N
Gender NULL yes 1 2 3
P no P2N1 (2→P2)
P1N0
male femaleP0N2 N P N Gender N
P0N1 P1N0 P1N2
N P (0→PN1) male female P1N3
P0N0 P0N0 Married N
yes no P2N1
Biggest reduction is in Age Group which reduces 1 error and it’s subtree doesn’t
reduce error. So we reduce Age Group with N and count errors again. N P
P0N1 P1N0

(4→N5)
Only Gender has “favorable” reduction so we reduce it with P and
YearsOfLicence
count errors again.
1 2 3

N P
P3N1 N No more “favorable” reductions exist so we stop here!
P1N2
P1N3
Problem of Overfitting
Cost Complexity Pruning
 Definition of Cost Complexity of a tree:
Let T: a decision tree,
S: a subtree of T
L(T): the number of leaf nodes in T
N: the number of training examples
E: the number of misclassifications
The cost complexity of T is defined as: E/N + L(T)
 About Cost Complexity Factor :
Suppose we replace subtree S of T with the best possible leaf. The new tree would
contain L(S) – 1 fewer leaves and make M more errors on the training set.
Therefore, the cost complexity of the new tree is: (E+M)/N + (L(T) - (L(S) - 1))
It is intended that the cost complexities of a tree before pruning and after pruning
should remain the same, ex. E/N + L(T) = (E+M)/N + (L(T) - (L(S) - 1))
Hence,  = M / (N * (L(S) - 1))

Sarajevo School of Science and Technology 57


Problem of Overfitting
Cost Complexity Pruning
 For every non-leaf node of a decision tree, calculate the cost complexity factor .
Replace the sub-tree with the minimum  with a leaf node. The leaf is the class tag of
the most examples in the sub-tree. Record the resulting tree. Repeat this step until the
entire tree is replaced by a single leaf.

Use each tree Ti to classify N’ examples from the test set. Calculate the number of
errors (Ei) for each tree. Find the minimum number of errors E’. Select the tree Tj with
the smallest number of nodes such that

E j  E '+ se( E ' ) N '− E '


se( E ' ) = E '
N'

Sarajevo School of Science and Technology 58


Problem of Overfitting
PEBLS: Exemplar Weight Tuning
 User the exemplars in the memory space to classify the test examples
 The weights of poorly performed exemplars are increased, and the weights of well
performed exemplars remain close to 1.
 The Modified PEBLS Algorithm outline:
algorithm PEBLS(Tr: training set; Ts : test set) : Exemplar Space
Begin
Space := Ø;
for each attribute ai of Tr do
construct a value difference table on all possible pairs of values;
end for;
for each instance e in Tr do
add e into Space with weight = 1;
calculate distances between e and each previous exemplars in Space;
find the nearest neighbour exemplar e’;
if the e’.Class ≠ e.Class then adjust the weight of e’
end for;
for each instance e in Ts do
classify e and adjust the weights for exemplars in Space;
return (Space)
End;

Sarajevo School of Science and Technology 59


Problem of Overfitting
Handling in Other Approaches
 Rule-based approach: rules that misclassify test
examples.
◼ Pruning away those rules
◼ Modifying those rules
 Artificial neural networks: the network misclassifies
test examples.
◼ Using the test examples in the same way as the training
examples to further tune the weights on links.
 Naïve Bayes: the theorem misclassifies test examples
◼ Retraining with test examples
◼ Advanced Bayesian methods (BNN)

Sarajevo School of Science and Technology 60


Evaluating Classification Methods:
Accuracy Rate
Predicted Classes
 Accuracy is a main factor to evaluate a Confusion Matrix Positive Negative Total
classification model Actual
Positive 15 5 20
Negative 3 7 10
 Accuracy can be reflected by the error rate in a Classes
Total 18 12 30
classification matrix known as confusion matrix Errors 8 Error Rate: 27%
 Error rate upon what data?
◼ Training data? No, training data are used to build the model. Even with model refinement, the
accuracy is still in favour of training examples.
◼ Predicting error rate on unseen new data? We do not have such data.
◼ Testing data? Yes, this is the best we can do.
 We use an independently sampled test set (i.e. the evaluation set) to estimate the
accuracy of the model, provided the set is representative.
 How close the estimation to the true accuracy? The bigger the test set is, the better
estimate it is. (e.g. 75% out of 100 is different from 75% out of 1000)
 In statistics, given a confidence level c%, the true accuracy rate p is estimated within a
range of the test set estimation t  a, where a is related to the size of test set.

Sarajevo School of Science and Technology 61


Evaluating Classification Methods:
Holdout Method
 Divide the available data set to two partitions, one as training set and the
other as the test set, by following a sampling policy
 Normal sampling policy include 50%(training)-50%(testing) and
67%(training)-33%(testing)
 Problems:
◼ Fewer examples for training, higher risk of overfitting
◼ Model depends on the sizes of training and testing sets. If the training set is too
small, the model is less reliable. If the testing set is too small, the accuracy
estimation is unreliable.
◼ Model depends on the contents of training and testing sets. The problem of over-
representation of a class in one and under-representation of the class in the other
is difficult to avoid.
 Random Sub-sampling: use the holdout method k times and take the
average of the accuracy. Most problems mentioned above still exist.

Sarajevo School of Science and Technology 62


Evaluating Classification Methods:
Cross-Validation
 Partition the whole data set available for model development into
k folds.
 Repeat the model construction k times. At each iteration, one
fold is used for testing and the rest are used for training. In this
way, we can ensure that each example is used for testing exactly
once, and for training for the same number of times.
 Special versions:
◼ 2-fold cross validation
◼ N-fold cross validation (leave-one-out)
◼ Normally used: 10-fold cross validation
 The error is calculated by summing the number of errors for each
iteration or taking the average of all estimates.

Sarajevo School of Science and Technology 63


Evaluating Classification Methods:
Bootstrap
 Use sampling with replacement scheme to
obtain a training set of N examples. When N is
not small, 63.2% of original examples are
selected. The remaining examples are used as
testing examples.
 The error rate is folded by the estimate on the
test set and the estimate on the training set
 Repeat the process b times. The result error
rates are averaged.
Sarajevo School of Science and Technology 64
Classification Methods
The Comparison
 Decision Tree Methods
✓ Ability to generate understandable rules
✓ Efficiency in classifying unseen data
✓ Ability to indicate the most important attribute for classification
 The Classifier is computationally expensive to build
 Nearest Neighbour Methods
✓ Short training time
✓ Working for various types of attributes
 The classification time is long
 Rule-based Methods
✓ Good for class imbalance situations
✓ Easy to understand
 Complex rules can be difficult to interpret
 Naïve Bayes Classifier
✓ Robust to noise and irrelevant attributes
✓ Can deal with missing values by ignoring them
 May not perform well when attributes are correlated.

Sarajevo School of Science and Technology 65


Classification in Practice
 Data Mining Using Classification Methods
◼ Locate data
◼ Data preparation
◼ Choosing a classification method
◼ Construct the model and tune the model
◼ Measure its accuracy and go back to step 3 or 4 until the accuracy is
satisfactory
◼ Further evaluation the model with other evaluators such as complexity,
comprehensibility, etc.
◼ Deliver the model and test run it in real environment. Further modify the
model if necessary

Note: Steps 3 to 5 may be repeated in order to obtain a good quality


classification model.

Sarajevo School of Science and Technology 66


Classification in Practice
 Data Mining Using Classification Methods
◼ Data Preparation
 Identify the descriptive features (input attributes)
 Identify or define the class
 Determine the sizes of the training, test and evaluation set
 Select examples
◼ Spread and coverage of classes
◼ Spread and coverage of attribute values
◼ Null values
◼ Noisy data
 Prepare the input values (categorical to continuous, continuous to categorical)
◼ Selecting classification methods
 For the purpose of investigation
 For better accuracy

Sarajevo School of Science and Technology 67


Classification in Practice
 How to Select Models by Statistical Significance
◼ We use the same classification method to obtain two models, M1 and M2. We evaluate M1 with test set D1
with error rate e1. We also evaluate M2 with test set D2 with error rate e2.
◼ If the sizes of the test sets are sufficiently large, the error rates can be approximated using normal
distribution. Therefore, d = e1 – e2 follows a normal distribution with dt as mean and d as the standard
deviation

e1 (1 − e1 ) e2 (1 − e2 )
d = +
D1 D2
◼ The confidence interval for the true difference dt at a confidence level c is calculated as

d t = d  zc   d confidence level 0.99 0.98 0.95 0.9 0.8


Zc 2.58 2.33 1.96 1.65 1.28

◼ The difference between e1 and e2 is significant if the interval does not span 0; otherwise, the difference is
not significant.
Ex: |D1| = 100, e1 = 0.15, |D2| = 5000, e2 = 0.25, d = 0.1, d = 0.0655
dt = 0.1  1.96 x 0.0655 = 0.1  0.128
Since the interval spans 0, the difference between e1 and e2 are not significant.

Sarajevo School of Science and Technology 68


Summary
 Classification is a process of classifying a record into one of the pre-defined classes.
 Depending on the classification model constructed, there are a number of approaches
for developing classification models. Different methods suit different application data
sets.
 Various measures of relevance between attribute values and the target class outcomes
are used during the model construction process.
 Problem of overfitting exists in any solutions that build a model out of a training set of
examples. To overcome this problem, the constructed model is refined, with the
assistance of test/validation examples.
 Evaluation of models requires the use of an independently sampled test set.
 To estimate the true accuracy rate of the model, a number of evaluation methods that
make use of existing examples can be adopted. These methods are more useful when
quality example data are limited.
 Accuracy of a classifier is an important factor of performance. Efficiencies in model
construction and model use are also important.

Sarajevo School of Science and Technology 69


Further Reading
 Hongbo Du, „Data Mining Techniques and
Applications“, Chapters 6 & 7
 Tan, P., Steinbach, M. & Kumar, V.
“Introduction to Data Mining”, Chapter 4 &
5, Addison-Wesley, 2006
 Berry & Linoff, “Data Mining Techniques for
Marketing, Sales and Customer Support”,
Chapter 12, Wiley, 1997
Sarajevo School of Science and Technology 70

You might also like