You are on page 1of 31

Classification

1
Classification: Definition
• Classification is the process of finding a model, which
describes and distinguishes data classes or concepts,
for the purpose of being able to use the model to
predict the class of objects whose class label is
unknown
• Generally, classification is a data mining technique
used to predict group membership for data instances.
• Given a collection of records (training set), each
record contains a set of attributes, one of the attributes
is the class
– Find a model for class attribute as a function of the values
of other attributes 2
Classification: Definition
• The goal of classification is to accurately
predict the target class for each case in the data
– For example, a classification model could be used
to identify loan applicants as low, medium, or high
credit risks
• A classification task begins with a data set in
which the class assignments are known.
– For example, a classification model that predicts
credit risk could be developed based on observed
data for many loan applicants over a period of time

3
Classification: Definition
– In addition to the historical credit rating, the data
might track employment history, home ownership
or rental, years of residence, number and type of
investments, and so on
– Credit rating would be the target, the other
attributes would be the predictors, and the data for
each customer would constitute a case
• The simplest type of classification problem is
binary classification
– In binary classification, the target attribute has
only two possible values: for example, high credit
rating or low credit rating
4
Classification: Definition
• In the model building (training) process, a classification
algorithm finds relationships between the values of the
predictors and the values of the target
– Different classification algorithms use different techniques for
finding relationships
– These relationships are summarized in a model, which can then be
applied to a different data set in which the class assignments are
unknown
• Classification models are tested by comparing the
predicted values to known target values in a set of test
data
• The historical data for a classification project is typically
divided into two data sets: one for building the model
(training data set); the other for testing the model (test
data set) 5
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Learning
No
1 Yes Large 125K
algorithm
2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction


14 No Small 95K ?

15 No Large 67K ?
10

Test Set
6
Classification
• There are different Classification algorithms such
– Decision tree,
– Naïve Bayes method,
– Bayesian Belief Network,
– Artificial Neural network,
– Support vector Machine, etc
• Most Classification algorithms involves two steps:
1. Model construction
2. Model Usage
• Some other classification algorithm such as the K-
Nearest Neighbor approach don’t require any model
April 27, 2024 Data Mining: Concepts and Techniques 7
Classification
1. Model construction
• refers to describing a set of predetermined classes
using training data set
• The training data is a set of tuples where Each
tuple/sample is assumed to belong to a predefined
class, as determined by the class label attribute
• The model is represented as classification rules,
decision trees, or mathematical formulae
2. Model usage:
• Refers to using the model for classifying future or
unknown objects

April 27, 2024 Data Mining: Concepts and Techniques 8


Classification Process (1): Model
Construction

Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


(Model)
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
IF rank = ‘professor’
Dave Assistant Prof 6 no OR years > 6
Anne Associate Prof 3 no THEN tenured = ‘yes’
April 27, 2024 Data Mining: Concepts and Techniques 9
Classification Process (2): Use
the Model in Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)

NAME RANK YEARS TENURED


Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph
April 27, 2024
Assistant Prof Data Mining:
7 Concepts andyes
Techniques 10
Metrics for Performance Evaluation…
PREDICTED CLASS
Class=Yes Class=No

Class=Yes a b
ACTUAL (TP) (FP)
CLASS Class=No c d
(FP) (TP)

• Most widely-used metric:


ad TP
Accuracy   *100
a  b  c  d TP  FP
11
Classification by Decision Tree
Induction
• Decision tree induction is the learning of decision
trees from class-labeled training tuples
• A decision tree is a flow-chart-like tree structure,
where
– each internal node (nonleaf node) denotes a test
on an attribute,
– each branch represents an outcome of the test, and
– each leaf node (or terminal node) holds a class
label
• The top most node in a tree is the root node
• Instances are classified starting at the root node
and sorted based on their feature values 12
How are decision trees used for classification?
• Given a tuple, X, for which the associated
class label is unknown,
– the attribute values of the tuple are tested against
the decision tree
– a path is traced from the root to a leaf node, which
holds the class prediction for that tuple
• Decision trees can easily be converted to
classification rules

13
Decision Tree

14
Decision tree classifier
• Decision tree performs classification by
constructing a tree based on training instances
with leaves having class labels
– The tree is traversed for each test instance to find a
leaf, and the class of the leaf is the predicted class
• Widely used learning method
• It has been applied to:
– classify medical patients based on the disease
– equipment malfunction by cause
– loan applicant by likelihood of payment
15
Decision Trees
• Tree where internal nodes are simple decision rules on one or more
attributes and leaf nodes are predicted class labels; i.e. a Boolean
classifier for the input instance
Given an instance of an object or situation, which is specified by a
set of properties, the tree returns a "yes" or "no" decision about that
instance

Attribute_1
value-1 value-3
value-2
Attribute_2 Class1 Attribute_2

value-5 value-4 value-6 value-7


Class2 Class3 Class4 Class5
16
Algorithm for Decision Tree Induction
• Basic algorithm(a greedy algorithm i.e. nonbacktracking)
– Tree is constructed in a top-down recursive divide-and-conquer
manner
– At start, all the training examples/tuples are at the root
– Attributes are categorical (if continuous-valued, they are
discretized in advance)
– Examples are partitioned recursively based on selected
attributes
– Optimal attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)

• Conditions for stopping partitioning


– All samples (tuples) for a given node belong to the same class
– There are no remaining attributes on which the tuples may be
further partitioned
– There are no samples (tuples) left for a given branch
17
Attribute Selection Measure
• An attribute selection measure is a heuristic for selecting the splitting
criterion that “best” separates a given data partition, D, of class-
labeled training tuples into individual classes.
• are also known as splitting rules because they determine how the
tuples at a given node are to be split
• provides a ranking for each attribute describing the given training
tuples
• Decision tree induction employs an attribute selection measure such as
• Information gain
– Select the attribute with the highest information gain
• First, compute the disorder using Entropy; the expected information needed to
classify objects into classes
• Second, measure the Information Gain; to calculate by how much the disorder
of a set would reduce by knowing the value of a particular attribute
• GINI index
– An alternative to information gain that measure impurity of attributes in the
classification task
– Select the attribute with the smallest GINI value
18
Entropy
• The Entropy measures the disorder of a set S containing a total of n
examples of which n+ are positive and n- are negative. The expected
information needed to classify a tuple in D is given by:
n n n n
D(n , n )   log 2  log 2  Entropy ( S )
n n n n
OR,

• where pi is the probability that an arbitrary tuple in D belongs to


class Ci and is estimated by |Ci, D|/D
• A log function to the base 2 is used, because the information is
encoded in bits
19
Entropy
• How much more information would we still need (after the
partitioning) in order to arrive at an exact classification? This
amount is measured by

• InfoA(D) is the expected information required to classify a


tuple from D based on the partitioning by A.
• Some useful properties of the Entropy:
• D(n, m) = D(m, n)
• D(0,m) = 0
• D(S)=0 means that all the examples in S have the same class
• D(m, m) = 1
• D(S)=1 means that half the examples in S are of one class and half20
are the opposite class
Information Gain
• Information gain is defined as the difference between the
original information requirement (i.e., based on just the
proportion of classes) and the new requirement (i.e.,
obtained after partitioning on A)
• The Information Gain measures the expected reduction in
entropy due to splitting on an attribute A
 v Di 
GAIN split  Entropy ( D)    Entropy (i ) 
 i 1 D 

OR

Parent Node, p is split into v partitions;


Di is number of records in partition i
Example: Decision Tree for “play football or not”. Use the
weather training Dataset given below to construct decision tree
Outlook Temperature Humidity Windy Play_football
Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes
Rainy Mild High False Yes
Rainy Cool Normal False Yes
Rainy Cool Normal True No
Overcast Cool Normal True Yes
Sunny Mild High False No
Sunny Cool Normal False Yes
Rainy Mild Normal False Yes
Sunny Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Rainy Mild High True No 22
The Process of Constructing a Decision Tree
• Select an attribute to place at the root of the decision
tree and make one branch for every possible value
• Repeat the process recursively for each branch
• Which attribute should be placed at a certain node is
based on the information gained by placing a certain
attribute at this node
• In the weather data example, there are 9 instances of
which the decision to play_football is “yes” and there
are 5 instances of which the decision to play_football
is “no’. Then, the expected information required by
knowing the result of the decision is
9  9   5   5 
   log      log   0.940 bits.
14  14   14   14 
Information Further Required If
“Outlook” Is Placed at the Root
Outlook
sunny overcast rainy

yes yes yes


yes yes yes
no yes yes
no yes no
no no

Information further required outlook   inf ooutlook 


5 4 5
   0.971     0     0.971  0.693bits .
 14   14   14 
Attribute Selection by Information Gain
• Class P: play_football = “yes” 5 4
E (outlook )  E ( 2,3)  E ( 4,0)
• Class N: play_football = “no” 14 14
5
• E(P, N) = E(9, 5) =0.940  E (3,2)  0.69
14
Hence
• Compute the entropy for
Outlook: Gain(outlook )  E ( p, n)  E (outlook )
Similarly
outlook pi ni E(pi, ni) Gain(outlook )  0.25
sunny 2 3 0.971 Gain( tempreture )  0.029
overcast 4 0 0 Gain( humidity )  0.151
rainy 3 2 0.971 Gain( windy )  0.048
25
The Strategy for Selecting an Attribute
to Place at a Node
• Select the attribute that gives us the largest
information gain
• In this example, it is the attribute “Outlook”

Outlook

sunny overcast rainy

4 “yes” 3 “yes”
2 “yes”
3 “no” 2 “no”
The Recursive Procedure for Constructing a
Decision Tree
• The operation discussed above is applied to each
branch recursively to construct the decision tree
• For example, for the branch “Outlook = Sunny”, we
evaluate the information gained by applying each of
the remaining 3 attributes
– Gain(Outlook=sunny;Temperature) = 0.971 – 0.4 = 0.571
– Gain(Outlook=sunny;Humidity) = 0.971 – 0 = 0.971
– Gain(Outlook=sunny;Windy) = 0.971 – 0.951 = 0.02
The Recursive Procedure for Constructing a
Decision Tree
• Similarly, we also evaluate the information
gained by applying each of the remaining 3
attributes for the branch “Outlook = rainy”.
– Gain(Outlook=rainy;Temperature) = 0.971 – 0.951
= 0.02
– Gain(Outlook=rainy;Humidity) = 0.971 – 0.951 =
0.02
– Gain(Outlook=rainy;Windy) =0.971 – 0 = 0.971
Output: A Decision Tree for “play_football”
Outlook

sunny overcast rainy

humidity yes windy

high normal false true

no yes yes no

Classification Rules
IF outlook= “sunny” & humidity= “high” THEN play_football = “no”
IF outlook= “sunny” & humidity= “normal” THEN play_football = “yes”
IF outlook= “overcast” THEN play_football = “yes”
IF outlook= “rainy” & windy= “false” THEN play_football = “yes”
29
IF outlook= “rainy” & windy= “true” THEN play_football = “no”
Pros and Cons of decision trees
Pros Cons
• Reasonable training time • Cannot handle complicated
• Fast application relationship between features
• Easy to interpret • problems with lots of missing
• Easy to implement data
•Can handle large number of
features

Why decision tree induction in data mining?


•Relatively faster learning speed (than other classification methods)
•Convertible to simple and easy to understand classification if-then-
else rules
•Comparable classification accuracy with other methods
•Does not require any prior knowledge of data distribution, works well
on noisy data
30
Logarithmic Table

31

You might also like