Professional Documents
Culture Documents
CLASSIFICATION
Dr. Aso Mohammad Darwesh
Chapter Four 1
aso.darwesh@yahoo.fr
Outlines
2
Classification and prediction Decision Tree Induction Bayesian classification Nearest Neighbor Classification Rule-Based Classification Artificial Neural Network Support Vector Machines
Data Mining - 4rth class UHD Aso M. Darwesh
Example
Medical
benign)
Classification
4
Given dataset
Collection
of instances (records)
Model building
Find
Goal
Classify
Classification illustration
5 Age Gender 19 21 20 35 34 28 35 40 35 23 24 23 24 F F M M M M F F M M F F F Specialty IT IT Medicine Engineering Medicine Sociology IT Medicine IT IT Engineering Medicine Sociology Sportive Yes Yes No No Yes No Yes No Yes No No No Yes Age Gender Specialty Sportive 23 F IT ? 30 M IT ? 28 F Medicine ? 27 M Engineering ? 29 F Sociology ?
Learning set
Test set
Data Mining - 4rth class UHD Aso M. Darwesh
rules (e.g., if x then y) Decision tree (Automatically generating classification rules) Mathematical formulae (e.g., f(attributes) = class label)
Model building
The
Training set used to build the model Test set used to validate it and find the accuracy of the model
Data Mining - 4rth class UHD Aso M. Darwesh
Partitioning dataset based on the value of an attribute (or un attribute) Creating a branch for each value of its possible values For continuous attributes the test is normally like less than or equal to or greater than The splitting process continues until each branch can be labeled with just one classification
Data Mining - 4rth class UHD Aso M. Darwesh
Specialty
Engineering
Medicine
Sociology
IT
No
No
Gender
F Yes
M No
M Yes <30 No
Aso M. Darwesh
compression
Tree representation is equivalent to the dataset in the sense that the values of all attributes will lead to identical classification
Prediction
unseen instances
Aso M. Darwesh
Top-Down Induction of Decision Trees Has no preconditions The same attribute cannot be assumed twice in the same branch Production decision rules in the implicit form of a decision tree At each non-leaf node an attribute is chosen for splitting
Data Mining - 4rth class UHD Aso M. Darwesh
TDIDT algorithms have two main distinguish aspects Impurity measure Selection method (underspecified)
The
algorithm specifies Select an attribute A to split on but no method is given for doing this
Otherwise
select the most informative attribute A partition S according to As values recursively construct subtrees T1, T2, ..., for the subsets of S
Aso M. Darwesh
v1
v2
vn
As values
T1
T2
Data Mining - 4rth class UHD
Tn
Aso M. Darwesh
Subtrees
Select attribute which partitions the learning set into subsets as pure as possible
Entropy Gini
index tables
Frequency G2
Aso M. Darwesh
Entropy
16
The average amount of information needed to classify an object n is the number of classes in the dataset
n
E ! pi log 2 pi
i
log2x=y 2y=x (x>0) e.g., log28=3, because 23=8 Properties The value of log2x is: Positive when x>1 Negative when x<1 Zero when x=1
Data Mining - 4rth class UHD Aso M. Darwesh
log2(a b) = log2 a + log2 b log2 (a/b) = log2 a log2 b log2 (an) = n log2 a log2 (1/a) = log2 a
Aso M. Darwesh
The value of -x log2x is in [0,1] when x is in [0,1] Maximum value of -x log2x is when x=1/e (e2.71828) The initial minus sign (-) is included to make the value of the function positive (or zero)
Data Mining - 4rth class UHD Aso M. Darwesh
Natural logarithm
logex
written as lnx
Entropy: Example
21
E ! pi log 2 pi
i
C1 C2
C1 C2
0 6
1 5
P1 = 0/6 = 0
P2 = 6/6 = 1
C1 C2
2 4
P1= 2/6
P2 = 4/6
Entropy: Example
22 Age Gender 19 21 20 35 34 28 35 40 35 23 24 23 24 F F M M M M F F M M F F F Specialty IT IT Medicine Engineering Medicine Sociology IT Medicine IT IT Engineering Medicine Sociology Sportive Yes Yes No No Yes No Yes No Yes No No No Yes
Entropy: Example
23
EIT
Example contd
24
Age Gender 20 34 40 23 M M F F
Sportive No Yes No No
Aso M. Darwesh
Example contd
25
Age Gender 35 24 M F
Sportive No No
class)
Aso M. Darwesh
Example contd
26
Age Gender 28 24 M F
Sportive No Yes
Aso M. Darwesh
Example contd
27
Now, we calculate ENew ENew is the weighted means of Eis Weights are the proportion of the original instances in each subset ENew = (5/13)EIT+(4/13)EMed+(2/13)EEng+(2/13)ESoc = (5/13)* 0,72192809 +(4/13)* 0,81127812 +(2/13)*0+(2/13)*1 = 0,68113484
Data Mining - 4rth class UHD Aso M. Darwesh
Example contd
28
We define Information Gain = EStart - Enew IG = 0,99572745 - 0,68113484 = 0,3 It must be calculate Enew for all other attributes of the original dataset Homework
Aso M. Darwesh
Example contd
29
We define Information Gain = EStart - Enew The entropy method of attribute selection is to choose to split on the attribute that maximizes the value of Information Gain This is equivalent to minimizing the value of ENew as EStart is fixed
Aso M. Darwesh
References
30
Principles of Data Mining, by Max Bramer, Springer-Verlag London Limited,2006. 342 pages, ISSN 1863-7310 Data Mining: Concepts and Techniquesby Jiawei Han and Micheline Kamber, Morgan Kaufmann Publishers, 2006. 772 pages. ISBN 1-55860-489-8 Seyed R. Mousavi and Krysia Broda, Impact of Binary Coding on Multiway-split TDIDT Algorithms. International Journal of Electrical and Electronics Engineering 2:3 2008, P 150-159.
Data Mining - 4rth class UHD Aso M. Darwesh
University of Human Development College of science and technology Computer department 4rth class
CLASSIFICATION
Dr. Aso Mohammad Darwesh
Chapter Four - 2
aso.darwesh@yahoo.fr
Gini index
32
Aso M. Darwesh
TDIDT algorithm
33
Aso M. Darwesh
Bayesian classification
34
Nave Bayes classifiers Does not use rules, decision tree, etc. Using probability theory to find the most likely of the possible classifications The sum of the probabilities of a set of mutually exclusive and exhaustive events must always be 1 The outcome of each trial is recorded in one row of a table. Each row must have one and only one classification
Data Mining - 4rth class UHD Aso M. Darwesh
Bayesian classification
35
Define dataset as in page 26 and the paragraph of the training set constitutes The probability of an event occuring if we know that an attribute has a particular value (or that several variables have particular values) is called the conditional probability of the event occuring and is written as e.g., p(class=on time|season=winter)
Aso M. Darwesh
Bayesian classification
36
Nave Bayes (1702-1761) Combining the prior and conditional probability in a single formula Nave: The effect of the value of one attribute on the probability of a given classification is independent of the values of the other attributes
Aso M. Darwesh
Aso M. Darwesh
Bayesian classification
38
Problems It relies on all attributes being categorical Estimating probabilities by relative frequencies can give a poor estimate if the number of instances with a given attribute / value combination is small
Aso M. Darwesh
39
A test set is used to determine the accuracy of the model Usually, the given data set is divided into Training set
Used
Test set
Used
Aso M. Darwesh
Aso M. Darwesh
Used when all attribute values are continuous The idea is to estimate the classification of an unseen instance using the classification of the instance or instances that are closest to it, in some sense that we need to define
Aso M. Darwesh
Aso M. Darwesh
Rule-Based Classification
43
Aso M. Darwesh
Aso M. Darwesh
Aso M. Darwesh