You are on page 1of 12

IME 672

Data Mining & Knowledge


Discovery

Chapter 4
Classification: Basic Concepts
Classification
• A form of data analysis that extracts model or classifier to
predict class labels
– class labels categorical (discrete or nominal)
– classifies data based on training set and values in a
classifying attribute, and uses it in classifying new data
• Numeric Prediction
– models continuous-valued functions, i.e., predicts
unknown or missing values
• Typical applications
– Credit/loan approval: loan application is “safe” or “risky”
– Medical diagnosis: tumor is “cancerous” or “benign”
– Fraud detection: transaction is “fraudulent”
Supervised vs. Unsupervised Learning
• Supervised learning (classification)
– Supervision: The training data is accompanied by labels
indicating the class of the observations
– New data is classified based on the training set

• Unsupervised learning (clustering)


– The class labels of training data is unknown
– Given a set of observations, the aim is to establish existence
of classes or clusters in the data
Classification— Two-Step Process
• Model construction: Describe a set of predetermined classes
– Each tuple is assumed to belong to a predefined class, as determined by
the class label attribute
– The model is represented as classification rules, decision trees, or
mathematical formulae

• Model usage: Classify future or unknown objects


– Estimate accuracy of the model
• The known label of test sample is compared with the classified result
from the model
• Accuracy = percentage of test set samples that are correctly
classified by the model
• Test set is independent of training set (otherwise overfitting)
– If the accuracy is acceptable, use the model to classify new data
Phase 1: Model Construction
Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


M ike A ssistant P rof 3 no (Model)
M ary A ssistant P rof 7 yes
B ill P rofessor 2 yes
Jim A ssociate P rof 7 yes
IF rank = ‘professor’
D ave A ssistant P rof 6 no
OR years > 6
A nne A ssociate P rof 3 no
THEN tenured = ‘yes’
6
Phase 2: Model Usage
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
T om A ssistant P rof 2 no Tenured?
M erlisa A ssociate P rof 7 no
G eorge P rofessor 5 yes
Joseph A ssistant P rof 7 yes
7
Decision Tree Induction
• Decision tree: a flowchart-like tree structure where
each
– internal node (nonleaf node): a test on an attribute,
– branch: an outcome of the test, and
– leaf node (or terminal node): class label

• Decision tree induction is the learning of decision


trees from class-labeled training tuples

• ID3 (Iterative Dichotomiser), C4.5, Classification and


Regression Trees (CART)
Decision Tree: Example
Algorithm
• Generate_decision_tree: Generate a decision tree from
the training tuples of data partition, D
• Input:
– Data partition, D, which is a set of training tuples and their
associated class labels;
– attribute_list, the set of candidate attributes;
– Attribute_selection_method( ): determine the splitting
criterion that “best” partitions the data tuples into
individual classes. Consists of a splitting attribute and,
possibly, either a split-point or splitting subset
• Output: A decision tree
Method
1. create a node N;
2. if tuples in D are all of the same class, C, then
3. return N as a leaf node labeled with the class C;
4. if attribute_list is empty then
5. return N as a leaf node labeled with the majority class in D;
6. apply Attribute_selection_method(D, attribute_list) to find
the “best” splitting criterion;
7. label node N with splitting_criterion;
8. if splitting_attribute is discrete-valued and multiway splits
allowed then // not restricted to binary trees
9. attribute_list = attribute_list – splitting_attribute; // remove
splitting attribute
Method
10. for each outcome j of splitting_criterion
// partition the tuples and grow subtrees for each partition
11. let Dj be the set of data tuples in D satisfying outcome j;
12. if Dj is empty then
13. attach a leaf labeled with the majority class in D
to node N;
14. else attach the node returned by
Generate_decision_tree(Dj ,attribute_list) to node N;
endfor
15. return N;

You might also like