Professional Documents
Culture Documents
Target marketing
Medical diagnosis
Fraud detection
or mathematical formulae
Model usage: for classifying future or unknown objects
Estimate accuracy of the model
will occur
If the accuracy is acceptable, use the model to classify data
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
December 16, 2022 Data Mining: Concepts and Techniques 6
Supervised vs. Unsupervised Learning
Data cleaning
Preprocess data in order to reduce noise and handle
missing values
Relevance analysis (feature selection)
Remove the irrelevant or redundant attributes
Data transformation
Generalize and/or normalize data
attributes
Speed
time to construct the model (training time)
Splitting Attributes
10
10 No Single 90K Yes Model: Decision Tree
Training Data
December 16, 2022 Data Mining: Concepts and Techniques 12
Another Example
MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes fits the same data!
10
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class
Model
11 No Small 55K ?
12 Yes Medium 80K ?
15 No Large 67K ?
10
Test Set
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
December 16, 2022 Data Mining: Concepts and Techniques 18
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class
Model
11 No Small 55K ?
Decision
12 Yes Medium 80K ?
Deduction Tree
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
CART
ID3, C4.5
SLIQ, SPRINT
age?
<=30 overcast
31..40 >40
no yes no yes
but gini{medium,high} is 0.30 and thus the best since it is the lowest
All attributes are assumed continuous-valued
May need other tools, e.g., clustering, to get the possible split values
Can be modified for categorical attributes
Parent
B? C1 6
Yes No C2 6
Gini = 0.500
Node N1 Node N2
Gini(N1)
= 1 – (5/7)2 – (2/7)2 N1 N2 Gini(Children)
= 0.408
C1 5 1 = 7/12 * 0.408 +
Gini(N2) C2 2 4 5/12 * 0.32
= 1 – (1/5)2 – (4/5)2 = 0.371
= 0.3216, 2022
December Data Mining: Concepts and Techniques 34
Categorical Attributes: Computing Gini
Index
For each distinct value, gather counts for each class in the
dataset
Use the count matrix to make decisions
values
= Number of distinct values
Each splitting value has a count
matrix associated with it
Taxable
Class counts in each of the Income
> 80K?
partitions, A < v and A v
Yes No
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
December 16, 2022 Data Mining: Concepts and Techniques 37
Comparing Attribute Selection Measures
methods)
convertible to simple and easy to understand
classification rules
can use SQL queries for accessing databases
Examples covered
Examples covered by Rule 2
by Rule 1 Examples covered
by Rule 3
Positive
examples
A3=1&&A1=2
A3=1&&A1=2
&&A8=5A3=1
Positive Negative
examples examples
Instance-based learning:
Store training examples and delay the processing
space.
Locally weighted regression
Case-based reasoning
based inference
December 16, 2022 Data Mining: Concepts and Techniques 53
The k-Nearest Neighbor Algorithm
All instances correspond to points in the n-D space
The nearest neighbor are defined in terms of
Euclidean distance, dist(X1, X2)
Target function could be discrete- or real- valued
For discrete-valued, k-NN returns the most common
value among the k training examples nearest to xq
Vonoroi diagram: the decision surface induced by 1-
NN for a typical set of training examples
_
_
_ _ .
+
_ .
+
xq + . . .
December 16, 2022
_ + .
Data Mining: Concepts and Techniques 54
Discussion on the k-NN Algorithm