You are on page 1of 55

Data Mining:

Concepts and Techniques


— Chapter 6 —

Jiawei Han and Micheline Kamber


Department of Computer Science
University of Illinois at Urbana-Champaign
www.cs.uiuc.edu/~hanj
©2006 Jiawei Han and Micheline Kamber. All rights reserved.

December 16, 2022 Data Mining: Concepts and Techniques 1


Chapter 6. Classification and Prediction

 What is classification? What is  Support Vector Machines (SVM)


prediction?  Associative classification
 Issues regarding classification  Lazy learners (or learning from
and prediction your neighbors)
 Classification by decision tree  Other classification methods
induction  Prediction
 Bayesian classification  Accuracy and error measures
 Rule-based classification  Ensemble methods
 Classification by back  Model selection
propagation  Summary
December 16, 2022 Data Mining: Concepts and Techniques 2
Classification vs. Prediction
 Classification
 predicts categorical class labels (discrete or nominal)

 classifies data (constructs a model) based on the

training set and the values (class labels) in a classifying


attribute and uses it in classifying new data
 Prediction
 models continuous-valued functions, i.e., predicts

unknown or missing values


 Typical applications
 Credit approval

 Target marketing

 Medical diagnosis

 Fraud detection

December 16, 2022 Data Mining: Concepts and Techniques 3


Classification—A Two-Step Process
 Model construction: describing a set of predetermined classes
 Each tuple/sample is assumed to belong to a predefined class,

as determined by the class label attribute


 The set of tuples used for model construction is training set

 The model is represented as classification rules, decision trees,

or mathematical formulae
 Model usage: for classifying future or unknown objects
 Estimate accuracy of the model

 The known label of test sample is compared with the

classified result from the model


 Accuracy rate is the percentage of test set samples that are

correctly classified by the model


 Test set is independent of training set, otherwise over-fitting

will occur
 If the accuracy is acceptable, use the model to classify data

tuples whose class labels are not known


December 16, 2022 Data Mining: Concepts and Techniques 4
Process (1): Model Construction

Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
December 16, 2022 Data Mining: Concepts and Techniques 5
Process (2): Using the Model in Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
December 16, 2022 Data Mining: Concepts and Techniques 6
Supervised vs. Unsupervised Learning

 Supervised learning (classification)


 Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
 New data is classified based on the training set
 Unsupervised learning (clustering)
 The class labels of training data is unknown
 Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
December 16, 2022 Data Mining: Concepts and Techniques 7
Chapter 6. Classification and Prediction

 What is classification? What is  Support Vector Machines (SVM)


prediction?  Associative classification
 Issues regarding classification  Lazy learners (or learning from
and prediction your neighbors)
 Classification by decision tree  Other classification methods
induction  Prediction
 Bayesian classification  Accuracy and error measures
 Rule-based classification  Ensemble methods
 Classification by back  Model selection
propagation  Summary
December 16, 2022 Data Mining: Concepts and Techniques 8
Issues: Data Preparation

 Data cleaning
 Preprocess data in order to reduce noise and handle
missing values
 Relevance analysis (feature selection)
 Remove the irrelevant or redundant attributes
 Data transformation
 Generalize and/or normalize data

December 16, 2022 Data Mining: Concepts and Techniques 9


Issues: Evaluating Classification Methods
 Accuracy
 classifier accuracy: predicting class label

 predictor accuracy: guessing value of predicted

attributes
 Speed
 time to construct the model (training time)

 time to use the model (classification/prediction time)

 Robustness: handling noise and missing values


 Scalability: efficiency in disk-resident databases
 Interpretability
 understanding and insight provided by the model

 Other measures, e.g., goodness of rules, such as decision


tree size or compactness of classification rules
December 16, 2022 Data Mining: Concepts and Techniques 10
Chapter 6. Classification and Prediction

 What is classification? What is  Support Vector Machines (SVM)


prediction?  Associative classification
 Issues regarding classification  Lazy learners (or learning from
and prediction your neighbors)
 Classification by decision tree  Other classification methods
induction  Prediction
 Bayesian classification  Accuracy and error measures
 Rule-based classification  Ensemble methods
 Classification by back  Model selection
propagation  Summary
December 16, 2022 Data Mining: Concepts and Techniques 11
Example of a Decision Tree

Splitting Attributes

Tid Refund Marital Taxable


Status Income Cheat
Refund
1 Yes Single 125K No Yes No
2 No Married 100K No
NO MarSt
3 No Single 70K No
Single, Divorced Married
4 Yes Married 120K No
5 No Divorced 95K Yes TaxInc NO
6 No Married 60K No < 80K > 80K
7 Yes Divorced 220K No
NO YES
8 No Single 85K Yes
9 No Married 75K No

10
10 No Single 90K Yes Model: Decision Tree
Training Data
December 16, 2022 Data Mining: Concepts and Techniques 12
Another Example

MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes fits the same data!
10

December 16, 2022 Data Mining: Concepts and Techniques 13


Decision Tree Classification
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes
6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes Decision
Model
Tree
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class
Model
11 No Small 55K ?
12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set

December 16, 2022 Data Mining: Concepts and Techniques 14


Apply Model to Test Data

Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

December 16, 2022 Data Mining: Concepts and Techniques 15


Apply Model to Test Data

Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

December 16, 2022 Data Mining: Concepts and Techniques 16


Apply Model to Test Data

Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

December 16, 2022 Data Mining: Concepts and Techniques 17


Apply Model to Test Data

Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
December 16, 2022 Data Mining: Concepts and Techniques 18
Apply Model to Test Data

Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

December 16, 2022 Data Mining: Concepts and Techniques 19


Apply Model to Test Data

Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married Assign Cheat to “No”

TaxInc NO
< 80K > 80K

NO YES

December 16, 2022 Data Mining: Concepts and Techniques 20


Decision Tree Classification

Tid Attrib1 Attrib2 Attrib3 Class


Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class
Model
11 No Small 55K ?
Decision
12 Yes Medium 80K ?
Deduction Tree
13 Yes Large 110K ?
14 No Small 95K ?

15 No Large 67K ?
10

Test Set

December 16, 2022 Data Mining: Concepts and Techniques 21


Decision Tree Induction
 Many Algorithms:
 Hunt’s Algorithm (one of the earliest)

 CART

 ID3, C4.5

 SLIQ, SPRINT

December 16, 2022 Data Mining: Concepts and Techniques 22


Decision Tree Induction: Training Dataset

age income student credit_rating buys_computer


<=30 high no fair no
This <=30 high no excellent no
31…40 high no fair yes
follows an >40 medium no fair yes
example >40 low yes fair yes
of >40 low yes excellent no
31…40 low yes excellent yes
Quinlan’s <=30 medium no fair no
ID3 <=30 low yes fair yes
(Playing >40 medium yes fair yes
<=30 medium yes excellent yes
Tennis) 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

December 16, 2022 Data Mining: Concepts and Techniques 23


Output: A Decision Tree for “buys_computer”

age?

<=30 overcast
31..40 >40

student? yes credit rating?

no yes excellent fair

no yes no yes

December 16, 2022 Data Mining: Concepts and Techniques 24


Algorithm for Decision Tree Induction
 Basic algorithm (a greedy algorithm)
 Tree is constructed in a top-down recursive divide-and-conquer
manner
 At start, all the training examples are at the root
 Attributes are categorical (if continuous-valued, they are
discretized in advance)
 Examples are partitioned recursively based on selected attributes
 Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
 Conditions for stopping partitioning
 All samples for a given node belong to the same class
 There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
 There are no samples left
December 16, 2022 Data Mining: Concepts and Techniques 25
Attribute Selection Measure:
Information Gain (ID3/C4.5)
 Select the attribute with the highest information gain
 Let pi be the probability that an arbitrary tuple in D
belongs to class Ci, estimated by |Ci, D|/|D|
 Expected information (entropy) needed to classify a tuple
m
in D: Info( D)   pi log 2 ( pi )
i 1

 Information needed (after using A to split vD into v


| Dj |
partitions) to classify D: Info A ( D)    I (D j )
j 1 | D |

 Information gained by branching on attribute A


Gain(A)  Info(D)  Info A(D)
December 16, 2022 Data Mining: Concepts and Techniques 26
Attribute Selection: Information Gain
 Class P: buys_computer = “yes” 5 4
Infoage ( D )  I (2,3)  I (4,0)
 Class N: buys_computer = “no” 14 14
9 9 5 5 5
Info( D)  I (9,5)   log 2 ( )  log 2 ( ) 0.940  I (3,2)  0.694
14 14 14 14 14
age pi ni I(p i, n i) 5
I (2,3) means “age <=30” has 5
<=30 2 3 0.971 14
out of 14 samples, with 2 yes’es
31…40 4 0 0 and 3 no’s. Hence
>40 3 2 0.971
age income student credit_rating buys_computer Gain(age)  Info( D)  Infoage ( D)  0.246
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes Similarly,
>40 medium no fair yes
>40 low yes fair yes
>40 low
31…40 low
yes
yes
excellent
excellent
no
yes
Gain(income)  0.029
<=30 medium no fair no
<=30 low yes fair yes Gain( student )  0.151
>40 medium yes fair yes
<=30 medium
31…40 medium
yes
no
excellent
excellent
yes
yes
Gain(credit _ rating )  0.048
31…40 high yes fair yes
>40December 16, 2022 no
medium excellent Data Mining:
no Concepts and Techniques 27
Computing Information-Gain for
Continuous-Value Attributes
 Let attribute A be a continuous-valued attribute
 Must determine the best split point for A
 Sort the value A in increasing order
 Typically, the midpoint between each pair of adjacent
values is considered as a possible split point
 (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
 The point with the minimum expected information
requirement for A is selected as the split-point for A
 Split:
 D1 is the set of tuples in D satisfying A ≤ split-point, and
D2 is the set of tuples in D satisfying A > split-point
December 16, 2022 Data Mining: Concepts and Techniques 28
Gain Ratio for Attribute Selection (C4.5)

 Information gain measure is biased towards attributes


with a large number of values
 C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
v | Dj | | Dj |
SplitInfo A ( D)    log 2 ( )
j 1 |D| |D|
 GainRatio(A) = Gain(A)/SplitInfo(A)
4 4 6 6 4 4
 Ex. SplitInfo A ( D )    log 2 ( )   log 2 ( )   log 2 ( )  0.926
14 14 14 14 14 14
 gain_ratio(income) = 0.029/0.926 = 0.031
 The attribute with the maximum gain ratio is selected as
the splitting attribute
December 16, 2022 Data Mining: Concepts and Techniques 29
Gini index (CART, IBM IntelligentMiner)
 If a data set D contains examples from n classes, gini index, gini(D) is
defined as n
gini( D)  1  p 2j
j 1
where pj is the relative frequency of class j in D
 If a data set D is split on A into two subsets D1 and D2, the gini index
gini(D) is defined as
|D1| |D |
gini A ( D)  gini( D1)  2 gini( D 2)
|D| |D|
 Reduction in Impurity:
gini( A)  gini(D)  giniA ( D)
 The attribute provides the smallest ginisplit(D) (or the largest reduction in
impurity) is chosen to split the node (need to enumerate all the possible
splitting points for each attribute)
December 16, 2022 Data Mining: Concepts and Techniques 30
Gini index (CART, IBM IntelligentMiner)
 Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
2 2
9
    5
gini ( D)  1        0.459
 14   14 
 Suppose the attribute income partitions D into 10 in D1: {low,
 10  4
medium} and 4 in D2 giniincome{low,medium} ( D)   Gini( D1 )   Gini( D1 )
 14   14 

but gini{medium,high} is 0.30 and thus the best since it is the lowest
 All attributes are assumed continuous-valued
 May need other tools, e.g., clustering, to get the possible split values
 Can be modified for categorical attributes

December 16, 2022 Data Mining: Concepts and Techniques 31


Examples for computing GINI
GINI (t )  1   [ p ( j | t )]2
j

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Gini = 1 – (1/6)2 – (5/6)2 = 0.278

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Gini = 1 – (2/6)2 – (4/6)2 = 0.444
C1 3
P(C1) = 3/6 P(C2) = 3/6
C2 3
Gini = 1 – (3/6)2 – (3/6)2 = 0.5
December 16, 2022 Data Mining: Concepts and Techniques 32
Splitting Based on GINI

 When a node p is split into k partitions (children), the


quality of split is computed as,
k
ni
GINI split   GINI (i )
i 1 n

where, ni = number of records at child i,


n = number of records at node p

December 16, 2022 Data Mining: Concepts and Techniques 33


Binary Attributes: Computing GINI
Index

 Splits into two partitions


 Effect of Weighing partitions:
 Larger and purer partitions are sought for

Parent
B? C1 6
Yes No C2 6
Gini = 0.500
Node N1 Node N2
Gini(N1)
= 1 – (5/7)2 – (2/7)2 N1 N2 Gini(Children)
= 0.408
C1 5 1 = 7/12 * 0.408 +
Gini(N2) C2 2 4 5/12 * 0.32
= 1 – (1/5)2 – (4/5)2 = 0.371
= 0.3216, 2022
December Data Mining: Concepts and Techniques 34
Categorical Attributes: Computing Gini
Index
 For each distinct value, gather counts for each class in the
dataset
 Use the count matrix to make decisions

Multi-way split Two-way split


(find best partition of values)

CarType CarType CarType


Family Sports Luxury {Sports, {Family,
{Family} {Sports}
Luxury} Luxury}
C1 1 2 1 C1 C1
3 1 2 2
C2 4 1 1 C2 2 4 C2 1 5
Gini 0.393 Gini 0.400 Gini 0.419

December 16, 2022 Data Mining: Concepts and Techniques 35


Continuous Attributes: Computing Gini
Index
 Use Binary Decisions based on one
value
 Several Choices for the splitting
value
 Number of possible splitting

values
= Number of distinct values
 Each splitting value has a count
matrix associated with it
Taxable
 Class counts in each of the Income
> 80K?
partitions, A < v and A  v
Yes No

December 16, 2022 Data Mining: Concepts and Techniques 36


Continuous Attributes: Computing Gini
Index...
 For each continuous attribute,
 Sort the attribute on values
 Choose a split position midway
between any two values
 Linearly scan these values, each time
updating the count matrix and
computing gini index
 Choose the split position that has the
least gini index

Cheat No No No Yes Yes Yes No No No No


Taxable Income

Sorted Values 60 70 75 85 90 95 100 120 125 220


55 65 72 80 87 92 97 110 122 172 230
Split Positions
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
December 16, 2022 Data Mining: Concepts and Techniques 37
Comparing Attribute Selection Measures

 The three measures, in general, return good results but


 Information gain:
 biased towards multivalued attributes
 Gain ratio:
 tends to prefer unbalanced splits in which one
partition is much smaller than the others
 Gini index:
 biased to multivalued attributes
 has difficulty when # of classes is large
 tends to favor tests that result in equal-sized
partitions and purity in both partitions
December 16, 2022 Data Mining: Concepts and Techniques 38
Overfitting and Tree Pruning
 Overfitting: An induced tree may overfit the training data
 Too many branches, some may reflect anomalies due to noise or
outliers
 Poor accuracy for unseen samples
 Two approaches to avoid overfitting
 Prepruning: Halt tree construction early—do not split a node if this
would result in the goodness measure falling below a threshold
 Difficult to choose an appropriate threshold
 Postpruning: Remove branches from a “fully grown” tree—get a
sequence of progressively pruned trees
 Use a set of data different from the training data to decide
which is the “best pruned tree”
December 16, 2022 Data Mining: Concepts and Techniques 40
Enhancements to Basic Decision Tree Induction

 Allow for continuous-valued attributes


 Dynamically define new discrete-valued attributes that
partition the continuous attribute value into a discrete
set of intervals
 Handle missing attribute values
 Assign the most common value of the attribute
 Assign probability to each of the possible values
 Attribute construction
 Create new attributes based on existing ones that are
sparsely represented
 This reduces fragmentation, repetition, and replication
December 16, 2022 Data Mining: Concepts and Techniques 41
Classification in Large Databases

 Classification—a classical problem extensively studied by


statisticians and machine learning researchers
 Scalability: Classifying data sets with millions of examples
and hundreds of attributes with reasonable speed
 Why decision tree induction in data mining?
 relatively faster learning speed (than other classification

methods)
 convertible to simple and easy to understand

classification rules
 can use SQL queries for accessing databases

 comparable classification accuracy with other methods

December 16, 2022 Data Mining: Concepts and Techniques 42


Scalable Decision Tree Induction Methods

 SLIQ (EDBT’96 — Mehta et al.)


 Builds an index for each attribute and only class list and

the current attribute list reside in memory


 SPRINT (VLDB’96 — J. Shafer et al.)
 Constructs an attribute list data structure

 PUBLIC (VLDB’98 — Rastogi & Shim)


 Integrates tree splitting and tree pruning: stop growing

the tree earlier


 RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
 Builds an AVC-list (attribute, value, class label)

 BOAT (PODS’99 — Gehrke, Ganti, Ramakrishnan & Loh)


 Uses bootstrapping to create several small samples

December 16, 2022 Data Mining: Concepts and Techniques 43


Chapter 6. Classification and Prediction

 What is classification? What is  Support Vector Machines (SVM)


prediction?  Associative classification
 Issues regarding classification  Lazy learners (or learning from
and prediction your neighbors)
 Classification by decision tree  Other classification methods
induction  Prediction
 Bayesian classification  Accuracy and error measures
 Rule-based classification  Ensemble methods
 Classification by back  Model selection
propagation  Summary
December 16, 2022 Data Mining: Concepts and Techniques 44
Using IF-THEN Rules for Classification
 Represent the knowledge in the form of IF-THEN rules
R: IF age = youth AND student = yes THEN buys_computer = yes
 Rule antecedent/precondition vs. rule consequent
 Assessment of a rule: coverage and accuracy
 ncovers = # of tuples covered by R
 ncorrect = # of tuples correctly classified by R
coverage(R) = ncovers /|D| /* D: training data set */
accuracy(R) = ncorrect / ncovers
 If more than one rule are triggered, need conflict resolution
 Size ordering: assign the highest priority to the triggering rules that has the
“toughest” requirement (i.e., with the most attribute test)
 Class-based ordering: decreasing order of prevalence or misclassification cost
per class
 Rule-based ordering (decision list): rules are organized into one long priority
list, according to some measure of rule quality or by experts
December 16, 2022 Data Mining: Concepts and Techniques 45
Rule Extraction from a Decision Tree
age?

<=30 31..40 >40


 Rules are easier to understand than large trees student? credit rating?
yes
 One rule is created for each path from the root excellent fair
no yes
to a leaf no yes
no yes
 Each attribute-value pair along a path forms a
conjunction: the leaf holds the class prediction
 Rules are mutually exclusive and exhaustive
 Example: Rule extraction from our buys_computer decision-tree
IF age = young AND student = no THEN buys_computer = no
IF age = young AND student = yes THEN buys_computer = yes
IF age = mid-age THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN buys_computer = yes
IF age = young AND credit_rating = fair THEN buys_computer = no

December 16, 2022 Data Mining: Concepts and Techniques 46


Rule Extraction from the Training Data

 Sequential covering algorithm: Extracts rules directly from training data


 Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER
 Rules are learned sequentially, each for a given class Ci will cover many
tuples of Ci but none (or few) of the tuples of other classes
 Steps:
 Rules are learned one at a time
 Each time a rule is learned, the tuples covered by the rules are
removed
 The process repeats on the remaining tuples unless termination
condition, e.g., when no more training examples or when the quality
of a rule returned is below a user-specified threshold
 Comp. w. decision-tree induction: learning a set of rules simultaneously
December 16, 2022 Data Mining: Concepts and Techniques 47
Sequential Covering Algorithm

while (enough target tuples left)


generate a rule
remove positive target tuples satisfying this rule

Examples covered
Examples covered by Rule 2
by Rule 1 Examples covered
by Rule 3

Positive
examples

December 16, 2022 Data Mining: Concepts and Techniques 48


How to Learn-One-Rule?
 Star with the most general rule possible: condition = empty
 Adding new attributes by adopting a greedy depth-first strategy
 Picks the one that most improves the rule quality
 Rule-Quality measures: consider both coverage and accuracy
 Foil-gain (in FOIL & RIPPER): assesses info_gain by extending
condition pos' pos
FOIL _ Gain  pos'(log2  log 2 )
pos' neg ' pos  neg
It favors rules that have high accuracy and cover many positive tuples
 Rule pruning based on an independent set of test tuples
pos  neg
FOIL _ Prune( R) 
pos  neg
Pos/neg are # of positive/negative tuples covered by R.
If FOIL_Prune is higher for the pruned version of R, prune R

December 16, 2022 Data Mining: Concepts and Techniques 49


Rule Generation
 To generate a rule
while(true)
find the best predicate p
if foil-gain(p) > threshold then add p to current rule
else break

A3=1&&A1=2
A3=1&&A1=2
&&A8=5A3=1

Positive Negative
examples examples

December 16, 2022 Data Mining: Concepts and Techniques 50


Chapter 6. Classification and Prediction

 What is classification? What is  Support Vector Machines (SVM)


prediction?  Associative classification
 Issues regarding classification  Lazy learners (or learning from
and prediction your neighbors)
 Classification by decision tree  Other classification methods
induction  Prediction
 Bayesian classification  Accuracy and error measures
 Rule-based classification  Ensemble methods
 Classification by back  Model selection
propagation  Summary
December 16, 2022 Data Mining: Concepts and Techniques 51
Lazy vs. Eager Learning
 Lazy vs. eager learning
 Lazy learning (e.g., instance-based learning): Simply

stores training data (or only minor processing) and


waits until it is given a test tuple
 Eager learning (the above discussed methods): Given a

set of training set, constructs a classification model


before receiving new (e.g., test) data to classify
 Lazy: less time in training but more time in predicting
 Accuracy
 Lazy method effectively uses a richer hypothesis space

since it uses many local linear functions to form its


implicit global approximation to the target function
 Eager: must commit to a single hypothesis that covers

the entire instance space


December 16, 2022 Data Mining: Concepts and Techniques 52
Lazy Learner: Instance-Based Methods

 Instance-based learning:
 Store training examples and delay the processing

(“lazy evaluation”) until a new instance must be


classified
 Typical approaches
 k-nearest neighbor approach

 Instances represented as points in a Euclidean

space.
 Locally weighted regression

 Constructs local approximation

 Case-based reasoning

 Uses symbolic representations and knowledge-

based inference
December 16, 2022 Data Mining: Concepts and Techniques 53
The k-Nearest Neighbor Algorithm
 All instances correspond to points in the n-D space
 The nearest neighbor are defined in terms of
Euclidean distance, dist(X1, X2)
 Target function could be discrete- or real- valued
 For discrete-valued, k-NN returns the most common
value among the k training examples nearest to xq
 Vonoroi diagram: the decision surface induced by 1-
NN for a typical set of training examples

_
_
_ _ .
+
_ .
+
xq + . . .
December 16, 2022
_ + .
Data Mining: Concepts and Techniques 54
Discussion on the k-NN Algorithm

 k-NN for real-valued prediction for a given unknown tuple


 Returns the mean values of the k nearest neighbors
 Distance-weighted nearest neighbor algorithm
 Weight the contribution of each of the k neighbors
according to their distance to the query xq 1
w
d ( xq , x )2
 Give greater weight to closer neighbors i
 Robust to noisy data by averaging k-nearest neighbors
 Curse of dimensionality: distance between neighbors could
be dominated by irrelevant attributes
 To overcome it, axes stretch or elimination of the least
relevant attributes
December 16, 2022 Data Mining: Concepts and Techniques 55
Case-Based Reasoning (CBR)
 CBR: Uses a database of problem solutions to solve new problems
 Store symbolic description (tuples or cases)—not points in a Euclidean
space
 Applications: Customer-service (product-related diagnosis), legal ruling
 Methodology
 Instances represented by rich symbolic descriptions (e.g., function
graphs)
 Search for similar cases, multiple retrieved cases may be combined
 Tight coupling between case retrieval, knowledge-based reasoning, and
problem solving
 Challenges
 Find a good similarity metric
 Indexing based on syntactic similarity measure, and when failure,
backtracking, and adapting to additional cases
December 16, 2022 Data Mining: Concepts and Techniques 56

You might also like