You are on page 1of 20

Data Mining

Spring 2007

• Noisy data
• Data Discretization using Entropy
based and ChiMerge
Noisy Data
• Noise: Random error, Data Present but not correct.
– Data Transmission error
– Data Entry problem

• Removing noise
– Data Smoothing (rounding, averaging within a window).
– Clustering/merging and Detecting outliers.

• Data Smoothing
– First sort the data and partition it into (equi-depth) bins.
– Then the values in each bin using Smooth by Bin Means,
Smooth by Bin Median, Smooth by Bin Boundaries, etc.
Noisy Data (Binning Methods)
Sorted data for price (in dollars):
4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Noisy Data (Clustering)
• Outliers may be detected by clustering, where similar
values are organized into groups or “clusters”.

• Values which falls outside of the set of clusters may be


considered outliers.
Data Discretization
• The task of attribute (feature)-discretization techniques is
to discretize the values of continuous features into a small
number of intervals, where each interval is mapped to a
discrete symbol.
• Advantages:-
– Simplified data description and easy-to-understand data and final
data-mining results.
– Only Small interesting rules are mined.
– End-results processing time decreased.
– End-results accuracy improved.
Effect of Continuous Data on Results
Accuracy
age income age buys_computer Discover only those
<=30 medium 9 no rules which contain
<=30 medium 10 no support (frequency)
<=30 medium 11 no greater >= 1
Data Mining
<=30 medium 12 no

• If ‘age <= 30’ and income = ‘medium’ and age =


‘9’ then buys_computer = ‘no’
• If ‘age <= 30’ and income = ‘medium’ and age =
‘10’ then buys_computer = ‘no’
• If ‘age <= 30’ and income = ‘medium’ and age =
‘11’ then buys_computer = ‘no’
• If ‘age <= 30’ and income = ‘medium’ and age =
‘12’ then buys_computer = ‘no’
age income age buys_computer
Due to the missing value in training <=30 medium 9 ?
dataset, the accuracy of prediction <=30 medium 11 ?
decreases and becomes “66.7%” <=30 medium 13 ?
Entropy-Based Discretization
• Given a set of samples S, if S is partitioned into two
intervals S1 and S2 using boundary T, the entropy after
partitioning is

• Where pi is the probability of class i in S1, determined by


dividing the number of samples of class i in S1 by the total
number of samples in S1.
Entropy-Based Discretization
• The boundary that minimizes the entropy function over all
possible boundaries is selected as a binary discretization.

• The process is recursively applied to partitions obtained


until some stopping criterion is met, e.g.,
Example 1
ID 1 2 3 4 5 6 7 8 9
Age 21 22 24 25 27 27 27 35 41
Grade F F P F P P P P P
• Let Grade be the class attribute. Use entropy-based
discretization to divide the range of ages into different
(22+24) / 2 = 23
discrete intervals.

• There are
(21+22) / 26= possible
21.5 boundaries. They are 21.5, 23, 24.5,
26, 31, and 38.

• Let us consider the boundary at T = 21.5.


Let S1 = {21}
Let S2 = {22, 24, 25, 27, 27, 27, 35, 41}
Example 1 (cont’)
ID 1 2 3 4 5 6 7 8 9
Age 21 22 24 25 27 27 27 35 41
Grade F F P F P P P P P

• The number of elements in S1 and S2 are:


|S1| = 1
|S2| = 8
• The entropy of S1 is
Ent ( S1 )   P (Grade  F)  log2 P(Grade  F)  P (Grade  P)  log 2 P (Grade  P)
 (1)  log 2 (1)  (0)  log 2 (0)

• The entropy of S2 is
Ent ( S 2 )   P(Grade  F)  log 2 P(Grade  F)  P(Grade  P)  log2 P (Grade  P)
 (2)  log2 (2)  (6)  log2 (6)

Example 1 (cont’)
• Hence, the entropy after partitioning at T = 21.5 is

| S1 | | S2 |
E (S , T )  Ent ( S1 )  Ent ( S 2 )
|S| |S|
|1| |8|
 Ent ( S1 )  Ent ( S 2 )
|9| |9|
 ...
Example 1 (cont’)
• The entropies after partitioning for all the boundaries are:
T = 21.5 = E(S,21.5)
T = 23 = E(S,23)
Now recursively apply entropy
. discretization upon both partitions
.
T = 38 = E(S,38)

Select the boundary with the smallest entropy


Suppose best is T = 23

ID 1 2 3 4 5 6 7 8 9
Age 21 22 24 25 27 27 27 35 41
Grade F F P F P P P P P
ChiMerge (Kerber92)
• This discretization method uses a merging approach.
• ChiMerge’s view:
– First sort the data on the basis of attribute (which is
being descretize).
– List all possible boundaries or intervals. In case of last
example there were 6 boundaries, 0, 21.5, 23, 24.5, 26,
31, and 38.
– For all two near intervals, calculate 2 class independent
test.
• {0,21.5} and {21.5, 23}
• {21.5, 23} and {23, 24.5}
• .
• .
– Pick the best 2 two near intervals and merge them.
ChiMerge -- The Algorithm
 Compute the 2 value for each pair of adjacent intervals

 Merge the pair of adjacent intervals with the lowest 2


value

 Repeat  and  until 2 values of all adjacent pairs


exceeds a threshold
Chi-Square Test
Class 1 Class 2 Σ
Int 1 o11 o12 R1

  
2
c r oij  eij 
2
Int 2 o21 o22 R2

i 1 j 1 eij
Σ C1 C2 N

oij = observed frequency of interval i for class j


eij = expected frequency (Ri * Cj) / N
ChiMerge Example
Data Set
Sample: F K

1 1 1
2 3 2
3 7 1
4 8 1
5 9 1
6 11 2
7 23 2
8 37 1
9 39 2
10 45 1
11 46 1
12 59 1

• Interval points for feature F are: 0, 2, 5, 7.5, 8.5, 10, etc.


ChiMerge Example (cont.)
2 was minimum for intervals: [7.5, 8.5] and [8.5, 10]
  
2
c r o
ij  eij 
2

K=1 K=2  i 1 j 1 eij


Interval [7.5, 8.5] A11=1 A12=0 R1=1
Interval [8.5, 9.5] A21=1 A22=0 R2=1

 C1=2 C2=0 N=2

Based on the table’s values, we can calculate expected values:


E11 = 2/2 = 1, oij = observed frequency of interval i
E12 = 0/2  0,
for class j
E21 = 2/2 = 1, and
eij = expected frequency (Ri * Cj) / N
E22 = 0/2  0

and corresponding 2 test:


2 = (1 – 1)2 / 1 + (0 – 0)2 / 0.1 + (1 – 1)2 / 1 + (0 – 0)2 / 0.1 = 0.2

For the degree of freedom d=1, and 2 = 0.0 < 2.706 ( MERGE !)
ChiMerge Example (cont.)
Additional Iterations:

K=1 K=2 

Interval [0, 7.5] A11=2 A12=1 R1=3


Interval [7.5, 10] A21=2 A22=0 R2=2

 C1=4 C2=1 N=5

E11 = 12/5 = 2.4,


E12 = 3/5 = 0.6,
E21 = 8/5 = 1.6, and
E22 = 2/5 = 0.4

2 = (2 – 2.4)2 / 2.4 + (1 – 0.6)2 / 0.6 + (2 – 1.6)2 / 1.6 + (0 – 0.4)2 / 0.4


2 = 0.834

For the degree of freedom d=1, 2 = 0.834 < 2.706 (MERGE!)


ChiMerge Example (cont.)
K=1 K=2 

Interval [0, 10.0] A11=4 A12=1 R1=5


Interval [10.0, 42.0] A21=1 A22=3 R2=4

 C1=5 C2=4 N=9

E11 = 2.78
E12 =2.22
E21 = 2.22
E22 = 1.78

and 2 = 2.72 > 2.706 (NO MERGE !)

Final discretization: [0, 10], [10, 42], and [42, 60]


References
– Text book of Jiawei Han and Micheline Kamber, “Data
Mining: Concepts and Techniques”, Morgan Kaufmann
Publishers, August 2000. (Chapter 3).

– Data Mining: Concepts, Models, Methods, and


Algorithms by Mehmed Kantardzic John Wiley & Sons
2003. (Chapter 3).

You might also like