Entropy Discretization

Data Mining
Spring 2007
• Noisy data
• Data Discretization using Entropy
based and ChiMerge
Noisy Data
• Noise: Random error, Data Present but not correct.
– Data Transmission error
– Data Entry problem
• Removing noise
– Data Smoothing (rounding, averaging within a window).
– Clustering/merging and Detecting outliers.
• Data Smoothing
– First sort the data and partition it into (equi-depth) bins.
– Then the values in each bin using Smooth by Bin Means,
Smooth by Bin Median, Smooth by Bin Boundaries, etc.
Noisy Data (Binning Methods)
Sorted data for price (in dollars):
4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Noisy Data (Clustering)
• Outliers may be detected by clustering, where similar
values are organized into groups or “clusters”.
• Values which falls outside of the set of clusters may be

considered outliers.
Data Discretization
• The task of attribute (feature)-discretization techniques is
to discretize the values of continuous features into a small
number of intervals, where each interval is mapped to a
discrete symbol.
• Advantages:-
– Simplified data description and easy-to-understand data and final
data-mining results.
– Only Small interesting rules are mined.
– End-results processing time decreased.
– End-results accuracy improved.
Effect of Continuous Data on Results
Accuracy
age income age buys_computer Discover only those
<=30 medium 9 no rules which contain
<=30 medium 10 no support (frequency)
<=30 medium 11 no greater >= 1
Data Mining
<=30 medium 12 no
• If ‘age <= 30’ and income = ‘medium’ and age =

‘9’ then buys_computer = ‘no’
age income age buys_computer
Due to the missing value in training <=30 medium 9 ?
dataset, the accuracy of prediction <=30 medium 11 ?
decreases and becomes “66.7%” <=30 medium 13 ?
Entropy-Based Discretization
• Given a set of samples S, if S is partitioned into two
intervals S1 and S2 using boundary T, the entropy after
partitioning is
• Where pi is the probability of class i in S1, determined by

dividing the number of samples of class i in S1 by the total
number of samples in S1.
Entropy-Based Discretization
• The boundary that minimizes the entropy function over all
possible boundaries is selected as a binary discretization.
• The process is recursively applied to partitions obtained

until some stopping criterion is met, e.g.,
Example 1
ID 1 2 3 4 5 6 7 8 9
Age 21 22 24 25 27 27 27 35 41
Grade F F P F P P P P P
• Let Grade be the class attribute. Use entropy-based
discretization to divide the range of ages into different
(22+24) / 2 = 23
discrete intervals.
• There are
(21+22) / 26= possible
21.5 boundaries. They are 21.5, 23, 24.5,
26, 31, and 38.
• Let us consider the boundary at T = 21.5.

Let S1 = {21}
Let S2 = {22, 24, 25, 27, 27, 27, 35, 41}
Example 1 (cont’)
ID 1 2 3 4 5 6 7 8 9
Age 21 22 24 25 27 27 27 35 41
• The number of elements in S1 and S2 are:

|S1| = 1
|S2| = 8
• The entropy of S1 is
Ent ( S1 )   P (Grade  F)  log2 P(Grade  F)  P (Grade  P)  log 2 P (Grade  P)
 (1)  log 2 (1)  (0)  log 2 (0)

• The entropy of S2 is
Ent ( S 2 )   P(Grade  F)  log 2 P(Grade  F)  P(Grade  P)  log2 P (Grade  P)
 (2)  log2 (2)  (6)  log2 (6)

Example 1 (cont’)
• Hence, the entropy after partitioning at T = 21.5 is
| S1 | | S2 |
E (S , T )  Ent ( S1 )  Ent ( S 2 )
|S| |S|
|1| |8|
 Ent ( S1 )  Ent ( S 2 )
|9| |9|
 ...
Example 1 (cont’)
• The entropies after partitioning for all the boundaries are:
T = 21.5 = E(S,21.5)
T = 23 = E(S,23)
Now recursively apply entropy
. discretization upon both partitions
.
T = 38 = E(S,38)
Select the boundary with the smallest entropy

Suppose best is T = 23
ID 1 2 3 4 5 6 7 8 9
Age 21 22 24 25 27 27 27 35 41
ChiMerge (Kerber92)
• This discretization method uses a merging approach.
• ChiMerge’s view:
– First sort the data on the basis of attribute (which is
being descretize).
– List all possible boundaries or intervals. In case of last
example there were 6 boundaries, 0, 21.5, 23, 24.5, 26,
31, and 38.
– For all two near intervals, calculate 2 class independent
test.
• {0,21.5} and {21.5, 23}
• {21.5, 23} and {23, 24.5}
• .
• .
– Pick the best 2 two near intervals and merge them.
ChiMerge -- The Algorithm
 Compute the 2 value for each pair of adjacent intervals
 Merge the pair of adjacent intervals with the lowest 2

value
 Repeat  and  until 2 values of all adjacent pairs

exceeds a threshold
Chi-Square Test
Class 1 Class 2 Σ
Int 1 o11 o12 R1
  
2
c r oij  eij 
2
Int 2 o21 o22 R2
i 1 j 1 eij
Σ C1 C2 N
oij = observed frequency of interval i for class j

eij = expected frequency (Ri * Cj) / N
ChiMerge Example
Data Set
Sample: F K
1 1 1
2 3 2
3 7 1
4 8 1
5 9 1
6 11 2
7 23 2
8 37 1
9 39 2
10 45 1
11 46 1
12 59 1
• Interval points for feature F are: 0, 2, 5, 7.5, 8.5, 10, etc.

ChiMerge Example (cont.)
2 was minimum for intervals: [7.5, 8.5] and [8.5, 10]
  
2
c r o
ij  eij 
2
K=1 K=2  i 1 j 1 eij

Interval [7.5, 8.5] A11=1 A12=0 R1=1
Interval [8.5, 9.5] A21=1 A22=0 R2=1
 C1=2 C2=0 N=2
Based on the table’s values, we can calculate expected values:

E11 = 2/2 = 1, oij = observed frequency of interval i
E12 = 0/2  0,
for class j
E21 = 2/2 = 1, and
eij = expected frequency (Ri * Cj) / N
E22 = 0/2  0
and corresponding 2 test:

2 = (1 – 1)2 / 1 + (0 – 0)2 / 0.1 + (1 – 1)2 / 1 + (0 – 0)2 / 0.1 = 0.2
For the degree of freedom d=1, and 2 = 0.0 < 2.706 ( MERGE !)
Additional Iterations:
K=1 K=2 
Interval [0, 7.5] A11=2 A12=1 R1=3

Interval [7.5, 10] A21=2 A22=0 R2=2
 C1=4 C2=1 N=5
E11 = 12/5 = 2.4,

E12 = 3/5 = 0.6,
E21 = 8/5 = 1.6, and
E22 = 2/5 = 0.4
2 = (2 – 2.4)2 / 2.4 + (1 – 0.6)2 / 0.6 + (2 – 1.6)2 / 1.6 + (0 – 0.4)2 / 0.4

2 = 0.834
For the degree of freedom d=1, 2 = 0.834 < 2.706 (MERGE!)

K=1 K=2 
Interval [0, 10.0] A11=4 A12=1 R1=5

Interval [10.0, 42.0] A21=1 A22=3 R2=4
 C1=5 C2=4 N=9
E11 = 2.78
E12 =2.22
E21 = 2.22
E22 = 1.78
and 2 = 2.72 > 2.706 (NO MERGE !)
Final discretization: [0, 10], [10, 42], and [42, 60]

References
– Text book of Jiawei Han and Micheline Kamber, “Data
Mining: Concepts and Techniques”, Morgan Kaufmann
Publishers, August 2000. (Chapter 3).
– Data Mining: Concepts, Models, Methods, and

Algorithms by Mehmed Kantardzic John Wiley & Sons
2003. (Chapter 3).

Entropy Discretization

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Entropy Discretization

Uploaded by

Copyright:

Available Formats

Data Mining

• Values which falls outside of the set of clusters may be

• If ‘age <= 30’ and income = ‘medium’ and age =

• Where pi is the probability of class i in S1, determined by

• The process is recursively applied to partitions obtained

• Let us consider the boundary at T = 21.5.

• The number of elements in S1 and S2 are:

Select the boundary with the smallest entropy

 Merge the pair of adjacent intervals with the lowest 2

 Repeat  and  until 2 values of all adjacent pairs

oij = observed frequency of interval i for class j

• Interval points for feature F are: 0, 2, 5, 7.5, 8.5, 10, etc.

K=1 K=2  i 1 j 1 eij

 C1=2 C2=0 N=2

Based on the table’s values, we can calculate expected values:

and corresponding 2 test:

Interval [0, 7.5] A11=2 A12=1 R1=3

 C1=4 C2=1 N=5

E11 = 12/5 = 2.4,

2 = (2 – 2.4)2 / 2.4 + (1 – 0.6)2 / 0.6 + (2 – 1.6)2 / 1.6 + (0 – 0.4)2 / 0.4

For the degree of freedom d=1, 2 = 0.834 < 2.706 (MERGE!)

Interval [0, 10.0] A11=4 A12=1 R1=5

 C1=5 C2=4 N=9

and 2 = 2.72 > 2.706 (NO MERGE !)

Final discretization: [0, 10], [10, 42], and [42, 60]

– Data Mining: Concepts, Models, Methods, and

You might also like