Professional Documents
Culture Documents
Spring 2007
• Noisy data
• Data Discretization using Entropy
based and ChiMerge
Noisy Data
• Noise: Random error, Data Present but not correct.
– Data Transmission error
– Data Entry problem
• Removing noise
– Data Smoothing (rounding, averaging within a window).
– Clustering/merging and Detecting outliers.
• Data Smoothing
– First sort the data and partition it into (equi-depth) bins.
– Then the values in each bin using Smooth by Bin Means,
Smooth by Bin Median, Smooth by Bin Boundaries, etc.
Noisy Data (Binning Methods)
Sorted data for price (in dollars):
4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Noisy Data (Clustering)
• Outliers may be detected by clustering, where similar
values are organized into groups or “clusters”.
• There are
(21+22) / 26= possible
21.5 boundaries. They are 21.5, 23, 24.5,
26, 31, and 38.
• The entropy of S2 is
Ent ( S 2 ) P(Grade F) log 2 P(Grade F) P(Grade P) log2 P (Grade P)
(2) log2 (2) (6) log2 (6)
Example 1 (cont’)
• Hence, the entropy after partitioning at T = 21.5 is
| S1 | | S2 |
E (S , T ) Ent ( S1 ) Ent ( S 2 )
|S| |S|
|1| |8|
Ent ( S1 ) Ent ( S 2 )
|9| |9|
...
Example 1 (cont’)
• The entropies after partitioning for all the boundaries are:
T = 21.5 = E(S,21.5)
T = 23 = E(S,23)
Now recursively apply entropy
. discretization upon both partitions
.
T = 38 = E(S,38)
ID 1 2 3 4 5 6 7 8 9
Age 21 22 24 25 27 27 27 35 41
Grade F F P F P P P P P
ChiMerge (Kerber92)
• This discretization method uses a merging approach.
• ChiMerge’s view:
– First sort the data on the basis of attribute (which is
being descretize).
– List all possible boundaries or intervals. In case of last
example there were 6 boundaries, 0, 21.5, 23, 24.5, 26,
31, and 38.
– For all two near intervals, calculate 2 class independent
test.
• {0,21.5} and {21.5, 23}
• {21.5, 23} and {23, 24.5}
• .
• .
– Pick the best 2 two near intervals and merge them.
ChiMerge -- The Algorithm
Compute the 2 value for each pair of adjacent intervals
2
c r oij eij
2
Int 2 o21 o22 R2
i 1 j 1 eij
Σ C1 C2 N
1 1 1
2 3 2
3 7 1
4 8 1
5 9 1
6 11 2
7 23 2
8 37 1
9 39 2
10 45 1
11 46 1
12 59 1
For the degree of freedom d=1, and 2 = 0.0 < 2.706 ( MERGE !)
ChiMerge Example (cont.)
Additional Iterations:
K=1 K=2
E11 = 2.78
E12 =2.22
E21 = 2.22
E22 = 1.78