Professional Documents
Culture Documents
Data Preprocessing
Data Quality
Data quality is a major concern for Data Mining Tasks
Why: At most all Data Mining algorithms induce
knowledge strictly from data
The quality of knowledge extracted highly depends
on the quality of data
There are two main problems in data quality: Missing data: The data not present
Noisy data: The data present but not correct
Hardware failure
Data transmission error
Data entry problem
Refusal of responds to answer certain questions
Data Mining
Training data
then
For {attribute#2,
attribute#3} the
value {cool,
high} appears in
only 2 records
Probability of Imputing
for value (20) =
100%
Missing value
record
Other dataset
records
Noisy Data
Noise: Random error, Data Present but not
correct
Data Transmission error
Data Entry problem
Removing noise
Data Smoothing (rounding, averaging within a
window).
Clustering/merging and Detecting outliers.
buys_computer
no
no
no
no
Data Mining
age
31
33
32
buys_computer
?
?
?
Entropy-Based Discretization
Given a set of samples S, if S is partitioned into two
intervals S1 and S2 using boundary T, the entropy
after partitioning is
Entropy-Based Discretization
The boundary that minimizes the entropy function
over all possible boundaries is selected as a binary
discretization.
The process is recursively applied to partitions
obtained until some stopping criterion is met, e.g.,
Example (cont)
ID
Age
Grade
21
22
24
25
27
27
27
35
41
The entropy of S1 is
Ent ( S1 ) P (Grade F) log 2 P(Grade F) P (Grade P) log 2 P(Grade P)
(1) log 2 (1) (0) log 2 (0)
The entropy of S2 is