Professional Documents
Culture Documents
28
Data Cleaning: How to catch Noisy Data
• Manually check all data : tedious + infeasible?
• Sort data by frequency
– ‘green’ is more frequent than ‘rgeen’
– Works well for categorical data
• Use, say Numerical constraints to Catch Corrupt Data
• Weight can’t be negative
• People can’t have more than 2 parents
• Salary can’t be less than Birr 300
• Use statistical techniques to Catch Corrupt Data
– Check for outliers (the case of the 8 meters man)
– Check for correlated outliers using n-gram (“pregnant males”)
• People can be male
• People can be pregnant
• People can’t be male AND pregnant
29
Data Cleaning: Redundancy
• Duplicate or redundant data is data problems which
require data cleaning
• What’s wrong here?
W O R
SRS le random
i m p ho ut
( s e wi t
l
sa m p m e nt )
p l a ce
re
SRSW
R
Raw Data
Sampling: Cluster or Stratified Sampling
Original Data
lossless Compressed
Data
s s y
Original Data lo
Approximated
Data Reduction: Data Compression
• Lossless and lossy compression have become part of our
every day vocabulary largely due to the popularity of
– MP3 music files.
• Compare it with WAV files
– JPEG image files
• Compare it with GIF or
– MP4 or MPEG video files
• Compare it with AVI files
73,600 54,000
– Ex. Let μ = 54,000, σ = 16,000. Then, 1.225
16,000
Simple Discretization: Binning
• Equal-width (distance) partitioning
– Divides the range into N intervals of equal size: uniform
grid
– if A and B are the lowest and highest values of the
attribute, the width of intervals will be: W = (B –A)/N.
– This is the most straightforward, but outliers may
dominate presentation
– Skewed data is not handled well
• Equal-depth (frequency) partitioning
– Divides the range into N intervals, each containing
approximately same number of samples
– Good data scaling
– Managing categorical attributes can be tricky
Cont,,,
Equal-width (distance) partitioning:
▶ It divides the range into N intervals of equal size: uniform grid
▶ if A and B are the lowest and highest values of the attribute,
the width of intervals will be: W = (B-A)/N.
▶ Given the data set (say 24, 21, 28, 8, 4, 26, 34, 21, 29, 15, 9, 25)
▶ Determine the number of bins : N (say 3)
▶ Determine the range R= Max(B) – Min (A)
▶ For the above data R= 30,
▶ W= R/3 = 30/3= 10, X1 = 14, X2 = 24, and X3 = 34
▶ First sort the data as 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
▶ Therefore in our case Bin 1 = 4,8,9 Bin 2 = 15, 21, 21 Bin3 = 24, 25,26,
28, 29, 34
• outliers may dominate presentation
• Skewed data may not be handled well.
Cont,,,
Equal-depth (frequency) partitioning:
▶ It divides the range into N intervals, each containing
approximately same number of samples
▶ Given the data set (say 24, 21, 28, 8, 4, 26, 34, 21, 29, 15, 9,25)
▶ Determine the number of bins : N (say 3)
▶ Determine the frequency (F) of the data set (12)
▶ Determine the number of sample per bin F/N (12/3 = 4)
▶ Sort the data as 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 and place
F/N element in order into a bin
▶ Therefore in our case Bin 1 = 4,8,9 ,15 Bin 2 = 21, 21,24, 25 Bin3 = 26,
28, 29, 34
▶ Good data scaling
Binning Methods for Data Smoothing
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,
28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Binning
• Attribute values (for one attribute e.g., age):
– 0, 4, 12, 16, 16, 18, 24, 26, 28
• Equi-width binning – for bin width of e.g., (n=10):
– Bin 1: 0, 4 [-,10) bin
– Bin 2: 12, 16, 16, 18 [10,20) bin
– Bin 3: 24, 26, 28 [20,+) bin
– – denote negative infinity, + positive infinity
• Equi-frequency binning – for bin density of e.g.,(n= 3):
– Bin 1: 0, 4, 12 [-, 14) bin
– Bin 2: 16, 16, 18 [14, 21) bin
– Bin 3: 24, 26, 28 [21,+] bin
Concept Hierarchy Generation
• Concept hierarchy organizes concepts (i.e.,
country
attribute values) hierarchically and is usually
associated with each dimension in a data
warehouse Region or state
– Concept hierarchy formation: Recursively
reduce the data by collecting and replacing city
low level concepts (such as numeric values
for age) by higher level concepts (such as
youth, adult, or senior) Sub city
• Concept hierarchies can be explicitly
specified by domain experts and/or data Street
warehouse designers
• Concept hierarchy can be automatically formed by the analysis of
the number of distinct values. E.g., for a set of attributes: {street,
city, state, country}
For numeric data, use discretization methods.