Professional Documents
Culture Documents
Data cleaning
Data integration and transformation
Data reduction
Concept Hierarchy Generation
Discretization
January 21, 2024 Data Mining: Data Preprocessing 1
Data Reduction
Data reduction
Obtains a reduced representation of the data
set that is much smaller in volume but yet
produces the same (or almost the same)
analytical results
Data reduction strategies
Dimensionality reduction
Numerosity reduction
Concept hierarchy generation
Discretization
January 21, 2024 Data Mining: Data Preprocessing 2
Dimensionality Reduction
A4 ?
A1? A6?
s s y
lo
Original Data
Approximated
Parametric methods
Assume the data fits some model, estimate
model parameters, store only the parameters,
and discard the data (except possible outliers)
Example: Regression
Non-parametric methods
Do not assume models
Major families: histograms, clustering, sampling
January 21, 2024 Data Mining: Data Preprocessing 7
Histograms
A popular data reduction 40
technique 35
Divide data into buckets 30
and store average or 25
sum for each bucket / bin.
20
15
10
5
0
10000 30000 50000 70000 90000
Example:
The following data are a list of prices of commonly sold items at
AllElectronics (rounded to the nearest dollar). The numbers have been
sorted.
1, 1, 4, 4, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15,
15, 15, 18, 18, 18, 18, 18, 19, 19, 19, 20, 20, 20, 20, 20, 21, 21, 21,
21, 22, 22, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30
How are the buckets determined and the attribute values partitioned?
By:
a. Equi-width (Equal Width) Histogram, Number of buckets = 5
b. Equi-depth (Equal Depth) Histogram, Number of buckets = 5
c. MaxDiff Histogram, = 3
W O R
SRS le random t
p
(sim le withou
samp ment)
ce
repla
SRSW
(simp R
le
Samp random
le wit
Repla h
cemen
t)
Raw Data
January 21, 2024 Data Mining: Data Preprocessing 13
Stratified Sampling
Data cleaning
Data integration and transformation
Data reduction
Concept Hierarchy Generation
Discretization
Binning
Histogram analysis
Clustering analysis
Entropy-based discretization
Multi-way split
partitions as
distinct values
Binary split:
Divides values into
two subsets
Need to find
optimal partitioning
Preserve order
property among
attribute values
Jan 21, 2024 Data Mining: Classification 33
Splitting based on Continuous
attributes
Different ways of handling
Discretization to form an ordinal categorical attribute
Static – discretize once at the beginning
Dynamic – ranges can be found by equal interval bucketing,
equal frequency bucketing (percentiles), clustering, ect.
Binary Decision: (A < v) or (A v)
Consider all possible splits and finds the best cut
Can be more compute intensive
(-$1,000 - $2,000)
Step 3:
(-$4000 -$5,000)
Step 4:
Step 3: Since this interval ranges over 3 distinct values at the most significant
digit, i.e., (2,000,000 - (-1,000,000)) / 1,000,000 = 3, the segment is
partitioned into 3 equi-width subsegments according to the 3-4-5 rule:
(-$1,000,000…$0], ($0…$1,000,000], and ($1,000,000…$2,000,000].
This represents the top tier of the hierarchy.
High Low
Value
msd
(-$1,000
(-$400 - - $2,000]
$5,000]
Step 4:
(-$4000 -$5,000]
Step 5: (-$400 - $0] ($0 - $1,000] ($1,000 - $2, 000] ($2,000 - $5, 000]
($0 - ($1,000 -
(-$400 - $1,200]
$200] ($2,000 -
-$300]
($200 - ($1,200 - $3,000]
(-$300 - $1,400]
-$200] $400]
($3,000 -
(-$200 - ($400 - ($1,400 - $4,000] ($4,000 -
-$100] $600] $1,600] $5,000]
($600 - ($800 - ($1,600 - ($1,800 -
(-$100 - $800] $1,000] $1,800] $2,000]
$0]
Similarly, the 3-4-5 rule can be carried on iteratively at deeper levels, as necessary.