Professional Documents
Culture Documents
Data Preprocessing: Data Cleaning Data Integration and Transformation
Data Preprocessing: Data Cleaning Data Integration and Transformation
■ Data cleaning
■ Data integration and transformation
■ Data reduction
■ Concept Hierarchy Generation
■ Discretization
August 26, 2020 Data Mining: Data Preprocessing 1
Data Reduction
■ Data reduction
⚪ Obtains a reduced representation of the data
set that is much smaller in volume but yet
produces the same (or almost the same)
analytical results
■ Data reduction strategies
⚪ Dimensionality reduction
⚪ Numerosity reduction
⚪ Concept hierarchy generation
⚪ Discretization
August 26, 2020 Data Mining: Data Preprocessing 2
Dimensionality Reduction
A4 ?
A1? A6?
Compresse
Original Data d
Data
lossless
s y
los
Original Data
Approximated
■ Parametric methods
⚪ Assume the data fits some model, estimate
model parameters, store only the parameters,
and discard the data (except possible outliers)
⚪ Example: Regression
■ Non-parametric methods
⚪ Do not assume models
⚪ Major families: histograms, clustering, sampling
August 26, 2020 Data Mining: Data Preprocessing 7
Histograms
Example:
The following data are a list of prices of commonly sold items at
AllElectronics (rounded to the nearest dollar). The numbers have been
sorted.
1, 1, 4, 4, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15,
15, 18, 18, 18, 18, 18, 19, 19, 19, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22,
22, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30
How are the buckets determined and the attribute values partitioned?
By:
a. Equi-width (Equal Width) Histogram, Number of buckets = 5
b. Equi-depth (Equal Depth) Histogram, Number of buckets = 5
c. MaxDiff Histogram, β = 3
WO R
SRS domran
p le t
(sim le withou
samp ment)
ep la ce
r
SRSW
(s R
imple
Samp random
le
Repla with
cemen
t)
Raw Data
August 26, 2020 Data Mining: Data Preprocessing 13
Stratified Sampling
■ Data cleaning
■ Data integration and transformation
■ Data reduction
■ Concept Hierarchy Generation
■ Discretization
■ Histogram analysis
■ Clustering analysis
■ Entropy-based discretization
(-$1,000 - $2,000)
Step 3:
(-$4000 -$5,000)
Step 4:
Step 3: Since this interval ranges over 3 distinct values at the most significant
digit, i.e., (2,000,000 - (-1,000,000)) / 1,000,000 = 3, the segment is
partitioned into 3 equi-width subsegments according to the 3-4-5 rule:
(-$1,000,000…$0], ($0…$1,000,000], and ($1,000,000…$2,000,000].
This represents the top tier of the hierarchy.
(-$400 -$5,000]
(-$1,000 - $2,000]
Step 4:
(-$4000 -$5,000]
Step 5: (-$400 - $0] ($0 - $1,000] ($1,000 - $2, 000] ($2,000 - $5, 000]
($0 - ($1,000 -
(-$400 - $1,200]
$200] ($2,000 -
-$300]
($200 - ($1,200 - $3,000]
(-$300 - $1,400]
-$200] $400]
($3,000 -
(-$200 - ($400 - ($1,400 - $4,000] ($4,000 -
-$100] $600] $1,600] $5,000]
($600 - ($800 - ($1,600 - ($1,800 -
(-$100 - $800] $1,000] $1,800] $2,000]
$0]
Similarly, the 3-4-5 rule can be carried on iteratively at deeper levels, as necessary.