Professional Documents
Culture Documents
— Chapter 2 —
Jiawei Han
Department of Computer Science
University of Illinois at Urbana-Champaign
www.cs.uiuc.edu/~hanj
©2006 Jiawei Han and Micheline Kamber, All rights reserved
January 16, 2022 Data Mining: Concepts and Techniques 1
January 16, 2022 Data Mining: Concepts and Techniques 2
Chapter 2: Data Preprocessing
Completeness
Consistency
Timeliness
Believability
Value added
Interpretability
Accessibility
Broad categories:
Intrinsic, contextual, representational, and accessibility
Motivation
To better understand the data: central tendency, variation
and spread
Data dispersion characteristics
median, max, min, quantiles, outliers, variance, etc.
Numerical dimensions correspond to sorted intervals
Data dispersion: analyzed with multiple granularities of
precision
Boxplot or quantile analysis on sorted intervals
Dispersion analysis on computed measures
Folding measures into numerical dimensions
Boxplot or quantile analysis on the transformed cube
January 16, 2022 Data Mining: Concepts and Techniques 11
Measuring the Central Tendency
1 n x
Mean (algebraic measure) (sample vs. population): x xi
n i 1 N
n
Weighted arithmetic mean: w x i i
x i 1
Trimmed mean: chopping extreme values n
w
i 1
i
N
i 1
( xi
2
)
N
xi 2
i 1
2
warehousing”—DCI survey
Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Resolve redundancy caused by data integration
January 16, 2022 Data Mining: Concepts and Techniques 28
Missing Data
technology limitation
incomplete data
inconsistent data
(regression line)
Clustering
detect and remove outliers
Y1
Y1’ y=x+1
X1 x
store
Schema integration: e.g., A.cust-id B.cust-#
Integrate metadata from different sources
rA, B
( A A)( B B ) ( AB) n A B
(n 1)AB (n 1)AB
Χ2 (chi-square) test
(Observed Expected ) 2
2
Expected
The larger the Χ2 value, the more likely the variables are
related
The cells that contribute the most to the Χ2 value are
those whose actual count is very different from the
expected count
Correlation does not imply causality
# of hospitals and # of car-theft in a city are correlated
Both are causally linked to the third variable: population
73,600 54,000
Ex. Let μ = 54,000, σ = 16,000. Then 1.225
16,000
Normalization by decimal scaling
v
v' j Where j is the smallest integer such that Max(|ν’|) < 1
10
January 16, 2022 Data Mining: Concepts and Techniques 45
Chapter 2: Data Preprocessing
smaller in volume but yet produce the same (or almost the
same) analytical results. Like Rar and Zip files
Data reduction strategies
Data cube aggregation:
Data Compression
understand
Heuristic methods (due to exponential # of choices):
Step-wise forward selection
Decision-tree induction
A4 ?
A1? A6?
Typically lossless
expansion
Audio/video compression
Typically lossy compression, with progressive
refinement
Sometimes small fragments of signal can be
ss y
lo
Original Data
Approximated
component vectors
The principal components are sorted in order of decreasing
“significance” or strength
Since the components are sorted, the size of the data can be
X2
Y1
Y2
X1
Linear regression: Y = w X + b
Two regression coefficients, w and b, specify the line
above
Log-linear models:
The multi-way table of joint probabilities is
20000
40000
50000
60000
70000
80000
90000
100000
the β–1 largest differences
January 16, 2022 Data Mining: Concepts and Techniques 61
Data Reduction Method (3): Clustering
Partition data set into clusters based on similarity, and store cluster
representation (e.g., centroid and diameter) only
Can be very effective if data is clustered but not if data is “smeared”
Can have hierarchical clustering and be stored in multi-dimensional
index tree structures
There are many choices of clustering definitions and clustering
algorithms
Cluster analysis will be studied in depth in Chapter 7
W O R
SRS le random
i m p h ou t
( s e wi t
l
samp ment)
pl a c e
re
SRSW
R
Raw Data
January 16, 2022 Data Mining: Concepts and Techniques 64
Sampling: Cluster or Stratified Sampling
(-$1,000 - $2,000)
Step 3:
(-$400 -$5,000)
Step 4: