Professional Documents
Culture Documents
Concepts and Techniques: - Chapter 3
Concepts and Techniques: - Chapter 3
Concepts and
Techniques
(3rd ed.)
Chapter 3
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
2011 Han, Kamber & Pei. All rights reserved.
1
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
2
Data cleaning
Data integration
Data reduction
Dimensionality reduction
Numerosity reduction
Data compression
Normalization
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
5
Data Cleaning
Age=42, Birthday=03/07/2010
equipment malfunction
Noisy Data
Binning
first sort data and partition into (equal-frequency)
bins
then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
Regression
smooth by fitting the data into regression functions
Clustering
detect and remove outliers
Combined computer and human inspection
detect suspicious values and check by human (e.g.,
deal with possible outliers)
10
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
12
Data Integration
Data integration:
2 (chi-square) test
2
(
Observed
Expected
)
2
Expected
Chi-Square Calculation: An
Example
Play
chess
Not play
chess
Sum
(row)
250(90)
200(360)
450
50(210)
1000(840)
1050
Sum(col.)
300
1200
1500
507.93
90
210
360
840
2
i 1 (ai A)(bi B)
n
rA, B
(n 1) A B
i 1
(ai bi ) n A B
(n 1) A B
Scatter plots
showing the
similarity from
1 to 1.
18
Correlation coefficient:
where n is the number of tuples,
and
are the respective mean or
expected values of A and B, AAand BB
are the respective standard
deviation of A and B.
Some pairs of random variables may have a covariance of 0 but are not
independent. Only under some additional assumptions (e.g., the data follow
multivariate normal distributions) does a covariance of 0 imply independence
20
Co-Variance: An Example
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
22
23
Curse of dimensionality
Dimensionality reduction
Wavelet transforms
Fourier transform
Wavelet transform
Frequency
25
26
Wavelet
Transformation
Haar2
Daubechie4
Method:
Wavelet Decomposition
28
Hierarchical
2.75
decomposition
structure (a.k.a. +
error tree) + -1.25
0.5
+
+
2
-1.25
-1
-1
- +
2
0
0
- +
5
0
-1
-1
0
0.5
-+
-+
-+
29
30
The original data are projected onto a much smaller space, resulting in
dimensionality reduction. We find the eigenvectors of the covariance
matrix, and these eigenvectors define the new space
x2
e
x1
31
Normalize input data: Each attribute falls within the same range
Since the components are sorted, the size of the data can be
reduced by eliminating the weak components, i.e., those with low
variance (i.e., using the strongest principal components, it is
possible to reconstruct a good approximation of the original data)
Redundant attributes
Irrelevant attributes
Domain-specific
Mapping data to new space (see: data reduction)
Data discretization
35
36
Linear regression
Data modeled to fit a straight line
Often uses the least-square method to fit the
line
Multiple regression
Allows a response variable Y to be modeled as
a linear function of multidimensional feature
vector
Log-linear model
Approximates discrete multidimensional
probability distributions
37
Regression Analysis
Y1
Y1
y=x+1
X1
(including forecasting of
time-series data),
inference, hypothesis
testing, and modeling of
causal relationships
Linear regression: Y = w X + b
Multiple regression: Y = b0 + b1 X1 + b2 X2
Log-linear models:
Estimate the probability of each point (tuple) in a multidimensional space for a set of discretized attributes, based on a
smaller subset of dimensional combinations
Histogram Analysis
25
Equal-width: equal
bucket range
20
Equal-frequency (or
equal-depth)
10
15
5
100000
90000
80000
70000
60000
0
50000
30
40000
35
30000
Partitioning rules:
40
20000
10000
40
Clustering
Sampling
Types of Sampling
R
O
W
SRS le random
t
p
u
o
m
i
h
t
s
i
(
w
e
l
samp ment)
ce
a
l
p
e
r
SRSW
R
Raw Data
44
Cluster/Stratified Sample
45
String compression
There are extensive theories and well-tuned algorithms
Typically lossless, but only limited manipulation is
possible without expansion
Audio/video compression
Typically lossy compression, with progressive
refinement
Sometimes small fragments of signal can be
reconstructed without reconstructing the whole
Time sequence is not audio
Typically short and vary slowly with time
Dimensionality and numerosity reduction may also be
considered as forms of data compression
47
Data Compression
Compressed
Data
Original Data
lossless
Original Data
Approximated
y
s
s
lo
48
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
49
Data Transformation
Methods
Attribute/feature construction
min-max normalization
z-score normalization
Normalization
v'
v minA
(new _ maxA new _ minA) new _ minA
maxA minA
v'
v A
73,600 54,000
1.225
Then16,000
v
v' j
10
Discretization
Binning
Histogram analysis
if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B A)/N.
55
Class Labels
(Binning vs. Clustering)
Data
56
Merge
performed
recursively,
until
predefined
stopping
condition
57
country
15 distinct values
province_or_ state
city
street
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
61
Summary
References