Professional Documents
Culture Documents
Concepts and
Techniques
(3rd ed.)
Chapter 3
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
2013 Han, Kamber & Pei. All rights reserved.
1
09/30/16
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
3
Data cleaning
Fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data reduction
Dimensionality reduction
Numerosity reduction
Data compression
Data transformation and data discretization
Normalization
Concept hierarchy generation
5
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
6
Data Cleaning
equipment malfunction
Noisy Data
Binning
first sort data and partition into (equal-frequency)
bins
then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
Regression
smooth by fitting the data into regression functions
Clustering
detect and remove outliers
Combined computer and human inspection
detect suspicious values and check by human (e.g.,
deal with possible outliers)
11
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
13
Data Integration
Data integration:
2 (chi-square) test
2
(
Observed
Expected
)
2
Expected
The larger the 2 value, the more likely the variables are
related
Chi-Square Calculation: An
Example
Play
chess
Not play
chess
Sum
(row)
250(90)
200(360)
450
50(210)
1000(840)
1050
Sum(col.)
300
1200
1500
507.93
90
210
360
840
2
i 1 (ai A)(bi B)
n
rA, B
(n 1) A B
i 1
(ai bi ) n A B
(n 1) A B
Scatter plots
showing the
similarity from
1 to 1.
19
Correlation coefficient:
where n is the number of tuples,
and
are the respective mean or
B B are the respective standard
A A and
expected values of A and B,
deviation of A and B
Some pairs of random variables may have a covariance of 0 but are not
independent. Only under some additional assumptions (e.g., the data follow
multivariate normal distributions) does a covariance of 0 imply independence
21
Co-Variance: An Example
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
23
24
Curse of dimensionality
When dimensionality increases, data becomes increasingly
sparse
Density and distance between points, which is critical to
clustering, outlier analysis, becomes less meaningful
The possible combinations of subspaces will grow exponentially
Dimensionality reduction
Avoid the curse of dimensionality
Help eliminate irrelevant features and reduce noise
Reduce time and space required in data mining
Allow easier visualization
Dimensionality reduction techniques
Wavelet transforms
Principal Component Analysis
Supervised and nonlinear techniques (e.g., feature selection)
25
Fourier transform
Wavelet transform
Frequency
26
27
Wavelet
Transformation
Haar2
Daubechie4
Method:
Wavelet Decomposition
29
30
The original data are projected onto a much smaller space, resulting
in dimensionality reduction. We find the eigenvectors of the
covariance matrix, and these eigenvectors define the new space
x2
e
x1
31
Normalize input data: Each attribute falls within the same range
Since the components are sorted, the size of the data can be
reduced by eliminating the weak components, i.e., those with low
variance (i.e., using the strongest principal components, it is
possible to reconstruct a good approximation of the original data)
Redundant attributes
Irrelevant attributes
Domain-specific
Mapping data to new space (see: data reduction)
Data discretization
35
36
Linear regression
Data modeled to fit a straight line
Often uses the least-square method to fit the
line
Multiple regression
Allows a response variable Y to be modeled as
a linear function of multidimensional feature
vector
Log-linear model
Approximates discrete multidimensional
probability distributions
37
Regression Analysis
Y1
Y1
y=x+1
X1
(including forecasting of
time-series data),
give a "best fit" of the data
inference, hypothesis
Most commonly the best fit is evaluated testing, and modeling of
by using the least squares method,
causal relationships
The parameters are estimated so as to
Linear regression: Y = w X + b
Multiple regression: Y = b0 + b1 X1 + b2 X2
Log-linear models:
Estimate the probability of each point (tuple) in a multidimensional space for a set of discretized attributes, based on a
smaller subset of dimensional combinations
Histogram Analysis
25
Equal-width: equal
bucket range
20
Equal-frequency (or
equal-depth)
10
15
5
100000
90000
80000
70000
60000
0
50000
30
40000
35
30000
Partitioning rules:
40
20000
10000
40
Clustering
Sampling
Types of Sampling
R
O
W
SRS le random
t
p
u
o
m
i
h
t
s
i
(
w
e
l
samp ment)
ce
a
l
p
e
r
SRSW
R
Raw Data
44
Cluster/Stratified Sample
45
String compression
There are extensive theories and well-tuned algorithms
Typically lossless, but only limited manipulation is
possible without expansion
Audio/video compression
Typically lossy compression, with progressive
refinement
Sometimes small fragments of signal can be
reconstructed without reconstructing the whole
Time sequence is not audio
Typically short and vary slowly with time
Dimensionality and numerosity reduction may also be
considered as forms of data compression
47
Data Compression
Compressed
Data
Original Data
lossless
Original Data
Approximated
y
s
s
lo
48
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
49
Data Transformation
Methods
Attribute/feature construction
min-max normalization
z-score normalization
Normalization
v'
v minA
(new _ maxA new _ minA) new _ minA
maxA minA
v'
v A
73,600 54,000
1.225
Then16,000
v
v' j
10
Discretization
Binning
Histogram analysis
if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B A)/N.
55
Class Labels
(Binning vs. Clustering)
Data
56
Merge
performed
recursively,
until
predefined
stopping
condition
57
country
15 distinct values
province_or_ state
city
street
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
61
Summary
Data reduction
References
63
09/30/16
64