Professional Documents
Culture Documents
Data Preprocessing
• Data cleaning
• Data reduction
• Data cleaning
• Data reduction
• Data cleaning
• Data reduction
---eqn (A)
Correlation Coefficient for numeric data
• Evaluate the correlation between two attributes, A
and B , by computing the correlation coefficient
(also known as Pearson’s product moment
coefficient, named after its inventor, Karl Pearson).
This is
- -----eqn (B)
Continue…
• Compare eqn(B) for rA,B (correlation coefficient)
with eqn(A) for covariance
• Data cleaning
• Data reduction
Original Data
Approximated
Continue…
Wavelet Transforms
Discrete wavelet transform (DWT): linear signal
processing
Compressed approximation: store only a small fraction
of the strongest of the wavelet coefficients
Similar to discrete Fourier transform (DFT), but better
lossy compression, localized in space
Example of wavelet families
Haar2 Daubechie4
Continue…
General algorithm for DWT is as follows:
1. Length, L, must be an integer power of 2 (padding
with 0s, when necessary)
2. Each transform involves applying two functions.
The first applies some data smoothing, such as a
sum or weighted average. The second performs a
weighted difference, which acts to bring out the
detailed features of the data.
3. The two functions are applied to pairs of the input
data, resulting in two sets of data of length L/2.
4. The two functions are recursively applied to the
sets of data obtained in the previous loop, until the
resulting data sets obtained are of length 2.
5. Selected values from the data sets obtained in the
above iterations are designated the wavelet
coefficients of the transformed data
Continue…
Principal Component Analysis
Also called as Karhunen-Loeve, or K-L, method.
Procedure:
1. The input data are normalized, so that each attribute
falls within the same range. This step helps ensure that
attributes with large domains will not dominate
attributes with smaller domains.
2. PCA computes k orthonormal vectors that provide a
basis for the normalized input data. These are unit
vectors that each point in a direction perpendicular to
the others. These vectors are referred to as the principal
components. The input data are a linear combination of
the principal components.
3. The principal components are sorted in order of
decreasing ―significance‖ or strength. The principal
components essentially serve as a new set of axes for
the data, providing important information about
variance.
Continue…
4. Because the components are sorted according to
decreasing order of ―significance,‖ the size of the data
can be reduced by eliminating the weaker components,
that is, those with low variance. Using the strongest
principal components, it should be possible to
reconstruct a good approximation of the original data.
X2
Y2
Y1
X1
Next
back
Data Preprocessing
• Why preprocess the data?
• Data cleaning
• Data reduction
2. Histogram analysis
3. Clustering analysis
4. Entropy-based discretization
Eg:
count
(-$1,000 - $2,000)
Step 3:
(-$4000 -$5,000)
Step 4: