Professional Documents
Culture Documents
Class8-9 DataPreprocessing DataReduction 30Sept-05Oct2020
Class8-9 DataPreprocessing DataReduction 30Sept-05Oct2020
Class8-9 DataPreprocessing DataReduction 30Sept-05Oct2020
Data Preprocessing
Data Reduction
Data Reduction
• Data reduction techniques are applied to obtain a
reduced representation of the dataset that is much
smaller in volume, yet closely maintain the integrity of
the original data
• The mining on the reduced dataset should produce
the same or almost same analytical results
• Different strategies:
– Attribute subset selection (feature selection):
• Irrelevant, weekly relevant or redundant attributes
(dimensions) are detected and removed
– Dimensionality reduction:
• Encoding mechanisms are used to reduce the dataset size
1
30-09-2020
2
30-09-2020
Dimensionality Reduction
Dimensionality Reduction
• Data encoding or transformations are applied so as to
obtain a reduced or compressed representation of the
original data
Reduced
Representation
Representation
Data Feature x Dimension a Pattern
Extraction Reduction Analysis Task
d l
3
30-09-2020
4
30-09-2020
5
30-09-2020
6
30-09-2020
13
ani q iT x n i 1, 2, ..., l
7
30-09-2020
15
8
30-09-2020
17
18
9
30-09-2020
Illustration: PCA
• Atmospheric Data:
– N = Number tiples
(data vectors) = 20
–d = Number of
attributes (dimension)
=5
• Mean of each
dimension:
19
Illustration: PCA
• Step1: Subtract mean
from each attribute
20
10
30-09-2020
Illustration: PCA
• Step2: Compute correlation matrix from the data
matrix
21
Illustration: PCA
Eigen Values
• Step4: Perform Eigen
analysis on
Eigen Vectors correlation matrix
– Get eigenvalues and
eigenvectors
• Step5: Sort the
eigenvalues in
descending order
• Step6: Arrange the
eigenvectors in the
descending order of
their corresponding
eigenvalues
22
11
30-09-2020
Illustration: PCA
• Step7: Consider the two leading
(significant) eigenvalues and their
corresponding eigenvectors
• Step8: Project the mean subtracted
data matrix onto the selected two
eigenvectors corresponding to leading
eigenvalues
23
24
12
30-09-2020
26
13
30-09-2020
28
14
30-09-2020
vectors z1 4 5
z 1 2 2 7
0
1 2 3 4 5 6 7 8
q1 2
z 1 q1 2 q 2
z1 41 52 • We can find λ1 and λ2 by
z 2 21 72 solving a system of linear
equations
29
z1 as a linear combination of
q2 5 z
z2 these vectors
4
z 1 q 1 2 q 2 ... d q d
3
4 z1 q11 q 21 qd 1
2 q1 z q q
1
2 2 12 22 ... q d 2
. 1
. 2
. d
.
0
1 2 3 4 5 6 7 8 zd q1d q2 d q dd
q1
z1 q11 q 21 ... q d 1 1
z
2 q12 q 22 ... q d 2 2
. . . . . . . . . . . . . . . . . .
zd q1d q 2 d ... q dd d
z Q λ
30
15
30-09-2020
32
16
30-09-2020
33
34
17
30-09-2020
36
18
30-09-2020
38
19
30-09-2020
39
Illustration: PCA
• Handwritten Digit Image [1]:
– Size of each image: 28 x 28
– Dimension after linearizing: 784
– Total number of training examples: 5000 (500 per class)
20
30-09-2020
Illustration: PCA
• Handwritten Digit Image:
– All 784 Eigenvalues
41
Illustration: PCA
• Handwritten Digit Image:
– Leading 100 Eigenvalues
42
21
30-09-2020
43
22