Class8-9 DataPreprocessing DataReduction 30Sept-05Oct2020

30-09-2020
Data Preprocessing
Data Reduction
Data Reduction
• Data reduction techniques are applied to obtain a
reduced representation of the dataset that is much
smaller in volume, yet closely maintain the integrity of
the original data
• The mining on the reduced dataset should produce
the same or almost same analytical results
• Different strategies:
– Attribute subset selection (feature selection):
• Irrelevant, weekly relevant or redundant attributes
(dimensions) are detected and removed
– Dimensionality reduction:
• Encoding mechanisms are used to reduce the dataset size
1
30-09-2020
Attribute (Feature) Subset Section

• In the context of machine learning, it is termed as
feature subset selection
• Irrelevant or redundant features are detected using
correlation analysis
• Two strategies:
– First strategy:
• Perform the correlation analysis between every pair of
attributes
• Drop one among the two attributes when they are highly
correlated
– Second strategy:
• Perform the correlation analysis between each attribute
and target attribute
• Drop the attributes that are less correlated with target
attribute.
Attribute (Feature) Subset Section

• Second strategy:
– Perform the correlation
analysis between each
attribute and target attribute
– Drop the attributes that are
less correlated with target
attribute
• Example:
– Predicting Rain (target
attribute) based on
Temperature, Humidity and
Pressure
– Rain dependent on
Temperature, Humidity and
Pressure
– Correlation analysis of
Temperature, Humidity,
Pressure with Rain
2
30-09-2020
Dimensionality Reduction
Dimensionality Reduction
• Data encoding or transformations are applied so as to
obtain a reduced or compressed representation of the
original data
Reduced
Representation
Representation
Data Feature x Dimension a Pattern
Extraction Reduction Analysis Task
d l
• If the original data can be reconstructed from

compressed data without any loss of information, the
data reduction is called lossless
• If only an approximation of the original data can be
reconstructed from compressed data, then the data
reduction is called lossy
• One of the popular and effective methods of lossy
dimensionality reduction is principal component
analysis (PCA)
3
30-09-2020
Tuple (Data Vector) – Attribute (Dimension)

• A tuple (one row) is
referred as a vector
• Attribute is referred as
dimension
• In this example:
– Number of vectors =
number of rows = 20
– Dimension of a vector
= number of
attributes = 5
– Size of data matrix is
20x5
Tuple (Data Vector)
Principal Component Analysis (PCA)

• Suppose data to be reduced consist of N tuples (or
data vectors) described by d-attributes (d -
dimensions)
D  {x n }nN1 , x n  R d
x n  [ xn1 xn 2 ... xnd ]T
• Let qi, where i = 1, 2,…, d be the d orthonormal
vectors in the d –dimensional space, q i  R
d
– These are unit vectors that each point in a direction

perpendicular to the others
q iT q j  0 i  j
q iT q i  1
• PCA searches for l orthonormal vectors that can best
be used to represent the data, where l < d
4
30-09-2020

• These orthonormal vectors are also called as direction
of projection
• The original data (each of the tuples (data vectors),
xn) is then projected onto each of the l orthonormal
vectors get the principal components
ani  q iT x n
i  1, 2, ..., l
xn
– ani is an ith principal component of xn qi
x2
• This transform each of the d –
dimensional vectors (i.e. tuples) to ani
l – dimensional vectors 0 x1
 xn1  an1  • Task:
x  a  – How to obtain the
xn   n2  an   n2  orthonormal vectors?
...  ... 
    – Which l orthonormal vectors
 xnd  anl  to choose?

• Thus the original data is projected onto much smaller
space, resulting in dimensionality reduction
• It combines the essence of attributes by creating an
alternative, smaller set of variables (attributes)
• It is possible to reconstruct the good approximation of
original data, xn , as linear combination of the direction of
projection, qi , and the principal components, ani
l

x n   ani q i
i 1
• The Euclidian distance between the original and

approximated tuples give the error in reconstruction
d
 
Error  x n  x n   (x
i 1
ni  xni ) 2
5
30-09-2020
PCA for Dimension Reduction

• Given: Data with N samples, D  {x n }n 1 , x n  R
N d
• Remove mean for each attribute (dimension) in data

samples (tuples)
• Then construct a data matrix X using the mean
subtracted samples, X  R
Nxd
– Each row of the matrix X corresponds to 1 sample (tuple

or a data vector)
• Compute a correlation matrix C=XTX
• Perform the eigen analysis of correlation matrix C
Cqi  i q i i  1, 2, ..., d
– As correlation matrix (covariance matrix) is symmetric
matrix and positive semideifinite,
• Each eigenvalues λi are distinct and non-negative.
• Eigenvectors qi corresponding to each eigenvalues are
orthonormal vectors
• Eigenvalues indicate the variance or strength of eigenvectors 11

• Project the xn onto each of the directions
(eigenvectors) to get the principal components
ani  q iT x n i  1, 2, ..., d
– ani is an ith principal component of xn
• Thus, each training example xn is transformed to a new
representation an by projecting on to d-orthonormal
basis (eigenvectors)  xn1  an1 
x  a 
xn   n2  an   n2 
...  ... 
   
 xnd  and 
• It is possible to reconstruct the original data, xn , without
error as linear combination of the direction of projection, qi,
and the principal components, ani d
x n   aniq i
i 1 12
6
30-09-2020

• In general, we are interested in representing the data
using fewer dimensions such that the data has high
variance along these dimensions
• Idea: Select l out of d orthonormal basis vectors
(eigenvectors) that contain high variance of data (i.e.
more information content)
• Rank order the eigenvalues (λi’s) such that
1  2  ...  d
• Based on the Definition 1, consider the l (l << d)
eigenvectors corresponding to l significant eigenvalues
– Definition 1: Let λ1, λ2, . . . , λd, be the eigenvalues of an
d x d matrix A. λ1 is called the dominant (significant)
eigenvalue of A if | λ1| ≥ | λi| , i = 1, 2, …, d
13

• Project the xn onto each of the l directions
(eigenvectors) to get reduced dimensional
representation
ani  q iT x n i  1, 2, ..., l
• Thus, each training example xn is transformed to a new

reduced dimensional representation an by projecting on
to l-orthonormal basis vectors (eigenvectors)
 xn1  an1 
x  a 
xn   n2 
an   n2 
...  ... 
   
 xnd  anl 
• The eigenvalue λi correspond to the variance of
projected data
14
7
30-09-2020

• Since the strongest l directions are considered for
obtaining reduced dimensional representation, it
should be possible to reconstruct a good
approximation of the original data
• An approximation of original data, xn , is obtained as linear
combination of the direction of projection (stongest
eigenvectors), qi , and the principal components, ai
l

x n   ai q i
i 1
15
PCA: Basic Procedure

N d
1. Remove mean for each attribute (dimension) in data

samples (tuples)
2. Then construct a data matrix X using the mean
subtracted samples,X  R
Nxd
– Each row of the matrix X corresponds to 1 sample

(tuple)
3. Compute a correlation matrix C=XTX
4. Perform the eigen analysis of correlation matrix C
Cqi  i q i i  1, 2, ..., d
– As correlation matrix is symmetric matrix,
• Each eigenvalues λi are distinct and non-negative
orthonormal vectors
• Eigenvalues indicate the variance or strength of
eigenvectors 16
8
30-09-2020

5. Rank order the eigenvalues (λi’s) (sorted order) such
that
1  2  ...  d
6. Consider the l (l << d) eigenvectors corresponding to

l significant eigenvalues
7. Project the xn onto each of the l directions
representation
ani  q iT x n i  1, 2, ..., l
17

8. Thus, each training example xn is transformed to a
new reduced dimensional representation an by
projecting on to l-orthonormal basis
 xn1  an1 
x  a 
xn   n2  an   n2 
...  ... 
   
• The new reduced representation an is uncorrelated
projected data
18
9
30-09-2020
Illustration: PCA
• Atmospheric Data:
– N = Number tiples
(data vectors) = 20
–d = Number of
attributes (dimension)
=5
• Mean of each
dimension:
23.42 93.63 1003.55 448.88 14.4
19
Illustration: PCA
• Step1: Subtract mean
from each attribute
20
10
30-09-2020
Illustration: PCA
• Step2: Compute correlation matrix from the data
matrix
21
Illustration: PCA
Eigen Values
• Step4: Perform Eigen
analysis on
Eigen Vectors correlation matrix
– Get eigenvalues and
eigenvectors
• Step5: Sort the
eigenvalues in
descending order
• Step6: Arrange the
eigenvectors in the
descending order of
their corresponding
eigenvalues
22
11
30-09-2020
Illustration: PCA
• Step7: Consider the two leading
(significant) eigenvalues and their
corresponding eigenvectors
• Step8: Project the mean subtracted
data matrix onto the selected two
eigenvectors corresponding to leading
eigenvalues
23
Eigenvalues and Eigenvectors

• What happens when a
1 2
8
A vector is multiplied with a
2 1
7 matrix?
7  • The vector gets transformed
6
q2 5 Aq    into a new vector

5 
4 1  – Direction changes
q 
3
3 • The vector may also get
2
scaled (elongated or
1 shortened) in the process
0
1 2 3 4 5 6 7 8
q1
24
12
30-09-2020

• For a given square symmetric
1 2 matrix A, there exist special
8
A
7 2 1 vectors which do not change
direction when multiplied
6
• These vectors are called
q2 5 eigenvectors
4
3 1 • More formally,

3
Aq     3    q
2 3 1 Aq   q
1 – λ is eigenvalue
1 q 
1 – Eigenvalue indicate the
0
1 2 3 4
q1
5 6 7 8
magnitude of the eigenvector
• The vector will only get scaled
but will not change its
direction
• So what is so special about
eigenvalues and eigenvectors?
25
Linear Algebra: Basic Definitions

• Basis: A set of vectors  R d is called a basis, if
– those vectors are linearly independent and
– every vector  R can be expressed as a linear
d
combination of these basis vectors

• Linearly independent vectors:
– A set of d vectors q1, q2, . . . , qd is linearly independent if
no vector in the set can be expressed as a linear
combination of the remaining d – 1 vectors
– In other words, the only solution to
c1q1  c2q 2  ...  cd q d  0 is c1  c2  ...  cd  0
• Here ci are scalars
26
13
30-09-2020

• For example consider the
8 space R 2
7 • Consider the vectors:
6 1  0 
q1    q2   
 z1 
q2 5 z  0  1 
4  z2 
• Any vector [z1 z2]T can be
3
expressed as a linear
2
0 combination of these two
q2   
1
1  vectors
 z1  1  0
 z   z1 0  z 2 1 
0
1 2 3 4 5 6 7 8
1  q1
q1     2    
 0
• Further, q1 and q2 are linearly
independent
– The only solution to
c1q1  c2q 2  0 is c1  c2  0
27

• It turns out that q1 and q2 are
8 unit vectors in the direction of
7
the co-ordinate axes
6
• And indeed we are used to
 z1  represent all vectors in R 2 as a
q2 5
z 
4  z2  linear combination of these
3
two vectors
2
0
1
q2   
1 
0
1 2 3 4 5 6 7 8
1  q1
q1   
 0
28
14
30-09-2020

• We could have chosen any 2
8 linearly independent vectors in
5 
7 q2    R 2 as the basis vectors
7 
6
• For example, consider the
 z1  linearly independent vectors
q2 5
z 
4  z2  [4 2]T and [5 7]T
3 • Any vector z=[z1 z2]T can be
4
2 q1    expressed as a linear
2 combination of these two
1
vectors  z1   4 5 
 z   1  2  2 7 
0
1 2 3 4 5 6 7 8
q1  2    
z  1 q1  2 q 2
z1  41  52 • We can find λ1 and λ2 by
z 2  21  72 solving a system of linear
equations
29

• In general, given a set of
8 linearly independent vectors
5  q 1, q 2 , . . . , q d  R d
7 q2   
7 
– we can express any vector z  R
d
6
 z1  as a linear combination of
q2 5 z 
 z2  these vectors
4
z  1 q 1   2 q 2  ...   d q d
3
4  z1   q11   q 21  qd 1 
2 q1    z  q  q   
1
2  2     12     22   ...    q d 2 
.  1
.  2
.  d
. 
       
0
1 2 3 4 5 6 7 8  zd   q1d  q2 d   q dd 
q1
 z1   q11 q 21 ... q d 1  1 
z    
 2    q12 q 22 ... q d 2   2 
.  . . . . . . . . . . . . . . . .  . 
    
 zd   q1d q 2 d ... q dd   d 
z  Q λ
30
15
30-09-2020

x2 • Let us see if we have
orthonormal basis
5
q iT q i  1 and q iT q j  0 i  j
q2 4
• We can express any vector z  R d
3
z as a linear combination of
q1
2
these vectors
1
2 1 z  1q1  2q 2  ...  d q d
-3 -2 -1 0 1 2 3 4
x1 – Multiply q1 to both sides
q1T z  1q1T q1  2q1Tq 2  ...  d q1Tq d
q1T z  1
• Similarly, 2  q 2T z
• An orthogonal basis is the ...
most convenient basis
d  q Td z
that one can hope for
31

• What does any of this have to do with eigenvectors?
• Eigenvectors can form a basis
• Theorem 1: The eigenvectors of a matrix A  R d x d
having distinct eigenvalues are linearly independent
• Theorem 2: The eigenvectors of a square symmetric
matrix are orthogonal
• Definition 1: Let λ1, λ2, . . . , λd, be the eigenvalues of
an d x d matrix A. λ1 is called the dominant
(significant) eigenvalue of A if | λ1| ≥ | λi| , i = 1, 2, …, d
• We will put all of this to use for principal component
analysis
32
16
30-09-2020

x2 • Each point (vector) here is
5
represented using a linear
4
combination of the x1 and x2
3
p2 axes
2
• In other words we are using p1
1
and p2 as the basis
p1
-2 -1 0 1 2 3 4 5
x1
33

x2 • Lets consider orthonormal
5
q1 vectors q1 and q2 as a basis
instead of p1 and p2 as the basis
4
q2 3
• We observe that all the points
2
have a very small component in
1
the direction of q2 (almost
noise)
-2 -1 0 1 2 3 4 5
x1
34
17
30-09-2020

x2 • Lets consider orthonormal
5
q1 vectors q1 and q2 as a basis
instead of p1 and p2 as the basis
4
q2
3
• We observe that all the points
2
have a very small component in
1
1 the direction of q2 (almost
2 noise)
-2 -1 0 1 2 3 4 5
x1
• Now the same data can be represented in 1-dimension
in the direction of q1 by making a smarter choice for
the basis
• Why do we not care about q2?
– Variance in the data in this direction is very small
– All data points have almost the same value in the q2
direction
35

x2
• If we were to build a classifier
q1
5 on top of this data then q2
4 would not contribute to the
q2 classier
3
2 – The points are not

1
1 distinguishable along this
2 direction
-2 -1 0 1 2 3 4 5
x1
using fewer dimensions such that
– the data has high variance along these dimensions
– the dimensions are linearly independent (uncorrelated)
• PCA preserves the geometrical locality of the
transformed data with respect to original data
36
18
30-09-2020
PCA: Basic Procedure

N d
1. Remove mean for each attribute (dimension) in data

samples (tuples)
2. Then construct a data matrix X using the mean
subtracted samples,X  R
Nxd
– Each row of the matrix X corresponds to 1 sample

(tuple)
3. Compute a correlation matrix C=XTX
4. Perform the eigen analysis of correlation matrix C
Cqi  i q i i  1, 2, ..., d
– As correlation matrix is symmetric matrix,
• Each eigenvalues λi are distinct and non-negative
orthonormal vectors
• Eigenvalues indicate the variance or strength of
eigenvectors 37

5. Rank order the eigenvalues (λi’s) (sorted order) such
that
1  2  ...  d
6. Consider the l (l << d) eigenvectors corresponding to

l significant eigenvalues
7. Project the xn onto each of the l directions
representation
ani  q iT x n i  1, 2, ..., l
38
19
30-09-2020

8. Thus, each training example xn is transformed to a
new reduced dimensional representation an by
projecting on to l-orthonormal basis
 xn1  an1 
x  a 
xn   n2 
an   n2 
...  ... 
   
• The new reduced representation an is uncorrelated
projected data
39
Illustration: PCA
• Handwritten Digit Image [1]:
– Size of each image: 28 x 28
– Dimension after linearizing: 784
– Total number of training examples: 5000 (500 per class)
[1] Y. LeCun, L. Bottou, Y. Bengio and P. Haffner, “Gradient-Based Learning

Applied to Document Recognition,” Intelligent Signal Processing, 306-351,
IEEE Press, 2001
40
20
30-09-2020
Illustration: PCA
• Handwritten Digit Image:
– All 784 Eigenvalues
41
Illustration: PCA
• Handwritten Digit Image:
– Leading 100 Eigenvalues
42
21
30-09-2020
Illustration: PCA-Reconstructed Images

Original Image l=1 l=20 l=100
43
22

Class8-9 DataPreprocessing DataReduction 30Sept-05Oct2020

Uploaded by

Copyright:

Available Formats

You might also like

Class8-9 DataPreprocessing DataReduction 30Sept-05Oct2020

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Class8-9 DataPreprocessing DataReduction 30Sept-05Oct2020

Uploaded by

Copyright:

Available Formats

30-09-2020

Attribute (Feature) Subset Section

Attribute (Feature) Subset Section

• If the original data can be reconstructed from

Tuple (Data Vector) – Attribute (Dimension)

Tuple (Data Vector)

Principal Component Analysis (PCA)

– These are unit vectors that each point in a direction

Principal Component Analysis (PCA)

Principal Component Analysis (PCA)

• The Euclidian distance between the original and

PCA for Dimension Reduction

• Remove mean for each attribute (dimension) in data

– Each row of the matrix X corresponds to 1 sample (tuple

PCA for Dimension Reduction

PCA for Dimension Reduction

PCA for Dimension Reduction

• Thus, each training example xn is transformed to a new

PCA for Dimension Reduction

PCA: Basic Procedure

1. Remove mean for each attribute (dimension) in data

– Each row of the matrix X corresponds to 1 sample

PCA for Dimension Reduction

6. Consider the l (l << d) eigenvectors corresponding to

PCA for Dimension Reduction

23.42 93.63 1003.55 448.88 14.4

Eigenvalues and Eigenvectors

q2 5 Aq    into a new vector

Eigenvalues and Eigenvectors

3 1 • More formally,

Linear Algebra: Basic Definitions

combination of these basis vectors

Linear Algebra: Basic Definitions

Linear Algebra: Basic Definitions

Linear Algebra: Basic Definitions

Linear Algebra: Basic Definitions

Linear Algebra: Basic Definitions

Eigenvalues and Eigenvectors

Principal Component Analysis (PCA)

Principal Component Analysis (PCA)

Principal Component Analysis (PCA)

Principal Component Analysis (PCA)

2 – The points are not

PCA: Basic Procedure

1. Remove mean for each attribute (dimension) in data

– Each row of the matrix X corresponds to 1 sample

PCA for Dimension Reduction

6. Consider the l (l << d) eigenvectors corresponding to

PCA for Dimension Reduction

[1] Y. LeCun, L. Bottou, Y. Bengio and P. Haffner, “Gradient-Based Learning

Illustration: PCA-Reconstructed Images

You might also like