You are on page 1of 60

Dimensionality Reduction

Curse of Dimensionality

• Increasing the number of features


will not always improve
classification accuracy.

• In practice, the inclusion of more


k=3 bins
features might actually lead to Total: 31 bins per feature
worse performance.

• The number of training examples


required increases exponentially
with dimensionality D (i.e., kD). Total: 32 bins
Total: 33 bins
k: number of bins per feature
Dimensionality Reduction
• What is the objective?
− Choose an optimum set of features d* of lower
dimensionality to improve classification accuracy.

d*

• Different methods can be used to reduce


dimensionality:
− Feature extraction
− Feature selection

3
Dimensionality Reduction (cont’d)

Feature extraction: computes


a new set of features from the Feature selection:
original features through some chooses a subset of the
transformation f() . original features.

f() could be linear


or non-linear  x1 
 x1  x 
x   2  xi1 
 2  .   
 y1   
 .  y   xi2 
 . 
   2 x y . 
.
x     f ( x )
y   .   .   
 .       . 
   .    .  
 .   yK   .   xiK 
 .   
   xD 
 xD  K<<D K<<D
4
Feature Extraction
• Linear transformations are particularly attractive because
they are simpler to compute and analytically tractable.

• Given x ϵ RD, find an K x D matrix T such that:

y = Tx ϵ RK where K<<D

 x1 
x  This is a projection
 2  y1  transformation from D
 .  y  dimensions to K dimensions.
  T  2
 . 
x 
f (x)
y   . 
 .    Each new feature yi is a
   .  linear combination of the
  .
 yK  original features xi
 . 
 
 xD 
5
Feature Extraction (cont’d)
• From a mathematical point of view, finding an optimum
mapping y=𝑓(x) can be formulated as an optimization
problem (i.e., minimize or maximize an objective
criterion).

• Commonly used objective criteria:

− Minimize Information Loss: projection in the lower-dimensional


space preserves as much information in the data as possible.

− Maximize Discriminatory Information: projection in the lower-


dimensional space increases class separability.

6
Feature Extraction (cont’d)
• Popular linear feature extraction methods:
− Principal Components Analysis (PCA): Seeks a projection that
minimizes information loss.
− Linear Discriminant Analysis (LDA): Seeks a projection that
maximizes discriminatory information.

• Many other methods:


− Making features as independent as possible (Independent
Component Analysis).
− Retaining interesting directions (Projection Pursuit).
− Embedding to lower dimensional manifolds (Isomap, Locally Linear
Embedding).

7
Vector Representation
 x1 
• A vector x ϵ RD
can be x 
 2
 . 
represented by D components:  
.
x: 
 . 
 
 . 
• Assuming the standard base  . 
 
 xD 
<v1, v2, …, vD> (i.e., unit vectors
in each dimension), xi can be xT vi
obtained by projecting x along xi  T  xT vi
vi vi
the vi direction:
xi are called projection coefficients

• x can be “reconstructed” from its D


projection coefficients: x   xi vi  x1v1  x2v2  ...  xD vD
i 1
xi are called expansion coefficients in this context

• In summary, the components of a vector are associated with a


specific base; if the base is changed, the components will
change too (essentially a change of coordinate systems). 8
Vector Representation (cont’d)

 x1   3
• Example assuming D=2: x:    
 x2   4 j

• Assuming the standard base 1 


x1  x i  3 4    3
T

<v1=i, v2=j>, xi can be obtained 0


by projecting x along the
0 
direction of vi: x2  xT j  3 4    4
1 

• x can be “reconstructed” from x  3i  4 j


its projection coefficients as
follows:
9
PCA – Main Idea

• Any x∈RD can be written as a linear combination of an


 x1 
orthonormal set of D basis vectors <v1,v2,…,vD>, viϵRD x 
 2
(e.g., using the standard base):  . 
D
 
.
x:  
1 if i  j x   xi vi  x1v1  x2v2  ...  xD vD  . 
v vj  
T
i i 1
 
0 otherwise xT vi  . 
where xi  T  xT vi  . 
vi vi  
 xD 

• PCA seeks to approximate x in a subspace of RD using a


new set of K<<D basis vectors <u1, u2, …,uK>, uiϵRD:
K  y1 
xˆ   yi ui  y1u1  y2u2  ...  yK u K xT ui
where yi  T  xT ui y 
ui ui  2
i 1 (reconstruction)
xˆ :  . 
 
1 if i  j
u uj  
T such that || x  xˆ || is minimized!  . 
 yK 
i
0 otherwise (i.e., minimize information loss)
10
Principal Component Analysis (PCA)

• The “optimal” set of basis vectors <u1, u2, …,uK> can be


found as follows (we’ll explain the details later):

(1) Find the eigenvectors u𝑖 of the covariance matrix Σx of the


(training) data (i.e., typically, D distinct eigenvectors)
Σx u𝑖= 𝜆𝑖 u𝑖 (reminder: ui form an orthogonal basis)

(2) Choose the K “largest” eigenvectors u𝑖 (i.e., corresponding


to the K “largest” eigenvalues 𝜆𝑖)

<u1, u2, …,uK> form to the “optimal” basis!

We refer to the “largest” eigenvectors u𝑖 as principal components.


http://people.whitman.edu/~hundledr/courses/M350F14/M350/BestBasisRedone.pdf
11
PCA - Steps
• Suppose we are given x1, x2, ..., xM (D x 1) vectors
D: # of features
Step 1: compute sample mean M: # data
M
1
x
M
x
i 1
i

Step 2: subtract sample mean (i.e., center the data at zero)

Φi  xi  x
Step 3: compute the sample covariance matrix Σx
1 M
1 M
1 where A=[Φ1 Φ2 ... ΦΜ]
x 
M

i 1
( x i  x )( x i  x )T

M

i 1
 T
i
i  
M
AAT
i.e., the columns of A are the Φi
(D x M matrix)

12
PCA - Steps
Step 4: compute the eigenvalues/eigenvectors of Σx
 xui  iui
where we assume 1  2  ...  N
Note : most software packages return the eigenvalues (and corresponding eigenvectors)
is decreasing order – if not, you should explicitly put them in this order)

Since Σx is symmetric, <u1,u2,…,uD> form an orthogonal basis


in RD; therefore, we can represent any x∈RD as: x 
x 
y 
y 
1 1

 2  2
D  .   . 

x  x   yi ui  y1u1  y2u2  ...  yDu D


   
 .  .
xx:  
 .   . 
eigen i 1    
coefficients i.e., this is  .   . 
 .   . 
(x  x) ui
T
just a “change”    
yi   ( x  x )T
ui if || ui || 1  xD   yD 
T
ui ui of basis!

Note : most software packages normalize ui to unit length to simplify calculations; if


not, you should explicitly normalize them) 13
PCA - Steps
Step 5: dimensionality reduction step – approximate x by xˆ
using only the first K eigenvectors (K<<D) (i.e., corresponding
to the K largest eigenvalues where K is a parameter):
D
x  x   yi ui  y1u1  y2u2  ...  yDu D
i 1
approximate x by xˆ
using first K eigenvectors
K
xˆ  x   yi ui  y1u1  y2u2  ...  yK uK
i 1 (reconstruction)
 x1   y1 
x  y 
 2  2  y1 
 . 
 
 . 
 
y 
 2
K<<D; note that if K=D, then xˆ  x
. .
x  x :       xˆ  x :  .  (i.e., zero reconstruction error)
 .   .   
     . 
  .   .
 yK 
 .   . 
    dimensionality reduction
 xD   yD 
change of basis 14
What is the Linear Transformation
implied by PCA?
• The linear transformation y = Tx associated with PCA can
be found as follows:
K
xˆ  x   yi ui  y1u1  y2u2  ...  yK uK
i 1

 y1 
y  where U  [u1 u2 ... uK ] D x K matrix
 2
(xˆ  x )  U  . 
  i.e., the columns of U are the
 .  the first K eigenvectors of Σx
 yK 

 y1 
y 
 2 T = UT K x D matrix
 .   U T (xˆ  x ) i.e., the rows of T are the first
 
K eigenvectors of Σx
 . 
 yK 
15
What is the form of Σy ?
M M
1 1
x 
M

i 1
(xi  x )(xi  x ) 
M
T
 i i
 
i 1
T

Using diagonalization:
The diagonal elements of
The columns of P are the
 x  P P T eigenvectors of ΣX
Λ are the eigenvalues of ΣX
or the variances (see review)

y i  U T ( x i  x )  PT  i
M M
1 1
  ( P  )( P  )
M
1
  
T T T T
y  (y i  y )(y i  y ) 
T ( y i )( y i ) i i
M i 1
M i 1 M i 1

M M
1 1
M

i 1
(T
P  i )(  T
P
i )  P ( T

M
 
i 1
i
T
i ) P  PT  x P  PT ( PPT ) P  

y   PCA de-correlates the data and


preserves the original variances!
16
Interpretation of PCA
• PCA chooses the eigenvectors
corresponding to the largest
eigenvalues.
• The eigenvalues correspond to the
variance of the data along the
eigenvector directions.
• Therefore, PCA projects the data
along the directions where the data
varies most. u1: direction of max variance
u2: orthogonal to u1
• PCA preserves as much information
in the data by preserving as much
variance in the data.

17
Example
• Compute the PCA of the following dataset:

(1,2),(3,3),(3,5),(5,4),(5,6),(6,5),(8,7),(9,8)

• Compute the sample covariance matrix:


n
ˆ  1  (x  μˆ )(x  μˆ )t
k k
n k 1

• The eigenvalues can be computed by finding the roots of the


characteristic polynomial:

18
Example (cont’d)
• The eigenvectors are the solutions of:
xui  iui

Note: if ui is a solution, then cui is also a solution where c≠0.

Eigenvectors are typically normalized to have unit-length:


vi
vˆi 
|| vi ||
19
How should we choose K ?

• K is typically chosen based on how much information


(variance) we want to preserve in the data:
K

Choose the smallest  i


K that satisfies the
i 1
D
T where T is a threshold (e.g ., 0.9)
following inequality:

i 1
i

• If T=0.9, for example, K is chosen to “preserve” 90% of the


information (variance) in the data.

• If K=D, then we “preserve” 100% of the information in the


data (i.e., just a “change” of basis and xˆ  x )

20
Approximation or Reconstruction Error

• The approximation error (or reconstruction error) can be


computed by:
|| x  xˆ ||
K
where xˆ  x   yi ui
i 1
K
or xˆ   yi ui  x  y1u1  y2u2  ...  yK uK  x
i 1
(reconstruction)

• It can also be computed using:

1 D
|| x  xˆ ||  i
2 i  K 1
21
Data Normalization

• The principal components depend both on the units used to


measure the original variables (i.e., features) and the range
of values they assume.

• If different units and/or ranges are involved, features should


always be normalized prior to applying PCA.

• A common normalization method is to transform all the


features to having zero mean and unit standard deviation:
xi   where μ and σ are the mean and standard
 deviation of the i-th feature xi

22
Application to Images
• Goal: represent images in a space of lower dimensionality
using PCA.
− Useful for various applications, e.g., face recognition, image
compression, etc.
• Given M images of size N x N, first represent each image
as an N2x1 1D vector (i.e., by stacking the rows together).

number of
features:

D=N2

23
Application to Images (cont’d)

• The key challenge is that the covariance matrix Σx is now


very large – here is Step 3 again:

Step 3: compute the covariance matrix Σx


M
1 1 where A=[Φ1 Φ2 ... ΦΜ]
x 
M
 i i
 
i 1
T

M
AAT (N2 x M matrix)

• Σx is now an N2 x N2 matrix – computationally expensive to


compute its eigenvalues/eigenvectors λi, ui
(AAT)ui= λiui

24
Application to Images (cont’d)
• We will use a simple “trick” to get around this by relating
the eigenvalues/eigenvectors of AAT to those of ATA.

• ATA is an M x M matrix (i.e., typically much smaller)


− Suppose its eigenvalues/eigenvectors are μi, vi
(ATA)vi= μivi
− Multiply both sides by A:
A(ATA)vi=Aμivi or (AAT)(Avi)= μi(Avi)

− Assuming that (AAT)ui= λiui

A=[Φ1 Φ2 ... ΦΜ]


λi=μi and ui=Avi
(N2 x M matrix)

25
Application to Images (cont’d)

• Do AAT and ATA have the same number of eigenvalues


and eigenvectors?

− AAT can have up to N2 eigenvalues/eigenvectors.

− ATA can have up to M eigenvalues/eigenvectors.

− It turns out that the M eigenvalues/eigenvectors of ATA correspond


to the M largest eigenvalues/eigenvectors of AAT

• Steps 3-5 of PCA need to be updated as follows:

26
Application to Images (cont’d)
Step 3 compute ATA (i.e., instead of AAT)
Step 4a: compute μi, vi of ATA
Step 4b: obtain λi, ui of AAT using λi=μi and ui=Avi ; then,
normalize ui to unit length.
Step 5: dimensionality reduction step – approximate x using
only the “largest” K eigenvectors (K<<M): x  1
x 
 2  y1 
M y   y1 
x  x   yi ui  y1u1  y2u2  ...  yM uM
 .  y 
    2

.   .   2
xx:      xˆ  x :  . 
i 1  .   
   . 
ˆ using    . 
approximate x by x   .
 
.
 yK 
the K largest eigenvectors  .   yM 
 
K  xD 

xˆ  x   yi ui  y1u1  y2u2  ...  yK uK K<<M; note that if K=M, then


i 1 xˆ  x (i.e., close to zero
reconstruction error)
27
Example

Dataset

28
Example (cont’d)
K largest eigenvectors: u1,…uK
(visualized as images – called “eigenfaces”)
u1 u2 u3

Mean face: x

29
Example (cont’d)

• How can you visualize an eigenvector v as a


PGM image? u u u  x1   y1 
x  y 
1 2 3

− Need to map its values xi to integer values yi  2  2


in the interval [0, 255] (i.e., required by the  .   . 
   
PGM format). .  . 
v: 
− Suppose fmin and fmax are the min/max values  .   . 
   
of v (note that they could be negative).  .   . 
 .   . 
− The following transformation achieves the    
desired mapping:  xD   yD 

yi=(int)255(xi - fmin)/(fmax - fmin)

i.e., [fmin, fmax]  [0, 255]

30
Application to Images (cont’d)
• Interpretation: approximate a face image using eigenfaces
K largest eigenvectors: u1,…uK (basis vectors)
u1 u2

 y1 
y 
 2
K xˆ  x :  . 
xˆ   yi ui  y1u1  y2u2  ...  yK u K  x  
i 1
 . 
eigen-coefficients  yK 
y1 y2 y3
...  x

31
Case Study: Eigenfaces for Face
Detection/Recognition

− M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of


Cognitive Neuroscience, vol. 3, no. 1, pp. 71-86, 1991.

• Face Recognition
− The simplest approach is to think of it as a template
matching problem.

− Problems arise when performing recognition in a high-


dimensional space.

− Use PCA for dimensionality reduction!

32
Training Phase
• Given a set of face images from a image database
group of people (each person could
have one or more images), perform
the following steps:

− Compute PCA space (K largest


eigenvectors or eigenspace) using the
image database (i.e., training data)
− Represent each image i in the database
with a vector Ωi of K eigen-coefficients:
Note: faces must
 y1 
y  be centered and
 2 eigen-coefficients of the same size.
Ωi   . 
 
 . 
 yK 
Recognition Phase
Given an unknown face x, apply the following steps:
Step 1: Subtract mean face x (computed from the training data)
  xx
 y1 
Step 2: Project the unknown face in the eigenspace: y 
 2
yi   ui i  1, 2,..., K
T
: .  eigen-coefficients
 
 . 
Step 3: Find closest match Ωp from the training set:  yK 
p  arg min i ||   i || i  1, 2,..., M
K K
1
where er  min i ||   i || min i  ( y j  y ) or min i 
i 2
( y j  y ij ) 2
j 1
j
Euclidean distance j 1 j
Mahalanobis distance

The distance er is called distance in face space (difs)

Step 4: Recognize x as the person associated with the ID of Ωp


Note: for intruder detection, we also impose er<Tr, for some threshold Tr
34
Face detection vs recognition

Detection Recognition “Sally”

35
Face Detection Using Eigenfaces
Given an unknown image x, follow these steps:
Step 1: Subtract mean face x (computed from training data):

  xx

Step 2: Project the unknown face in the eigenspace and


reconstruct it using K projections:
K

y  T u i  1, 2,..., K ˆ yu

i i i i
i 1

ˆ ||
Step 3: Compute ed ||   
reconstruction

The distance ed is called distance from face space (dffs)

Step 4: if ed<Td, then x is a face.


36
Why does this work?
Input  Reconstructed ̂

Reconstructed image looks


like a face.

Reconstructed image looks


like a face.

Reconstructed image
looks like a face again!
large dffs ˆ ||
ed ||    37
Face Detection Using Eigenfaces
We can use dffs to find faces in an image
ˆ ||
ed ||   

1) Compute ed at each image location


2) Pick the location with the lowest distance (red circle)
Visualization of dffs as an image

Dark: small ed value

Bright: large ed value

38
Limitations (cont’d)
• PCA is not always an optimal dimensionality-reduction
technique for classification purposes.

39
Linear Discriminant Analysis (LDA)

• What is the goal of LDA?


− Seeks to find directions along which the classes are best
separated (i.e., find more discriminatory features).
− It takes into consideration the scatter (i.e., variance) within-
classes and between-classes.

projection direction

projection direction

Bad separability Good separability


40
Linear Discriminant Analysis (LDA) (cont’d)
• Let us assume C classes with each class containing Mi samples, i=1,2,..,C
(each of dimensionality D), with M being the total number of samples:
C
M   Mi
i 1

• Let μi be the mean of the i-th class, i=1,2,…,C and μ be the mean of the
whole dataset: 1 C
μ
C
 μ
i 1
i

Within-class scatter matrix:


C Mi
S w   (x j  μi )(x j  μ i )T
i 1 j 1

Between-class scatter matrix:


C
Sb   (μ i  μ )(μ i  μ )T
i 1
41
Linear Discriminant Analysis (LDA) (cont’d)
• Suppose the desired projection transformation U is:

y UTx
• Suppose the scatter matrices of the projected data y are:
Sb , S w

• LDA seeks a transformation U that maximizes the between-


class scatter and minimizes the within-class scatter.

| Sb | | U T SbU |
max or max T
| Sw | | U S wU |

• What is the solution U to the above optimization problem?


42
Linear Discriminant Analysis (LDA) (cont’d)

• It turns out that the columns of U are the eigenvectors (i.e.,


called Fisherfaces) corresponding to the largest eigenvalues
of the following generalized eigen-problem:

Sbuk  k Swuk
• It can be shown that Sb has at most rank C-1; therefore,
the max number of eigenvectors with non-zero
eigenvalues is C-1 which implies that:

max dimensionality of LDA sub-space is C-1

e.g., when C=2, we always end up with one LDA feature


no matter what the original number of features D was!
43
Example

44
Linear Discriminant Analysis (LDA) (cont’d)

• If Sw is non-singular, we can solve a conventional


eigenvalue problem as follows:

Sbuk  k Swuk

S Sb uk  k uk
1
w

• In practice, Sw is singular due to the high dimensionality


of the data (e.g., images) and the relatively low number
of samples.

45
Linear Discriminant Analysis (LDA) (cont’d)
• To alleviate this problem, PCA could be applied first:

1) First, apply PCA to reduce data dimensionality:

 x1   y1 
x  y 
  2  2
x   .  PCA
y   . , D  D
   
  .  . 
 xD   yD 

2) Then, apply LDA to find the most discriminative directions:


 y1   z1 
y  z 
  2  2
y   .   LDA
z   . , K  D
   
  .  . 
 yD   zK 
46
Case Study I

− D. Swets, J. Weng, "Using Discriminant Eigenfeatures for Image


Retrieval", IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 18, no. 8, pp. 831-836, 1996.

• Content-based image retrieval:

− Application: query-by-example content-based image retrieval

− Question: how should we select a good set of features?

47
Case Study I (cont’d)
• Assumptions
− Well-framed images are required as input for training and query-by-
example test probes.
− Only a small variation in the size, position, and orientation of the
objects in the images is allowed.

48
Case Study I (cont’d)
• Terminology
− Most Expressive Features (MEF): obtained using PCA.
− Most Discriminating Features (MDF): obtained using LDA.

• Numerical instabilities
− Computing the eigenvalues/eigenvectors of Sw-1SBuk = kuk
could lead to unstable computations since Sw-1SB is not always
symmetric.

− Check the paper for more details about how to deal with this
issue.

49
Case Study I (cont’d)
• Comparing projection directions between MEF with MDF:
− PCA eigenvectors show the tendency of PCA to capture major
variations in the training set such as lighting direction.
− LDA eigenvectors discount those factors unrelated to classification.

50
Case Study I (cont’d)

• Clustering effect

PCA space LDA space

51
Case Study I (cont’d)

• Methodology
1) Represent each training image in terms of MDFs (or MEFs for
comparison).

2) Represent a query image in terms of MDFs (or MEFs for


comparson).

3) Find the r closest neighbors (e.g., using Euclidean distance).

52
Case Study I (cont’d)
• Experiments and results
Face images
− A set of face images was used with 2 expressions, 3 lighting conditions.
− Testing was performed using a disjoint set of images.

53
Case Study I (cont’d)

Top match (r=1)

54
Case Study I (cont’d)
− Examples of correct search probes

55
Case Study I (cont’d)

− Example of a failed search probe

56
Case Study II

− A. Martinez, A. Kak, "PCA versus LDA", IEEE Transactions on


Pattern Analysis and Machine Intelligence, vol. 23, no. 2, pp. 228-
233, 2001.

• Is LDA always better than PCA?

− There has been a tendency in the computer vision community to


prefer LDA over PCA.
− This is mainly because LDA deals directly with discrimination
between classes while PCA does not pay attention to the underlying
class structure.

57
Case Study II (cont’d)
AR database

58
Case Study II (cont’d)

LDA is not always better when the training set is small

PCA w/o 3: not using the


first three principal components
that seem to encode mostly
variations due to lighting

59
Case Study II (cont’d)

LDA outperforms PCA when the training set is large

PCA w/o 3: not using the


first three principal components
that seem to encode mostly
variations due to lighting

60

You might also like