Dim Reduction & Pattern Recognition

Dimensionality Reduction
Section 3.8 (Duda et al.)
CS479/679 Pattern Recognition

Dr. George Bebis
Curse of Dimensionality
• Increasing the number of features

will not always improve
classification accuracy.
• In practice, the inclusion of more

features might actually lead to k=3
31 bins
worse performance.
• The number of training examples

required increases exponentially
with dimensionality d (i.e., kd).
32 bins
33 bins
k: number of bins per feature
Dimensionality Reduction
• What is the objective?
− Choose an optimum set of features of lower
dimensionality to improve classification accuracy.
• Different methods can be used to reduce

dimensionality:
− Feature extraction
− Feature selection
3
Dimensionality Reduction (cont’d)
Feature extraction: finds a set

of new features (i.e., through Feature selection:
some mapping f()) from the chooses a subset of the
existing features. original features.
The mapping f()

could be linear or
non-linear
K<<N K<<N
4
Feature Extraction
• Linear combinations are particularly attractive because
they are simpler to compute and analytically tractable.
• Given x ϵ RN, find an K x N matrix T such that:
y = Tx ϵ RK where K<<N
T This is a projection from

the N-dimensional space to
a K-dimensional space.
5
Feature Extraction (cont’d)
• From a mathematical point of view, finding an optimum
mapping y=𝑓(x) is equivalent to optimizing an objective
criterion.
• Different methods use different objective criteria, e.g.,
− Minimize Information Loss: represent the data as accurately as

possible in the lower-dimensional space.
− Maximize Discriminatory Information: enhance the class-

discriminatory information in the lower-dimensional space.
6
Feature Extraction (cont’d)
• Popular linear feature extraction methods:
− Principal Components Analysis (PCA): Seeks a projection that
preserves as much information in the data as possible.
− Linear Discriminant Analysis (LDA): Seeks a projection that best
discriminates the data.
• Many other methods:

− Making features as independent as possible (Independent
Component Analysis or ICA).
− Retaining interesting directions (Projection Pursuit).
− Embedding to lower dimensional manifolds (Isomap, Locally Linear
Embedding or LLE).
7
Vector Representation
• A vector x ϵ Rn can be
represented by n components:
• Assuming the standard base

<v1, v2, …, vN> (i.e., unit vectors
in each dimension), xi can be
obtained by projecting x along
the direction of vi:
• x can be “reconstructed” from its

projections as follows:
• Since the basis vectors are the same for all x ϵ Rn

(standard basis), we typically represent them as a
n-component vector. 8
Vector Representation (cont’d)
• Example assuming n=2:

j
• Assuming the standard base <v1=i,

v2=j>, xi can be obtained by
projecting x along the direction of vi:
• x can be “reconstructed” from its

projections as follows:
9
Principal Component Analysis (PCA)
•If x∈RN, then it can be written a linear combination of an

orthonormal set of N basis vectors <v1,v2,…,v𝑁> in RN (e.g.,
using the standard base):
•PCA seeks to approximate x in a subspace of RN using a

new set of K<<N basis vectors <u1, u2, …,uK> in RN:
(reconstruction)
such that is minimized!

10
(i.e., minimize information loss)
Principal Component Analysis (PCA)
• The “optimal” set of basis vectors <u1, u2, …,uK> can be

found as follows (we will see why):
(1) Find the eigenvectors u𝑖 of the covariance matrix of the

(training) data Σx
Σx u𝑖= 𝜆𝑖 u𝑖
(2) Choose the K “largest” eigenvectors u𝑖 (i.e., corresponding

to the K “largest” eigenvalues 𝜆𝑖)
<u1, u2, …,uK> correspond to the “optimal” basis!
We refer to the “largest” eigenvectors u𝑖 as principal components.
11
PCA - Steps
• Suppose we are given x1, x2, ..., xM (N x 1) vectors
N: # of features
Step 1: compute sample mean M: # data
Step 2: subtract sample mean (i.e., center data at zero)
Step 3: compute the sample covariance matrix Σx

where A=[Φ1 Φ2 ... ΦΜ]
i.e., the columns of A are the Φi
(N x M matrix)
12
PCA - Steps
Step 4: compute the eigenvalues/eigenvectors of Σx
where we assume
Note : most software packages return the eigenvalues (and corresponding eigenvectors)
is decreasing order – if not, you can explicitly put them in this order)
Since Σx is symmetric, <u1,u2,…,uN> form an orthogonal basis

in RN and we can represent any x∈RN as:
i.e., this is
just a “change”
of basis!
Note : most software packages normalize ui to unit length to simplify calculations; if

not, you can explicitly normalize them) 13
PCA - Steps
Step 5: dimensionality reduction step – approximate x using
only the first K eigenvectors (K<<N) (i.e., corresponding to the
K largest eigenvalues where K is a parameter):
approximate
using first K eigenvectors only
(reconstruction)
note that if K=N, then

(i.e., zero reconstruction error)
14
What is the Linear Transformation
implied by PCA?
• The linear transformation y = Tx which performs the
dimensionality reduction in PCA is:
i.e., the columns of U are the

the first K eigenvectors of Σx
T = UT K x N matrix
i.e., the rows of T are the first K

eigenvectors of Σx
15
What is the form of Σy ?
Using diagonalization:
The diagonal elements of
The columns of P are the
Λ are the eigenvalues of ΣX
eigenvectors of ΣX or the variances
PCA de-correlates the data!

Preserves original variances!
16
Interpretation of PCA
• PCA chooses the eigenvectors of the
covariance matrix corresponding to
the largest eigenvalues.
• The eigenvalues correspond to the
variance of the data along the
eigenvector directions.
• Therefore, PCA projects the data
along the directions where the data
varies most. u1: direction of max variance
• PCA preserves as much information u2: orthogonal to u1
in the data by preserving as much
variance in the data.
17
Example
• Compute the PCA of the following dataset:
(1,2),(3,3),(3,5),(5,4),(5,6),(6,5),(8,7),(9,8)
• Compute the sample covariance matrix is:
• The eigenvalues can be computed by finding the roots of the

characteristic polynomial:
18
Example (cont’d)
• The eigenvectors are the solutions of the systems:
Note: if ui is a solution, then cui is also a solution where c≠0.
Eigenvectors can be normalized to unit-length using:
19
How do we choose K ?
• K is typically chosen based on how much information

(variance) we want to preserve:
Choose the smallest

K that satisfies
the following
inequality:
• If T=0.9, for example, we “preserve” 90% of the information

(variance) in the data.
• If K=N, then we “preserve” 100% of the information in the

data (i.e., just a “change” of basis and )
20
Approximation Error
• The approximation error (or reconstruction error) can be

computed by:
where
(reconstruction)
• It can also be shown that the approximation error can be

computed as follows:
21
Data Normalization
• The principal components are dependent on the units used

to measure the original variables as well as on the range of
values they assume.
• Data should always be normalized prior to using PCA.
• A common normalization method is to transform all the data

to have zero mean and unit standard deviation:
where μ and σ are the mean and standard

deviation of the i-th feature xi
22
Application to Images
• The goal is to represent images in a space of lower
dimensionality using PCA.
− Useful for various applications, e.g., face recognition, image
compression, etc.
• Given M images of size N x N, first represent each image
as a 1D vector (i.e., by stacking the rows together).
− Note that for face recognition, faces must be centered and of the
same size.
23
Application to Images (cont’d)
• The key challenge is that the covariance matrix Σx is now very

large (i.e., N2 x N2) – see Step 3:
Step 3: compute the covariance matrix Σx
where A=[Φ1 Φ2 ... ΦΜ]

(N2 x M matrix)
• Σx is now an N2 x N2 matrix – computationally expensive to

compute its eigenvalues/eigenvectors λi, ui
(AAT)ui= λiui
24
• We will use a simple “trick” to get around this by relating the
eigenvalues/eigenvectors of AAT to those of ATA.
• Let us consider the matrix ATA instead (i.e., M x M matrix)

− Suppose its eigenvalues/eigenvectors are μi, vi
(ATA)vi= μivi
− Multiply both sides by A:
A(ATA)vi=Aμivi or (AAT)(Avi)= μi(Avi)
− Assuming (AAT)ui= λiui
λi=μi and ui=Avi

A=[Φ1 Φ2 ... ΦΜ]
(N2 x M matrix)
25
• But do AAT and ATA have the same number of

eigenvalues/eigenvectors?
− AAT can have up to N2 eigenvalues/eigenvectors.
− ATA can have up to M eigenvalues/eigenvectors.
− It can be shown that the M eigenvalues/eigenvectors of ATA are also
the M largest eigenvalues/eigenvectors of AAT
• Steps 3-5 of PCA need to be updated as follows:
26
Step 3 compute ATA (i.e., instead of AAT)
Step 4: compute μi, vi of ATA
Step 4b: compute λi, ui of AAT using λi=μi and ui=Avi, then
normalize ui to unit length.
Step 5: dimensionality reduction step – approximate x using

only the first K eigenvectors (K<M):
each image can be
represented by
a K-dimensional
vector
27
Example
Dataset
28
Example (cont’d)
Top eigenvectors: u1,…uk
(visualized as an image - eigenfaces)
u1 u2 u3
Mean face:
29
Example (cont’d)
• How can you visualize the eigenvectors (eigenfaces) as

an image? u u
1 u
2 3
− Their values must be first mapped to integer values in the

interval [0, 255] (required by PGM format).
− Suppose fmin and fmax are the min/max values of a given
eigenface (could be negative).
− If xϵ[fmin, fmax] is the original value, then the new value
yϵ[0,255] can be computed as follows:
y=(int)255(x - fmin)/(fmax - fmin)
30
• Interpretation: represent a face in terms of eigenfaces
u1 u2 u3
y1 y2 y3
31
Case Study: Eigenfaces for Face
Detection/Recognition
− M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of

Cognitive Neuroscience, vol. 3, no. 1, pp. 71-86, 1991.
• Face Recognition
− The simplest approach is to think of it as a template matching
problem.
− Problems arise when performing recognition in a high-dimensional

space.
− Use dimensionality reduction!
32
Face Recognition Using Eigenfaces
• Process the image database (i.e., set of images with labels)
– typically referred to as “training” phase:
− Compute PCA space using image database (i.e., training data)
− Represent each image in the database with K coefficients Ωi
Ωi
Face Recognition Using Eigenfaces
Given an unknown face x, follow these steps:
Step 1: Subtract mean face (computed from training data)
Step 2: Project unknown face in the eigenspace:
Step 3: Find closest match Ωi from training set using:
Euclidean distance Mahalanobis distance
The distance er is called distance in face space (difs)
Step 4: Recognize x as person “k” where k is the ID linked to Ωi
Note: for intruder rejection, we need er<Tr, for some threshold Tr

34
Face detection vs recognition
Detection Recognition “Sally”
35
Face Detection Using Eigenfaces
Given an unknown image x, follow these steps:
Step 1: Subtract mean face (computed from training data):
Step 2: Project unknown face in the eigenspace:
Step 3: Compute
The distance ed is called distance from face space (dffs)
Step 4: if ed<Td, then x is a face.
36
Eigenfaces
Input Reconstructed
Reconstructed image looks

like a face.
Reconstructed image looks

like a face.
Reconstructed image
looks like a face again!
37
Reconstruction from partial information
• Robust to partial face occlusion.
Input Reconstructed
38
Eigenfaces
• Can be used for face detection, tracking, and recognition!
Visualize dffs as an image:
Dark: small distance

Bright: large distance
39
Limitations
• Background changes cause problems
− De-emphasize the outside of the face (e.g., by multiplying the input
image by a 2D Gaussian window centered on the face).
• Light changes degrade performance
− Light normalization might help but this is a challenging issue.
• Performance decreases quickly with changes to face size
− Scale input image to multiple sizes.
− Multi-scale eigenspaces.
• Performance decreases with changes to face orientation (but
not as fast as with scale changes)
− Out-of-plane rotations are more difficult to handle.
− Multi-orientation eigenspaces.
40
Limitations (cont’d)
• Not robust to misalignment.
41
Limitations (cont’d)
• PCA is not always an optimal dimensionality-reduction
technique for classification purposes.
42
Linear Discriminant Analysis (LDA)
• What is the goal of LDA?

− Seeks to find directions along which the classes are best
separated (i.e., increase discriminatory information).
− It takes into consideration the scatter (i.e., variance) within-
classes and between-classes.
projection direction
projection direction
Bad separability Good separability

43
Linear Discriminant Analysis (LDA) (cont’d)
• Let us assume C classes with each class containing Mi samples, i=1,2,..,C
and M the total number of samples:
• Let μi is the mean of the i-th class, i=1,2,…,C and μ is the mean of the
whole dataset:
Within-class scatter matrix
Between-class scatter matrix
44
• Suppose the desired projection transformation is:
• Suppose the scatter matrices of the projected data y are:
• LDA seeks transformations that maximize the between-class

scatter and minimize the within-class scatter:
or
45
• It can be shown that the columns of the matrix U are the

eigenvectors (i.e., called Fisherfaces) corresponding to the
largest eigenvalues of the following generalized eigen-
problem:
• It can be shown that Sb has at most rank C-1; therefore,

the max number of eigenvectors with non-zero
eigenvalues is C-1, that is:
max dimensionality of LDA sub-space is C-1
e.g., when C=2, we always end up with one LDA feature

no matter what the original number of features was!
46
Example
47
• If Sw is non-singular, we can solve a conventional eigenvalue

problem as follows:
• In practice, Sw is singular due to the high dimensionality of

the data (e.g., images) and a much lower number of data
(M << N )
48
• To alleviate this problem, PCA could be applied first:
1) First, apply PCA to reduce data dimensionality:
2) Then, apply LDA to find the most discriminative directions:
49
Case Study I
− D. Swets, J. Weng, "Using Discriminant Eigenfeatures for Image

Retrieval", IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 18, no. 8, pp. 831-836, 1996.
• Content-based image retrieval:
− Application: query-by-example content-based image retrieval
− Question: how to select a good set of image features?
50
Case Study I (cont’d)
• Assumptions
− Well-framed images are required as input for training and query-by-
example test probes.
− Only a small variation in the size, position, and orientation of the
objects in the images is allowed.
51
• Terminology
− Most Expressive Features (MEF): features obtained using PCA.
− Most Discriminating Features (MDF): features obtained using LDA.
• Numerical instabilities
− Computing the eigenvalues/eigenvectors of Sw-1SBuk = lkuk could
lead to unstable computations since Sw-1SB is not always symmetric.
− Check the paper for more details about how to deal with this issue.
52
• Comparing projection directions between MEF with MDF:
− PCA eigenvectors show the tendency of PCA to capture major
variations in the training set such as lighting direction.
− LDA eigenvectors discount those factors unrelated to classification.
53
• Clustering effect
PCA space LDA space
54
1) Methodology
1) Represent each training image in terms of MDFs (or MEFs for
comparison).
2) Represent a query image in terms of MDFs (or MEFs for

comparson).
3) Find the k closest neighbors (e.g., using Euclidean distance).
55
• Experiments and results
Face images
− A set of face images was used with 2 expressions, 3 lighting conditions.
− Testing was performed using a disjoint set of images.
56
Top match (k=1)
57
− Examples of correct search probes
58
− Example of a failed search probe
59
Case Study II
− A. Martinez, A. Kak, "PCA versus LDA", IEEE Transactions on

Pattern Analysis and Machine Intelligence, vol. 23, no. 2, pp. 228-
233, 2001.
• Is LDA always better than PCA?
− There has been a tendency in the computer vision community to

prefer LDA over PCA.
− This is mainly because LDA deals directly with discrimination
between classes while PCA does not pay attention to the underlying
class structure.
60
Case Study II (cont’d)
AR database
61
LDA is not always better when the training set is small
PCA w/o 3: not using the

first three principal components
that seem to encode mostly
variations due to lighting
62
LDA outperforms PCA when the training set is large
PCA w/o 3: not using the

first three principal components
that seem to encode mostly
variations due to lighting
63

Dim Reduction &amp; Pattern Recognition

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dim Reduction &amp; Pattern Recognition

Uploaded by

Copyright:

Available Formats

Dimensionality Reduction

Section 3.8 (Duda et al.)

CS479/679 Pattern Recognition

• Increasing the number of features

• In practice, the inclusion of more

• The number of training examples

• Different methods can be used to reduce

Feature extraction: finds a set

The mapping f()

• Given x ϵ RN, find an K x N matrix T such that:

T This is a projection from

• Different methods use different objective criteria, e.g.,

− Minimize Information Loss: represent the data as accurately as

− Maximize Discriminatory Information: enhance the class-

• Many other methods:

• Assuming the standard base

• x can be “reconstructed” from its

• Since the basis vectors are the same for all x ϵ Rn

• Example assuming n=2:

• Assuming the standard base <v1=i,

• x can be “reconstructed” from its

•If x∈RN, then it can be written a linear combination of an

•PCA seeks to approximate x in a subspace of RN using a

such that is minimized!

• The “optimal” set of basis vectors <u1, u2, …,uK> can be

(1) Find the eigenvectors u𝑖 of the covariance matrix of the

(2) Choose the K “largest” eigenvectors u𝑖 (i.e., corresponding

<u1, u2, …,uK> correspond to the “optimal” basis!

We refer to the “largest” eigenvectors u𝑖 as principal components.

Step 2: subtract sample mean (i.e., center data at zero)

Step 3: compute the sample covariance matrix Σx

Since Σx is symmetric, <u1,u2,…,uN> form an orthogonal basis

Note : most software packages normalize ui to unit length to simplify calculations; if

note that if K=N, then

i.e., the columns of U are the

i.e., the rows of T are the first K

PCA de-correlates the data!

• Compute the sample covariance matrix is:

• The eigenvalues can be computed by finding the roots of the

Note: if ui is a solution, then cui is also a solution where c≠0.

Eigenvectors can be normalized to unit-length using:

• K is typically chosen based on how much information

Choose the smallest

• If T=0.9, for example, we “preserve” 90% of the information

• If K=N, then we “preserve” 100% of the information in the

• The approximation error (or reconstruction error) can be

• It can also be shown that the approximation error can be

• The principal components are dependent on the units used

• Data should always be normalized prior to using PCA.

• A common normalization method is to transform all the data

where μ and σ are the mean and standard

• The key challenge is that the covariance matrix Σx is now very

Step 3: compute the covariance matrix Σx

where A=[Φ1 Φ2 ... ΦΜ]

• Σx is now an N2 x N2 matrix – computationally expensive to

• Let us consider the matrix ATA instead (i.e., M x M matrix)

− Assuming (AAT)ui= λiui

λi=μi and ui=Avi

• But do AAT and ATA have the same number of

• Steps 3-5 of PCA need to be updated as follows:

Step 4: compute μi, vi of ATA

Step 5: dimensionality reduction step – approximate x using

• How can you visualize the eigenvectors (eigenfaces) as

− Their values must be first mapped to integer values in the

Dim Reduction & Pattern Recognition

Dim Reduction & Pattern Recognition