Fattah DimensionReduction 02june2022

Dimensionality Reduction in Machine Learning
Dr. Shaikh Fattah

Professor Department of EEE, BUET
Chair, IEEE Signal Processing Society Bangladesh Chapter
1
Brief Biography
▸ Prof., Dept. of EEE, BUET
▸Director, Inst of Robotics & Automation BUET
▸Ex-Director, Inst. Nuclear Power Eng, BUET
▸Editorial Board Member, IEEE Access
▸Chair, IEEE R-10 Adhoc on Climate Change
▸Chair, IEEE PES-Humanitarian Activity Comm.
▸Chair, IEEE SSIT Chapters
▸Education Chair, IEEE HAC (2018-20)
▸Member, IEEE PES LRP Committee
▸Chair, IEEE SPS, SSIT Vice-Chair, IEEE RAS BDC
▸Chair (’15-16) N&A Chair (‘17-20), Bangladesh Section
▸General Chair, ICAICT 20, RAAICON, R10-HTC 17
▸TPC Chairs: TENSYMP 20, WIECON, ICIVPR, ICAEE
Dr. Shaikh Fattah, Professor, Department of EEE, BUET

Major Awards
▸Concordia University's Distinguished
Doctoral Dissertation Prize (2009),
▸2007 URSI Canadian Young Scientist Award
▸Dr. Rashid Gold Medal (in MSc)
▸BAS-TWAS Young Scientists Prize (2014)
▸2018 R10 Outstanding Volunteer Award
▸2016 IEEE MGA Achievement Award
▸2017 IEEE R10 HTA Outstanding Vol. Award
▸2016 IEEE R10 HTA Outstanding Activities
Award for Bangladesh Section
▸Best paper awards: EEE BECITHCON 2019 , Biomedical Track at IEEE
TENCON 2017, WIECON-ECE 2016
▸Faculty Advisor -IEEE SP Cup, 2022 Champion,
2023 1st Runner Up, IEEE VIP Cup, 2020- 1st R/Up

Why Reduce Dimensionality?
1. Reduces time complexity: Less computation

2. Reduces space complexity: Less parameters
3. Saves the cost of observing the feature
4. Simpler models are more robust on small datasets
5. More interpretable; simpler explanation
6. Data visualization (structure, groups, outliers, etc) if
plotted in 2 or 3 dimensions
4
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Why Component Analysis?
• We have too many observations and dimensions
– To reason about or obtain insights from
– To visualize
– Too much noise in the data
– Need to “reduce” them to a smaller set of factors
– Better representation of data without losing much
information
– Can build more effective data analyses on the reduced-
dimensional space: classification, clustering, pattern
recognition
• Combinations of observed variables may be more effective bases

for insights, even if physical meaning is obscure
Basic Concept of Dimensionality Reduction
▸Areas of variance in data are where items can be best
discriminated and key underlying phenomena observed
- Areas of greatest “signal” in the data
▸If two items or dimensions are highly correlated or dependent
- They are likely to represent highly related phenomena
- If they tell us about the same underlying variance in the data,
combining them to form a single measure is reasonable
• Parsimony
• Reduction in Error
▸So we want to combine related variables, and focus on
uncorrelated or independent ones, especially those along which
the observations have high variance
▸We want a smaller set of variables that explain most of the
variance in the original data, in more compact and insightful form

Basic Concept of Dimensionality Reduction
▸What if the dependences and correlations are not so strong or
direct?
▸And suppose you have 3 variables, or 4, or 5, or 10000?
▸Look for the phenomena underlying the observed covariance/co-
dependence in a set of variables
- Once again, phenomena that are uncorrelated or independent, and
especially those along which the data show high variance
▸These phenomena are called “factors” or “principal components”

or “independent components,” depending on the methods used
- Factor analysis: based on variance/covariance/correlation
- Independent Component Analysis: based on independence

Feature Selection vs Extraction
▸Feature selection: Choosing k<d important features,

ignoring the remaining d – k
Subset selection algorithms
▸Feature extraction: Project the
original xi , i =1,...,d dimensions to
new k<d dimensions, zj , j =1,...,k
Principal components analysis (PCA), linear

discriminant analysis (LDA), factor analysis (FA)
8
Subset Selection Method
Forward search:
Add the best feature at each step
- Set of features F initially Ø.
- At each iteration, find the best new feature
j = argmini E ( F  xi )
- Add xj to F if E ( F  xj ) < E ( F )
Backward search:
▸Start with all features and remove one at a time, if possible.
9
Principal Component Analysis
▸Most common form of factor analysis

▸The new variables/dimensions
- Are linear combinations of the original ones
- Are uncorrelated with one another
• Orthogonal in original dimension space
- Capture as much of the original variance in the data as possible
- Are called Principal Components

What are the new axes?
Original Variable B
PC 2
PC 1
Original Variable A
• Orthogonal directions of greatest variance in data

• Projections along PC1 discriminate the data most along any one axis

Principal Components
▸First principal component is the direction of greatest

variability (covariance) in the data
▸Second is the next orthogonal (uncorrelated) direction of
greatest variability
- So first remove all the variability along the first
component, and then find the next direction of greatest
variability
▸And so on …

Principal Components Analysis (PCA)
- Linear projection method to reduce the number of parameters
- Transfer a set of correlated variables into a new set of
uncorrelated variables
- Map the data into a space of lower dimensionality
- Form of unsupervised learning
Mathematically:
- It can be viewed as a rotation of the existing axes to new
positions in the space defined by original variables
- New axes are orthogonal and represent the directions with
maximum variability

Computing the Components
▸Data points are vectors in a multidimensional space
▸Projection of vector x onto an axis (dimension) u is u.x
▸Direction of greatest variability is that in which the average square
of the projection is greatest
- I.e. u such that E((u.x)2) over all x is maximized
- (we subtract the mean along each dimension, and center the original
axis system at the centroid of all data points, for simplicity)
- This direction of u is the direction of the first Principal Component

▸E((u.x)2) = E ((u.x) (u.x)T) = E (u.x.x T.uT)
▸The matrix C = x.xT contains the correlations (similarities) of
the original axes based on how the data values project onto
them
▸So we are looking for w that maximizes uCuT, subject to u
being unit-length
▸It is maximized when w is the principal eigenvector of the
matrix C, in which case
- uCuT = uluT = l if u is unit-length, where l is the
principal eigenvalue of the correlation matrix C
- The eigenvalue denotes the amount of variability
captured along that dimension

Why the Eigenvectors?
Maximize uTxxTu s.t uTu = 1

Construct Langrangian uTxxTu – λuTu
Vector of partial derivatives set to zero
xxTu – λu = (xxT – λI) u = 0
As u ≠ 0 then u must be an eigenvector of xxT with eigenvalue λ

Singular Value Decomposition
The first root is called the prinicipal eigenvalue which has an

associated orthonormal (uTu = 1) eigenvector u
Subsequent roots are ordered such that λ1> λ2 >… > λM with
rank(D) non-zero values.
Eigenvectors form an orthonormal basis i.e. uiTuj = δij
The eigenvalue decomposition of xxT = UΣUT
where U = [u1, u2, …, uM] and Σ = diag[λ 1, λ 2, …, λ M]
Similarly the eigenvalue decomposition of xTx = VΣVT
The SVD is closely related to the above x=U Σ1/2 VT
The left eigenvectors U, right eigenvectors V,
singular values = square root of eigenvalues.

▸Similarly for the next axis, etc.
▸So, the new axes are the eigenvectors of the matrix of
correlations of the original variables, which captures the
similarities of the original variables based on how data
samples project to them
• Geometrically: centering followed by rotation

– Linear transformation
PCs, Variance and Least-Squares
▸The first PC retains the greatest amount of variation in the
sample
▸The kth PC retains the kth greatest fraction of the variation in
the sample
▸The kth largest eigenvalue of the correlation matrix C is the
variance in the sample along the kth PC
▸The least-squares view: PCs are a series of linear least squares

fits to a sample, each orthogonal to all previous ones

Principal Components Analysis (PCA)
▸Find a low-dimensional space such that when x is

projected there, information loss is minimized.
▸The projection of x on the direction of w is: z = wTx
▸Find w such that Var(z) is maximized
Var(z) = Var(wTx) = E[(wTx – wTμ)2]
= E[(wTx – wTμ)(wTx – wTμ)]
= E[wT(x – μ)(x – μ)Tw]
= wT E[(x – μ)(x –μ)T]w = wT ∑ w
where Var(x)= E[(x – μ)(x –μ)T] = ∑
20
▸Maximize Var(z) subject to ||w||=1
(
max w1T w1 −  w1T w1 − 1
w1
)
∑w1 = αw1 that is, w1 is an eigenvector of ∑
Choose the one with the largest eigenvalue for Var(z) to be max
▸Second principal component: Max Var(z2), s.t., ||w2||=1 and orthogonal
to w1
∑ w2 = α w2 that is, w2 is another eigenvector of ∑

and so on.
( ) (
max w 2T w 2 −  w 2T w 2 − 1 −  w 2T w1 − 0
w2
)
21
What PCA does
z = WT(x – m)
where the columns of W are the eigenvectors of ∑, and m is
sample mean
Centers the data at the origin and rotates the axes
22
How to choose k ?
▸Proportion of Variance (PoV) explained
l1 + l 2 +  + l k
l1 + l 2 +  + l k +  + l d
when λi are sorted in descending order
▸Typically, stop at PoV>0.9
▸Scree graph plots of PoV vs k, stop at “elbow”
23
24
Lect
ure
Not
es
for E
Alpa
ydın
200
4
Intro
duct
ion
to 25
Mac
How Many PCs?
▸For n original dimensions, correlation matrix is nxn, and

has up to n eigenvectors. So n PCs.
▸Where does dimensionality reduction come from?

Dimensionality Reduction
Can ignore the components of lesser significance.

25
20
Variance (%)
15
10
0
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
You do lose some information, but if the eigenvalues are

small, you don’t lose much
– n dimensions in original data
– calculate n eigenvectors and eigenvalues
– choose only the first p eigenvectors, based on their eigenvalues
– final data set has only p dimensions
Computing PCA: Example
A

A

A

A

A

A

A

A

Curse of Feature Dimensionality
▸Increasing the number of features will not always improve classification
accuracy. It might actually lead to worse performance.
▸The number of training examples required increases exponentially with
dimensionality.
▸Choose an optimum set of features of lower dimensionality to improve
classification accuracy.
k=3
32 bins
33 bins
k: number of bins per feature
Feature Dimensionality Reduction
Feature selection: chooses a Feature extraction: finds a set of
subset of the original features. new features (via some mapping f())
from the existing features.
 x1   x1 
x  x 
 2  xi1   2  y1 
 .     .  y 
   xi2     2
.  .
x= 
 . 
→y= .  x =   ⎯⎯⎯ f (x)
→y =  . 
   .   
   .    K<<N . 
 .  
   .   yK 
 .   xiK   . 
  K<<N  
 xN   xN  The mapping f() could be
linear or non-linear
K<<N
38
Feature Selection vs Feature Extraction
▸Feature Extraction
- New features are combinations (linear for PCA/LDA) of the
original features (difficult to interpret).
- When classifying novel patterns, all features need to be
computed.
▸Feature Selection
- New features represent a subset of the original features.
- When classifying novel patterns, only a small number of features
need to be computed (i.e., faster classification).
Feature Selection
Feature Selection: Main Steps
▸Feature selection is an
optimization problem.
- Step 1: Search the space of
possible feature subsets.
- Step 2: Pick the subset that is

optimal or near-optimal with
respect to some objective
function.
Feature Selection: Main Steps
(cont’d)
Search methods
- Exhaustive
- Heuristic
- Randomized
Evaluation methods
- Filter
- Wrapper
Search Methods
▸Assuming n features, an exhaustive search would
require examining  n  possible subsets of size d.
d 
▸The number of subsets grows combinatorially, making

exhaustive search impractical.
- e.g., exhaustive evaluation of 10 out of 20 features involves
184,756 feature subsets.
- e.g., exhaustive evaluation of 10 out of 100 involves more than
1013 feature subsets.
▸In practice, heuristics are used to speed-up search but

they cannot guarantee optimality.
43
Evaluation Methods
▸Filter
- Evaluation is independent of
the classification algorithm.
- The objective function is
based on the information
content of the feature “goodness”
subsets, e.g.:
• interclass distance
• statistical dependence
• information-theoretic measures
(e.g., mutual information).
Evaluation Methods (cont’d)
▸Wrapper
- Evaluation uses criteria
related to the classification
algorithm.
- The objective function is “goodness”
based on the predictive
accuracy of the classifier,
e.g.,:
• recognition accuracy on a
“validation” data set.
Filter vs Wrapper Methods
▸Wrapper Methods
- Advantages
• Achieve higher recognition accuracy compared to filter methods
since they use the classifier itself in choosing the optimum set of
features.
- Disadvantages
• Much slower compared to filter methods since the classifier must
be trained and tested for each candidate feature subset.
• The optimum feature subset might not work well for other
classifiers.
Filter vs Wrapper Methods
(cont’d)
▸Filter Methods
- Advantages
• Much faster than wrapper methods since the objective function
has lower computational requirements.
• The optimum feature set might work well with various classifiers
as it is not tied to a specific classifier.
- Disadvantages
• Achieve lower recognition accuracy compared to wrapper
methods.
• Have a tendency to select more features compared to wrapper
methods.
Naïve Search(heuristic search)
▸Given n features:
1. Sort them in decreasing order of their “goodness” (i.e., based
on some objective function).
2. Choose the top d features from this sorted list.
▸Disadvantages
- Correlation among features is not considered.
- The best pair of features may not even contain the best
individual feature.
Sequential forward selection (SFS)
(heuristic search)
▸First, the best single feature is selected (i.e.,

using some objective function).
▸Then, pairs of features are formed using one of
the remaining features and this best feature, and
the best pair is selected.
▸Next, triplets of features are formed using one of
the remaining features and these two best
features, and the best triplet is selected.
▸This procedure continues until a predefined
number of features are selected.
SFS performs
best when the
optimal subset is
small.
49
Example
features added at
each iteration
highest
accuracy
Sequential forward feature selection for classification of a satellite image using 28 features.
x-axis shows the classification accuracy (%)
y-axis
50 shows the features added at each iteration (the first iteration is at the bottom).
Sequential backward selection (SBS)
(heuristic search)
▸First, the objective function is computed for all n

features.
▸Then, each feature is deleted one at a time, the
objective function is computed for all subsets with
n-1 features, and the worst feature is discarded.
▸Next, each feature among the remaining n-1 is
deleted one at a time, and the worst feature is
discarded to form a subset with n-2 features.
▸This procedure continues until a predefined
number of features are left.
SBS performs
best when the
optimal subset is
large.
51
Example
features removed at
each iteration
highest
accuracy
Sequential backward feature selection for classification of a satellite image using 28 features.
x-axis shows the classification accuracy (%)
y-axis shows the features removed at each iteration (the first iteration is at the top)
52
Bidirectional Search (BDS)
▸BDS applies SFS and SBS
simultaneously:
- SFS is performed from the
empty set.
- SBS is performed from the
full set.
▸To guarantee that SFS and
SBS converge to the same
solution:
- Features already selected by
SFS cannot be removed by Y
SBS.
- Features already removed
by SBS cannot be added by
SFS.
Limitations of SFS and SBS
▸The main limitation of SFS is that it is unable to
remove features that become non useful after the
addition of other features.
▸The main limitation of SBS is its inability to
reevaluate the usefulness of a feature after it has
been discarded.
▸Enhancements of SFS and SBS:
- “Plus-L, minus-R” selection (LRS)
- Sequential floating forward (or backward) selection
(i.e., SFFS and SFBS)

Feature selection
What is feature selection?
▸Consider our training data as a matrix where each row is a

vector and each column is a dimension.
▸For example consider the matrix for the data x1=(1, 10, 2),
x2=(2, 8, 0), and x3=(1, 9, 1)
▸We call each dimension a feature or a column in our
matrix.

Feature selection
▸Useful for high dimensional data such as genomic DNA and text
documents.
▸Methods
- Univariate (looks at each feature independently of others)
• Pearson correlation coefficient
• F-score
• Chi-square
• Signal to noise ratio
• And more such as mutual information, relief
- Multivariate (considers all features simultaneously)
• Dimensionality reduction algorithms
• Linear classifiers such as support vector machine
• Recursive feature elimination

Feature selection
▸Methods are used to rank features by importance

▸Ranking cut-off is determined by user
▸Univariate methods measure some type of correlation
between two random variables. We apply them to machine
learning by setting one variable to be the label (yi) and the
other to be a fixed feature (xij for fixed j)

Pearson correlation coefficient
▸Measures the correlation between two variables
▸Formulas:
- Covariance(X,Y) = E((X-μX)(Y-μY))
- Correlation(X,Y)= Covariance(X,Y)/σXσY
- Pearson correlation =
▸The correlation r is between -1 and 1. A value of 1 means perfect

positive correlation and -1 in the other direction

Pearson correlation coefficient
From Wikipedia
Signal to noise ratio
▸Difference in means divided by difference in standard

deviation between the two classes
▸S2N(X,Y) = (μX - μY)/(σX – σY)
▸Large values indicate a strong correlation

Multivariate feature selection
▸Consider the vector w for any linear classifier.

▸Classification of a point x is given by wTx+w0.
▸Small entries of w will have little effect on the dot product
and therefore those features are less relevant.
▸For example if w = (10, .01, -9) then features 0 and 2 are
contributing more to the dot product than feature 1. A
ranking of features given by this w is 0, 2, 1.

Multivariate feature selection
▸The w can be obtained by any of linear classifiers we have

see in class so far
▸A variant of this approach is called recursive feature
elimination:
- Compute w on all features
- Remove feature with smallest wi
- Recompute w on reduced data
- If stopping criterion not met then go to step 2

Limitations
▸Unclear how to tell in advance if feature selection will work

- Only known way is to check but for very high dimensional data (at
least half a million features) it helps most of the time
▸How many features to select?
- Perform cross-validation

Creating Features
▸“Good” features are the key to accurate generalization
▸Domain knowledge can be used to generate a feature set

- Medical Example: results of blood tests, age, smoking history
- Game Playing example: number of pieces on the board, control of the
center of the board
▸Data might not be in vector form

- Example: spam classification
• “Bag of words”: throw out order, keep count of how many times each
word appears.
• Sequence: one feature for first letter in the email, one for second letter,
etc.
• Ngrams: one feature for every unique string of n features

What is feature selection?
▸Reducing the feature space by throwing out some of the features

Reasons for Feature Selection
▸Want to find which features are relevant
- Domain specialist not sure which factors are predictive of disease
- Common practice: throw in every feature you can think of, let feature selection
get rid of useless ones
▸Want to maximize accuracy, by removing irrelevant and noisy features
- For Spam, create a feature for each of ~105 English words
- Training with all features computationally expensive
- Irrelevant features hurt generalization
▸Features have associated costs, want to optimize accuracy with least
expensive features
- Embedded systems with limited resources
• Voice recognition on a cell phone
• Branch prediction in a CPU (4K code limit)

Terminology
▸Univariate method: considers one variable (feature) at a time

▸Multivariate method: considers subsets of variables (features)
together
▸Filter method: ranks features or feature subsets independently
of the predictor (classifier)
▸Wrapper method: uses a classifier to assess features or feature
subsets

Filtering
▸Basic idea: assign score to each feature x indicating how
“related” x and the class y are
- Intuition: if x=y for all instances, then x is great no matter what our model
is; x contains all information needed to predict y
▸Pick the n highest scoring features to keep

Individual Feature Irrelevance
Legend:
Y=1
Y=-1
P(Xi, Y) = P(Xi) P(Y)
densit P(Xi| Y) = P(Xi)
y
P(Xi| Y=1) = P(Xi| Y=-1)
xi
Slide from I. Guyon, PASCAL Bootcamp

Dr. Shaikh in ML
Fattah, Professor, Department of EEE, BUET
Individual Feature Relevance
m- m+
-1
s- s+
xi
Figure from I. Guyon,Dr.
PASCAL
Shaikh Bootcamp in ML Department of EEE, BUET
Fattah, Professor,
Univariate Dependence
▸Independence:
P(X, Y) = P(X) P(Y)
▸Measures of dependence:
- Mutual Information (see notes from board)
- Correlation (see notes from board)

Dr. Shaikh in ML
Correlation and MI
R=0.02
MI=1.03 nat
P(X) X
X Y
Y P(Y)
R=0.0002 X Y
MI=1.65 nat
Dr. Shaikh in ML
Gaussian Distribution
P(X) X
X Y
Y P(Y)
X Y
MI(X, Y) = -(1/2) log(1-R2)

Dr. Shaikh in ML
m- m+
T-test
P(Xi|Y=1)
P(Xi|Y=-1)
-1
xi
s- s+
• Normally distributed classes, equal variance s2

unknown; estimated from data as s2within.
• Null hypothesis H0: m+ = m-
• T statistic: If H0 is true,
t= (m+ - m-)/(swithin1/m++1/m-) Student(m++m--2 d.f.)
Dr. Shaikh in ML
Considering each feature
alone may fail
Guyon-Elisseeff, JMLR 2004; Springer 2006

Filtering
▸Advantages:
- Fast, simple to apply
▸Disadvantages:
- Doesn’t take into account which learning algorithm will be used
- Doesn’t take into account correlations between features, just correlation of
each feature to the class label

Feature Selection Methods
Filter:
Selected
All Features Supervised Classifier
Features
Filter Learning
Algorithm Selected
Wrapper: Features
Feature Classifier Feature Classifier

All Features Supervised
Subset Evaluation
Search Learning
Algorithm Criterion Selected
Features
Criterion Value

Feature Classifier Feature
Subset Evaluation
Search Learning
Features
Criterion Value

Subset Evaluation
Search Learning
Features
Criterion Value
Search Method: sequential forward search
A B C D
A, B B, C B, D
A, B, C B, C, D

Subset Evaluation
Search Learning
Features
Criterion Value
Search Method: sequential backward elimination
ABC ABD ACD BCD
AB AD BD
A D

▸ Forward or backward selection? Of the three variables of this example, the third one
separates the two classes best by itself (bottom right histogram). It is therefore the best
candidate in a forward selection process. Still, the two other variables are better taken
together than any subset of two including it. A backward selection method may perform
better in this case.

Model search
▸More sophisticated search strategies exist
- Best-first search
- Stochastic search
- See “Wrappers for Feature Subset Selection”, Kohavi and
John 1997
▸Other objective functions exist which add a model-complexity
penalty to the training error
- AIC, BIC

Regularization
▸In certain cases, we can move model selection into the induction
algorithm
- Only have to fit one model; more efficient
▸This is sometimes called an embedded feature selection
algorithm

Regularization
▸Regularization: add model complexity penalty to training error.
▸
for some constant C

▸Now
▸Regularization forces weights to be small, but does it force
weights to be exactly zero?
- is equivalent to removing feature f from the
model

Kernel Methods (Quick Review)
▸Expanding feature space gives us new potentially useful features
▸Kernel methods let us work implicitly in a high-dimensional
feature space
- All calculations performed quickly in low-dimensional space

Feature Engineering
▸Linear models: convenient,
fairly broad, but limited
▸We can increase the
expressiveness of linear
models by expanding the
feature space.
- E.g.
- Now feature space is R6 rather than R2

- Example linear predictor in these features:

Individual Feature Relevance
ROC 1
curve m- m+
0
Sensitivity
AUC -1
0 1 s- s+
1- Specificity xi
Dr. Shaikh in ML
L1 versus L2 Regularization

Feature Extraction
▸Linear combinations are particularly attractive because they
are simpler to compute and analytically tractable.
▸Given x ϵ RN, find an K x N matrix T such that:
y = Tx ϵ RK where K<<N
 x1 
x 
 2  y1 
 .  y  This is a projection from the
  T  2
. N-dimensional space to a K-
x =   ⎯⎯⎯ f (x)
→y =  . 
 .    dimensional space.
   . 
  .
 yK 
 . 
 
 xN 
90
Feature Extraction (cont’d)
▸From a mathematical point of view, finding an optimum
mapping y=𝑓(x) is equivalent to optimizing an objective
criterion.
▸Different methods use different objective criteria, e.g.,
- Minimize Information Loss: represent the data as accurately as

possible in the lower-dimensional space.
- Maximize Discriminatory Information: enhance the class-

discriminatory information in the lower-dimensional space.
91
Feature Extraction (cont’d)
▸Popular linear feature extraction methods:

- Principal Components Analysis (PCA): Seeks a projection
that preserves as much information in the data as possible.
- Linear Discriminant Analysis (LDA): Seeks a projection that
best discriminates the data.
▸Many other methods:

- Making features as independent as possible (Independent
Component Analysis or ICA).
- Retaining interesting directions (Projection Pursuit).
- Embedding to lower dimensional manifolds (Isomap, Locally
Linear Embedding or LLE).
92
Vector Representation
 x1 
▸A vector x ϵ can be
Rn x 
 2
represented by n components:  . 
 
.
x: 
 . 
 
 . 
 . 
▸Assuming the standard base <v1,  
 xN 
v2, …, vN> (i.e., unit vectors in each
dimension), xi can be obtained by xT vi
projecting x along the direction of xi = T = xT vi
vi vi
v i:
▸x can be “reconstructed” from its N

x =  xi vi = x1v1 + x2 v2 + ... + xN vN
projections as follows: i =1
• Since the basis vectors are the same for all x ϵ Rn

(standard basis), we typically represent them as a
n-component vector.
Dr. Shaikh Fattah, Professor, Department of EEE, BUET 93
Vector Representation (cont’d)
▸Example assuming n=2:  x1   3
x:  =  
 x2   4 j
▸Assuming the standard base 1 

<v1=i, v2=j>, xi can be obtained by x1 = x i = 3 4   = 3
T
projecting x along the direction of 0

v i:
0 
x2 = xT j = 3 4   = 4
1 
▸x can be “reconstructed” from its

projections as follows:
x = 3i + 4 j
94
Principal Component Analysis (PCA)
 x1 
• If x∈RN, then it can be written a linear combination of an x 
 2
orthonormal set of N basis vectors <v1,v2,…,v𝑁> in RN (e.g.,  . 
 
using the standard base): N
x:  
.
1 if i = j x =  xi vi = x1v1 + x2 v2 + ... + xN vN  . 
v vj = 
T
i i =1
 
0 otherwise xT vi  . 
where xi = T = xT vi  . 
vi vi  
 xN 
• PCA seeks to approximate x in a subspace of RN using a new
set of K<<N basis vectors <u1, u2, …,uK> in RN:
K  y1 
xˆ =  yi ui = y1u1 + y2u2 + ... + yK u K xT ui
where yi = T = xT ui y 
ui ui  2
i =1 (reconstruction)
xˆ :  . 
 
such that || x − xˆ || is minimized!  . 
 yK 
95 (i.e., minimize information loss)
Principal Component Analysis (PCA)
• The “optimal” set of basis vectors <u1, u2, …,uK> can be
found as follows (we will see why):
(1) Find the eigenvectors u𝑖 of the covariance matrix of the

(training) data Σx
Σx u𝑖= 𝜆𝑖 u𝑖
(2) Choose the K “largest” eigenvectors u𝑖 (i.e., corresponding

to the K “largest” eigenvalues 𝜆𝑖)
<u1, u2, …,uK> correspond to the “optimal” basis!
We refer to the “largest” eigenvectors u𝑖 as principal components.

96
PCA - Steps
▸Suppose we are given x1, x2, ..., xM (N x 1) vectors
N: # of features
Step 1: compute sample mean M: # data
M
1
x=
M
x
i =1
i
Step 2: subtract sample mean (i.e., center data at zero)
Φi = x i − x
Step 3: compute the sample covariance matrix Σx

1 M
1 M
1 where A=[Φ1 Φ2 ... ΦΜ]
x =
M

i =1
( x i − x )( x i − x )T
=
M

i =1
 T
i
i  =
M
AAT
i.e., the columns of A are the Φi
(N x M matrix)
97
PCA - Steps
Step 4: compute the eigenvalues/eigenvectors of Σx
 x ui = li ui
where we assume l1  l2  ...  lN
Note : most software packages return the eigenvalues (and corresponding eigenvectors)
is decreasing order – if not, you can explicitly put them in this order)
Since Σx is symmetric, <u1,u2,…,uN> form an orthogonal basis in

RN and we can represent any x∈RN as: x 
x 
y 
y 
1 1
 2  2
N  .   . 
x − x =  yi ui = y1u1 + y2u2 + ... + y N u N

   
 .  .
x−x: → 
 .   . 
i =1    
i.e., this is  .   . 
 .   . 
(x − x)T ui just a “change”    
yi = T
= ( x − x )T
ui if || ui ||= 1 of basis!  xN   yN 
ui ui
Note : most software packages normalize ui to unit length to simplify calculations; if not, you
can explicitly normalize them) 98
PCA - Steps
Step 5: dimensionality reduction step – approximate x using
only the first K eigenvectors (K<<N) (i.e., corresponding to the K
largest eigenvalues where K is a parameter):
N
x − x =  yi ui = y1u1 + y2u2 + ... + y N u N
i =1
approximate x by x ˆ
using first K eigenvectors only
K
xˆ − x =  yi ui = y1u1 + y2u2 + ... + yK u K
 x1   y1 
x  y 
  2  2  y1 
 .   .  y 
     2
x−x:  . 
 . 
→  . 
 . 
→ xˆ − x :  .  note that if K=N, then xˆ =x
 
     .  (i.e., zero reconstruction error)
 .   .   yK 
 .   . 
   
 xN   yN 
99
What is the Linear Transformation
implied by PCA?
• The linear transformation y = Tx which performs the
dimensionality reduction in PCA is:
K
xˆ − x =  yi ui = y1u1 + y2u2 + ... + yK u K
i =1
 y1 
y  where U = [u1 u2 ...uK ] N x K matrix
 2
(xˆ − x ) = U  . 
  i.e., the columns of U are the
 .  the first K eigenvectors of Σx
 yK 
 y1 
y 
 2 T = UT K x N matrix
 .  = U T (xˆ − x ) i.e., the rows of T are the first K
  eigenvectors of Σx
 . 
 yK 
100
What is the form of Σy ?
M M
1 1
x =
M

i =1
(xi − x )(xi − x ) =
M
T
 i i
 
i =1
T
Using diagonalization:
The diagonal elements of
The columns of P are the
x = PPT eigenvectors of ΣX
Λ are the eigenvalues of ΣX
or the variances
y i = U T ( x i − x ) = PT  i
M M
1 1
  (P
M
1
 =  i )( PT  i )T =
T T
y = (y i − y )(y i − y ) =
T ( y i )( y i )
M i =1
M i =1 M i =1
M M
1 1
M

i =1
(T
P  i )(  T
P
i ) = P ( T
M
 
i =1
i
T
i ) P = PT x P = PT ( PPT ) P = 
y =  PCA de-correlates the data!

Preserves original variances!
101
Interpretation of PCA
• PCA chooses the eigenvectors of

the covariance matrix corresponding
to the largest eigenvalues.
• The eigenvalues correspond to the
variance of the data along the
eigenvector directions.
• Therefore, PCA projects the data
along the directions where the data
varies most. u1: direction of max variance
u2: orthogonal to u1
• PCA preserves as much information
in the data by preserving as much
variance in the data.
102
Example
▸Compute the PCA of the following dataset:
(1,2),(3,3),(3,5),(5,4),(5,6),(6,5),(8,7),(9,8)
▸Compute the sample covariance

n
matrix is:
1
ˆ =  (x k − μˆ )(x k − μˆ )t
n k =1
▸The eigenvalues can be computed by finding the roots of the

characteristic polynomial:
103
Example (cont’d)
▸The eigenvectors are the solutions of the systems:
 x ui = li ui
Note: if ui is a solution, then cui is also a solution where c≠0.
Eigenvectors can be normalized to unit-length using:

vi
vˆi =
|| vi ||
104
How do we choose K ?
• K is typically chosen based on how much information

(variance) we want to preserve:
K
Choose the smallest l i
K that satisfies
i =1
N
T where T is a threshold (e.g ., 0.9)
the following
inequality: l
i =1
i
• If T=0.9, for example, we “preserve” 90% of the information

(variance) in the data.
• If K=N, then we “preserve” 100% of the information in the

data (i.e., just a “change” of basis and xˆ = x )
105
Approximation Error
• The approximation error (or reconstruction error) can be

computed by:
|| x − xˆ ||
K
where xˆ =  yi ui + x = y1u1 + y2u2 + ... + yK u K + x
• It can also be shown that the approximation error can be

computed as follows:
1 N
|| x − xˆ ||=  li
2 i = K +1
106
Data Normalization
• The principal components are dependent on the units used

to measure the original variables as well as on the range of
values they assume.
• Data should always be normalized prior to using PCA.
• A common normalization method is to transform all the data

to have zero mean and unit standard deviation:
xi − m where μ and σ are the mean and standard

deviation of the i-th feature xi
s
107
Application to Images
• The goal is to represent images in a space of lower
dimensionality using PCA.
− Useful for various applications, e.g., face recognition, image
compression, etc.
• Given M images of size N x N, first represent each image
as a 1D vector (i.e., by stacking the rows together).
− Note that for face recognition, faces must be centered and of the
same size.
108
Application to Images (cont’d)
▸The key challenge is that the covariance matrix Σx is

now very large (i.e., N2 x N2) – see Step 3:
Step 3: compute
M the covariance matrix Σx
1 1 where A=[Φ1 Φ2 ... ΦΜ]
x = 
M i =1
 
i i
T
=
M
AAT
(N2 x M matrix)
▸Σx is now an N2 x N2 matrix – computationally

expensive to compute its eigenvalues/eigenvectors λi,
ui
(AAT)ui= λiui
109
▸We will use a simple “trick” to get around this by

relating the eigenvalues/eigenvectors of AAT to those
of ATA.
▸Let us consider the matrix ATA instead (i.e., M x M

matrix)
- Suppose its eigenvalues/eigenvectors are μi, vi
(ATA)vi= μivi
- Multiply both sides by A:
A(ATA)vi=Aμivi or (AAT)(Avi)= μi(Avi)
A=[Φ1 Φ2 ... ΦΜ]
- Assuming (AAT)ui= λiui (N2 x M matrix)
λi=μi and ui=Avi

110
▸But do AAT and ATA have the same number of

eigenvalues/eigenvectors?
- AAT can have up to N2 eigenvalues/eigenvectors.
- ATA can have up to M eigenvalues/eigenvectors.
- It can be shown that the M eigenvalues/eigenvectors of ATA
are also the M largest eigenvalues/eigenvectors of AAT
▸Steps 3-5 of PCA need to be updated as follows:
111
Step 3 compute ATA (i.e., instead of AAT)
Step 4: compute μi, vi of ATA
Step 4b: compute λi, ui of AAT using λi=μi and ui=Avi, then
normalize ui to unit length.
Step 5: dimensionality reduction step – approximate x using

only the first K eigenvectors (K<M):  y1 
y 
K each image can be  2
xˆ − x =  yi ui = y1u1 + y2u2 + ... + yK u K represented by
a K-dimensional
xˆ − x :  . 
 
i =1 vector  . 
 yK 
112
Example
Dataset
113
Example (cont’d)
Top eigenvectors: u1,…uk
(visualized as an image - eigenfaces)
u1 u2 u3
Mean face: x
114
Example (cont’d)
• How can you visualize the eigenvectors (eigenfaces) as an
image? u 1
u 2u 3
− Their values must be first mapped to integer values in the

interval [0, 255] (required by PGM format).
− Suppose fmin and fmax are the min/max values of a given
eigenface (could be negative).
− If xϵ[fmin, fmax] is the original value, then the new value
yϵ[0,255] can be computed as follows:
y=(int)255(x - fmin)/(fmax - fmin)
115
• Interpretation: represent a face in terms of eigenfaces
u1 u2 u3
 y1 
y 
 2
K xˆ − x :  . 
xˆ =  yi ui = y1u1 + y2u2 + ... + yK u K + x  
i =1
 . 
 yK 
y1 y2 y3
+x
116
Case Study: Eigenfaces for Face
Detection/Recognition
− M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of

Cognitive Neuroscience, vol. 3, no. 1, pp. 71-86, 1991.
• Face Recognition
− The simplest approach is to think of it as a template matching
problem.
− Problems arise when performing recognition in a high-dimensional

space.
− Use dimensionality reduction!
117
Face Recognition Using Eigenfaces
▸Process the image database (i.e., set of images with

labels) – typically referred to as “training” phase:
- Compute PCA space using image database (i.e., training data)
- Represent each image in the database with K coefficients Ωi

 y1 
y 
 2
Ωi =  . 
 
 . 
 yK 

Face Recognition Using Eigenfaces
Given an unknown face x, follow these steps:
Step 1: Subtract mean face (computed
x from training data)
 = x−x
 y1 
Step 2: Project unknown face in the eigenspace: y 
K  2
ˆ =yu
 where yi = T ui : . 
i i  
i =1  . 
Step 3: Find closest match Ωi from training set using:  yK 
K K
1
er = min i ||  − i ||= min i  ( y j − y ) or min i 
i 2
( y j − y ij ) 2
j =1
j
j =1 lj
Euclidean distance Mahalanobis distance
The distance er is called distance in face space (difs)
Step 4: Recognize x as person “k” where k is the ID linked to Ωi
Note: for intruder rejection, we need er<Tr, for some threshold Tr

119
Face detection vs recognition
Detection Recognition “Sally”
120
Face Detection Using Eigenfaces
Given an unknown image x, follow

x these steps:
Step 1: Subtract mean face (computed from training
data):  = x−x
K
Step 2: Project =  yi ui face
ˆunknown where i = T
in theyeigenspace:
ui
i =1
ˆ ||
ed =||  − 
Step 3: Compute
The distance ed is called distance from face space (dffs)
Step 4: if ed<Td, then x is a face.

121
Eigenfaces
Input Reconstructed
Reconstructed image looks

like a face.
Reconstructed image looks

like a face.
Reconstructed image
looks like a face again!
122
Reconstruction from partial information
• Robust to partial face occlusion.
Input Reconstructed
123
Eigenfaces
• Can be used for face detection, tracking, and recognition!
Visualize dffs as an image:

ˆ ||
ed =||  − 
Dark: small distance

Bright: large distance
124
Limitations
• Background changes cause problems
− De-emphasize the outside of the face (e.g., by multiplying the input
image by a 2D Gaussian window centered on the face).
• Light changes degrade performance
− Light normalization might help but this is a challenging issue.
• Performance decreases quickly with changes to face size
− Scale input image to multiple sizes.
− Multi-scale eigenspaces.
• Performance decreases with changes to face orientation
(but not as fast as with scale changes)
− Out-of-plane rotations are more difficult to handle.
− Multi-orientation eigenspaces.
125
Limitations (cont’d)
▸Not robust to misalignment.
126
Limitations (cont’d)
• PCA is not always an optimal dimensionality-reduction
technique for classification purposes.
127
Linear Discriminant Analysis (LDA)
• What is the goal of LDA?

− Seeks to find directions along which the classes are best
separated (i.e., increase discriminatory information).
− It takes into consideration the scatter (i.e., variance) within-
classes and between-classes.
projection direction
projection direction
Bad separability Good separability

128
▸ Let us assume C classes with each class containing Mi samples, i=1,2,..,C and
M the total number of samples:
𝐶
𝑀 = ෍ 𝑀𝑖
𝑖=1
▸ Let μi is the mean of the i-th class, i=1,2,…,C and μ is the mean of the whole
dataset: 1 C
μ=
C
 μ
i =1
i
Within-class scatter matrix

C Mi
S w =  (x j − μ i )(x j − μ i )T
i =1 j =1
Between-class scatter matrix

C
Sb =  (μ i − μ )(μ i − μ )T
i =1
129
• Suppose the desired projection transformation is:
y =UTx
• Suppose the scatter matrices of the projected data y are:
Sb , Sw
• LDA seeks transformations that maximize the between-

class scatter and minimize the within-class scatter:
| Sb | | U T SbU |
max or max T
| Sw | | U S wU |
130
• It can be shown that the columns of the matrix U are the

eigenvectors (i.e., called Fisherfaces) corresponding to the
largest eigenvalues of the following generalized eigen-
problem:
Sbuk = lk Swuk
• It can be shown that Sb has at most rank C-1; therefore,
the max number of eigenvectors with non-zero
eigenvalues is C-1, that is:
max dimensionality of LDA sub-space is C-1
e.g., when C=2, we always end up with one LDA feature

131
no matter what the original number of features was!
Example
132
• If Sw is non-singular, we can solve a conventional

eigenvalue problem as follows:
Sbuk = lk Swuk
S Sbuk = lk uk
−1
w
• In practice, Sw is singular due to the high dimensionality

of the data (e.g., images) and a much lower number of
data (M << N )
133
• To alleviate this problem, PCA could be applied first:
1) First, apply PCA to reduce data dimensionality:
 x1   y1 
x  y 
  2  2
x =  .  ⎯⎯⎯PCA
→y =  . 
   
  .  . 
 xN   yM 
2) Then, apply LDA to find the most discriminative directions:

 y1   z1 
y  z 
  2  2
y =  .  ⎯⎯⎯LDA
→z =  . 
   
  .  . 
134  yM   zK 

Fattah DimensionReduction 02june2022

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fattah DimensionReduction 02june2022

Uploaded by

Copyright:

Available Formats

Dimensionality Reduction in Machine Learning

Dr. Shaikh Fattah

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

1. Reduces time complexity: Less computation

• Combinations of observed variables may be more effective bases

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

▸These phenomena are called “factors” or “principal components”

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

▸Feature selection: Choosing k<d important features,

Principal components analysis (PCA), linear

▸Most common form of factor analysis

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

• Orthogonal directions of greatest variance in data

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

▸First principal component is the direction of greatest

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

Maximize uTxxTu s.t uTu = 1

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

The first root is called the prinicipal eigenvalue which has an

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

• Geometrically: centering followed by rotation

▸The least-squares view: PCs are a series of linear least squares

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

▸Find a low-dimensional space such that when x is

∑ w2 = α w2 that is, w2 is another eigenvector of ∑

▸Proportion of Variance (PoV) explained

▸For n original dimensions, correlation matrix is nxn, and

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

Can ignore the components of lesser significance.

You do lose some information, but if the eigenvalues are

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

- Step 2: Pick the subset that is

▸The number of subsets grows combinatorially, making

▸In practice, heuristics are used to speed-up search but

▸First, the best single feature is selected (i.e.,

▸First, the objective function is computed for all n

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

▸Consider our training data as a matrix where each row is a

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

▸Methods are used to rank features by importance

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

▸The correlation r is between -1 and 1. A value of 1 means perfect

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

▸Difference in means divided by difference in standard

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

▸Consider the vector w for any linear classifier.

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

▸The w can be obtained by any of linear classifiers we have

Dr. Shaikh Fattah, Professor, Department of EEE, BUET

▸Unclear how to tell in advance if feature selection will work

Dr. Shaikh Fattah, Professor, Department of EEE, BUET