You are on page 1of 134

Dimensionality Reduction in Machine Learning

Dr. Shaikh Fattah


Professor Department of EEE, BUET
Chair, IEEE Signal Processing Society Bangladesh Chapter

1
Brief Biography
▸ Prof., Dept. of EEE, BUET
▸Director, Inst of Robotics & Automation BUET
▸Ex-Director, Inst. Nuclear Power Eng, BUET
▸Editorial Board Member, IEEE Access
▸Chair, IEEE R-10 Adhoc on Climate Change
▸Chair, IEEE PES-Humanitarian Activity Comm.
▸Chair, IEEE SSIT Chapters
▸Education Chair, IEEE HAC (2018-20)
▸Member, IEEE PES LRP Committee
▸Chair, IEEE SPS, SSIT Vice-Chair, IEEE RAS BDC
▸Chair (’15-16) N&A Chair (‘17-20), Bangladesh Section
▸General Chair, ICAICT 20, RAAICON, R10-HTC 17
▸TPC Chairs: TENSYMP 20, WIECON, ICIVPR, ICAEE

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Major Awards
▸Concordia University's Distinguished
Doctoral Dissertation Prize (2009),
▸2007 URSI Canadian Young Scientist Award
▸Dr. Rashid Gold Medal (in MSc)
▸BAS-TWAS Young Scientists Prize (2014)
▸2018 R10 Outstanding Volunteer Award
▸2016 IEEE MGA Achievement Award
▸2017 IEEE R10 HTA Outstanding Vol. Award
▸2016 IEEE R10 HTA Outstanding Activities
Award for Bangladesh Section
▸Best paper awards: EEE BECITHCON 2019 , Biomedical Track at IEEE
TENCON 2017, WIECON-ECE 2016
▸Faculty Advisor -IEEE SP Cup, 2022 Champion,
2023 1st Runner Up, IEEE VIP Cup, 2020- 1st R/Up

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Why Reduce Dimensionality?

1. Reduces time complexity: Less computation


2. Reduces space complexity: Less parameters
3. Saves the cost of observing the feature
4. Simpler models are more robust on small datasets
5. More interpretable; simpler explanation
6. Data visualization (structure, groups, outliers, etc) if
plotted in 2 or 3 dimensions

4
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Why Component Analysis?
• We have too many observations and dimensions
– To reason about or obtain insights from
– To visualize
– Too much noise in the data
– Need to “reduce” them to a smaller set of factors
– Better representation of data without losing much
information
– Can build more effective data analyses on the reduced-
dimensional space: classification, clustering, pattern
recognition

• Combinations of observed variables may be more effective bases


for insights, even if physical meaning is obscure
Basic Concept of Dimensionality Reduction
▸Areas of variance in data are where items can be best
discriminated and key underlying phenomena observed
- Areas of greatest “signal” in the data
▸If two items or dimensions are highly correlated or dependent
- They are likely to represent highly related phenomena
- If they tell us about the same underlying variance in the data,
combining them to form a single measure is reasonable
• Parsimony
• Reduction in Error
▸So we want to combine related variables, and focus on
uncorrelated or independent ones, especially those along which
the observations have high variance
▸We want a smaller set of variables that explain most of the
variance in the original data, in more compact and insightful form

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Basic Concept of Dimensionality Reduction
▸What if the dependences and correlations are not so strong or
direct?
▸And suppose you have 3 variables, or 4, or 5, or 10000?
▸Look for the phenomena underlying the observed covariance/co-
dependence in a set of variables
- Once again, phenomena that are uncorrelated or independent, and
especially those along which the data show high variance

▸These phenomena are called “factors” or “principal components”


or “independent components,” depending on the methods used
- Factor analysis: based on variance/covariance/correlation
- Independent Component Analysis: based on independence

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Feature Selection vs Extraction

▸Feature selection: Choosing k<d important features,


ignoring the remaining d – k
Subset selection algorithms
▸Feature extraction: Project the
original xi , i =1,...,d dimensions to
new k<d dimensions, zj , j =1,...,k

Principal components analysis (PCA), linear


discriminant analysis (LDA), factor analysis (FA)

8
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Subset Selection Method
Forward search:
Add the best feature at each step
- Set of features F initially Ø.
- At each iteration, find the best new feature
j = argmini E ( F  xi )
- Add xj to F if E ( F  xj ) < E ( F )
Backward search:
▸Start with all features and remove one at a time, if possible.

9
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Principal Component Analysis

▸Most common form of factor analysis


▸The new variables/dimensions
- Are linear combinations of the original ones
- Are uncorrelated with one another
• Orthogonal in original dimension space
- Capture as much of the original variance in the data as possible
- Are called Principal Components

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


What are the new axes?

Original Variable B
PC 2
PC 1

Original Variable A

• Orthogonal directions of greatest variance in data


• Projections along PC1 discriminate the data most along any one axis

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Principal Components

▸First principal component is the direction of greatest


variability (covariance) in the data
▸Second is the next orthogonal (uncorrelated) direction of
greatest variability
- So first remove all the variability along the first
component, and then find the next direction of greatest
variability
▸And so on …

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Principal Components Analysis (PCA)
- Linear projection method to reduce the number of parameters
- Transfer a set of correlated variables into a new set of
uncorrelated variables
- Map the data into a space of lower dimensionality
- Form of unsupervised learning

Mathematically:
- It can be viewed as a rotation of the existing axes to new
positions in the space defined by original variables
- New axes are orthogonal and represent the directions with
maximum variability

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Computing the Components
▸Data points are vectors in a multidimensional space
▸Projection of vector x onto an axis (dimension) u is u.x
▸Direction of greatest variability is that in which the average square
of the projection is greatest
- I.e. u such that E((u.x)2) over all x is maximized
- (we subtract the mean along each dimension, and center the original
axis system at the centroid of all data points, for simplicity)
- This direction of u is the direction of the first Principal Component

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Computing the Components
▸E((u.x)2) = E ((u.x) (u.x)T) = E (u.x.x T.uT)
▸The matrix C = x.xT contains the correlations (similarities) of
the original axes based on how the data values project onto
them
▸So we are looking for w that maximizes uCuT, subject to u
being unit-length
▸It is maximized when w is the principal eigenvector of the
matrix C, in which case
- uCuT = uluT = l if u is unit-length, where l is the
principal eigenvalue of the correlation matrix C
- The eigenvalue denotes the amount of variability
captured along that dimension

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Why the Eigenvectors?

Maximize uTxxTu s.t uTu = 1


Construct Langrangian uTxxTu – λuTu
Vector of partial derivatives set to zero
xxTu – λu = (xxT – λI) u = 0
As u ≠ 0 then u must be an eigenvector of xxT with eigenvalue λ

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Singular Value Decomposition

The first root is called the prinicipal eigenvalue which has an


associated orthonormal (uTu = 1) eigenvector u
Subsequent roots are ordered such that λ1> λ2 >… > λM with
rank(D) non-zero values.
Eigenvectors form an orthonormal basis i.e. uiTuj = δij
The eigenvalue decomposition of xxT = UΣUT
where U = [u1, u2, …, uM] and Σ = diag[λ 1, λ 2, …, λ M]
Similarly the eigenvalue decomposition of xTx = VΣVT
The SVD is closely related to the above x=U Σ1/2 VT
The left eigenvectors U, right eigenvectors V,
singular values = square root of eigenvalues.

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Computing the Components
▸Similarly for the next axis, etc.
▸So, the new axes are the eigenvectors of the matrix of
correlations of the original variables, which captures the
similarities of the original variables based on how data
samples project to them

• Geometrically: centering followed by rotation


– Linear transformation
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
PCs, Variance and Least-Squares
▸The first PC retains the greatest amount of variation in the
sample
▸The kth PC retains the kth greatest fraction of the variation in
the sample
▸The kth largest eigenvalue of the correlation matrix C is the
variance in the sample along the kth PC

▸The least-squares view: PCs are a series of linear least squares


fits to a sample, each orthogonal to all previous ones

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Principal Components Analysis (PCA)

▸Find a low-dimensional space such that when x is


projected there, information loss is minimized.
▸The projection of x on the direction of w is: z = wTx
▸Find w such that Var(z) is maximized
Var(z) = Var(wTx) = E[(wTx – wTμ)2]
= E[(wTx – wTμ)(wTx – wTμ)]
= E[wT(x – μ)(x – μ)Tw]
= wT E[(x – μ)(x –μ)T]w = wT ∑ w
where Var(x)= E[(x – μ)(x –μ)T] = ∑

20
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
▸Maximize Var(z) subject to ||w||=1
(
max w1T w1 −  w1T w1 − 1
w1
)
∑w1 = αw1 that is, w1 is an eigenvector of ∑
Choose the one with the largest eigenvalue for Var(z) to be max
▸Second principal component: Max Var(z2), s.t., ||w2||=1 and orthogonal
to w1

∑ w2 = α w2 that is, w2 is another eigenvector of ∑


and so on.
( ) (
max w 2T w 2 −  w 2T w 2 − 1 −  w 2T w1 − 0
w2
)

21
What PCA does
z = WT(x – m)
where the columns of W are the eigenvectors of ∑, and m is
sample mean
Centers the data at the origin and rotates the axes

22
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
How to choose k ?

▸Proportion of Variance (PoV) explained

l1 + l 2 +  + l k
l1 + l 2 +  + l k +  + l d
when λi are sorted in descending order
▸Typically, stop at PoV>0.9
▸Scree graph plots of PoV vs k, stop at “elbow”

23
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

24
Lect
ure
Not
es
for E
Alpa
ydın
200
4
Intro
duct
ion
to 25
Mac
How Many PCs?

▸For n original dimensions, correlation matrix is nxn, and


has up to n eigenvectors. So n PCs.
▸Where does dimensionality reduction come from?

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Dimensionality Reduction

Can ignore the components of lesser significance.


25

20
Variance (%)

15

10

0
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10

You do lose some information, but if the eigenvalues are


small, you don’t lose much
– n dimensions in original data
– calculate n eigenvectors and eigenvalues
– choose only the first p eigenvectors, based on their eigenvalues
– final data set has only p dimensions
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Computing PCA: Example
A

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


A

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


A

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


A

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


A

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


A

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


A

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


A

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Curse of Feature Dimensionality
▸Increasing the number of features will not always improve classification
accuracy. It might actually lead to worse performance.
▸The number of training examples required increases exponentially with
dimensionality.
▸Choose an optimum set of features of lower dimensionality to improve
classification accuracy.

k=3

32 bins

33 bins
k: number of bins per feature
Feature Dimensionality Reduction
Feature selection: chooses a Feature extraction: finds a set of
subset of the original features. new features (via some mapping f())
from the existing features.

 x1   x1 
x  x 
 2  xi1   2  y1 
 .     .  y 
   xi2     2
.  .
x= 
 . 
→y= .  x =   ⎯⎯⎯ f (x)
→y =  . 
   .   
   .    K<<N . 
 .  
   .   yK 
 .   xiK   . 
  K<<N  
 xN   xN  The mapping f() could be
linear or non-linear
K<<N
38
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Feature Selection vs Feature Extraction

▸Feature Extraction
- New features are combinations (linear for PCA/LDA) of the
original features (difficult to interpret).
- When classifying novel patterns, all features need to be
computed.

▸Feature Selection
- New features represent a subset of the original features.
- When classifying novel patterns, only a small number of features
need to be computed (i.e., faster classification).
Feature Selection
Feature Selection: Main Steps
▸Feature selection is an
optimization problem.
- Step 1: Search the space of
possible feature subsets.

- Step 2: Pick the subset that is


optimal or near-optimal with
respect to some objective
function.
Feature Selection: Main Steps
(cont’d)
Search methods
- Exhaustive
- Heuristic
- Randomized

Evaluation methods
- Filter
- Wrapper
Search Methods
▸Assuming n features, an exhaustive search would
require examining  n  possible subsets of size d.
d 

▸The number of subsets grows combinatorially, making


exhaustive search impractical.
- e.g., exhaustive evaluation of 10 out of 20 features involves
184,756 feature subsets.
- e.g., exhaustive evaluation of 10 out of 100 involves more than
1013 feature subsets.

▸In practice, heuristics are used to speed-up search but


they cannot guarantee optimality.

43
Evaluation Methods
▸Filter
- Evaluation is independent of
the classification algorithm.
- The objective function is
based on the information
content of the feature “goodness”

subsets, e.g.:
• interclass distance
• statistical dependence
• information-theoretic measures
(e.g., mutual information).
Evaluation Methods (cont’d)
▸Wrapper
- Evaluation uses criteria
related to the classification
algorithm.
- The objective function is “goodness”
based on the predictive
accuracy of the classifier,
e.g.,:
• recognition accuracy on a
“validation” data set.
Filter vs Wrapper Methods
▸Wrapper Methods
- Advantages
• Achieve higher recognition accuracy compared to filter methods
since they use the classifier itself in choosing the optimum set of
features.
- Disadvantages
• Much slower compared to filter methods since the classifier must
be trained and tested for each candidate feature subset.
• The optimum feature subset might not work well for other
classifiers.
Filter vs Wrapper Methods
(cont’d)
▸Filter Methods
- Advantages
• Much faster than wrapper methods since the objective function
has lower computational requirements.
• The optimum feature set might work well with various classifiers
as it is not tied to a specific classifier.
- Disadvantages
• Achieve lower recognition accuracy compared to wrapper
methods.
• Have a tendency to select more features compared to wrapper
methods.
Naïve Search(heuristic search)
▸Given n features:
1. Sort them in decreasing order of their “goodness” (i.e., based
on some objective function).
2. Choose the top d features from this sorted list.

▸Disadvantages
- Correlation among features is not considered.
- The best pair of features may not even contain the best
individual feature.
Sequential forward selection (SFS)
(heuristic search)

▸First, the best single feature is selected (i.e.,


using some objective function).
▸Then, pairs of features are formed using one of
the remaining features and this best feature, and
the best pair is selected.
▸Next, triplets of features are formed using one of
the remaining features and these two best
features, and the best triplet is selected.
▸This procedure continues until a predefined
number of features are selected.

SFS performs
best when the
optimal subset is
small.
49
Example
features added at
each iteration

highest
accuracy

Sequential forward feature selection for classification of a satellite image using 28 features.
x-axis shows the classification accuracy (%)
y-axis
50 shows the features added at each iteration (the first iteration is at the bottom).
Sequential backward selection (SBS)
(heuristic search)

▸First, the objective function is computed for all n


features.
▸Then, each feature is deleted one at a time, the
objective function is computed for all subsets with
n-1 features, and the worst feature is discarded.
▸Next, each feature among the remaining n-1 is
deleted one at a time, and the worst feature is
discarded to form a subset with n-2 features.
▸This procedure continues until a predefined
number of features are left.

SBS performs
best when the
optimal subset is
large.
51
Example
features removed at
each iteration

highest
accuracy

Sequential backward feature selection for classification of a satellite image using 28 features.
x-axis shows the classification accuracy (%)
y-axis shows the features removed at each iteration (the first iteration is at the top)
52
Bidirectional Search (BDS)
▸BDS applies SFS and SBS
simultaneously:
- SFS is performed from the
empty set.
- SBS is performed from the
full set.
▸To guarantee that SFS and
SBS converge to the same
solution:
- Features already selected by
SFS cannot be removed by Y

SBS.
- Features already removed
by SBS cannot be added by
SFS.
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Limitations of SFS and SBS
▸The main limitation of SFS is that it is unable to
remove features that become non useful after the
addition of other features.
▸The main limitation of SBS is its inability to
reevaluate the usefulness of a feature after it has
been discarded.
▸Enhancements of SFS and SBS:
- “Plus-L, minus-R” selection (LRS)
- Sequential floating forward (or backward) selection
(i.e., SFFS and SFBS)

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Feature selection
What is feature selection?

▸Consider our training data as a matrix where each row is a


vector and each column is a dimension.
▸For example consider the matrix for the data x1=(1, 10, 2),
x2=(2, 8, 0), and x3=(1, 9, 1)
▸We call each dimension a feature or a column in our
matrix.

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Feature selection
▸Useful for high dimensional data such as genomic DNA and text
documents.
▸Methods
- Univariate (looks at each feature independently of others)
• Pearson correlation coefficient
• F-score
• Chi-square
• Signal to noise ratio
• And more such as mutual information, relief
- Multivariate (considers all features simultaneously)
• Dimensionality reduction algorithms
• Linear classifiers such as support vector machine
• Recursive feature elimination

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Feature selection

▸Methods are used to rank features by importance


▸Ranking cut-off is determined by user
▸Univariate methods measure some type of correlation
between two random variables. We apply them to machine
learning by setting one variable to be the label (yi) and the
other to be a fixed feature (xij for fixed j)

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Pearson correlation coefficient
▸Measures the correlation between two variables
▸Formulas:
- Covariance(X,Y) = E((X-μX)(Y-μY))
- Correlation(X,Y)= Covariance(X,Y)/σXσY
- Pearson correlation =

▸The correlation r is between -1 and 1. A value of 1 means perfect


positive correlation and -1 in the other direction

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Pearson correlation coefficient

From Wikipedia
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Signal to noise ratio

▸Difference in means divided by difference in standard


deviation between the two classes
▸S2N(X,Y) = (μX - μY)/(σX – σY)
▸Large values indicate a strong correlation

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Multivariate feature selection

▸Consider the vector w for any linear classifier.


▸Classification of a point x is given by wTx+w0.
▸Small entries of w will have little effect on the dot product
and therefore those features are less relevant.
▸For example if w = (10, .01, -9) then features 0 and 2 are
contributing more to the dot product than feature 1. A
ranking of features given by this w is 0, 2, 1.

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Multivariate feature selection

▸The w can be obtained by any of linear classifiers we have


see in class so far
▸A variant of this approach is called recursive feature
elimination:
- Compute w on all features
- Remove feature with smallest wi
- Recompute w on reduced data
- If stopping criterion not met then go to step 2

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Limitations

▸Unclear how to tell in advance if feature selection will work


- Only known way is to check but for very high dimensional data (at
least half a million features) it helps most of the time
▸How many features to select?
- Perform cross-validation

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Creating Features
▸“Good” features are the key to accurate generalization

▸Domain knowledge can be used to generate a feature set


- Medical Example: results of blood tests, age, smoking history
- Game Playing example: number of pieces on the board, control of the
center of the board

▸Data might not be in vector form


- Example: spam classification
• “Bag of words”: throw out order, keep count of how many times each
word appears.
• Sequence: one feature for first letter in the email, one for second letter,
etc.
• Ngrams: one feature for every unique string of n features

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


What is feature selection?

▸Reducing the feature space by throwing out some of the features

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Reasons for Feature Selection
▸Want to find which features are relevant
- Domain specialist not sure which factors are predictive of disease
- Common practice: throw in every feature you can think of, let feature selection
get rid of useless ones
▸Want to maximize accuracy, by removing irrelevant and noisy features
- For Spam, create a feature for each of ~105 English words
- Training with all features computationally expensive
- Irrelevant features hurt generalization
▸Features have associated costs, want to optimize accuracy with least
expensive features
- Embedded systems with limited resources
• Voice recognition on a cell phone
• Branch prediction in a CPU (4K code limit)

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Terminology

▸Univariate method: considers one variable (feature) at a time


▸Multivariate method: considers subsets of variables (features)
together
▸Filter method: ranks features or feature subsets independently
of the predictor (classifier)
▸Wrapper method: uses a classifier to assess features or feature
subsets

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Filtering
▸Basic idea: assign score to each feature x indicating how
“related” x and the class y are

- Intuition: if x=y for all instances, then x is great no matter what our model
is; x contains all information needed to predict y

▸Pick the n highest scoring features to keep

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Individual Feature Irrelevance
Legend:
Y=1
Y=-1
P(Xi, Y) = P(Xi) P(Y)
densit P(Xi| Y) = P(Xi)
y
P(Xi| Y=1) = P(Xi| Y=-1)

xi

Slide from I. Guyon, PASCAL Bootcamp


Dr. Shaikh in ML
Fattah, Professor, Department of EEE, BUET
Individual Feature Relevance
m- m+

-1

s- s+
xi
Figure from I. Guyon,Dr.
PASCAL
Shaikh Bootcamp in ML Department of EEE, BUET
Fattah, Professor,
Univariate Dependence
▸Independence:
P(X, Y) = P(X) P(Y)

▸Measures of dependence:
- Mutual Information (see notes from board)
- Correlation (see notes from board)

Slide from I. Guyon, PASCAL Bootcamp


Dr. Shaikh in ML
Fattah, Professor, Department of EEE, BUET
Correlation and MI
R=0.02
MI=1.03 nat

P(X) X

X Y

Y P(Y)

R=0.0002 X Y
MI=1.65 nat
Slide from I. Guyon, PASCAL Bootcamp
Dr. Shaikh in ML
Fattah, Professor, Department of EEE, BUET
Gaussian Distribution

P(X) X

X Y

Y P(Y)

X Y

MI(X, Y) = -(1/2) log(1-R2)


Slide from I. Guyon, PASCAL Bootcamp
Dr. Shaikh in ML
Fattah, Professor, Department of EEE, BUET
m- m+

T-test
P(Xi|Y=1)
P(Xi|Y=-1)
-1

xi
s- s+

• Normally distributed classes, equal variance s2


unknown; estimated from data as s2within.
• Null hypothesis H0: m+ = m-
• T statistic: If H0 is true,
t= (m+ - m-)/(swithin1/m++1/m-) Student(m++m--2 d.f.)
Slide from I. Guyon, PASCAL Bootcamp
Dr. Shaikh in ML
Fattah, Professor, Department of EEE, BUET
Considering each feature
alone may fail

Guyon-Elisseeff, JMLR 2004; Springer 2006


Filtering
▸Advantages:
- Fast, simple to apply

▸Disadvantages:
- Doesn’t take into account which learning algorithm will be used
- Doesn’t take into account correlations between features, just correlation of
each feature to the class label

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Feature Selection Methods
Filter:

Selected
All Features Supervised Classifier
Features
Filter Learning
Algorithm Selected
Wrapper: Features

Feature Classifier Feature Classifier


All Features Supervised
Subset Evaluation
Search Learning
Algorithm Criterion Selected
Features

Criterion Value

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Feature Classifier Feature
All Features Supervised
Subset Evaluation
Search Learning
Algorithm Criterion Selected
Features

Criterion Value

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Feature Classifier Feature
All Features Supervised
Subset Evaluation
Search Learning
Algorithm Criterion Selected
Features

Criterion Value

Search Method: sequential forward search

A B C D

A, B B, C B, D

A, B, C B, C, D

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Feature Classifier Feature
All Features Supervised
Subset Evaluation
Search Learning
Algorithm Criterion Selected
Features

Criterion Value

Search Method: sequential backward elimination

ABC ABD ACD BCD

AB AD BD

A D

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


▸ Forward or backward selection? Of the three variables of this example, the third one
separates the two classes best by itself (bottom right histogram). It is therefore the best
candidate in a forward selection process. Still, the two other variables are better taken
together than any subset of two including it. A backward selection method may perform
better in this case.

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Model search
▸More sophisticated search strategies exist
- Best-first search
- Stochastic search
- See “Wrappers for Feature Subset Selection”, Kohavi and
John 1997
▸Other objective functions exist which add a model-complexity
penalty to the training error
- AIC, BIC

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Regularization

▸In certain cases, we can move model selection into the induction
algorithm
- Only have to fit one model; more efficient
▸This is sometimes called an embedded feature selection
algorithm

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Regularization
▸Regularization: add model complexity penalty to training error.

for some constant C


▸Now
▸Regularization forces weights to be small, but does it force
weights to be exactly zero?
- is equivalent to removing feature f from the
model

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Kernel Methods (Quick Review)
▸Expanding feature space gives us new potentially useful features
▸Kernel methods let us work implicitly in a high-dimensional
feature space
- All calculations performed quickly in low-dimensional space

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Feature Engineering
▸Linear models: convenient,
fairly broad, but limited
▸We can increase the
expressiveness of linear
models by expanding the
feature space.
- E.g.

- Now feature space is R6 rather than R2


- Example linear predictor in these features:

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Individual Feature Relevance
ROC 1
curve m- m+
0
Sensitivity

AUC -1

0 1 s- s+
1- Specificity xi
Slide from I. Guyon, PASCAL Bootcamp
Dr. Shaikh in ML
Fattah, Professor, Department of EEE, BUET
L1 versus L2 Regularization

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Feature Extraction
▸Linear combinations are particularly attractive because they
are simpler to compute and analytically tractable.

▸Given x ϵ RN, find an K x N matrix T such that:

y = Tx ϵ RK where K<<N

 x1 
x 
 2  y1 
 .  y  This is a projection from the
  T  2
. N-dimensional space to a K-
x =   ⎯⎯⎯ f (x)
→y =  . 
 .    dimensional space.
   . 
  .
 yK 
 . 
 
 xN 
90
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Feature Extraction (cont’d)
▸From a mathematical point of view, finding an optimum
mapping y=𝑓(x) is equivalent to optimizing an objective
criterion.

▸Different methods use different objective criteria, e.g.,

- Minimize Information Loss: represent the data as accurately as


possible in the lower-dimensional space.

- Maximize Discriminatory Information: enhance the class-


discriminatory information in the lower-dimensional space.

91
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Feature Extraction (cont’d)

▸Popular linear feature extraction methods:


- Principal Components Analysis (PCA): Seeks a projection
that preserves as much information in the data as possible.
- Linear Discriminant Analysis (LDA): Seeks a projection that
best discriminates the data.

▸Many other methods:


- Making features as independent as possible (Independent
Component Analysis or ICA).
- Retaining interesting directions (Projection Pursuit).
- Embedding to lower dimensional manifolds (Isomap, Locally
Linear Embedding or LLE).

92
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Vector Representation
 x1 
▸A vector x ϵ can be
Rn x 
 2
represented by n components:  . 
 
.
x: 
 . 
 
 . 
 . 
▸Assuming the standard base <v1,  
 xN 
v2, …, vN> (i.e., unit vectors in each
dimension), xi can be obtained by xT vi
projecting x along the direction of xi = T = xT vi
vi vi
v i:

▸x can be “reconstructed” from its N


x =  xi vi = x1v1 + x2 v2 + ... + xN vN
projections as follows: i =1

• Since the basis vectors are the same for all x ϵ Rn


(standard basis), we typically represent them as a
n-component vector.
Dr. Shaikh Fattah, Professor, Department of EEE, BUET 93
Vector Representation (cont’d)
▸Example assuming n=2:  x1   3
x:  =  
 x2   4 j

▸Assuming the standard base 1 


<v1=i, v2=j>, xi can be obtained by x1 = x i = 3 4   = 3
T

projecting x along the direction of 0


v i:
0 
x2 = xT j = 3 4   = 4
1 

▸x can be “reconstructed” from its


projections as follows:
x = 3i + 4 j

94
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Principal Component Analysis (PCA)

 x1 
• If x∈RN, then it can be written a linear combination of an x 
 2
orthonormal set of N basis vectors <v1,v2,…,v𝑁> in RN (e.g.,  . 
 
using the standard base): N
x:  
.
1 if i = j x =  xi vi = x1v1 + x2 v2 + ... + xN vN  . 
v vj = 
T
i i =1
 
0 otherwise xT vi  . 
where xi = T = xT vi  . 
vi vi  
 xN 
• PCA seeks to approximate x in a subspace of RN using a new
set of K<<N basis vectors <u1, u2, …,uK> in RN:
K  y1 
xˆ =  yi ui = y1u1 + y2u2 + ... + yK u K xT ui
where yi = T = xT ui y 
ui ui  2
i =1 (reconstruction)
xˆ :  . 
 
such that || x − xˆ || is minimized!  . 
 yK 
95 (i.e., minimize information loss)
Principal Component Analysis (PCA)
• The “optimal” set of basis vectors <u1, u2, …,uK> can be
found as follows (we will see why):

(1) Find the eigenvectors u𝑖 of the covariance matrix of the


(training) data Σx
Σx u𝑖= 𝜆𝑖 u𝑖

(2) Choose the K “largest” eigenvectors u𝑖 (i.e., corresponding


to the K “largest” eigenvalues 𝜆𝑖)

<u1, u2, …,uK> correspond to the “optimal” basis!

We refer to the “largest” eigenvectors u𝑖 as principal components.


96
PCA - Steps
▸Suppose we are given x1, x2, ..., xM (N x 1) vectors
N: # of features
Step 1: compute sample mean M: # data
M
1
x=
M
x
i =1
i

Step 2: subtract sample mean (i.e., center data at zero)

Φi = x i − x

Step 3: compute the sample covariance matrix Σx


1 M
1 M
1 where A=[Φ1 Φ2 ... ΦΜ]
x =
M

i =1
( x i − x )( x i − x )T
=
M

i =1
 T
i
i  =
M
AAT
i.e., the columns of A are the Φi
(N x M matrix)

97
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
PCA - Steps
Step 4: compute the eigenvalues/eigenvectors of Σx
 x ui = li ui
where we assume l1  l2  ...  lN
Note : most software packages return the eigenvalues (and corresponding eigenvectors)
is decreasing order – if not, you can explicitly put them in this order)

Since Σx is symmetric, <u1,u2,…,uN> form an orthogonal basis in


RN and we can represent any x∈RN as: x 
x 
y 
y 
1 1

 2  2
N  .   . 

x − x =  yi ui = y1u1 + y2u2 + ... + y N u N


   
 .  .
x−x: → 
 .   . 
i =1    
i.e., this is  .   . 
 .   . 
(x − x)T ui just a “change”    
yi = T
= ( x − x )T
ui if || ui ||= 1 of basis!  xN   yN 
ui ui
Note : most software packages normalize ui to unit length to simplify calculations; if not, you
can explicitly normalize them) 98
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
PCA - Steps
Step 5: dimensionality reduction step – approximate x using
only the first K eigenvectors (K<<N) (i.e., corresponding to the K
largest eigenvalues where K is a parameter):
N
x − x =  yi ui = y1u1 + y2u2 + ... + y N u N
i =1
approximate x by x ˆ
using first K eigenvectors only
K
xˆ − x =  yi ui = y1u1 + y2u2 + ... + yK u K
i =1 (reconstruction)
 x1   y1 
x  y 
  2  2  y1 
 .   .  y 
     2
x−x:  . 
 . 
→  . 
 . 
→ xˆ − x :  .  note that if K=N, then xˆ =x
 
     .  (i.e., zero reconstruction error)
 .   .   yK 
 .   . 
   
 xN   yN 
99
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
What is the Linear Transformation
implied by PCA?
• The linear transformation y = Tx which performs the
dimensionality reduction in PCA is:
K
xˆ − x =  yi ui = y1u1 + y2u2 + ... + yK u K
i =1

 y1 
y  where U = [u1 u2 ...uK ] N x K matrix
 2
(xˆ − x ) = U  . 
  i.e., the columns of U are the
 .  the first K eigenvectors of Σx
 yK 

 y1 
y 
 2 T = UT K x N matrix
 .  = U T (xˆ − x ) i.e., the rows of T are the first K
  eigenvectors of Σx
 . 
 yK 
100
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
What is the form of Σy ?
M M
1 1
x =
M

i =1
(xi − x )(xi − x ) =
M
T
 i i
 
i =1
T

Using diagonalization:
The diagonal elements of
The columns of P are the
x = PPT eigenvectors of ΣX
Λ are the eigenvalues of ΣX
or the variances

y i = U T ( x i − x ) = PT  i
M M
1 1
  (P
M
1
 =  i )( PT  i )T =
T T
y = (y i − y )(y i − y ) =
T ( y i )( y i )
M i =1
M i =1 M i =1

M M
1 1
M

i =1
(T
P  i )(  T
P
i ) = P ( T

M
 
i =1
i
T
i ) P = PT x P = PT ( PPT ) P = 

y =  PCA de-correlates the data!


Preserves original variances!
101
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Interpretation of PCA

• PCA chooses the eigenvectors of


the covariance matrix corresponding
to the largest eigenvalues.
• The eigenvalues correspond to the
variance of the data along the
eigenvector directions.
• Therefore, PCA projects the data
along the directions where the data
varies most. u1: direction of max variance
u2: orthogonal to u1
• PCA preserves as much information
in the data by preserving as much
variance in the data.

102
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Example
▸Compute the PCA of the following dataset:

(1,2),(3,3),(3,5),(5,4),(5,6),(6,5),(8,7),(9,8)

▸Compute the sample covariance


n
matrix is:
1
ˆ =  (x k − μˆ )(x k − μˆ )t
n k =1

▸The eigenvalues can be computed by finding the roots of the


characteristic polynomial:

103
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Example (cont’d)
▸The eigenvectors are the solutions of the systems:

 x ui = li ui

Note: if ui is a solution, then cui is also a solution where c≠0.

Eigenvectors can be normalized to unit-length using:


vi
vˆi =
|| vi ||
104
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
How do we choose K ?

• K is typically chosen based on how much information


(variance) we want to preserve:
K

Choose the smallest l i

K that satisfies
i =1
N
T where T is a threshold (e.g ., 0.9)
the following
inequality: l
i =1
i

• If T=0.9, for example, we “preserve” 90% of the information


(variance) in the data.

• If K=N, then we “preserve” 100% of the information in the


data (i.e., just a “change” of basis and xˆ = x )
105
Approximation Error

• The approximation error (or reconstruction error) can be


computed by:
|| x − xˆ ||
K
where xˆ =  yi ui + x = y1u1 + y2u2 + ... + yK u K + x
i =1 (reconstruction)

• It can also be shown that the approximation error can be


computed as follows:
1 N
|| x − xˆ ||=  li
2 i = K +1
106
Data Normalization

• The principal components are dependent on the units used


to measure the original variables as well as on the range of
values they assume.

• Data should always be normalized prior to using PCA.

• A common normalization method is to transform all the data


to have zero mean and unit standard deviation:

xi − m where μ and σ are the mean and standard


deviation of the i-th feature xi
s
107
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Application to Images
• The goal is to represent images in a space of lower
dimensionality using PCA.
− Useful for various applications, e.g., face recognition, image
compression, etc.
• Given M images of size N x N, first represent each image
as a 1D vector (i.e., by stacking the rows together).
− Note that for face recognition, faces must be centered and of the
same size.

108
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Application to Images (cont’d)

▸The key challenge is that the covariance matrix Σx is


now very large (i.e., N2 x N2) – see Step 3:

Step 3: compute
M the covariance matrix Σx
1 1 where A=[Φ1 Φ2 ... ΦΜ]
x = 
M i =1
 
i i
T
=
M
AAT
(N2 x M matrix)

▸Σx is now an N2 x N2 matrix – computationally


expensive to compute its eigenvalues/eigenvectors λi,
ui
(AAT)ui= λiui
109
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Application to Images (cont’d)

▸We will use a simple “trick” to get around this by


relating the eigenvalues/eigenvectors of AAT to those
of ATA.

▸Let us consider the matrix ATA instead (i.e., M x M


matrix)
- Suppose its eigenvalues/eigenvectors are μi, vi

(ATA)vi= μivi
- Multiply both sides by A:
A(ATA)vi=Aμivi or (AAT)(Avi)= μi(Avi)
A=[Φ1 Φ2 ... ΦΜ]
- Assuming (AAT)ui= λiui (N2 x M matrix)

λi=μi and ui=Avi


110
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Application to Images (cont’d)

▸But do AAT and ATA have the same number of


eigenvalues/eigenvectors?
- AAT can have up to N2 eigenvalues/eigenvectors.
- ATA can have up to M eigenvalues/eigenvectors.
- It can be shown that the M eigenvalues/eigenvectors of ATA
are also the M largest eigenvalues/eigenvectors of AAT

▸Steps 3-5 of PCA need to be updated as follows:

111
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Application to Images (cont’d)
Step 3 compute ATA (i.e., instead of AAT)

Step 4: compute μi, vi of ATA

Step 4b: compute λi, ui of AAT using λi=μi and ui=Avi, then
normalize ui to unit length.

Step 5: dimensionality reduction step – approximate x using


only the first K eigenvectors (K<M):  y1 
y 
K each image can be  2
xˆ − x =  yi ui = y1u1 + y2u2 + ... + yK u K represented by
a K-dimensional
xˆ − x :  . 
 
i =1 vector  . 
 yK 

112
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Example

Dataset

113
Example (cont’d)
Top eigenvectors: u1,…uk
(visualized as an image - eigenfaces)
u1 u2 u3

Mean face: x

114
Example (cont’d)
• How can you visualize the eigenvectors (eigenfaces) as an
image? u 1
u 2u 3

− Their values must be first mapped to integer values in the


interval [0, 255] (required by PGM format).
− Suppose fmin and fmax are the min/max values of a given
eigenface (could be negative).
− If xϵ[fmin, fmax] is the original value, then the new value
yϵ[0,255] can be computed as follows:

y=(int)255(x - fmin)/(fmax - fmin)

115
Application to Images (cont’d)
• Interpretation: represent a face in terms of eigenfaces

u1 u2 u3

 y1 
y 
 2
K xˆ − x :  . 
xˆ =  yi ui = y1u1 + y2u2 + ... + yK u K + x  
i =1
 . 
 yK 
y1 y2 y3
+x

116
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Case Study: Eigenfaces for Face
Detection/Recognition

− M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of


Cognitive Neuroscience, vol. 3, no. 1, pp. 71-86, 1991.

• Face Recognition
− The simplest approach is to think of it as a template matching
problem.

− Problems arise when performing recognition in a high-dimensional


space.

− Use dimensionality reduction!

117
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Face Recognition Using Eigenfaces

▸Process the image database (i.e., set of images with


labels) – typically referred to as “training” phase:

- Compute PCA space using image database (i.e., training data)

- Represent each image in the database with K coefficients Ωi


 y1 
y 
 2
Ωi =  . 
 
 . 
 yK 

Dr. Shaikh Fattah, Professor, Department of EEE, BUET


Face Recognition Using Eigenfaces
Given an unknown face x, follow these steps:
Step 1: Subtract mean face (computed
x from training data)
 = x−x
 y1 
Step 2: Project unknown face in the eigenspace: y 
K  2
ˆ =yu
 where yi = T ui : . 
i i  
i =1  . 
Step 3: Find closest match Ωi from training set using:  yK 
K K
1
er = min i ||  − i ||= min i  ( y j − y ) or min i 
i 2
( y j − y ij ) 2
j =1
j
j =1 lj
Euclidean distance Mahalanobis distance

The distance er is called distance in face space (difs)

Step 4: Recognize x as person “k” where k is the ID linked to Ωi

Note: for intruder rejection, we need er<Tr, for some threshold Tr


119
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Face detection vs recognition

Detection Recognition “Sally”

120
Face Detection Using Eigenfaces

Given an unknown image x, follow


x these steps:
Step 1: Subtract mean face (computed from training
data):  = x−x

K
Step 2: Project =  yi ui face
ˆunknown where i = T
in theyeigenspace:
ui
i =1
ˆ ||
ed =||  − 
Step 3: Compute
The distance ed is called distance from face space (dffs)

Step 4: if ed<Td, then x is a face.


121
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Eigenfaces
Input Reconstructed

Reconstructed image looks


like a face.

Reconstructed image looks


like a face.

Reconstructed image
looks like a face again!
122
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Reconstruction from partial information
• Robust to partial face occlusion.

Input Reconstructed

123
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Eigenfaces
• Can be used for face detection, tracking, and recognition!

Visualize dffs as an image:


ˆ ||
ed =||  − 

Dark: small distance


Bright: large distance

124
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Limitations
• Background changes cause problems
− De-emphasize the outside of the face (e.g., by multiplying the input
image by a 2D Gaussian window centered on the face).
• Light changes degrade performance
− Light normalization might help but this is a challenging issue.
• Performance decreases quickly with changes to face size
− Scale input image to multiple sizes.
− Multi-scale eigenspaces.
• Performance decreases with changes to face orientation
(but not as fast as with scale changes)
− Out-of-plane rotations are more difficult to handle.
− Multi-orientation eigenspaces.

125
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Limitations (cont’d)
▸Not robust to misalignment.

126
Limitations (cont’d)
• PCA is not always an optimal dimensionality-reduction
technique for classification purposes.

127
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Linear Discriminant Analysis (LDA)

• What is the goal of LDA?


− Seeks to find directions along which the classes are best
separated (i.e., increase discriminatory information).
− It takes into consideration the scatter (i.e., variance) within-
classes and between-classes.

projection direction

projection direction

Bad separability Good separability


128
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Linear Discriminant Analysis (LDA)
▸ Let us assume C classes with each class containing Mi samples, i=1,2,..,C and
M the total number of samples:
𝐶

𝑀 = ෍ 𝑀𝑖
𝑖=1

▸ Let μi is the mean of the i-th class, i=1,2,…,C and μ is the mean of the whole
dataset: 1 C
μ=
C
 μ
i =1
i

Within-class scatter matrix


C Mi
S w =  (x j − μ i )(x j − μ i )T
i =1 j =1

Between-class scatter matrix


C
Sb =  (μ i − μ )(μ i − μ )T
i =1
129
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Linear Discriminant Analysis (LDA)
• Suppose the desired projection transformation is:

y =UTx
• Suppose the scatter matrices of the projected data y are:

Sb , Sw

• LDA seeks transformations that maximize the between-


class scatter and minimize the within-class scatter:

| Sb | | U T SbU |
max or max T
| Sw | | U S wU |
130
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Linear Discriminant Analysis (LDA)

• It can be shown that the columns of the matrix U are the


eigenvectors (i.e., called Fisherfaces) corresponding to the
largest eigenvalues of the following generalized eigen-
problem:
Sbuk = lk Swuk
• It can be shown that Sb has at most rank C-1; therefore,
the max number of eigenvectors with non-zero
eigenvalues is C-1, that is:

max dimensionality of LDA sub-space is C-1

e.g., when C=2, we always end up with one LDA feature


131
no matter what the original number of features was!
Example

132
Linear Discriminant Analysis (LDA)

• If Sw is non-singular, we can solve a conventional


eigenvalue problem as follows:

Sbuk = lk Swuk

S Sbuk = lk uk
−1
w

• In practice, Sw is singular due to the high dimensionality


of the data (e.g., images) and a much lower number of
data (M << N )

133
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Linear Discriminant Analysis (LDA)
• To alleviate this problem, PCA could be applied first:

1) First, apply PCA to reduce data dimensionality:

 x1   y1 
x  y 
  2  2
x =  .  ⎯⎯⎯PCA
→y =  . 
   
  .  . 
 xN   yM 

2) Then, apply LDA to find the most discriminative directions:


 y1   z1 
y  z 
  2  2
y =  .  ⎯⎯⎯LDA
→z =  . 
   
  .  . 
134  yM   zK 

You might also like