Professional Documents
Culture Documents
Fattah DimensionReduction 02june2022
Fattah DimensionReduction 02june2022
1
Brief Biography
▸ Prof., Dept. of EEE, BUET
▸Director, Inst of Robotics & Automation BUET
▸Ex-Director, Inst. Nuclear Power Eng, BUET
▸Editorial Board Member, IEEE Access
▸Chair, IEEE R-10 Adhoc on Climate Change
▸Chair, IEEE PES-Humanitarian Activity Comm.
▸Chair, IEEE SSIT Chapters
▸Education Chair, IEEE HAC (2018-20)
▸Member, IEEE PES LRP Committee
▸Chair, IEEE SPS, SSIT Vice-Chair, IEEE RAS BDC
▸Chair (’15-16) N&A Chair (‘17-20), Bangladesh Section
▸General Chair, ICAICT 20, RAAICON, R10-HTC 17
▸TPC Chairs: TENSYMP 20, WIECON, ICIVPR, ICAEE
4
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Why Component Analysis?
• We have too many observations and dimensions
– To reason about or obtain insights from
– To visualize
– Too much noise in the data
– Need to “reduce” them to a smaller set of factors
– Better representation of data without losing much
information
– Can build more effective data analyses on the reduced-
dimensional space: classification, clustering, pattern
recognition
8
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Subset Selection Method
Forward search:
Add the best feature at each step
- Set of features F initially Ø.
- At each iteration, find the best new feature
j = argmini E ( F xi )
- Add xj to F if E ( F xj ) < E ( F )
Backward search:
▸Start with all features and remove one at a time, if possible.
9
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Principal Component Analysis
Original Variable B
PC 2
PC 1
Original Variable A
Mathematically:
- It can be viewed as a rotation of the existing axes to new
positions in the space defined by original variables
- New axes are orthogonal and represent the directions with
maximum variability
20
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
▸Maximize Var(z) subject to ||w||=1
(
max w1T w1 − w1T w1 − 1
w1
)
∑w1 = αw1 that is, w1 is an eigenvector of ∑
Choose the one with the largest eigenvalue for Var(z) to be max
▸Second principal component: Max Var(z2), s.t., ||w2||=1 and orthogonal
to w1
21
What PCA does
z = WT(x – m)
where the columns of W are the eigenvectors of ∑, and m is
sample mean
Centers the data at the origin and rotates the axes
22
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
How to choose k ?
l1 + l 2 + + l k
l1 + l 2 + + l k + + l d
when λi are sorted in descending order
▸Typically, stop at PoV>0.9
▸Scree graph plots of PoV vs k, stop at “elbow”
23
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
24
Lect
ure
Not
es
for E
Alpa
ydın
200
4
Intro
duct
ion
to 25
Mac
How Many PCs?
20
Variance (%)
15
10
0
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
k=3
32 bins
33 bins
k: number of bins per feature
Feature Dimensionality Reduction
Feature selection: chooses a Feature extraction: finds a set of
subset of the original features. new features (via some mapping f())
from the existing features.
x1 x1
x x
2 xi1 2 y1
. . y
xi2 2
. .
x=
.
→y= . x = ⎯⎯⎯ f (x)
→y = .
.
. K<<N .
.
. yK
. xiK .
K<<N
xN xN The mapping f() could be
linear or non-linear
K<<N
38
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Feature Selection vs Feature Extraction
▸Feature Extraction
- New features are combinations (linear for PCA/LDA) of the
original features (difficult to interpret).
- When classifying novel patterns, all features need to be
computed.
▸Feature Selection
- New features represent a subset of the original features.
- When classifying novel patterns, only a small number of features
need to be computed (i.e., faster classification).
Feature Selection
Feature Selection: Main Steps
▸Feature selection is an
optimization problem.
- Step 1: Search the space of
possible feature subsets.
Evaluation methods
- Filter
- Wrapper
Search Methods
▸Assuming n features, an exhaustive search would
require examining n possible subsets of size d.
d
43
Evaluation Methods
▸Filter
- Evaluation is independent of
the classification algorithm.
- The objective function is
based on the information
content of the feature “goodness”
subsets, e.g.:
• interclass distance
• statistical dependence
• information-theoretic measures
(e.g., mutual information).
Evaluation Methods (cont’d)
▸Wrapper
- Evaluation uses criteria
related to the classification
algorithm.
- The objective function is “goodness”
based on the predictive
accuracy of the classifier,
e.g.,:
• recognition accuracy on a
“validation” data set.
Filter vs Wrapper Methods
▸Wrapper Methods
- Advantages
• Achieve higher recognition accuracy compared to filter methods
since they use the classifier itself in choosing the optimum set of
features.
- Disadvantages
• Much slower compared to filter methods since the classifier must
be trained and tested for each candidate feature subset.
• The optimum feature subset might not work well for other
classifiers.
Filter vs Wrapper Methods
(cont’d)
▸Filter Methods
- Advantages
• Much faster than wrapper methods since the objective function
has lower computational requirements.
• The optimum feature set might work well with various classifiers
as it is not tied to a specific classifier.
- Disadvantages
• Achieve lower recognition accuracy compared to wrapper
methods.
• Have a tendency to select more features compared to wrapper
methods.
Naïve Search(heuristic search)
▸Given n features:
1. Sort them in decreasing order of their “goodness” (i.e., based
on some objective function).
2. Choose the top d features from this sorted list.
▸Disadvantages
- Correlation among features is not considered.
- The best pair of features may not even contain the best
individual feature.
Sequential forward selection (SFS)
(heuristic search)
SFS performs
best when the
optimal subset is
small.
49
Example
features added at
each iteration
highest
accuracy
Sequential forward feature selection for classification of a satellite image using 28 features.
x-axis shows the classification accuracy (%)
y-axis
50 shows the features added at each iteration (the first iteration is at the bottom).
Sequential backward selection (SBS)
(heuristic search)
SBS performs
best when the
optimal subset is
large.
51
Example
features removed at
each iteration
highest
accuracy
Sequential backward feature selection for classification of a satellite image using 28 features.
x-axis shows the classification accuracy (%)
y-axis shows the features removed at each iteration (the first iteration is at the top)
52
Bidirectional Search (BDS)
▸BDS applies SFS and SBS
simultaneously:
- SFS is performed from the
empty set.
- SBS is performed from the
full set.
▸To guarantee that SFS and
SBS converge to the same
solution:
- Features already selected by
SFS cannot be removed by Y
SBS.
- Features already removed
by SBS cannot be added by
SFS.
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Limitations of SFS and SBS
▸The main limitation of SFS is that it is unable to
remove features that become non useful after the
addition of other features.
▸The main limitation of SBS is its inability to
reevaluate the usefulness of a feature after it has
been discarded.
▸Enhancements of SFS and SBS:
- “Plus-L, minus-R” selection (LRS)
- Sequential floating forward (or backward) selection
(i.e., SFFS and SFBS)
From Wikipedia
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Signal to noise ratio
- Intuition: if x=y for all instances, then x is great no matter what our model
is; x contains all information needed to predict y
xi
-1
s- s+
xi
Figure from I. Guyon,Dr.
PASCAL
Shaikh Bootcamp in ML Department of EEE, BUET
Fattah, Professor,
Univariate Dependence
▸Independence:
P(X, Y) = P(X) P(Y)
▸Measures of dependence:
- Mutual Information (see notes from board)
- Correlation (see notes from board)
P(X) X
X Y
Y P(Y)
R=0.0002 X Y
MI=1.65 nat
Slide from I. Guyon, PASCAL Bootcamp
Dr. Shaikh in ML
Fattah, Professor, Department of EEE, BUET
Gaussian Distribution
P(X) X
X Y
Y P(Y)
X Y
T-test
P(Xi|Y=1)
P(Xi|Y=-1)
-1
xi
s- s+
▸Disadvantages:
- Doesn’t take into account which learning algorithm will be used
- Doesn’t take into account correlations between features, just correlation of
each feature to the class label
Selected
All Features Supervised Classifier
Features
Filter Learning
Algorithm Selected
Wrapper: Features
Criterion Value
Criterion Value
Criterion Value
A B C D
A, B B, C B, D
A, B, C B, C, D
Criterion Value
AB AD BD
A D
▸In certain cases, we can move model selection into the induction
algorithm
- Only have to fit one model; more efficient
▸This is sometimes called an embedded feature selection
algorithm
AUC -1
0 1 s- s+
1- Specificity xi
Slide from I. Guyon, PASCAL Bootcamp
Dr. Shaikh in ML
Fattah, Professor, Department of EEE, BUET
L1 versus L2 Regularization
y = Tx ϵ RK where K<<N
x1
x
2 y1
. y This is a projection from the
T 2
. N-dimensional space to a K-
x = ⎯⎯⎯ f (x)
→y = .
. dimensional space.
.
.
yK
.
xN
90
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Feature Extraction (cont’d)
▸From a mathematical point of view, finding an optimum
mapping y=𝑓(x) is equivalent to optimizing an objective
criterion.
91
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Feature Extraction (cont’d)
92
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Vector Representation
x1
▸A vector x ϵ can be
Rn x
2
represented by n components: .
.
x:
.
.
.
▸Assuming the standard base <v1,
xN
v2, …, vN> (i.e., unit vectors in each
dimension), xi can be obtained by xT vi
projecting x along the direction of xi = T = xT vi
vi vi
v i:
94
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Principal Component Analysis (PCA)
x1
• If x∈RN, then it can be written a linear combination of an x
2
orthonormal set of N basis vectors <v1,v2,…,v𝑁> in RN (e.g., .
using the standard base): N
x:
.
1 if i = j x = xi vi = x1v1 + x2 v2 + ... + xN vN .
v vj =
T
i i =1
0 otherwise xT vi .
where xi = T = xT vi .
vi vi
xN
• PCA seeks to approximate x in a subspace of RN using a new
set of K<<N basis vectors <u1, u2, …,uK> in RN:
K y1
xˆ = yi ui = y1u1 + y2u2 + ... + yK u K xT ui
where yi = T = xT ui y
ui ui 2
i =1 (reconstruction)
xˆ : .
such that || x − xˆ || is minimized! .
yK
95 (i.e., minimize information loss)
Principal Component Analysis (PCA)
• The “optimal” set of basis vectors <u1, u2, …,uK> can be
found as follows (we will see why):
Φi = x i − x
97
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
PCA - Steps
Step 4: compute the eigenvalues/eigenvectors of Σx
x ui = li ui
where we assume l1 l2 ... lN
Note : most software packages return the eigenvalues (and corresponding eigenvectors)
is decreasing order – if not, you can explicitly put them in this order)
2 2
N . .
y1
y where U = [u1 u2 ...uK ] N x K matrix
2
(xˆ − x ) = U .
i.e., the columns of U are the
. the first K eigenvectors of Σx
yK
y1
y
2 T = UT K x N matrix
. = U T (xˆ − x ) i.e., the rows of T are the first K
eigenvectors of Σx
.
yK
100
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
What is the form of Σy ?
M M
1 1
x =
M
i =1
(xi − x )(xi − x ) =
M
T
i i
i =1
T
Using diagonalization:
The diagonal elements of
The columns of P are the
x = PPT eigenvectors of ΣX
Λ are the eigenvalues of ΣX
or the variances
y i = U T ( x i − x ) = PT i
M M
1 1
(P
M
1
= i )( PT i )T =
T T
y = (y i − y )(y i − y ) =
T ( y i )( y i )
M i =1
M i =1 M i =1
M M
1 1
M
i =1
(T
P i )( T
P
i ) = P ( T
M
i =1
i
T
i ) P = PT x P = PT ( PPT ) P =
102
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Example
▸Compute the PCA of the following dataset:
(1,2),(3,3),(3,5),(5,4),(5,6),(6,5),(8,7),(9,8)
103
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Example (cont’d)
▸The eigenvectors are the solutions of the systems:
x ui = li ui
K that satisfies
i =1
N
T where T is a threshold (e.g ., 0.9)
the following
inequality: l
i =1
i
108
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Application to Images (cont’d)
Step 3: compute
M the covariance matrix Σx
1 1 where A=[Φ1 Φ2 ... ΦΜ]
x =
M i =1
i i
T
=
M
AAT
(N2 x M matrix)
(ATA)vi= μivi
- Multiply both sides by A:
A(ATA)vi=Aμivi or (AAT)(Avi)= μi(Avi)
A=[Φ1 Φ2 ... ΦΜ]
- Assuming (AAT)ui= λiui (N2 x M matrix)
111
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Application to Images (cont’d)
Step 3 compute ATA (i.e., instead of AAT)
Step 4b: compute λi, ui of AAT using λi=μi and ui=Avi, then
normalize ui to unit length.
112
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Example
Dataset
113
Example (cont’d)
Top eigenvectors: u1,…uk
(visualized as an image - eigenfaces)
u1 u2 u3
Mean face: x
114
Example (cont’d)
• How can you visualize the eigenvectors (eigenfaces) as an
image? u 1
u 2u 3
115
Application to Images (cont’d)
• Interpretation: represent a face in terms of eigenfaces
u1 u2 u3
y1
y
2
K xˆ − x : .
xˆ = yi ui = y1u1 + y2u2 + ... + yK u K + x
i =1
.
yK
y1 y2 y3
+x
116
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Case Study: Eigenfaces for Face
Detection/Recognition
• Face Recognition
− The simplest approach is to think of it as a template matching
problem.
117
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Face Recognition Using Eigenfaces
120
Face Detection Using Eigenfaces
K
Step 2: Project = yi ui face
ˆunknown where i = T
in theyeigenspace:
ui
i =1
ˆ ||
ed =|| −
Step 3: Compute
The distance ed is called distance from face space (dffs)
Reconstructed image
looks like a face again!
122
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Reconstruction from partial information
• Robust to partial face occlusion.
Input Reconstructed
123
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Eigenfaces
• Can be used for face detection, tracking, and recognition!
124
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Limitations
• Background changes cause problems
− De-emphasize the outside of the face (e.g., by multiplying the input
image by a 2D Gaussian window centered on the face).
• Light changes degrade performance
− Light normalization might help but this is a challenging issue.
• Performance decreases quickly with changes to face size
− Scale input image to multiple sizes.
− Multi-scale eigenspaces.
• Performance decreases with changes to face orientation
(but not as fast as with scale changes)
− Out-of-plane rotations are more difficult to handle.
− Multi-orientation eigenspaces.
125
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Limitations (cont’d)
▸Not robust to misalignment.
126
Limitations (cont’d)
• PCA is not always an optimal dimensionality-reduction
technique for classification purposes.
127
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Linear Discriminant Analysis (LDA)
projection direction
projection direction
𝑀 = 𝑀𝑖
𝑖=1
▸ Let μi is the mean of the i-th class, i=1,2,…,C and μ is the mean of the whole
dataset: 1 C
μ=
C
μ
i =1
i
y =UTx
• Suppose the scatter matrices of the projected data y are:
Sb , Sw
| Sb | | U T SbU |
max or max T
| Sw | | U S wU |
130
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Linear Discriminant Analysis (LDA)
132
Linear Discriminant Analysis (LDA)
Sbuk = lk Swuk
S Sbuk = lk uk
−1
w
133
Dr. Shaikh Fattah, Professor, Department of EEE, BUET
Linear Discriminant Analysis (LDA)
• To alleviate this problem, PCA could be applied first:
x1 y1
x y
2 2
x = . ⎯⎯⎯PCA
→y = .
. .
xN yM