You are on page 1of 49

ELEN 6820

Speech and audio signal processing


Instructor: Nima Mesgarani (nm2764)
3 credits
TA: Yi Luo (yl3364)
O ce hours: TBD
ffi
Week topic HW

1 Introduc on and history -

2 Discrete signal processing DSP (W)

3 Machine learning 1 Neural network and VAD (P)

4 Machine learning 2 -

5 Speech signal produc on Speech produc on (W)

6 Speech signal representra on Speech enhancement (P)


Speech enhancement and
7 -
separa on
8 Human speech percep on Acous c event detec on (P)

9 Acous c modeling -

10 Sequence modeling and HMMs Phoneme recogni on and ASR (P)

11 Language modeling -

12 Automa c speech recogn on Projcet

13 Music signal processing -


ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
Homework #2
• Neural networks basics and Voice Ac vity Detec on (VAD)

• No prior knowledge required

• Python and Pytorch are the default pla orms. Make sure
they are properly installed

• No GPU or addi onal computa onal resources are


required, just your laptop

• Due in two weeks


ti
ti
ti
tf
ti
Typical speech recogni on system
/T/ /UW/ “two”

Feature Acoustic Language


extraction modeling Lexicon modeling

“two”

waveform feature phoneme word


to to to to
feature phoneme word text

DSP GMM-HMM Pronuncia on ngram


DNN DNN-HMM dic onary RNN
ti
ti
ti
21 frequency band
Example: acous c modeling

11 samples (110 ms)

input output
231 me x frequency 40 phonemes
W
40 x 231
x

ti
s

ti
Pattern Classification
Classi ca on
Goal: To classify objects (or patterns) into categories (or classes)

! Feature ! !
Extraction Classifier

Observation Feature Vector Class


s x ωi

What
•Types ofis machine
Problems: learning?
Can you learn
• Supervised:
1. anyare
Classes one-to-one mapping
known beforehand, given
and enoughof
data samples
training
each class data?
are available
2. Unsupervised: Classes (and/or number of classes) are not known
• Supervised vs. Unsupervised
beforehand, and must be inferred from data
• Determinis c vs. Stochas c
•6.345Genera
Automatic Speechve vs. Discrimina
Recognition ve Pattern Classification 2
ti
ti
fi
ti
ti
ti
Building a classi er
• De ne classes/a ributes

• Could be based on explicit rules (rule base), or de ne


through training examples (data driven)

• De ne feature space

• De ne decision algorithm

• Measure performance
fi
fi
fi
tt
fi
fi
Feature extrac on
• What features do we use to assign samples to di erent classes?
• waveform vs. formants vs. cepstra
• invariance under irrelevant modi ca ons
• Theore cally equivalent features may act very di erently under
par cular classi er:
• representa ons make important aspects explicit
• remove irrelevant informa on
• Feature design incorporate “domain knowledge”
• although more data -> less need for “cleverness”
• Smaller feature space (fewer dimensions)
• simpler models (fewer parameters)
• less training data needed
• faster training
ti
ti
ti
fi
ti
fi
ti
ti
ff
ff
Genera ve vs. Discrimina ve
• Genera ve models
• Model class-condi onal PDF and prior
probabili es
• “genera ve” since sampling can generate
synthe c data points
• Popular models:
• Gaussians, Bayes, Mixture of gaussians,
Hidden Markov Models, Neural Networks
• Discrimina ve models
• Directly es mate posterior probabili es
• No a empt to model underlying probability
distribu ons
• Popular models
• Logis c regression, SVMs, neural networks
tt
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
Example: Language Iden ca on

• Genera ve approach:

• Learn each language, determine which language the


speech belongs to

• Discrimina ve approach:

• Determine the linguis c di erences without learning


any language, a much easier task
ti
ti
ti
ff
ti
fi
ti
Topics
• Bayes Theorem

• Dimensionality reduc on with PCA and LDA

• K-Means unsupervised clustering

• Gaussian density models

• Neural Network Models: linear neuron, logis c neuron,


back propaga on
ti
ti
ti
Probability?
• P(A) = number of outcomes in A, divided by
total possible outcomes

• Condi onal probability: probability of a


par cular outcome, given a related event

• Why P(A|B) = P(A,B)/P(B)?


• What about A changes, if we know B
happened?
ti
ti
Condi onal Probability

Probability of a person going


to Columbia, if they live in
Manha an

Not everyone who goes to


Columbia lives in Manha an
tt
ti
tt
Condi onal Probability
Probability of a person going to Columbia
University?

Probability of a person living in Manha an?

Probability of a person living in Manha an and


going to Columbia?

Probability of a person going to Columbia, if they


live in Manha an?

Probability of a person living in Manha an, if they


go to Columbia?
tt
ti
tt
tt
tt
Condi onal Probability
• P(A) = number of outcomes in A, divided by total possible
outcomes

• Condi onal probability: probability of a par cular


outcome, given a related event

• Why P(A|B) = P(A,B)/P(B)?


• What about A changes, if we know B happened? both the
number of possibili es in A, and the sample size

• P(A,…) = P(A|…) P(…)


ti
ti
ti
ti
Bayes Theorem
Bayes Theorem
Bayes Theorem
!
!
PDF
PDF p(x|ω1 ))
p(x|ω
1
p(x|ω 2) )
p(x|ω 2

"
x "
Define: {ωi } a set of M x mutually exclusive classes
Define: {ωii)} aapriori
P(ω set ofprobability
M mutually forexclusive
class ωi classes
P(ωii)) PDF
p(x|ω a priori probability
for feature vectorfor class
x in classωωi i
p(x|ω i ) aPDF
P(ωi |x) for feature
posteriori vectorof
probability xω ini class
givenω xi
P(ωi |x) a posteriori probability of ωi given x
p(x|ωi )P(ωi )
From Bayes Rule: P(ωi |x) =
p(x)i )P(ωi )
p(x|ω
From Bayes Rule: P(ωi |x) M=
! p(x)
where p(x) = Mp(x|ωi )P(ωi )
!
where p(x) =
i=1
p(x|ωi )P(ωi )
i=1
6.345 Automatic Speech Recognition Pattern Classification 5

6.345 Automatic Speech Recognition Pattern Classification 5


Bayes decision theory
Bayes Decision Theory
Bayes
Bayes Decision
Decision Theory
Theory
The probability
• Bayes
The
Bayes probability
Decision
Decision ofTheory
of making
Theory
making anerror
an errorgiven
givenx xis:is:
• The
• The probability
probability of making
of making an error
an error x is:x is:
given
given
P(error|x) = 1 − P(ωi |x) if decide class ωi
• The probability of =making
1= 1 an error if given x is: class
ifxdecide
decide
P(error|x)
P(error|x)
• The probability of making an− error
− P(ω i |x) i |x)
P(ω given is:class ωi ωi
• To minimize P(error|x) = 1
(and P(error)):
To
To minimize
minimize
• To minimize
• P(error
P(error|x)
P(error|x) = 1| x)
(and

−(and
P(ω
(and P(error)):
P(error|x) P(ω |x) P(error)
i |x)
P(error)):
i if decide ):
if decide class ωi
class ωi
Choose ωi if mathP(ωi |x) > P(ωj |x) ∀j #= i
• To minimize P(error|x)
Choose i if
(and P(error)): ∀ji #= i
Choose
• To minimize ωi ifωmathP(ω
P(error|x) mathP(ω
(and P(error)):|x)
i |x) i> j |x) j |x) ∀j #=
> P(ω
P(ω
• For a two class
Chooseproblem this decision
ωi if mathP(ω i |x) > P(ω rule means:
j |x) ∀j #= i
For a Choose
two class
• For a two class problem
• ω i if
problem this decision
this decision
mathP(ω i |x) rule
rule means:
> P(ω j means:
|x) ∀j #= i
• For
For a two
a two
Choose
class
class problem
problem p(x|ωthis
)P(ω
ω if 1 1 1 1 >rule
p(x|ω p(x|ω this
)P(ω)
)P(ω
1decisiondecision
) 1 ) rule
p(x|ω p(x|ω
p(x|ω
)P(ω
>2means:
rule
means:
)P(ω
2 2 ) means:
2 )P(ω
2 ) 2)

else ω2
• For a Choose
twoChoose
class
ω1problem
if 1 1if this decision
ω > ; else ; else
ω
p(x)
p(x)
p(x) 1 )P(ω1 ) p(x|ω p(x) p(x)
p(x) 2 )P(ω2 )
2 2
p(x|ω
Choose ω1 if p(x|ω 1 )P(ω 1 ) p(x|ω
> 2 )P(ω2 ) ; else ω2
• This Choose
rule canω be
1 if expressed p(x) as >
a likelihood p(x) ;
ratio: else ω2
This rule can be expressed
• This rule can be expressed p(x)
• as a likelihood
as a likelihood ratio: p(x) ratio:
• This
• This
This rule
rule can can
rule can
be be
be expressed
expressed
expressed p(x|ωas
as
p(x|ω
p(x|ω
a
1 as
a
)likelihood
P(ω1 )likelihood
a
1 )likelihood
P(ω 2 ) ra o:
2 ) ratio:
2 ) P(ω
ratio:
Choose Choose
Chooseω1 ω if ω 1 if
1 if > > >; else ; else ; else
choose choose ωchoose
2 ω2 ω2
p(x|ω p(x|ω) 21)P(ω
p(x|ω
p(x|ω
2 ) 2 )P(ω ) P(ω
1P(ω 12) ) 1)
Choose ω1 p(x|ω if 1) > 2)
P(ω ; else choose ω2
Choose ω1 if p(x|ω2 ) P(ω1 ) choose ω2
> ; else
p(x|ω2 ) P(ω1 )
6.345 Automatic
6.345 Automatic SpeechSpeech Recognition
Recognition PatternPattern Classification
Classification 6 6
6.345 Automatic Speech Recognition Pattern Classification 6
6.345 Automatic Speech Recognition Pattern Classification 6
6.345 Automatic Speech Recognition Pattern Classification 6
ti
Sources of error
Sources of error
Xω^ 1
Xω^ 2
p(x|ω1)·Pr(ω1)
p(x|ω2)·Pr(ω2)

x
Pr(ω2|Xω^ 1) Pr(ω1|Xω^ 2)
= Pr(err|Xω^ 1) = Pr(err|Xω^ 2)

• Subop mal threshold / regions (bias error)


Suboptimal
• use a threshold
Bayes classi/ er
regions (bias error)
I use a Bayes classifier
• Incorrect distribu ons (model error)
Incorrect distributions (model error)
I
be er distribu on models / more training data

better distribution models / more training data
• Misleading features (‘Bayes error’)
Misleading features (‘Bayes error’)
I • irreducible
irreducible forfor given
given featureset
feature set, regardless of classi ca on scheme
I regardless of classification scheme
tt
ti
ti
ti
fi
fi
ti
Discriminant func ons
Discriminant Functions
Alterna ve formula on of Bayes decision rule
• Alternative formulation of Bayes decision rule
De ne a discriminant
Discriminant Functions func on for each class:
• Define a discriminant function, gi (x), for each class ωi
Choose ωi if gi (x) > gj (x) ∀j "= i
Alternative formulation of Bayes decision rule
• Functions
Define a yielding identical
discriminant classification
function, (x), for results:
each class
Func ons yielding iden cal gi classi ca on results,
ωi
gi (x)
= P(ωi |x)
par oning feature
Choose ω i space
if gi (x) into
>
= p(x|ωi )P(ωi )
g j decision
(x) ∀j regions,
=
" i
separated by decision
= log boundaries
p(x|ωi ) + log P(ωi )
Functions yielding identical classification results:
• Choice of function impacts
g (x) = P(ωcomputation
|x) costs
i i
• Discriminant functions = i )P(ωi )space into decision
p(x|ωfeature
partition
= log p(x|ω
regions, separated by decision i ) + log P(ωi )
boundaries

Choice of function impacts computation costs


fi
ti
ti
ti
ti
ti
ti
ti
fi
ti
ti
Density es ma on

• Used to es mate the underlying PDF, p(x|w)

• Parametric models: assume a speci c func on form


for the PDF, op mize parameters to t the data

• Non-parametric methods: determine the form of the


PDF from the data, grow parameter size with the
amount of data
ti
ti
ti
ti
fi
fi
ti
Gaussian Distributions
Gaussian distribu ons (normal distribu
• Gaussian PDF’s are reasonable when a feature vector can be
on)
viewed as perturbation around a reference

0.4
Probability Density
0.3
0.2
0.1
0.0

4 2 0 2 4
x−µ
σ

• Appropriate whenprocedures
• Simple estimation a featureforvector can be seen as
model parameters

perturba on around
• Classification a reference
often reduced to simple distance metrics
• Gaussian distributions also called Normal
• Simple es ma on procedure for model parameters
6.345 Automatic Speech Recognition Parametric Classifiers 2

• Classi ca on o en reduced to simple distance metrics


fi
ti
ti
ti
ti
ft
ti
ti
Gaussian distribu on Gaussian Distributions

• Gaussian PDF’s are reasonable when a feature vector can be


viewed as perturbation around a reference

( x− µ )2

0.4
( x− µ2) 2
1 −
One dimension 11 − 2 σ2 2=

Probability Density
−p(x) 2σ 2

0.3
( x− µ ) e
p(x) == ee 2πσ

0.2

p(x) !
22πσ

0.1
πσ ! !!

0.0
4 2 0 2 4
x−µ

Mul ! ple! dimensions (mul variate)


σ

1 T −1 for model parameters


1 − ( x− µ ) Σ ( x− µ )
• Simple estimation procedures

p(x) =1 e • 2 often reduced to simple distance metrics


Classification

(1x−π
1/2
1 − (2 µ )) Σ T (Σ
−1
Td/2
x−−1µ ) • Gaussian distributions also called Normal

p(x) = 1 2− ( x− µ ) Σ ( x− µ ) !
1/2 e
p(x) = (2π )d/2 Σ! 1/2 e
6.345 Automatic Speech Recognition Parametric Classifiers

d/2 2
!
(2π ) Σµ = [ µ (x )! µ (x )] !
! 1 n

! µ = [ µ (x1 )! µ (xn )] !
Σ! = [cov[x
µ = [ µ (x1 )! µ (xn )] ! i , x j ]]
!
Σ! = [cov[xi , x j ]]! !
! ⎡ µ1 ⎤
Σ! =! [cov[xi , x j ]]µ = ⎢ ⎥
⎡ µ1 ⎤ ⎣ µ2 ⎦ !
ti
ti
ti
Proper es of mul dimensional Gaussians

Gaussian Distributions:
stributions:
If the ith and jth dimensions are
Multi-Dimensional sta s cally or linearly
Properties
nsional Properties
independent, then
• If the i and j th dimensions are statistically or linearly
th

dimensions independent xj )linearly


then E(xior
are statistically = E(xi )E(xj ) and σij = 0
n E(xi xj•) =If E(x i )E(xj ) andare
all dimensions =0
σijstatistically or linearly independent, then
σij = 0 ∀i "= j and Σ has non-zero elements only on the diago
are statistically or linearly independent, then
So the o -diagonal elements on covariance Σ
nd Σ has non-zero elements only on the diagonala diagonal matr
• If the underlying density is Gaussian and ismatrix
then themodel correlaare
dimensions onstatistically
between features
independent and
density is Gaussian and Σ is a diagonal matrix,
d
ions are statistically independent and
! 2
p(x) = p(xi ) p(xi ) ∼ N (µi , σii ) σii = σi
i=1
d
!
ff
ti
ti
ti
ti
ti
Diagonal covariance matrix
Diagonal Covariance Matrix: Σ = σ 2 I
! !
! 2 0 !
Σ=!
! !
! 0 2
!
!

3-Dimensional PDF PDF Contour


4

4
0
2

-4 0
-2
-2
0 -2

2
-4
4 -4 -4 -2 0 2 4

6.345 Automatic Speech Recognition Parametric Classifiers 11


Diagonal Covariance matrix
Diagonal Covariance Matrix: σij = 0 ∀i "= j
! !
! 2 0 !
Σ=!
! !
! 0 1
!
!

3-Dimensional PDF PDF Contour


4

4
0
2

-4 0
-2
-2
0 -2

2
-4
4 -4 -4 -2 0 2 4

6.345 Automatic Speech Recognition Parametric Classifiers 12


General Covariance matrix
General Covariance Matrix: σij != 0
! !
! 2 1 !
Σ = !! !
! 1 1
!
!

3-Dimensional PDF PDF Contour


4

4
0
2

-4 0
-2
-2
0 -2

2
-4
4 -4 -4 -2 0 2 4

6.345 Automatic Speech Recognition Parametric Classifiers 13


Gaussian
Mixture Densities Mixture Models
PDF is composed of a mixture of m component densities
{ω1 , . .PDF
. , ωis
m
composed
}: of a mixture of m component densi es:
!
m
p(x) = p(x|ωj )P(ωj )
j=1

Component PDF parameters and mixture weights are


Component PDF parameters and mixture weights P(ωj ) are
typically unknown, making parameter es ma on a form of
typically unknown, making parameter estimation a form of
unsupervised learning
unsupervised learning
Gaussian mixtures assume Normal components:

p(x|ωk ) ∼ N(µ
µk , Σ k )
ti
ti
ti
Gaussian mixture: one
dimensional example
Gaussian Mixture Example: One Dimension

0.25
0.20
Probability Density

0.15
0.10
0.05
0.0

-4 -2 0 2 4
x
σ

p(x) = 0.6p1 (x) + 0.4p2 (x)


p1 (x) ∼ N(−σ, σ 2 ) p2 (x) ∼ N(1.5σ, σ 2 )
6.345 Automatic Speech Recognition Semi-Parametric Classifiers 4

Expecta on-Maximiza on (EM) is used to nd the


parameters (we will come back to this)
ti
ti
fi
Principal Component Analysis

• A method to reduce the dimension of the data by


elimina ng redundancy

• Taking advantage of the correla on between features

• Unsupervised
ti
ti
For example,
hout loss of to find the we
generality, firstwill,
principal
T in what component,
follows we seek axvector
assume has u1 ∈
zero-m
cipal components
2.1.2 A yView
i = uof
of a multivariate
Statistical i xPCA∈ R, i variable
random = 1, . . . ,xd from given sample (2.7)
poin
rically PCA
such{xthat
}
was
[Hotelling,
first
1933].
formulated
For a
in
multivariate
a statistical
random
setting:
variable x ∈
toR
estimat
D
and an
eorem
components2.2
i for (Principal
some
Historically u
of ∈
a R D
Components
such that
multivariate
i PCA was first the variance
random
formulated of
in aa
of y iRandom
is maximized
variable
statistical Variable).
subject
setting:
x from
to to
given
estimate The
the samfir
prin-
PCA goal: maximize the explained variance
d < D, the d u ∗
“principal components” Tare defined to T beud uncorrelated linea
al[Hotelling,
components of a = arg
multivariatemax Var(u x), s.t. u 1 = 1.
cipal components ofi = D random
a multivariate random variable
variable x
x≥from aregiven given
sample by the
points d
1 T
≥ 1 ≥ · · · 1
D ∈ R
u u 1, Var(y ) Var(y ) Var(y ). D
components 1933].
of x: For a multivariate
i u 1 ∈R 1
. random
2
{xi } [Hotelling, 1933]. For a multivariate random Tvariable x ∈ R ∗ and D variable
d
x any
envectors
D, thedsuch
Without For of its
example,covariance
d< “principal
loss of to find
generality, the matrix
first
components”
we principal
will,
T in Σ x =
are
what E[xx
component,
defined
follows we ].
seek
to
assume abevector
x d
has R
1 ∈ linear
uzero-mea
uncorrela
D, the d “principal yi = ui x ∈ R, i = 1, . . . , d
components” are defined to be d uncorrelated (2.7
that
onents
of. Finding
components
Notice
Theorem of x: the
that
2.2 projec
of x: on that D
forD any
(Principal u ∈ R
Components maximize
, of the
a variance
Random of the dataThe first d
Variable).
for some ui ∈ R such that the variance of yi is maximized subject to
∗ T
u1 = arg maxT Var(u1 x), s.t. u1 u1 = 1. T
(2.8)
cipal componentsTof a yi u=
multivariate
1 ∈R ∈ R, ivariable
uDi xrandom = 1, . . . , x
d are given by the(2.7)
d le
T T 2 T T T
1 ) R,
T
Var(u
Without loss
uix) =
uDyi i==1,E[(u
u i x
Var(y ∈
x) ≥] = E[u
.
Var(y i 2=
) ≥1,
xx·T·.· .≥.u]
, =
d
Var(yu
d ). Σx u.
eigenvectors
for someof ∈ofcovariance
u its Rgenerality,
i such thatwematrix
will,
the inΣ whatof
variance =follows
yE[xx
x assume has zero-mean.
]. x subject
is maximized
i to
en toFor example,
find the to find
first the first
T principal principal component,
compoent, the we seek
above a vector u
minimization

∈ R D
me
Proof. R such
i ∈Notice
usuch that
D
Theorem 2.2 (Principal
u
that forthat
u i the
=
i any
Components
u ∈variance
1, R ,1
D
Var(y ) ≥of a Random
Var(y ) ≥ · ·Variable).
of y2i is maximized
· ≥ Var(y d
The
). first d 1prin-
subject to
cipal components of a multivariate random variable x are given by the d leading
ivalent For
to example,
eigenvectors
T Var(u to T find
of uits∗ x) the
covariancefirst principal
T
matrix 2
Σ
.
component,
= E[xxT T
]. we
T seek a Tvector u ∗
∈ R D
= E[(u x) ≥ ] T= E[u xx≥ u]
T =
u·1·u·1 ≥ u Σ u. 1
suchuthat
i ui = 1, 1 = arg Var(y
max1 )Var(u Var(y
x), 2s.t.) = 1.Var(yx d ). (2.8
x
1
u1T∈RD D T
Proof. Notice thatmax for anyuu1∈ΣRx u , 1, s.t. u 1 u 1 = 1.
Then to find the first
u u ∗ principal
∈R = arg
D max compoent,
Var(u T
x), the
s.t. uabove
T
u = minimization
1. (2.8)(2
xample, to find the
Without loss of
Var(u
1first
generality,
1 T
x) =principal
we will, in
∈R T x)2 ] =
uE[(u D component,
what
1
E[u
follows
T
xx T assume
u]
1
= we
u
1 T seek a vector
x
Σx u.
has zero-mean.
equivalent to 1

hat Theorem
ving theWithout
above
Then to2.2
lossconstrained
(Principal
find ofthe
generality, minimization
first Components
weTwill,compoent,
principal in what problem
of a Random
follows
the T assume
above using
x hasszero-mean.
Variable).
minimization Lagrange
The first m
(2.8)d isprin
thod,cipal components
weTheorem
obtain
equivalent the
to of a max
necessary u1condition
multivariate Σxrandom
u1 , s.t. for u1 uto
variable
u 1
x =
1 are
be 1.
given
an by the d leadin
extrema:
∗2.2 (Principal u1 ∈R ComponentsTof .a Random
D T
Variable). The first d prin-
eigenvectors u 1 of=itsarg max
covariance
cipal components of ua1multivariate
Var(u
matrix
T Σ1xx), = E[xx T
s.t.
T
]. u1 u1 = 1.
maxD u1 Σx urandom
∈R 1, s.t.variable
u 1 u1 = x are
1. given by the d leading(2.9)
Solving the above constrained u ∈R D minimization
Σ x u 1 = . λuproblem
1 T using s Lagrange mult
eigenvectors
Proof. Notice that of itsfor any u ∈ R
covariance
1 D
matrix
, Σx = E[xx ].
outmethod, ofwegenerality,
lossProof.
Solving obtain
the above theconstrained
necessary
we
Notice that for any u ∈ R ,
T will, condition
in what
minimization
T D 2
for u
follows
problem
T
1
T
to be
using an
assume
s extrema:
Lagrange
T x has zero-
multiplier
method, we Var(u = E[(u condition
x)necessary
obtain the x) ] = E[u to beu]an=extrema:
for u1xx u Σx u.
ti
d mean. If not, simply subtract the mean
the case in which the data points have zero
gives from
a constructive solution
each point and to the
the solution foroptimal solution
Ud remains the same.ÛdThe
. following theorem
gives a constructive solution to the optimal solution Ûd .
Theorem 2.1 (PCA via SVD). Let X = [x1 , x2 , . . . , xN ] ∈ R D×N
be
Theorem
Another interpreta on
2.1 (PCA via SVD).
matrix formed by stacking the (zero-mean) data
matrix formed
T
Let X = [x 1 , x , . . .
2points as, x ∈ R D×N
N its columnbe
]
by stacking the (zero-mean) data points as its column vectors. Let
X = U ΣV beT the singular value decomposition (SVD) of the matrix X. Th
the
vectors. L
X = U ΣV be the singular value decomposition (SVD) of the matrix X. Then
for any
forgiven
any given d <d D,< D,a asolution
solution to PCA,ÛÛ
to PCA, d
dis is exactly
exactly the the
first dfirst d columns
columns of U ; of
and ŷand
is the ith column of the top d × N submatrix Σ T V T of the matrix T ΣV T
i ŷ i is the ith column of the top d × N submatrix Σd Vd d ofdthe matrix ΣV .
• Find P projec ons that maximize the reconstruc on
Proof.Proof.
NoteNote
thatthat
error thethe problem
problem
N
!
N
! " "2
min ""xi − Ud Ud xTi " "2
T
(2.6)
min
Ud "xi − Ud Ud xi " (2
Ud i=1
i=1
is equivalent to
is equivalent to N
! #$ &
T
%$ T
%T
minN trace# xi − Ud Ud xi xi − Ud Ud xi
&
!
Ud
i=1
$ %$ %T
T T
min trace
# xi − Ud Ud &xi xi − Ud Ud xi
trace (I − Ud UdT )XX T ,
⇔ Udmini=1
Ud
# &
T T
⇔the min
where, for trace
second (I − Uwe
equivalence, use)XX
d Ud
'N
the facts ,trace(AB) = trace(BA),
T T Ud T T T
Ud Ud Ud Ud = Ud Ud , and XX = i=1 xi xi to simplify the expression.
T
Substitute X = U ΣV into the above
where, for the second equivalence, weexpression,
use the the problem
facts becomes= trace(BA
trace(AB)
( ' )
ti
ti
ti
PCA steps

• Subtract the mean of the data

• Form the correla on matrix

• Perform Singular Value Decomposi on

• Choose the rst d components to keep % variance


fi
ti
ti
Discriminant projec ons
• Also a dimensionality reduc on through projec on
• Di erent than PCA: maximize class discriminability
ff
ti
ti
ti
• In order to measure
find a of
good the within-class
projection variability,
vector, we need or
tothe so-ca
define a
• In order to find a good projection vector, we need to define a
measureof•of separation
For each between thedefine
classthewe projections.
the scatter, an equ
Linear Discriminant Analysis
measure separation between projections.
LDA …variance, Two Classes
LDA
• … Two Classes
~
as;
• The mean vector of each classclass
The mean vector of each
~ in xin x and
and y feature
y feature spacespace
is: is: 2 2
order to find a good projection vector, we need stoi define a y i
1 1 is to maximize
ion proposed by Fisher ~ 1 1 that
a~function 1 1y T T
easure of separation betweenx x and
the and
projections. y y w x
i w x
s the differencei between
i
N i Nx i ix the means, normalized
i
N
i
N byNai x N x
of the within-class • class ~
variability,
i2
si inormeasures LDA i
the
y ii
…y i
variability Two i i
Classes
i
within class after p
e mean vector of each xthe
andso-called
y feature scatter.
space is: 1 i
wT wT 1x wTx i wT
class 1we define thethe ~ y-space.
scatter, an equivalent
1 • However,1 the distance
T
of between
the N i the x Nprojected means is not
i a very
x and y w x i
i x i
as; i
Ni x i
Ni y good measure
Ni x since it does not take into account the standard
~
s 2 i
y ~
• Thus
2 ~
deviation ~
2 within
2 the classes.
s1 distance
s between
i

measures
i

the variability wi
• • We could
i then choose
i the T 1 2 the projected
We could then choose the distance between the projected means
T
w x
means
w
y
classes
as our objective at hand
function
i
N i xafter projection, hence it is called w
i

sures theas our objective function i


variability within class i after projecting it on
ce.
e could then choose
of the
~ projected
~ T samples. T T
J ( wthe
) distance
~ between
~ w the projected
w means
w
our Is
objective this a good
J
function ( w ) objec
1 ve
2This
func
axis yields better
w 1 T on?
class separability
2 T
w w1 T 2
~
s1 ~
2
s22
measures the variability 1 2
within the1 two
2 1 2

hand after
J ( w) ~ ~
projection, hence
wT it is called
w T within-class
w T scatter
1 2 1 2 1 2
ojected samples.
This axis has a larger distance between means
ti
ti
LDA … Two Classes
Fisher’s solu on
The Fisher linear discriminant is defined as
LDA …
he linear function w x that maximizes the
T Two Classes
riterion function: Between class
• The Fisher linear discriminant is defined as
sca er
the linear function wTx that maximizes the
criterion function:
~ ~ 2
~ 2 ~ 2

J ( w) ~
1
J ( w2) ~~
1 2

s1 s12 2 ~
s s22
2
• Therefore, we will be looking for a projection
Within class
where examples from the same class are
Therefore, we will be looking for
sca er a projection
projected very close to each other and, at the
same time, the projected means are as farther
where examples from the same class are
apart as possible
rojected very close to each other and, at the
ame time, the projected means are as farther
part as possible
tt
tt
ti
• We will define a measure of the scatter in multivariate feature
space x which are denoted as scatter matrices;

T
Si x x
Within class LDA…x…Two
TwoClasses
Classes i i

sca er
LDA i

• Now, the scatter S S1 yScan


ofwthe projection 2y can then be expressed as a function of
• Now,thethescatter
scatter of the projection
matrix in feature space x. then be expressed as a function of
the scatter matrix in feature space x.
• Where Si is the covariance matrix of class i, and Sw is called the
within-class
~ 2 scatter matrix.
~ 2 T T 2
~ 2 si y~ 2 i T w x Tw 2i
si yy i i x wi x w i
y i x i
T T
w x T i x Ti w
xw i x i x i w
x T
wi
S i w
wT Si w
~ 2 ~ 2 T T T T ~
s1 s2 w S1w w S 2 w w S1 S 2 w w SW w ~SW
~
s12 ~ s22 wT S1w wT S 2 w wT S1 S 2 w wT SW w SW
Where S~W is the within-class scatter matrix of the projected samples y.
Where S~W is the within-class scatter matrix of the projected samples y.
tt
LDA … Two Classes
• Similarly, the difference between the projected means (in y-space) can be
expressed in terms of the means in the original feature space (x-space).

Between class ~ ~ 2
w T
w T 2

scatter LDA … Two Classes


1 2 1 2
T
T
w 1 2 1 2 w
• We can finally express the Fisher
S criterion in terms of B
~
SW and SB as: T
w SB w SB
• The matrix SB is called the between-class scatter of the original samples/feature
vectors, while S~B is the between-class
~ scatter
~ 2 of the projected samples y.
T
2 w SB w
Since SB is the outer J ( w) of two
1
• product vectors, its rank is at most one.
~
s1 ~
2
s22 T
w SW w

• Hence J(w) is a measure of the difference between class


means (encoded in the between-class scatter matrix)

LDA … Two Classes


LDA …of Two
J(w), we Classes
• To find the
• zero. LDA deriva on
maximum differentiate and equate to
To find the maximum of J(w), we differentiate and equate to
zero. d d wT S B w
J ( w) T 0
dwd d ww SSWB w
dw T
J ( w) T
0
dw dw w SWdw d
T T T T
w SW w w SB w w SB w w SW w 0
T d
dw T T d
dw T
w SW w w SB w w SB w w SW w 0
w SW w 2S B w w S B w 2SW w dw
T dw T
0
T T
w STW w 2S B w w S B w 2SW w 0
Dividing by 2 w SW w :
Dividing by 2TwT SW w :
w SW w wT S B w
w T
T SW w
S B w w T
T SB w
SW w 0
w TSW w S B w wT SW w SW w 0
w SW w w SW w
S B w J ( w) SW w 0
S B1w J ( w) SW w 0
SW 1S B w J ( w) w 0
SW S B w J ( w) w 0
ti
LDA solu on
LDA … Two Classes
Solving the generalized eigen value problem
1
S SB w
W w where J ( w) scalar
yields
The Eignevector corresponding
T to the largest
* w SBw 1
w arg max J ( w) arg maxvalueT
Eigen SW 1 2
w w w SW w

This is known as Fisher’s Linear Discriminant, although it is not a


discriminant but rather a specific choice of direction for the projection
of the data down to one dimension.
ti
Steps:

• Find the mean and covariance of each class

• Create within-class and between class sca er matrices

• Invert within-class sca er, and mul ply by between class


sca er

• Find the largest Eigenvector, project the data


tt
tt
ti
tt
K-Means Clustering
• An example of unsupervised learning

• number and form of classes unknown

• available data samples are unlabeled

• useful for discovery of data structure before


classi ca on

• Result depends strongly on the clustering algorithm


fi
ti
Clustering issues
• What de nes a cluster?
• is there a prototype represen ng each cluster?
• What de nes membership in a cluster?
• what is the distance metric, d(x,y)?
• How many clusters are there?
• is the number of clusters picked before clustering?
• How well do the clusters represent unseen data?
• how is a new point assigned to a cluster?
fi
fi
ti
K-Means
• Used to group data into K clusters

• Each cluster is represented by mean of assigned data

• Itera ve algorithm converges to a local op mum

• select K ini al cluster means

• Iterate un l stopping criterion is sa s ed


ti
ti
ti
ti
fi
ti
• Used to group data into K clusters, {C1 , . . . , CK }
• Used to group data into K clusters, {C1 , . . . , CK }
• Each cluster is represented by mean of assigned data
K-Means algorithm
• Each cluster is represented by mean of assigned data
• Iterative algorithm converges to a local optimum:
• Iterative algorithm converges to a local optimum:
– Select K initial cluster means, {µ µ1 , . . . , µ K }
– Select K initial cluster means, {µ µ1 , . . . , µ K }
– Iterate until stopping criterion is satisfied:
– Iterate until stopping criterion is satisfied:
• 1) Assign each data sample to the closest cluster
1. Assign each data sample to the closest cluster
1. Assign each data sample to the closest cluster
x!Ci , d(x, µ i ) ≤ d(x, µ j ), ∀i #= j
x!Ci , d(x, µi ) ≤ d(x, µj ), ∀i #= j
2. Update K means from assigned samples
• 2) Update K means from assigned samples
2. Update K means from assigned samples
µ i = E(x), x!Ci , 1 ≤ i ≤ K
µi = E(x), x!Ci , 1 ≤ i ≤ K
• Nearest neighbor quantizer used for unseen data
• Nearest neighbor quantizer used for unseen data

6.345 Automatic Speech Recognition Vector Quantiza


K-Means example: K = 3
K-Means Example: K = 3
• Random selec on of 3 data samples for ini al means
• Random selection of 3 data samples for initial means
• Euclidean distance
• Euclidean metric
distance between
metric betweenmeans andsamples
means and samples

0 3

1 4

2 5

6.345 Automatic Speech Recognition Vector Quantization & Clustering 8


ti
ti
K-means proper es
K-Means Properties
K-Means Properties
• Usually used
• Usually
• Usually usedwith
used aaEuclidean
withwith distance
a Euclidean
Euclidean metric
distance
distance metric
metric
K-Means Clustering: Stopping Criterion
2 t
d(x, µ ) = !x − µ ! 2 = (x − µ )
t
d(x, µi i ) = !x − µii ! = (x − µ ii) (x (x−−µµi )i )

The total
• The
•Many totaldistortion,
distortion,
criterion D,is
D,
can be used the
istothe sum
sum of squared
terminatesquared
K-means error
error
:
• The total distor on is the sum of the squared error
• No changes in sample assignments
KK !
D== − µµii!!22
!! !
D !x −
• Maximum number of iterations exceeded
x!C
i=1 x!C
i=1 ii


• D •
D D
Change decreases
decreases
in total
decreases in each
between n
distortion,
between thitera
th
n and and non.
+
D, falls1Can
st bea used
st iteration
n + 1below as a stopping
threshold
iteration
criterion
D(n
D(n + 1)
D(n
+ 1)+≤
1)D(n)
≤ D(n)
1− <T
D(n)
• Also known as Isodata, or generalized Lloyd algorithm
• Also known as Isodata, or generalized Lloyd algorithm
• Similarities with Expectation-Maximization (EM) algorithm for
ti
ti
ti
Some issues
K-means converges to a local op mum

K-Means Clustering: Initialization
• Global op mum is not guaranteed
• K-means converges to a local optimum

• Ini
– Global al choice
optimum canguaranteed
is not in uence the nal results
– Initial choices can influence final result

K=3

• Initial K-means can be chosen randomly


– Clustering can be repeated multiple times
ti
ti
fl
fi
ti
Importance of feature
normaliza on
• Scaling feature vector dimensions can signi cantly
Clustering Issues: Impact of Scaling
impact the nal clustering results
Scaling feature vector dimensions can significantly impact
clustering results

Scaling can be used to normalize dimensions so a simple distance


metric is a reasonable criterion for similarity
fi
ti
fi
Summary
/T/ /UW/ “two”

Feature Acoustic Language


extraction modeling Lexicon modeling

“two”

• What is machine learning?


• Supervised vs. Unsupervised (e.g. K-means)
• Determinis c vs. Stochas c
• Genera ve (e.g. Bays) vs. Discrimina ve
• Dimensionality reduc on (PCA, LDA)
ti
ti
ti
ti
ti

You might also like