Professional Documents
Culture Documents
4 Machine learning 2 -
9 Acous c modeling -
11 Language modeling -
• Python and Pytorch are the default pla orms. Make sure
they are properly installed
“two”
input output
231 me x frequency 40 phonemes
W
40 x 231
x
ti
s
ti
Pattern Classification
Classi ca on
Goal: To classify objects (or patterns) into categories (or classes)
! Feature ! !
Extraction Classifier
What
•Types ofis machine
Problems: learning?
Can you learn
• Supervised:
1. anyare
Classes one-to-one mapping
known beforehand, given
and enoughof
data samples
training
each class data?
are available
2. Unsupervised: Classes (and/or number of classes) are not known
• Supervised vs. Unsupervised
beforehand, and must be inferred from data
• Determinis c vs. Stochas c
•6.345Genera
Automatic Speechve vs. Discrimina
Recognition ve Pattern Classification 2
ti
ti
fi
ti
ti
ti
Building a classi er
• De ne classes/a ributes
• De ne feature space
• De ne decision algorithm
• Measure performance
fi
fi
fi
tt
fi
fi
Feature extrac on
• What features do we use to assign samples to di erent classes?
• waveform vs. formants vs. cepstra
• invariance under irrelevant modi ca ons
• Theore cally equivalent features may act very di erently under
par cular classi er:
• representa ons make important aspects explicit
• remove irrelevant informa on
• Feature design incorporate “domain knowledge”
• although more data -> less need for “cleverness”
• Smaller feature space (fewer dimensions)
• simpler models (fewer parameters)
• less training data needed
• faster training
ti
ti
ti
fi
ti
fi
ti
ti
ff
ff
Genera ve vs. Discrimina ve
• Genera ve models
• Model class-condi onal PDF and prior
probabili es
• “genera ve” since sampling can generate
synthe c data points
• Popular models:
• Gaussians, Bayes, Mixture of gaussians,
Hidden Markov Models, Neural Networks
• Discrimina ve models
• Directly es mate posterior probabili es
• No a empt to model underlying probability
distribu ons
• Popular models
• Logis c regression, SVMs, neural networks
tt
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
Example: Language Iden ca on
• Genera ve approach:
• Discrimina ve approach:
"
x "
Define: {ωi } a set of M x mutually exclusive classes
Define: {ωii)} aapriori
P(ω set ofprobability
M mutually forexclusive
class ωi classes
P(ωii)) PDF
p(x|ω a priori probability
for feature vectorfor class
x in classωωi i
p(x|ω i ) aPDF
P(ωi |x) for feature
posteriori vectorof
probability xω ini class
givenω xi
P(ωi |x) a posteriori probability of ωi given x
p(x|ωi )P(ωi )
From Bayes Rule: P(ωi |x) =
p(x)i )P(ωi )
p(x|ω
From Bayes Rule: P(ωi |x) M=
! p(x)
where p(x) = Mp(x|ωi )P(ωi )
!
where p(x) =
i=1
p(x|ωi )P(ωi )
i=1
6.345 Automatic Speech Recognition Pattern Classification 5
x
Pr(ω2|Xω^ 1) Pr(ω1|Xω^ 2)
= Pr(err|Xω^ 1) = Pr(err|Xω^ 2)
0.4
Probability Density
0.3
0.2
0.1
0.0
4 2 0 2 4
x−µ
σ
• Appropriate whenprocedures
• Simple estimation a featureforvector can be seen as
model parameters
perturba on around
• Classification a reference
often reduced to simple distance metrics
• Gaussian distributions also called Normal
• Simple es ma on procedure for model parameters
6.345 Automatic Speech Recognition Parametric Classifiers 2
( x− µ )2
0.4
( x− µ2) 2
1 −
One dimension 11 − 2 σ2 2=
Probability Density
−p(x) 2σ 2
0.3
( x− µ ) e
p(x) == ee 2πσ
0.2
2σ
p(x) !
22πσ
0.1
πσ ! !!
0.0
4 2 0 2 4
x−µ
(1x−π
1/2
1 − (2 µ )) Σ T (Σ
−1
Td/2
x−−1µ ) • Gaussian distributions also called Normal
p(x) = 1 2− ( x− µ ) Σ ( x− µ ) !
1/2 e
p(x) = (2π )d/2 Σ! 1/2 e
6.345 Automatic Speech Recognition Parametric Classifiers
d/2 2
!
(2π ) Σµ = [ µ (x )! µ (x )] !
! 1 n
! µ = [ µ (x1 )! µ (xn )] !
Σ! = [cov[x
µ = [ µ (x1 )! µ (xn )] ! i , x j ]]
!
Σ! = [cov[xi , x j ]]! !
! ⎡ µ1 ⎤
Σ! =! [cov[xi , x j ]]µ = ⎢ ⎥
⎡ µ1 ⎤ ⎣ µ2 ⎦ !
ti
ti
ti
Proper es of mul dimensional Gaussians
Gaussian Distributions:
stributions:
If the ith and jth dimensions are
Multi-Dimensional sta s cally or linearly
Properties
nsional Properties
independent, then
• If the i and j th dimensions are statistically or linearly
th
4
0
2
-4 0
-2
-2
0 -2
2
-4
4 -4 -4 -2 0 2 4
4
0
2
-4 0
-2
-2
0 -2
2
-4
4 -4 -4 -2 0 2 4
4
0
2
-4 0
-2
-2
0 -2
2
-4
4 -4 -4 -2 0 2 4
p(x|ωk ) ∼ N(µ
µk , Σ k )
ti
ti
ti
Gaussian mixture: one
dimensional example
Gaussian Mixture Example: One Dimension
0.25
0.20
Probability Density
0.15
0.10
0.05
0.0
-4 -2 0 2 4
x
σ
• Unsupervised
ti
ti
For example,
hout loss of to find the we
generality, firstwill,
principal
T in what component,
follows we seek axvector
assume has u1 ∈
zero-m
cipal components
2.1.2 A yView
i = uof
of a multivariate
Statistical i xPCA∈ R, i variable
random = 1, . . . ,xd from given sample (2.7)
poin
rically PCA
such{xthat
}
was
[Hotelling,
first
1933].
formulated
For a
in
multivariate
a statistical
random
setting:
variable x ∈
toR
estimat
D
and an
eorem
components2.2
i for (Principal
some
Historically u
of ∈
a R D
Components
such that
multivariate
i PCA was first the variance
random
formulated of
in aa
of y iRandom
is maximized
variable
statistical Variable).
subject
setting:
x from
to to
given
estimate The
the samfir
prin-
PCA goal: maximize the explained variance
d < D, the d u ∗
“principal components” Tare defined to T beud uncorrelated linea
al[Hotelling,
components of a = arg
multivariatemax Var(u x), s.t. u 1 = 1.
cipal components ofi = D random
a multivariate random variable
variable x
x≥from aregiven given
sample by the
points d
1 T
≥ 1 ≥ · · · 1
D ∈ R
u u 1, Var(y ) Var(y ) Var(y ). D
components 1933].
of x: For a multivariate
i u 1 ∈R 1
. random
2
{xi } [Hotelling, 1933]. For a multivariate random Tvariable x ∈ R ∗ and D variable
d
x any
envectors
D, thedsuch
Without For of its
example,covariance
d< “principal
loss of to find
generality, the matrix
first
components”
we principal
will,
T in Σ x =
are
what E[xx
component,
defined
follows we ].
seek
to
assume abevector
x d
has R
1 ∈ linear
uzero-mea
uncorrela
D, the d “principal yi = ui x ∈ R, i = 1, . . . , d
components” are defined to be d uncorrelated (2.7
that
onents
of. Finding
components
Notice
Theorem of x: the
that
2.2 projec
of x: on that D
forD any
(Principal u ∈ R
Components maximize
, of the
a variance
Random of the dataThe first d
Variable).
for some ui ∈ R such that the variance of yi is maximized subject to
∗ T
u1 = arg maxT Var(u1 x), s.t. u1 u1 = 1. T
(2.8)
cipal componentsTof a yi u=
multivariate
1 ∈R ∈ R, ivariable
uDi xrandom = 1, . . . , x
d are given by the(2.7)
d le
T T 2 T T T
1 ) R,
T
Var(u
Without loss
uix) =
uDyi i==1,E[(u
u i x
Var(y ∈
x) ≥] = E[u
.
Var(y i 2=
) ≥1,
xx·T·.· .≥.u]
, =
d
Var(yu
d ). Σx u.
eigenvectors
for someof ∈ofcovariance
u its Rgenerality,
i such thatwematrix
will,
the inΣ whatof
variance =follows
yE[xx
x assume has zero-mean.
]. x subject
is maximized
i to
en toFor example,
find the to find
first the first
T principal principal component,
compoent, the we seek
above a vector u
minimization
∗
∈ R D
me
Proof. R such
i ∈Notice
usuch that
D
Theorem 2.2 (Principal
u
that forthat
u i the
=
i any
Components
u ∈variance
1, R ,1
D
Var(y ) ≥of a Random
Var(y ) ≥ · ·Variable).
of y2i is maximized
· ≥ Var(y d
The
). first d 1prin-
subject to
cipal components of a multivariate random variable x are given by the d leading
ivalent For
to example,
eigenvectors
T Var(u to T find
of uits∗ x) the
covariancefirst principal
T
matrix 2
Σ
.
component,
= E[xxT T
]. we
T seek a Tvector u ∗
∈ R D
= E[(u x) ≥ ] T= E[u xx≥ u]
T =
u·1·u·1 ≥ u Σ u. 1
suchuthat
i ui = 1, 1 = arg Var(y
max1 )Var(u Var(y
x), 2s.t.) = 1.Var(yx d ). (2.8
x
1
u1T∈RD D T
Proof. Notice thatmax for anyuu1∈ΣRx u , 1, s.t. u 1 u 1 = 1.
Then to find the first
u u ∗ principal
∈R = arg
D max compoent,
Var(u T
x), the
s.t. uabove
T
u = minimization
1. (2.8)(2
xample, to find the
Without loss of
Var(u
1first
generality,
1 T
x) =principal
we will, in
∈R T x)2 ] =
uE[(u D component,
what
1
E[u
follows
T
xx T assume
u]
1
= we
u
1 T seek a vector
x
Σx u.
has zero-mean.
equivalent to 1
hat Theorem
ving theWithout
above
Then to2.2
lossconstrained
(Principal
find ofthe
generality, minimization
first Components
weTwill,compoent,
principal in what problem
of a Random
follows
the T assume
above using
x hasszero-mean.
Variable).
minimization Lagrange
The first m
(2.8)d isprin
thod,cipal components
weTheorem
obtain
equivalent the
to of a max
necessary u1condition
multivariate Σxrandom
u1 , s.t. for u1 uto
variable
u 1
x =
1 are
be 1.
given
an by the d leadin
extrema:
∗2.2 (Principal u1 ∈R ComponentsTof .a Random
D T
Variable). The first d prin-
eigenvectors u 1 of=itsarg max
covariance
cipal components of ua1multivariate
Var(u
matrix
T Σ1xx), = E[xx T
s.t.
T
]. u1 u1 = 1.
maxD u1 Σx urandom
∈R 1, s.t.variable
u 1 u1 = x are
1. given by the d leading(2.9)
Solving the above constrained u ∈R D minimization
Σ x u 1 = . λuproblem
1 T using s Lagrange mult
eigenvectors
Proof. Notice that of itsfor any u ∈ R
covariance
1 D
matrix
, Σx = E[xx ].
outmethod, ofwegenerality,
lossProof.
Solving obtain
the above theconstrained
necessary
we
Notice that for any u ∈ R ,
T will, condition
in what
minimization
T D 2
for u
follows
problem
T
1
T
to be
using an
assume
s extrema:
Lagrange
T x has zero-
multiplier
method, we Var(u = E[(u condition
x)necessary
obtain the x) ] = E[u to beu]an=extrema:
for u1xx u Σx u.
ti
d mean. If not, simply subtract the mean
the case in which the data points have zero
gives from
a constructive solution
each point and to the
the solution foroptimal solution
Ud remains the same.ÛdThe
. following theorem
gives a constructive solution to the optimal solution Ûd .
Theorem 2.1 (PCA via SVD). Let X = [x1 , x2 , . . . , xN ] ∈ R D×N
be
Theorem
Another interpreta on
2.1 (PCA via SVD).
matrix formed by stacking the (zero-mean) data
matrix formed
T
Let X = [x 1 , x , . . .
2points as, x ∈ R D×N
N its columnbe
]
by stacking the (zero-mean) data points as its column vectors. Let
X = U ΣV beT the singular value decomposition (SVD) of the matrix X. Th
the
vectors. L
X = U ΣV be the singular value decomposition (SVD) of the matrix X. Then
for any
forgiven
any given d <d D,< D,a asolution
solution to PCA,ÛÛ
to PCA, d
dis is exactly
exactly the the
first dfirst d columns
columns of U ; of
and ŷand
is the ith column of the top d × N submatrix Σ T V T of the matrix T ΣV T
i ŷ i is the ith column of the top d × N submatrix Σd Vd d ofdthe matrix ΣV .
• Find P projec ons that maximize the reconstruc on
Proof.Proof.
NoteNote
thatthat
error thethe problem
problem
N
!
N
! " "2
min ""xi − Ud Ud xTi " "2
T
(2.6)
min
Ud "xi − Ud Ud xi " (2
Ud i=1
i=1
is equivalent to
is equivalent to N
! #$ &
T
%$ T
%T
minN trace# xi − Ud Ud xi xi − Ud Ud xi
&
!
Ud
i=1
$ %$ %T
T T
min trace
# xi − Ud Ud &xi xi − Ud Ud xi
trace (I − Ud UdT )XX T ,
⇔ Udmini=1
Ud
# &
T T
⇔the min
where, for trace
second (I − Uwe
equivalence, use)XX
d Ud
'N
the facts ,trace(AB) = trace(BA),
T T Ud T T T
Ud Ud Ud Ud = Ud Ud , and XX = i=1 xi xi to simplify the expression.
T
Substitute X = U ΣV into the above
where, for the second equivalence, weexpression,
use the the problem
facts becomes= trace(BA
trace(AB)
( ' )
ti
ti
ti
PCA steps
measures
i
the variability wi
• • We could
i then choose
i the T 1 2 the projected
We could then choose the distance between the projected means
T
w x
means
w
y
classes
as our objective at hand
function
i
N i xafter projection, hence it is called w
i
hand after
J ( w) ~ ~
projection, hence
wT it is called
w T within-class
w T scatter
1 2 1 2 1 2
ojected samples.
This axis has a larger distance between means
ti
ti
LDA … Two Classes
Fisher’s solu on
The Fisher linear discriminant is defined as
LDA …
he linear function w x that maximizes the
T Two Classes
riterion function: Between class
• The Fisher linear discriminant is defined as
sca er
the linear function wTx that maximizes the
criterion function:
~ ~ 2
~ 2 ~ 2
J ( w) ~
1
J ( w2) ~~
1 2
s1 s12 2 ~
s s22
2
• Therefore, we will be looking for a projection
Within class
where examples from the same class are
Therefore, we will be looking for
sca er a projection
projected very close to each other and, at the
same time, the projected means are as farther
where examples from the same class are
apart as possible
rojected very close to each other and, at the
ame time, the projected means are as farther
part as possible
tt
tt
ti
• We will define a measure of the scatter in multivariate feature
space x which are denoted as scatter matrices;
T
Si x x
Within class LDA…x…Two
TwoClasses
Classes i i
sca er
LDA i
Between class ~ ~ 2
w T
w T 2
0 3
1 4
2 5
The total
• The
•Many totaldistortion,
distortion,
criterion D,is
D,
can be used the
istothe sum
sum of squared
terminatesquared
K-means error
error
:
• The total distor on is the sum of the squared error
• No changes in sample assignments
KK !
D== − µµii!!22
!! !
D !x −
• Maximum number of iterations exceeded
x!C
i=1 x!C
i=1 ii
•
• D •
D D
Change decreases
decreases
in total
decreases in each
between n
distortion,
between thitera
th
n and and non.
+
D, falls1Can
st bea used
st iteration
n + 1below as a stopping
threshold
iteration
criterion
D(n
D(n + 1)
D(n
+ 1)+≤
1)D(n)
≤ D(n)
1− <T
D(n)
• Also known as Isodata, or generalized Lloyd algorithm
• Also known as Isodata, or generalized Lloyd algorithm
• Similarities with Expectation-Maximization (EM) algorithm for
ti
ti
ti
Some issues
K-means converges to a local op mum
•
K-Means Clustering: Initialization
• Global op mum is not guaranteed
• K-means converges to a local optimum
• Ini
– Global al choice
optimum canguaranteed
is not in uence the nal results
– Initial choices can influence final result
K=3
“two”