Lecture Slides For: Ethem Alpaydin © The MIT Press, 2010

Lecture Slides for
ETHEM ALPAYDIN
© The MIT Press, 2010
alpaydin@boun.edu.tr
http://www.cmpe.boun.edu.tr/~ethem/i2ml2e
Outline
Last class Chapter 13 Kernel Machines
-Non separable case: Soft Margin Hyperplane
-Kernel Trick
-Vectorial Kernels
-Multiple Kernel Learning
-Multiclass Kernel Machines
Today: Finish Chapter 13 Kernel Machines
Chapter 16 Hidden Markov Models
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
SVM for Regression
Use a linear model (possibly kernelized)
f(x)=wTx+w0
Use the є-sensitive error function
0 if r t  f  xt   
e  r , f  x     t
t t
 r  f  x   
t
otherwise
min w  C   t   t 
1 2

2
 
t
r t  w T x  w0     t
w x  w   r
T
0
t
    t
 t , t  0
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 4
Kernel Regression
Polynomial kernel Gaussian kernel
One-Class Kernel Machines
Consider a sphere with center a and radius R
min R 2  C  t
t
subject to
xt  a  R 2   t , t  0
Ld    x  x    r r  x 
N
t t T s t s t s t T
xs
t t 1 s
subject to
0   t  C ,  t  1
t
Kernel Dimensionality Reduction
Kernel PCA does
PCA on the kernel
matrix (equal to
canonical PCA with
a linear kernel)
Introduction
 Assumption
 Modeling dependencies in input; no longer iid (independent and identically
distributed)
 Sequences
 Temporal:
In speech: phonemes in a word (dictionary), words in a sentence (syntax,

semantics of the language).

 In handwriting: pen movements
 Spatial:
 In a DNA sequence: base pairs
 Base pairs in a DNA sequence can not be modeled as simple
probability distribution.
11
Discrete Markov Process
 N states: S1, S2, ..., SN
 State at “time” t, qt = Si
 First-order Markov
P(qt+1=Sj | qt=Si, qt-1=Sk ,...) = P(qt+1=Sj | qt=Si)
 Transition probabilities
aij ≡ P(qt+1=Sj | qt=Si) aij ≥ 0 and Σj=1N aij=1
 Initial probabilities
πi ≡ P(q1=Si) Σj=1N πi=1
12
Stochastic Automaton
T
P ( O=Q∣ A,Π ) =P ( q1 ) ∏ P ( q t∣qt −1 ) =π q a q q .. . aq q
t=2
1 1 2 T −1 T
For example:
π 3 a31 a12 a22 a 23 a32 a21 ...
Q= 3 1 2 2 3 2 1...
13
Example: Balls and Urns
Three urns each full of balls of one color
S1: red, S2: blue, S3: green
S1 S2 S3
¿
S1 S2
0.4 0 .3 0.3
Π= [ 0 .5,0 . 2,0 . 3 ]
T
A= 0 . 2
[ 0 .6 0.2
]
S3
0.1 0.1 0.8
O= { S1 ,S 1 ,S 3 ,S 3 } = {red, red, green, green}
P ( O∣A,Π ) =
¿
¿
14
Example: Balls and Urns
Three urns each full of balls of one color
S1: red, S2: blue, S3: green
S1 S2 S3
0.4 0.3 0.3
S1 S2
   0.5,0.2,0.3 A  0.2 0.6 0.2
T
0.1 0.1 0.8
S3
O  S1 ,S1 ,S3 ,S3 = {red, red, green, green}
P O | A ,    P  S1   P  S1 | S1   P  S3 | S1   P  S3 | S3 
 1  a11  a13  a33
 0.5  0.4  0.3  0.8  0.048
15
Balls and Urns: Learning
Observable Markov Model
Given K example sequences of length T
How to estimate the parameters?

16
Balls and Urns: Learning
Given K example sequences of length T
{sequences starting with Si } ∑k 1 ( q1k =Si )

π̂ i = =
{ sequences } K
{ transitions from Si to S j }
a
̂ ij=
{ transitions from Si }
T −1
∑k ∑t 1 ( qkt =S i and qt+k 1 =S j )
= T −1
∑k ∑t 1 ( qkt =S i )
17
Hidden Markov Models
 States are not observable
 Discrete observations {v1,v2,...,vM} are recorded; a probabilistic function
of the state
 Emission probabilities
bj(m) ≡ P(Ot=vm | qt=Sj)
 Example
 In each urn, there are balls of different colors, but with different
probabilities.
 For each observation sequence, there are multiple state sequences
18
Another Example
A colored ball choosing example :
Urn 1 Urn 2 Urn 3

# of Red = 30 # of Red = 10 # of Red =60
# of Green = 50 # of Green = 40 # of Green =10
# of Blue = 20 # of Blue = 50 # of Blue = 30
Probability of transition to another Urn after picking a ball:

U1 U2 U3
U1 0.1 0.4 0.5
U2 0.6 0.2 0.2
U3 0.3 0.4 0.3
Example (contd.)
U1 U2 U3 R G B
Given : U1 0.1 0.4 0.5 U1 0.3 0.5 0.2
and
U2 0.6 0.2 0.2 U2 0.1 0.4 0.5
U3 0.3 0.4 0.3 U3 0.6 0.1 0.3
Observation : RRGGBRGR
State Sequence : ??
Not so Easily Computable.

Example (contd.)
Here :
S = {U1, U2, U3} U1 U2 U3
A=
V = { R,G,B} U1 0.1 0.4 0.5
For observation: U2 0.6 0.2 0.2
O ={o1… on}
U3 0.3 0.4 0.3
And State sequence R G B
Q ={q1… qn} B=
U1 0.3 0.5 0.2
π is
U2 0.1 0.4 0.5
 i  P(q1  U i )
U3 0.6 0.1 0.3
22
Elements

of an HMM
N: Number of states
S = {S1, S2, ..., SN }

 M: Number of observation symbols
V = {v1,v2,...,vM}
 A = [aij]: N by N state transition probability matrix
aij ≡ P (qt+1=Sj | qt=Si)

 B = bj(m): N by M observation probability matrix
bj(m) ≡ P (Ot=vm | qt=Sj)

 Π = [πi]: N by 1 initial state probability vector
πi ≡ P (q1=Si)
λ = (A, B, Π), parameter set of HMM
23
Examples
•Gene regulation O={A, C, G, T} S={Gene,Transcription factor binding site,Junk

DNA,...}
•Speech processing O=speech signal S=word or phoneme being uttered•
Text understanding O=words S=topic (e.g. sports, weather, etc)
•Robot localization–O=sensor readings S=discretized position of the robot
24
Three Basic Problems of HMMs
1. Evaluation:
Given λ, and O, calculate P (O | λ)
1. State sequence:
Given λ, and O, find Q* such that
P (Q* | O, λ ) = maxQ P (Q | O , λ )
1. Learning:
Given X={Ok}k, find λ* such that
P ( X | λ* )=maxλ P ( X | λ )
(Rabiner, 1989)
25
Evaluation: Naïve solution
State sequence Q = {q1,…qT}
Assume independent observations:
T
P(O∣Q , λ )=∏ P(Ot ∣q t , λ )=b q (O1 )b q (O 2 ). .. b q (OT )
i=1 1 2 T
Observations are mutually independent, given the

hidden states.
Observe that :
P(Q∣λ)=π q1 aq1q2 aq2q3 ...aqT −1qT

And that:
P(O∣λ )=∑ P(O∣Q , λ )P (Q∣λ )

q
Finally get:
P(O∣λ )=∑ P(O∣Q , λ )P (Q∣λ )

q
-The above sum is over all state paths

-There are NT states paths, each ‘costing’
O(T) calculations, leading to O(TNT)
time complexity.
Evaluation
 Forward variable:
α t ( i ) ≡P ( O 1 ⋯O t ,q t =Si∣ λ )
 The probability of observing the partial
sequence {O1 ,⋯,O
until timet } t and being
in Si at time t, given the model λ
Initialization:
α 1 ( i ) =π i bi ( O 1 )
Recursion:
N
α t+1 ( j )=
N
[∑ ] (
i=1
α t ( i ) aij b j O t+1 )
P ( O∣λ ) =∑ αT ( i ) Evaluation result

i=1
29
Evaluation
 Backward variable:
β t ( i ) ≡P ( O t+1 ⋯O T ∣q t =S i ,λ )
 The probability of being in Si at time t
and observing the partial sequence
{O t+1 ,⋯,OT }
Initialization:
β T ( i )=1 (=P(OT+1∣qT =Si ,λ ))
Recursion:
N
β t ( i ) =∑ aij b j ( Ot+1 ) β t+1 ( j )
j=1
30

Lecture Slides For: Ethem Alpaydin © The MIT Press, 2010

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture Slides For: Ethem Alpaydin © The MIT Press, 2010

Uploaded by

Copyright:

Available Formats

Lecture Slides for

semantics of the language).

 Base pairs in a DNA sequence can not be modeled as simple

0.1 0.1 0.8

{sequences starting with Si } ∑k 1 ( q1k =Si )

Urn 1 Urn 2 Urn 3

Probability of transition to another Urn after picking a ball:

Not so Easily Computable.

S = {S1, S2, ..., SN }

aij ≡ P (qt+1=Sj | qt=Si)

bj(m) ≡ P (Ot=vm | qt=Sj)

•Gene regulation O={A, C, G, T} S={Gene,Transcription factor binding site,Junk

•Speech processing O=speech signal S=word or phoneme being uttered•

Text understanding O=words S=topic (e.g. sports, weather, etc)

•Robot localization–O=sensor readings S=discretized position of the robot

Observations are mutually independent, given the

P(Q∣λ)=π q1 aq1q2 aq2q3 ...aqT −1qT

P(O∣λ )=∑ P(O∣Q , λ )P (Q∣λ )

P(O∣λ )=∑ P(O∣Q , λ )P (Q∣λ )

-The above sum is over all state paths

in Si at time t, given the model λ

P ( O∣λ ) =∑ αT ( i ) Evaluation result

You might also like