You are on page 1of 65

MDP

MDP


.: 2008019031, Email: andreadis.paul@yahoo.com

:
. , , ,
()
. , , ,

. , ,
,

c 2010 .

c 2010
: Pi,

MDP


(Reinforcement Learning, RL) .
(Discrete time optimal Control)
(Probabilistic Inference) . - (Expectation - Maximization, EM) (probabilistic
mixture model). ,

- (max-product Belief Propagation).
.
: , Reinforcement Learning, RL, , Probabilistic Inference, , Expectation maximization, EM, , Viterbi, Bayes, Dynamic bayesian networks, DBN,
Markov, Markov Decision Process, MDP,
, Mixture models


, , , .
2010
, .
.
.
, ,
.
.
.
, ,
,
.




iii

iii


2.1 . . . . . . . . . . . . . .
2.2 . . . . . . . . . . .
2.3
2.4 . . .
2.5 .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

3
3
5
6
12
15

- Viterbi
17
3.1 . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 - . . . . . . . . . 18
3.3 20
3.4 Markov . . 24
3.5 . . . . . . . . . . . . . . 31

Markov
33
4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Markov (MDPs) . . . . . . . . . . . . . . . 33
4.3 MDPs . . . . . . . . . . . . . . . . . . . . 34
4.4 EM 36
4.5 Policy Iteration . . . . . . . . . . . . . . . . . . . 39
4.6 Viterbi-EM
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.7 . . . . . . . . . . . . . . . . . . . . 42


5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
v

43
43
44

5.3 B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5 . . . . . . . . . . . . . . . . . . . . . . . . . . .

46
46
48

51

vi

53

paper Probabilistic Inference for solving discrete and continuous


state Markov decision processes (Toussaint and Storkey (2006)), Toussaint Storkey
Markov . , , .
.
, - (ExpectationMaximization, EM) .
EM
, , Viterbi path,
.
(Reinforcement Learning, RL) .

. RL

. (finite-state) Markov (Markov
decision process, MDP).
.


. . EM Viterbi. EM
, . Viterbi
- Viterbi path -
.
1

1.
:
.2 - ,
RL,
.
.3 - - Viterbi, ,
.
.4 - Markov
, EM MDP Toussaint Storkey,
Viterbi-EM, .
.5 - ,
EM Viterbi - EM,
Policy Value iteration.
, .6 - , .


2.1

, . (Reinforcement Learning, RL),


, .
,
,
. , ,
. ,
, , .
, - (trial-and-error) , RL.
,
.
,
RL.
,
, . ,

.
,
.
, ,
, .
(supervised
learning), . .
3

2.
,
.


. ,
,
.


. , RL

.
. ,
.
. .
,
.


, .
. , . RL
, ,
. ,

.
(planning),
,
.

2.1.1


.

.
RL 1980.

. , .
, ,
, (temporal difference).
4

2.2.
1980
.
1950

.
1950 Richard Bellman
Hamilton Jacobi.
,
,
Bellman.
(Bellman (1957a)). (Bellman (1957b))
, Markov (Markovian decision processes, MDPs),
Ron Howard (Howard (1960)) policy iteration MDPs.

.
1960 (reinforcement) (, Waltz and Fu (1965),
Mendel (1966), Fu (1970), Mendel and McLaren (1970)).
Minsky (Minsky (1961)),
RL,
, :
;
RL.

2.2

,
: ,
, , , .
. ,
.
, , . RL,
. ,
.
RL.
( -)
, , ( ). RL
. , ,
5

2.

. ,
,
. ,
.
,
.
, . ,
, , ,
. , , , . ,
.
,
. ,

. ,

.
RL
. , , .
,
,
. RL
. RL - (trial-and-error).
. , RL
, ,
(state-space planning
methods).

2.3

.
.

2.3.1


. . , , . ,
6

2.3.

Agent
st

rt

at
rt +1
s t +1

Enviroment

2.1: Bayes (DBN) Markov


(MDP). x , a r
.

. , ,
.
,
, t = 0, 1, 2, 3 . . .. t,
, st S,
S . , ,
at A(st ), A(st )
st . , ,
rt +1 ,
st +1 . (2.1) -.
, . t , t (s, a ) at = a st = s. RL

.
, .

2.3.2

.
,
.
. ,
.
t
rt +1 , rt +2 , rt +3 , . . ., ,
Rt .
7

2.

Rt = rt +1 + rt +2 + rt +3 + . . . + rT ,

(2.1)

T . , , , - , .
.
- .
T = , .
(discounting).

2

Rt = rt +1 + rt +2 + rt +3 + . . . =

X
k =0

k r t +k +1 ,

(2.2)

(discount rate) 0
1.

2.3.3

Markov

RL, , . , Markov.
. ,

. . ,
,
.
,
. Markov,
Markov. ,
Markov
.
. Markov
, :
Markov ,
s0 , r, st at ,
Pr {st +1 = s0 , rt +1 = r |st , at , rt , st 1 , at 1 , . . . , r1 , s0 , a0 } =
Pr {st +1 = s0 , rt +1 = r |st , at }.
8

(2.3)

2.3.
,
Markov.
Markov,

.
,
,
.

2.3.4

Markov

Markov
Markov (Markov decision process, MDP).
, Markov (finite MDP). finite MDP
, -
. (transition propabilities)
a
0
Pss
0 = Pr {st +1 = s |st = s, at = a },

(2.4)

, s a, s0 . , s, a
s0 ,
a
0
Rss
0 = E {rt +1 |st = s, at = a, st +1 = s }.

2.3.5

(2.5)

RL
. , -,
, .
. ,
.
s , V (s),
s
. MDP

V (s) = E {Rt |st = s} = E

X
k =0




rt +k +1 st = s ,
k

(2.6)

E {} , t .
, , . V
.
9

2.
a, s
, Q (s, a ), s, a,
:

Q (s, a ) = E {Rt |st = s, at = a } = E

X
k =0




rt +k +1 st = s, at = a .
k

(2.7)

Q .

.
s s
:
V (s) = E {Rt |st = s}

= E

X
k =0




rt +k +1 st = s
k

= E rt +1 +
=

(s, a )

X
k =0

X
s
X




rt +k +2 st = s
k

"

a
a
Pss
0 Rss0 + E

X

(s, a )

a
Pss
0

a
Rss
0

k =0


k rt +k +2 st +1 = s0

i
+ V (s0 ) ,

#
(2.8)

s0

, a, A(s),
, s0 , S. (2.8) Bellman
V .
. ,
, .
V Bellman.

2.3.6

0 0 .
0
0 V (s) V (s) s S.
, .
,
. , V ,

V (s) = max V (s),

(2.9)

s S.

, Q ,
Q (s, a ) = max Q (s, a ),

10

(2.10)

2.3.
s S a A(s). -,
,
.
Q (s, a ) = E {rt +1 + V (st +1 )|st = s, at = a }.

(2.11)

V (2.8).
, ,
. Bellman V ,
Bellman:

V (s) = max Q (s, a )


a A(s)

= max E {Rt |st = s, at = a }


a

X

= max E

k =0

= max E




rt +k +1 st = s, at = a
k

rt +1 +

X
k =0




rt +k +2 st = s, at = a
k

= max E {rt +1 + V (st +1 )|st = s, at = a }


a
X
i
h
a
a
0
= max
Pss
0 Rss0 + V (s ) .
a A(s)

(2.12)
(2.13)

s0

Bellman
V . Bellman Q
Q (s, a ) = E {rt +1 + max
Q (st +1 , a 0 )|st = s, at = a }
0
a

X
s0

a
a
Pss
0 [Rss0

+ max
Q (s0 , a 0 )].
0
a

(2.14)

finite MDPs, Bellman (2.13)


. Bellman , , N
, N N . a
a
(Rss
0 Pss0 ),

- V .
Q .
V ,
Bellman.
. -
.
Q , ,
, Q (s, a ).
Bellman
, ,
. .
.
:
11

2.
1. .
2.
.
3. Markov.
,
.
.

2.4


.
.
, (Schmidhuber (1997)).
. ,
, ,
RL.

2.4.1

(dynamic programming, DP)


MDP. DP
, ,
. ,
.

, (2.3.5).
,
Bellman (2.13, 2.14). DP
Bellman , ,
.
finite MDP,
(2.3.4).

(policy evaluation)
. (2.8) V (s)
12

2.4.
. V (s) < 1
.
, (2.8)
|S| |S| .
.
. Bellman (2.8) :
Vk +1 (s) = E {rt +1 + Vk (st +1 )|st = s}

(s, a )

a
Pss
0

a
Rss
0

+ Vk (s ) .

(2.15)

s0

s S. Vk = V
Bellman. k
(iterative policy evaluation).
Vk +1 Vk ,
s:
s, ,
. full backup. backs up ,
Vk +1 .

(policy improvement theorem),
0 , s S,
Q (s, 0 (s)) V (s),

(2.16)

0 ,
. ,
s S:
0

V (s ) V (s ),

(2.17)

, (2.16) ,
(2.17)
.
.

, Q (s, a ). ,
13

2.
greedy , 0 ,
0 (s) = arg max Q (s, a )
a

= arg max E {rt +1 + V (st +1 )|st = s, at = a }


a


X
a
a
0
= arg max
Pss
0 Rss0 + V (s ) ,
a

(2.18)

s0

arg maxa a
. greedy
, , V .
, greedy
(2.16). greedy
( policy improvement).
Policy Iteration
, , V
0
, 0 , V ,
00 .
:
E
E
I
E
I
E
I
0 V 0 1 V 1 2 . . . V ,
E

(2.19)

.
,
. finite MDP,

.
policy iteration.
Value Iteration
policy iteration ,
.
V .
, , ,
policy iteration.
,
. value iteration.
backup
:
Vk +1 (s) = max E {rt +1 + Vk (st +1 )|st = s, at = a }
a

= maxa

X
s0

14

a
a
0
Pss
0 Rss0 + Vk (s ) ,

(2.20)

2.5.
s S. V0 , {Vk }
V , V .

2.4.2

Monte Carlo

Monte Carlo ,
.
Monte Carlo .
,
(on-line)
. on-line
, . ,
, ,
.

2.4.3

(temporal-difference (TD) learning) Monte Carlo . Monte Carlo, TD


, . , TD
, . , Monte Carlo
.

2.5

, ,
. , : ,
(gradient descent in policy
space) (Baird and Moore (1998), Hansen (1998)),
Markov (Partially
observable Markov decision process, POMDP) (Vlassis and Toussaint (2009),
Spaan and Vlassis (2005), Hsu (Lee), Cassandra (Littman and Zhang)), (Modular, hierarchical reinforcement
learning) (Barto and Mahadevan (2003)).
(Multiagent / Distributed reinforcement learning) (Claus
and Boutilier (1998)) .

15

-
Viterbi
(maximum-likelihood parameter estimation problem)
- (Expectation-Maximization algorithm, EM)
.
EM . EM :
1. Gauss (mixture of Gaussian densities)
2. Markov (hidden Markov
model, HMM), Baum-Welch,
Gauss .
Viterbi, ,
,
.

3.1


. p(x |) (.., p Gauss
). N , , , X =
{x1 , . . . , xN }. p. ,

p(X |) =

N
Y

p(xi |) = L (|X ).

i =1

17

(3.1)

3. - Viterbi
L (|X ) , .
X .
, L. ,
,
= arg max L (|X ).
(3.2)

log L (|X ) .
p(x |).
, p(x |) Gauss = (, 2 ),
log L (|X ) ,
2 .
. ,
, ,
.

3.2

EM (Dempster (Laird and Rubin), Rabiner and Juang (1993)) ,


.
EM. , .


.
, X
. X - . Z = (X, Y )
( ) :
p(z |) = p(x, y|) = p(y|x, )p(x |).

(3.3)

.
p(x |)
(
Baum-Welch). (..,
),
.
,
, L (|Z ) = L (|X, Y ) = p(X, Y |), (complete-data likelihood). ,
, , Y
, , .
L (|X, Y ) = hX, (Y ) hX, () X
Y . L (|X )
18

3.2. -
(incomplete-data likelihood function).
EM log-
log p(X, Y |) Y
X . :
Q (, i 1 ) = E {log p(X, Y |) | X, i 1 },

(3.4)

i 1
, Q.
X i 1
, , Y
f (y|X, i 1 ). (3.4)
:
E {log p(X, Y |) | X,

i 1

Z
}=
y Y

log p(X, y|) f (y|X, i 1 ) dy.

(3.5)

f (y|X, i 1 )
X
, Y y. , i 1 . ,
. , ,
f (y, X |i 1 ) f (X |i 1 ),
, f (X |i 1 ) .
, h (, ) . h (, Y )
Y
R
fY (y). q() = EY [h (, Y )] = y h (, y) fY (y) dy .
(Estep) .
Q (, i 1 ). .
i 1
.
, (M-step),
.
, :
i = arg max Q(, i 1 ).
(3.6)

, E-step M-step, . log-


.
M-step ,
Q (, i 1 ), i Q (i , i 1 ) > Q (, i 1 ).
(Generalized EM, GEM) .
19

3. - Viterbi
,
.
.

3.3


EM.
:
p(x |) =

M
X

(3.7)

i pi (x |i )

i =1

= (1 , . . . , M , 1 , . . . , M ) M
i =1 i = 1
pi i .
, M , M
i .
log- X :

N
Y

log L (|X ) = log

p(xi |) =

i =1

N
X

M
X

log
j pj (xi |j )

i =1

(3.8)

j =1


. X , ,
Y = {yi }N
i =1
,
. yi 1, . . . , M i, yi = k
i k.
Y , :

log L (|X, Y ) = logP (X, Y |) =

N
X

log(P (xi |yi )P (y)) =

N
X

i =1

log(yi pyi (xi |yi )) (3.9)

i =1

, ,
.
, , Y .
Y .
.
g
g
g
g
,, g = (1 , . . . , M , 1 , . . . , M )
g
L ( |X, Y ). g ,
g
pj (xi |j ) i j. , , aj (prior)
, , j = p(j). , Bayes
:
g

p(yi |xi , ) =
20

yi pyi (xi |yi )


p(xi |g )

yi pyi (xi |yi )

= PM

k =1

k pk (xi |k )

(3.10)

3.3.

p(y|X, ) =

N
Y

p ( yi |x i , g )

(3.11)

i =1

y = (y1 , . . . , yN )
. (3.5),


.
, (3.5) :

Q (, g ) =

log L (|X, y)p(y|X, g )

y Y
N
XX

log(yi pyi (xi |yi ))

y Y i =1
M
X

M X
M
X

...

M
X

N
X

log(yi pyi (xi |yi ))

N
Y

y N =1 i =1

...

y 1 =1 y 2 =1
M X
N
X

p(yj |xj , g )

j =1

M
X

y 1 =1 y 2 =1

N
Y

p(yj |xj , g )

j =1

M X
N X
M
X

`,yi log(` p` (xi |` ))

N
Y

yN =1 i =1 `=1

log(` p` (xi |` ))

l =1 i =1

p(yj |xj , g )

j =1

M
X

M
X

...

y1 =1 y2 =1

M
X

`,yi

N
Y

y N =1

p(yj |xj , g ).

(3.12)

j =1

.
` 1, . . . , M, :
M X
M
X
y 1 =1 y 2 =1
M
X

M
X

...

...

y 1 =1

y N =1
M
X

N
Y

p(yj |xj , g )

j =1
M
X

M
N
X
Y

...

y i 1 =1 y i + 1 =1

N
M
Y
X

`,yi

!
p(yj |xj , ) p(`|xi , g )
g

yN =1 j=1,j,i

p(yj |xj , g ) p(`|xi , g ) = p(`|xi , g ),

(3.13)

j=1,j,i yj =1
g
M
i =1 p(i |xj , ) = 1. (3.13), (3.12)
:

Q (, g ) =

M X
N
X

log(` p` (xi |` ))p(`|xi , g )

` =1 i =1

M X
N
X
` =1 i =1

log ` p(`|xi , ) +

M X
N
X

log p` (xi |` )p(`|xi , g ).

(3.14)

`=1 i =1

,
` ` , .
21

3. - Viterbi
` ,
P
Lagrange ` ` = 1, :

M X
N
X

log ` p(`|xi , g ) +

X

` =1 i =1

` 1



= 0,

(3.15)

N
X
1

i =1 `

p(`|xi , g ) + = 0.

(3.16)

`, = N :
` =

N
1 X

N i =1

p(`|xi , g ).

(3.17)

, `
. ,
Gauss d , , = (, )
p` (x |` , ` ) =

(2 )d/2 |` |1/2

e 2 (x ` )

P 1
`

(x ` )

(3.18)


.
tr(A) A. . ,
P
tr(A + B) = tr(A) + tr(B), tr(AB) = tr(BA), i xiT Axi =
P
tr(AB) B = i xi XiT . |A|
|A1 | = 1/|A|.
f (A)
f (A)
. , , A
f (A)

i, j [ a ], ai,j i, j A.
i,j
A :
|A |
ai,j

Ai,j
=

2Ai,j

x T Ax
x

= (A + AT )x. ,

i = j
i , j

(3.19)

Ai,j i, j A. , :
log|A|
A

Ai,j /|A|
=
2Ai,j /|A|

i = j
= 2A1 diag(A1 )
i , j

(3.20)

. , :
tr(AB)
A
22

= B + BT diag(B).

(3.21)

3.3.
(3.18), , (3.14), :
M X
N
X

log(p` (xi |` , ` ))p(`|xi , g )

` =1 i =1
M X
N 

X
1
1
log|` | (xi ` )T ` 1 (xi ` ) p(`|xi , g ).

`=1 i =1

(3.22)

(3.22) ` , :
N
X
i =1

` 1 (xi ` )p(`|xi , g ) = 0,

(3.23)

` :

PN

i =1

` = PN

xi p(`|xi , g )

i =1

p(`|xi , g )

(3.24)

` , (3.22) :
M 
X
1

` =1

log(|` 1 |)

M 
X
1
`=1

N
X

p(`|xi , g )

i =1

log(|` |)

N
X

N
1X

2 i =1
g

p(`|xi , )

i =1

1
T
p(`|xi , g ) tr
` (xi ` )(xi ` )

N
1X

2 i =1

p(`|xi , ) tr(` N`,i )



(3.25)

N`,i = (xi ` )(xi ` )T ).


1

, :
`
N
1X

2 i =1

p(`|xi , )(2` diag(` ))

N
1X

2 i =1

p(`|xi , g )(2N`,i diag(N`,i ))

N
1X

p(`|xi , g )(2M`,i diag(M`,i )


2 i =1
= 2S diag(S),

(3.26)

g
M`,i = ` N`,i S = 12 N
i =1 p(`|xi , )M`,i .
, 2S diag(S) = 0, S = 0.

N
X

p(`|xi , g )(` N`,i ) = 0

(3.27)

i =1

PN
` =

g
i =1 p(`|xi , )N`,i
PN
g
i =1 p(`|xi , )

PN
=

i =1

p(`|xi , g )(xi ` )(xi ` )T

PN

i =1

p(`|xi , g )

(3.28)
23

3. - Viterbi
, :
`new =

N
1 X

p(`|xi , g )
N i =1
PN
g
i =1 xi p(`|xi , )

`new = PN

PN
new
`

i =1

i =1
g
i =1

(3.30)

p(`|xi , g )

p(`|xi , )(xi `new )(xi `new )T

PN

(3.29)

p(`|xi , g )

(3.31)

,
. .

3.4


Markov

Markov (Hidden Markov Model, HMM)


{O1 , . . . , OT ,
Q1 , . . . , QT }. Ot
Qt . HMM,
. :
1. t, t 1 , ,
P (Qt |Qt 1 , Ot 1 , . . . , Q1 , O1 ) = P (Qt |Qt 1 ).

(3.32)

2. t, t ,
,
P (Ot |QT , OT , QT 1 , OT 1 , . . . , Qt +1 , Ot +1 , Qt , Qt 1 , Ot 1 , . . . , Q1 , O1 )

= P (Ot |Qt ).

(3.33)

EM
Markov
.
Baum-Welch.
Qt N {1, . . . , N }. Markov P (Qt |Qt 1 )
(, t).
P (Qt |Qt 1 )
A = {ai,j } = p(Qt = j|Qt 1 = i ). t = 1
, i = p(Q1 = i ).
j t Qt = j.
24

3.4. Markov
q = (q1 , . . . , qT ) qt {1, . . . , N }
t.
O O = (O1 =
o1 , . . . , OT = oT ).
t j bj (ot ) = p(Ot = ot |Qt = j).

B = {bj ()}.
. L
ot V = {v1 , . . . , vL }. , ot = vk ,
bj (ot ) = p(Ot = vk |qt = j). M Gauss
P
PM
bj (ot ) = M
`=1 cj` N (ot |j` , j` ) =
`=1 cj` bj` (ot ).
HMM
= (A, B, ).
HMMs:
1. p(O |) O = (o1 , . . . , oT ). , , .
2. O , q =
(q1 , . . . , qT ) O. Viterbi .
3. = arg max p(O |). Baum-Welch (
- EM HMMs)
.
.

3.4.1

HMM . EM
Q, .
.
i (t ) = p(O1 = o1 , . . . , Ot = ot , Qt = i |)

(3.34)

o1 , . . . , ot
i t. i (t )
:
1. i (1) = i bi (o1 )

(3.35)
25

3. - Viterbi
2. j (t + 1) =

PN

i =1

PN

3. p(O |) =

i =1

i (t )aij bj (ot +1 )

(3.36)

i (T )

(3.37)

.
(3.38)

i (t ) = p(Ot +1 = ot +1 , . . . , OT = oT |Qt = i, )

o1 , . . . , ot
i t. i (t )
:
1. i (T ) = 1
2. i (t ) =

(3.39)

PN

j =1

3. p(O |) =

aij bj (ot +1 ) j (t + 1)

PN

i =1

(3.40)

i (1)i bi (o1 )

(3.41)

(3.42)

i (t ) = p(Qt = i |O, ),

i t O.
:
p(Qt = i |O, ) =

p(O, Qt = i |)
P (O |

p(O, Qt = i |)

= PN

j =1

p(O, Qt = j|)

(3.43)

, Markov
i (t ) i (t ) = p(o1 , . . . , ot , Qt = i |)p(ot +1 , . . . , oT |Qt = i, ) = p(O, Qt = i |),
(3.44)
i (t ) i (t ) i (t )
i (t ) i (t )
i (t ) = PN
.
j=1 j (t ) j (t )

(3.45)

ij (t ) = p(Qt = i, Qt +1 = j|O, )

(3.46)

i t j t + 1.
:
ij (t ) =

p(Qt = i, Qt +1 = j, O |)
p(O |)

i (t )aij bj (ot +1 ) j (t + 1)
= PN PN
i =1
j=1 i (t )aij bj (ot +1 ) j (t + 1)

(3.47)

:
ij (t ) =

26

p(Qt = i |O )p(ot +1 . . . oT , Qt +1 = j|Qt = i, )


p(ot +1 . . . oT |Qt = i, )

i (t )aij bj (ot +1 ) j (t + 1)
i (t )

(3.48)

3.4. Markov

. ,
T
X

(3.49)

i (t )

t =1

i , ,
i O. ,
T
1
X

(3.50)

ij (t )

t =1

i
j O.

i (t ) =

ij (t ) =

E {It (i )} = E

X

It (i )

(3.51)

E {It (i, j)} = E

X

It (i, j)

(3.52)

It (i ) 1
i t, It (i, j)
1 i j
t.
EM HMM . ,
. , ,
:

i = i (1)
(3.53)
i
1.

PT 1
t =1 ij (t )
a ij = PT 1
(3.54)
t =1 ij (t )

i j
i.
, ,

PT
bi (k ) =

t =1 ot ,vk i (t )
PT
t =1 i (t )

(3.55)


vk i
i.
Gauss, ` i
ot
i` (t ) = i (t )

ci` bi` (ot )


bi (ot )

= p(Qt = i, Xit = `|O, )

(3.56)
27

3. - Viterbi
Xit t
i.
:

PT

t =1

i` (t )

t =1

i (t )

ci` = PT

PT

t =1

i` = PT

PT
i` =

i` (t )ot
i` (t )

t =1

t =1

i` (t )(ot i` )(ot i` )T

PT

i` (t )

t =1

(3.57)
(3.58)
(3.59)

E e Te ,
:

PE

e =1

i =

PE

PTe

ci` = P
E

PTe

e =1

i` =

ie (1)
t =1

i`e (t )

e
e =1
t =1 i ( t )
PE PTe e
e
e =1
t =1 i` (t )ot
i` = P
PTe e
E
e =1
t =1 i` (t )
PE PTe e
e
e
T
e =1
t =1 i` (t )(ot ii` )(ot ii` )

PE

PTe

PE

PTe

e =1

t =1

i`e (t )

e
t =1 ij (t )
P Te e
e =1
t =1 i ( t )

e =1

aij = P
E

(3.60)
(3.61)
(3.62)
(3.63)

(3.64)

EM (Baum-Welch)
.
EM .

3.4.2

O = (o1 , . . . , oT )
q = (q1, . . . , qT ) . P (O, )
P (O, q|). Q :
Q (, 0 ) =

log P (O, q|)P (O, q|0 )

(3.65)

qQ

0 Q
T .
q, P (O, q|0 )

P (O, q|) = q0
28

T
Y
t =1

aqt 1 qt bqt (ot )

(3.66)

3.4. Markov
t = 0
t = 1 . , ,
,
.
Q :

Q (, 0 ) =

log q0 P (O, q|0 ) +

T
XX

qQ

T
XX

qQ t =1


log aqt 1 qt p(O, q|0 )

(3.67)

log bqt (ot ) P (O, q|0 ).

qQ t +1


.
(3.67)

log q0 P (O, q|0 ) =

qQ

N
X
i =1

log i p(O, q0 = i |0 )

(3.68)

q Q
q0 ,
P
t = 0. Lagrange, i i = 1,
, :

N
X
i

i =1

log i p(O, q0 = i | ) +

N
X



i 1

=0

(3.69)

i =1

, i ,
i :
P (O, q0 = i |0 )
i =
(3.70)
P ( O | 0 )
(3.67) :
T
XX
qQ t =1

N X
N X
T

X
log aqt 1 qt p(O, q|0 ) =
log aij P (O, qt 1 = i, qt = j|0 )

(3.71)

i =1 j =1 t =1

, t
i j .
t 1 t. ,
Lagrange
PN
j=1 aij = 1 :

PT
aij =

t =1

P (O, qt 1 = i, qt = j|0 )

PT

t =1

P (O, qt 1 = i |0 )

(3.72)

(3.67) :
T
XX

N X
T

X
log bqt (ot ) P (O, q|0 ) =
log bi (ot )p(O, qt = i |0 )

qQ t +1

i =1 t =1

(3.73)
29

3. - Viterbi
, t,
.
t.
, , P
Lagrange Lj=1 bi (j) = 1.
vk k, :

PT

bi (k ) =

t =1 P (O, qt =
PT
t =1 P (O, qt

i |0 )ot ,vk

= i | 0 )

(3.74)

Gauss, Q , ,

,
. , Q :
Q (, 0 ) =

X X

log P (O, q, m |)P (O, q, m |0 )

(3.75)

qQ m (M )

m m = {mq1 1 , mq2 2 , . . . , mqT T }


.
(3.67), ,
m .
(3.67) :
T
X X X


log bqt (ot , mqt t ) P (O, q, m |0 ) =

qQ m M t =1
N X
M X
T
X

log(ci` bi` (ot ))p(O, qt = i, mqt t = `|0 )

(3.76)

i =1 `=1 t =1

(3.14),
. (3.3), :

PT

t =1 P (qt = i, mqt t = l |O, )


PM
0
t =1
`=1 P (qt = i, mqt t = `|O, )
PT
0
t =1 ot P (qt = i, mqt t = l |O, )

cil = PT

il = PT

t =1

PT
il =

t =1 (ot

P (qt = i, mqt t = l |O, 0 )

il )(ot il )T P (qt = i, mqt t = l |O, 0 )


PT
0
t =1 P (qt = i, mqt t = l |O, )

(3.77)
(3.78)

(3.79)

(3.4.1).
Markov
(Rabiner and Juang (1993)).
30

3.5.

3.5


. , ,
, .
, ,,
. , ..,
qt
t.
. ,

(3.80)

t (i ) = P (qt = i |O, ),

i t,
O, . t (i )
,
t (i ) = P (qt = i |O, ) =

P (O, qt = i |)
P ( O | )

P (O, qt = i |)

= PN

j =1

P (O, qt = j|)

(3.81)

P (O, qt = i |) t (i ) t (i ), t (i )
t (i ) t (i )
,
t (i ) = PN
j=1 t (j ) t (j )

(3.82)

t (i ) t,
t (i ) t + 1
T qt = i t.
t (i ), qt t
qt (i ) = arg max t (i ),
i

1 t T.

(3.83)

(3.83) , . ,
HMM (aij = 0, i, j),
.
(3.83) ,
.
. ,
(qt , qt +1 ), (qt , qt +1 , qt +2 ), ....
,
(). P (q|O, ), P (q, O |) .
, ,
Viterbi.
31

3. - Viterbi

3.5.1

Viterbi

, q = (q1 q2 . . . qT ),
O = (o1 o2 . . . oT ),
t (i ) =

max

q1 ,q2 ,...,qt 1

P (q1 q2 . . . qt 1 , qt = i, o1 o2 . . . ot |),

(3.84)

t (i ) , t,
t
i.
(3.85)

t +1 (j) = max t (i )aij bj (ot +1 ).


j

, (3.85), t j.
t (j).
:
1.
1 (i ) = i bi (o1 ),

1iN

1 (i ) = 0.

(3.86)
(3.87)

2.
t (j) = max [t 1 (i )aij ]bj (ot ),

2tT
1iN

(3.88)

t (j) = arg max[t 1 (i )aij ],

2tT
1iN

(3.89)

1 i N

1 i N

3.
P = max [T (i )]

(3.90)

qT = arg max[T (i )].

(3.91)

1i N

1i N

4. ()
q = t +1 (qt+1 ),

t = T 1, T 2, . . . , 1.

(3.92)

Viterbi ( )
(3.35)-(3.37). (3.88), (3.36).
Viterbi.

32


Markov

4.1

2006, Toussaint Storkey Probabilistic Inference for Solving Markov Decision Processes (Toussaint and Storkey
(2006)), Markov (Markov Decision Process, MDP).
, .
MDP
, MDP.
, .
, -,
.
,

EM .

4.2

Markov (MDPs)

(4.1) Bayes (Dynamic Bayesian Network, DBN) MDP (time-unlimited MDP), P (xt +1 |at , xt ),
P (at |xt ; ), P (rt |at , xt ).
, x , a,
, ,
, rt {0, 1}.
P (rt |at , xt ) [0, 1] ,
, ,
33

4. Markov
x0

x1

x2

a0

a1

a2

r0

xt

...

r1

...

at

rt

r2

4.1: Bayes (DBN) Markov


(MDP). x , a r
.
.
.
P (at |xt ; )
P (at = a |xt = i ; ) = ai , ai [0, 1]
a. MDP, , ,
(4.1)
,
V (i ) = E

X
t =0

t rt |x0 = i ; ,

[0, 1] .
MDP Bellman.
2, 2.4.

4.3

MDPs

MDP
,
.
.
MDP

. (finite-state) MDPs. MDP T ,
T
4.2. MDP
P (r, x0:T , a0:T |T ; ) = P (r |aT , xT )P (a0 |x0 ; )P (x0 )

T
Y
t =1

P (at |xt ; )P (xt |at 1 , xt 1 ).

MDPs
P (r, x0:T , a0:T , T ; ) = P (r, x0:T , a0:T |T ; )P (T ).
34

(4.1)
(4.2)

4.3. MDPs

x0
r

T =0

a0

x0

x1

a0

a1

x0

x1

T =1

x2
r

a0

a1

a2

x1

x2

T =2

...

finite-time MDPs

x0

a0

a2

r
aT

...

a1

...

xT

4.2: Markov (finitetime MDPs). x , a r


.

, P (T ) (prior)
, , P (T ) = T (1 ).
MDP
.
r ,
, MDP,
LT (i ) = P (r = 1|x0 = i, T ; ) = E {r |x0 = i, T ; },

(4.3)
35

4. Markov
MDP,
L (i ) = P (r = 1|x0 = i ; )

P (T )E {r |x0 = i, T ; }.

(4.4)

, prior P (T ) = T (1 ),
,
L (i ) = (1 )V (i ).

(4.5)

E {r |x0 = i, T ; }
E {rt |x0 = i ; } t = T .
MDP , 0
T . MDPs
.

4.1 (4.4) MDPs (4.2) MDP.

,
:
E-step , .

4.4

EM


EM ,
, . ,
x0 , . E-step ,
, (posteriors) T , x0 = A r = 1. M-step
. E-step
, MDP
.

4.4.1 E-step: MDP.


, p(j|a, i ) = P (xt +1 = j|at = a, xt = i ) p(j|i ; ) = P (xt +1 = j|xt = i ; ) =
36

4.4. EM
a p(j |a, i )ai . , ,

= P (r = 1|xT = i ; )

P (r = 1|aT = a, xT = i )ai .

(4.6)

E-step, , , .
MDP T

0 (i ) = i =A ,

t (i ) = P (xt = i |x0 = A; )

p(i |j; )t 1 (j),

(4.7)

t (i ) = P (r = 1|xt = i ; )

T (i ) = (i ),

p(j|i ; ) t +1 (j).

(4.8)

T
. T .
.
( ),
0 (i ) = (i ), T (i ) = P (r = 1|xT = i ; )

p(j|i ; ) 1 (j).

(4.9)

, T .
MDP T , = T t
.

t ,
MDP T .
MDP -. ,
Markov (Hidden Markov Models, HMM),
(x0 = A)
(r = 1). prior ,
MDPs
( 4.2) MDP
( 4.1).
,
, T .

t (i ) = P (xt = i |x0 = A, r = 1, T = t + ; )

Z (t, )

(i )t (i ),

(4.10)
37

4. Markov

Z (t, ) =

t (i ) (i ) =

P (x0 = A, r = 1|T = t + ; )
P (x0 = A )

4.4.2

= Z (t + ).

(4.11)

Z t + , Z (t + ) =
Z (t, ). , Z (t + ) P (x0 = A, r = 1|T =
t + ; ), , , A
, MDP T .
Bayes T
( ),
P ( T |x0 , r ; ) =

P ( x0 , r |T ; )
P (x0 , r ; )

P (T ) =


P (r |x0 ; ) =

Z (T )P (T )
P (r |x0 ; )

(4.12)

(4.13)

P (T )Z (T ).

4.4.3


. ,
qt (a, i ) = P (r = 1|at = a, xt = i, T = t + ; )

j p(j|i, a ) 1 (j)
=

P (r = 1|aT = a, xT = i )

> 1
= 0

(4.14)

A t xt
t Markov.
q (a, i ) = qt (a, i ).
prior

P (r = 1|at = a, xt = i ; ) =

1X

C =0

P (T = t + )p (a, i ),

(4.15)

C = 0 P (T = t + 0 ), prior t P (t + ) = t P ( ), .
, Bayes,

ai X
P (T = t + )p (a, i ),
(4.16)
P (at = a |xt = i, r = 1; ) = 0
C =0

C0 = P (r = 1|xt = i ; ) 0 P (T = t + 0 )
, t. ,
i

P (i x0:T |x0 , r = 1; ) =

38

P (T )Z (T )

P (r = 1|x0 ; )

T
Y
t =0


[1 t,T t (i )] .

(4.17)

4.5. Policy Iteration

4.4.4 M-step: .
M-step EM
P P
Q ( , ) = T x0:T ,a0:T P (x0:T , a0:T , T |r = 1; ) log P (r =
1, x0:T , a0:T , T ; ) ,
(T, x0:T , a0:T ) , . ,
HMM,
(4.16),

ai
= P (at = a |xt = i, r = 1; ).

(4.18)

, MDP,
P (r = 1|x0 = i ; ) =

P (r = 1|at = a, xt = j; )aj
P (xt = j|x0 = i ; ). (4.19)

aj


aj
P (xt = j|x0 = i ; )
t P (r = 1|at = a, xt = j; ) ( 4.15). E-step P

a P (r = 1|at = a, xt = j; )aj

ai
= a =a (i ) ,

a (i ) = arg max P (r = 1|at = a, xt = i ; )


a

(4.20)

a (i ) (4.15).
(4.15) (4.16),
M-step greedy
MDP.

4.5

Policy Iteration

, ,
MDP
T . , (4.9) (4.3) MDP T
, (i ) (V (i ) MDP T = ). ,
1 P
, V (i ) = 1
T P (T ) T (i ),

prior . , q (a, i ) (4.14) Q,


1 P
Q (a, i ) = 1
T P (T )qT (a, i ) prior . ,

, (4.15)
posterior (4.16) ai Q (a, i ). ,
E-step
, , .
, M-step ( greedy M-step (4.20) )
39

4. Markov
Policy Iteration (Sutton and Barto (1998)),
Q a i. , EM

Policy Iteration ( )
. ,
EM . , ,

.

4.6

Viterbi-EM

Viterbi-EM,
.

Viterbi (3.5.1) arg max P ( ; ).

L (i ) (4.4),
Viterbi
, . ,
, , ,
.
Viterbi-EM ,
EM: E-step M-step.

4.6.1 E-step: MDP.



(4.21)

max = arg max P ( ; ),

max = arg max P (x0 )


T,x0:T

T
Y

t =1

P (xt |xt 1 ; ) P (R |xT ) ,

(4.22)

(4.22) . T , x0 , xT , T 1
Q
, P (x0 ) Tt=1 P (xt |xt 1 ; )P (R |xT ).
P (x0 ), P (R |xT ),
P (xt |xt 1 ; ) t = 1 T 1.

, ,
.
40

4.6. Viterbi-EM
, (4.22),
Viterbi, . ,
, t, , ,
t 1, .
. ,
, .

T . ,

. ,
P (R |xT ), .
P (R |xT ), ,
, ,
P (xt |xt 1 ; )
. , t, i, j k t 1,
k, i j.
,
. k
t ,
,
.
P (R |xT ), 1
0 ,
.

.

.
max .
.
V (x ) x V (x 0 ) x 0 , E-step. :
1. V (x )
max back-up
x max , xT x0 .
V (x ) = R (x ; ) +

P (x, x 0 ; )V (x 0 ),

x max ,

x0 X

(4.23)

x0

41

4. Markov
2. , (4.23) Value Iteration:
V (x ) = R (x ; ) +

P (x, x 0 ; )V (x 0 ),

(4.24)

xX

x0

, V (x ) . 1, V (x ),
max ,
E step
. 2, V (x 0 )
1.
2, max
.
E step, max ,
V (x ) V (x 0 ) (4.23) (4.24).
x 0 (4.23)
X . , (4.24), V (x ) , , ,
x 0 max

(Bertsekas and Tsitsiklis (1989)).
,
.

4.6.2 M-step: .
M-step Viterbi-EM EM.
, , V ,
E-step. , x:
new
ax

old
ax

P (R |a, x ) +


V (x )P (x |a, x )
0

(4.25)

x0

EM, ,
MDP , (4.25),
x:
new
ax
= (a, a (x )),

4.7

a (x ) = arg max P (R |a, s) +


a

V (x 0 )P (x 0 |a, x )

(4.26)

x0

Markov
. , ,
Bayes.

(Vlassis and Toussaint (2009), Barber and Furmston (2009)).
42


,
Viterbi-EM MDP,
Policy Value Iteration , , EM. ,
MDP.

5.1

,
,
. 20%
, .
, .
, , ,
.
, (
).
1 ( ).

r = 0. r = 1,
. ,
, Policy Value Iteration, ,
EM Viterbi-EM,
.
= 0.99, ,
.
43

5.

() A

() B

5.1: A B, .

(5.4, 5.6, 5.9) P (R ; ) , P (x 0 |a, x )


. EM Viterbi-EM,
Policy Value Iteration, .

5.2

(5.1 ) ,
. ,
,
,
.
EM (5.2 , 5.2 )
(5.3 ), 0.22. Viterbi-EM
(5.3 ),
, 0.11.
(5.4) .
EM
. Viterbi-EM , ,
.
, Viterbi-EM .
44

5.2. A

() EM 1 - P (R ; ) = 0.0576

() EM 2 - P (R ; ) = 0.1758

5.2:
EM A.

() EM 3 - P (R ; ) = 0.2178

() Viterbi-EM - P (R ; ) = 0.1118

5.3: () EM
A, , Viterbi-EM,
.

45

5.

5.4: A.

5.3

B (5.1 ) , ,
. ,
.
, EM (5.5 ), Viterbi-EM (5.5 ),
. Viterbi-EM
, 0.31,
0.36 EM.
(5.6) Viterbi-EM
.

5.4

(5.7) Viterbi-EM .
. , , ,
(10% ). ,
46

5.4.

() EM - P (R ; ) = 0.36

() Viterbi-EM - P (R ; ) = 0.31

5.5: EM Viterbi-EM
B.

5.6: B.

47

5.

5.7: , .

() EM - P (R ; ) = 0.69

() Viterbi-EM - P (R ; ) = 0.28

5.8: EM Viterbi-EM
.

, ,
.
, , Viterbi-EM
, (5.8 ). ,
EM (5.8 ), , 0.69, 0.28 Viterbi-EM. , A (5.1 ), ,
,
.
, .
,
Viterbi-EM .
(5.9)

5.5

, , Viterbi-EM
. , .

Viterbi Viterbi-EM
48

5.5.

5.9: .

. EM
Toussaint and Storkey (2006) Toussaint (Storkey
and Harmeling). ,
. Policy Iteration,
Value Iteration
.

49

,
Viterbi-EM MDP
, Viterbi. ,

Toussaint and Storkey (2006).

Markov,
Viterbi-EM.
,
Viterbi-EM
.
,
.
:
1. Viterbi-EM.
2. .
3. ViterbiEM On-line .
4. Viterbi-EM POMDP,
model-free .

51


Baird, L., Moore, A. (1998). Gradient Descent for General Reinforcement Learning. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, Advances in Neural
Information Processing Systems 11,MIT Press, Cambridge, MA, 1999.
Barber, D. and Furmston, T. (2009). Solving deterministic policy (PO)MPDs
using Expectation-Maximisation and Antifreeze. In European Conference on
Machine Learning (LEMIR workshop), pp. 50-64, 2009.
Barto, A. G. and Mahadevan, S. (2003). Recent Advances in Hierarchical Reinforcement Learning. In Discrete Event Dynamic Systems 13, 4 (2003), 341379.
Bellman, R. E. (1957a) Dynamic Programming. Princeton University Press,
Princeton, NJ.
Bellman, R. E. (1957b) A Markov decision process. In Journal of Mathematical
Mech., 6:679-684.
Bertsekas, D. P. and Tsitsiklis, J. N. (1989) Convergence rate and termination
of asynchronous iterative algorithms. In Proc. of the 3rd international conference on Supercomputing (Crete, Greece, June 05 - 09, 1989). ICS 89. ACM,
New York, NY, 461-470.
Cassandra, A., Littman, M. L. and Zhang, N. L. (1997). Incremental Pruning: A Simple, Fast, Exact Method for Partially Observable Markov Decision
Processes. In Proc. of the Thirteenth Conference on Uncertainty in Artificial
Intelligence,Morgan Kaufmann Publishers, 5461 (1997).
Claus, C. and Boutilier, C. (1998). The Dynamics of Reinforcement Learning
in Cooperative Multiagent Systems. In Proc. Proceedings of the Fifteenth
National Conference on Artificial Intelligence, AAAI Press, 746752 (1998)
Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum Likelihood
from Incomplete Data via the EM Algorithm. In Journal of the Royal Statistical
Society. Series B (Methodological), Vol. 39, No. 1., pp. 1-38, 1977.
53


Fu, K. S. (1970). Learning control systemsReview and outlook. In IEEE Transactions on Automatic Control,pages 210-221.
Hansen, E. A. (1998). Solving POMDPs by Searching in Policy Space. In Proc.
of the 14th International Conference on Uncertainty in Artificial Intelligence
(1998.
Howard, R. (1960). Dynamic Programming and Markov Processes. MIT Press,
Cambridge, MA.
Hsu, D., Lee, W. S. and Rong, N. (2007). What Makes Some POMDP Problems
Easy to Approximate? In John C. Platt, Daphne Koller, Yoram Singer and
Sam T. Roweis, ed., NIPS , MIT Press.
Littman, M. L. (1996). Algorithms for Sequential Decision Making. Ph.D., Brown
University, May 1996.
Mendel, J. M. (1966). Applications of artificial intelligence techniques to a
spacecraft control problem. Technical Report NASA CR-755, National Aeronautics and Space Administration.
Mendel, J. M. and McLaren, R. W. (1970). In Mendel, J. M. and Fu, K. S.,
editors, Adaptive, Learning and Pattern Recognition Systems: Theory and
Applications, pages 287-318. Academic Press, New York.
Minsky, M. L. (1961). Steps toward artificial intelligence. In Proceedings of
the Institute of Radio Engineers, 49:8-30. Reprinted in E. A. Feigenbaum
and J. Feldman, editors, Computers and Thought. McGraw- Hill, New York,
406-450, 1963.
Murphy, K. P. (2002). Dynamic Bayesian Networks: Representation, Inference
and Learning.
Rabiner, L. and Juang, B. (1993) Fundamentals of speech recognition. PrenticeHall, Inc.
Schmidhuber, J. 1997. A Computer Scientists View of Life, the Universe,
and Everything. In Foundations of Computer Science: Potential - theory Cognition, To Wilfried Brauer on the Occasion of His Sixtieth Birthday. C.
Freksa, M. Jantzen, and R. Valk, Eds. Lecture Notes In Computer Science,
vol. 1337. Springer-Verlag, London, 201-208.
Spaan, M. T. J. and Vlassis, N. (2005). Perseus: Randomized point-based
value iteration for POMDPs. In Journal of Artificial Intelligence Research,
vol.24, 195-220 (2005).
Sutton, R. S. and Barto, A. G. (1998). Reinforcement Learning: An Introduction. In IEEE Transactions on Neural Networks, 9(5): 1054-1054 (1998)
54


Toussaint, M. and Storkey, A. (2006). Probabilistic inference for solving discrete and continuous state Markov Decision Processes. In Proc. of the 23rd
international Conference on Machine Learning (ICML 2006). vol. 148. ACM,
New York, NY, 945-952
Toussaint, M., Storkey, A. and Harmeling, S. (2010). Expectation-Maximization
methods for solving (PO)MDPs and optimal control problems. In Silvia Chiappa, David Barber (Eds.): Inference and Learning in Dynamic Models, Cambridge University Press, print expected in 2010.
Vlassis, N. and Toussaint, M. (2009). Model-Free Reinforcement Learning as
Mixture Learning. In Proc. of the 25nd International Conference on Machine
Learning (ICML), 2009.
Waltz, M. D. and Fu, K. S. (1965). A heuristic approach to reinforcement
learning control systems. In IEEE Transactions on Automatic Control, 10:390398.

55