You are on page 1of 141

Machine

Learning for
Physicists
University of Erlangen-Nuremberg
& Max Planck Institute for the
Science of Light
Florian Marquardt
Florian.Marquardt@fau.de
http://machine-learning-for-physicists.org

Part Two (Image generated by a net with 20 hidden layers)


Optimized gradient descent algorithms

How to speed up stochastic gradient descent?

- accelerate (“momentum”) towards minimum


- Automatically choose learning rate
- ...even different rates for different weights
Summary: a few gradient update methods
see overview by S. Ruder https://arxiv.org/abs/1609.04747

divide by RMS of all previous gradients


adagrad

RMSprop divide by RMS of last few previous gradients

same, but multiply also by RMS of last few


adadelta parameter updates

adam divide running average of last few gradients by


RMS of last few gradients (* with corrections
during earliest steps)

adam often works best


“adagrad”
divide by RMS of all previous gradients
“RMSprop”
divide by RMS of last
few previous gradients
“RMSprop”
“adadelta”
same, but multiply also by RMS of
last few parameter updates
“adam”
divide running average
of last few gradients by
RMS of last few
gradients (* with
corrections during
earliest steps)
Recurrent neural networks
Recurrent neural networks

Networks “with memory”

Useful for analyzing time-evolution (time-series of


data), for analyzing and translating sentences, for
control/feedback (e.g. robotics or action games), and
many other things
Could use a convolutional network!

output

... ...

input

filter size = memory time

time
Long memories with convolutional nets are
challenging:
• would need large filter sizes
• even then, would need to know required
memory time beforehand
• can expand memory time efficiently by multi-
layer network with subsampling (pooling), but
this is still problematic for precise long-term
memory

recall
signal no important signals signal!

time
But: may be OK for some physics applications!
(problems local in time, with short memory)
Memory

recall
signal no important signals signal!

time
Solution: Recurrent Neural Networks (RNN)
keep
memory
output

input
Advantage: in principle this could give arbitrarily long memory!

time
Note: each circle may represent multiple neurons (i.e. a layer)
Each arrow then represents all possible connections between those
neurons
Solution: Recurrent Neural Networks (RNN)
keep
memory
output

input

time
Note: the weights are not time-dependent, i.e. need to store only one set of
weights (similar to convolutional net)
output

hidden

input

time
“correct answer” known here
output

hidden

input

“Backpropagation through time”


time
Long memories with recurrent networks are
challenging, due to a feature of backpropagation:

“Exploding gradients” / “Vanishing gradients”

Backpropagation through many layers (in a deep


network) or through many time-steps (in a
recurrent network):
(for the
Something like t 1 = Mt t recurrent
network case)
Depending on | | | |
typical exploding vanishing
eigenvalues of
the matrices M:

backprop. steps backprop. steps


Long short-term memory (LSTM)
Why this name? “Long-term memory” would be the weights that are adapted
during training and then stored forever. “Short-term memory” is the input-
dependent memory we are talking about here. “Long short-term memory”
tries to have long memory times in a robust way, for this short-term memory.
Sepp Hochreiter and Jürgen Schmidhuber, 1997
Main idea: determine read/write/delete operations of a
memory cell via the network (through other neurons)
Most of the time, a memory neuron just sits there and
is not used/changed!
recall
signal no important signals signal!

time
LSTM: Forget gate (delete)

ct 1 ct
memory
cell content *

keep: ct = 1 ⇤ ct 1
delete: ct = 0 ⇤ ct 1
LSTM: Forget gate (delete)

Calculate “forget gate”:


ct 1 ct f = (W (f )
xt + b (f )
)
memory
cell content *
sigmoid
(usually x,b,f are vectors, W
“forget the weight matrix)
gate f”
xt Obtain new memory content:
ct = f ⇤ ct
input 1

elementwise product

NEW: for the first time, we are multiplying neuron values!


LSTM: Forget gate (delete)

Backpropagation
ct 1 ct
memory
cell content *
The multiplication * splits the
“forget error backpropagation into two
gate f” branches
xt
input
product rule:
@fj ct 1,j @fj @ct 1,j
= ct 1,j + fj
@w⇤ @w⇤ @w⇤
(Note: if time is not specified, we are referring to t)
LSTM: Forget gate (delete)

memory
cell content * *

“forget
gate f”

input
t-1 t t+1
LSTM: Write new memory value

ct 1 ct i = (W (i) (i)
xt + b )
+ c̃t = tanh(W (c) xt + b(c) )
*
“input new value c̃t
gate i”
xt
input
both delete and write together:
ct = f ⇤ ct 1 + i ⇤ c̃t
forget new value
LSTM: Read (output) memory value
ht
ct *c
1 t
(o) (o)
o = (W xt + b )
ht = o ⇤ tanh(ct )
“output
gate o”
xt
input
LSTM: exploit previous memory output ‘h’

make f,i,o etc. at time t depend on output ‘h’


calculated in previous time step!
(otherwise: ‘h’ could only be used in higher
layers, but not to control memory access in
present layer)
f = (W (f ) xt + U (f ) ht 1 + b (f )
)
...and likewise for every other quantity!
Thus, result of readout can actually influence
subsequent operations (e.g.: readout of some
selected other memory cell!)

Sometimes, o is even made to depend on ct


LSTM: backpropagation through time is OK

As long as memory content is not read or written, the


backpropagation gradient is trivial:
ct = ct 1 = ct 2 = ...
@ct @ct 1 @ct 2
= = = ...
@w⇤ @w⇤ @w⇤
(deviation vector multiplied by 1)
During those ‘silent’ time-intervals: No
explosion or vanishing gradient!
Adding an LSTM layer with 10 memory cells:

Each of those cells has the full structure, with f,i,o


gates and the memory content c, and the output h.

rnn.add(LSTM(10, return_sequences=True))

whether to return the full


time sequence of outputs, or only
the output at the final time
Two LSTM layers (input > LSTM > LSTM=output), taking an
input of 3 neuron values for each time step and producing a
time sequence with 2 neuron values for each time step
output

LSTM 2

def init_memory_net():
global rnn, batchsize, timesteps
rnn = Sequential()
LSTM 5
# note: batch_input_shape is
(batchsize,timesteps,data_dim)
rnn.add(LSTM(5, batch_input_shape=(None,
timesteps, 3), return_sequences=True))
rnn.add(LSTM(2, return_sequences=True)) 3
rnn.compile(loss='mean_squared_error', input
optimizer='adam', metrics=['accuracy'])
Example: A network for recall
(see code on website)

input time sequence


2 1
1 0.4

0 1

tell recall! time

desired output time sequence

0 0.4

tell recall time


Example: A network that counts down
(see code on website)

input time sequence


1 7

0 1

tell time

desired output time sequence


7 steps

0 1

signal! time
Output of the recall network, evolving during
training (for a fixed input sequence)
time

RECALL

TELL

Learning episode (batch of 20 for each episode)


Output of the countdown network, evolving
during training (for a fixed input sequence)
time

SIGNAL

TELL (delay 5)

Learning episode (batch of 20 for each episode)


Character generation

input sequence
T H E _ T H E O R Y _ O F _ G E
(characters in one-hot encoding) time

desired output: predict next character


H E _ T H E O R Y _ O F _ G E N
network will output probability for each
possible character, at each time step
ABCDEFGHIJKLMNOPQRSTUVWXYZ_
(example for second time-step)
Character generation

Example by Andrej Karpathy training on MBs of text

tyntd-iafhatawiaoihrdemot lytdws e ,tfti, astai f ogoh eoase rrranbyne 'nhthnee e


plia tklrgd t o idoe ns,smtt h ne etie h,hregtrs nigtike,aoaenns lng

"Tmont thithey" fomesscerliund


Keushey. Thom here
sheulke, anmerenith ol sivh I lalterthend Bleipile shuwy fil on
aseterlome
coaniogennc Phe lism thond hon at. MeiDimorotion in ther thize."

we counter. He stutn co des. His stanted out one ofler that concossions and was
to gearang reay Jotrets and with fre colt otf paitt thin wall. Which das stimn

Aftair fall unsuch that the hall for Prince Velzonski's that me of
her hearly, and behs to so arwage fiving were to it beloge, pavu say falling misfort
how, and Gogition is so overelical and ofter.

"Why do what that day," replied Natasha, and wishing to himself the fact the
princess, Princess Mary was easier, fed in had oftened him.
Pierre aking his soul came to the packs and drove up his father-in-law women.
Homework

Train a network that is eventually able to carry out


sums or differences:

input 3+5=??
output ....08
input 7-5=??
output ....02

How do you encode the input/output


sequences? What happens when the result has
two digits? etc.
Word vectors
Word vectors
simple one-hot encoding of words needs
large vectors (and they do not carry any
special meaning):
“warm” 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 ...

warm

three
apple

desk

cold
tree

nice
bird

that

two

was
one
sun

hot

full
dimension: number of words in dictionary

word2vec – reduction to vectors in much


lower dimension, where similar words lie
closer together: “warm”
“warm” 0.3 0 0.1 0 2 1.2 0 “hot” “tree”
“hot” 0.2 0 0.2 0 2.5 1.3 0 “cold”
“cold” 0.1 0 0 2 0.4 0.2 0.3
Word vectors: recurrent net for training
have a long his-
Mikolov,1986;
ng (Hinton, Yih, Zweig 2013
Deerwester et al.,
network language
the classical lan- w(t) s(t) y(t)
a probability dis-
Figure 1: Recurrent Neural Network Language Model.
ven some preced-
input word at time t predicted next word
dim. N (probabilities for each word in vocab.;
first studied in the
(one-hot; dimension D=
s (Bengio size
et ofal.,dictionary)
output layers are computed as follows:dimension D)
later in the con- NxD NxN
odels (Mikolov et s(t) = f (Uw(t) + Ws(t 1)) (1)
This early work DxN
y(t) = g (Vs(t)) , (2)
mance in terms of
for more compu- where U matrix (NxD) contains word vectors!
as been addressed 1 ezm
rchical prediction f (z) = , g(zm ) = . (3)
1+e z
ke
z k
and Hinton, 2009;
sigmoid SOFTMAX are
011b; Mikolov et In this framework, the word representations
use of distributed found in the columns of U, with each column rep-
udied in (Hinton resenting a word. The RNN is trained with back-
Word vectors: how to train them

Predicting the probability of any word in


the dictionary, given the context words
(most recent word): very expensive!
Alternative:
Noise-contrastive estimation: provide
a few noisy (wrong) examples, and
train the model to predict that they
are fake (but that the true one is
correct)!
Word vectors: how to train them
Two approaches:
“continuous bag of words” Context words word
“skip-gram” word context words
Example dataset:
the quick brown fox jumped over the lazy dog

word context words (here: just surrounding words)


quick the, brown
over jumped, the
lazy the, dog
... ...
Word vectors: how to train them

Model tries to predict:


P✓ (w, h) prob. that w is the correct word,
given the context word h
parameters of the model, i.e. weights, biases,
and entries of embedding vectors

P✓ (w, h) = (Wjk ek (h) + bj )


j: index for word w in dictionary
k: index in embedding vector [Einstein sum]
e(h): embedding vector for word h
W,b: weights, biases

At each time-step: go down the


gradient of
X
C ==
-( ln P✓ (wt , h) +
(t)
ln(1 P✓ (w̃, h)) )

noisy examples
Word vectors encode meaning
Mikolov,Yih, Zweig 2013

car-cars ~ tree-trees
(subtracting the word vectors on each
side yields approx. identical vectors)

“car”
“tree”
“cars”
“trees”
Word vectors encode meaning
Mikolov,Yih, Zweig 2013

“woman”
“queen”
“aunt”

“man” “king”
“uncle”
Word vectors encode meaning
Mikolov et al. 2013 ”Distributed Representations of Words and Phrases and their Compositionality”

Country and Capital Vectors Projected by PCA


2
China
Beijing
1.5 Russia
Japan
Moscow
1
Ankara Tokyo
Turkey

0.5
Poland

0 Germany
France Warsaw
Berlin
-0.5 Italy Paris

Greece Athens
Rome
-1 Spain

Madrid
-1.5 Portugal
Lisbon

-2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Figure 2: Two-dimensional PCA projection of the 1000-dimensional Skip-gram vectors of countries and their
capital cities. The figure illustrates ability of the model to automatically organize concepts and learn implicitly
Word vectors in keras

Layer for mapping word indices (integer numbers representing


position in a dictionary) to word vectors (of length
EMBEDDING_DIM), for input sequences of some given length
embedding_layer = Embedding(len(word_index) + 1,
=
EMBEDDING_DIM,
input_length=MAX_SEQUENCE_LENGTH)

Helper routines for converting actual text into a sequence of


word indices. See especially:
function/class Keras documentation
Tokenizer Text Preprocessing
pad_sequences Sequence Preprocessing
(and others)
Search for “GloVe word embeddings”: 800 MB database
pre-trained on a 2014 dump of the English Wikipedia,
encoding 400k words in 100-dimensional vectors
Fu
re nct
pr io
es n/
en Im
tat ag
i o e
[H Im n
an ag
dw e
rit clas
ing si
Co re ficat
nv co io
olu gn n
Au t i on
itio
a n]
to ln
en et
dim V c od s
er
en isua
sio izal s
na tio
lr n
Re ed by
cu uc
tio
rre
nt n
ne
tw
W or
or ks
dv
Re e cto
inf rs
or
ce
me
nt
lea
rn
ing
Image: Wikipedia
Reinforcement Learning
“Supervised learning”
(most neural network applications)

!
!

teacher student
(smart) (imitates teacher)

final level limited by teacher


“Reinforcement learning”

?
?
?

student/scientist
(tries out things)

final level: unlimited (?)


Reinforcement learning

observation
fully observed vs.
“agent” “environment” partially observed
“state” of the
environment
action
Self-driving cars, robotics:
Observe immediate environment & move
Games:
Observe board & place stone
Observe video screen & move player

Challenge: the “correct” action is not known!


Therefore: no supervised learning!
Reward will be rare (or decided only at end)
Reinforcement learning

Use reinforcement learning:


Training a network to produce actions based on rare
rewards (instead of being told the ‘correct’ action!)
Challenge: We could use the final reward to define a cost
function, but we cannot know how the environment
reacts to a proposed change of the actions that were
taken!
(unless we have a model of the environment)
Reinforcement Learning: Basic setting
action

“policy”
state action

observation
RL-agent RL-environment
state = position x,y
action = move (direction)
state = position x,y
action = move (direction)
reward for picking up box
Policy Gradient
=REINFORCE (Williams 1992): The simplest model-free
general reinforcement learning technique

Basic idea: Use probabilistic action choice. If the


reward at the end turns out to be high, make
all the actions in this sequence more likely
(otherwise do the opposite)
This will also sometimes reinforce ‘bad’ actions,
but since they occur more likely in trajectories
with low reward, the net effect will still be to
suppress them!
action

“policy”
state action

observation
RL-agent RL-environment
Policy: ⇡✓ (at |st ) – probability to pick action at
given observed state st
at time t
action

“policy”
state action

observation
RL-agent RL-environment
action
neural
network

observation
RL-agent RL-environment
Policy Gradient

Probabilistic policy:
Probability to take action a, given the current state s
⇡✓ (a|s)

parameters of the network

s a π
π down 0.1
up 0.6
left 0.2
right 0.1

Environment: makes (possibly stochastic) transition to a new state


s’, and possibly gives a reward r
Transition function 0
P (s |s, a)
Policy Gradient
Probability for having a certain trajectory of actions
and states: product over time steps

P✓ (⌧ ) = ⇧t P (st+1 |st , at )⇡✓ (at |st )


trajectory: ⌧ = (a, s)
a = a 0 , a1 , a2 , . . .
s = s1 , s2 , . . . (state 0 is fixed)
Expected overall reward (='return'):
sum over all trajectories
X
P✓ (⌧ )R(⌧ ) return for this sequence (sum over
R̄ = E[R] = individual rewards r for all times)

X X
sum over all actions at all times ... = ...
and over all states at all times >0 ⌧ a0 ,a1 ,a2 ,...,s1 ,s2 ,...

Try to maximize expected return by changing


parameters of policy:
@ R̄
=?
@✓
Policy Gradient
@ R̄ X X @⇡✓ (at |st ) 1
= R(⌧ ) ⇧t0 P (st0 +1 |st0 , at0 )⇡✓ (at0 |st0 )
@✓ t ⌧
@✓ ⇡✓ (at |st )

@ ln ⇡✓ (at |st )
@✓

Main formula of policy gradient method:


@ R̄ X @ ln ⇡✓ (at |st )
= E[R ]
@✓ t
@✓

Stochastic gradient descent:


@ R̄
✓=⌘ where E[. . .] is approximated via the
@✓
value for one trajectory (or a batch)
Policy Gradient

@ R̄ X @ ln ⇡✓ (at |st )
= E[R ]
@✓ t
@✓

Increase the probability of all action choices in the


given sequence, depending on size of return R.
Even if R>0 always, due to normalization of probabilities
this will tend to suppress the action choices in
sequences with lower-than-average returns.
Abbreviation: @ ln P (⌧ ) X @ ln ⇡✓ (at |st )

Gk = =
@✓k t
@✓k
@ R̄
= E[RGk ]
@✓k
Policy Gradient: reward baseline
Challenge: fluctuations of estimate for return gradient
can be huge. Things improve if one subtracts a constant
baseline from the return.
@ R̄ X @ ln ⇡✓ (at |st )
= E[(R b) ]
@✓ t
@✓
= E[(R b)G]
This is the same as before. Proof:
X @ ln P✓ (⌧ ) @ X
E[Gk ] = P✓ (⌧ ) = P✓ (⌧ ) = 0

@✓k @✓k ⌧
However, the variance of the fluctuating random
variable (R-b)G is different, and can be smaller
(depending on the value of b)!
Optimal baseline

Xk = (R bk )Gk
2 2
Var[Xk ] = E[Xk ] E[Xk ] = min
@Var[Xk ]
=0
@bk

E[G2k R]
bk =
E[G2k ]

@ ln P✓ (⌧ )
Gk =
@✓k
✓k = ⌘E[Gk (R bk )]
For more in-depth treatment, see David Silver’s course on
reinforcement learning (University College London):

http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html
The simplest RL example ever

A random walk, where the probability to go “up” is


determined by the policy, and where the return is given
by the final position (ideal strategy: always go up!)
(Note: this policy does not even depend on the current state)
position

reward
time
The simplest RL example ever

A random walk, where the probability to go “up” is


determined by the policy, and where the return is given
by the final position (ideal strategy: always go up!)
(Note: this policy does not even depend on the current state)
1
policy ⇡✓ (up) = ✓
return R = x(T )
1+e
X ⌧
@ ln ⇡✓ (at )
RL update ✓=⌘ R
t
@✓
at = up or down
@ ln ⇡✓ (at ) + for up, - for down
✓ 1 ⇡✓ (up) for up
= ±e ⇡✓ (at ) = ±(1 ⇡✓ (at )) =
@✓ ⇡✓ (up) for down
number of ‘up-steps’
X @ ln ⇡✓ (at )
= Nup N ⇡✓ (up)
t
@✓ N=number of time steps
The simplest RL example ever

return R = x(T ) = Nup Ndown = 2Nup N


X ⌧
@ ln ⇡✓ (at )
RL update ✓=⌘ R
t
@✓
at = up or down
* + ⌧
X @ ln ⇡✓ (at ) N
R = 2 (Nup )(Nup N̄up )
t
@✓ 2
(general analytical expression for
average update, rare)
1
Initially, when ⇡✓ (up) = :
⌧ 2
N 2 N
✓ = 2⌘ (Nup ) = 2⌘Var(Nup ) = ⌘ > 0
2 2
(binomial distribution!)
The simplest RL example ever
In general:
* + ⌧
X @ ln ⇡✓ (at ) N
R = 2 (Nup )(Nup N̄up )
⌧✓
t
@✓ 2◆
N
=2 (Nup N̄up ) + (N̄up ) (Nup N̄up )
2
N ⌦ ↵
= 2VarNup + 2(N̄up ) Nup N̄up
2
= 2VarNup = 2N ⇡✓ (up)(1 ⇡✓ (up)) (general analytical
expression for average
update, fully simplified,
⇡✓ (up))

extremely rare)
⇡✓ (up)(1

⇡✓ (up)
The simplest RL example ever

probability ⇡✓ (up)

3 learning attempts
strong fluctuations!

trajectory (=training episode)

(This plot for N=100 time steps in a


trajectory; eta=0.001)
Spread of the update step

Y = Nup N̄up c = N̄up N/2 X = (Y + c)Y


(Note: to get Var X, we need central moments X=update
of binomial distribution up to 4th moment) (except
prefactor of 2)
3
⇠N 2
p
Var(X)

1
⇠N
hXi

⇡✓ (up) (This plot for N=100)


Optimal baseline suppresses spread!

Y = Nup N̄up c = N̄up N/2 X = (Y + c)Y


with optimal baseline: ⌦ 2 ↵
0 Y (Y + c)
X = (Y + c b)Y b=
hY 2 i
3
⇠N2
p
Var(X)

p
Var(X 0 ) 1
⇠N
hXi

⇡✓ (up) (This plot for N=100)


Note: Many update steps reduce relative spread

M = number of update steps


M
X
X= Xj
j=1

h Xi = M hXi
p p p
Var X = M VarX
p
Var X 1
relative spread ⇠p
h Xi M
Homework

Implement the RL update including the optimal


baseline and run some stochastic learning
attempts. Can you observe the improvement
over the no-baseline results shown here?
Note:You do not need to simulate the individual
random walk trajectories, just exploit the
binomial distribution.
The second-simplest RL example
position actions: move or stay
“walker”

“target site”

return=number of time steps on target


time

See code on website: “SimpleRL_WalkerTarget”


output = action probabilities (softmax)

policy ⇡✓ (a|s)

a=0 ("stay") a=1 ("move")

input = s = "are we on target"? (0/1)


Policy gradient: all the steps
Obtain one "trajectory":

execute action,
record new state

apply neural network to state,


thus obtain action probabilities

from probabilities, obtain


action for next step
Policy gradient: all the steps
For each trajectory:

Do one trajectory
(in reality: a batch of trajectories)

Obtain overall sum of rewards (=return)


for each trajectory

apply policy gradient training


(enhance probabilities for all
actions in a high-return trajectory)
RL in keras: categorical cross-entropy trick
output = action categorical cross-entropy
probabilities (softmax) X distr. from net
⇡✓ (a|s) C= P (a) ln ⇡✓ (a|s)
a desired
distribution
Set
a=0 a=1 a=2
P (a) = R
for a=action that was taken

P (a) = 0
for all other actions a

@C
✓= ⌘
@✓
input = state
implements policy gradient
RL in keras: categorical cross-entropy trick
Encountered N states (during repeated runs)
After setting categorical cross-entropy as cost function,
just use the following simple line to implement policy
gradient: array N x state-size

net.train_on_batch(observed_inputs,desired_outputs)

array N x number-of-actions
Here desired_outputs[j,a]=R for the state
numbered j, if action a was taken during a run that
gave overall return R
AlphaGo

Among the major board games, “Go” was


not yet played on a superhuman level by any
program (very large state space on a 19x19
board!)
alpha-Go beat the world’s best player in 2017
AlphaGo

First: try to learn from human expert players

Silver et al.,“Mastering the game of Go with deep neural networks


and tree search” (Google Deepmind team), Nature, January 2016
AlphaGo

Second: use policy gradient RL on games played


against previous versions of the program

Silver et al.,“Mastering the game of Go with deep neural networks


and tree search” (Google Deepmind team), Nature, January 2016
AlphaGo

*Note: beyond policy-


gradient type methods,
this also includes another
algorithm, called Monte
Carlo Tree Search

Silver et al.,“Mastering the game of Go with deep neural networks


and tree search” (Google Deepmind team), Nature, January 2016
AlphaGoZero
No training on human expert knowledge
– eventually becomes even better!

Silver et al, Nature 2017


AlphaGoZero

Ke Jie stated that "After humanity spent thousands of


years improving our tactics, computers tell us that humans
are completely wrong... I would go as far as to say not a
single human has touched the edge of the truth of Go."
Q-learning
Q-learning

An alternative to the policy gradient approach

Introduce a quality function Q that predicts the


future reward for a given state s and a given
action a. Deterministic policy: just select
the action with the largest Q!
"value" of a state as color
"quality" of the action "going up" as color
Q-learning
Introduce a quality function Q that predicts the
future reward for a given state s and a given
action a. Deterministic policy: just select
the action with the largest Q!
Q(st , at ) = E[Rt |st , at ] (assuming future
steps to follow the
“Discounted” XT policy!)
t0 t
future reward: Rt = rt 0
t0 =t
depends on state
Reward at time step t: rt and action at time t
Discount factor: 0 <  1 learning somewhat
easier for smaller
factor (short
memory times)
Note: The ‘value’ of a state is V (s) = maxa Q(s, a)
How do we obtain Q?
Q-learning: Update rule
Bellmann equation: (from optimal control theory)
Q(st , at ) = E[rt + maxa Q(st+1 , a)|st , at ]
In practice, we do not know the Q function yet, so
we cannot directly use the Bellmann equation.
However, the following update rule has the correct Q
function as a fixed point:
Qnew (st , at ) = Qold (st , at ) + ↵(rt + maxa Qold (st+1 , a) Qold (st , at ))
will be zero, once
we have converged
small (<1) update to the correct Q
factor
If we use a neural network to calculate Q, it will be
trained to yield the “new” value in each step.
Q(a=up,s)
Q(a=up,s)
Q(a=up,s)
Q-learning: Exploration

Initially, Q is arbitrary. It will be bad to follow this Q all


the time. Therefore, introduce probability ✏ of
random action (“exploration”)!
Follow Q: “exploitation”
Do something random (new): “exploration”
“✏ -greedy“

Reduce this randomness later!


Example: Learning to play Atari Video Games
“Human-level control through deep reinforcement learning”, Mnih et al., Nature, February 2015

last four 84x84 pixel images as input [=state]


motion as output [=action]
Example: Learning to play Atari Video Games
“Human-level control through deep reinforcement learning”, Mnih et al., Nature, February 2015
Example: Learning to play Atari Video Games
“Human-level control through deep reinforcement learning”, Mnih et al., Nature, February 2015

t-SNE visualization of
last hidden layer
Advanced Exercise

Apply RL to solve the challenge of finding, as fast


as possible, a "treasure" in:
- a fixed given labyrinth
- an arbitrary labyrinth (in each run, the player
finds itself in another labyrinth)

Use the labyrinth generator on Wikipedia "Maze


Generation Algorithm"
Wikipedia: "Maze Generation Algorithm /
Python Code Example"
Fu
re nct
pr io
es n/
en Im
tat ag
i o e
[H Im n
an ag
dw e
rit clas
ing si
Co re ficat
nv co io
olu gn n
Au t i on
itio
a n]
to ln
en et
dim V c od s
er
en isua
sio izal s
na tio
lr n
Re ed by
cu uc
tio
rre
nt n
ne
W tw
or or
dv ks
Re ec
inf to
or rs
ce
me
Co nt
nn lea
ec rn
ti o
ns i ng
to
ph
ys
i cs
Neural networks and spin models

0/1 /

Artifical Bit Spin


neuron

Neural networks with stochastic transitions,


and with some energy functional similar to
spin models in physics; e.g. as described by
Hopfield and others starting from the 80s
Modeling probability distributions

Goal: Use a neural network to generate previously


unseen examples, according to the probability
distribution of training samples
One example already mentioned in these lectures: generating new
random (but kind-of reasonable) text after seeing lots of it

Example: Generate new images after looking at many,


generate handwritten text

The solution will exploit the connection between


neural networks and the statistical physics of spin
models!
Boltzmann-Gibbs distribution
Probabilities of states of a physical system,
in thermal equilibrium?

energy high: less likely energy low: more likely

1 kE(s)T X E(s0 )
P (s) = e B Z= e kB T
Z s0
probability for state s, Z for normalization:
in thermal equilibrium “partition function”
Problem: for a many-body system, exponentially many states (for
example 2N spin states). Cannot go through all of them!
Monte Carlo approach
0
s
P (s 0
... s)
P (s s
s 0)
...

Place system in some state,


make stochastic transitions to
other states (with prescribed
transition probabilities)
Monte Carlo approach
0
s
P (s 0
... s)
P (s s
s 0)
...

Place system in some state,


make stochastic transitions to
other states (with prescribed
transition probabilities)
Monte Carlo approach
0
s
P (s 0
... s)
P (s s
s 0)
...

Place system in some state,


make stochastic transitions to
other states (with prescribed
transition probabilities)
Monte Carlo approach
0
s
P (s 0
... s)
P (s s
s 0)
...

Place system in some state,


make stochastic transitions to
other states (with prescribed
transition probabilities)
Monte Carlo approach
0
s
P (s 0
... s)
P (s s
s 0)
...

Time evolution of ensemble?


X “IN” “OUT”
0 0 0
P (s) = P (s s )P (s ) P (s s)P (s)
s0
change in one P(s) = probability to find the
time-step system in state s (or: fraction
of ensemble in this state)
Monte Carlo approach
0
s
P (s 0
... s)
P (s s
s 0)
...

Time evolution of ensemble?


X “IN” “OUT”
0 0 0
P (s) = P (s s )P (s ) P (s s)P (s)
s0
change in one P(s) = probability to find the
time-step system in state s (or: fraction
of ensemble in this state)
Monte Carlo approach
0
s
P (s 0
... s)
P (s s
s 0)
...

Time evolution of ensemble?


X “IN” “OUT”
0 0 0
P (s) = P (s s )P (s ) P (s s)P (s)
s0
change in one P(s) = probability to find the
time-step system in state s (or: fraction
of ensemble in this state)
Monte Carlo approach
0
s
P (s 0
... s)
P (s s
s 0)
...

At long times: stable steady state distribution


If we have “detailed balance”, i.e. if there exists a
distribution P(s), such that for any pair of states:
0
P (s s) P (s)
=
P (s0 s) P (s0 )
then P(s) is the long-time distribution!
Monte Carlo approach
0
s
P (s 0
... s)
P (s s
s 0)
...

Monte Carlo for thermal equilibrium: choose


transition probabilities such that P(s) will be the
Boltzmann distribution!
0 E(s0 ) E(s)
P (s s)
= e kB T
P (s0 s)
example Metropolis algorithm: pick random spin, calculate
energy change for spin flip. Do the flip if it lowers the energy. If
the energy increases, only flip with probability exp( E/kB T )
Markov chain

s1 s3
s0 s2

The sequence of visited states forms a


so-called “Markov chain”

Markov = transitions without memory


Restricted Boltzmann Machine

“hidden” units h

“visible” units v

Each “unit” is like a spin (or a bit) that can be 0 or 1


Restricted Boltzmann Machine

“hidden” units h

“visible” units v

Each “unit” is like a spin (or a bit) that can be 0 or 1

Define “energy” (we will set kBT=1)

“restricted”: no coupling v-v or h-h w: couplings (weights)

see G. Hinton’s guide: http://www.cs.toronto.edu/~hinton/absps/guideTR.pdf


Restricted Boltzmann Machine

“hidden” units h

“visible” units v

Each “unit” is like a spin (or a bit) that can be 0 or 1


e E(v,h) X
E(v,h)
P (v, h) = Z = e
X Z v,h
P (v) = P (v, h)
h
Goal: adapt weights (and biases), such that the
probability distribution of a set of training
examples is approximately reproduced by P(v)
P (v) ⇡ P0 (v) from training samples
Restricted Boltzmann Machine

“hidden” units h

“visible” units v

Each “unit” is like a spin (or a bit) that can be 0 or 1

Interpretation: the ‘hidden units’ represent


categories of data (e.g. “dog+white+big”)
Building a Markov chain
Instead of the full state s=(v,h): Consider alternating
transitions between v and h states
P (v, h)
Set: P (h v) = P (h|v) =
P (v)
P (v, h)
P (v h) = P (v|h) =
P (h)
h h0

v 0
v “reconstruction” v 00

These transition probabilities fulfill detailed balance!


P (h v) P (h) Thus: P(v) [and P(h)] are the
= steady-state distributions!
P (v h) P (v)
Building a Markov chain
X X P P P
ZP
P (v) = e E(v,h)
= e i a i vi + j bj h j + i,j vi hj wij

h h
P
a i vi zj
=e i ⇧j (1 + e )
X
with: zj = b j + vi wij
P i X X X X
where we used: e j Xj = ⇧j eXj ... = ...
h h 0 =0,1 h1 =0,1 h2 =0,1
Therefore:
e E(v,h) ezj hj
P (h|v) = = ⇧j
ZP (v) 1 + ezj
Product of probabilities! All the hj are independently
distributed, with probabilities:
ezj
P (hj = 1|v) = z
= (z j )
1+e j sigmoid
P (hj = 0|v) = 1 (zj )
Building a Markov chain
Given some visible-units state vector v, calculate
the probabilities z
ej
P (hj = 1|v) = z
= (z j )
1+e j sigmoid
Then assign 1 or 0, according to these probabilities,
to obtain the new hidden state vector h

Similarly, go from h to a new v’, using:


0 0
P (vi = 1|h) = (zi )
X
zi0 = ai + wij hj
j
Updating the weights
Goal: adapt weights (and biases), such that the
probability distribution of a set of training
examples is approximately reproduced by P(v)
P (v) ⇡ P0 (v) from training samples

Minimize the categorical cross-entropy


X
C= P0 (v) ln P (v)
v
But now (unlike earlier examples), there are
exponentially many values for v, so we cannot
simply have a network output P(v) for all v. Still,
let us take the derivative of C with respect to
the weights w!
Updating the weights
X
C= P0 (v) ln P (v)
v
@
P
@ @wij h P (v, h)
ln P (v) = P
@wij P (v, h)
h
@
P E(v,h) @ 1
X e
E(v 0 ,h0 ) @wij h @wij Z
Z= e = P E(v,h) 1
v 0 ,h0 h e Z
P P 0 0 E(v 0 ,h0 )
E(v,h)
h v i hj e v 0 ,h0 vi hj e
= P E(v,h)
h e Z
overall:
X @ X X
P0 (v) ln P (v) = vi hj P (h|v)P0 (v) vi0 h0j P (v 0 , h0 )
@wij
v v,h v 0 ,h0
Updating the weights
X @ X X
0 0 0 0 0
P0 (v) ln P (v) = vi hj P (h|v)P0 (v) vi hj P (h |v )P (v )
@wij
v v,h v 0 ,h0

easy: draw one training


sample v, then do one
Markov chain step from v to
h; average over all samples v
hard: need to average over
the correct distribution P(v)
belonging to the Boltzmann
machine!
Updating the weights
X @ X X
0 0 0 0 0
P0 (v) ln P (v) = vi hj P (h|v)P0 (v) vi hj P (h |v )P (v )
@wij
v v,h v 0 ,h0

Could obtain P(v) by running the Markov chain for


really long times! Very expensive!
h 0
h

v
(v: training sample) v0 v 00

Rough approximation, used in practice: Just take v’,h’


from the second pair of the chain! [For better
approx.: can take a pair further down the chain]
⌦ 0 0↵
wij = ⌘(hvi hj i v i hj )
(averaged over a batch of training samples v starting the chain)
Updating the weights
⌦ 0 0

wij = ⌘(hvi hj i v i hj
)
(averaged over a batch of training samples v starting the chain)
“Contrastive Divergence” (CD) algorithm by G. Hinton

Note: At least we can claim that P0 (v) = P (v)


would be a fixed point of this update rule, since then the two
averages on the right-hand-side yield identical results. Of
course, usually the restricted Boltzmann machine will not be
able to reach this point, since it cannot represent arbitrary
P(v).
0
ai = ⌘(hvi i hvi i)
⌦ 0↵
bj = ⌘(hhj i hj )
Restricted Boltzmann Machine for MNIST
example from http://deeplearning.net/tutorial/rbm.html
Markov chain steps (1000 steps between each row!)

Each column: a different, independent Markov chain


Restricted Boltzmann Machine for MNIST
example from http://deeplearning.net/tutorial/rbm.html

The learned weights for the 100 hidden units


RBM as a starting point
First train RBM, then connect hidden layer to some
output layer for supervised learning of classification
Idea: RBM provides unsupervised learning of
important features in the training set (pre-training)

output layer
(e.g. softmax)

“hidden” units h

“visible” units v
Deep belief networks
Stack RBMs: First train a simple RBM, then use its
hidden units as input to another RBM, and so on

hidden units h3
third RBM
hidden units h2
second RBM
hidden units h1
first RBM
visible units v

Afterwards, fine-tune weights,


e.g. by supervised learning
Application to Quantum Physics

“hidden” units h

“visible” units v
=spins of quantum model
Try to solve a quantum many-body problem (quantum spin
model) using the following variational ansatz for the wave
function amplitudes:
X P z P P z
a + b h + h i j Wij
(S) = e j j j i i i ij

h
Carleo & Troyer, Science 2017
z z z
S = ( 1 , 2 , . . . , N ) one basis state in the many-body Hilbert space
z
j = ±1 (in general, a,b, W may be complex)
hi = ±1
This is exactly (proportional to) the RBM representation
for P(v) [with v=S]!
Application to Quantum Physics

“hidden” units h

“visible” units v
=spins of quantum model
D E
Minimize the energy |Ĥ|
h | i
by adapting the weights W and biases a and b!
[requires additional Monte Carlo simulation, to obtain a
stochastic sampling of the gradient with respect to these
parameters]
For example: sample probabilities by using Metropolis
algorithm, with transition probabilities 0 2
0 (S )
P (S S) = min(1, )
(S)
Application to Quantum Physics
Carleo & Troyer, Science 2017

error
number of ‘filters’

Exploit translational invariance (like in convolutional nets);


weights are “filters” (convolutional kernels)
Find updates on

http://machine-learning-for-physicists.org

You might also like