MachineLearningSlides PartTwo

Machine
Learning for
Physicists
University of Erlangen-Nuremberg
& Max Planck Institute for the
Science of Light
Florian Marquardt
Florian.Marquardt@fau.de
http://machine-learning-for-physicists.org
Part Two (Image generated by a net with 20 hidden layers)

Optimized gradient descent algorithms
How to speed up stochastic gradient descent?
- accelerate (“momentum”) towards minimum

- Automatically choose learning rate
- ...even different rates for different weights
Summary: a few gradient update methods
see overview by S. Ruder https://arxiv.org/abs/1609.04747
divide by RMS of all previous gradients

adagrad
RMSprop divide by RMS of last few previous gradients
same, but multiply also by RMS of last few

adadelta parameter updates
adam divide running average of last few gradients by

RMS of last few gradients (* with corrections
during earliest steps)
adam often works best

“adagrad”
divide by RMS of all previous gradients
“RMSprop”
divide by RMS of last
few previous gradients
“RMSprop”
“adadelta”
same, but multiply also by RMS of
last few parameter updates
“adam”
divide running average
of last few gradients by
RMS of last few
gradients (* with
corrections during
earliest steps)
Recurrent neural networks
Recurrent neural networks
Networks “with memory”
Useful for analyzing time-evolution (time-series of

data), for analyzing and translating sentences, for
control/feedback (e.g. robotics or action games), and
many other things
Could use a convolutional network!
output
... ...
input
filter size = memory time
time
Long memories with convolutional nets are
challenging:
• would need large filter sizes
• even then, would need to know required
memory time beforehand
• can expand memory time efficiently by multi-
layer network with subsampling (pooling), but
this is still problematic for precise long-term
memory
recall
signal no important signals signal!
time
But: may be OK for some physics applications!
(problems local in time, with short memory)
Memory
recall
time
Solution: Recurrent Neural Networks (RNN)
keep
memory
output
input
Advantage: in principle this could give arbitrarily long memory!
time
Note: each circle may represent multiple neurons (i.e. a layer)
Each arrow then represents all possible connections between those
neurons
Solution: Recurrent Neural Networks (RNN)
keep
memory
output
input
time
Note: the weights are not time-dependent, i.e. need to store only one set of
weights (similar to convolutional net)
output
hidden
input
time
“correct answer” known here
output
hidden
input
“Backpropagation through time”

time
Long memories with recurrent networks are
challenging, due to a feature of backpropagation:
“Exploding gradients” / “Vanishing gradients”
Backpropagation through many layers (in a deep

network) or through many time-steps (in a
recurrent network):
(for the
Something like t 1 = Mt t recurrent
network case)
Depending on | | | |
typical exploding vanishing
eigenvalues of
the matrices M:
backprop. steps backprop. steps

Long short-term memory (LSTM)
Why this name? “Long-term memory” would be the weights that are adapted
during training and then stored forever. “Short-term memory” is the input-
dependent memory we are talking about here. “Long short-term memory”
tries to have long memory times in a robust way, for this short-term memory.
Sepp Hochreiter and Jürgen Schmidhuber, 1997
Main idea: determine read/write/delete operations of a
memory cell via the network (through other neurons)
Most of the time, a memory neuron just sits there and
is not used/changed!
recall
time
LSTM: Forget gate (delete)
ct 1 ct
memory
cell content *
keep: ct = 1 ⇤ ct 1
delete: ct = 0 ⇤ ct 1
Calculate “forget gate”:

ct 1 ct f = (W (f )
xt + b (f )
)
memory
cell content *
sigmoid
(usually x,b,f are vectors, W
“forget the weight matrix)
gate f”
xt Obtain new memory content:
ct = f ⇤ ct
input 1
elementwise product
NEW: for the first time, we are multiplying neuron values!

Backpropagation
ct 1 ct
memory
cell content *
The multiplication * splits the
“forget error backpropagation into two
gate f” branches
xt
input
product rule:
@fj ct 1,j @fj @ct 1,j
= ct 1,j + fj
@w⇤ @w⇤ @w⇤
(Note: if time is not specified, we are referring to t)
memory
cell content * *
“forget
gate f”
input
t-1 t t+1
LSTM: Write new memory value
ct 1 ct i = (W (i) (i)
xt + b )
+ c̃t = tanh(W (c) xt + b(c) )
*
“input new value c̃t
gate i”
xt
input
both delete and write together:
ct = f ⇤ ct 1 + i ⇤ c̃t
forget new value
LSTM: Read (output) memory value
ht
ct *c
1 t
(o) (o)
o = (W xt + b )
ht = o ⇤ tanh(ct )
“output
gate o”
xt
input
LSTM: exploit previous memory output ‘h’
make f,i,o etc. at time t depend on output ‘h’

calculated in previous time step!
(otherwise: ‘h’ could only be used in higher
layers, but not to control memory access in
present layer)
f = (W (f ) xt + U (f ) ht 1 + b (f )
)
...and likewise for every other quantity!
Thus, result of readout can actually influence
subsequent operations (e.g.: readout of some
selected other memory cell!)
Sometimes, o is even made to depend on ct

LSTM: backpropagation through time is OK
As long as memory content is not read or written, the

backpropagation gradient is trivial:
ct = ct 1 = ct 2 = ...
@ct @ct 1 @ct 2
= = = ...
@w⇤ @w⇤ @w⇤
(deviation vector multiplied by 1)
During those ‘silent’ time-intervals: No
explosion or vanishing gradient!
Adding an LSTM layer with 10 memory cells:
Each of those cells has the full structure, with f,i,o

gates and the memory content c, and the output h.
rnn.add(LSTM(10, return_sequences=True))
whether to return the full

time sequence of outputs, or only
the output at the final time
Two LSTM layers (input > LSTM > LSTM=output), taking an
input of 3 neuron values for each time step and producing a
time sequence with 2 neuron values for each time step
output
LSTM 2
def init_memory_net():
global rnn, batchsize, timesteps
rnn = Sequential()
LSTM 5
# note: batch_input_shape is
(batchsize,timesteps,data_dim)
rnn.add(LSTM(5, batch_input_shape=(None,
timesteps, 3), return_sequences=True))
rnn.add(LSTM(2, return_sequences=True)) 3
rnn.compile(loss='mean_squared_error', input
optimizer='adam', metrics=['accuracy'])
Example: A network for recall
(see code on website)
input time sequence

2 1
1 0.4
0 1
tell recall! time
desired output time sequence
0 0.4
tell recall time

Example: A network that counts down
(see code on website)
input time sequence

1 7
0 1
tell time
desired output time sequence

7 steps
0 1
signal! time
Output of the recall network, evolving during
training (for a fixed input sequence)
time
RECALL
TELL
Learning episode (batch of 20 for each episode)

Output of the countdown network, evolving
during training (for a fixed input sequence)
time
SIGNAL
TELL (delay 5)
Learning episode (batch of 20 for each episode)

Character generation
input sequence
T H E _ T H E O R Y _ O F _ G E
(characters in one-hot encoding) time
desired output: predict next character

H E _ T H E O R Y _ O F _ G E N
network will output probability for each
possible character, at each time step
ABCDEFGHIJKLMNOPQRSTUVWXYZ_
(example for second time-step)
Character generation
Example by Andrej Karpathy training on MBs of text
tyntd-iafhatawiaoihrdemot lytdws e ,tfti, astai f ogoh eoase rrranbyne 'nhthnee e

plia tklrgd t o idoe ns,smtt h ne etie h,hregtrs nigtike,aoaenns lng
"Tmont thithey" fomesscerliund

Keushey. Thom here
sheulke, anmerenith ol sivh I lalterthend Bleipile shuwy fil on
aseterlome
coaniogennc Phe lism thond hon at. MeiDimorotion in ther thize."
we counter. He stutn co des. His stanted out one ofler that concossions and was
to gearang reay Jotrets and with fre colt otf paitt thin wall. Which das stimn
Aftair fall unsuch that the hall for Prince Velzonski's that me of
her hearly, and behs to so arwage fiving were to it beloge, pavu say falling misfort
how, and Gogition is so overelical and ofter.
"Why do what that day," replied Natasha, and wishing to himself the fact the
princess, Princess Mary was easier, fed in had oftened him.
Pierre aking his soul came to the packs and drove up his father-in-law women.
Homework
Train a network that is eventually able to carry out

sums or differences:
input 3+5=??
output ....08
input 7-5=??
output ....02
How do you encode the input/output

sequences? What happens when the result has
two digits? etc.
Word vectors
Word vectors
simple one-hot encoding of words needs
large vectors (and they do not carry any
special meaning):
“warm” 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 ...
warm
three
apple
desk
cold
tree
nice
bird
that
two
was
one
sun
hot
full
dimension: number of words in dictionary
word2vec – reduction to vectors in much

lower dimension, where similar words lie
closer together: “warm”
“warm” 0.3 0 0.1 0 2 1.2 0 “hot” “tree”
“hot” 0.2 0 0.2 0 2.5 1.3 0 “cold”
“cold” 0.1 0 0 2 0.4 0.2 0.3
Word vectors: recurrent net for training
have a long his-
Mikolov,1986;
ng (Hinton, Yih, Zweig 2013
Deerwester et al.,
network language
the classical lan- w(t) s(t) y(t)
a probability dis-
Figure 1: Recurrent Neural Network Language Model.
ven some preced-
input word at time t predicted next word
dim. N (probabilities for each word in vocab.;
first studied in the
(one-hot; dimension D=
s (Bengio size
et ofal.,dictionary)
output layers are computed as follows:dimension D)
later in the con- NxD NxN
odels (Mikolov et s(t) = f (Uw(t) + Ws(t 1)) (1)
This early work DxN
y(t) = g (Vs(t)) , (2)
mance in terms of
for more compu- where U matrix (NxD) contains word vectors!
as been addressed 1 ezm
rchical prediction f (z) = , g(zm ) = . (3)
1+e z
ke
z k
and Hinton, 2009;
sigmoid SOFTMAX are
011b; Mikolov et In this framework, the word representations
use of distributed found in the columns of U, with each column rep-
udied in (Hinton resenting a word. The RNN is trained with back-
Word vectors: how to train them
Predicting the probability of any word in

the dictionary, given the context words
(most recent word): very expensive!
Alternative:
Noise-contrastive estimation: provide
a few noisy (wrong) examples, and
train the model to predict that they
are fake (but that the true one is
correct)!
Two approaches:
“continuous bag of words” Context words word
“skip-gram” word context words
Example dataset:
the quick brown fox jumped over the lazy dog
word context words (here: just surrounding words)

quick the, brown
over jumped, the
lazy the, dog
... ...
Model tries to predict:

P✓ (w, h) prob. that w is the correct word,
given the context word h
parameters of the model, i.e. weights, biases,
and entries of embedding vectors
P✓ (w, h) = (Wjk ek (h) + bj )

j: index for word w in dictionary
k: index in embedding vector [Einstein sum]
e(h): embedding vector for word h
W,b: weights, biases
At each time-step: go down the

gradient of
X
C ==
-( ln P✓ (wt , h) +
(t)
ln(1 P✓ (w̃, h)) )
w̃
noisy examples
Word vectors encode meaning
Mikolov,Yih, Zweig 2013
car-cars ~ tree-trees
(subtracting the word vectors on each
side yields approx. identical vectors)
“car”
“tree”
“cars”
“trees”
Mikolov,Yih, Zweig 2013
“woman”
“queen”
“aunt”
“man” “king”
“uncle”
Mikolov et al. 2013 ”Distributed Representations of Words and Phrases and their Compositionality”
Country and Capital Vectors Projected by PCA

2
China
Beijing
1.5 Russia
Japan
Moscow
1
Ankara Tokyo
Turkey
0.5
Poland
0 Germany
France Warsaw
Berlin
-0.5 Italy Paris
Greece Athens
Rome
-1 Spain
Madrid
-1.5 Portugal
Lisbon
-2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Figure 2: Two-dimensional PCA projection of the 1000-dimensional Skip-gram vectors of countries and their
capital cities. The figure illustrates ability of the model to automatically organize concepts and learn implicitly
Word vectors in keras
Layer for mapping word indices (integer numbers representing

position in a dictionary) to word vectors (of length
EMBEDDING_DIM), for input sequences of some given length
embedding_layer = Embedding(len(word_index) + 1,
=
EMBEDDING_DIM,
input_length=MAX_SEQUENCE_LENGTH)
Helper routines for converting actual text into a sequence of

word indices. See especially:
function/class Keras documentation
Tokenizer Text Preprocessing
pad_sequences Sequence Preprocessing
(and others)
Search for “GloVe word embeddings”: 800 MB database
pre-trained on a 2014 dump of the English Wikipedia,
encoding 400k words in 100-dimensional vectors
Fu
re nct
pr io
es n/
en Im
tat ag
i o e
[H Im n
an ag
dw e
rit clas
ing si
Co re ficat
nv co io
olu gn n
Au t i on
itio
a n]
to ln
en et
dim V c od s
er
en isua
sio izal s
na tio
lr n
Re ed by
cu uc
tio
rre
nt n
ne
tw
W or
or ks
dv
Re e cto
inf rs
or
ce
me
nt
lea
rn
ing
Image: Wikipedia
Reinforcement Learning
“Supervised learning”
(most neural network applications)
!
!
teacher student
(smart) (imitates teacher)
final level limited by teacher

“Reinforcement learning”
?
?
?
student/scientist
(tries out things)
final level: unlimited (?)

Reinforcement learning
observation
fully observed vs.
“agent” “environment” partially observed
“state” of the
environment
action
Self-driving cars, robotics:
Observe immediate environment & move
Games:
Observe board & place stone
Observe video screen & move player
Challenge: the “correct” action is not known!

Therefore: no supervised learning!
Reward will be rare (or decided only at end)
Reinforcement learning
Use reinforcement learning:

Training a network to produce actions based on rare
rewards (instead of being told the ‘correct’ action!)
Challenge: We could use the final reward to define a cost
function, but we cannot know how the environment
reacts to a proposed change of the actions that were
taken!
(unless we have a model of the environment)
Reinforcement Learning: Basic setting
action
“policy”
state action
observation
RL-agent RL-environment
state = position x,y
action = move (direction)
state = position x,y
action = move (direction)
reward for picking up box
Policy Gradient
=REINFORCE (Williams 1992): The simplest model-free
general reinforcement learning technique
Basic idea: Use probabilistic action choice. If the

reward at the end turns out to be high, make
all the actions in this sequence more likely
(otherwise do the opposite)
This will also sometimes reinforce ‘bad’ actions,
but since they occur more likely in trajectories
with low reward, the net effect will still be to
suppress them!
action
“policy”
state action
observation
Policy: ⇡✓ (at |st ) – probability to pick action at
given observed state st
at time t
action
“policy”
state action
observation
action
neural
network
observation
Policy Gradient
Probabilistic policy:
Probability to take action a, given the current state s
⇡✓ (a|s)
parameters of the network
s a π
π down 0.1
up 0.6
left 0.2
right 0.1
Environment: makes (possibly stochastic) transition to a new state

s’, and possibly gives a reward r
Transition function 0
P (s |s, a)
Policy Gradient
Probability for having a certain trajectory of actions
and states: product over time steps
P✓ (⌧ ) = ⇧t P (st+1 |st , at )⇡✓ (at |st )

trajectory: ⌧ = (a, s)
a = a 0 , a1 , a2 , . . .
s = s1 , s2 , . . . (state 0 is fixed)
Expected overall reward (='return'):
sum over all trajectories
X
P✓ (⌧ )R(⌧ ) return for this sequence (sum over
R̄ = E[R] = individual rewards r for all times)
⌧
X X
sum over all actions at all times ... = ...
and over all states at all times >0 ⌧ a0 ,a1 ,a2 ,...,s1 ,s2 ,...
Try to maximize expected return by changing

parameters of policy:
@ R̄
=?
@✓
Policy Gradient
@ R̄ X X @⇡✓ (at |st ) 1
= R(⌧ ) ⇧t0 P (st0 +1 |st0 , at0 )⇡✓ (at0 |st0 )
@✓ t ⌧
@✓ ⇡✓ (at |st )
@ ln ⇡✓ (at |st )
@✓
Main formula of policy gradient method:

@ R̄ X @ ln ⇡✓ (at |st )
= E[R ]
@✓ t
@✓
Stochastic gradient descent:

@ R̄
✓=⌘ where E[. . .] is approximated via the
@✓
value for one trajectory (or a batch)
Policy Gradient
@ R̄ X @ ln ⇡✓ (at |st )
= E[R ]
@✓ t
@✓
Increase the probability of all action choices in the

given sequence, depending on size of return R.
Even if R>0 always, due to normalization of probabilities
this will tend to suppress the action choices in
sequences with lower-than-average returns.
Abbreviation: @ ln P (⌧ ) X @ ln ⇡✓ (at |st )
✓
Gk = =
@✓k t
@✓k
@ R̄
= E[RGk ]
@✓k
Policy Gradient: reward baseline
Challenge: fluctuations of estimate for return gradient
can be huge. Things improve if one subtracts a constant
baseline from the return.
@ R̄ X @ ln ⇡✓ (at |st )
= E[(R b) ]
@✓ t
@✓
= E[(R b)G]
This is the same as before. Proof:
X @ ln P✓ (⌧ ) @ X
E[Gk ] = P✓ (⌧ ) = P✓ (⌧ ) = 0
⌧
@✓k @✓k ⌧
However, the variance of the fluctuating random
variable (R-b)G is different, and can be smaller
(depending on the value of b)!
Optimal baseline
Xk = (R bk )Gk
2 2
Var[Xk ] = E[Xk ] E[Xk ] = min
@Var[Xk ]
=0
@bk
E[G2k R]
bk =
E[G2k ]
@ ln P✓ (⌧ )
Gk =
@✓k
✓k = ⌘E[Gk (R bk )]
For more in-depth treatment, see David Silver’s course on
reinforcement learning (University College London):
http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html
The simplest RL example ever
A random walk, where the probability to go “up” is

determined by the policy, and where the return is given
by the final position (ideal strategy: always go up!)
(Note: this policy does not even depend on the current state)
position
reward
time
A random walk, where the probability to go “up” is

determined by the policy, and where the return is given
by the final position (ideal strategy: always go up!)
(Note: this policy does not even depend on the current state)
1
policy ⇡✓ (up) = ✓
return R = x(T )
1+e
X ⌧
@ ln ⇡✓ (at )
RL update ✓=⌘ R
t
@✓
at = up or down
@ ln ⇡✓ (at ) + for up, - for down
✓ 1 ⇡✓ (up) for up
= ±e ⇡✓ (at ) = ±(1 ⇡✓ (at )) =
@✓ ⇡✓ (up) for down
number of ‘up-steps’
X @ ln ⇡✓ (at )
= Nup N ⇡✓ (up)
t
@✓ N=number of time steps
return R = x(T ) = Nup Ndown = 2Nup N

X ⌧
@ ln ⇡✓ (at )
RL update ✓=⌘ R
t
@✓
at = up or down
* + ⌧
X @ ln ⇡✓ (at ) N
R = 2 (Nup )(Nup N̄up )
t
@✓ 2
(general analytical expression for
average update, rare)
1
Initially, when ⇡✓ (up) = :
⌧ 2
N 2 N
✓ = 2⌘ (Nup ) = 2⌘Var(Nup ) = ⌘ > 0
2 2
(binomial distribution!)
In general:
* + ⌧
X @ ln ⇡✓ (at ) N
R = 2 (Nup )(Nup N̄up )
⌧✓
t
@✓ 2◆
N
=2 (Nup N̄up ) + (N̄up ) (Nup N̄up )
2
N ⌦ ↵
= 2VarNup + 2(N̄up ) Nup N̄up
2
= 2VarNup = 2N ⇡✓ (up)(1 ⇡✓ (up)) (general analytical
expression for average
update, fully simplified,
⇡✓ (up))
extremely rare)
⇡✓ (up)(1
⇡✓ (up)
probability ⇡✓ (up)
3 learning attempts
strong fluctuations!
trajectory (=training episode)
(This plot for N=100 time steps in a

trajectory; eta=0.001)
Spread of the update step
Y = Nup N̄up c = N̄up N/2 X = (Y + c)Y

(Note: to get Var X, we need central moments X=update
of binomial distribution up to 4th moment) (except
prefactor of 2)
3
⇠N 2
p
Var(X)
1
⇠N
hXi
⇡✓ (up) (This plot for N=100)

Optimal baseline suppresses spread!
Y = Nup N̄up c = N̄up N/2 X = (Y + c)Y

with optimal baseline: ⌦ 2 ↵
0 Y (Y + c)
X = (Y + c b)Y b=
hY 2 i
3
⇠N2
p
Var(X)
p
Var(X 0 ) 1
⇠N
hXi
⇡✓ (up) (This plot for N=100)

Note: Many update steps reduce relative spread
M = number of update steps

M
X
X= Xj
j=1
h Xi = M hXi
p p p
Var X = M VarX
p
Var X 1
relative spread ⇠p
h Xi M
Homework
Implement the RL update including the optimal

baseline and run some stochastic learning
attempts. Can you observe the improvement
over the no-baseline results shown here?
Note:You do not need to simulate the individual
random walk trajectories, just exploit the
binomial distribution.
The second-simplest RL example
position actions: move or stay
“walker”
“target site”
return=number of time steps on target

time
See code on website: “SimpleRL_WalkerTarget”

output = action probabilities (softmax)
policy ⇡✓ (a|s)
a=0 ("stay") a=1 ("move")
input = s = "are we on target"? (0/1)

Policy gradient: all the steps
Obtain one "trajectory":
execute action,
record new state
apply neural network to state,

thus obtain action probabilities
from probabilities, obtain

action for next step
Policy gradient: all the steps
For each trajectory:
Do one trajectory
(in reality: a batch of trajectories)
Obtain overall sum of rewards (=return)

for each trajectory
apply policy gradient training

(enhance probabilities for all
actions in a high-return trajectory)
RL in keras: categorical cross-entropy trick
output = action categorical cross-entropy
probabilities (softmax) X distr. from net
⇡✓ (a|s) C= P (a) ln ⇡✓ (a|s)
a desired
distribution
Set
a=0 a=1 a=2
P (a) = R
for a=action that was taken
P (a) = 0
for all other actions a
@C
✓= ⌘
@✓
input = state
implements policy gradient
RL in keras: categorical cross-entropy trick
Encountered N states (during repeated runs)
After setting categorical cross-entropy as cost function,
just use the following simple line to implement policy
gradient: array N x state-size
net.train_on_batch(observed_inputs,desired_outputs)
array N x number-of-actions
Here desired_outputs[j,a]=R for the state
numbered j, if action a was taken during a run that
gave overall return R
AlphaGo
Among the major board games, “Go” was

not yet played on a superhuman level by any
program (very large state space on a 19x19
board!)
alpha-Go beat the world’s best player in 2017
AlphaGo
First: try to learn from human expert players
Silver et al.,“Mastering the game of Go with deep neural networks

and tree search” (Google Deepmind team), Nature, January 2016
AlphaGo
Second: use policy gradient RL on games played

against previous versions of the program

AlphaGo
*Note: beyond policy-

gradient type methods,
this also includes another
algorithm, called Monte
Carlo Tree Search

AlphaGoZero
No training on human expert knowledge
– eventually becomes even better!
Silver et al, Nature 2017

AlphaGoZero
Ke Jie stated that "After humanity spent thousands of

years improving our tactics, computers tell us that humans
are completely wrong... I would go as far as to say not a
single human has touched the edge of the truth of Go."
Q-learning
Q-learning
An alternative to the policy gradient approach
Introduce a quality function Q that predicts the

future reward for a given state s and a given
action a. Deterministic policy: just select
the action with the largest Q!
"value" of a state as color
"quality" of the action "going up" as color
Q-learning
Introduce a quality function Q that predicts the
future reward for a given state s and a given
action a. Deterministic policy: just select
the action with the largest Q!
Q(st , at ) = E[Rt |st , at ] (assuming future
steps to follow the
“Discounted” XT policy!)
t0 t
future reward: Rt = rt 0
t0 =t
depends on state
Reward at time step t: rt and action at time t
Discount factor: 0 <  1 learning somewhat
easier for smaller
factor (short
memory times)
Note: The ‘value’ of a state is V (s) = maxa Q(s, a)
How do we obtain Q?
Q-learning: Update rule
Bellmann equation: (from optimal control theory)
Q(st , at ) = E[rt + maxa Q(st+1 , a)|st , at ]
In practice, we do not know the Q function yet, so
we cannot directly use the Bellmann equation.
However, the following update rule has the correct Q
function as a fixed point:
Qnew (st , at ) = Qold (st , at ) + ↵(rt + maxa Qold (st+1 , a) Qold (st , at ))
will be zero, once
we have converged
small (<1) update to the correct Q
factor
If we use a neural network to calculate Q, it will be
trained to yield the “new” value in each step.
Q(a=up,s)
Q(a=up,s)
Q(a=up,s)
Q-learning: Exploration
Initially, Q is arbitrary. It will be bad to follow this Q all

the time. Therefore, introduce probability ✏ of
random action (“exploration”)!
Follow Q: “exploitation”
Do something random (new): “exploration”
“✏ -greedy“
Reduce this randomness later!

Example: Learning to play Atari Video Games
“Human-level control through deep reinforcement learning”, Mnih et al., Nature, February 2015
last four 84x84 pixel images as input [=state]

motion as output [=action]
t-SNE visualization of
last hidden layer
Advanced Exercise
Apply RL to solve the challenge of finding, as fast

as possible, a "treasure" in:
- a fixed given labyrinth
- an arbitrary labyrinth (in each run, the player
finds itself in another labyrinth)
Use the labyrinth generator on Wikipedia "Maze

Generation Algorithm"
Wikipedia: "Maze Generation Algorithm /
Python Code Example"
Fu
re nct
pr io
es n/
en Im
tat ag
i o e
[H Im n
an ag
dw e
rit clas
ing si
Co re ficat
nv co io
olu gn n
Au t i on
itio
a n]
to ln
en et
dim V c od s
er
en isua
sio izal s
na tio
lr n
Re ed by
cu uc
tio
rre
nt n
ne
W tw
or or
dv ks
Re ec
inf to
or rs
ce
me
Co nt
nn lea
ec rn
ti o
ns i ng
to
ph
ys
i cs
Neural networks and spin models
0/1 /
Artifical Bit Spin

neuron
Neural networks with stochastic transitions,

and with some energy functional similar to
spin models in physics; e.g. as described by
Hopfield and others starting from the 80s
Modeling probability distributions
Goal: Use a neural network to generate previously

unseen examples, according to the probability
distribution of training samples
One example already mentioned in these lectures: generating new
random (but kind-of reasonable) text after seeing lots of it
Example: Generate new images after looking at many,

generate handwritten text
The solution will exploit the connection between

neural networks and the statistical physics of spin
models!
Boltzmann-Gibbs distribution
Probabilities of states of a physical system,
in thermal equilibrium?
energy high: less likely energy low: more likely
1 kE(s)T X E(s0 )
P (s) = e B Z= e kB T
Z s0
probability for state s, Z for normalization:
in thermal equilibrium “partition function”
Problem: for a many-body system, exponentially many states (for
example 2N spin states). Cannot go through all of them!
Monte Carlo approach
0
s
P (s 0
... s)
P (s s
s 0)
...
Place system in some state,

make stochastic transitions to
other states (with prescribed
transition probabilities)
0
s
P (s 0
... s)
P (s s
s 0)
...

0
s
P (s 0
... s)
P (s s
s 0)
...

0
s
P (s 0
... s)
P (s s
s 0)
...

0
s
P (s 0
... s)
P (s s
s 0)
...
Time evolution of ensemble?

X “IN” “OUT”
0 0 0
P (s) = P (s s )P (s ) P (s s)P (s)
s0
change in one P(s) = probability to find the
time-step system in state s (or: fraction
of ensemble in this state)
0
s
P (s 0
... s)
P (s s
s 0)
...

0 0 0
P (s) = P (s s )P (s ) P (s s)P (s)
s0
0
s
P (s 0
... s)
P (s s
s 0)
...

0 0 0
P (s) = P (s s )P (s ) P (s s)P (s)
s0
0
s
P (s 0
... s)
P (s s
s 0)
...
At long times: stable steady state distribution

If we have “detailed balance”, i.e. if there exists a
distribution P(s), such that for any pair of states:
0
P (s s) P (s)
=
P (s0 s) P (s0 )
then P(s) is the long-time distribution!
0
s
P (s 0
... s)
P (s s
s 0)
...
Monte Carlo for thermal equilibrium: choose

transition probabilities such that P(s) will be the
Boltzmann distribution!
0 E(s0 ) E(s)
P (s s)
= e kB T
P (s0 s)
example Metropolis algorithm: pick random spin, calculate
energy change for spin flip. Do the flip if it lowers the energy. If
the energy increases, only flip with probability exp( E/kB T )
Markov chain
s1 s3
s0 s2
The sequence of visited states forms a

so-called “Markov chain”
Markov = transitions without memory

Restricted Boltzmann Machine
“hidden” units h
“visible” units v
Each “unit” is like a spin (or a bit) that can be 0 or 1

Define “energy” (we will set kBT=1)
“restricted”: no coupling v-v or h-h w: couplings (weights)
see G. Hinton’s guide: http://www.cs.toronto.edu/~hinton/absps/guideTR.pdf


e E(v,h) X
E(v,h)
P (v, h) = Z = e
X Z v,h
P (v) = P (v, h)
h
Goal: adapt weights (and biases), such that the
probability distribution of a set of training
examples is approximately reproduced by P(v)
P (v) ⇡ P0 (v) from training samples
Interpretation: the ‘hidden units’ represent

categories of data (e.g. “dog+white+big”)
Building a Markov chain
Instead of the full state s=(v,h): Consider alternating
transitions between v and h states
P (v, h)
Set: P (h v) = P (h|v) =
P (v)
P (v, h)
P (v h) = P (v|h) =
P (h)
h h0
v 0
v “reconstruction” v 00
These transition probabilities fulfill detailed balance!

P (h v) P (h) Thus: P(v) [and P(h)] are the
= steady-state distributions!
P (v h) P (v)
X X P P P
ZP
P (v) = e E(v,h)
= e i a i vi + j bj h j + i,j vi hj wij
h h
P
a i vi zj
=e i ⇧j (1 + e )
X
with: zj = b j + vi wij
P i X X X X
where we used: e j Xj = ⇧j eXj ... = ...
h h 0 =0,1 h1 =0,1 h2 =0,1
Therefore:
e E(v,h) ezj hj
P (h|v) = = ⇧j
ZP (v) 1 + ezj
Product of probabilities! All the hj are independently
distributed, with probabilities:
ezj
P (hj = 1|v) = z
= (z j )
1+e j sigmoid
P (hj = 0|v) = 1 (zj )
Given some visible-units state vector v, calculate
the probabilities z
ej
P (hj = 1|v) = z
= (z j )
1+e j sigmoid
Then assign 1 or 0, according to these probabilities,
to obtain the new hidden state vector h
Similarly, go from h to a new v’, using:

0 0
P (vi = 1|h) = (zi )
X
zi0 = ai + wij hj
j
Updating the weights
Goal: adapt weights (and biases), such that the
probability distribution of a set of training
examples is approximately reproduced by P(v)
P (v) ⇡ P0 (v) from training samples
Minimize the categorical cross-entropy

X
C= P0 (v) ln P (v)
v
But now (unlike earlier examples), there are
exponentially many values for v, so we cannot
simply have a network output P(v) for all v. Still,
let us take the derivative of C with respect to
the weights w!
X
C= P0 (v) ln P (v)
v
@
P
@ @wij h P (v, h)
ln P (v) = P
@wij P (v, h)
h
@
P E(v,h) @ 1
X e
E(v 0 ,h0 ) @wij h @wij Z
Z= e = P E(v,h) 1
v 0 ,h0 h e Z
P P 0 0 E(v 0 ,h0 )
E(v,h)
h v i hj e v 0 ,h0 vi hj e
= P E(v,h)
h e Z
overall:
X @ X X
P0 (v) ln P (v) = vi hj P (h|v)P0 (v) vi0 h0j P (v 0 , h0 )
@wij
v v,h v 0 ,h0
X @ X X
0 0 0 0 0
P0 (v) ln P (v) = vi hj P (h|v)P0 (v) vi hj P (h |v )P (v )
@wij
v v,h v 0 ,h0
easy: draw one training

sample v, then do one
Markov chain step from v to
h; average over all samples v
hard: need to average over
the correct distribution P(v)
belonging to the Boltzmann
machine!
X @ X X
0 0 0 0 0
P0 (v) ln P (v) = vi hj P (h|v)P0 (v) vi hj P (h |v )P (v )
@wij
v v,h v 0 ,h0
Could obtain P(v) by running the Markov chain for

really long times! Very expensive!
h 0
h
v
(v: training sample) v0 v 00
Rough approximation, used in practice: Just take v’,h’

from the second pair of the chain! [For better
approx.: can take a pair further down the chain]
⌦ 0 0↵
wij = ⌘(hvi hj i v i hj )
(averaged over a batch of training samples v starting the chain)
⌦ 0 0
↵
wij = ⌘(hvi hj i v i hj
)
(averaged over a batch of training samples v starting the chain)
“Contrastive Divergence” (CD) algorithm by G. Hinton
Note: At least we can claim that P0 (v) = P (v)

would be a fixed point of this update rule, since then the two
averages on the right-hand-side yield identical results. Of
course, usually the restricted Boltzmann machine will not be
able to reach this point, since it cannot represent arbitrary
P(v).
0
ai = ⌘(hvi i hvi i)
⌦ 0↵
bj = ⌘(hhj i hj )
Restricted Boltzmann Machine for MNIST
example from http://deeplearning.net/tutorial/rbm.html
Markov chain steps (1000 steps between each row!)
Each column: a different, independent Markov chain

Restricted Boltzmann Machine for MNIST
example from http://deeplearning.net/tutorial/rbm.html
The learned weights for the 100 hidden units

RBM as a starting point
First train RBM, then connect hidden layer to some
output layer for supervised learning of classification
Idea: RBM provides unsupervised learning of
important features in the training set (pre-training)
output layer
(e.g. softmax)
Deep belief networks
Stack RBMs: First train a simple RBM, then use its
hidden units as input to another RBM, and so on
hidden units h3
third RBM
hidden units h2
second RBM
hidden units h1
first RBM
visible units v
Afterwards, fine-tune weights,

e.g. by supervised learning
Application to Quantum Physics
=spins of quantum model
Try to solve a quantum many-body problem (quantum spin
model) using the following variational ansatz for the wave
function amplitudes:
X P z P P z
a + b h + h i j Wij
(S) = e j j j i i i ij
h
Carleo & Troyer, Science 2017
z z z
S = ( 1 , 2 , . . . , N ) one basis state in the many-body Hilbert space
z
j = ±1 (in general, a,b, W may be complex)
hi = ±1
This is exactly (proportional to) the RBM representation
for P(v) [with v=S]!
=spins of quantum model
D E
Minimize the energy |Ĥ|
h | i
by adapting the weights W and biases a and b!
[requires additional Monte Carlo simulation, to obtain a
stochastic sampling of the gradient with respect to these
parameters]
For example: sample probabilities by using Metropolis
algorithm, with transition probabilities 0 2
0 (S )
P (S S) = min(1, )
(S)
Carleo & Troyer, Science 2017
error
number of ‘filters’
Exploit translational invariance (like in convolutional nets);

weights are “filters” (convolutional kernels)
Find updates on
http://machine-learning-for-physicists.org

MachineLearningSlides PartTwo

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MachineLearningSlides PartTwo

Uploaded by

Copyright:

Available Formats

Machine

Part Two (Image generated by a net with 20 hidden layers)

How to speed up stochastic gradient descent?

- accelerate (“momentum”) towards minimum

divide by RMS of all previous gradients

RMSprop divide by RMS of last few previous gradients

same, but multiply also by RMS of last few

adam divide running average of last few gradients by

adam often works best

Networks “with memory”

Useful for analyzing time-evolution (time-series of

filter size = memory time

“Backpropagation through time”

“Exploding gradients” / “Vanishing gradients”

Backpropagation through many layers (in a deep

backprop. steps backprop. steps

Calculate “forget gate”:

NEW: for the first time, we are multiplying neuron values!

make f,i,o etc. at time t depend on output ‘h’

Sometimes, o is even made to depend on ct

As long as memory content is not read or written, the

Each of those cells has the full structure, with f,i,o

whether to return the full

input time sequence

tell recall! time

desired output time sequence

tell recall time

input time sequence

desired output time sequence

Learning episode (batch of 20 for each episode)

Learning episode (batch of 20 for each episode)

desired output: predict next character

Example by Andrej Karpathy training on MBs of text

tyntd-iafhatawiaoihrdemot lytdws e ,tfti, astai f ogoh eoase rrranbyne 'nhthnee e

"Tmont thithey" fomesscerliund

Train a network that is eventually able to carry out

How do you encode the input/output

word2vec – reduction to vectors in much

Predicting the probability of any word in

word context words (here: just surrounding words)

Model tries to predict:

P✓ (w, h) = (Wjk ek (h) + bj )

At each time-step: go down the

Country and Capital Vectors Projected by PCA

Layer for mapping word indices (integer numbers representing

Helper routines for converting actual text into a sequence of

final level limited by teacher

final level: unlimited (?)

Challenge: the “correct” action is not known!

Use reinforcement learning:

Basic idea: Use probabilistic action choice. If the

parameters of the network

Environment: makes (possibly stochastic) transition to a new state

P✓ (⌧ ) = ⇧t P (st+1 |st , at )⇡✓ (at |st )

Try to maximize expected return by changing

Main formula of policy gradient method:

Stochastic gradient descent:

Increase the probability of all action choices in the

A random walk, where the probability to go “up” is

A random walk, where the probability to go “up” is

return R = x(T ) = Nup Ndown = 2Nup N

trajectory (=training episode)

(This plot for N=100 time steps in a