You are on page 1of 40

Machine Learning and Quantum Devices

Florian Marquardt
April 22, 2020

Max Planck Institute for the Science of Light and Contents


Friedrich-Alexander Universität Erlangen-Nürnberg,
Erlangen, Germany 1 General remarks 2
(Florian.Marquardt@mpl.mpg.de)
2 A Practical Introduction to Neural Net-
works for Physicists 2
2.1 What are artificial neural networks
good for? . . . . . . . . . . . . . . . . 2
2.2 Neural networks as function approxi-
Abstract mators . . . . . . . . . . . . . . . . . . 3
2.3 The layout of a neural network . . . . 3
In these lectures, you will learn the basics of neu- 2.4 Training: cost function and stochastic
ral networks in a condensed version that should be gradient descent . . . . . . . . . . . . 4
accessible to an advanced physics student. In the 2.5 Backpropagation . . . . . . . . . . . . 6
first part, we will cover their basic structure, training 2.6 First examples: function approxima-
using backpropagation, the use of software libraries, tion, image labeling, state reconstruc-
image classification, convolutional networks and au- tion . . . . . . . . . . . . . . . . . . . 7
toencoders. The second part will be about advanced 2.6.1 Approximating a function . . . 7
techniques like reinforcement learning (for discover- 2.6.2 Image classification . . . . . . . 7
ing control strategies), recurrent neural networks (for 2.6.3 A first quantum physics exam-
analyzing time traces), and Boltzmann machines (for ple: State reconstruction . . . . 10
learning probability distributions). In the third lec- 2.7 Making your life easy: modern li-
ture, we will discuss first applications to quantum braries for neural networks . . . . . . . 11
physics that have arisen mainly during the past three 2.8 Exploiting translational invariance:
years, with an emphasis on applications to quantum convolutional neural networks . . . . . 12
machines. Finally, the fourth lecture will be devoted 2.9 Unsupervised learning: autoencoders . 13
to the promise of using quantum effects to acceler- 2.10 Some warnings for the enthusiastic be-
ate machine learning. Note: This draft (version from ginner . . . . . . . . . . . . . . . . . . 14
April 22, 2020) is not yet fully finished – some dis-
cussions in the part on quantum machine learning 3 Advanced Concepts: Reinforcement
are still missing. If you find these notes useful, please Learning, Networks with Memory,
cite them as: F. Marquardt, “Machine Learning and Boltzmann Machines 15
Quantum Devices”, to appear in “Quantum Informa- 3.1 Discovering strategies: reinforcement
tion Machines; Lecture Notes of the Les Houches learning . . . . . . . . . . . . . . . . . 15
Summer School 2019”, eds. M. Devoret, B. Huard, 3.1.1 Introduction . . . . . . . . . . 15
and I. Pop (Oxford University Press) 3.1.2 Policy gradient approach . . . 15

1
3.1.3 Extremely simple RL example: are available on the website machine-learning-for-
training a random walker . . . 17 physicists.org. This includes links to python code
3.1.4 Simple RL example: Walker for some examples.
reaching a target . . . . . . . . 18 The third and fourth section are specifically de-
3.1.5 Quantum physics RL example . 20 voted to applications of machine learning to quan-
3.1.6 Q learning . . . . . . . . . . . . 23 tum devices and to quantum machine learning, re-
3.2 Mimicking observed probability distri- spectively.
butions: Boltzmann machines . . . . . 24 I thank the organizers of this Les Houches school as
3.3 Analyzing time traces: recurrent neu- well as the enthusiastic students. In particular, how-
ral networks . . . . . . . . . . . . . . . 27 ever, I want to thank my graduate student Thomas
Fösel, who helped me set up the first course on this
4 Applications of Neural Networks and topic in 2017 and whose expertise in machine learning
Machine Learning for Quantum Devices 28 has been of great help.
4.1 Interpreting measurement outcomes . 28
4.2 Choosing the smartest measurement . 29
4.3 Discovering better control sequences
2 A Practical Introduction to
and designing experimental setups . . 31 Neural Networks for Physi-
4.4 Discovering better quantum feedback
strategies . . . . . . . . . . . . . . . . 31
cists

5 Towards Quantum-Enhanced Machine


2.1 What are artificial neural net-
Learning 34 works good for?
5.1 The curse of loading classical data into During the past few years, artificial neural networks
a quantum machine . . . . . . . . . . . 34 have revolutionised science and technology [20, 13].
5.2 Quantum Neural Networks . . . . . . 35 They are being used to classify images, to describe
5.3 The quantum Boltzmann machine . . 36 those images in full sentences, to translate between
5.4 The quantum principal component languages, to answer questions about a text, to con-
analysis . . . . . . . . . . . . . . . . . 36 trol robots and self driving cars, and to play complex
5.5 Quantum reinforcement learning . . . 38 games at a superhuman level. In science (and specif-
ically in physics), they are being used to predict the
properties of materials, to interpret astronomical pic-
1 General remarks tures, to classify phases of matter, to represent quan-
tum wave functions, and to control quantum devices.
These lecture notes cover the material of four lectures Many of these developments, especially in physics,
delivered in Les Houches in the summer of 2019. The have taken place only in the last few years, since
emphasis of the first two sections is on teaching the about 2016. In the context of machine learning in
basics and some more advanced concepts of classical physics, several good reviews [28, 4, 8, 9, 24, 7] are
machine learning – sometimes illustrated in examples by now available, documenting the rapidly growing
drawn from physics. This part relies on an earlier field, both with respect to applications of classical
lecture series that I delivered in the summers of 2017 machine learning methods to physics, as well as with
and 2019, “Machine Learning for Physicists”, at the respect to the promise of using quantum physics to
university in Erlangen, Germany. That course ran accelerate machine learning.
a full semester and covered more material, although The reasons for the recent string of successful appli-
some specific examples are new to the present notes. cations are not so much conceptual developments (al-
The videos and slides for that earlier lecture series though they are also happening at a rapid pace), but

2
rather the availability of large amounts of data and can easily be scaled up to more parameters (or higher
of unprecedented computing power (including the use input or output dimensions), if needed. Efficiency
of graphical processing units). relates not only to the evaluation of Fθ (x), but also
to the computation of derivatives with respect to the
2.2 Neural networks as function ap- parameters, since that is needed for training (as we
will see). Neural networks fulfill both requirements,
proximators
with their pairwise connections between simple units
Essentially, neural networks are very powerful arranged in a layered structure.
general-purpose function approximators that can be
trained using many (i.e. at least thousands of) ex- 2.3 The layout of a neural network
amples.
Let us consider a whole class of functions that has The basic unit of an artificial neural network is the
been parametrized: neuron, which holds a scalar value (a real number).
The operation of this neuron is simple (Fig. 1a). Its
y = Fθ (x) (1) value y is obtained starting from the values yk of some
other neurons that feed into it, in the following man-
Below we will see how Fθ looks like specifically for a
ner: We first
P calculate a linear function of those val-
neural network. Suppose, in addition, we are handed
ues, z = k wk yk + b. The coefficients wk are called
some particular smooth function,
the “weights”, and the offset b is called the “bias”. Af-
terwards, a nonlinear function f is applied, to yield
y = F (x) (2)
the neuron’s value, y = f (z). The points in the input
The goal will be to approximate F as well as possible space for which z > 0 or z < 0 are separated by a
by choosing suitable parameters in Fθ . In the context hyperplane z = 0 (Fig. 1c). This arrangement itself
of neural networks, we are talking about many pa- already constitutes an elementary neural network, a
rameters (hundreds or thousands), θ = (θ1 , θ2 , . . .), so-called (single-layer) “perceptron” that was widely
and typically also of high-dimensional input x and investigated in the 60s for classification tasks before
output y. its limitations were fully recognized.
In a general sense, one can view the training of an To obtain a nonlinear function Fθ (x) that can truly
artificial neural network as a more advanced example represent arbitrary functions F (x), multiple layers of
of curve fitting, albeit with thousands of parameters. neurons are needed. Each neuron receives the values
However, it would be wrong to reduce it only to that of all the neurons in the preceding layer, with suitable
description. After all, quantum many-body physics weights.
is in principle “only” about a Schrödinger equation To keep the notation precise for the multi-layer
in high-dimensional space – but in practice it brings case, we will now have to introduce extra indices. We
in many new phenomena and requires new solution denote as yk(n) the value of neuron k in layer n. Then
techniques. The same can be said about neural net- the “weight” w(n+1) tells us how neuron k in layer n
jk
works. will affect neuron j in layer n + 1. For any neuron,
In many applications to empirical data, no under- we thus have the following two equations:
lying function F (x) is actually known – the relation
between input x and output y is merely specified for a zj
(n+1)
=
X (n+1) (n)
wjk yk + bj
(n+1)
large number of samples, where each sample is given k
by an input/output combination (x, y).
In principle, the function Fθ in Eq. (1) could be (n+1) (n+1)
yj = f (zj )
constructed arbitrarily. However, we want to make
(n+1)
sure that this representation is (i) scalable and (ii) The constant offset values bj are called the “bi-
efficient. Scalability means we need a structure that ases”. The output of the network is obtained by go-

3
ing through these equations layer by layer, starting connections) and biases in terms of this piecewise ap-
at the input layer n = 0, whose neuron values are proximation? How can you make the sigmoid steps
provided by the user. The computational effort (and arbitrarily sharp?
memory consumption) scales quadratically with the
typical number of neurons in the layer, since there Exercise: “the XOR function” – The XOR func-
are N (n+1) N (n) weights connecting two layers with tion F (x1 , x2 ) should yield 1 for the cases x1 =
N (n+1) and N (n) neurons. Big neural networks can 1, x2 = 0 and x1 = 0, x2 = 1 but 0 for x1 = x2 = 0
quickly become memory-intensive. and x1 = x2 = 1 (we do not care about other input
It is all the weights w and biases b that together values). How can you approximate it using a network
form the parameters of the network, and we will col- with only one hidden layer? This was an important
lectively call them θ (so θ would be a vector that con- example which could not be solved without any hid-
tains all weights and biases). They will be updated den layer (triggering a crisis in the early development
during training. of neural networks).
On the other hand, the nonlinear function f (“acti-
vation function”) is usually kept fixed. Popular acti- 2.4 Training: cost function and
vation functions are: (i) the “sigmoid”, a smoothened
stochastic gradient descent
step-function (or inverted Fermi-Dirac distribution),
f (x) = 1/(1 + e−x ); and (ii) the “ReLU”, an even We would like to consider some measure of the devia-
simpler function that is piecewise linear, f (x) = 0 tion between the network output and the function it
for x < 0 and f (x) = x for x ≥ 0 (Fig. 1b). In re- is trying to approximate. To this end, we will intro-
cent times, the ReLU has been used predominantly, duce the cost function C. In the simplest case, we
since its gradient can be calculated very efficiently might just measure the quadratic deviation between
and training seems to get stuck less frequently. the network’s output Fθ (x) and the true answer F (x).
Neural networks are very powerful function ap- We first define the sample-specific cost function (de-
proximators. It turns out that a single hidden layer pending on the specific input x) as
with sufficiently many neurons can approximate an
2
arbitrary (smooth) function of several variables to ar- Cx (θ) = |Fθ (x) − F (x)| (3)
bitrary precision (a result due to George Cybenko in Subsequently, we average over all points x – accord-
1989). Interestingly, practically any nonlinear activa- ing to a distribution that reflects the likelihood of
tion function will do the job, although some may be encountering some x in real data. The averaging will
better for training than others. However, a represen- automatically take place during training on the set of
tation by multiple hidden layers may be more efficient given training examples (see below). This yields the
(defining a so-called “deep network”). Fig. 1e illus- cost function itself:
trates the complex output that can be obtained from
a multilayered network whose parameters have been C(θ) = hCx (θ)ix (4)
chosen randomly.
Throughout, we have made it explicitly clear that the
cost function depends on the network’s parameters θ.
Exercise: “approximating an arbitrary 1D In the scientific literature, you will often see ma-
function” – Given a smooth function F (x) of one chine learning tasks defined in terms of this high-level
variable, show that it can be approximated to ar- description, where one writes down a cost function
bitrary accuracy using only a single hidden layer (maybe more complicated than the one given here).
and smooth step functions (sigmoids) for the neu- For the field of machine learning, specifying a cost
rons in this layer. Hint: Think of a piecewise con- function is as essential as it is, for physics, to specify
stant approximation to F . How do you have to choose a Hamiltonian or a Lagrangian. It defines the prob-
the weights (for the input–hidden and hidden–output lem to be solved.

4
a y = f (z) b c z f 1 2
>0
<latexit sha1_base64="(null)">(null)</latexit>
<latexit

sigmoid
<latexit sha1_base64="(null)">(null)</latexit>

weights y2
y1
neurons
ReLU
y1 y2 y3
<latexit sha1_base64="(null)">(null)</latexit>

<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
3 target
d output layer e <0
hidden layers

g
y2
<latexit sha1_base64="(null)">(null)</latexit>

input layer y1
<latexit sha1_base64="(null)">(null)</latexit>

Figure 1: Structure and training of an artificial neural network. (a) Operation of a singlePneuron. (b)
Popular nonlinear activation functions. (c) The linear weighted sum of input values, z = k wk yk + b.
Applying a sigmoid to z will set to 1 the output for all points in the half-space where z > 0, and yield 0 for
all other points (with a smooth transition). (d) Structure of a neural network. (e) Output of a deep neural
network with two input neurons (coordinates in the picture) and one output neuron (whose value determines
the color), for randomly chosen weights. (f) Learning a scalar function of 2 variables, in this case the color
values of a 2D picture, with a deep network. Panels 1,2,3 illustrate the training progress. (g) Analyzing
the network’s operation: each panel represents the output of a modified network where all but one of the
neurons in the last hidden layer have been switched off artificially.

5
Once the cost function has been defined, the ba-
N
sic idea is to try finding its minimum by gradient 1 X
descent, in the high-dimensional space of the param- C(θ) ≈ Cx (θ) ≡ hCx (θ)ibatch
N j=1 j
eters θ. It is a most remarkable fact that this simple
approach to such a complex high-dimensional prob- This defines the stochastic gradient descent
lem actually works very well in many cases. One method:
of the immediate questions that spring to mind is
whether one has to fear getting stuck in a local mini-  
mum. For now, let us just say that the problem exists ∂Cx (θ) ∂C(θ)
δθk = −η = −η + noise
but is not as bad as one might assume. We will come ∂θk batch ∂θk
back to this issue further below, at the end of this
The basic idea is that the noise averages out after
section.
sufficiently many small steps. That works only if η is
Let us now discuss how to implement the gradi-
small enough.
ent descent. When you read a research paper, the
steps to be explained in the following paragraphs will
usually not be mentioned, because they are assumed 2.5 Backpropagation
known. Also, using modern software libraries, you
We still face the task of calculating the gradient of the
may not even have to implement them yourself. How-
cost function with respect to its parameters. Numer-
ever, it is very important to understand what is go-
ical differentiation (which was actually used in the
ing on in practice, “under the hood”, while training
early days of neural network training!) is extremely
a neural network. That is because the great success
inefficient due to the large number of parameters.
of artificial neural networks depends crucially on the
Fortunately, the structure of an artificial neural net-
fact that these steps can be carried out efficiently.
work allows for a much more efficient approach. The
In principle, gradient descent is simple. We just
basic idea is very simple: just apply the chain rule!
move along the negative gradient of the cost function
For the quadratic cost function, we have:
(which thereby plays the same role as a potential):

∂C(θ)
δθk = −η (5) ∂Cx (θ) X ∂ [Fθ (x)]l
∂θk =2 ([Fθ (x)]l − [F (x)]l ) (6)
∂θk ∂θk
The parameter η is called the learning rate. If it l

is too small, learning will proceed slowly, but if it is (N )


too large, one may overshoot the optimum. In the Here [Fθ (x)]l = yl is the value of neuron l in the
limit of small η, it is easy to show that the step of output layer N . The real task is therefore to calculate
Eq. (5) reduces the value of the cost function: δC = the gradient of a neuron value with respect to any of
2
−η (∂C/∂θ) + O(η 2 ). the parameters:
There are two immediate challenges connected (n) (n)
∂yl (n) ∂z
with this approach: (i) In principle, the cost func- = f 0 (zl ) l (7)
tion is defined as an average over all possible inputs, ∂θk ∂θk
which is much too expensive to calculate at each step. where we have (in the case that θk is not among the
(ii) The cost function depends on many parameters, weights or biases for this layer):
and we have to find a way to calculate the gradient
efficiently. (n) (n−1)
∂zl X (n,n−1) ∂ym
The first problem is solved by averaging only over = wlm (8)
∂θk m
∂θk
a small number of randomly selected training sam-
ples (called a “batch”, or sometimes more precisely Here we see two things: First, there is obviously a
a “mini batch”): recursive structure. Second, this equation can be

6
viewed as a matrix-vector product. We now define used here, where we multiply the “Green’s functions”
the following matrix: of a neural network layer by layer.

(n,n−1) (n,n−1) 0 (n−1)


Mlm = wlm f (zm ) (9)
2.6 First examples: function approx-
Using the above equations, it is then easy to show imation, image labeling, state re-
that the following relation holds (if θk is not among construction
the weights and biases between the layers n and n0 ):
2.6.1 Approximating a function

In a first example, we try to approximate a scalar


(n)
" #
∂zl (n0 )
(n,n−1) (n−1,n−2) (n0 +1,n0 ) ∂z
= M M ...M function of two variables. We choose a network with
∂θk ∂θk
l two input neurons, a set of hidden layers with more
neurons each (in this case 150,150,100 neurons), and
(a product of matrices applied to a vector). one single output neuron. We use the quadratic cost
This leads to the so-called backpropagation al- function.
gorithm: (a) Initialise the following “deviation To make matters more interesting, this “function of
vector” at the output layer N : ∆j = (yjN − two variables” will actually be defined by an image:
(N )
[F (x)]j )f 0 (zj ). (b) For each layer, starting at F (x1 , x2 ) will be the gray-scale value of the pixel at
n = N , store the derivatives with respect to the location (x1 , x2 ). After a sufficient amount of train-
weights (and biases) at that layer: ∂Cx (θ)/∂θk = ing, we can see how the original image is nicely repro-
(n) (n) duced to a good degree of approximation (Fig. 1f).
∆j ∂zj /∂θk , for all θk explicitly occuring in zj .
(c) Step down to the next lower layer by setting If the number of parameters were small enough, the
(new) P (n,n−1) trained network could be viewed as a compressed ver-
∆j = k ∆k Mkj . At the end, all of the
derivatives will be known. sion of the image (in practice, this is not an efficient
It is crucial that this algorithm is computationally algorithm for image compression).
no more demanding than the so-called forward pass One of the important questions for neural networks
(i.e. the evaluation of the network for a given input)! is “how does it work”? Sometimes the analysis of the
It also reuses the values obtained for the neuron val- inner workings of a network is referred to as “opening
ues during the forward pass. Without the backprop- the box”. A simple approach is to artificially modify
agation algorithm, none of the modern applications the network, e.g. by switching off neurons. In Fig. 1g,
of neural networks would have become possible. Im- we illustrate what happens if we switch off all the neu-
plementing it e.g. in python takes no more than a rons but one in the last hidden layer. The resulting
page of code. output reveals that different neurons have learned to
We can have the following useful qualitative picture encode different parts of the image (e.g. only the
for why the backpropagation alorithm works. Taking outline of the head, or only the eyes, or sometimes
the gradients of C amounts to asking for the influ- everything) – in this example, we also see there is a
ence that a small perturbation in one of the weights lot of redundancy. We could probably have reduced
will have on the cost function. It is therefore similar the number of neurons and layers significantly.
to calculating a Green’s function (or response func-
tion) in physics. We already know that a Green’s 2.6.2 Image classification
function that measures the response of some point f
to a perturbation in point i can be decomposed as a One of the main applications of neural networks is
sum over all products of Green’s functions that first image classification. Given an image, the network
connect i to some intermediate
P point j and then j to is asked to label it (e.g. as a “giraffe” or a “dolphin”
f – roughly Gf i = j Gf j Gji . The same principle is etc.). In 2012, a deep neural network was able to beat

7
all other algorithms in the so-called “ImageNet” com- cost function is preferable, since its gradients are less
petition, and since then such networks have surpassed likely to become small.
even humans in their accuracy to properly recognize A well-known test-case for image labeling is the
and label images. MNIST dataset, where more than 50000 images (28×
The input layer contains as many neurons as there 28 pixel) of handwritten digits have been labeled ac-
are pixels in the image, with the neurons set to the cording to the digit they represent. More complex
pixels’ brightness values. In the output layer of an datasets are also available freely for download, such
image classification network, each neuron is respon- as the ImageNet dataset.
sible for a different category (label). It is supposed Training on the MNIST set with a modest neu-
to represent the likelihood that the input image falls ral network (with 282 input neurons, 30 hidden neu-
into that category (Fig. 2a). To obtain a suitably rons, and 10 output neurons) already can yield a nice
normalized distribution of output values, one uses the performance, with about 3% of error. However, one
so-called “softmax” activation function. Suppose we has to take care: in a naive approach, the accuracy
have already calculated the values zj for the output on the training data is getting ever better while one
layer from the linear superpositions of the previous is repeatedly going through the same set of train-
layer’s values. Then the new value yj of output neu- ing samples. At the same time, the accuracy on test
ron j is defined in the following manner (which de- data (that the network has never seen) actually de-
pends on all the other neurons, in contrast to what creases again after some point. The reason is an im-
we encountered before): portant and well-known problem: so-called “overfit-
ting”. The network essentially memorizes the train-
ezj ing examples so well that it becomes very good on
yj = fj (z1 , . . . , zM ) = PM (10)
ez k them, paying attention to the slightest details. How-
k=1
ever, for examples it has never seen, this reduces its
This is automatically normalized and non-negative. performance, because it can no longer properly gen-
It can be viewed as a multi-dimensional generaliza- eralize. There are several solutions to this. First,
tion of the sigmoid. one may artificially augment the training data, gen-
The “correct” output distribution has a value of 1 in erating new samples e.g. by rotating and scaling im-
a single spot, for the neuron that corresponds to the ages (since the labels typically should not change un-
correct label for the given image. All other values are der these operations). Another solution is to keep a
zero. This is also known as a “one-hot encoding”. small set of samples separate as so-called “validation
How should we define the cost function? In prin- data”, and to constantly monitor the network’s per-
ciple, we could take the quadratic deviation between formance on these samples during training. Once the
the one-hot encoding and the network output. How- performance starts to decrease again, one has to stop
ever, we are essentially comparing two probability (“early stopping”). Another, very powerful solution
distributions (the “true”, one-hot distribution, and is to introduce noise. The noise prevents the network
the network output). For that purpose, there ex- from overfitting if it is sufficiently strong. In practice,
ists an alternative, the so-called categorical cross- this means randomly switching off a small fraction of
entropy neurons during training (or multiplying them with
random Gaussian values). That strategy, which was
Pjtarget ln Pj . invented only rather recently, is called “drop-out”.
X
C=− (11)
j Importantly, if you generate your data always fresh
(i.e. the network never sees a sample twice), then
You can show that minimizing this with respect to there is no danger of overfitting. This will be feasible
Pj will yield Pj = Pjtarget . In our case, Pjtarget is the e.g. for random training samples generated through
one-hot correct label (= 1 for exactly one of the j), simulations or experiments, provided that each sim-
(N )
and Pj = yj are the output neuron values. This ulation or experimental run is not too costly.

8
shocked

dense
smile
frown

a b d
shout
sleep
laugh
angry

conv dense
cry

conv dense

output
input
c +
-
+
- subsampling
e
bottleneck

encoder decoder
f g linear h
linear

Figure 2: (a) Image classification, where the output neurons signify different labels. (b) A one-dimensional
convolutional neural network, where the value of a neuron only depends on some ’nearby’ neurons in a lower
layer, in a translationally invariant way. (c) For 2D images, application of such filters can extract features
like contours (the filter is shown as an inset). (d) A full-fledged CNN, with convolutional steps, several
channels (indicated as multiple images overlaid over each other), subsampling, and finally a transition to
densely connected layers. (e) An autoencoder tries to reconstruct the input after having compressed the
information into a few latent variables inside a bottleneck layer. (f) After training, the ’encoder’ part of
an autoencoder can be repurposed for a specific classification task. (g) A fully linear autoencode will find
a projection onto the most important principal components of the set of input vectors. (h) The six most
important principal components of the MNIST handwritten digits images.

9
2.6.3 A first quantum physics example: So far, so good. Now we have to make an impor-
State reconstruction tant choice: how do we set up the cost function, i.e.
how do we punish the network if it deviates from
Let us try to come up with a relatively simple but use- the real state? After all, even the best network will
ful example in quantum physics. We could, for exam- not be able to guess the state perfectly, since it has to
ple, teach a neural network to time-evolve a quantum rely only on a limited set of binary measurement out-
state, where the input would be a quantum state at comes (whereas the space of all states is continuous).
time zero (e.g. for a discrete basis: Ψj (t = 0)) and It makes sense to demand the network’s output to be
the output during training would be set to the time- as close as possible to the real Bloch vector, since that
evolved state Ψj (t) at some later time t. If train- ensures the predictions for all the three qubit observ-
ing is carried out on many randomly chosen states ables are as correct as possible. Let us here choose
Ψ, this would teach the network effectively to im- the simple quadratic deviation. You should be aware,
plement the unitary time-evolution operator e−iĤt . however, that the network might sometimes output
However, since that operator is linear, the network unphysical states, i.e. Bloch vectors of magnitude
itself would not need any nonlinearity and the exam- larger than 1. If we want to avoid this at all costs,
ple is therefore maybe a bit too trivial. A slightly we should correspondingly modify the cost function,
more interesting variant would be to provide not the to yield very large (infinite) values outside the physi-
state but various expectation values of observables cal range. In this simple example, we will not go that
and to ask for their time evolution. However, since far.
these are linear in the density matrix and that also How do we choose the quantum states during train-
evolves linearly, it is still not a challenging problem. ing? This choice is very important, since it will de-
Consider another problem, that of quantum state termine the network’s responses. If we only ever were
reconstruction. Given several identical copies of a to show states during training that are either point-
quantum state |Ψi, and a set of projective measure- ing up or down in the z-direction, the network will
ments on those copies, try to figure out the state from learn this and also assume any other state will be of
the measurement results. this kind. Let us therefore choose states with Bloch
Let us turn this into a simple challenge for a neural vectors uniformly distributed on the Bloch sphere.
network, where it will be the network’s task to pro- In fact, the network’s task is related to Bayes rea-
vide us with an estimate of the quantum state based soning. Given the a-priori probability of having cer-
on the measurement outcomes. To make things con- tain states (here defined by the distribution of train-
crete, we imagine measuring copies of a qubit state ing samples!), and given the observed measurement
in several basis directions (denoted by projectors outcomes, what is the distribution of likely states af-
P̂1 , P̂2 , . . . , P̂M ). These directions have been fixed (byter the extra information produced by the measure-
us) beforehand, e.g. we might measure a few times ment has been taken into account? Since we are,
along the z-axis and a few times along the x-axis etc. however, only asking for a single state (Bloch vec-
For any given quantum state and any given “experi- tor), the network should try to pick the state that
mental run”, that procedure yields a string of M ran- minimizes the cost function under the new proba-
bility distribution that has been obtained using the
dom measurement results D x1 , x2E, . . ., where xj = 1
with probability pj = Ψ P̂j Ψ and xj = 0 other-
Bayes rule. This distribution over states Ψ, given the
measurement outcomes x, is:
wise. These will be the input to the network. We
will then ask the network to provide us with an “esti- P (x|Ψ)Pprior (Ψ)
mate of the state”. This is best done by asking it to Pnew (Ψ) = R (12)
P (x|Ψ0 )Pprior (Ψ0 )dΨ0
output the density matrix, which for a qubit can be
represented Das a three-dimensional
E real-valued Bloch The denominator involves averaging over all states Ψ0
ˆ
vector, ~y = Ψ ~σ Ψ . according to the prior distribution. Evaluating this

10
and then finding the optimal choice for the predicted We should still specify the cost function (called
Bloch vector is not a trivial task, and the network “loss” in keras). This is done in a step labeled ’compi-
(in order to be optimal) would have to discover all of lation’. It is here that the gradients are produced us-
this just from the training samples. ing symbolic differentiation, to be used during train-
ing. Moreover, in this step the network’s weights are
2.7 Making your life easy: modern li- initialized randomly. Let us just set up the simplest
kind of cost function, the usual quadratic deviation
braries for neural networks
between training samples and network output:
Even just a few years ago, one might have imple-
mented the neural network code oneself. While this n e t . c o m p i l e ( l o s s =’ mean_squared_error ’ ,
is fine for the basics, it can get cumbersome when o p t i m i z e r =’adam ’ )
trying to implement the latest advances. Nowadays,
the situation has changed dramatically. There are One alternative loss would have been
all kinds of libraries that simplify the implementa- ’categorical_crossentropy’ (see above). In
tion considerably. These include libraries like ten- the second line, we have also selected a so-called
sorflow, PyTorch, and others. Here we will illustrate ’optimizer’. This refers to the method used during
the power of these approaches using keras. This is gradient descent. Instead of the simple stochastic
a widely used high-level python library that inter- gradient descent introduced above, we have chosen
faces with lower-level libraries like tensorflow (and a more modern advanced technique called ’adam’.
is actually automatically included in any installation This is an adaptive scheme, where essentially the
of tensorflow nowadays; you can import commands learning rate for each parameter is chosen automat-
from tensorflow.keras). ically to provide for faster convergence. It is one of
This is all it takes in keras to produce a network the most popular choices right now, though there
with two hidden layers (layer sizes, from input to out- may be occasional cases where it is less stable than
put, are 2,30,20,1): the standard stochastic gradient descent.
Training data will be provided in the form of an ar-
n e t=S e q u e n t i a l ( ) ray of inputs x (of dimension batchsize × Min , where
n e t . add ( Dense ( 3 0 , input_shape = ( 2 , ) , Min is the number of input neurons) and outputs y
a c t i v a t i o n =’ r e l u ’ ) ) (of dimension batchsize × Mout ). A single line runs
n e t . add ( Dense ( 2 0 , a c t i v a t i o n =’ r e l u ’ ) ) through the whole batch and updates the network’s
n e t . add ( Dense ( 1 , a c t i v a t i o n =’ l i n e a r ’ ) ) parameters:
Nothing more is needed! The first line creates a
n e t . train_on_batch ( x , y )
new network called ’net’, to which layers are then
added one by one. “Sequential” refers to the standard This returns the current value of the cost function.
layered network layout we have been discussing with- Repeated application (preferably to randomly se-
out exception, though keras also can be used to pro- lected fresh data) will be needed to train the network.
duce more advanced designs, where the network forks Finally, to evaluate the network on any given data x,
into branches or there are connections between layers we would write
further apart. “Dense” means densely connected lay-
ers, in contrast e.g. to convolutional layers that we y=n e t . predict_on_batch ( x )
will discuss below. We have set the activation func-
tions to ReLU for the two hidden layers, but linear Now y[j, :] will contain the output vector obtained for
(no activation function) for the output layer. Any the j-th sample in the batch, x[j, :]. If you think of 2D
combination is possible, including also ’sigmoid’. data (like pixels in an image), use numpy’s flatten
The softmax function mentioned above would be in- and reshape commands to convert into (and back
dicated as ’softmax’. from) 1D vectors.

11
2.8 Exploiting translational invari- One of the amazing insights that CNNs have pro-
ance: convolutional neural net- vided is the connection to the visual cortex in the
works brain. It turns out that in our brain the lowest lay-
ers, right after the retina, effectively implement filters
Often, the meaning of an image does not depend that detect edges and the orientation of those edges.
on translations – e.g. when a handwritten letter is The same functionality arises in deep CNNs during
shifted a bit. In other words, we are facing a prob- training, practically irrespective of the task for which
lem with translational invariance, similar to what is they have been trained (task-specific details emerge
often the case in physics. In a physics scenario of in higher layers).
this kind, the response A(x) at point x to a pertur-
It is deep CNNs that underlie the success of neural
bation at x0 only depends on the displacement from
network at image classification tasks. In such ap-
that point. Mathematically, this is represented by a
plications, two additional features are implemented:
convolution: A(x) = G(x − x0 )F (x0 )dx0 .
R
channels and subsampling. Often, an image al-
Convolutional neural networks (CNNs) exploit
ready has several color channels. In addition, in
translational invariance by restricting significantly
higher layers, it becomes useful to store different fea-
the structure of the neural network weights. The
tures in separate channels (Fig. 2d). This requires
weights now only depend on the distance. Let us
introduction of an extra channel index c, such that
first consider the 1D situation (Fig. 2b):
each neuron is now labeled in the form (i, c). Then,
(n+1,n) we would have:
wij = w(n+1,n) (i − j) (13)

The function w(n+1,n) (i−j) would be called the ’ker-


nel’ or the ’filter’. It is cut off beyond a certain
(n+1) (n+1,n) (n)
X
distance, i.e. set to zero for |i − j| > d. Most impor- z(i,c) = wcc0 (i − j)y(j,c0 ) + b(n+1)
c (14)
tantly, you should note a tremendous reduction in the j
amount of weights that have to be stored and updated
during training. We switched from M 2 weights (if
If you think about this, it is a hybrid between the
both layers have M neurons) to a fixed number 2d+1
operation of densely connected layers (with respect
that does not even depend on the size of the layer!
to the channel indices) and single-channel CNNs.
The standard network layout we have discussed above
is referred to as ’densely connected’ layers, in con- Often, it is useful to reduce the resolution when
trast to CNNs, which have a sparse weight matrix. passing towards higher layers. For example, one may
In 2D, each neuron sits at a specific pixel in an im- just subdivide an image into 3×3 patches and replace
age, so we could label it by two discrete coordinates, each of those with a single pixel whose value is the
i = (ix , iy ). Then the kernel depends on ix − jx and average (similar to block decimation in the real-space
iy − jy , i.e. it is given as a small 2D array of dimen- renormalization group). This is called subsampling.
sions (2d + 1) × (2d + 1). Finally, to carry out the actual classification of an
The linear part of such a network’s operation cor- image, at some late stage one may switch from CNN
responds exactly to what happens in a photo-editing back to densely connected layers. This can simply
(n+1) be done by taking all the neurons in all channels and
program when applying linear filters: zi =
P (n+1,n) (n) (n+1) arranging them back into one big vector (a so-called
jw (i − j)yj + b . These can be used to
“flattening” operation).
smoothen an image or to highlight contours (Fig. 2c).
The difference here though is not only the subsequent In a framework such as keras, setting up a full-
application of a nonlinear activation function, but fledged CNN requires only a few lines (try Conv2D in-
more importantly the fact that the CNN filters will stead of Dense, AveragePooling2D for subsampling,
be learned automatically during training. and Flatten for the transition to dense layers).

12
2.9 Unsupervised learning: autoen- eventually for noisy images it has never seen before
coders during training (provided the noise is similar in struc-
ture to what it has encountered before). The same
Up to now, we have dealt with what is called “super- works for partially occluded images. Another nice
vised learning”: when providing the training sam- example is colorization of images: For training, a
ples, both the input and the desired correct output large number of color images are obtained, but the
have to be specified. input is always taken to be the gray-scale version
However, suppose you just have a large amount of the image. The network will learn to automati-
of data and you do not yet know whether there are cally fill in the colors. In this way, black-and-white
any specific patterns to be discovered in this data. movies can be turned into realistically colored movies
One example may be a large set of unlabeled images. (although there is no guarantee that the colors are
Can a network discover on its own that these images indeed correct, because sometimes there are simply
represent different categories (e.g. “cats”, “dogs”, and several equally plausible options).
“birds” ?). Once an autoencoder has been trained, it can easily
It turns out that there is a surprisingly simple way be used as the basis for solving another task, e.g. clas-
to force a network to develop an understanding of sification. The idea is to re-use the already trained
the most important features of a set of unlabeled encoder layers, and then add one or more layers on
training data. The basic idea is to require the net- top of it. During subsequent training, only these ad-
work’s output to (approximately) reproduce its in- ditional layers need be trained, and they will much
put: y = Fθ (x) ≈ x. This seems a trivial task, until more quickly converge to a good solution, since the
you learn about an extra requirement. One of the basic features have already been extracted in the en-
layers in the middle of the network has very few neu- coding stage (Fig. 2f).
rons, much less than the number of input and output Example: Linear autoencoder (principal
neurons. This layer is sometimes referred to as a component analysis) – We now turn briefly to
“bottleneck”, through which the information has to the simplest possible autoencoder: A single hidden
pass. In order to succeed at this task, the network bottleneck layer and no activation functions – i.e. a
has to compress or encode the relevant features of the purely linear network! What are the weights that it
input, pass it through the bottleneck, and then re- will find? As you will see, this is a highly instructive
construct the input based on that limited amount of example.
information. This can only work well if the whole set To keep the following discussion simple, let us also
of all inputs is highly structured (it could never work, assume the input vectors have zero mean, hxi = 0,
e.g., if the inputs are completely random vectors). in which case we will not need biases in the network.
Such a network is called an “autoencoder” Then the value of neuron j in the hidden layer is
(Fig. 2e). The layers below the bottleneck form an
(1)
X (1,0)
“encoder”, while the layers afterwards form a “de- yj = wjk xk . (15)
coder”. The neurons in the bottleneck layer are k
called “latent variables”. They represent the main This can be interpreted as a projection of the input
features that the network has learned in this unsu- vectors onto some other set of vectors |vj i, modulo
pervised fashion. (1,0)
normalization, of the kind wjk = h vj | ki. In the
Note that the autoencoder structure can be used (2) P (2,1) (1)
for different tasks than unsupervised feature extrac- output layer, we have yl = j wlj yj . This
tion. For example, it can be trained to denoise im- should be approximately equal to x. In summary,
ages: One can feed as input an image that has had we want to minimize:
noise added to it artificially, while the output is still
 2  D
the clean image. The network will learn (as well as (2,1) (1,0) 2
E
C = x − w w x = |x − w̃wx| (16)

possible) to get rid of the noise, and this will work

13
Here the matrix w̃ ≡ w(2,1) is of size Mout × Mhidden , “principal component analysis”. Strictly speak-
and w ≡ w(1,0) is of size Mhidden × Mout . In other ing, the latent variable neurons of this linear autoen-
words, A = w̃w must be “as close as possible to the coder need not correspond to the projections onto
identity”, even though it is a matrix of rank at most individual eigenvectors (the overall operation of the
Mhidden . network only needs to implement the subspace pro-
Minimizing C with respect to A is a well-defined jector). However, if desired, this can be “fixed” by
problem. The essential quantity in carrying out the demanding thatD there are E no correlations between la-
average will be the correlation matrix of the input (1) (1)
tent variables, yj yk = 0 for j 6= k. Such a con-
vectors: straint can be added to the cost function, for example
D E2
P (1) (1)
ρlj = hxl xj i (17) in the form Cnew = Cold + j6=k yj yk batch . This
kind of requirement can also be useful in the context
(in a
physics analogy, think of the density matrix, of arbitrary (nonlinear) autoencoders.
ρlj = Ψl Ψ∗j Ψ ).
Using this definition, we can rewrite Eq. (16) as
Exercise: Denoising autoencoder Train a neu-
t
 ral network to get rid of noise in images that show a
C = tr ρ − 2Aρ + A Aρ (18)
randomly placed circle of random size. Use convolu-
We now choose to write the trace in the basis of eigen- tional layers with downsampling for the encoder, and
vectors of the symmetric matrix ρ: convolutional layers with upsampling for the decoder
X X (use UpSampling2D). Generate random training im-
C= ρjj − 2Ajj ρjj + A2kj ρjj (19) ages and feed a noisy version of each image into the
j j,k autoencoder as input, while defining the original im-
age as the target. Vary the challenge by producing
To minimize this over an arbitrary A, Pwe should
other sorts of training images, with more complicated
choose Akj = 0 for all k 6= j, and make j Ajj (2 −
shapes! Instead of simple noise, try to obscure the
Ajj )ρjj as large as possible. Each term in this sum
original image by deleting pieces (e.g. setting all pix-
would have its maximum at Ajj = 1 (since ρjj > 0).
els to zero in randomly chosen small squares).
However, since A has at most rank Mhidden , we can
only choose that many diagonal elements Ajj to be
nonzero (and equal to 1). We obviously have to 2.10 Some warnings for the enthusias-
choose those for which the eigenvalues ρjj are maxi- tic beginner
mum.
In other words, this linear autoencoder has to im- After witnessing the impressive success of artificial
plement the projector onto the subspace that con- neural networks, it is tempting to become a bit too
tains the eigenvectors with the largest eigenvalues of enthusiastic. You should realize that there are several
the correlation matrix ρ of the inputs. Calling these challenges:
(orthonormal) eigenvectors |vj i, we have for the out- • Training a neural network is a highly nonlinear
put of this network: and stochastic process (not well-understood the-
MX
hidden
oretically). Training several times from scratch
|yi = |vj i hvj | xi , (20) on the same training data (starting from random
j=1
weights) will usually result in networks with dif-
ferent weights, even though their performance
where the eigenvectors have been ordered, with the may be similar.
largest eigenvalues first.
In data science, the decomposition of the correla- • Results depend strongly on the quantity and
tion matrix of inputs into its eigenvectors is known as quality of training data.

14
• Applying a neural network (or more generally good solutions, we may want to recombine them in
machine learning techniques) to data is no sub- novel ways, to find even better solutions.
stitute for basic understanding. In the field of machine learning, this approach
is known as “reinforcement learning” (RL, for
• Interpretation of the results requires care. A short). It represents the most promising approach
neural network is like a black box, and extra ef- to future general artificial intelligence [27], especially
fort is needed to understand its inner workings. when combined with deep neural networks.
The general RL setting is a control problem. Imag-
ine a robot that interacts with the world around it. In
3 Advanced Concepts: Re- RL language, the robot is an “agent”, and the world
inforcement Learning, Net- around it is the “environment”. The robot can ma-
nipulate objects in the world and move around. It
works with Memory, Boltz- can also observe its environment and choose its sub-
mann Machines sequent actions depending on the observations – this
is an example of feedback control. The situation is
3.1 Discovering strategies: reinforce- displayed in Fig. 3.
The mapping from the observed state of the envi-
ment learning ronment to the next action is called “policy”. The
3.1.1 Introduction policy effectively defines the strategy that the robot
implements.
So far, we have dealt with a simple scenario: a neu-
ral network is shown many training examples, where 3.1.2 Policy gradient approach
the correct answer (e.g. the correct label for an im-
age) is already known. That scenario is known as To make things concrete, we will now describe one of
“supervised learning”. the oldest RL approaches, the so-called “policy gra-
In a sense, this describes a knowledgeable teacher dient” method – even nowadays this is one of the
training a student, but in a simplistic way. The stu- most powerful techniques. Here, the policy is proba-
dent essentially learns to imitate the teacher’s an- bilistic. Let us imagine time t is discrete, and in each
swers. In the best case, the student may be able to time step an observation is taken and a next action
extrapolate from these examples in a modest way, but is chosen. If the observed state of the environment
it will likely never surpass its teacher in any substan- at time t is st , then the policy is a probability distri-
tial aspect. The power of the approach comes from bution over all possible actions at given that state:
the student being infinitely diligent and patient, but
not from any creativity. πθ (at |st )
In contrast, let us consider what we expect from The actual next action will be chosen randomly ac-
a really talented student or from a scientist. We cording to this distribution. We typically have in
would hope that they are able to discover good novel mind a discrete set of actions at . For example, if the
solutions to problems on their own, without having robot can move around on a grid, at = N, S, W, E
been provided with answers to a large range of rather might indicate motion by one “step” into the corre-
similar training problems. In real life, this creative sponding direction.
approach to problem-solving requires the following. The subscript θ for the policy indicates that the
First, there is a lot of trial and error. Second, once policy depends on a set of parameters θ. These will
we stumble on a good solution, we have to be able be updated during training. In the advanced cases we
to recognize it as such. That means there should be are interested in, the policy will be represented by a
a criterion to decide whether one solution is better neural network, and θ will be its parameters (weights
than another. Third, if we have discovered a set of and biases).

15
a b

action a

“policy”
state action

observation (state s)
RL-agent RL-environment
c high reward: increase
all action probabilities
low reward: decrease
action sequence all action probabilities

Figure 3: Reinforcement learning. (a) A robot roaming around a grid world and trying to pick up boxes
is one simple example of a reinforcement learning problem. (b) The general scheme: In each time step,
the observed state st of the environment is used to choose the next action at , according to the agent’s
policy, represented by the probability πθ (at |st ). (c) Basic principle of policy gradient. Given a certain action
sequence, the action probabilities for all the actions involved in this particular sequence will be increased if
the reward turns out to be high.

We still have to define the goal of this game. This ity pθ (τ ) of observing this trajectory depends on the
is done via “rewards”. At each time step, a reward rt policy. We will now assume the environment can
is provided, depending on the state st and the action be modeled as a Markov process, where the next
at that was taken. For the robot, we might want state st+1 only depends on the current state and the
it to pick up boxes, and therefore assign a reward current action, and there is a transition probability
rt = +1 for each time it picks up a box. The sum of P (st+1 |at , st ). Then, the trajectory’s probability can
all rewards in a given time interval then will be equal be factorized in the following manner:
to the total number of boxes that have been picked
up. This sum of rewards is called the “return” R pθ (τ ) = ΠTt=1 P (st+1 |at , st )πθ (at |st ) (23)
(used in the sense of “return on investment”):
This is a string of conditional probabilities, alter-
T
X nating between the action choices of the agent and
R= rt (21) the transitions of the environment (for the purposes
t=1 of this expression we may set P (sT +1 |aT , sT ) = 1).
Since the policy is probabilistic, and also the environ- Note that the assumption of a Markovian environ-
ment may have stochastic dynamics, the return R will ment is much less restrictive than it may sound at
fluctuate from run to run, even if the policy is kept first: We can have arbitrarily complicated dynamics
fixed. We are interested in the expectation value of if the total number of degrees of freedom in the state
the return, averaged over all runs (or “trajectories”): space is large enough. The policy only depends on
X the observable degrees of freedom (i.e. πθ (at |st ) =
E[R] = pθ (τ )R(τ ) (22) πθ (at |s0t ) if the states st and s0t coincide in the ob-
τ served quantities). These may be only a small subset
Here we have introduced τ as a label for a trajec- and their dynamics can be non-Markovian, since it is
tory: τ = (s1 , a1 , s2 , a2 , . . . , sT , aT ). The probabil- driven by the unobserved parts of the environment –

16
the usual reason for having non-Markovian dynamics Overall, after inserting back into Eq. (22), we obtain
in nature. a surprisingly simple expression:
The basic idea of the policy gradient approach is
to optimize the expected return via gradient ascent ∂ X ∂
with respect to the policy parameters θ: δθ = η E[R] = ηE[R(τ ) ln πθ (at |st )] (27)
∂θ t
∂θ

δθ = +η E[R] (24) This is the main result for the policy gradient ap-
∂θ proach. What it means in practice is the following.
This is symbolic notation: More precisely, θ is a We run through a trajectory and note all the actions
whole vector containg all parameters, and this equa- we took. In the end, we calculate the return. We
tion should be read as δθj = +η ∂θ∂ j E[R], for all j. change the policy parameters according to the loga-
One of the most important features of Eq. (23) is that rithmic gradient of the policy, evaluated for these ac-
the dependence on the policy θ does not enter the tions, multiplied by the return. All the actions that
environment’s transition probabilities P . This will have been taken are made more likely, but more so if
enable us to take the gradient with respect to θ with- the return is larger (the norm is conserved, you can
out actually having any explicit knowledge of P – it try to show this yourself). Averaged over many tra-
would be very hard for most real environments to con- jectories, this has the effect of reinforcing the “good”
struct a detailed model for their dynamics. Since the actions, i.e. actions that have been taken primarily
RL approach is independent of having such a model, in high-return trajectories.
it is called “model-free”. That sets it apart from
other numerical methods for optimizing control, such
as GRAPE (used for quantum control, with an ex- 3.1.3 Extremely simple RL example: train-
plicit model for the Hamiltonian). This is the reason ing a random walker
we cannot easily use a deterministic policy, because
there the effect of any change in the policy at time t Let us look at what must be the simplest possible RL
would affect the subsequent environment dynamics, example, a biased random walker, whose probability
and to understand the consequences for the return, πθ (at = +1) to go “up” can be varied during training
we would have to differentiate through the unknown (Fig. 4a). Note that this policy does not depend on
environment dynamics. any observed state, so there is no feedback yet in this
We now evaluate the gradient. First, we note that example. The goal of the walker is to reach as far
the gradient of pθ (τ ) can be written in the following as possible from the origin, i.e. make R = x(T ) as
way: large as possible. Of course we know that the optimal
strategy is simply to always go up. However, it is very
instructive to see how this strategy is reached.
∂ X ∂θ πθ (at |st ) Let us use a sigmoid for πθ (+1) = (1 + e−θ )−1 .
pθ (τ ) = Πt0 P (st0 +1 |at0 , st0 )πθ (at0 |st0 )
∂θ t
πθ (at |st ) During training, we need the policy gradients, to eval-
(25) uate the central equation (27). Do the math to show
This comes about because taking the gradient of a that they are:
product means differentiating each factor separately
and then adding up the results (take a moment to un- ∂θ ln πθ (+1) = 1 − πθ (+1) (28)
derstand it). We have already re-arranged terms such
that it becomes obvious this can be further simplified ∂θ ln πθ (−1) = −πθ (+1) (29)
to As a consequence, we find:
∂ X ∂ X
pθ (τ ) = pθ (τ ) ln πθ (at |st ) . (26) ∂θ ln πθ (at ) = N+ − T πθ (+1) (30)
∂θ t
∂θ t

17
HereN+ is a fluctuating quantity, namely the number The largest update steps are obtained at πθ (+1) =
PT
of “up” steps: N+ = t=1 δat ,+1 . Its expectation
1/2, i.e. when the walker is unbiased. Then the fluc-
value is N̄+ = T πθ (+1). In other words, Eq. (30) tuations of N+ are largest, and the walker efficiently
measures by how much this number, for a particular explores all possibilities.
trajectory, exceeds the average. To get the update δθ, The resulting training progress is shown in Fig. 4c.
according to Eq. (27), we only have to multiply by the Exercise: Training a walker – Implement the
return R(T ) and take the expectation value. This will stochastic training update numerically, by drawing
yield a positive update for θ if trajectories with more the random N+ according to a binomial distribution,
“up” steps than average yield an enhanced return. and using the update equation in the form (31) –
For our scenario, this should be the case. Let’s see but without taking the expectation value E[. . .] (just
whether the math bears out this expectation! evaluate for a particular trajectory, i.e. a particular
value N+ ). Plot the evolution of πθ (+1) during train-
P
For the return, we obtain R = x(T ) = t at =
N+ − N− = 2N+ − 1. Now we see that this example ing, and repeat several times to observe the stochastic
is so simple that the update equation can be obtained nature of training. Show numerically that for a suffi-
analytically (a very rare case): ciently small learning rate η, we obtain the behaviour
expected from the averaged equation (plot the curve
X expected from this average equation for comparison)!
δθ = ηE[R ∂θ ln πθ (at )] = ηE[(2N+ −1)(N+ −N̄+ )] Empirically, for which values of π (+1) are the fluc-
θ
t
(31) tuations in the update the largest?
Rewriting slightly and using E[N+ − N̄+ ] = 0, we
find that the update just depends on the variance of 3.1.4 Simple RL example: Walker reaching a
N+ : target
We now change the scenario to include feedback: a
δθ = ηE[2(N+ − N̄+ )2 ] = 2ηT πθ (+1)(1 − πθ (+1)) . walker that wants to find a target site and stay there
(32) (Fig. 4b). The observed state is either 0 (most of
In the last step we used the formula for the variance the time) or 1 (on the target site). The goal of the
of a binomial distribution. game is now to have the walker spend as much time
Does this update equation make sense? First, we as possible on the target. For simplicity, we slightly
note that it is always positive. And an increase in θ revise the walker’s actions: it can now either stay
also increases the probability πθ (+1) to go up. This (at = 0) or move up (at = +1). We will assume the
is exactly what is needed to increase the return! target is somewhere at a positive position x∗ , so that
Second, we find that the update vanishes in the it can be reached at all. This position will be chosen
extreme cases. When the walker always goes up al- at random before the start of each trajectory. Again,
ready, no further increase of the probability is neces- it is clear to any human neural network after a few
sary or possible, so this is fine. On the other hand, seconds of thinking what is the best strategy: Move
when the walker always goes down, nothing happens up as fast as possible in the beginning, but stop once
either (which is bad). The walker is stuck with the the target site is reached. In terms of the policy, this
worst possible strategy. The reason for this is that means: πθ (at = 1|st = 0) = 1, πθ (0|1) = 1, and zero
then there is not even a single trajectory that deviates for the two other policy probabilities. Let us see how
from the expected behaviour, and thus the walker this policy is reached!
never even gets a chance to see larger returns. In RL One can probably once more obtain an analytical
jargon, this is called a lack of “exploration”. When- solution (I have not tried it). However, it is also
ever that is a problem, a typical solution is to intro- fun to implement this example numerically. We still
duce random actions once in a while (i.e. not follow do not really need a neural network: there are only
the policy all the time). two independent policy probabilities (due to normal-

18
a d random walker e walker/target
probability

x target

batch size 1
random walker t
b f
⇡✓ (+1)
<latexit sha1_base64="MEMWL7yqI6h12FQ4uYQNs0m0KG0=">AAAB+XicbVDLSsNAFJ3UV62vqEs3oUWoFEqi4mNXdOOygn1AE8JkOm2HTh7M3BRC6F+4dONCEbf+ibv+jZO0iFoPXDiccy/33uNFnEkwzZlWWFldW98obpa2tnd29/T9g7YMY0Foi4Q8FF0PS8pZQFvAgNNuJCj2PU473vg28zsTKiQLgwdIIur4eBiwASMYlOTquh0xN7VhRAFPqzXrxNUrZt3MYSwTa0EqjbJde5w1kqarf9r9kMQ+DYBwLGXPMiNwUiyAEU6nJTuWNMJkjIe0p2iAfSqdNL98ahwrpW8MQqEqACNXf06k2Jcy8T3V6WMYyb9eJv7n9WIYXDkpC6IYaEDmiwYxNyA0shiMPhOUAE8UwUQwdatBRlhgAiqsUh7CdYaL75eXSfu0bp3Vz+9VGjdojiI6QmVURRa6RA10h5qohQiaoCf0gl61VHvW3rT3eWtBW8wcol/QPr4AzcSWOQ==</latexit>

✓ batch size 16

⇡✓ (0|1)
<latexit sha1_base64="1dnjVQPlOs9PdYmqulVf77U+gOw=">AAAB7XicbZDLSsNAFIYn9VbjrerSzWARXJVUxctCLLpxWcFeoA1lMp20YyeZMHMilNB3cONCETcufBT3bsS3cZIWUesPAx//fw5zzvEiwTU4zqeVm5mdm1/IL9pLyyura4X1jbqWsaKsRqWQqukRzQQPWQ04CNaMFCOBJ1jDG1ykeeOWKc1leA3DiLkB6YXc55SAsept6DMgnULRKTmZ8DSUJ1A8e7NPo5cPu9opvLe7ksYBC4EKonWr7ETgJkQBp4KN7HasWUTogPRYy2BIAqbdJJt2hHeM08W+VOaFgDP3Z0dCAq2HgWcqAwJ9/TdLzf+yVgz+sZvwMIqBhXT8kR8LDBKnq+MuV4yCGBogVHEzK6Z9oggFcyA7O8JJqsPvlaehvlcq75cOrpxi5RyNlUdbaBvtojI6QhV0iaqohii6QXfoAT1a0rq3nqzncWnOmvRsol+yXr8AHKeSnQ==</latexit>

<latexit sha1_base64="(null)">(null)</latexit>
batch size 64

walker/target training batch ⇡✓ (1|0)


<latexit sha1_base64="(null)">(null)</latexit>

Figure 4: Simple illustrations of reinforcement learning. (a) A random walker moving either left or right
in each time step, where the reward will be determined according to the distance covered to the right. (b)
Dependence of the probability for moving right on the policy parameter θ. (c) The walker/target example,
where the walker has to learn to stop when on target. (d) Learning progress for the random walker, using
policy gradient. Increasing batch sizes lead to smoother learning behaviour. (e) Walker/target example,
with a sample of trajectories, displayed for increasing learning progress (from left to right). Eventually, the
walker learns to stop on target. In this plot, the target was always placed at the same location, though it has
random locations during training. (f) Training evolution of the probabilities to move when not on target,
πθ (1|0), and to stop when on target, πθ (0|1). The fixed point is indicated, representing the optimal policy.

19
ization), so it is enough to introduce sigmoids with qubit.
parameters θ0 for πθ (1|0) = (1 + e−θ0 )−1 and θ1 for In RL, it is standard to think of a discrete time,
πθ (0|1) = (1 + e−θ1 )−1 . which we can implement here by choosing a small
Implementing this example numerically means: (i) time step ∆t (this may be subdivided further into
simulate a trajectory stochastically;
P (ii) for this tra- even smaller time steps for the physics simulation, if
jectory, evaluate the quantity t ∂θ0 ln πθ (at |st ) [and needed).
likewise for θ1 ], where the derivative has been calcu- The actions that the neural network can take in
lated analytically beforehand; (iii) record the return this situation are in principle described by continu-
R (the number of timesteps during which the walker ous values. It might want to adjust the drive ampli-
was on target); (iv) apply the policy gradient update tude αin of a beam entering the cavity, or it might be
rule to both θ0 and θ1 . allowed to control something else, like the frequency
The progress during training is shown in Fig. 4d. ωL of the drive beam, or even the strength of some
The walker becomes ever better at moving quickly at additional nonlinear term in the cavity Hamiltonian.
first and then staying on target. These figures illus- All of these are easily accessible to the RL approach.
trate that the procedure works as expected. During In fact, the network does not need to be changed for
training, one observes an upward drift of both the any of these choices, it is only the interpretation of
probability to “stay on target” and to “move when the actions that changes. The appropriate RL variant
not on target”. This flow reaches the ideal policy to apply here would be continuous RL (where the net-
that we have identified before (Fig. 4e). work outputs continuous values which are interpreted
as the center of some Gaussian distribution). How-
3.1.5 Quantum physics RL example ever, to keep things simple and in line with the pre-
ceding discussion, we will merely discretize the con-
We now want to apply RL techniques to a realistic tinuous values. For example, if there is only a drive
scenario from quantum physics. The purpose is to amplitude to care about, we will pre-define a number
illustrate that we can already obtain valuable results of discrete amplitudes and label them by an integer:
based on nothing more but the policy gradient ap- αin = αa , a = 1 . . . Nα . These are then the action
proach introduced above. In order to make best use choices whose probabilities πθ (a|s) are output by the
of the capabilities of RL, we will naturally choose a network. If we have more than one continuous control
situation involving feedback. We consider a quantum parameter, we would have to let a label a discrete set
system that is controlled by a neural network based involving all possible combinations of values, which
on the results of some measurements that have been quickly becomes cumbersome (and would be a good
performed on that system. That description covers reason to switch to continuous RL), see Fig. 5b.
a rather wide class of problems, many of them ex- The input to the network (i.e. the observed state
tremely interesting for applications in modern quan- s) will be the measurement trace. In principle, the
tum technologies. most sophisticated approach at this point would be to
For the purposes of this example, we will keep the use a network with memory (recurrent network, see
quantum system itself simple. It is a single mode of a below), and to feed in one measurement result per
cavity, i.e. a harmonic oscillator. However, this mode time step. However, to keep things simple, we will
decays, and it can be measured, so we will need to not use a recurrent network. Instead, we will always
describe its quantum dissipative dynamics. In addi- present most recent values of the measurement signal
tion, the cavity mode can be acted upon, e.g. via an as input to the network (Tmsmt data points, which
external drive. The situation is displayed in Fig. 5a. thus defines the number of input neurons). In this
This example is more valuable than one might think way, the network can react at least to a finite time
at first sight, given the important role that cavities interval of the fluctuating signal. That may allow it
play in scenarios like qubit coupling and readout as to average the signal if needed or perform some more
well as a potential quantum memory and even as a sophisticated interpretation of the time trace.

20
a action b

Kerr term
drive

cavity displacement drive


action probabilities

observation

RL-agent RL-environment
50
c
displacement drive
time

0 training epochs 400


50
ca. 50% time
time

Fock state 1 probability


measurement trace
0 training epochs 400

Figure 5: Reinforcement learning in quantum physics. (a) The quantum feedback setting in our example,
a cavity that is observed and driven. (b) The network converts a measurement trace into probabilities for
all the available actions. In this picture, continuous control amplitudes are assumed to be discretized to
yield discrete action choices. For two control parameters, this results in an array of possible parameter
combinations, each of which represents one action. (d) The training progress, illustrated via the drive
amplitude (red means higher amplitudes), and via the resulting probability for Fock state 1 in the cavity.
Obviously the network becomes better at stabilizing this Fock state as the training progresses.

21
Finally, we have to choose the reward function. The measurement operator has to be selected accord-
Again, there are many possible choices, all of which ing to the physical situation. Suppose we do a homo-
can be selected without any change in the underly- dyne measurement of the linear amplitude of the field
ing algorithm. Here, we will aim for quantum state leaking out of the cavity. For clarity, assume the left
stabilization, i.e. the reward is the overlap of the mirror is described by κ, while we measure the field
cavity’s state ρ̂t at time t and a fixed given state: leaking out of the right mirror at a rate set by κ0 .
rt = hΨ |ρ̂t | Ψi. Another interesting choice would be Then we would choose  = â. On the other hand, if
the overlap with
PT a particular subspace of states. The we had available a more sophisticated setup where a
return R = t=1 rt will favor a policy that goes to QND measurement of the photon number inside the
the target state rather quickly. cavity can be performed, we would have  = ↠â.
In principle, all of this could be applied in an ex- That could be achieved by a Kerr coupling between
periment. The prerequisites would be sufficiently fast cavity modes [Helmer et al. paper on QND pho-
control hardware, where a neural network is able to ton msmt in a cavity].
access quickly the measurement results and produce Using these equations, we can implement a physics
the feedback signal. However, for the purpose of these simulation of our driven, dissipative cavity. This sim-
lecture notes, which had to be prepared without the ulation will evolve the system’s state forward by an
benefit of a laser or a microwave generator, we will amount ∆t, based on the current value of the drive
simulate the dynamics on a computer. amplitude. After this short step, the neural network
The Hamiltonian of a cavity mode driven at reso- is queried again. Given the measurement trace, which
nance is most suitably described in a frame rotating has been updated according to Eq. (33), the network
at the cavity frequency. will decide on the next action probabilities. One of
√ In this frame,

it only con-
tains the drive: Ĥ = i κ(αin ↠− αin â), which would these actions is selected, and the next physics simu-

lead to a Heisenberg equation of motion â˙ = καin . lation step will be executed. This procedure is per-
2
(Here |αin | would be the number of drive photons formed until the fixed end T of the trajectory, be-
per unit time impinging on the cavity) The unitary fore the network’s parameters are updated according
dynamics of the cavity’s quantum state ρ̂ is then de- to the policy gradient approach. For efficiency, all
termined by iρ̂˙ unitary = [Ĥ, ρ̂]. Moreover, the decay of this is done in a parallelized fashion, on a batch
of photons at the cavity decay rate κ is described by a of trajectories that are processed simultaneously (so
Lindblad term, ρ̂˙ decay = κD[â]ρ̂, where we adopt the there is always a set ρ̂j (t) of states to keep track of,
where j = 1 . . . Nbatch ).
usual definition D[R̂]ρ̂ = R̂ρ̂R̂† − 21 (R̂† R̂ρ̂ + R̂† R̂ρ̂).
Finally, we have to treat the stochastic measurement Fig. 5 shows some results obtained using this ap-
signal. We can do this using the quantum jump tra- proach. As our goal, we have chosen to stabilize the
jectories approach. Given a measurement operator Fock state with one photon in the cavity, |Ψi = |1i.
Â, we find a noisy classical measurement trace: To facilitate this, we assume a weak QND measure-
ment of the photon number inside the cavity. The
√ D E
X(t) = κ0  + † + ξ(t), (33)control simply consists in a linear drive (as explained
above). One clearly observes the improvement of the
policy during training, as the Fock state probabil-
D E
where  + † = tr[ρ̂( + † )] and ξ(t) is a
ity increases. The observed values of the Fock state
stationary Gaussian white noise stochastic process, probability in this example are already beyond what
hξ(t)ξ(0)i = δ(t). The induced stochastic dynamics could be obtained simply from a coherent state, even
of the state ρ̂ is: if its displacement were chosen optimally. Many ex-
tensions of this example are possible, by choosing dif-
d √ D E ferent goals (i.e. rewards), control knobs (e.g. con-
ρ̂msmt = κ0 D[Â]ρ̂+ κ0 (Âρ̂+ρ̂† −  + † ρ̂)ξ(t) trollable Kerr terms inside the Hamiltonian), readout
dt
(34) approaches. In any case, it is not quite trivial to an-

22
alyze the performance of the RL approach e.g. with
respect to analytically constructed feedback control
strategies.
Qnew (st , at ) = Qold (st , at )+α(RHSQ=Qold −LHSQ=Qold )
(37)
3.1.6 Q learning Here RHS and LHS refer to the right-hand and left-
We now briefly describe an alternative RL ap- hand side of Eq. (36), and α  1 is a small positive
proach, different conceptually from the policy gradi- number, which determines the speed of the iterative
ent method. All the other present-day RL techniques improvement.
can essentially be traced back to either one of those What happens in practice is the following: At first,
two techniques or are hybrids between the two con- the Q function becomes large directly at states s that
cepts. give a large immediate reward. In subsequent steps
The idea of Q learning is to introduce a so-called of the update rule, this also affects nearby states s0 ,
quality function Q(st , at ) that is the expected fu- since one can reach s from any of those. In this way,
ture return if one takes action at in state st : large values of the Q function tend to spread through
state space, in a diffusion-like process.
Q(st , at ) = E[Rt ] (35) In advanced applications, Q(s, a) is represented
PT 0
Here Rt = t0 =t rt0 γ t−t
is the so-called discounted by a neural network, and the update rule is imple-
future return, with a discounting factor γ < 1. This mented by training the network to approximate the
means immediate rewards are considered more im- new value. Q learning has been used successfully
portant (γ = 0 would result in a greedy strategy that in many cases. One recent impressive example was
always tries to maximize the next reward, without training a network to play Atari video games, where
concern for the long-term consequences). Eq. (35) is the state s consisted in a combination of the last few
nontrivial: The expectation on the right-hand-side is video frames from the game and the actions a were
taken over all trajectories (beginning at the present the simple discrete controls of the form “move left”
time t) that follow the current policy. The policy etc.
in Q learning depends on Q itself. It simply con- We finally mention a related concept, the so-called
sists in always choosing the action a that maximizes value function, that measures the expected future
Q(s, a). In this sense, Eq. (35) is a recursive defi- return only depending on a state s. The defining
nition of (or implicit equation for) Q. We can make equation looks superficially the same as before, but
this more obvious by rewriting it in the form of “Bell- we now assume that the next action will be chosen
mann’s equation”: according to the current policy, instead of being pre-
scribed:
Q(st , at ) = rt + γE[Rt+1 ] = rt + γmaxa Q(st+1 , a) ,
(36)
where st+1 is the state reached from st by executing V (st ) = E[Rt ] (38)
action at . This is obtained by inserting the definition
of Rt , collecting all terms t0 > t, and noting that their This concept, of a value function, has been combined
expectation value is exactly the Q function evaluated with the policy gradient approach to yield so-called
at the optimal action a for the new state st+1 (up to “actor-critic” RL methods. The basic idea there is to
an extra factor γ). always compare the true overall return with the ex-
This is still not tractable. However, we can iter- pected return given the current state – obviously, a
atively improve an estimate for the Q function by sequence of action choices that yielded only a mod-
using the following Q learning update rule, which erate return despite starting from a high-value state
ensures we come ever closer to a solution of Eq. (36): cannot have been very good.

23
3.2 Mimicking observed probability For brevity, we have denoted as P (v) the distribution
distributions: Boltzmann ma- over visible unit configurations v, and correspond-
chines will write P (h) for the marginal distribution
ingly we P
P (h) = v P (v, h). A more precise notation would
The Boltzmann machine is an example of a machine be Pv (v) and Ph (h), but it will always be clear from
learning tool that is very directly linked to the sta- the argument which distribution we refer to.
tistical physics of spin systems. It is also important This arrangement, with hidden variables, makes it
because it can be generalized to the quantum case. possible to evaluate the statistics of the model effi-
The basic task of the Boltzmann machine is to ciently, as we will see below – provided one chooses
mimick an observed probability distribution P0 (v) of a particular architecture, the so-called restricted
values v. In the applications of interest, the values Boltzmann machine. This has an energy given by
v are high-dimensional (e.g. images or measurement
results obtained from many measurements on a quan- X X X
tum many-body system). The key words here are E(v, h) = − ai vi − bj hj − vi wij hj . (41)
“mimick” and “observed”. We are not given access to i j i,j
the functional form of P0 (v). Rather, we can only ob-
serve many samples v drawn from this distribution. Note the absence of v − v or h − h coupling terms,
On the other hand, we also do not want to produce which makes this a restricted model. In the typical
an approximation to this high-dimensional function approach, the values vi and hj are binary (0 or 1).
P0 (v) either. Rather, we want to be able to sample In other words, we are dealing with an Ising model of
from the same distribution efficiently. a particular restricted form, but with arbitary v − h
couplings. It is these couplings w (as well as the
The basic idea is to set up a statistical model whose
’magnetic fields’ a and b) that form the parameters θ
Boltzmann distribution can be adapted to approxi-
of the model and which have to be trained.
mate P0 (v). The model energy Eθ depends on pa-
In the end, we want to minimize the deviation be-
rameters θ, which will be changed during training of
tween the target distribution P0 (v) and the Boltz-
the Boltzmann machine.
mann machine thermal distribution. Let us measure
One crucial ingredient of Boltzmann machines is
this deviation by the categorical cross entropy,
the existence of “hidden” variables h. The full con-
figuration of the model at any time is described by X
specifying both v and h together, and the Boltzmann C=− P0 (v) ln P (v) (42)
v
distribution we are talking about is a joint distribu-
tion: introduced above in the context of image recogni-
tion. Since v is high-dimensional, there is no hope
e −Eθ (v,h) of actually carrying out the sum (if v consists of N
P (v, h) = , (39) units, the sum has 2N terms). However, we can for-
Z
mally take the derivative with respect to the param-
−Eθ (v,h)
P
where Z = v,h e is the partition sum, eters. A lengthy calculation yields a comparatively
needed for normalization. Obviously we have set simple result. For example, the derivative of the en-
kB T = 1, which just amounts to a rescaling of the ergy Eθ (v, h) with respect to the weight wij generates
energy. In a physical implementation of a Boltzmann the combination vi hj . As a result, we find:
machine, E would be a dimensionless energy, rescaled
by the thermal energy. In the end, we want to tune
θ such that ∂ X ∂
− C= P0 (v) ln P (v) = hvi hj iP0 −hvi hj iP .
∂wij v
∂w ij
(43)
X
P (v) = P (v, h) ≈ P0 (v) . (40)
h
The two terms on the right-hand-side are defined as:

24
ct 1 ct
a P0 P d output e
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit

*
v2
<latexit sha1_base64="(null)">(null)</latexit>

hidden “forget
v1
<latexit sha1_base64="(null)">(null)</latexit>

gate f”
b “hidden” units h xt
input input
time
f memory g h

c “visible” units v
h h0 signal recall signal recall
finish
i time tell start
v v0 v 00 training epochs

Figure 6: (a) The goal of a Boltzmann machine is to learn to sample from an approximation P to an observed
probability distribution P0 (v). (b) A restricted Boltzmann machine, with connections between visible and
hidden units. (c) During training, starting from an observed sample v, a Monte Carlo Markov chain is
produced according to the current statistics of the Boltzmann machine. (d) Structure of a recurrent neural
network, with feedforward connections in time. (e) A “forget gate” neuron as part of a long short-term
memory (LSTM) network. (f) A typical challenge requiring long memory times. (g) A signal triggers the
recall of a number presented earlier to the network. The training progress is shown here. (h) A network
learns to count down from an arbitrary number (input at “start”). (i) A typical application of recurrent
networks in physics: analyzing fluctuating measurement time traces.

25
X As a consequence, this Markov chain converges to a
hvi hj iP0 ≡ vi hj P (h|v)P0 (v) (44) steady-state distribution that, for both visible units
v,h and hidden units, is equal to the respective marginal
and distribution P (v) and P (h). As an aside we note
X that the full Boltzmann distribution P (v, h) is re-
hvi hj iP ≡ vi hj P (h|v)P (v) (45) alized as the distribution of pairs (v, h) composed
v,h of a visible configuration and the hidden configura-
Here we have introduced the conditional probability, tion that it reaches in one Monte Carlo update [since
P (v, h) = P (h|v)P (v)].
P (v, h) To actually implement the Monte Carlo step, we
P (h|v) = . (46) need to calculate the conditional probabilities. A
P (v)
To evaluate these expressions, we need a way to sam- brief calculation reveals
ple from the distribution P (v), as well as to sam-
ple h given v according to the conditional probability e−E(v,h) ezj hj
P (h|v) = = Πj . (48)
P (h|v). This is the challenge we address below. On ZP (v) 1 + ezj
the other hand, sampling over the observed empirical with
distribution P0 (v) is easy, because we are being pro-
vided samples v accordingly (that was the starting
X
zj = bj + vi wij . (49)
point of the whole task). i
In general, sampling from a given distribution P (s)
(for any model with some configurations s) can be The most important fact about Eq. (48) is that it is
performed using a Monte Carlo algorithm. Any a product of probabilities, one for each hidden unit.
Monte Carlo algorithm is constructed as a stochas- In other words, we can sample the new values hj in-
tic Markov process, where the transitions between dependently, and the probability for hj = 1 is sim-
states s have been chosen to fulfill detailed balance: ply σ(zj ), where σ is the sigmoid activation func-
P (s → s0 )/P (s0 → s) = P (s0 )/P (s) for all pairs of tion. Monte Carlo sampling of a Boltzmann ma-
states s, s0 . In particular, if the target is to obtain a chine thus consists in two steps: calculate proba-
Boltzmann distribution, we will have P (s0 )/P (s) = bilities the same way you would calculate the new
exp(E(s) − E(s0 )), where again we have used an en- neuron values for a densely connected pair of layers
ergy rescaled by kB T , like above. (with sigmoid activation), and then sample binary
In the present situation, we will slightly modify values hj = 0/1 according to those probabilities. The
the standard Monte Carlo approach, by exploiting step back,P from h to v, proceeds analogously (with a
0
the special structure of our problem: the distinction zi = a i + j wij hj ).
between visible units and hidden units. Consider a Finally, let us return to the task of evaluating the
Markov chain that starts from some visible unit con- weight update for the Boltzmann machine, Eq. (43).
figuration v, then jumps to some hidden unit configu- There are two terms, one involves sampling from the
0
ration h, goes back to some other v , etc. It keeps al- target distribution P 0 (v), the other requires sampling
ternating between visible and hidden configurations. from P (v). The first task is easy, by definition, since
We define the transition probabilities as the condi- we are provided with samples from P0 . The second
tional probabilities P (h|v) and P (v|h) that can be task seems hard, since we have to run a lot of Monte
obtained from the underlying Boltzmann distribution Carlo steps to converge to the steady state distribu-
P (v, h). Then it is easy to check that that detailed tion. However, there is a trick we can use. If the
balance holds: Boltzmann machine is already close to the target dis-
tribution, P0 (v) ≈ P (v), then a sample v from P0 will
P (h|v) P (h) be almost as good as a sample from P . We can then
= . (47) get even closer to the P distribution by doing a few
P (v|h) P (v)

26
Monte Carlo steps starting from this sample. In prac- the previous time step. External and internal inputs
tice, the simplest approach is to take a single extra are processed together to calculate the new, updated
pair of steps: v → h → v 0 → h0 . Then v 0 , obtained in neuron values. In this way, in principle, recurrent
this way, can serve as a good approximation to hav- networks can keep memory over arbitrarily long time
ing a sample from P . In this way, the right-hand side spans. Training proceeds by presenting both an in-
of the update equation can be approximated as: put time-series and the corresponding correct output
time-series. Importantly, the weights are not them-
hvi hj iP0 − vi0 h0j P ,


(50) selves time-dependent, so the number of training pa-
0
rameters does not grow with the time interval T that
where the second term, written out explicitly, is: is considered. In this way a given trained recurrent
network can be applied to arbitrarily long time series
X (in the same way that a convolutional network can
vi0 h0j vi0 h0j P (h0 |v 0 )P (v 0 |h)P (h|v)P0 (v)


P0
= be applied to arbitrarily sized images).
v,h,v 0 ,h0 When training such a network, taking the gradient
(51) of the cost function will not only step down layer by
This approach is called “contrastive divergence”. As layer (as in usual backpropagation) but also back in
emphasized above, the approximations involved be- time (Fig. 6d). This can involve a lot of steps back
come better when P finally approaches P0 . in time, only limited by the total time interval. It
In this way, training a Boltzmann machine has was realized already in the 90s that backpropagation
been reduced to sampling from the target distribu- involving many layers or time steps can result in prob-
tion P0 and executing a few Monte Carlo steps for lematic behaviour. In each step, the deviation vector
any given sample v. is multiplied by a matrix, so e.g. ∆t−1 = M (t−1,t) ∆t
for backpropagation in time. Since that matrix can
3.3 Analyzing time traces: recurrent have eigenvalues larger or smaller than unity, this
neural networks can lead to exponential growth or vanishing of the
gradient vector ∆t . What this means is that the in-
The study of dynamics defines much of physics. Ob- fluence of a weight change at some early time on the
serving the dynamics of a system results in time network’s response at some late time is either van-
traces, often with fluctuations (e.g. due to measure- ishingly small or exponentially large. This leads to
ment imprecision). To analyze them with a neural problems in learning, especially for situations which
network, one may hand the whole time series (with require memory to be preserved over long times.
T time steps) as input to the network. However, that It was recognized by Hochreiter and Schmidhuber
usually implies fixing the time interval T in advance, in 1997 [16] that this problem can be circumvented.
because it is connected to the network structure. One They pointed out that a typical application scenario
simple way around this would be to use convolutional often looks like this (Fig. 6f): a memory is created but
neural networks, where the translational invariance then remains irrelevant for a long time (during which
(now with respect to time) is exploited. But this time it need not be accessed). Only much later a
comes with a catch: the size of the filters (kernels) certain external signal triggers recall of the memory.
in such a network will determine the time-scale over If that is the case, it is a smart idea to not touch
which the network’s memory works. the memory most of the time, i.e. to make read-out
The alternative are so-called “recurrent neural or write-in depend on external stimuli. As we will
networks”, i.e. networks that have built-in memory. show below, this then avoids the exponential growth
Basically, at each time the network not only receives or vanishing of gradients.
fresh external input (as was the case in all the set- To implement this in practice, so-called “gating
tings we discussed so far), but it also receives inter- neurons” are introduced. Their purpose is to cal-
nal input, from the values that the neurons had at culate, based on the current external input, whether

27
the memory needs to be accessed or not. Let us dis- ternal memory state, passing that state forward in
cuss this first for the simplest case, that of a “forget time. The input to such a network now has to be
gate” neuron, which determines whether the mem- of dimension batchsize × timesteps × Min . The out-
ory should be erased. Assume a neuron carries a put is either of dimension batchsize × Mout , yield-
value ct−1 at time step t − 1, and we want to de- ing only the output at the final time step (if
cide whether to keep that value. We can do this by the option return_sequences is set to False), or
writing ct = f · ct−1 , where f ∈ [0, 1] is the value batchsize × timesteps × Mout (if True), yielding the
of the gating neuron. That value, in turn, has been full sequence.
calculated from the external input xt (or from some
lower layer), e.g. as ft = σ(wxt + b). In the simplest Exercise: Training an LSTM to add numbers
case, c, f, w, b would be scalars, but in practice we Train an LSTM to perform addition, turning a se-
will be talking about whole layers of neurons. Then quence of the type “23+18=???” into “23+18= 41”.
we would introduce suitable indices, to have cj,t , fj,t , Hint: convert each digit (as well as the special charac-
wjk , bj , and xk,t . Note that, for the first time, we are ters ’+’,’=’,’ ?’) into a one-hot-encoded binary string,
multiplying neuron values! for example “3” yields “0001000000000”, and this be-
When backpropagation is applied in such a case, comes the input to the network at that particular
the product rule splits the gradient into two branches, time step.
one of which steps down to the lower layer (through
the forget gate neurons), and the other goes back
further in time (Fig. 6e): 4 Applications of Neural Net-
∂ct ∂f ∂ct−1 works and Machine Learning
= ct−1 + f . (52)
∂θ ∂θ ∂θ for Quantum Devices
Gated read and write operations are implemented in
a similar way. In that context, one distinguishes the In this section we will review a few characteristic ex-
memory content of a neuron (the ct above) and its amples of how machine learning might be applied to
output value (that is fed into higher layers or as out- improve the performance of quantum devices. Some
put to the user). Furthermore, one can make the for- of these examples have already been realized in first
get/read/write gate neuron’s values also depend on proof-of-principle experiments, while others represent
the output values of neurons in the same layer, taken theoretical studies pointing the way to future exper-
from the earlier time step t − 1 (instead of just the iments. We do not pretend the following to be a
lower layer inputs, denoted xt in the example above). comprehensive review. Rather, it is our goal to give
All of this results in a structure that is called “long a representative sample of current research in this
short-term memory” (LSTM). The label “short- field.
term memory” is to set this apart from the true long-
term memory that would be encoded in the network’s 4.1 Interpreting measurement out-
weights w that have been modified during training.
comes
Short-term memory, by contrast, is the memory re-
tained for the duration of a specific task (a single run, The readout of quantum states in modern quantum
with T time steps). devices is a fertile challenge for neural networks. For
We do not list the detailed formulas required example, in weak continuous measurements, given a
for implementing the various LSTM gates here, noisy measurement trace, the network can help to ex-
since frameworks like tensorflow and keras auto- tract the maximum amount of information possible.
matically will provide you with LSTM implemen- In projective measurements of quantum many-body
tations ready to use. Just use an LSTM layer in- systems, a network can help to represent the under-
stead of Dense. This layer keeps track of its in- lying quantum state.

28
Machine learning approaches are particularly help- of the qubit array. However, given a syndrome (i.e. a
ful in the presence of non-idealities like extraneous pattern of unexpected measurement outcomes in the
noise and nonlinearities. Using suitable training sam- surface code array), the challenge is to deduce the
ples, a network can learn to overcome these chal- most likely underlying error and, consequently, the
lenges. In that way, it can become better than the correct way to undo this error. The syndrome is es-
default approach to any given measurement problem, sentially a 2D image, as is the underlying error (the
which relies on idealized assumptions. locations of the qubits that have been flipped by the
Weak qubit measurements – For the case of noise). Thus, this is a task well suited for neural net-
weak measurements, a nice experimental example has works, and this insight has been exploited in a series
been realized recently. It illustrates for the first time of works (see [30, 19, 3] for early examples).
the application of neural networks to the weak mea- Extracting entanglement – The logarithmic
surement of a driven superconducting qubit. The negativity probably represents the practically most
standard dispersive readout of a qubit works by cou- useful quantity for describing the entanglement be-
pling it to a microwave cavity. Sending a microwave tween two subsystems A and B. However, measuring
beam through the cavity, one can detect the phase it experimentally is a challenge, since it does not cor-
shift that depends on the qubit state. In the ex- respond to any simple observable. Rather, arbitrarily
periment of the Berkeley group [11], a network was high moments of the (partially transposed) density
trained to analyze the resulting weak measurement matrix are needed, which translates experimentally
trace (voltage vs. time). After preparing the qubit, into a large number of copies of the system that have
it is continuously driven and simultaneously weakly to be measured after applying controlled-SWAP op-
monitored. Finally, a strong projective measure- erations. It would be desirable to deduce the logarith-
ment is applied. The task of the network is to pre- mic negativity approximately after measuring only a
dict the probability for obtaining a certain projec- few moments of the density matrix. In [14], a neu-
tive (strong) measurement outcome yt ∈ {0, 1} at ral network was trained to map low-order moments
a time t, P (yt |y0 , a, b, V0 , V1, . . . , Vt ), given the prior (e.g. only up to the third moment) to the logarithmic
observed fluctuating trace V0 , V1 , . . . , Vt of the weak negativity. In such a setting, the choice of training
continuous measurement, and given the projective samples (in this case, quantum many-body states) is
measurement basis b (as well as the initial state deter- crucial. The authors trained on random states of two
mined by y0 ∈ {0, 1} and a subsequent qubit prepa- different varieties (area-law and volume-law states).
ration pulse a). A recurrent neural network (LSTM) The resulting network was able to perform surpris-
is able to properly learn the dissipative quantum dy- ingly well on numerically simulated quantum many-
namics of a continuously measured qubit from a mil- body states arising in realistic dynamics.
lion experimental measurement traces. It is particu-
larly noteworthy that the network has no notion of 4.2 Choosing the smartest measure-
quantum mechanics to begin with, i.e. it learns all of
ment
the dynamics purely by example.
Interpreting error syndromes – In quantum er- Rather than merely interpreting a given set of mea-
ror correction, an essential step is to employ collec- surement outcomes, one can strive to choose the
tive qubit measurements in order to check whether most informative measurements possible. Given a
an error has occured – without projecting the state sequence of prior observations, the observable for the
of the logical qubit. This is known as error syndrome next measurement can be selected so as to maximize
detection. The most important category of quan- the information.
tum error correction approaches are stabilizer codes. In other words, we are looking for an adaptive-
Among those, the surface code represents an easily measurement strategy (or “policy”). This represents
scalable variant, where the probability of having an a high-dimensional optimization problem. Machine
irreversible error decreases exponentially in the size learning tools can help discovering such policies.

29
Adaptive phase estimation – The following il- challenge becomes more pronounced if the relation-
lustrative and important pioneering example [15] was ship between the observations and the underlying pa-
already investigated before the recent surge of inter- rameter(s) is more complex, possibly even not ac-
est in applications of machine learning. Consider cessible analytically (in contrast to the situation in
the task of estimating an unknown phase shift in- phase estimation, where the relation between phase
side an interferometer. For N independent
√ photons, shift and measurement probability is simple).
the phase uncertainty scales as 1/ N , the so-called In general, we might be able to control a few pa-
“standard quantum limit” of phase estimation. How- rameters V1 , V2 , . . . (which correspond to the choice
ever, by injecting an entangled state into the inter- of measurement basis in the example above). Fur-
ferometer, one can improve on that bound, down to thermore, the measurement result I also depends on
the Heisenberg limit 1/N . An adaptive scheme con- some hidden but fixed underlying model parameters
sists in measuring one photon at a time and using all λ1 , λ2 , . . ., which we might want to extract as well
the previous results in order to select the next mea- as possible (the unknown phase shift in the example
surement basis. In the interferometer case, the mea- above). The mapping from the controllable parame-
surement basis is imposed via an additional, control- ters and the model parameters to the measurement
lable, deterministic phase shift. The adaptive policy result may be simulated numerically but is complex
is thus described by a “decision tree”, in which each and not easily inverted. One important question in
branch (corresponding to a sequence of measurement this setting, just as before, is “where to measure next”.
outcomes) leads to a different choice of measurement This problem was studied recently experimentally
basis. A search for the best policy is a search over for the first time using advanced machine learning
all such trees – a formidable problem: the tree itself techniques [21]. The setting was semiconductor quan-
already contains a number of leaves that scales expo- tum dots, although the principles illustrated there
nentially in the number of photons, and for each leaf a are fairly general. The controllable parameters are
different value of the measurement basis (phase shift) the gate and bias voltages applied to the device.
can be selected. In [15], this high-dimensional opti- The underlying model itself is specified by the ac-
mization problem was tackled by the use of “particle tual physical device, with its detailed (unknown) po-
swarm optimization”, which is an efficient technique tential landscape that electrons inside the quantum
comparable to genetic algorithms and simulated an- dot experience, together with a set of further volt-
nealing. The resulting strategies, for photon numbers ages that are kept fixed during an experiment. The
up to N = 14, were able to beat the best previously current I(V1 , V2 ) through the quantum dot depends
known adaptive scheme. on all of these aspects.
Recently, an experimental implementation of these The authors of [21] adopt a measure of predicted
ideas was presented in [23], although restricted to N “information gain” to select the “best” voltages V1 , V2
independent photons (instead of an entangled multi- for the next measurement. The information gain at
photon state). The authors of [23] compared particle- any selected location in voltage space is defined as
swarm optimization to Bayesian approaches. The lat- the Kullback-Leibler divergence between the proba-
ter are somewhat simpler, in that they update the bility distributions before and after the new measure-
probability distribution over the unknown phase shift ment, averaged over all possible underlying “ground
after each new incoming measurement result accord- truths” (averaged according to the present distribu-
ing to the Bayes rule. The adaptive Bayes approach tion). Since we are talking about probability distri-
then seeks to select a measurent basis that would min- butions over the space of all possible current maps
imize the expected variance. In essence, this repre- I(V1 , V2 ), it is impossible to handle them explicitly.
sents a “greedy” strategy, where each individual step In order to make progress anyway, a technique is
is optimized (which need not necessarily lead to the needed to sample from the correct distribution that
best overall result, generally speaking). Bayes predicts given all previous measurements. The
Experimental device characterization – The general technique exploited in [21] is known as “con-

30
ditional variational autoencoder”. A variational au- where it becomes hard to find an optimal protocol.
toencoder is similar to the autoencoder discussed pre- Moreover, going beyond control of a single qubit, they
viously in these lecture notes, except that it learns to show that the same techniques very successfully also
produce neuron values in the bottleneck layer that apply to spin chains where the Hilbert space is expo-
are distributed according to a fixed Gaussian distri- nentially large, yet the number of control parameters
bution. The benefit of this is that feeding the en- remains small.
coder with random Gaussian-distributed values will Control of dissipative qubits – More advanced
generate outputs that are distributed according to modern RL techniques have also been applied re-
the correct underlying distribution. A conditional cently to find continuous control sequences for dis-
variational autoencoder takes this one step further sipative few-qubit systems; for first examples see
by allowing for the specification of extra features. In [2, 26].
[21], training was undertaken using both simulated Discovering experimental setups – In a setting
and measured current maps. like quantum optics, the sequence of time-dependent
control pulses is replaced by a sequence of optical de-
vices through which photons will pass. In [25], RL
4.3 Discovering better control se- has been successfully used to search for “good” se-
quences and designing experimen- tups composed of beam-splitters, prisms, holograms,
tal setups and mirrors, placed on an optical table, where the
input state is an entangled state generated by para-
In quantum control, the goal is typically to imple- metric down-conversion, and the final state is pro-
ment a desired unitary as well as possible. Stan- duced by measurement and post-selection. In con-
dard numerical techniques exist for this purpose, with trast to the tasks mentioned in the other examples
one of the most well-known being GRAPE (gradi- above, there is not simply a pre-assigned target uni-
ent ascent for pulse engineering; [17]). However, re- tary to reach. Rather, a high reward is assigned to
cently reinforcement-learning (RL) type techniques setups producing photonic states with a large degree
have been applied to this challenge. Even though in of high-dimensional multi-partite entanglement. Re-
this context they do not yet use the full power of RL, markably, the algorithm discovers useful novel build-
in that they do not search for feedback-based strate- ing blocks that can be inspected and analyzed after-
gies, they turn out to be a very useful addition to wards. The RL technique used in [25] is called “pro-
the control toolbox. In contrast to techniques like jective simulation” [5], representing a lesser known re-
GRAPE, modern RL techniques are “model-free”, i.e. cent alternative to the approaches discussed in these
the algorithm itself applies independently of the un- lecture notes.
derlying dynamical model. In addition, adopting RL-
based control makes it very easy to benefit from the 4.4 Discovering better quantum feed-
most recent advances in the field of machine learning.
back strategies
State preparation – One important task is to
drive a quantum system from an initial state to a As we have explained in the previous section on rein-
target state. In [6], the authors studied how RL forcement learning (Sec. 3.1), the basic paradigm in-
(Q learning) finds protocols for this purpose, in non- volves an “agent” interacting with an “environment”.
dissipative systems, focussing on “bang-bang” type This interaction goes both ways. Not only is the
protocols (which are also often used to combat slow agent permitted to act on (control) the environment,
noise acting on qubits). One particular specialty of but it can also observe the consequences and adapt its
their work is the study of the “landscape” of learning: future moves accordingly. This second aspect repre-
how likely is it that the RL algorithm gets stuck? sents feedback, applied directly during the interaction
They discover a spin-glass type phase transition as a with the environment. It is to be distinguished from
function of the prescribed duration of the protocol, the other type of feedback which is used only during

31
training, when the reward controls the update of the This provides a suitable challenge for RL: Start
agent’s strategy. by providing the layout of a few-qubit device, spec-
The examples of RL mentioned in the previous sec- ify the qubits’ connectivity and the available native
tion still do not incorporate (direct) feedback, with quantum gates and possible measurements. Then,
their goal rather being to find an optimal control se- RL can help to find the best strategy to protect a
quence that does not require any adaptation to the logical qubit state from decoherence, given the noise
unpredictable behaviour of the environment. In other processes acting on the device.
words, the optimal sequence does not contain any In our work [12], we showed how RL (natural pol-
conditional branches. icy gradient) can discover from scratch such quantum
This changes as soon as the agent is allowed to error correction strategies involving feedback (Fig. 7).
observe the quantum environment. The stochastic The network finds concepts such as entangled multi-
measurement outcome must then be considered by qubit states for encoding, collective qubit measure-
the agent in deciding on its subsequent actions. We ments, adaptive noise estimation, and others. None
will now summarize the first work applying neural- of these ideas had been provided to the network in
network-based RL to quantum physics including feed- advance.
back [12]. For example, given a system of four qubits, RL
Quantum feedback is an important technique for automatically figures out that it is beneficial to en-
quantum technologies. If both the quantum system code a logical quantum state (first present in one of
and the measurement are linear, many analytical re- the four qubits) into a 3-qubit state, effectively re-
sults exist to help find the optimal feedback proto- inventing Shor’s repetition code for this example. It
col. However, for nonlinear quantum systems, feed- then understands that direct measurements on any of
back protocols must become more complex and this the three code qubits are destructive, but the fourth
is where RL can be useful. We also note that the qubit can be treated as an ancilla, such that a se-
feedback aspect represents one of the most important quence of two CNOTs and a measurement then im-
conceptual differences between RL and other numer- plements parity detection, which helps signal errors.
ical methods like GRAPE. Finally, the network develops an adaptive strategy,
Among the possible applications of quantum feed- where after detection of an error it learns to quickly
back to nonlinear quantum systems, quantum error pinpoint where exactly the error occured and how to
correction (QEC) is particularly important. The typ- correct it.
ical idea in QEC, established in the 90s by Shor and Depending on the layout of the qubit device (e.g.
others, is to encode a “logical” qubit state into a which qubits can be connected via a CNOT), and de-
complicated entangled multi-qubit state. This multi- pending on the available gates and the properties of
qubit state then is effectively more robust to noise, in the noise, the network will vary the strategies. How-
that an error can be detected and corrected. Impor- ever, the range of applicability is far wider than sta-
tantly, the error detection can be performed without bilizer codes. As RL is completely general and works
detecting the state of the logical qubit. While text- without any human-provided assumptions, the same
books provide the useful encodings and associated neural network can also discover completely differ-
error syndromes for such stabilizer codes, it is not ent strategies. For example, as we show [12], in a
clear for any given actual hardware what might be scenario where several qubits are subject to a fluc-
the most efficient way to achieve this abstract task. tuating field that is spatially homogeneous, the net-
In addition, given a certain hardware layout for a work finds a strategy where some of the qubits are
quantum memory device, it may turn out that some observed repeatedly to gain information about the
low-level, hardware-centric approaches are more effi- noisy field – which can then be used to correct the
cient. For example, if the noise is spatially or tempo- qubit where the quantum information is stored. The
rally correlated, techniques like decoherence-free sub- observation strategy is even adaptive, in that the net-
spaces or dynamical decoupling can be very helpful. work chooses a measurement basis that depends on

32
a c e network activation
start of training patterns for different
action quantum states
qubits
NN
noise
observation
RL-agent RL-environment after training
b teach
time
⇢ˆ d

Figure 7: Discovering quantum error correction strategies from scratch [12]. (a) The setting: A neural-
network-based agent controls a few qubits, applying quantum gates and measurements, with the aim of
protecting the quantum information against noise. (b) RL is used for training a powerful first network that
receives the quantum state ρ̂ as input at every time step. This is then used for supervised training of a
second network that only obtains the measurement results and which can be deployed in an experiment.
(c,d) Quantum circuits (i.e. action sequences) for two different qubit setups. The network learns to encode
the quantum information, apply periodic collective measurements (parity detection), and eventually also
to correct any errors. (e) Visualization of the network activation patterns. Each point corresponds to one
quantum state (reached at a particular time, in one of many trajectories). Its location is a 2D nonlinear
projection (using the t-SNE method [31]) of the high-dimensional vector of neuron activations obtained in
the network for that quantum state. Different clusters (colored according to the action suggested by the
network) belong to quantum states that are considered qualitatively different by the network. Using t-SNE
with a higher ’perplexity’ parameter (here: 30) would result in more clearly separated clusters [12].

33
the full sequence of previous measurement outcomes, 5 Towards Quantum-Enhanced
to enhance the accuracy.
Machine Learning
Quantum algorithms promise spectacular speedups
Despite the power of RL, we found that this chal- for certain tasks like factoring and search. It is there-
lenge cannot be solved without any extra insights. fore natural to ask whether they can also help with
In our case, we invented a new quantity, “recoverable machine learning. We want to stress right away that
quantum information”, that measures the amount of even on the theoretical level there is, at the time
quantum information that survives in a complicated of writing, not yet any completely compelling ex-
entangled multi-qubit state and could, in principle, ample of evident practical relevance for quantum-
be extracted. This then serves as an immediate re- accelerated machine learning. Nevertheless, there
ward function for RL, and it is much more powerful are first insights and proposals, e.g. for quantum-
than only calculating the overlap between the ini- accelerated linear algebra subroutines that may help
tial state and the final state after the full sequence with machine learning tasks [4], as well as for possi-
of 200 time steps. In addition, we devised a “two- ble speed-ups in reinforcement learning (via Grover
stage learning” scheme. In a first step, RL is used search), for modeling the statistics of quantum states
to train a network that is made more powerful by al- via quantum Boltzmann machines, and for various
lowing it to see the full quantum state at any given other tasks. We can only scratch the surface of the
time step. In a second step, the first RL-trained net- rapidly developing literature here, and we refer the
work is used to train a second network in a super- reader to a number of excellent reviews for a more
vised manner, which then learns to mimick the strat- complete overview [28, 4, 9].
egy. However, this second network only receives the We will start by mentioning one of the main road-
measurement results as input. Thus, it could be re- blocks for a naive approach to quantum-accelerated
alistically deployed in an experiment, where the full machine learning.
quantum state is of course not available. These two
key insights represent domain-specific human input.
Making use of such knowledge for RL is permissible, 5.1 The curse of loading classical data
as long as the resulting algorithm does not become into a quantum machine
restricted to special use cases but still covers a wide
We think of machine learning as a way to learn from,
range of possible scenarios. Here, it covers quantum
and discover patterns in, large amounts of data. Typ-
error correction for all possible settings of few-qubit
ically, we would have in mind classical data, ob-
quantum memories.
tained from databases or by measurements. This
immediately gives rise to a severe challenge that af-
fects many potential quantum-accelerated algorithms
In the future, similar RL approaches could be ap- if their purpose is to act on large amounts of classical
plied to other physical systems with specific require- data. If the quantum algorithm’s complexity scales
ments (e.g. ion trap chips, where one may shuffle better than linear in the size N of data, then this
the ions between different registers; this would rep- advantage will be destroyed by the need to load all
resent another RL action), for finding fault-tolerant the N data points into the quantum machine.
unitaries, and to treat quantum information storage That challenge can be illustrated in many exam-
in hybrid systems, where qubits are, e.g., coupled to ples, but let us just consider briefly the quantum
cavities. Implementing the RL scheme experimen- Fourier transform, because we will need it later on
tally will likely require dedicated hardware, like FP- anyway. This is a unitary operation that implements
GAs, in order to be sufficiently fast in deciding on the Fourier transform on the set of N = 2d ampli-
the next action. tudes in a d-qubit wavefunction. It is most well-

34
known for its use in Shor’s algorithm. To write it the quantum generalization of an artificial neural net-
down, we label the basis states |xi by integer num- work. Several aspects arise.
bers x = 0 . . . 2d − 1. These numbers can be decom- On the positive side, there is an obvious conceptual
posed into binary representation x = x0 +2x1 +4x2 + link, in that the simplest version of a neuron would be
8x3 + . . ., and xm = 0, 1 is interpreted to be the state a binary neuron, which can be either in its ’resting’
of qubit m in the basis state |xi. Then the quantum state (0) or in an activated state (1). This could be
Fourier transform is the unitary given by directly translated into a qubit, which now can also
be in a superposition of such states – and multiple
1 X −ikx qubit-neurons could be in an entangled multi-qubit
√ e |ki hx| . (53)
N k,x superposition state.
However, beyond that, there are essential problem-
This means the coefficient of basis state |ki after ap- atic differences. First, the typical artificial neural
plication of the qFT is indeed the network is an irreversible information processing de-
P Fourier transform
of the original coefficients, √1N x e−ikx hx| Ψi. vice: multiple inputs may lead to one and the same
In its original implementation, the qFT needed output (that is obvious for image labeling, where
2
O((log N ) ) CPHASE gates, but this complexity has many different images will obtain the label ’cat’).
been improved by now to O(log N · log log N ). In On the other hand, if one tries to exploit the power
any case, that is exponentially faster than the clas- of quantum mechanics, a quantum neural network
sical fast Fourier transform, which needs O(N log N ) should presumably be isolated and coherent. In that
operations. case, the dynamics is reversible (unitary). Thus, we
cannot immediately build a quantum version of the
Unfortunately, there is no way to build a quan-
usual feedforward neural network. Nevertheless, one
tum sub-processor that takes, say, 230 ∼ 109 complex
may try to create a quantum version of reversible
numbers, stores them into a 30-qubit wave function,
classical neural networks (which are constructed such
executes the qFT in only O(30 log 30) steps, and then
that the mapping between input and output is bi-
returns those numbers. The whole operation would
jective). Alternatively, the dynamics could be made
be dominated by the need to load 109 numbers into
partially dissipative (e.g. by intermediate measure-
the sub-processor (and that is not even speaking of
ments or coupling to a bath) – which brings up the
the challenge of reading them out, which is impossible
challenge to demonstrate that under these conditions
by measurement on a single copy of the state!).
there is still a quantum advantage.
On the other hand, Shor’s algorithm does exploit
Second, while the linear steps in a classical neural
the qFT to obtain a real exponential advantage over
network (matrix-vector multiplication for each layer)
a classical computer. The trick is, of course, that
seem to straightforwardly correlate to unitary, linear
the amount of data to be loaded is very modest (one
quantum dynamics, the nonlinear character of clas-
single big number to be factorized), and the exponen-
sical neural networks is essential. This essential non-
tially large set of complex amplitudes that the qFT
linearity can be built into the quantum nonlinear net-
operates on are generated internally. This points a
work via multi-qubit interactions, so that the network
way towards applications of quantum machine learn-
has sufficient power (notwithstanding the fact that
ing that are not subject to the curse of loading clas-
the unitary dynamics in the multi-qubit Hilbert space
sical data: try to find quantum data on which to
is still linear, just as in a quantum computer). Alter-
operate!
natively, nonlinearity could be generated by measure-
ments and feedback into the quantum device based on
5.2 Quantum Neural Networks those measurements, though this disrupts the quan-
tum coherence.
When talking about quantum accelerated machine Third, a naive application of a hypothetical fully
learning, the obvious first question is what would be coherent quantum artificial neural network to a quan-

35
tum superposition of inputs would just result in a Thus, to update the weights, we need to measure
quantum superposition of outputs. The final mea- the expectation values of the observables Ĥj in the
surement of the output would then collapse this su- target state (once), as well as in the QBM state (re-
perposition to a single output state. Therefore, the peatedly, since the weights are evolving during train-
device would simply act like a classical neural net- ing). If we think of the QBM as a quantum spin sys-
work that spits out the answer to a randomly se- tem, then some of the Ĥj might be single-spin opera-
lected input state. No speed-up would ensue. This, tors (with the corresponding wj as effective magnetic
of course, is the general reason why it is so hard to fields) and others might be two-spin operators (with
come up with good quantum algorithms – it is not wj denoting a coupling constant). Longer-range in-
sufficient to naively rely on quantum parallelism. teractions will allow more expressive freedom for the
QBM. Provided that we do not need exponentially
many tuneable parameters to achieve a good approx-
5.3 The quantum Boltzmann machine imation, there will be the usual quantum speed-up
Earlier, we discussed the (classical) Boltzmann ma- of a quantum simulator: the QBM will yield the ex-
chine which can learn to reproduce the statistical pectation values exponentially faster than a classical
properties of a large training data set and produce computer would be able to compute them.
new samples according to this distribution. Can one Implementing a QBM in this way, to approximate
do the same for “quantum data”? an interesting quantum many-body state, is still a
In quantum mechanics, one way to deal with an formidable challenge. For example, if we implement
ensemble of statistically sampled quantum states is longer-range couplings in order to make the QBM
to represent it by a density matrix. We could, there- more powerful, then we also need to be able to mea-
fore, ask for a “quantum Boltzmann machine” (QBM) sure the corresponding two-point correlators in the
which is able to “learn” the density matrix ρ̂ of a given target state. In addition, it may not even be clear in
quantum state [18, 1, 4]. Presumably, for the chal- the beginning how to most effectively establish a cor-
lenge to be interesting, we are trying to learn the respondence between the degrees of freedom in the
state of a quantum many-body system. To make this target system Hilbert space and the degrees of free-
happen, we assume the QBM is in a thermal state dom of the QBM. Such a correspondence has been
σ̂, and it obeys a Hamiltonian that has sufficiently assumed implicitly in writing down the expression
many tuneable parameters, such that for the cost function above, since Ĥj must be able
to act on both Hilbert spaces. In practice, we will
e−
P
j wj Ĥj have to set up a ’translation table’ that determines,
σ̂ = P . (54) e.g., which spin operator in the QBM relates to which
tr[e− j wj Ĥj
] operator in the target system.
Here, the inverse temperature has been absorbed
into the definition of the weights wj (meaning wj = 5.4 The quantum principal compo-
βwjphysical ). We need some measure of the deviation nent analysis
between target ρ̂ and QBM state σ̂, to serve as our
cost function. One option [18] is the relative entropy, One example of quantum data that is typically hard
to analyze is a quantum many-body state, expressed
via its density matrix ρ̂, which is exponentially large
S(ρ̂ k σ̂) = tr[ρ̂ ln ρ̂] − tr[ρ̂ ln σ̂] . (55)
in the number of degrees of freedom. Is there a way to
We will apply gradient descent. The derivative with analyze it with the help of quantum subroutines? For
respect to the weights is example, can we decompose it into its eigenvectors
and study the most important ones, i.e. those with
∂ the largest eigenvalues?
S(ρ̂ k σ̂) = tr[ρ̂Ĥj ] − tr[σ̂ Ĥj ] . (56) The answer is yes, and the tool invented for this
∂wj

36
task is called the quantum principal component anal- realization that this can be obtained to leading order
ysis (qPCA) [22]. It is a nice example that illustrates by performing an exponential SWAP operation on a
the power of quantum-accelerated data processing – product state of ρ̂ and σ̂:
and also the range of tricks from the quantum compu-
tation toolbox that go into the construction of such
an algorithm. We will now indicate the main steps. tr1 e−iŜ∆t ρ̂⊗ σ̂e+iŜ∆t = σ̂−i∆t[ρ̂, σ̂]+O(∆t2 ) . (57)
One way of obtaining the eigenvalues and -vectors
Here Ŝ is the SWAP which operates on two subspaces
of a (Hermitean) matrix ρ̂ on a classical computer
by Ŝ |ii⊗|ji = |ji⊗|ii, and tr1 is the partial trace over
would be to consider the exponential e−iρ̂t and
the first subsystem (i.e. we discard this system after
Fourier-transform it with respect to time. Since
the operation and will never measure it). The density
its eigenbasis, we have e−iρ̂t =
P
ρ̂ = l p l |v l i hv l | in
P −ipl t matrix ρ̂ describes the quantum many-body system
l e R |vl i hvl |, and theP Fourier transform would
and σ̂ is the state of the quantum computer. We need
1 +∞ iωt −iρ̂t
be 2π −∞
dte e = l δ(ω − pl ) |vl i hvl |. The to be able to do partial swaps e−iŜ∆t on correspond-
eigenvalues can then be read off from the resonance ing pairs of qubits of both these systems. Again, this
peaks in this Fourier transform, and their “weight” is is not trivial experimentally, since it presumes, e.g.,
the projector onto the eigenvector. Even a Fourier that these operations can be carried out fast enough
transform over a finite time-range t will be able to that the many-body system does not evolve (or its
resolve eigenvalues that are further apart than 1/t. dynamics has to be frozen, e.g. by setting couplings
Of course, this algorithm is nowhere near as efficient to zero). Repeated application of the trick in Eq. (57)
as the best classical algorithms for matrix diagonal- to a state σ̂⊗ ρ̂⊗ ρ̂⊗ ρ̂⊗. . . will result in a higher-order
ization of a N ×N matrix, but it is a feasible method. approximation to e−iρ̂t , where we have to apply the
The basic idea of qPCA is to take this method exponential SWAP (and the partial trace) separately
and accelerate it via the quantum Fourier transform. to each of the multiple copies of ρ̂.
However, this first requires us to produce e−iρ̂t , the We now want to exploit the qFT. To this end,
exponential of a density matrix, on a quantum ma- we do not just need e−iρ̂t for one particular value
chine! of the time t, but for a whole time interval. More-
One elementary but important observation is that over, for the qFT, the whole time trace has to be in
the eigenvalues (and -vectors) of any matrix ρ̂ are the quantum memory simultaneously (as opposed to
nonlinear functions of the elements of that matrix. repeatedly running the quantum computer for differ-
In the present context this means there is no way to ent values of time t). The way this is done is to set
apply a (fixed) unitary to an arbitrary ρ̂ and end up up some auxiliary degrees of freedom that are in a
with its eigenvalues and -vectors, since that would superposition of states |n∆ti which label time (the
be a linear operation. Indeed, the exponential e−iρ̂t n would be an integer, and the encoding would be
mentioned above is nonlinear in ρ̂. This means our done in the way we discussed for the qFT above).
quantum machine will have to operate on states that Afterwards, one would apply the exponential of ρ̂ in
are already themselves nonlinear in ρ̂, i.e. of the form the manner discussed above, but conditioned on the
ρ̂ ⊗ ρ̂ ⊗ ρ̂ ⊗ . . . ⊗ ρ̂ – we will therefore necessarily need auxiliary state. This results in:
multiple identically prepared copies of the state. If ρ̂ X
is the state of a quantum many-body system, multi- |n∆ti ⊗ e−iρ̂n∆t |χi , (58)
ple copies of this system (with identical parameters) n
will have to be prepared, which requires some exper- where |χi was the original state of the quantum
imental effort. computer. This is an entangled state, entangling
Consider the unitary e−iρ̂t acting on some state the “time-label states” with the corresponding time-
σ̂, i.e. try to compute e−iρ̂t σ̂e+iρ̂t ≈ σ̂ − it[ρ̂, σ̂] + evolved states. To do this, one has to perform condi-
. . .. The crucial trick introduced in Ref. [22] is the tional SWAP gates.

37
Finally, one can apply the qFT to this state. Note of a vector, at random with a certain probability pre-
that instead of simple complex amplitudes we now scribed by the vector. This then leads to a stochastic
have a quantum state attached to each time bin. The algorithm that does not require exponential effort.
qFT will result in the spectrum, which is peaked near
the eigenfrequencies, with the corresponding eigen-
vectors attached. Considering the special case where
the initial state is ρ̂ itself, one obtains [22] for the final
5.5 Quantum reinforcement learning
state of the quantum computer (now again writing
If both the agent and the environment are quantum,
everything as a mixed state):
one can imagine a fully quantum-mechanical version
X of reinforcement learning. In the simplest, direct
pj |p̃j i hp̃j | ⊗ |vj i hvj | . (59) translation from the classical domain, we would have
j an agent+environment quantum device that proceeds
Here the pj are the eigenvalues of ρ̂, |vj i are its eigen- through all training trajectories simultaneously (i.e.
vectors, and |p̃j i represent vectors peaked around the all sequences of training epochs, with all possible evo-
eigenvalues pj in the Hilbert space that represents the lutions of the reward). However, once we measure the
frequencies after application of the qFT. agent, we would be left in only one branch, and over-
One can now sample by measurements from this all there would be no gain in time needed to reach
state, to obtain properties of the eigenvalues and - this reward level.
vectors of the density matrix. A measurement of The question is, therefore, how to exploit some
the first Hilbert space (where the |p̃j i hp̃j | live) will quantum algorithm for RL. The most famous quan-
project the overall state down to a random eigenvec- tum algorithm, Shor’s factoring, with its exponential
tor, but with larger probability for the more impor- acceleration due to the quantum Fourier transform,
tant ones (where pj is larger). Afterwards, arbitrary is relatively specialized. By contrast, Grover’s algo-
properties of the eigenvector |vj i can be measured. rithm for search among N items, has been a very
As shown in [22], the overall running time of qPCA useful starting point for diverse applications. Luck-
grows polynomially in the number of particles, rather ily, RL in its simplest incarnations can be viewed
than the exponential growth of effort that “brute- as a search problem that may benefit from Grover’s
force” normal quantum state tomography would re- scheme, which accelerates search
√ from the classically
quire. There are of course multiple challenges for the expected O(N ) steps to O( N ) steps. While this
qPCA: one needs multiple copies of the many-body may not seem much, compared to exponential accel-
quantum state, to produce, via controlled SWAP op- eration, it can still be substantial if the number N
erations, a single copy of one (randomly selected) of database entries is large (e.g. imagine N ∼ 1012 ,
eigenstate, on which one can then perform a few mea- leading to a millionfold acceleration!).
surements (of commuting observables). Afterwards, Imagine a simplified toy RL-problem, where only
the whole procedure has to be repeated again on precisely one “good” sequence of actions yields a re-
fresh copies, to find out more on some other ran- ward of 1, while all other sequences yield reward 0.
domly selected eigenstate. And if the original data This is directly a search problem, where the action se-
is presented in classical form (rather than a quan- quences can be taken as the entries in the database,
tum many-body state), one runs into the bottleneck and the reward is the function used to label the
mentioned above. In this context, it is important “good” entry in Grover’s algorithm. If we have a
to mention that qPCA was one of the first quan- quantum environment that can yield the reward given
tum algorithms for which a quantum-inspired classi- an arbitrary action sequence, then it can be used as
cal counterpart was found recently [29]. The scenario the quantum oracle in Grover’s algorithm, accelerat-
assumed there is to have “sampling” access to classical ing the RL search. This is the basic idea exploited in
data, i.e. be able to obtain a particular component [10].

38
References [9] Vedran Dunjko and Hans J Briegel. Machine
learning & artificial intelligence in the quantum
[1] Mohammad H. Amin, Evgeny Andriyash, Jason domain: a review of recent progress. Reports on
Rolfe, Bohdan Kulchytskyy, and Roger Melko. Progress in Physics, 81(7):074001, July 2018.
Quantum Boltzmann Machine. Physical Review
X, 8(2), May 2018. [10] Vedran Dunjko, Jacob M. Taylor, and Hans J.
Briegel. Quantum-Enhanced Machine Learn-
[2] Moritz August and José Miguel Hernández- ing. Physical Review Letters, 117(13), Septem-
Lobato. Taking gradients through experi- ber 2016.
ments: LSTMs and memory proximal policy
optimization for black-box quantum control. [11] Emmanuel Flurin, Leigh S. Martin, Shay
arXiv:1802.04063 [quant-ph], April 2018. arXiv: Hacohen-Gourgy, and Irfan Siddiqi. Using a Re-
1802.04063. current Neural Network to Reconstruct Quan-
tum Dynamics of a Superconducting Qubit
[3] Paul Baireuther, Thomas E. O’Brien, Brian
from Physical Observations. arXiv:1811.12420
Tarasinski, and Carlo W. J. Beenakker.
[quant-ph], December 2018. arXiv: 1811.12420.
Machine-learning-assisted correction of corre-
lated qubit errors in a topological code. Quan- [12] Thomas Fösel, Petru Tighineanu, Talitha Weiss,
tum, 2:48, January 2018. and Florian Marquardt. Reinforcement Learning
with Neural Networks for Quantum Feedback.
[4] Jacob Biamonte, Peter Wittek, Nicola Pancotti,
Physical Review X, 8(3), September 2018.
Patrick Rebentrost, Nathan Wiebe, and Seth
Lloyd. Quantum machine learning. Nature,
[13] Ian Goodfellow, Yoshua Bengio, and Aaron
549(7671):195–202, September 2017.
Courville. Deep learning. Adaptive computation
[5] Hans J. Briegel and Gemma De las Cuevas. Pro- and machine learning. The MIT Press, Cam-
jective simulation for artificial intelligence. Sci- bridge, Massachusetts, 2016.
entific Reports, 2(1), December 2012.
[14] Johnnie Gray, Leonardo Banchi, Abolfazl Bayat,
[6] Marin Bukov, Alexandre G. R. Day, Dries and Sougato Bose. Machine-Learning-Assisted
Sels, Phillip Weinberg, Anatoli Polkovnikov, and Many-Body Entanglement Measurement. Phys-
Pankaj Mehta. Reinforcement Learning in Dif- ical Review Letters, 121(15), October 2018.
ferent Phases of Quantum Control. Physical Re-
view X, 8(3), September 2018. [15] Alexander Hentschel and Barry C. Sanders. Ma-
chine Learning for Precise Quantum Measure-
[7] Giuseppe Carleo, Ignacio Cirac, Kyle Cranmer, ment. Physical Review Letters, 104(6), February
Laurent Daudet, Maria Schuld, Naftali Tishby, 2010.
Leslie Vogt-Maranto, and Lenka Zdeborová. Ma-
chine learning and the physical sciences. Reviews [16] Sepp Hochreiter and Jürgen Schmidhuber. Long
of Modern Physics, 91(4), December 2019. Short-Term Memory. Neural Computation,
9(8):1735–1780, November 1997.
[8] Carlo Ciliberto, Mark Herbster, Alessandro Da-
vide Ialongo, Massimiliano Pontil, Andrea Roc- [17] Navin Khaneja, Timo Reiss, Cindie Kehlet,
chetto, Simone Severini, and Leonard Woss- Thomas Schulte-Herbrüggen, and Steffen J.
nig. Quantum machine learning: a classical Glaser. Optimal control of coupled spin dynam-
perspective. Proceedings of the Royal Society ics: design of NMR pulse sequences by gradient
A: Mathematical, Physical and Engineering Sci- ascent algorithms. Journal of Magnetic Reso-
ences, 474(2209):20170551, January 2018. nance, 172(2):296–305, February 2005.

39
[18] Mária Kieferová and Nathan Wiebe. Tomog- learning. npj Quantum Information, 5(1), De-
raphy and generative training with quantum cember 2019.
Boltzmann machines. Physical Review A, 96(6),
December 2017. [27] Stuart J Russell and Peter Norvig. Artificial
intelligence: a modern approach. Pearson In-
[19] Stefan Krastanov and Liang Jiang. Deep Neu- dia Education Services Pvt. Ltd., Noida, India,
ral Network Probabilistic Decoder for Stabilizer 2018. OCLC: 1085511730.
Codes. Scientific Reports, 7(1), December 2017.
[28] Maria Schuld, Ilya Sinayskiy, and Francesco
[20] Yann LeCun, Yoshua Bengio, and Geoffrey Hin- Petruccione. An introduction to quantum
ton. Deep learning. Nature, 521(7553):436–444, machine learning. Contemporary Physics,
May 2015. 56(2):172–185, April 2015.
[21] D. T. Lennon, H. Moon, L. C. Camenzind, Liuqi [29] Ewin Tang. Quantum-inspired classical algo-
Yu, D. M. Zumbühl, G. A .D. Briggs, M. A. Os- rithms for principal component analysis and su-
borne, E. A. Laird, and N. Ares. Efficiently mea- pervised clustering. arXiv:1811.00414 [quant-
suring a quantum device using machine learning. ph], October 2018. arXiv: 1811.00414.
npj Quantum Information, 5(1), December 2019.
[30] Giacomo Torlai and Roger G. Melko. Neural
[22] Seth Lloyd, Masoud Mohseni, and Patrick Decoder for Topological Codes. Physical Review
Rebentrost. Quantum principal component Letters, 119(3), July 2017.
analysis. Nature Physics, 10(9):631–633,
September 2014. [31] Laurens van der Maaten and Geoffrey Hinton.
Visualizing Data using t-SNE. Journal of Ma-
[23] Alessandro Lumino, Emanuele Polino, Adil S. chine Learning Research, 9:2579, 2008.
Rab, Giorgio Milani, Nicolò Spagnolo, Nathan
Wiebe, and Fabio Sciarrino. Experimental
Phase Estimation Enhanced by Machine Learn-
ing. Physical Review Applied, 10(4), October
2018.
[24] Pankaj Mehta, Marin Bukov, Ching-Hao
Wang, Alexandre G.R. Day, Clint Richardson,
Charles K. Fisher, and David J. Schwab. A
high-bias, low-variance introduction to Machine
Learning for physicists. Physics Reports, 810:1–
124, May 2019.
[25] Alexey A. Melnikov, Hendrik Poulsen Nautrup,
Mario Krenn, Vedran Dunjko, Markus Tiersch,
Anton Zeilinger, and Hans J. Briegel. Active
learning machine learns to create new quan-
tum experiments. Proceedings of the National
Academy of Sciences, 115(6):1221–1226, Febru-
ary 2018.
[26] Murphy Yuezhen Niu, Sergio Boixo, Vadim N.
Smelyanskiy, and Hartmut Neven. Universal
quantum control through deep reinforcement

40

You might also like