Professional Documents
Culture Documents
Florian Marquardt
Max Planck Institute for the Science of Light, Staudtstraße 2, 91058 Erlangen, Germany
and Physics Department, University of Erlangen-Nuremberg, Staudtstraße 5, 91058 Erlangen, Germany
(Received 23 April 2018; revised manuscript received 12 June 2018; published 27 September 2018)
Machine learning with artificial neural networks is revolutionizing science. The most advanced
challenges require discovering answers autonomously. In the domain of reinforcement learning, control
strategies are improved according to a reward function. The power of neural-network-based reinforcement
learning has been highlighted by spectacular recent successes such as playing Go, but its benefits for
physics are yet to be demonstrated. Here, we show how a network-based “agent” can discover complete
quantum-error-correction strategies, protecting a collection of qubits against noise. These strategies require
feedback adapted to measurement outcomes. Finding them from scratch without human guidance and
tailored to different hardware resources is a formidable challenge due to the combinatorially large search
space. To solve this challenge, we develop two ideas: two-stage learning with teacher and student networks
and a reward quantifying the capability to recover the quantum information stored in a multiqubit system.
Beyond its immediate impact on quantum computation, our work more generally demonstrates the promise
of neural-network-based reinforcement learning in physics.
DOI: 10.1103/PhysRevX.8.031084 Subject Areas: Computational Physics,
Interdisciplinary Physics,
Quantum Information
(b) (c) (d) It can discover strategies for such a wide range of situations
Noise and Noise and
with minimal domain-specific input. We illustrate this
Dynamical
Hybrid hardware hardware flexibility in examples from two different domains: In
DD DFS model model
decoupling
k one set of examples (uncorrelated bit-flip noise), the net-
bac
f eed work is able to go beyond rediscovering the textbook
antum Algorithm
qu RL for stabilizer repetition code. It finds an adaptive response to
for Phase estimation
quantum
RL feedback unexpected measurement results that allows it to increase the
Stabilizer codes Quantum
compiler coherence time, performing better than any straightforward
Decoherence-fre
Decoherence-free
D
subspaces
subspace nonadaptive implementation. Simultaneously, it automati-
Device Device cally discovers suitable gate sequences for various types of
Spatial correlation of the noise instructions instructions hardware settings. In another very different example, the
agent learns to counter spatially correlated noise by finding
FIG. 1. (a) The general setting of this work: A few-qubit nontrivial adaptive phase-estimation strategies that quickly
quantum device with a neural-network-based controller whose become intractable by conventional numerical approaches
task is to protect the quantum memory residing in this device such as brute-force search. Crucially, all these examples can
against noise. Reinforcement learning lets the controller (“RL be treated by exactly the same approach with no fine-tuning.
agent”) discover on its own how to best choose gate sequences, The only input consists in the problem specification (hard-
perform measurements, and react to measurement results by ware and noise model).
interacting with the quantum device (“RL environment”). In a nutshell, our goal is to have a neural network which
(b) visualizes the flexibility of our approach (schematic). can be employed in an experiment, receiving measurement
Depending on the type of noise and hardware setting, different
results and selecting suitable subsequent gate operations
approaches are optimal (DD, dynamical decoupling; DFS,
decoherence-free subspace). By contrast, the RL approach is
conditioned on these results. However, in our two-stage
designed to automatically discover the best strategy adapted to learning approach, we do not directly train this neural
the situation. In (c), we show the conventional procedure to select network from scratch. Rather, we first employ reinforcement
some QEC algorithm and then produce hardware-adapted device learning to train an auxiliary network that has full knowledge
instructions (possibly reiterating until an optimal choice is of the simulated quantum evolution. Later on, the exper-
found). We compare this to our approach (d) that takes care of imentally applicable network is trained in a supervised way
all these steps at once and provides QEC strategies fully adapted to mimic the behavior of this auxiliary network.
to the concrete specifications of the quantum device. We emphasize that feedback requires reaction towards
the observations, going beyond optimal-control type chal-
These modules could be used as stand-alone quantum lenges (like pulse-shape optimization or dynamical decou-
memory or be part of the modular approach to quantum pling), and RL has been designed for exactly this purpose.
computation, which has been suggested for several leading Specifically, in this work, we consider discrete-time digital
hardware platforms [19,20]. feedback of the type that is now starting to be implemented
Given a collection of qubits and a set of available experimentally [31–35], e.g., for error correction in super-
quantum gates, the agent is asked to preserve an arbitrary conducting quantum computers. Other widespread optimi-
quantum state αj0i þ βj1i initially stored in one of the zation techniques for quantum control, like gradient-ascent
qubits. It finds complex sequences including projective pulse engineering (GRAPE), often vary evolution operators
measurements and entangling gates, thereby protecting the with respect to continuous parameters [36,37] but do not
quantum information stored in such a few-qubit system easily include feedback and are most suited for optimizing
against decoherence. This challenge is very complex, the pulse shapes of individual gates (rather than complex
where both brute-force searches and even the most straight- gate sequences acting on many qubits). Another recent
forward RL approaches fail. The success of our approach is approach [38] to quantum-error correction uses the opti-
due to a combination of two key ideas: (i) two-stage mization of control parameters in a preconfigured gate
learning with a RL-trained network receiving maximum sequence. By contrast, RL directly explores the space of
input acting as a teacher for a second network and (ii) a discrete gate sequences. Moreover, it is a “model-free”
measure of the recoverable quantum information hidden approach [2]; i.e., it does not rely on access to the
inside a collection of qubits being used as a reward. underlying dynamics. What is optimized is the network
031084-2
REINFORCEMENT LEARNING WITH NEURAL NETWORKS … PHYS. REV. X 8, 031084 (2018)
agent. Neural-network-based RL promises to complement subspaces or phase estimation. We remind the reader that
other successful machine-learning techniques applied to for the particular case of stabilizer-code-based QEC, the
quantum control [39–42]. typical steps are (i) the encoding, in which the logical state
Conceptually, our approach aims to control a quantum initially stored in one qubit is distributed over several
system using a classical neural network. To avoid con- physical qubits, (ii) the detection of errors via measurement
fusion, we emphasize that our approach is distinct from of suitable multiqubit operators (syndromes), (iii) the
future “quantum machine learning” devices, where even the subsequent correction, and (iv) the decoding procedure
network will be quantum [8,43,44]. that transfers the encoded state back into one physical
qubit. We stress that no such specialized knowledge is
II. REINFORCEMENT LEARNING provided a priori to our network, thus, retaining maximum
flexibility in the tasks it might be applied to and in the
The purpose of RL [Fig. 1(a)] is to find an optimal set of strategies it can encompass [Fig. 1(b)].
actions (in our case, quantum gates and measurements) that We start by storing an arbitrary quantum state αj0i þ
an agent can perform in response to the changing state of an βj1i inside one physical qubit. The goal is to be able to
environment (here, the quantum memory). The objective is retrieve this state with optimum fidelity after a given time
to maximize the expected return R, i.e., a sum of rewards. span. Given hardware constraints such as the connectivity
To find optimal gate sequences, we employ a widespread between qubits, the network agent must develop an
version of reinforcement learning [45,46] where discrete efficient QEC strategy from scratch solely by interacting
actions are selected at each time step t according to a with the quantum memory at every time step via a set of
probabilistic policy π θ. Here, π θ ðat jst Þ is the probability to unitary gates (such as controlled NOT gates and bit flips)
apply action at , given the state st of the RL environment. and measurements. They are chosen according to the
As we use a neural network to compute π θ , the multidi- available hardware and define the action set of the agent.
mensional parameter θ stands for all the network’s weights Importantly, the network must react and adapt its strategy to
and biases. The network is fed st as an input vector and
the binary measurement results, providing real-time quan-
outputs the probabilities π θ ðat jst Þ. The expected return can
tum feedback.
then be maximized by applying the policy gradient RL
This particular task seems practically unsolvable for
update rule [45]
the present reinforcement-learning techniques if no extra
X precautions are taken. The basic challenge is also encoun-
∂E½R ∂
δθj ¼ η ¼ ηE R ln π θ ðat jst Þ ; ð1Þ tered in other difficult RL applications: The first sequence
∂θj t
∂θj
leading to an increased return is rather long. In our
scenarios, the probability to randomly select a good
with η the learning rate parameter and E the expectation sequence is much less than 10−12 . Moreover, any sub-
value over all gate sequences and measurement outcomes. sequence may be worse than the trivial (idle) strategy: For
These ingredients summarize the basic policy gradient
example, performing an incomplete encoding sequence
approach. In practice, improvements of Eq. (1) are used;
(ending up in a fragile entangled state) can accelerate
for example, we employ a baseline, natural policy gradient,
decay. Adopting the straightforward return, namely, the
and entropy regularization (see Appendix H). Even so,
overlap of the final and initial states, both the trivial
several further conceptual steps are essential to have any
strategy and the error-correction strategy are fixed points.
chance of success (see below).
These are separated by a wide barrier—all the intermedi-
Equation (1) provides the standard recipe for a fully
observed environment. This approach can be extended to a ate-length sequences with lower return. In our numerical
partially observed environment, where the policy will then experiments, naive RL was not successful, except for
be a function of the observations only instead of the state. some tasks with very few qubits and gates.
The observations contain partial information on the actual We introduce two key concepts to solve this challenge: a
state of the environment. In the present manuscript, we two-stage learning approach with one network acting as
encounter both cases. teacher of another and a measure of the “recoverable
quantum information” retained in any quantum memory.
Before we address these concepts, we mention that from
III. REINFORCEMENT LEARNING APPROACH
a machine-learning point of view there is another uncon-
TO QUANTUM MEMORY
ventional aspect: Instead of sampling initial states of the
In this work, we seek to train a neural network to develop RL environment stochastically, we consider the evolution
strategies to protect the quantum information stored in a under the influence of the agent’s actions for all possible
quantum memory from decoherence. This involves both states simultaneously. This is required because the quantum
variants of stabilizer-code-based QEC [47–49] as well as memory has to preserve arbitrary input states. Our reward is
other more specialized (but, in their respective domain, based on the completely positive map describing the
more resource efficient) approaches like decoherence-free dissipative quantum evolution of arbitrary states. The only
031084-3
FÖSEL, TIGHINEANU, WEISS, and MARQUARDT PHYS. REV. X 8, 031084 (2018)
031084-4
REINFORCEMENT LEARNING WITH NEURAL NETWORKS … PHYS. REV. X 8, 031084 (2018)
by optimal measurements is given by the trace distance (a) Recoverable quantum information (c) Recoverable qu. info
1 for single trajectories
2 kρ̂1 − ρ̂2 k1 . Let ρ̂n⃗ ðtÞ be the quantum state into which the
0.99
Nonadaptive detection 1
multiqubit system has evolved, given the initial logical
qubit state of Bloch vector n⃗ . We now consider the 0.9
Only encoding
Idle
distinguishability of two initially orthogonal states, 0
0.75
0
1 Time step 200
2 kρ̂n⃗ ðtÞ − ρ̂−⃗n ðtÞk1 . In general, this quantity may display
0 500 2500 22500
Training epoch
a nontrivial, nonanalytic dependence on n⃗ . We introduce
(b) Gate sequences: training progress (d) Stochastic
the “recoverable quantum information” as After 60 epochs measurement
results
1
RQ ðtÞ ¼ minn⃗ kρ̂n⃗ ðtÞ − ρ̂−⃗n ðtÞk1 : ð2Þ 1
2 After 160 epochs
1
0
0 1
The minimum over the full Bloch sphere is taken because the (Mostly) converged z meas. with result:
logical qubit state is unknown to the agent, so the success of 0
031084-5
FÖSEL, TIGHINEANU, WEISS, and MARQUARDT PHYS. REV. X 8, 031084 (2018)
(a) 1 2 (c)
6 3
5 4 1
6
(b)
1 5 7 1
2 4 (d)
FIG. 4. Visualizing the operation of the state-aware network. (Same scenario as in Fig. 3) (a) Sequence of quantum states pffiffiffivisited
during a standard repetitive detection cycle (after encoding) displayed for an initial logical qubit state jþxi ¼ ðj0i þ j1iÞ= 2 in the
absence of unexpected measurement outcomes (no errors). Each state ρ̂ is represented by its decomposition into eigenstates, where, e.g.,
∘••• þ •∘•∘ ≡ j0111i þ j1010i. The eigenstates are sorted according to decreasing eigenvalues (probabilities). Eigenstates of less than
5% weight are displayed semitransparently. In this particular example, each eigenstate is a superposition of two basis states in the z basis.
(b) Gate sequence triggered upon encountering an unexpected measurement. Again, we indicate the states ρ̂ with bars now showing the
probabilities. By further measurements, this sequence tries to disambiguate the error. (c) Visualization of neuron activations (300
neurons in the last hidden layer), in response to the quantum states (more precisely, maps Φ) encountered in many runs. These
activations are projected down to 2D using the t-SNE technique [50], which is a nonlinear mapping that tries to preserve neighborhood
relations while reducing the dimensionality. Each of the 2 × 105 points corresponds to one activation pattern and is colored according to
the action taken. The sequences of (a) and (b) are indicated by arrows, with the sequence (b) clearly proceeding outside the dominant
clusters (which belong to the more typically encountered states). Qubits are numbered 1,2,3,4, and gate CNOT13 has control qubit 1 and
target 3. (d) The enlargement shows a set of states in a single cluster which is revisited periodically during the standard detection cycle;
this means we stroboscopically observe the time evolution. The shading indicates the time progressing during the gate sequence with a
slow drift of the state due to decoherence. (See the Supplemental Material [51] for an extended discussion.)
Can we understand better how the network operates? To nearest neighbours, and in addition, we fix a single meas-
this end, we visualize the responses of the network urement location. Then, the network learns that it may use
responses to the input states [Fig. 4(c)], projecting the the available CNOT gates to swap through the chain.
high-dimensional neuron activation patterns into the 2D However, if every qubit can be measured, the net discovers
plane using the t-distributed stochastic neighbor embedding a better strategy with fewer gates, where the middle two
(t-SNE) technique [50]. Similar activation patterns are qubits of the chain alternate in playing the role of the ancilla.
mapped close to each other, forming clearly visible clusters, We also show, specifically, the complex recovery sequences
each of which results in one type of action. During a gate triggered by unexpected measurements. They are a priori
sequence, the network visits states in different clusters. The unknown, and RL permits their discovery from scratch
sequence becomes complex if unexpected measurement without extra input. Generally, additional resources (such as
results are encountered [Fig. 4(b)]. In the example shown enhanced connectivity) are exploited to yield better
here, the outcome of the first parity measurement is improvement of the decoherence time [Fig. 5(b)]. In another
compatible with three possibilities (one of two qubits is scenario [Fig. 5(c)], we find that the network successfully
flipped, or the ancilla state is erroneous). The network learns to adapt to unreliable measurements by redundancy.
learns to resolve the ambiguity through two further mea- In a separate class of scenarios, we consider dephasing of
surements, returning to the usual detection cycle. It is a qubit by a fluctuating field (Fig. 6). If the field is spatially
remarkable that RL finds these nontrivial sequences (which homogeneous and also couples to nearby ancilla qubits,
would be complicated to construct ab initio), picking out P ðjÞ
reward differences of a few percent. then the dephasing is collective: ĤðtÞ ¼ BðtÞ j μj σ̂ z ,
The flexibility of the approach is demonstrated by where BðtÞ is white noise and μj are the coupling strengths
training on different setups, where the network discovers (to qubit and ancillas). Note that in this situation, one can
from scratch other feedback strategies [Fig. 5(a)] adapted to use neither dynamical decoupling (since the noise is
the available resources. For example, we consider a chain uncorrelated in time) nor decoherence-free subspaces
of qubits where CNOT gates are available only between (since the μj can be arbitrary, in general). However, the
031084-6
REINFORCEMENT LEARNING WITH NEURAL NETWORKS … PHYS. REV. X 8, 031084 (2018)
(a)
in such a brute-force approach is analyzed in detail in the
Supplemental Material [51].
Up to now, the network only encodes and keeps track of
errors by suitable collective measurements. By revising the
reward structure, we can force it to correct errors and finally
decode the quantum information back into a single physical
qubit. Our objective is to maximize the overlap between
the initial and final states for any logical qubit state (see
4 Appendix E). Moreover, we find that learning the decoding
1
during the final time steps is reinforced by punishing states
2 where the logical qubit information is still distributed over
1
3 2 multiple physical qubits. The corresponding rewards are
3
4
added to the previous reward based on the recoverable
quantum information. The network now indeed learns to
(b) (c) decode properly [Fig. 7(a)]. In addition, it corrects errors. It
does so typically soon after detecting an error, instead of at the
25
Fraction of meas.
1 meas
Chain
Ring
031084-7
FÖSEL, TIGHINEANU, WEISS, and MARQUARDT PHYS. REV. X 8, 031084 (2018)
Coherence improvement
(a) (b) (a) (b)
(c)
Improvement
(c)
Number of ancillas Interval between meas.
(d)
Neuron
1
2
3
Time
MY2
(e) MY3
MX3 MX2 FIG. 7. (a) Training the state-aware network to implement
decoding back into the original data qubit near the end of a 200-
MY2 step gate sequence. Blue: recoverable quantum information RQ .
MY3
Orange: (rescaled, shifted) overlap OQ between final and initial
MX2
states of the data qubit. Inset displays the decoding gate sequence.
MX3 MY3
MY2 (b) Training the recurrent network in a supervised way to imitate
the state-aware network. Again, we show RQ and OQ evolving
FIG. 6. Countering dephasing by measurements (adaptive phase during training. Dashed blue or orange lines depict the perfor-
estimation). (a) The setting: A data qubit whose phase undergoes a mance of the state-aware network. (c) To investigate the workings
random walk due to a fluctuating field. Since the field is spatially of the recurrent network, we display some of the LSTM neuron
correlated, its fluctuations can be detected by measuring the activations in the second-to-last layer. Quantum states are
evolution of nearby ancilla qubits, which can then be exploited illustrated even though the network is unaware of them. The
for correction. In this example, the allowed actions are measure- network’s input consists of the observed measurement result (if
ments along x, y (MX, MY), and the idle operation. (b) For one any) and of the actual gate choice (made in the previous time step,
ancilla, the network performs measurements periodically, finding here aligned on top of the current neuron activations). The
the optimal measurement interval. This can be seen by comparing example gate sequence displayed here first shows the repetitive
the coherence time enhancement as a function of the interval (in standard error-detection pattern of repeated parity measurements
units of the single-qubit decoherence time T single ) for the network that is interrupted once an unexpected measurement is encoun-
(circles) with the analytical predictions (curves; see the Supple- tered, indicating an error. During error recovery (boxed region),
mental Material [51]). The remaining differences are due to the the network first tries to pinpoint the error and then applies
discretization of the interval (not considered in the analytics). The corrections. The neurons whose behavior changes markedly
coupling between noise and ancilla (μ2 ) is different from that during recovery are depicted in gray. Neuron 1r obviously keeps
between noise and data qubit (μ1 ), and the strategy depends on the track of whether recovery is ongoing, while 3r acts like a counter
ratio indicated here. (c) Coherence time enhancement for different keeping time during the standard periodic detection pattern. (See
numbers of ancillas (here, μ2 ¼ μ3 ¼ μ4 ¼ 4μ1 ). (d) Gate sequen- the Supplemental Material [51] for more information.)
ces. For two ancillas, the network discovers an adaptive strategy,
where measurements on qubit 2 are rare, and the measurement basis
is decided based on previous measurements of qubit 3. The arrows demonstrate that a common set of hyperparameters can be
show the measurement result for qubit 3 in the equator plane of the used for all the scenarios.
Bloch sphere. (e) The two-ancilla adaptive strategy (overview; see,
also, the Supplemental Material [51]). Here, a brute-force search V. POSSIBLE FUTURE APPLICATIONS
for strategies is still (barely) possible, becoming infeasible for
higher ancilla numbers due to the exponentially large search space The physical setups considered in today’s quantum-
of adaptive strategies [μ2 ¼ 3.8μ1 , μ3 ¼ 4.1μ1 in (d),(e)]. computing platforms contain many components and features
that go beyond the simplest scenario of short-range coupled
evolution governing the quantum state) and the particular qubits. Conceptually, the approach developed in the present
action set (i.e., the available hardware instructions related to work is general enough to find future application in any of the
the setup and its connectivity). Importantly, fine-tuning the following experimentally relevant domains.
hyperparameters (like learning rate, network architecture, An important example is cavities, which can be used as
etc.) is not required; in the Supplemental Material [51], we long-lived quantum memory, especially in the microwave
031084-8
REINFORCEMENT LEARNING WITH NEURAL NETWORKS … PHYS. REV. X 8, 031084 (2018)
domain. When they are coupled to qubits, nonlinear control and debug small quantum modules as opposed
operations can be performed that may aid in error correc- to an entire large monolithic quantum computer. Our
tion of the cavity state, as the Yale group has demonstrated approach is very well suited to this strategy. In principle,
(“kitten” and “cat” codes [53,54]). Our approach allows us one can even envision a hierarchical application of the
to cover such situations without any changes to the quantum-module concept (with error-correction strategies
reinforcement-learning method. Only the description of applied to multiple modules coupled together), but for that
the physical scenario via the set of available actions and, of case, our approach needs to be extended (e.g., by using RL
course, the physics simulation must be updated. Cavities to find one- and two-qubit gates acting on the logical
also give access to unconventional controls, e.g., naturally qubits stored inside the modules).
occurring long-distance multiqubit entangling gates pro-
vided by the common coupling of the qubits to the cavity. VI. CONCLUSIONS
In addition, they permit direct collective readout that is
We see how a network can discover quantum-error-
sensitive to the joint state of multiple qubits, which may correction techniques from scratch. It finds a priori
be used to speed up error-detection operations. Again, unknown nontrivial detection or recovery sequences for
RL-based quantum feedback of the type proposed here can diverse settings without any input beyond the available
naturally make use of these ingredients. gate set. The trained neural networks can, in principle,
Novel hardware setups, like cross-bar-type geometries be used to control experimental quantum devices. The
[55,56], give rise to the challenge to exploit the unconven- present approach is flexible enough to be applied directly
tional connectivity, for which our approach is well suited. to a range of further qualitatively different physical
In the future, it may even become possible to cooptimize situations, like non-Markovian noise, weak measure-
the hardware layout (taking into account physical con- ments, qubit-cavity systems, and error-corrected transport
straints) and the strategies adapted to the layout. In the of quantum information through networks. An obvious
simplest case, this means discovering strategies for auto- challenge for the future is to successfully discover strategies
matically generated alternative layouts and comparing their on even more qubits, where eventually full protection
performance. against all noise sources and multiple logical qubits can
The actions considered by the agent need not refer to be realized. There is still considerable leeway in improving
unitary operations. They might also perform other func- the speed of the physics simulation and of GPU-based
tions, like restructuring the connectivity itself in real time. training (for further details on the current computational
This is the case for the proposed 2D ion-trap architecture effort, see Appendix L).
where the ions are shuffled around using electrodes [19]. On the machine-learning side, other RL schemes can be
Similar ideas have been proposed for spins in quantum substituted for the natural policy gradient adopted here, like
dots, which can be moved around using electrodes or Q learning or advantage-actor-critic techniques, or RL with
surface-acoustic waves. Again, no changes to our approach continuous controls. Recurrent networks might be employed
are needed. The modifications are confined to the physics to discover useful subsequences. The two-stage learning
simulation. Depending on the present state of the con- approach introduced here can also be applied in other RL
nectivity, the set of effective qubit gates will change. scenarios, where one first trains based on expanded state
Like any numerical approach, our method is invariably information. In general, we show that neural-network-based
limited to modest qubit numbers (of course, these will RL promises to be a flexible and general tool of wide-
increase with further optimizations, possibly up to about ranging applicability for exploring feedback-based control of
ten). It is important, therefore, to recall that even an quantum and classical systems in physics.
improvement of the decoherence rate in an isolated few- The data that support the plots within this paper and
qubit module can have useful applications (as a quantum other findings of this study are available from the authors
memory, e.g., in a quantum repeater). More generally, it on request (florian.marquardt@mpl.mpg.de).
is clear that classical simulation of a full-scale quantum
computer in the domain of quantum supremacy is out of ACKNOWLEDGMENTS
the question, by definition. This is a challenge widely
We thank Hugo Ribeiro and Vittorio Peano for fruitful
acknowledged by the entire community, affecting not only
comments on the manuscript.
optimization but also design, testing, and verification of a
quantum computer. One promising way to address this All authors contributed to the ideas, their implementa-
challenge at least partially advocated by a growing tion, and the writing of the manuscript. The numerics was
number of experimental groups is the so-called modular performed by T. F., P. T., and T. W.
approach to quantum computation and quantum devices.
This consists of connecting small few-qubit quantum Note added.—Recently, a preprint [57] appeared exploring
modules together via quantum network links [19,20]. RL with recurrent networks for optimal quantum control
The main advantage of this approach is the ability to (without feedback).
031084-9
FÖSEL, TIGHINEANU, WEISS, and MARQUARDT PHYS. REV. X 8, 031084 (2018)
APPENDIX A: PHYSICAL TIME EVOLUTION error models, the bit-flip error (BF) and the correlated noise
error (CN),
To track the time evolution for an arbitrary initial logical
qubit state (identified by its Bloch vector n⃗ ), we start from X ðqÞ ðqÞ
DBF ρ̂ ¼ T −1 σ̂ x ρ̂σ̂ x − ρ̂; ðB1aÞ
ρ̂n⃗ ð0Þ ¼ 12 ð1 þ n⃗ σ⃗ ˆ Þ ⊗ ρ̂rest , factorizing out the (fixed) state dec
q
of all the other qubits. Now, consider the four quantities
† 1 †
1 DCN ρ̂ ¼ T −1 L̂CN ρ̂L̂CN − fL̂CN L̂CN ; ρ̂g ; ðB1bÞ
ρ̂0 ð0Þ ¼ ½ρ̂e⃗ j ð0Þ þ ρ̂−⃗ej ð0Þ; ðA1aÞ
dec
2
2
ðqÞ
1 where σ̂ x;y;z applies the corresponding Pauli operator to
δρ̂j ð0Þ ¼ ½ρ̂e⃗ j ð0Þ − ρ̂−⃗ej ð0Þ; ðA1bÞ qffiffiffiffiffiffiffiffiffiffiffiffi
P 2 P ðqÞ
2 the qth qubit and L̂CN ¼ ð1= q μq Þ q μq σ̂ z . Here, μq
where j ∈ fx; y; zg and e⃗ j are the basis vectors; note that denotes the coupling of qubit q to the noise. Note that in the
the right-hand side of Eq. (A1) is independent of j. ρ̂0 and bit-flip scenario, the single-qubit decay time T single ¼ T dec ,
the δρ̂j are evolved stepwise for each time interval ½ti ; tf whereas in the presence of correlated noise, it is T single ¼
P
according to the update rule T dec ð q μ2q Þ=μ21 .
of measurements, it is an unnormalized version (see in the (for us relevant) P case that tr½δρ̂j ðtÞ ¼ 0 for all j.
below). We explicitly renormalize such that always
The trace distance 12 k j nj δρ̂j ðtÞk1 has often a nontrivial
tr½ρ̂0 ðtÞ ¼ 1. ρ̂0 and the δρ̂j give us access to the density dependence on the logical qubit state n⃗ , and finding
matrix for every logical qubit state, at any time t: its minimum can become nontrivial. However, the
P location of the minimum can sometimes be “guessed” in
ρ̂0 ðtÞ þ j nj δρ̂j ðtÞ
ρ̂n⃗ ðtÞ ¼ P : ðA3Þ advance. For any circuit that can be decomposed into
1 þ tr½ j nj δρ̂j ðtÞ CNOT, Hadamard and RZ ðπ=2Þ gates (CHZ) [48], i.e., for
all the bit-flip examples considered here, the anticommu-
tator relation fδρ̂j ; δρ̂k g ¼ 0 is satisfied for all distinct
APPENDIX B: PHYSICAL SCENARIOS j ≠ k; it can be shown that this circumstance restricts the
minimumP to lie along one of the coordinate axes:
We always start from the initial condition that the logical
minn⃗ k j nj δρ̂j ðtÞk1 ¼ minj∈fx;y;zg kδρ̂j ðtÞk1 . For the cor-
qubit is stored in one physical qubit, and the others are P
prepared in the down state (j1i). If explicit recovery is related noise, the trace distance 12 k j nj δρ̂j ðtÞk1 is sym-
desired, we use this original qubit also as the target qubit for metric around the z axis and takes its minimal value at the
final decoding. The time evolution is divided into discrete equator.
time steps of uniform length Δt (set to 1 in the main text). After a measurement, the updated value of RQ may vary
At the start of each of these time slices, we perform the between the different measurement results. To obtain a
measurement or gate operation (which is assumed to be measure that does not depend on this random factor, we
quasi-instantaneous) chosen by the agent; afterwards, the introduce R̄Q as the average over all possible values of RQ
system is subject to the dissipative dynamics. Thus, the (after a single time step) weighted by the probability to end
map ϕ for the time interval ½t; t þ Δt is of the form up in the corresponding branch. If the action is not a
ϕ½ρ̂ ¼ eΔtD ðÛ ρ̂ Û† Þ for unitary operations Û and ϕ½ρ̂ ¼ measurement, there is only one option and, thus, R̄Q ¼ RQ .
eΔtD ðP̂m ρ̂P̂†m Þ for projection operators P̂m (m indicates the
measurement results), where D is the dissipative part of APPENDIX D: PROTECTION REWARD
the Liouvillian (we consider only Markovian noise). The goal of the “protection reward” is to maximize RQ at
Note that the measurement results are chosen stochastically the end of the simulation; i.e., the ability to, in principle,
according to their respective probability trðP̂m ρ̂0 Þ. In the recover the target state. A suitable (immediate) reward is
examples discussed in the figures, we use two different given by
031084-10
REINFORCEMENT LEARNING WITH NEURAL NETWORKS … PHYS. REV. X 8, 031084 (2018)
8
>
> R̄ ðtþΔtÞ−R ðtÞ
1 þ Q 2Δt=T singleQ þ 0 if R̄Q ðt þ ΔtÞ > 0;
APPENDIX F: INPUT OF THE
>
>
>
< STATE-AWARE NETWORK
rt ¼ 0 − P if RQ ðtÞ ≠ 0 ðD1Þ The core of the input to the state-aware network is a
>
>
>
> ∧ RQ ðt þ ΔtÞ ¼ 0; representation of the density matrices ρ̂0 , ρ̂1 ≔ ρ̂0 þ δρ̂x ,
>
:
0 þ0 if RQ ðtÞ ¼ 0 ρ̂2 ≔ ρ̂0 þ δρ̂y , and ρ̂3 ≔ ρ̂0 þ δρ̂z . Together, they re-
|fflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflffl
ffl} |{z}
ð1Þ ð2Þ
present the completely positive map of the evolution (for
≕ rt ≕rt
arbitrary logical qubit states). For reduction of the input
size (especially in view of higher qubit numbers),
with RQ and R̄Q as defined above, T single the decay time we compress them via principal component analysis
for encoding in one physical qubit only (“trivial” encod- (PCA); i.e., we perform an eigendecomposition ρ̂j ¼
ing), Δt the discrete time step of the simulation, and P a P
k k k ihϕk j and select the eigenstates jϕk i with the
p jϕ
punishment for measurements which reveal the logical largest eigenvalues pk . To include also the eigenvalue in the
qubit state. Based on this reward, we choose the return (the
input, we feed all components of the scaled states jϕ̃k i ≔
function of the reward sequence used to compute the policy pffiffiffiffiffi P
gradient) as pk jϕk i (which yield ρ̂j ¼ k jϕ̃k ihϕ̃k j) into the network,
where the states are in addition sorted by their eigenvalue.
T−t−1
X For our simulations, we select the six largest components,
k ð1Þ ð2Þ
Rt ¼ ð1 − γÞ γ rtþk þ rt ; ðD2Þ so we need 768 ¼ 4 × 6 × 16 × 2 input neurons (four
k¼0 density matrices, 16 is the dimension of the Hilbert space,
two for the real and imaginary parts).
where γ is the return discount rate; for more information on In addition, at each time step, we indicate to the network
the (discounted) return, see, e.g., Ref. [58]. whether a potential measurement would destroy the quan-
tum state by revealing the quantum information. Explicitly,
APPENDIX E: RECOVERY REWARD we compute for each measurement whether trðP̂δρ̂j Þ ¼ 0
for all j ∈ fx; y; zg and every possible projector P̂ (i.e.,
The protection reward does not encourage the network to
every possible measurement result) and feed these Boolean
finally decode the quantum state. If this behavior is desired,
values into the network. Note that this information can be
we add suitable terms to the reward (only employed for
deduced from the density matrix (so, in principle, the
Fig. 7):
network can learn that deduction on its own but giving it
ðrecovÞ directly speeds up training).
rt ¼ βdec ðDtþ1 − Dt Þ þ βcorr CT δt;T−1 ; ðE1Þ Because all relevant information for the decision about
ðq0 Þ P ðqÞ the next action is contained in the current density matrix,
where Dt ¼ 12 ðI t − q≠q0 I t Þ, unless t ≤ T signal , where knowledge about the previous actions is not needed.
we set Dt ¼ 0. This means decoding is rewarded only after However, we find that providing this extra information
ðqÞ
T signal . We set I t ¼ 1 if trq̄ ½δρ̂j ðtÞ ≠ 0 for any j, and is helpful to accelerate learning. Therefore, we provide
ðqÞ
otherwise I t ¼ −1. trq̄ denotes the partial trace over all also the last action (in a one-hot encoding). We note that it
qubits except q, and q0 labels the target qubit. The condition is not necessary to feed the latest measurement result to the
ðqÞ network since the updated density matrix is conditional on
I t ¼ 1 implies that the logical qubit state is encoded in the
the measurement outcome and, therefore, contains all
specific qubit q (this is not a necessary criterion). CT is 1 if
relevant information for a future decision.
(at the final time T) the logical qubit state is encoded in the
To train the state-aware network to restore the original
target qubit only, and this qubit has the prescribed polari-
state at the end of a trajectory, it becomes necessary to add
zation (i.e., not flipped), and otherwise 0.
the time to the input. It is fully sufficient to indicate the last
As the return, we use
few time steps where t > T signal (when decoding should be
performed) in a one-hot encoding.
ðrecovÞ
X
T−t−1
ðrecovÞ
Rt ¼ ð1 − γÞ γ k rtþk ðE2Þ
k¼0
APPENDIX G: LAYOUT OF THE
with the same return discount rate γ as for the protection STATE-AWARE NETWORK
reward. Our state-aware networks have a feedforward architec-
With this reward, we aim to optimize the minimum ture. Between the input layer and the output layer (one
overlap OQ ¼ minn⃗ hϕn⃗ jtrq̄0 ðρ̂n⃗ Þjϕn⃗ i between the (pure) neuron per action), there are two or three hidden layers
target state jϕn⃗ i and the actual final state ρ̂n⃗ reduced to (the specific numbers are summarized in Appendix K).
the target qubit given by the partial trace trq̄0 ðρ̂n⃗ Þ over all All neighboring layers are densely connected, the activation
other qubits. function is the rectified linear unit. At the output layer, the
031084-11
FÖSEL, TIGHINEANU, WEISS, and MARQUARDT PHYS. REV. X 8, 031084 (2018)
P
softmax function xj ↦ exj = k exk is applied such that the APPENDIX J: SUPERVISED LEARNING
result can be interpreted as a probability distribution. OF THE RECURRENT NETWORK
The training data are generated from inference of a state-
aware network which is trained to sufficiently good
APPENDIX H: REINFORCEMENT LEARNING
strategies (via reinforcement learning); for every time step
OF THE STATE-AWARE NETWORK
in each trajectory, we save the network input and the policy,
Our learning scheme is based on the policy gradient i.e., the probabilities for all the actions, and we train on
algorithm [45]. The full expression for our learning these data. It is possible to generate enough data such that
gradient (indicating the change in θ) reads overfitting is not a concern (for the example in Fig. 7, each
trajectory is reused only five times during the full training
X process). For the actual training of the recurrent network,
−1 ∂
g ¼ λpol F E ðRt − bt Þ ln π θ ðat jst Þ we use supervised learning with categorical cross-entropy
t
∂θ
X as the cost function (q is the actual policy of the recurrent
∂ network to train, and p the desired policy from the state-
− λentr E (π ðajsÞ ln π θ ðajsÞ) ; ðH1Þ
a
∂θ θ aware network):
X
where Rt is the (discounted) return [compare Eqs. (D2) and Cðq; pÞ ¼ − pðaÞ ln qðaÞ: ðJ1Þ
a
(E2)]. This return is corrected by an (explicitly time-
dependent) baseline bt , which we choose as the exponen-
tially decaying average of Rt ; i.e. for the training update in Because of the LSTM layers, it is necessary to train on full
P n ðN−1−nÞ , where κ trajectories (in the true time sequence) instead of indi-
epoch N, we use bt ¼ ð1 − κÞ N−1 n¼0 κ R̄t
ðnÞ
vidual actions. Dropout [64] is used for regularization.
is the baseline discount rate, and R̄t is the mean return at The training update rule is adaptive moment estimation
time step t in epoch n. We compute the natural gradient (Adam) (see Ref. [63]).
[59–61] by multiplying F−1 , the (Moore-Penrose) inverse
of the Fisher information matrix F¼Ef½ð∂=∂θÞlnπ θ ðajsÞ
½ð∂=∂θÞlnπ θ ðajsÞT g. The second term is entropy regulari- APPENDIX K: PHYSICAL PARAMETERS
zation [62]; we use it only to train the state-aware network AND HYPERPARAMETERS
shown in Fig. 7. As update rule, we use adaptive moment The physical parameters used throughout the main text
estimation (Adam) (see Ref. [63]) without bias correction. are summarized in Table I. Times are always given in units
of the time step (gate time).
We use a few separately trained neural networks
APPENDIX I: LAYOUT OF THE throughout this work which differ slightly in hyper-
RECURRENT NETWORK parameters (e.g., in the number of hidden layers and
The recurrent network is designed such that it can, in neurons per layer). This is not due to fine-tuning, and in
principle, operate in a real-world experiment. This means, the Supplemental Material [51], we demonstrate that we
in particular, that (in contrast to the state-aware network) can successfully train the neural networks in all scenar-
its input must not contain directly the quantum state (or ios with one common set of hyperparameters. The
the evolution map); instead, measurements are its only strategies found by the neural networks are not influ-
way to obtain information about the quantum system. enced by using different sets of hyperparameters.
Hence, the input to the recurrent network contains the Different hyperparameters may influence the training
present measurement result (and, additionally, the pre- time, etc. In Table II, we summarize the architecture of
vious action). Explicitly, we choose the input as a one-hot the networks; i.e., we list the number of neurons in
each layer.
encoding for the action in the last time step, and in case of
Each output neuron represents one action that can be
measurements, we additionally distinguish between the
performed by the agent, and, thus, the output layer size is
different results. In addition, there is an extra input neuron
to indicate the beginning of time (where no previous
action is performed). Since this input contains only the
most recent event, the network requires a memory to TABLE I. Physical parameters.
perform reasonable strategies; i.e., we need a recurrent
network. Therefore, the input and output layers are Figures 3–5,7 decoherence time T dec 1200
connected by two successive LSTM layers [52] with tanh Figure 6 single-qubit decoherence time T single ¼T dec =μ21 500
Figures 3–5,7 number of time steps T 200
interlayer activations. After the output layer, the softmax
Figure 6 number of time steps T 100
function is applied (like for the state-aware network).
031084-12
REINFORCEMENT LEARNING WITH NEURAL NETWORKS … PHYS. REV. X 8, 031084 (2018)
031084-13
FÖSEL, TIGHINEANU, WEISS, and MARQUARDT PHYS. REV. X 8, 031084 (2018)
[10] S. Krastanov and L. Jiang, Deep Neural Network Probabi- [26] N. Ofek, A. Petrenko, R. Heeres, P. Reinhold, Z. Leghtas, B.
listic Decoder for Stabilizer Codes, Sci. Rep. 7, 11003 (2017). Vlastakis, Y. Liu, L. Frunzio, S. M. Girvin, L. Jiang, M.
[11] P. Mehta, M. Bukov, C.-H. Wang, A. G. R. Day, C. Mirrahimi, M. H. Devoret, and R. J. Schoelkopf, Extending
Richardson, C. K. Fisher, and D. J. Schwab, A High-Bias, the Lifetime of a Quantum Bit with Error Correction in
Low-Variance Introduction to Machine Learning for Phys- Superconducting Circuits, Nature (London) 536, 441 (2016).
icists, arXiv:1803.08823. [27] A. Potocnik, A. Bargerbos, F. A. Y. N. Schröder, S. A.
[12] R. D. King, J. Rowland, S. G. Oliver, M. Young, W. Aubrey, Khan, M. C. Collodo, S. Gasparinetti, Y. Salathé, C.
E. Byrne, M. Liakata, M. Markham, P. Pir, L. N. Soldatova, Creatore, C. Eichler, H. E. Türeci, A. W. Chin, and A.
A. Sparkes, K. E. Whelan, and A. Clare, The Automation of Wallraff, Studying Light-Harvesting Models with Super-
Science, Science 324, 85 (2009). conducting Circuits, Nat. Commun. 9, 904 (2018).
[13] M. Schmidt and H. Lipson, Distilling Free-Form Natural [28] S. Debnath, N. M. Linke, C. Figgatt, K. A. Landsman, K.
Laws from Experimental Data, Science 324, 81 (2009). Wright, and C. Monroe, Demonstration of a Small Pro-
[14] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, grammable Quantum Computer with Atomic Qubits, Nature
M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, (London) 536, 63 (2016).
G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, [29] M. Takita, A. W. Cross, A. D. Córcoles, J. M. Chow, and
H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, J. M. Gambetta, Experimental Demonstration of Fault-
Human-Level Control through Deep Reinforcement Learn- Tolerant State Preparation with Superconducting Qubits,
ing, Nature (London) 518, 529 (2015). Phys. Rev. Lett. 119, 180501 (2017).
[15] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, [30] T. F. Watson, S. G. J. Philips, E. Kawakami, D. R. Ward, P.
A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Scarlino, M. Veldhorst, D. E. Savage, M. G. Lagally, Mark
Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, Friesen, S. N. Coppersmith, M. A. Eriksson, and L. M. K.
T. Graepel, and D. Hassabis, Mastering the Game of Vandersypen, A Programmable Two-Qubit Quantum Proc-
Go without Human Knowledge, Nature (London) 550,
essor in Silicon, Nature (London) 555, 633 (2018).
354 (2017).
[31] D. Ristè, C. C. Bultink, K. W. Lehnert, and L. DiCarlo,
[16] C. Chen, D. Dong, H. X. Li, J. Chu, and T. J. Tarn, Fidelity-
Feedback Control of a Solid-State Qubit Using High-
Based Probabilistic Q-Learning for Control of Quantum
Fidelity Projective Measurement, Phys. Rev. Lett. 109,
Systems, IEEE Trans. Neural Netw. Learn. Syst. 25, 920
240502 (2012).
(2014).
[32] P. Campagne-Ibarcq, E. Flurin, N. Roch, D. Darson, P.
[17] M. Bukov, A. G. R. Day, D. Sels, P. Weinberg, A.
Polkovnikov, and P. Mehta, Reinforcement Learning in Morfin, M. Mirrahimi, M. H. Devoret, F. Mallet, and B.
Different Phases of Quantum Control, arXiv:1705.00565. Huard, Persistent Control of a Superconducting Qubit by
[18] A. A. Melnikov, H. P. Nautrup, M. Krenn, V. Dunjko, M. Stroboscopic Measurement Feedback, Phys. Rev. X 3,
Tiersch, A. Zeilinger, and H. J. Briegel, Active Learning 021008 (2013).
Machine Learns to Create New Quantum Experiments, [33] L. Steffen, Y. Salathe, M. Oppliger, P. Kurpiers, M. Baur, C.
Proc. Natl. Acad. Sci. U.S.A. 115, 1221 (2018). Lang, C. Eichler, G. Puebla-Hellmann, A. Fedorov, and A.
[19] C. Monroe and J. Kim, Scaling the Ion Trap Quantum Wallraff, Deterministic Quantum Teleportation with Feed-
Processor, Science 339, 1164 (2013). Forward in a Solid State System, Nature (London) 500, 319
[20] M. H. Devoret and R. J. Schoelkopf, Superconducting (2013).
Circuits for Quantum Information: An Outlook, Science [34] W. Pfaff, B. J. Hensen, H. Bernien, S. B. van Dam, M. S.
339, 1169 (2013). Blok, T. H. Taminiau, M. J. Tiggelman, R. N. Schouten, M.
[21] J. Chiaverini, D. Leibfried, T. Schaetz, M. D. Barrett, R. B. Markham, D. J. Twitchen, and R. Hanson, Unconditional
Blakestad, J. Britton, W. M. Itano, J. D. Jost, E. Knill, C. Quantum Teleportation between Distant Solid-State Quan-
Langer, R. Ozeri, and D. J. Wineland, Realization of Quantum tum Bits, Science 345, 532 (2014).
Error Correction, Nature (London) 432, 602 (2004). [35] D. Ristè and L. DiCarlo, in Superconducting Devices in
[22] P. Schindler, J. T. Barreiro, T. Monz, V. Nebendahl, D. Nigg, Quantum Optics, edited by R. H. Hadfield and G. Johansson
M. Chwalla, M. Hennrich, and R. Blatt, Experimental (Springer, New York, 2016).
Repetitive Quantum Error Correction, Science 332, 1059 [36] N. Khaneja, T. Reiss, C. Kehlet, T. Schulte-Herbrüggen, and
(2011). S. J. Glaser, Optimal Control of Coupled Spin Dynamics:
[23] G. Waldherr, Y. Wang, S. Zaiser, M. Jamali, T. Schulte- Design of NMR Pulse Sequences by Gradient Ascent
Herbrüggen, H. Abe, T. Ohshima, J. Isoya, J. F. Du, P. Algorithms, J. Magn. Reson. 172, 296 (2005).
Neumann, and J. Wrachtrup, Quantum Error Correction in [37] S. Machnes, U. Sander, S. J. Glaser, P. de Fouquières, A.
a Solid-State Hybrid Spin Register, Nature (London) 506, Gruslys, S. Schirmer, and T. Schulte-Herbrüggen, Compar-
204 (2014). ing, Optimizing, and Benchmarking Quantum-Control
[24] J. Kelly et al., State Preservation by Repetitive Error Algorithms in a Unifying Programming Framework, Phys.
Detection in a Superconducting Quantum Circuit, Nature Rev. A 84, 022305 (2011).
(London) 519, 66 (2015). [38] P. D. Johnson, J. Romero, J. Olson, Y. Cao, and A. Aspuru-
[25] M. Veldhorst, C. H. Yang, J. C. C. Hwang, W. Huang, J. P. Guzik, QVECTOR: An Algorithm for Device-Tailored Quan-
Dehollain, J. T. Muhonen, S. Simmons, A. Laucht, F. E. tum Error Correction, arXiv:1711.02249.
Hudson, K. M. Itoh, A. Morello, and A. S. Dzurak, A Two- [39] A. Hentschel and B. C. Sanders, Machine Learning for
Qubit Logic Gate in Silicon, Nature (London) 526, 410 Precise Quantum Measurement, Phys. Rev. Lett. 104,
(2015). 063603 (2010).
031084-14
REINFORCEMENT LEARNING WITH NEURAL NETWORKS … PHYS. REV. X 8, 031084 (2018)
[40] A. Hentschel and B. C. Sanders, Efficient Algorithm for Protected Cat-Qubits: A New Paradigm for Universal
Optimizing Adaptive Quantum Metrology Processes, Phys. Quantum Computation, New J. Phys. 16, 045014 (2014).
Rev. Lett. 107, 233601 (2011). [54] R. W. Heeres, P. Reinhold, N. Ofek, L. Frunzio, L. Jiang,
[41] P. Palittapongarnpim, P. Wittek, E. Zahedinejad, S. Vedaie, M. H. Devoret, and R. J. Schoelkopf, Implementing a
and B. C. Sanders, Learning in Quantum Control: High- Universal Gate Set on a Logical Qubit Encoded in an
Dimensional Global Optimization for Noisy Quantum Oscillator, Nat. Commun. 8, 94 (2017).
Dynamics, Neurocomputing 268, 116 (2017). [55] F. Helmer, M. Mariantoni, A. G. Fowler, J. von Delft, E.
[42] M. Tiersch, E. J. Ganahl, and H. J. Briegel, Adaptive Solano, and F. Marquardt, Cavity Grid for Scalable Quan-
Quantum Computation in Changing Environments Using tum Computation with Superconducting Circuits, Europhys.
Projective Simulation, Sci. Rep. 5, 12874 (2015). Lett. 85, 50007 (2009).
[43] J. Biamonte, P. Wittek, N. Pancotti, P. Rebentrost, N. Wiebe, [56] R. Li, L. Petit, D. P. Franke, J. P. Dehollain, J. Helsen, M.
and S. Lloyd, Quantum Machine Learning, Nature Steudtner, N. K. Thomas, Z. R. Yoscovits, K. J. Singh, S.
(London) 549, 195 (2017). Wehner, L. M. K. Vandersypen, J. S. Clarke, and M. Veld-
[44] J. Romero, J. P. Olson, and A. Aspuru-Guzik, Quantum horst, A Crossbar Network for Silicon Quantum Dot Qubits,
Autoencoders for Efficient Compression of Quantum Data, arXiv:1711.03807.
Quantum Sci. Technol. 2, 045001 (2017). [57] M. August and J. M. Hernández-Lobato, Taking Gradients
[45] R. J. Williams, Simple Statistical Gradient-Following Algo- through Experiments: LSTMs and Memory Proximal Policy
Optimization for Black-Box Quantum Control, arXiv:1802
rithms for Connectionist Reinforcement Learning, Mach.
.04063.
Learn. 8, 229 (1992).
[58] R. S. Sutton and A. G. Barto, Reinforcement Learning: An
[46] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning
Introduction (MIT Press, Cambridge, MA, 1998), Vol. 1.
(MIT Press, Cambridge, MA, 2016).
[59] S.-i. Amari, Natural Gradient Works Efficiently in Learn-
[47] P. W. Shor, Scheme for Reducing Decoherence in Quantum
ing, Neural Comput. 10, 251 (1998).
Computer Memory, Phys. Rev. A 52, R2493 (1995).
[60] S. M. Kakade, A Natural Policy Gradient, in Advances in
[48] M. A. Nielsen and I. L. Chuang, Quantum Computation and
Neural Information Processing Systems (2001), pp. 1531–1538,
Quantum Information 10th anniv. ed. (Cambridge Univer- https://papers.nips.cc/paper/2073-a-natural-policy-gradient.
sity Press, Cambridge, England, 2011). [61] J. Peters and S. Schaal, Natural Actor-Critic, Neurocom-
[49] B. M. Terhal, Quantum Error Correction for Quantum puting 71, 1180 (2008).
Memories, Rev. Mod. Phys. 87, 307 (2015). [62] R. J. Williams and J. Peng, Function Optimization Using
[50] L. van der Maaten and G. Hinton, Visualizing Data Using Connectionist Reinforcement Learning Algorithms, Con-
t-SNE, J. Mach. Learn. Res. 9, 2579 (2008). nection Science 3, 241 (1991).
[51] See Supplemental Material at http://link.aps.org/ [63] D. Kingma and J. Ba, Adam: A method for stochastic
supplemental/10.1103/PhysRevX.8.031084 for further optimization, in Proceedings of the International
information on the physical time evolution, the recoverable Conference for Learning Representations, 2015, Vol. 6
quantum information, conception and training of the state- [arXiv:1412.6980].
aware and the recurrent network, the t-SNE analysis, and the [64] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever,
numerical implementation, plus background information on and R. R. Salakhutdinov, Improving Neural Networks by
the figures. Preventing Co-Adaptation of Feature Detectors, arXiv:
[52] S. Hochreiter and J. Schmidhuber, Long Short-Term 1207.0580.
Memory, Neural Comput. 9, 1735 (1997). [65] R. Al-Rfou et al., THEANO: A PYTHON Framework for Fast
[53] M. Mirrahimi, Z. Leghtas, V. V. Albert, S. Touzard, Computation of Mathematical Expressions, arXiv:1605.
R. J. Schoelkopf, L. Jiang, and M. H. Devoret, Dynamically 02688.
031084-15