Professional Documents
Culture Documents
DOI:10.1145/ 3571725
There is a maze of hard problems
What are the promising applications that have been suggested to profit
from quantum acceleration: from
to realize quantum advantage? cryptanalysis, chemistry and materi-
als science, to optimization, big data,
BY TORSTEN HOEFLER, THOMAS HÄNER, AND MATTHIAS TROYER machine learning, database search,
drug design and protein folding, fluid
Disentangling Hype
dynamics and weather prediction. But
which of these applications realisti-
from Practicality:
cally offer a potential quantum advan-
tage in practice? For this, we cannot
only rely on asymptotic speedups but
On Realistically
must consider the constants involved.
Being optimistic in our outlook for
quantum computers, we identify clear
Achieving
guidelines for quantum practicality
and use them to classify which of the
many proposed applications for quan-
tum computing show promise and
Quantum
which ones would require significant
algorithmic improvements to become
practical and relevant.
Advantage
To establish reliable guidelines, or
lower bounds for the required speed-
up of a quantum computer, we err on
the side of being optimistic for quan-
tum and overly pessimistic for clas-
sical computing. Despite our overly
optimistic assumptions, our analysis
shows a wide range of often-cited ap-
plications is unlikely to result in a
practical quantum advantage without
OPERATING ON FUNDAMENTALLY different principles significant algorithmic improvements.
We compare the performance of only
than conventional computers, quantum computers a single classical chip fabricated like
promise to solve a variety of important problems that the one used in the NVIDIA A100 GPU
seemed forever intractable on classical computers. that fits around 54 billion transistors15
with an optimistic assumption for a which determines bounds for data in- computers. Similarly, a potentially ex-
hypothetical quantum computer that put and output bandwidths. Scalable ponential quantum speedup in linear
may be available in the next decades implementations of quantum random algebra problems12 vanishes when the
with 10,000 error-corrected logical access memory (QRAM8,9) demand a matrix must be loaded from classical
qubits, 10ms gate time for logical op- fault-tolerant error corrected imple- data, or when the full solution vector
erations, the ability to simultaneously mentation and the bandwidth is then should be read out. Generally, quan-
perform gate operations on all qubits fundamentally limited by the number tum computers will be practical for
and all-to-all connectivity for fault tol- of quantum gate operations or mea- “big compute” problems on small data,
erant two-qubit gates.a surements that can be performed per not big data problems.
I/O bandwidth. We first consider unit time. We assume only a single gate Crossover scale. With quantum
the fundamental I/O bottleneck that operation per input bit. For our opti- speedup, asymptotically fewer opera-
limits quantum computers in their mistic future quantum computer, the tions will be needed on a quantum
interaction with the classical world, resulting rate is 10,000-times smaller computer than on a classical computer.
than for an existing classical chip (see Due to the high operational complexity
Table 1). We immediately see that any and slower gate operations, however,
a Note that no quantum error correction scheme
exists today that allows simultaneous execu-
problem limited by accessing classical each operation on a quantum comput-
tion of gates and all-to-all connectivity without data, such as search problems in data- er will be slower than a correspond-
at least a O (√N ) slowdown for N qubits. bases, will be solved faster by classical ing classical one. As sketched in the
accompanying figure, classical com-
Quantum speedup. puters will always be faster for small
problems and quantum advantage
The time needed to solve certain problems with quantum algorithms increases
more slowly than that of any known classical algorithm as the problem size N
is realized beyond a problem-depen-
increases. To be practical, however, we need more than an asymptotic speedup: dent crossover scale where the gain
the crossover time where quantum advantage gets realized needs to be reasonably due to quantum speedup overcomes
short and the crossover problem size not too large. (For illustrative purposes, the the constant slowdown of the quan-
time axis is scaled such that the quantum algorithm is a straight line.)
tum computer. To have real practical
impact, the crossover time must be
short, not more than weeks. Constants
Time
hardware architecture with limited Table 2. Crossover operation counts for quantum algorithms with quadratic, cubic, and
qubit connectivity. quartic speedups.
Crossover times for classical and
quantum computation. To estimate We determine the number of operations that can be afforded per function call (see the
accompanying figure) for a quantum computer to show an advantage over a classical
lower bounds for the crossover times,
computer using a quantum algorithm with quadratic, cubic, and quartic quantum
we consider that while both classical speedup. The number of oracle calls required to reach the crossover point with a
and quantum computers must evalu- quadratic, cubic, and quartic speedup is computed using the relative runtimes of a single
ate the same functions (usually called oracle evaluation, and the total runtime of 106 seconds is then used to compute how
oracles) that describe a problem, many basic operations can be afforded in each oracle call. Since we make optimistic
assumptions for a future quantum computer, we ignore overheads of reversible arithmetic
quantum computers require fewer for quantum computing and limit the classical computer to a single chip that can be
evaluations thereof due to quantum manufactured today. The actual crossover operation counts will be significantly smaller.
speedup. At the root of many quan- A similar analysis for quantum algorithms with exponential speedups yields promising
tum acceleration proposals lies a qua- operation budgets for all datatypes.
more reliable and realistic (less opti- cation circuits would be significantly
mistic) estimates in cases where our cheaper. If we assume a transistor-to-
rough guidelines indicate a potential gate ratio of 1013 and that 50% of the
practical quantum advantage. total chip area is used for control logic
improvements.
throughputs. In Table 1, we provide performance ranges approximately
concrete examples using three types of between 1013 and 1016 op/s for binary,
operations: logical operations, 16-bit int32, and fp16 types.
floating point, and 32-bit integer arith- Hypothetical future quantum computer.
metic operations for numerical model- To determine the costs of N-bit multi-
ing. Other datatypes could be modeled plication on a quantum computer, we
using our methodology as well. choose the controlled adder from Gid-
Classical NVIDIA A100. According ney6 and implement the multiplication
to its datasheet, NVIDIA’s A100 GPU, a using N single-bit controlled adders,
SIMT-style von Neumann load store ar- each requiring 2N CCZ magic states.
chitecture, delivers 312 tera-operations These states are produced in so called
per second (Top/s) with half precision “magic state factories” that are imple-
floating point (fp16) through tensor mented on the physical chip. While
cores and 78Top/s through the normal the resulting multiplier is entirely se-
processing pipeline. NVIDIA assumes quential, we found that this construc-
a 50/50 mix of addition and multiplica- tion allows for more units to be placed
tion operations and thus, we divide the on one chip than for a low-depth adder
number by two, yielding 195Top/s fp16 and/or for a tree-like reduction of par-
performance. The datasheet states tial products since the number of CCZ
19.5Top/s for 32-bit integer operations, states is lower (and thus fewer magic
again assuming a 50/50 mix of addition state factories are required), and the
and multiplication, leading to an effec- number of work-qubits is lower. The
tive 9.75Top/s. The binary tensor core resulting multiplier has a CCZ-depth
performance is listed as 4,992Top/s and count of 2N2 using 5N − 1 qubits
with a limited set of instructions. (2N input, 2N − 1 output, N ancilla for
Classical special-purpose ASIC. Our the addition).
main analysis assumes that we build To compute the space overhead
a special-purpose ASIC using a similar due to CCZ factories, we first use the
technology. If we were to fill the equiv- analysis of Gidney and Fowler7 to com-
alent chip-space of an A100 with a spe- pute the number of physical qubits per
cialized circuit, we would use existing factory when aiming for circuits (pro-
execution units, for which the size is grams) using ≈ 108 CCZ magic states
typically measured in gate equivalents with physical gate errors of 10−3. We
(GE). A 16-bit floating point unit (FPU) approximate the overhead in terms of
with addition and multiplication func- logical qubits by dividing the physi-
tions requires approximately 7kGE, a cal space overhead by 2d2, where we
32-bit integer unit requires 18kGE,14 choose the error-correcting code dis-
and we assume 50GE for a simple bi- tance d = 2 • 312 to be the same as the
nary operation. All units include op- distance used for the second level of
erand buffer registers and support a distillation.7 Thus we divide Gidney
set of programmable instructions. We and Fowler’s 147,904 physical qubits
note that simple addition or multipli- per factory (for details consult the an-
cillary spreadsheet (field B40) of Gid- and 77.4 bin Zop. Similarly, assuming References
ney and Fowler) by 2d2 = 2 • 312 and get the rates from Table 1, for a quantum 1. Aaronson, S., Ben-David, S., Kothari, R., Rao, S. and
Tal, A. Degree vs. approximate degree and quantum
an equivalent space of 77 logical qubits chip: 7, 4, 2, and 350 Gop, respectively. implications of Huang’s sensitivity theorem. In
per factory. We now assume that all calcula- Proceedings of the 53rd Annual ACM SIGACT Symp.
Theory of Computing, 2021, 1330–1342, ACM, New
For the multiplier of the 10-bit man- tions are used in oracle calls on the York, NY, USA.
tissa of an fp16 floating point number, quantum computer and we ignore 2. Arute, F. et al. Quantum supremacy using a
programmable superconducting processor. Nature
we need 2 • 102 = 200 CCZ states and 5 • all further costs on the quantum ma- 574 (2019), 505–510.
10 = 50 qubits. Since each factory takes chine. We start by modeling algo- 3. Babbush, R., McClean, J.R., Newman, M., Gidney,
C., Boixo, S., and Neven, H. Focus beyond quadratic
5.5 cycles7 and we can pipeline the pro- rithms that provide polynomial Xk speedups for error-corrected quantum advantage.
duction of CCZ states, we assume 5.5 speedup, for small constants k. For PRX Quantum, 2:010103, Mar 2021.
4. Beverland, M.E. et al. Assessing requirements to scale
factories per multiplication unit such example, for Grover’s algorithms,11 to practical quantum advantage, 2022.
that multipliers do not wait for magic k + 2. It is clear quantum computers are 5. Feynman, R.P. Simulating physics with computers.
Intern. J. Theoretical Physics 21, 6/7 (1981).
state production on average. Thus, each asymptotically faster (in the number 6. Gidney, C. Halving the cost of quantum addition.
Quantum 2, 74 ( June 2018).
multiplier requires 200 cycles and 5N of oracle queries) for any k >1. Howev- 7. Gidney, C. and Fowler, A.G. Efficient magic state
+ 5.5 • 77 = 50 + 5.5 • 77 = 473.5 qubits. er, we are interested to find the oracle factories with a catalyzed CCZ to 2 T transformation.
Quantum 3, 135 (Apr. 2019).
With a total of 10,000 logical qubits, we complexity (that is, the number of oper- 8. Giovannetti, V., Lloyd, S., and Maccone, L. Quantum
can implement 21 10-bit multipliers on ations required to evaluate it) for which random access memory. Physical Review Letters 100,
16 (Apr. 2008).
our hypothetical quantum chip. With a quantum computer is faster than a 9. Giovannetti, V., Lloyd, S., and Maccone, L.
10ms cycle time, the 200-cycle latency, classical computer within the time- Architectures for a quantum random access memory.
Physical Review A 78, 5 (Nov. 2008).
we get the final rate of less than 105 cy- window of 106 seconds. 10. Grover, L. K. Quantum mechanics helps in searching
cle/ s / (200 cycle/op) • 21= 10.5kop/s. For Let the number of operations re- for a needle in a haystack. Physical Review Letters 79,
2 (1997).
int32 (N=32), the calculation is equiva- quired to evaluate a single oracle call 11. Grover, L. K. A fast quantum mechanical algorithm for
lent. For binary, we assume two input be M and let the number of required database search. In Proceedings of the 28th Annual
ACM Symp. Theory of Computing, 1996. ACM, New
and one output qubit for the (binary) invocations be N. It takes a classical York, NY, 212–219.
adder (Toffoli gate) which does not computer time Tc = Nk • M • tc , whereas 12. Harrow, A.W., Hassidim, A. and Lloyd, S. Quantum
algorithm for linear systems of equations. Phys. Rev.
need ancillas. The results are summa- a quantum computer solves the same Lett. 103, 150502 (Oct. 2009).
13. Jones, S. GlobalFoundries: 7nm, 5nm and 3nm Logic,
rized in Table 1. problem in time Tq = Nk • M • tq where current and projected processes, 2018; https://bit.
A note on parallelism. We assumed tc and tq denote the time to evaluate ly/3XE3ucJ.
14. Mach, S., Schuiki, F., Zaruba, F. and Benini, L. Fpnew:
massively parallel execution of the ora- an operation on a classical and on a An open-source multiformat floating-point unit
cle on both the classical and quantum quantum computer, respectively. By de- architecture for energy-proportional transprecision
computing. IEEE Trans. Very Large-Scale Integration
computer (that is, oracles with a depth manding that the quantum computer 29, 4 (2021), 774–787.
of one). If the oracle does not admit should solve the problem faster than 15. NVIDIA Corp. NVIDIA A100 Tensor Core GPU
Architecture, 2020; https://bit.ly/3WjVc91.
such parallelization, for example, if the classical computer and within 106 16. Shor, P.W. Polynomial-time algorithms for prime
depth = work in the worst-case scenar- seconds, we find factorization and discrete logarithms on a quantum
computer. SIAM J. Comput. 26, 5 (Oct. 1997),
io, then the comparison becomes more 1484–1509.
favorable towards the quantum com-
puter. One could model this scenario by Torsten Hoefler is Consulting Researcher at Microsoft
Corporation, Redmond, WA, USA, and a professor at ETH
allowing the classical computer to only which allows us to compute the maxi- Zurich, Switzerland.
perform one operation per cycle. With a mal number of basic operations per Thomas Häner is Research Scientist at Amazon Web
2GHz clock frequency, this would mean oracle evaluation such that the quan- Services, Zurich, Switzerland. This work was done when
he worked at Microsoft, Zurich, prior to his joining AWS.
a slowdown of about 100,000 times for tum computer still achieves a practical
Matthias Troyer is Technical Fellow and Corporate
fp16 on the GPU. In this extremely un- speedup: Vice President at Microsoft, Redmond, WA, USA.
realistic algorithmic worst case, the
oracle would still have to consist of only
Copyright held by authors.
several thousands of fp16 operations Publication rights licensed to ACM.
with a quadratic speedup. However, we Determining I/O bandwidth. We use
note that in practice, most oracles have the I/O bandwidth specified in NVID-
low depth and parallelization across a IA’s A100 datasheet for our classical
single chip is achievable, which is what chips or the quantum computer, we as-
we assumed. sume that one quantum gate is re-
Determining maximum operation quired per bit of I/O. Using all 10,000
counts per oracle call. In Table 2, we list qubits for reading/writing, this yields
the maximum number of operations of an estimate of the I/O bandwidth B ≈
10,000
a certain type that can be run to achieve 10
–5 = 1Gbit/s.
a quantum speedup within a runtime Acknowledgments. We thank L. Be-
of 106 seconds (a little more than two nini for helpful discussions about ASIC
weeks). The maximum number of clas- and processor design and related over- Watch the authors discuss
sical operations that can be performed heads and W. van Dam and anonymous this work in the exclusive
Communications video.
with a single classical chip in 106 sec- reviewers for comments that improved https://cacm.acm.org/videos/
onds would be: 0.55 fp16, 0.22 int32, an earlier draft. quantum-advantage