You are on page 1of 6

review articles

DOI:10.1145/ 3571725
There is a maze of hard problems
What are the promising applications that have been suggested to profit
from quantum acceleration: from
to realize quantum advantage? cryptanalysis, chemistry and materi-
als science, to optimization, big data,
BY TORSTEN HOEFLER, THOMAS HÄNER, AND MATTHIAS TROYER machine learning, database search,
drug design and protein folding, fluid

Disentangling Hype
dynamics and weather prediction. But
which of these applications realisti-

from Practicality:
cally offer a potential quantum advan-
tage in practice? For this, we cannot
only rely on asymptotic speedups but

On Realistically
must consider the constants involved.
Being optimistic in our outlook for
quantum computers, we identify clear

Achieving
guidelines for quantum practicality
and use them to classify which of the
many proposed applications for quan-
tum computing show promise and

Quantum
which ones would require significant
algorithmic improvements to become
practical and relevant.

Advantage
To establish reliable guidelines, or
lower bounds for the required speed-
up of a quantum computer, we err on
the side of being optimistic for quan-
tum and overly pessimistic for clas-
sical computing. Despite our overly
optimistic assumptions, our analysis
shows a wide range of often-cited ap-
plications is unlikely to result in a
practical quantum advantage without
OPERATING ON FUNDAMENTALLY different principles significant algorithmic improvements.
We compare the performance of only
than conventional computers, quantum computers a single classical chip fabricated like
promise to solve a variety of important problems that the one used in the NVIDIA A100 GPU
seemed forever intractable on classical computers. that fits around 54 billion transistors15

Leveraging the quantum foundations of nature, the


time to solve certain problems on quantum computers key insights
grows more slowly with the size of the problem than on ˽ Most of today’s quantum algorithms may
not achieve practical speedups. Material
classical computers—this is called quantum speedup.
IMAGE BY AND RIJ BORYS ASSOCIAT ES, USING SH UTT ERSTOC K

science and chemistry have a huge


potential and we hope more practical
Going beyond quantum supremacy,2 which was the algorithms will be invented based on our
guidelines
demonstration of a quantum computer outperforming
˽ Due to limitations of input and output
a classical one for an artificial problem, an important bandwidth, quantum computers will be
question is finding meaningful applications (of practical for “big compute” problems on
small data, not big data problems.
academic or commercial interest) that can realistically ˽ Quadratic speedups delivered by
be solved faster on a quantum computer than on algorithms such as Grover’s search
are insufficient for practical quantum
a classical one. We call this a practical quantum advantage without significant
improvements across the entire software/
advantage, or quantum practicality for short. hardware stack.

82 COMM UNICATIO NS O F THE ACM | M AY 2023 | VO L . 66 | NO. 5


MAY 2 0 2 3 | VO L. 6 6 | N O. 5 | C OM M U N IC AT ION S OF T HE ACM 83
review articles

with an optimistic assumption for a which determines bounds for data in- computers. Similarly, a potentially ex-
hypothetical quantum computer that put and output bandwidths. Scalable ponential quantum speedup in linear
may be available in the next decades implementations of quantum random algebra problems12 vanishes when the
with 10,000 error-corrected logical access memory (QRAM8,9) demand a matrix must be loaded from classical
qubits, 10ms gate time for logical op- fault-tolerant error corrected imple- data, or when the full solution vector
erations, the ability to simultaneously mentation and the bandwidth is then should be read out. Generally, quan-
perform gate operations on all qubits fundamentally limited by the number tum computers will be practical for
and all-to-all connectivity for fault tol- of quantum gate operations or mea- “big compute” problems on small data,
erant two-qubit gates.a surements that can be performed per not big data problems.
I/O bandwidth. We first consider unit time. We assume only a single gate Crossover scale. With quantum
the fundamental I/O bottleneck that operation per input bit. For our opti- speedup, asymptotically fewer opera-
limits quantum computers in their mistic future quantum computer, the tions will be needed on a quantum
interaction with the classical world, resulting rate is 10,000-times smaller computer than on a classical computer.
than for an existing classical chip (see Due to the high operational complexity
Table 1). We immediately see that any and slower gate operations, however,
a Note that no quantum error correction scheme
exists today that allows simultaneous execu-
problem limited by accessing classical each operation on a quantum comput-
tion of gates and all-to-all connectivity without data, such as search problems in data- er will be slower than a correspond-
at least a O (√N ) slowdown for N qubits. bases, will be solved faster by classical ing classical one. As sketched in the
accompanying figure, classical com-
Quantum speedup. puters will always be faster for small
problems and quantum advantage
The time needed to solve certain problems with quantum algorithms increases
more slowly than that of any known classical algorithm as the problem size N
is realized beyond a problem-depen-
increases. To be practical, however, we need more than an asymptotic speedup: dent crossover scale where the gain
the crossover time where quantum advantage gets realized needs to be reasonably due to quantum speedup overcomes
short and the crossover problem size not too large. (For illustrative purposes, the the constant slowdown of the quan-
time axis is scaled such that the quantum algorithm is a straight line.)
tum computer. To have real practical
impact, the crossover time must be
short, not more than weeks. Constants
Time

Classical matter in determining the utility for


Computer
goal ≤ 2 weeks applications, as with any runtime esti-
years
mate in computing.
Compute performance. To model
crossover
weeks performance, we employ the well-
time
Quantum
known work-depth model from clas-
Computer sical parallel computing to determine
crossover upper bounds of classical silicon-
size based computations and an exten-
minutes
sion for quantum computations. In
this model, the work is the total num-
Problem Size (N) ber of operations and applies to both
classical and quantum executions. In
Table 1, we provide concrete examples
using three types of operations: logi-
Table 1. Performance comparison. cal operations, 16-bit floating point,
and 32-bit integer or fixed-point
We compare the peak performance of a single classical chip that can be
manufactured today (like an NVIDIA A100 GPU, or an ASIC with a similar number
arithmetic operations for numerical
of transistors) with a future quantum computer with 10,000 error-corrected logical modeling. For the quantum costs,
qubits, 10ms gate time for logical operations and all-to-all connectivity. We consider we consider only the most expensive
an estimate of the I/O bandwidth (namely the number of operations per second) parts in our estimates, again benefit-
and three types of operations: logical binary operations, 16-bit floating point, 32-bit
integer or fixed-point arithmetic multiply add operations.
ing quantum computers; for arithme-
tic, we count just the dominant cost
GPU ASIC Future Quantum of multiplications, assuming addi-
I/O Bandwidth 10,000 Gbit/s 10,000 G/s 1 Gbit/s
tions are free. Furthermore, for float-
ing point multiplication, we consider
Operation throughput only the cost of the multiplication of
16-bit floating point 195 Top/s 550 Top/s 10.5 kop/s the mantissa (10 bits in fp16). We ig-
32-bit integer 9.75 Top/s 215 Top/s 0.83 kop/s nore all further overheads incurred by
binary (Boolean logical) 4,992 Top/s 77,000 Top/s 235 kop/s
the quantum algorithm due to revers-
ible computations, as well as the sig-
nificant cost of mapping to a specific

84 COMM UNICATIO NS O F THE ACM | M AY 2023 | VO L . 66 | NO. 5


review articles

hardware architecture with limited Table 2. Crossover operation counts for quantum algorithms with quadratic, cubic, and
qubit connectivity. quartic speedups.
Crossover times for classical and
quantum computation. To estimate We determine the number of operations that can be afforded per function call (see the
accompanying figure) for a quantum computer to show an advantage over a classical
lower bounds for the crossover times,
computer using a quantum algorithm with quadratic, cubic, and quartic quantum
we consider that while both classical speedup. The number of oracle calls required to reach the crossover point with a
and quantum computers must evalu- quadratic, cubic, and quartic speedup is computed using the relative runtimes of a single
ate the same functions (usually called oracle evaluation, and the total runtime of 106 seconds is then used to compute how
oracles) that describe a problem, many basic operations can be afforded in each oracle call. Since we make optimistic
assumptions for a future quantum computer, we ignore overheads of reversible arithmetic
quantum computers require fewer for quantum computing and limit the classical computer to a single chip that can be
evaluations thereof due to quantum manufactured today. The actual crossover operation counts will be significantly smaller.
speedup. At the root of many quan- A similar analysis for quantum algorithms with exponential speedups yields promising
tum acceleration proposals lies a qua- operation budgets for all datatypes.

dratic quantum speedup, including Maximum number of operations for practical


the well-known Grover algorithm.10,11
Operation type quadratic speedup cubic speedup quartic speedup
For such an algorithm, a problem
16-bit floating point 0.2 45,800 2,800,000
that needs X function calls on a quan-
tum computer requires quadratically 32-bit integer 0.003 1,630 130,000

more, namely on the order of X2 calls Binary (logical) 68 12,500,000 712,000,000


on a classical computer. To overcome
the large constant performance differ-
ence between a quantum computer in quantum technology of multiple or- ear systems of equations, such as fluid
and a classical computer, which Table ders of magnitude. dynamics in the turbulent regime,
1 shows to be more than a factor of Practical and impractical applica- weather, and climate simulations will
1010, many function calls X ≫ 1010 is tions. We can now use these consid- not achieve quantum advantage with
needed for the quantum speedup to erations to discuss several classes of current quantum algorithms in the
deliver a practical advantage. In Table applications where our fundamen- foreseeable future. We also conclude
2, we estimate upper bounds for the tal bounds draw a line for quantum that the identified I/O limits constrain
complexity of the function that will practicality. The most likely problems the performance of quantum comput-
lead to a cross-over time of 106 sec- to allow for a practical quantum ad- ing for big data problems, unstruc-
onds, or approximately two weeks. vantage are those with exponential tured linear systems, and database
We see that with quadratic speedup quantum speedup. This includes the search based on Grover’s algorithm
even a single floating point or integer simulation of quantum systems for such that a speedup is unlikely in
operation leads to crossover times of problems in chemistry, materials sci- those cases. Furthermore, Aaronson
several months. Furthermore, at most ence, and quantum physics, as well et al.1 show the achievable quantum
68 binary logical operations can be as cryptanalysis using Shor’s algo- speedup of unstructured black-box
afforded to stay within our desired rithm.16 The solution of linear systems algorithms is limited to O(N4). This
crossover time of two weeks, which is of equations for highly structured implies that any algorithm achieving
too low for any non-trivial application. problems12 also has an exponential higher speedup must exploit structure
Keeping in mind that these estimates speedup, but the I/O limitations dis- in the problem it solves.
are pessimistic for classical compu- cussed above will limit the practicality These considerations help with
tation (a single of today’s classical and undo this advantage if the matrix separating hype from practicality in
chips) and overly optimistic for quan- has to be loaded from memory instead the search for quantum applications
tum computing (only considering the of being computed based on limited and can guide algorithmic develop-
multiplication of the mantissa and as- data or knowledge of the full solution ments. Specifically, our analysis shows
suming all-to-all qubit connectivity), is required (as opposed to just some it is necessary for the community to
we come to the clear conclusion that limited information obtained by sam- focus on super-quadratic speedups,
quadratic speedups are insufficient pling the solution). ideally exponential speedups, and one
for practical quantum advantage. Equally important, we identify like- needs to carefully consider I/O bottle-
The numbers look better for cubic or ly dead ends in the maze of applica- necks when deriving algorithms to
quartic speedups where thousands tions. A large range of problem areas exploit quantum computation best.
or millions of operations may be fea- with quadratic quantum speedups, Therefore, the most promising can-
sible, and we conclude, similarly to such as many current machine learn- didates for quantum practicality are
Babbush et al.,3 that at least cubic or ing training approaches, accelerating small-data problems with exponential
quartic speedups are required for a drug design and protein folding with speedup. Specific examples where this
practical quantum advantage. Grover’s algorithm, speeding up Mon- is the case are quantum problems in
As a result of our overly optimistic te Carlo simulations through quan- chemistry and materials science,5
assumptions in favor of quantum com- tum walks, as well as more traditional which we identify as the most promis-
puting, these conclusions will remain scientific computing simulations in- ing application. We recommend using
valid even with significant advances cluding the solution of many non-lin- precise requirements models4 to get

MAY 2 0 2 3 | VO L. 6 6 | N O. 5 | C OM M U N IC AT ION S OF T HE ACM 85


review articles

more reliable and realistic (less opti- cation circuits would be significantly
mistic) estimates in cases where our cheaper. If we assume a transistor-to-
rough guidelines indicate a potential gate ratio of 1013 and that 50% of the
practical quantum advantage. total chip area is used for control logic

Methods Our analysis shows of a dataflow ASIC with the required


buffering, we can fit 54.2B/(7k • 10 • 2)
Here, we provide more details for how
we obtained the numbers mentioned
a wide range = 387k fp16 units. Similarly, we can
fit 54.2B(18k • 10 • 2) = 151k int32, or
earlier. We compare our quantum of often-cited 54.2B/(50 • 10 • 2) = 54.2M bin2 units
computer with a single microprocessor
chip like the one used in the NVIDIA
applications on our hypothetical chip. Assuming a
cycle time of 0.7ns, this leads to a total
A100 GPU.15 The A100 chip is around is unlikely to result operation rate of 0.55 fp16, 0.22 int32,
850mm2 in size and manufactured in
TSMC’s 7nm N7 silicon process. A100
in a practical and 77.4 bin Pop/s for an application-
specific ASIC with the A100’s technol-
shows that such a chip fits around 54.2 quantum advantage ogy and budget. The ASIC thus leads to
billion transistors and can operator at
a cycle time of around 0.7ns. without significant a raw speedup between approximately
2x and 15x over a programmable cir-
Determining peak operation algorithmic cuit. Thus, on classical silicon, the

improvements.
throughputs. In Table 1, we provide performance ranges approximately
concrete examples using three types of between 1013 and 1016 op/s for binary,
operations: logical operations, 16-bit int32, and fp16 types.
floating point, and 32-bit integer arith- Hypothetical future quantum computer.
metic operations for numerical model- To determine the costs of N-bit multi-
ing. Other datatypes could be modeled plication on a quantum computer, we
using our methodology as well. choose the controlled adder from Gid-
Classical NVIDIA A100. According ney6 and implement the multiplication
to its datasheet, NVIDIA’s A100 GPU, a using N single-bit controlled adders,
SIMT-style von Neumann load store ar- each requiring 2N CCZ magic states.
chitecture, delivers 312 tera-operations These states are produced in so called
per second (Top/s) with half precision “magic state factories” that are imple-
floating point (fp16) through tensor mented on the physical chip. While
cores and 78Top/s through the normal the resulting multiplier is entirely se-
processing pipeline. NVIDIA assumes quential, we found that this construc-
a 50/50 mix of addition and multiplica- tion allows for more units to be placed
tion operations and thus, we divide the on one chip than for a low-depth adder
number by two, yielding 195Top/s fp16 and/or for a tree-like reduction of par-
performance. The datasheet states tial products since the number of CCZ
19.5Top/s for 32-bit integer operations, states is lower (and thus fewer magic
again assuming a 50/50 mix of addition state factories are required), and the
and multiplication, leading to an effec- number of work-qubits is lower. The
tive 9.75Top/s. The binary tensor core resulting multiplier has a CCZ-depth
performance is listed as 4,992Top/s and count of 2N2 using 5N − 1 qubits
with a limited set of instructions. (2N input, 2N − 1 output, N ancilla for
Classical special-purpose ASIC. Our the addition).
main analysis assumes that we build To compute the space overhead
a special-purpose ASIC using a similar due to CCZ factories, we first use the
technology. If we were to fill the equiv- analysis of Gidney and Fowler7 to com-
alent chip-space of an A100 with a spe- pute the number of physical qubits per
cialized circuit, we would use existing factory when aiming for circuits (pro-
execution units, for which the size is grams) using ≈ 108 CCZ magic states
typically measured in gate equivalents with physical gate errors of 10−3. We
(GE). A 16-bit floating point unit (FPU) approximate the overhead in terms of
with addition and multiplication func- logical qubits by dividing the physi-
tions requires approximately 7kGE, a cal space overhead by 2d2, where we
32-bit integer unit requires 18kGE,14 choose the error-correcting code dis-
and we assume 50GE for a simple bi- tance d = 2 • 312 to be the same as the
nary operation. All units include op- distance used for the second level of
erand buffer registers and support a distillation.7 Thus we divide Gidney
set of programmable instructions. We and Fowler’s 147,904 physical qubits
note that simple addition or multipli- per factory (for details consult the an-

86 COMMUNICATIO NS O F TH E AC M | M AY 2023 | VO L . 66 | NO. 5


review articles

cillary spreadsheet (field B40) of Gid- and 77.4 bin Zop. Similarly, assuming References
ney and Fowler) by 2d2 = 2 • 312 and get the rates from Table 1, for a quantum 1. Aaronson, S., Ben-David, S., Kothari, R., Rao, S. and
Tal, A. Degree vs. approximate degree and quantum
an equivalent space of 77 logical qubits chip: 7, 4, 2, and 350 Gop, respectively. implications of Huang’s sensitivity theorem. In
per factory. We now assume that all calcula- Proceedings of the 53rd Annual ACM SIGACT Symp.
Theory of Computing, 2021, 1330–1342, ACM, New
For the multiplier of the 10-bit man- tions are used in oracle calls on the York, NY, USA.
tissa of an fp16 floating point number, quantum computer and we ignore 2. Arute, F. et al. Quantum supremacy using a
programmable superconducting processor. Nature
we need 2 • 102 = 200 CCZ states and 5 • all further costs on the quantum ma- 574 (2019), 505–510.
10 = 50 qubits. Since each factory takes chine. We start by modeling algo- 3. Babbush, R., McClean, J.R., Newman, M., Gidney,
C., Boixo, S., and Neven, H. Focus beyond quadratic
5.5 cycles7 and we can pipeline the pro- rithms that provide polynomial Xk speedups for error-corrected quantum advantage.
duction of CCZ states, we assume 5.5 speedup, for small constants k. For PRX Quantum, 2:010103, Mar 2021.
4. Beverland, M.E. et al. Assessing requirements to scale
factories per multiplication unit such example, for Grover’s algorithms,11 to practical quantum advantage, 2022.
that multipliers do not wait for magic k + 2. It is clear quantum computers are 5. Feynman, R.P. Simulating physics with computers.
Intern. J. Theoretical Physics 21, 6/7 (1981).
state production on average. Thus, each asymptotically faster (in the number 6. Gidney, C. Halving the cost of quantum addition.
Quantum 2, 74 ( June 2018).
multiplier requires 200 cycles and 5N of oracle queries) for any k >1. Howev- 7. Gidney, C. and Fowler, A.G. Efficient magic state
+ 5.5 • 77 = 50 + 5.5 • 77 = 473.5 qubits. er, we are interested to find the oracle factories with a catalyzed CCZ to 2 T transformation.
Quantum 3, 135 (Apr. 2019).
With a total of 10,000 logical qubits, we complexity (that is, the number of oper- 8. Giovannetti, V., Lloyd, S., and Maccone, L. Quantum
can implement 21 10-bit multipliers on ations required to evaluate it) for which random access memory. Physical Review Letters 100,
16 (Apr. 2008).
our hypothetical quantum chip. With a quantum computer is faster than a 9. Giovannetti, V., Lloyd, S., and Maccone, L.
10ms cycle time, the 200-cycle latency, classical computer within the time- Architectures for a quantum random access memory.
Physical Review A 78, 5 (Nov. 2008).
we get the final rate of less than 105 cy- window of 106 seconds. 10. Grover, L. K. Quantum mechanics helps in searching
cle/ s / (200 cycle/op) • 21= 10.5kop/s. For Let the number of operations re- for a needle in a haystack. Physical Review Letters 79,
2 (1997).
int32 (N=32), the calculation is equiva- quired to evaluate a single oracle call 11. Grover, L. K. A fast quantum mechanical algorithm for
lent. For binary, we assume two input be M and let the number of required database search. In Proceedings of the 28th Annual
ACM Symp. Theory of Computing, 1996. ACM, New
and one output qubit for the (binary) invocations be N. It takes a classical York, NY, 212–219.
adder (Toffoli gate) which does not computer time Tc = Nk • M • tc , whereas 12. Harrow, A.W., Hassidim, A. and Lloyd, S. Quantum
algorithm for linear systems of equations. Phys. Rev.
need ancillas. The results are summa- a quantum computer solves the same Lett. 103, 150502 (Oct. 2009).
13. Jones, S. GlobalFoundries: 7nm, 5nm and 3nm Logic,
rized in Table 1. problem in time Tq = Nk • M • tq where current and projected processes, 2018; https://bit.
A note on parallelism. We assumed tc and tq denote the time to evaluate ly/3XE3ucJ.
14. Mach, S., Schuiki, F., Zaruba, F. and Benini, L. Fpnew:
massively parallel execution of the ora- an operation on a classical and on a An open-source multiformat floating-point unit
cle on both the classical and quantum quantum computer, respectively. By de- architecture for energy-proportional transprecision
computing. IEEE Trans. Very Large-Scale Integration
computer (that is, oracles with a depth manding that the quantum computer 29, 4 (2021), 774–787.
of one). If the oracle does not admit should solve the problem faster than 15. NVIDIA Corp. NVIDIA A100 Tensor Core GPU
Architecture, 2020; https://bit.ly/3WjVc91.
such parallelization, for example, if the classical computer and within 106 16. Shor, P.W. Polynomial-time algorithms for prime
depth = work in the worst-case scenar- seconds, we find factorization and discrete logarithms on a quantum
computer. SIAM J. Comput. 26, 5 (Oct. 1997),
io, then the comparison becomes more 1484–1509.
favorable towards the quantum com-
puter. One could model this scenario by Torsten Hoefler is Consulting Researcher at Microsoft
Corporation, Redmond, WA, USA, and a professor at ETH
allowing the classical computer to only which allows us to compute the maxi- Zurich, Switzerland.
perform one operation per cycle. With a mal number of basic operations per Thomas Häner is Research Scientist at Amazon Web
2GHz clock frequency, this would mean oracle evaluation such that the quan- Services, Zurich, Switzerland. This work was done when
he worked at Microsoft, Zurich, prior to his joining AWS.
a slowdown of about 100,000 times for tum computer still achieves a practical
Matthias Troyer is Technical Fellow and Corporate
fp16 on the GPU. In this extremely un- speedup: Vice President at Microsoft, Redmond, WA, USA.
realistic algorithmic worst case, the
oracle would still have to consist of only
Copyright held by authors.
several thousands of fp16 operations Publication rights licensed to ACM.
with a quadratic speedup. However, we Determining I/O bandwidth. We use
note that in practice, most oracles have the I/O bandwidth specified in NVID-
low depth and parallelization across a IA’s A100 datasheet for our classical
single chip is achievable, which is what chips or the quantum computer, we as-
we assumed. sume that one quantum gate is re-
Determining maximum operation quired per bit of I/O. Using all 10,000
counts per oracle call. In Table 2, we list qubits for reading/writing, this yields
the maximum number of operations of an estimate of the I/O bandwidth B ≈
10,000
a certain type that can be run to achieve 10
–5 = 1Gbit/s.
a quantum speedup within a runtime Acknowledgments. We thank L. Be-
of 106 seconds (a little more than two nini for helpful discussions about ASIC
weeks). The maximum number of clas- and processor design and related over- Watch the authors discuss
sical operations that can be performed heads and W. van Dam and anonymous this work in the exclusive
Communications video.
with a single classical chip in 106 sec- reviewers for comments that improved https://cacm.acm.org/videos/
onds would be: 0.55 fp16, 0.22 int32, an earlier draft. quantum-advantage

MAY 2 0 2 3 | VO L. 6 6 | N O. 5 | C OM M U N IC AT ION S OF T HE ACM 87

You might also like