202202AISTATS PadMag

PAC Learning of Quantum Measurement Classes: Sample
Complexity Bounds and Universal Consistency
Arun Padakandla Abram Magner

University of Tennessee, Knoxville University at Albany, SUNY
Abstract day’s learning algorithms. On an alternate front, con-

siderable efforts are underway to design and build a
fundamentally new computing device based on the ax-
We formulate a quantum analogue of the fun-
ioms of quantum mechanics. How do the above ques-
damental classical PAC learning problem. As
tions manifest in this new quantum computing frame-
on a quantum computer, we model data to
work? Do these lead to interesting formulations re-
be encoded by modifying specific attributes
quiring new physical and mathematical ideas? These
- spin axis of an electron, plane of polar-
questions motivate the current work.
ization of a photon - of sub-atomic parti-
cles. Any interaction, including reading off,
extracting or learning from such data is via Learning from physical realizations of quantum
quantum measurements, thus leading us to a states: In the classical model of computation, data
problem of PAC learning Quantum Measure- stored in registers is modeled as elements taking values
ment Classes. We propose and analyze the in sets, and computations or algorithms are modeled as
sample complexity of a new ERM algorithm functions or sequences of functions operating on these
that respects quantum non-commutativity. sets. In order to classify, say, an image stored in a col-
Our study entails that we define the VC di- lection of registers, a learning algorithm has to identify
mension of Positive Operator Valued Mea- an optimal function from a concept class – a library of
sure(ments) (POVMs) concept classes. Our deterministic functions. Classically, the learning algo-
sample complexity bounds involve optimizing rithm is able to observe the exact contents of registers
over partitions of jointly measurable classes. arbitrarily many times.
Finally, we identify universally consistent se-
In contrast, on a quantum computer, data is encoded
quences of POVM classes. Technical compo-
on sub-atomic particles by modifying their attributes.
nents of this work include computations in-
For example, 8 bits can be stored on an electron by
volving tensor products, trace and uniform
preparing the axis of its spin to be one among 256
convergence bounds.
outward radial directions. As we will discuss in more
detail below, data stored in such a manner cannot
be directly observed. Moreover, learning information
1 INTRODUCTION about the value stored in the attributes of a particle
necessarily results in a change in the value/attribute.
The blueprint of today’s learning algorithms can be
traced back to the foundational ideas developed in the Our framework will focus on learning in a supervised
early works of statistical learning theory (SLT). Ques- setting, where the data is encoded, as described above,
tions such as : When is a concept class learnable? in attributes of sub-atomic particles – i.e., as a physical
What parameters quantify its complexity regarding its realization (not an analytical description) of a quan-
learnability? How does the sample complexity depend tum state, and the associated labels, which we inter-
on these parameters? were formulated four decades pret as inferences about the data, are stored in classical
ago keeping in mind Turing’s model of computation. registers. The combination of the quantum data with
Answers to these questions continue to influence to- the classical label is referred to as a classical-quantum
(CQ) state [Wilde, 2017, Sec. 4.3.4]. Any interaction
Proceedings of the 25th International Conference on Artifi- with a sub-atomic particle is governed by the laws of
cial Intelligence and Statistics (AISTATS) 2022, Valencia, quantum mechanics, and in particular, ascertaining
Spain. PMLR: Volume 151. Copyright 2022 by the au- its state or properties thereof is via a measurement.
thor(s).
Therefore, the prediction of the label of a quantum
PAC Learning of Quantum Measurement Classes: Sample Complexity Bounds and Universal
Consistency
state must proceed via a measurement that is applied measuring multiple identically prepared states. This
to the particle and results in a classical outcome that set includes works on state tomography [O’Donnell
is interpreted as the predicted label. and Wright, 2016, O’Donnell and Wright, 2017, Haah
et al., 2016, Aaronson, 2006, Rehacek and Paris, 2004,
The task of a quantum learning algorithm in this for-
Waseem et al., 2020, Altepeter et al., 2004, Aaronson,
mulation, given a set of physical realizations of quan-
2018, Aaronson et al., 2018], state discrimination [Bae
tum states along with their associated labels, is to
and Kwek, 2015] and quantum property testing [Mon-
identify an “optimal” measurement from a fixed con-
tanaro and de Wolf, 2018,Bubeck et al., 2020] that in-
cept class, which now consists of a library of measure-
vestigate the number of prepared states to accomplish
ments. Measurements are formalized via the generic
the learning task. We highlight the work of Cheng
framework of positive operator-valued measurements
et. al. [Cheng et al., 2016] that studies the related
(POVMs) [Holevo, 2019, Sec. 2.3]. A concept class
problem when the training samples form the collec-
thus consists of a set of POVMs.
tion (ρi , tr(M ρi )) : 1 ≤ i ≤ n, where ρi denotes
In this work, we answer the above three questions in the ith quantum state, and the algorithm must find
the context of POVM concept classes. We then initiate the unknown optimal POVM M . 1 See [Kearns and
a study of universal consistency [Stephane Boucheron Schapire, 1990] for a closely related classical study.
and Massart, 2013] and identify universally consistent The related paper [Heidari et al., 2021] defines and
sequences of POVM concept classes. studies a similar framework and learning rule to our
own but only proves a sample complexity bound for
finite POVM classes. Furthermore, it makes much
Challenges inherent in our setting: This nat-
stronger assumptions regarding the mutual compati-
ural, deceptively simple formulation hides the in-
bility of POVMs in the classes under study. The ex-
volved challenges. The axioms of quantum mechan-
tension in the present work of statistical learning the-
ics, that govern all our interaction with quantum
ory machinery to the quantum setting, necessary for
data, manifests in three unique properties - non-
proving finite upper bounds on the sample complexity
commutativity, quantum indeterminism and entan-
for learning infinite-cardinality POVM classes, is a key
glement. Non-commutativity implies that measur-
contribution of our work.
ing a quantum state causes it to collapse. Post-
measurement collapse results in a significant blowup of The second set is comprised of works [Gortler and
the sample complexity of the empirical risk minimiza- Servedio, 2001, Servedio, 2001, Lloyd et al., 2014,
tion (ERM) rule, thus leading us to modify this foun- Schuld et al., 2014, Atici and Servedio, 2005, Anguita
dational algorithm of SLT. Leveraging the elegant no- et al., 2003, Wiebe et al., 2012, Aı̈meur et al., 2012]
tion of joint measurability, also referred to as compat- that explore the use of a quantum computer to speed
ibility (Sec. 3.1) herein, we propose a modified ERM up classical learning tasks. Surveys [Arunachalam and
rule (Alg. 1 in Sec. 3.2) that prevents this blowup. de Wolf, 2017,Khan and Robles-Kelly, 2020] provide a
Next, we quantify the effect of quantum indetermin- detailed account of works in this set.
ism in the characterization of the VC dimension of a
POVM concept class (Defn. 5). Our analysis (Sec. 4) ∆
Notation: For integer n ∈ N, [n] = {1, · · · , n}. We
reveals the functional dependence of the sample com- reserve H with appropriate subscripts to denote a
plexity (Thm. 1) on this measure of complexity. Our finite-dimensional Hilbert space. The symbols L(H),
formulation generalizes (Rem. 1) the original PAC for- R(H), P(H), D(H) denote the collection of linear,
mulation [Vapnik and Chervonenkis, 1971,Vapnik and Hermitian, positive and density operators acting on
Chervonenkis, 1974, Vapnik, 1982, Valiant, 1984, Vap- ∆
H respectively. A POVM is a subcollection M =
nik, 1995,Vapnik, 1998] and we recover all correspond- {My ∈ P(H) : y ∈ Y} of indexed P positive operators
ing results (Rem. 4). We also generalize the results that sum to the identity IH , i.e, y∈Y My = IH . For
of [Heidari et al., 2021], which only deals with finite clarity, a POVM M = {My ∈ P(H) : y ∈ Y} with
concept classes and a simpler notion of compatibil- outcomes in Y is often referred to as a Y−POVM or
ity. In Sec. 5, we use optimal state discrimination to Y−POVM on H. In this work, we are required to
identify sequences of POVM concept classes that are perform the same measurement on multiple quantum
universally consistent (Defn. 7). states. For a Y−POVM M = {My : y ∈ Y}, we let
∆ ∆
Mn = {Myn = My1 ⊗ · · · ⊗ Myn : y n ∈ Y n } with a
Prior Work: Related prior work can be classi- slight abuse of notation. For a Hilbert space HX and a
fied broadly into two sets. The first set, which in- 1
Our work differs considerably since we are provided the
cludes this work, revolves around learning a quantum actual labels, not the associated probabilities tr(M ρi ) : 1 ≤
state [Arunachalam and de Wolf, 2017, Arunachalam i ≤ n. As discussed in [Kearns and Schapire, 1990], this
et al., 2020], or its properties [Anshu et al., 2021], by makes the problem substantially more challenging.
Arun Padakandla, Abram Magner
set Y, we let MX→Y denote the set of all Y−POVMs label pairs, the feature-label pair (x, y) is encoded in a
on HX . See [Holevo, 2019, Sec. 2.3] for its structure. physically realized quantum system modeled via den-
∆
We let MX n →Y denote the set of all Y−POVMs on sity operators ρxy = ρx ⊗ |yi hy| ∈ D(HZ ), where
⊗n
HX . We let SX→Y denote the subcollection of sharp HZ = HX ⊗ HY , HY = span{|yi : y ∈ Y} with
(projective) measurements on HX with outcomes in Y. hy|ŷi = δyŷ ensuring label distinguishability. Note
∆
We let Z = X ×Y and HZ = HX ⊗HY . We abbreviate that we have modeled labels, as before, to be stored
random variables, probability mass function, empirical in classical registers and have thus adopted the well-
risk as RVs, PMF, ER respectively. We use PMF and established notion of a classical-quantum (CQ) state
distribution interchangeably. We write X ∼ pX if RV [Wilde, 2017, Sec. 4.3.4] to model a feature-label pair.2
X is distributed with PMF pX .
Labeled features are generated with respect to an un-
known distribution pXY . Specifically, the density op-
2 MODEL FORMULATION, erator of a feature-label pair is
PROBLEM STATEMENT ∆
Z
ρXY = ρx ⊗ |yi hy| dpXY ∈ D(HZ ) (1)
X ×Y
Let X represent an (arbitrary) set of features, Y a P
finite set of labels and P(X × Y) the set of all dis- which reduces to ρXY = (x,y)∈X ×Y pXY (x, y)ρx ⊗
tributions/densities/PMFs on X × Y. In the classi- |yihy| if |X | < ∞. A predictor is only provided the
cal model of computation, data and hence features quantum system corresponding to the feature, i.e.,
are stored in registers, and operated on by functions. the X−system. Mathematically, the density opera-
Therefore, random variables (RVs) (X, Y ) ∈ X × Y tor
P of the quantum state provided to a predictor is
with an unknown distribution pXY model feature-label x∈X pX (x)ρx . We exemplify this below.
pairs occurring in nature. A predictor – a function Ex. 1. Consider electrons with spins. An electron
f ∈ FX →Y - is only provided X. Its performance can be prepared with the axis of its spin pointing in a
∆ ∆
is measured through its true risk lp (f ) = lpXY (f ) = direction, represented by a 3−dimensional unit vector.
j2π
EpXY {l(f (X), Y )} = Ep {l(f (X), Y )} with respect to Let finite set X = {(θi , φj ) = ( iπ
8 , 8 ) : 0 ≤ i, j ≤ 7}
a loss function l : Y 2 → [A, B] bounded between index unit vectors in the Bloch sphere representing the
A, B ≥ 0. A learning algorithm’s (LA’s) task is possible spin axis directions. We have two labels in
to identify an optimal predictor – a function – from Y = {blue, red}. Nature decides to label an electron
within a/its concept class C ⊆ FX →Y – a library of ‘red’ if the axis of its spin is orthonormal to a specific
functions. We focus on supervised learning, wherein orthant. Otherwise the electron is labeled ‘blue’. For
the LA is provided with n training samples modelled as this she chooses a specific orthant O. This establishes
n independent and identically distributed (IID) pairs a relationship - pY |X - between the elements (x, y) ∈
(Xi , Yi ) ∈ X × Y : i ∈ [n] with unknown density pXY . X × Y. Going further, she chooses a distribution pX ,
A C−LA is therefore again a sequence of functions samples X with respect to this distribution, endows an
An : (X × Y)n → C : n ≥ 1, of which the ERM electron with the corresponding spin and hands only
rule [Shalev-Shwartz and Ben-David, 2014, Sec. 2.2] is the electron to us. Our predictor is aware of X , its
a canonical example. association with the spin directions, i.e the mapping
Before we describe a quantum formulation, we briefly x → ρx , and Y. Oblivious to both the nature’s decision
highlight two consequences of the classical model that and the orthant, but possessing the prepared electron,
are often taken for granted. Stored in registers and a predictor’s task is to determine the label.3
modeled as elements in a set, features can be dupli- Previously stored in registers, features were labeled
cated (indiscriminately), thereby enabling one to si- via functions. In contrast, on a quantum comput-
multaneously retain an “original copy” x and the out- ing device, features encoded in quantum states can
comes (f (x) : f ∈ C) of evaluating it through any arbi- be labeled only through measurements. A predictor
trary collection of functions (Note 1). Secondly, if one M = {My ∈ P(HX ) : y ∈ Y} is therefore a Y−POVM
ignores circuit reliability issues, no uncertainty is in- on HX .
volved in evaluating features through functions (Note
2
2). In other words, the outcome of evaluating a func- A CQ state [Wilde, 2017, Sec. 4.3.4] models a scenario
wherein some components of data is stored in classical reg-
tion on a feature is deterministic.
isters, while others are encoded in quantum states that
In our proposed quantum learning model, feature x ∈ obey the axioms of quantum mechanics.
3
X is encoded by modifying specific characteristics of For example, we may know that 000 is encoded as hor-
izontal spin axis, 111 as vertical spin axis and so on. How-
a sub-atomic particle. Let ρx ∈ D(H) : x ∈ X ever, we are unaware of which spin axis is labeled as red
model the behaviour of a particle representing feature or blue, i.e., we are unaware of the rule based on which a
x ∈ X . While X × Y indexes (all) possible feature- spin axis is labeled.
Consistency
Since our labels are stored in registers, a conventional > 0, δ > 0, there exists a N (, δ) ∈ N such that
bounded loss function l : Y × Y → [A, B] quantifies for all n ≥ N (, δ) and every distribution pXY , the
a predictor’s performance. The true risk of predictor probability of An choosing a predictor M ∈ C for which
M = {My : y ∈ Y} if the underlying distribution is the true risk lp (M) > l∗ (pXY , C) + , is at most δ. In
pXY is this case, the function N : [A, B] × (0, 1) → N is the
sample complexity of algorithm An : n ≥ 1 in learning
∆
lp (M) = C. We say that the C−LA An : n ≥ 1 efficiently PAC
learns C if N (, δ) is polynomial in 1 , 1δ . A concept
X
pXY (x, y)l(y, ŷ) tr((ρx ⊗ |yi hy|)(Mŷ ⊗ IY )),
(x,y,ŷ)∈Z
class C ⊆ MX→Y is (efficiently) PAC learnable if there
(2) exists a C−LA that (efficiently) PAC learns C.
Remark 1. Suppose we are given a classical PAC
∆
where Z = X × Y 2 . A concept class C ⊆ MX→Y is a learning problem with feature set X , a function con-
subcollection of measurements, i.e., POVMs with out- cept class C ⊆ FX→Y . We can recover this classi-
comes in Y. Henceforth, a concept class, a POVM cal formulation via the following substitutions in the
concept class and a measurement concept class all problem formulated herein. For this, choose HX =
refer to the same object and are used interchange- span{|xi : x ∈ X } with hx|x̂i = δxx̂ and ρx = |xi hx|.
ably. For an underlying distribution pXY , we let We then encode the concept class of functions in terms
∆
l∗ (pXY , C) = of POVMs as Mfy = x:f (x)=y |xi hx| for each f ∈ C.
P
inf lp (M) denote the risk of an opti-
M∈C
mal predictor in C. The goal of a C−LA is to choose
a predictor M ∈ C which is a POVM whose true risk While the classical problem is well understood [Shalev-
lp (M) is close to l∗ (pXY , C), while being oblivious to Shwartz and Ben-David, 2014, Devroye et al., 1996],
the underlying distribution pXY . To accomplish this, the unique behaviour of quantum systems and com-
a C−LA is provided n training samples - labeled fea- plexity of the involved mathematical objects throws up
tures - randomly chosen IID from pXY . The density challenges leaving the above problem, which we tackle
operator of the received training samples is therefore here, unresolved.
Z
⊗n ∆
ρXY = ⊗n
ρxn ⊗ |y n i hy n | dpnXY ∈ D(HZ ) (3) The need to modify the ERM rule The following
X n ×Y n example illustrates the need to modify the classical
ERM algorithm; we do so in Sec. 3.2 and analyze the
n n
∆
O ∆
O resulting sample complexity in Sec. 4.
where ρxn = ρxi , |y n i hy n | = |yi ihyi | .
i=1 i=1 Ex. 2. Suppose we are provided n = 20 training
Mathematically, any operation on a quantum state samples, i.e., 20 electrons with corresponding labels
with classical outcomes is a measurement. A C−LA in registers, and our library C consists of 2 predictors
must therefore measure the n training samples mod- M1 = {M1b , M1r }, M2 = {M2b , M2r }. An ERM rule
eled via density operator ρ⊗n would attempt to measure all 20 electrons with both
XY and output an index in
C. measurements and choose that measurement, whose
outcomes disagree least with the provided labels. How-
Defn 1. A POVM concept class for labeling features in ever, every measurement alters the spin, rendering it
D(HX ) with labels in Y is a subcollection C ⊆ MX→Y unusable to perform the next measurement. Moreover,
of POVMs. A C−learning algorithm (C−LA) is a se- electron spins cannot be replicated (No Cloning Theo-
∆ ⊗n
quence An = {AnM ∈ P(HZ ) : M ∈ C} ∈ MZ n →C : rem [Wootters and Zurek, 1982]). This indicates that
⊗n
n ≥ 1 of C−POVMs on HZ . every training sample can be used to evaluate the em-
pirical risk (ER) of just one measurement.
We elaborate on Def. 1 for clarity. A C−LA is a se-
quence An : n ≥ 1 of measurements. Each measure- Unlike in Remark 1, non-commutativity of quantum
ment in this sequence is a C−POVM. When measure- measurements and the No Cloning theorem [Wootters
ment An is performed on n training samples, the out- and Zurek, 1982] suggest that each training sample can
come obtained is the index of the chosen predictor in be used to evaluate the ER of just one measurement.
C. Having clarified this, we now enquire how many Moreover the outcome of measurements are random
samples does a C−LA need to identify −optimal pre- (Remark 2) resulting in random empirical risks. How
dictor with confidence at least 1 − δ? We are thus should a C−LA optimally utilize training samples to
led to the notion of PAC learnability of measurement identify the best predictor from within C? The dis-
spaces. cussion thus far, exemplified through Ex. 1 - 2, sug-
∆ ⊗n
Defn 2. An algorithm An = {AnM ∈ P(HZ ):M∈ gests that each training sample can yield the ER of
C} ∈ MZ n →C : n ≥ 1 (PAC) learns C if for every just one measurement. This results in a blow-up of
the sample complexity of any concept class. Our first unique to each individual POVM in the compatible
simple step (Sec. 3.1) is to leverage compatible mea- collection.
surements [Holevo, 2019, Sec. 2.3.2] to ‘re-use’ sam-
From (4), it is evident that to obtain the joint statistics
ples effectively. We build on this to derive an upper
of all POVMs in a compatible subcollection B, one can
bound on the sample complexity. In contrast with the
perform POVM GQ and pass the outcome through the
previous work [Heidari et al., 2021], our work necessi- B
stochastic matrix B∈B αY|W . This suggests that we
tates a substantial extension to cover the ubiquitous
partition the POVM concept class C into compatible
case of infinite-cardinality concept classes and to gen-
subcollections and ‘re-use’ training samples amongst
eralize beyond compatible partitions defined in terms
POVMs within a compatible subcollections. Before
of commuting POVMs.
we state this, we identify an important example of a
compatible subcollection.
3 STOCHASTICALLY Remark 2. A subcollection B ⊆ MX→Y of
COMPATIBLE POVMS AND A Y−POVMs on HX are compatible if for any pair
NEW ERM RULE M = {My : y ∈ Y} ∈ B, L = {Ly : y ∈ Y} ∈ B,
we have My Lŷ = Lŷ My for all (y, ŷ) ∈ Y × Y.
While classical register contents maybe duplicated,
thereby permitting simultaneous evaluations of all While commutative POVMs are compatible, the con-
functions in C ⊆ FX→Y , the No Cloning theorem verse is not necessarily true. In fact, the problem of
[Wootters and Zurek, 1982] precludes reproducing characterizing compatible POVMs remains active [Jae
quantum states. Moreover, any measurement results et al., 2019], [Guerini and Terra Cunha, 2018], [Kun-
in the collapse of a quantum state: performing POVM jwal, 2014] [Karthik et al., 2015]. Notwithstanding
M1 = {M1y ⊗ IY : y ∈ Y} on each training sam- this, the notion of compatibility provides us with the
ple results in the modified trainingR samples described appropriate construct to modify a naive ERM rule that
√ √
via ρ̃⊗n
P
XY , where ρ̃ XY = z∈Y M 1z x M1z ⊗
ρ resulted in blowup of sample complexity. We specify
|yi hy| dpXY . Performing subsequent POVMs on the the new ERM rule in the following.
collapsed training samples does not enable us to esti-
mate the true risk of these predictors. 3.2 ERM Rule for Learning Quantum
Measurement Concept Classes
3.1 Stochastically Compatible Measurements
The idea of the new ERM rule is to partition the
The ERM rule we propose is based on the divide-and- POVM concept class C into compatible subcollections.
conquer approach. While POVMs are generally incom- The set of training samples is also partitioned analo-
patible, certain subsets are jointly measurable [Holevo, gously so that there is a 1 : 1 correspondence between
2019, Sec. 2.1.2], leading us to the following.4 the two partitions. The number ni of samples allotted
Defn 3. A subcollection B ⊆ MX→Y of Y−POVMs to a given partition element with index i in the com-
on HX is compatible if there exists a finite outcome patible partition is chosen in accordance with our sam-
set W, a W−POVM G = {Gw ∈ P(HX ) : w ∈ W} on ple complexity upper bound in Theorem 1. That is,
2
B
HX and a stochastic matrix αY|W : W → Y for each we choose ni = 8(B−A) log(1/δ)+8B
2
V C(Bi )
, where the
B = {By ∈ P(HX ) : y ∈ Y} ∈ B such that VC dimension V C(·) is defined in subsequent discus-
sion. Each subset of the training samples is employed
(4)to evaluate the ER of all POVMs in the corresponding
X
By = B
Gw αY|W (y|w) for all y ∈ Y and all B ∈ B.
w∈W
compatible subset. Finally, the POVM with the least
ER among all POVMs is the chosen predictor. The fol-
In this case, we say B is compatible with fine-graining lowing definition enables us specify the proposed ERM
B ∆
(W, G, αY|W )= B
(W, G, αY|W : B ∈ B). rule precisely.
Defn 4. Let C ⊆ MX→Y . We say (Bi : i ∈ I) is a
In other words, a compatible collection of POVMs can
compatible partition of C if (i) Bi ⊆ C is compatible for
be applied by first applying a single POVM to a quan-
each i ∈ I, (ii) Bi ∩ Bî = φ whenever î 6= i and (iii)
tum state, then applying a noisy classical channel to [
the output of this POVM. The classical channel is Bi = C. In this case, suppose Bi is compatible
i∈I
4 i ∆ Bi
Note that our notion of compatible is referred to with fine-graining (Wi , Gi , αY|Wi
) = (Wi , Gi , αY|Wi
)
as stochastically compatible by Holevo [Holevo, 2019, ∆
Sec. 2.1.2, Pg. 13]. Compatible in [Holevo, 2019, Sec. 2.1.2] for each i ∈ I, we say BI = (Bi : i ∈ I) is a compat-
i
corresponds to the case when the stochastic matrices are ible partition of C with fine-graining (Wi , Gi , αY|Wi
):
0 − 1 valued. i ∈ I. Furthermore, we let B denote the set of all com-
Consistency
patible partitions of POVM concept class C ⊆ MX→Y . Algorithm 1: ERM-Quantum (ERM-Q) Algo-
rithm
We give a finite example below.
Input: POVM Concept class C and n training
Ex. 3. Suppose Y = {0, 1}, C = {M1 , · · · , M5 } samples with (y1 , · · · , yn ) denoting the
contains 5 POVMs. Suppose B1 = {M1 , M2 } labels.
and B2 = {M3 , M4 , M5 } is a compatible partition Output: Index of the selected predictor in C
of C (Defn. 4). Specifically, suppose B1 is com- 1 Identify a compatible partition BI = ∆
(Bi : i ∈ I)
1 2
patible with fine-graining (W1 , G1 , αY|W 1
, αY|W1
) i
of C with fine graining (Wi , Gi , αY|Wi ) : i ∈ I and
and B2 is compatible with fine graining Y
3 4 5 let t = |I|. Henceforth, let αY|Wi
= αYB |Wi
(W2 , G2 , αY|W2
, αY|W2
, αY|W2
). In order to ob- i
B∈Bi
tain outcomes that are statistically indistinguishable
be the product of the stochastic matrices.
from that of the outcomes of POVMs in B1 , one
2 Partition n training samples into t = |I| subsets
can perform W1 −POVM G1 on the quantum state
with i−th subset having ni training samples. Let
and post-process the outcome w1 ∈ W1 by passing it (i)
(yj : 1 ≤ j ≤ ni ) denote the training sample
through the product stochastic matrix
labels of samples provided in i−th subset. ni is
1 ∆ B1 determined in relation to VC(Bi ) appearing in
αY|W (y , y2 |w1) =
1 1
αY|W1
(y1 , y2 |w1 ) (5)
∆ M1 M2
(9)
= αY|W1
(y1 |w1 )αY|W1
(y2 |w1) (6) 3 for i = 1 to t do
4 Perform Wi −POVM Gi on each of the ni
where (w1 , y1 , y2) ∈ W1 × Y 2 . (i)
samples. For j = 1, · · · , ni , let wj ∈ Wi
Ex. 3 identifies a compatible partition with two el- denote the outcome of the POVM on the
ements, i.e., I = {1, 2} with t = 2. The product j−th quantum state.
(i)
stochastic matrix for i = 1 is explicitly stated in the 5 Postprocess (wj : 1 ≤ j ≤ ni ) by passing each
(i)
example. The outcomes (wj : 1 ≤ j ≤ ni ) are component through the stochastic matrix
the ni outcomes one obtains on performing POVM
i
αY|W i
. For POVM B ∈ Bi ⊆ C, let ŷjB
Gi on the ni quantum states. Each of these is passed denote the post-processed outcome
i Bi corresponding to sample j.
through the product stochastic matrix αY|W i
= αY|W i
.
∆ 1 Pni B (i)
We have denoted the resulting collection of outcomes 6 Compute ER ER(B, Bi ) = ni j=1 l(ŷj , yj )
(ŷjB : B ∈ Bi ) : 1 ≤ j ≤ ni to distinguish it from of POVM B ∈ Bi ⊆ C.
the labels provided as part of the training data. These 7 Let Bi∗ = arg minB∈Bi ER(B, Bi ).
were denoted as y1 , y2 in Ex. 3. The remaining steps 8 return Index arg mini∈I Bi∗ that globally
facilitate identifying the ERM POVM in each parti- minimizes ER.
tion followed by identifying the global ERM POVM
that is returned.
Remark 3. If B = {Mj : 1 ≤ j ≤ J} consists subsets of a compatible partition. Using {ρx : x ∈ X }
of commuting Y−POVMs Mj = {Mjy ∈ P(H) : and the obtained outcomes one can, in theory, charac-
y ∈ Y} : 1 ≤ j ≤ J, then B is compat- terize all possible collapsed states and ‘re-use’ samples
QJ
ible with fine-graining (Y J , j=1 Mj , 1Yj |Y ), where across partitions to glean partial information on the
QJ ∆
Y J = ×Jj=1 Yj is the Cartesian product, j=1 Mj = ER of incompatible POVMs. Since POVMs obfuscate
J
{M1y1 M2y2 · · · MJyJ ∈ P(H) : (y1 , · · · , yJ ) ∈ Y } is phase information of the collapsed state, we have not
the product POVM and 1Yj |Y (y|y1 , · · · , yJ ) = 1{y=yj } attempted to explore this direction in this first step.
is the co-ordinate function. Our sample complexity of ERM-Q is therefore only an
upper bound.
Remark 4. For a set X , consider the Hilbert space
∆
HX = span{|xi : x ∈ X } with hx̂|xi = δx̂x . The set of
all stochastic matrices on X (that subsumes functions) 4 AN UPPER BOUND ON THE
forms a commuting set of operators on HX . Hence the SAMPLE COMPLEXITY OF
classical PAC learning problem reduces to a POVM POVM CONCEPT CLASSES
concept class in which all operators commute. The
ERM-Q therefore reduces to the classical ERM with We derive an upper bound on the sample complexity
one commuting subset of POVMs. of ERM-Q algorithm. As noted before, in contrast to
Remark 5. While the ERM-Q algorithm ‘re-uses’ the classical PAC, there are two points of departure
samples to obtain ER of POVMS within a compatible - non-commutativity and random outcomes. The for-
subset, it makes no attempt to ‘re-use’ samples across mer led us to partitioning C. The latter forces us to
deal with randomness in outcome labels. We provide Remark 6. From the definition of reachability and
several steps of the proof in Sec. 4.2. To aid clarity θr (B, x), it is evident that the VC dimension of POVM
and readability, probability of events are expressed as classes is, owing to its indeterminsim, quite high. In-
expectations of corresponding indicator RVs, which is deed, any set of outcome labels that get a non-zero
further expressed as a summation. E.g., if X, Y, Z has probability of occurrence, no matter how small, is
a distribution
P pXY pZ|X , then P (l(Y, Z) > a) is written reachable. This increases the sample complexity of
as x,y,z pXY (x, y)pZ|X (z|x)1{l(y,z)>a} . This is done POVM classes. Quantum indeterminism is unfortu-
only to permit an uncluttered integration of quantum nately unavoidable. Additionally, our bound is finite
and classical probability. We emphasize that none of only when a finite-cardinality compatible partition of
the arguments rely on summations confined to finite the hypothesis class exists. This condition naturally
ranges or suprema being maxima. holds in certain use cases: e.g., where an experimenter
has access to finitely many “root” measurement de-
4.1 VC Dimension of Compatible POVMs vices (POVMs), and they form an infinite-cardinality
and Sample Complexity of ERM-Q concept class by passing the results of these POVMs
through infinitely many classical channels. In this sce-
The divide-and-conquer approach of ERM-Q is natu- nario, the natural partition is immediate. It is likely
rally reflected in its sample complexity. Below, we de- that our results can be further extended to the case
fine a notion of VC dimension of a compatible concept where we only require a looser version of compatibility
class and use it to upper bound the sample complexity among POVMs in a partition element. Since the space
of ERM-Q. of POVMs of a given dimension is compact, finitude
Defn 5. Let B = {Bk : k ∈ K} be a compatible parti- of the partition may then be guaranteed. However, the
∆
tion with fine-graining (W, G = ∆ k
GW , αk = αY|W : k ∈ problem of actually specifying a partition for which our
bound is minimized can likely only be solved under fur-
K) where G = {Gw ∈ P(H) : w ∈ W}. Let
ther assumptions on the hypothesis class.
∆
X
βk (yk |x) = tr(ρx Gw )αk (yk |w) (7) Our provided sample complexity bound is a nontriv-
w∈W ial extension of the techniques in the classical learning
theory case to the quantum framework, which is nec-
and essary for showing finite sample complexity bounds for
X Y infinite-cardinality learnable hypothesis classes. Our
βK (yk : k ∈ K|x) = tr(ρx Gw ) αk (yk |w). (8) upper bound is likely not tight for a wide variety of
w∈W k∈K
POVM classes. Thus, it is of interest to further im-
prove our upper bound and to provide sample complex-
For x ∈ X , we say (yk : k ∈ K) is reachable from
ity lower bounds that match upper bounds. Ultimately,
x if βK (yk : k ∈ K|x) > 0. Let r ∈ N and let
∆ the goal is to give necessary and sufficient conditions
y= (y 1 , · · · , y r ) ∈ Y |K| × · · · × Y |K| where y i = (yki :
for learnability of POVM classes, and the present work
k ∈ K) for i ∈ [r]. We say y is reachable from is a first step in this direction.
Qr
x = (x1 , · · · , xr ) ∈ X r if i=1 βK (y i |xi ) > 0. Let
∆
θr (B, x) = {y ∈ Y |K|r : y is reachable from x}. We 4.2 Proof of Theorem 1 : Outline
say B shatters x ∈ X r if |θr (B, x)| = |Y|r . The VC
dimension of B, denoted VC(B), is the maximal d ∈ N To indicate the general direction of the proof, we iden-
for which there exists a set x ∈ X d that is shattered by tify here two main terms T1 and T2 that need to be
B. Finally, we let Sr (B) = ∆
maxx∈X r |θr (B, x)|. bounded on the above in order to derive sample com-
plexity. Upper bounds on the two main terms are de-
This leads to the following theorem. rived in the supplementary material.
Theorem 1. The sample complexity N : [A, B] ×
(0, 1) → N of the ERM-Q algorithm in learning C is Preliminaries and Notation: Let B1 , · · · , B|I| be
bounded by a compatible partition of C. Throughout, we focus
on proving the uniform convergence property [Shalev-
N (, δ) ≤ (9) Shwartz and Ben-David, 2014, Defn. 4.3] of one par-
P|I| tition element B1 , henceforth denoted B. Let B =
|I| 2
8((B − A) log δ +B i=1 VC(Bi ))) {M1 , · · · , MK } ⊆ C be compatible, with fine grain-
min ing (W, GW , αYk |W : 1 ≤ k ≤ K). Henceforth, we
(B1 ,··· ,B|I| )∈B 2
let αk = αYk |W : k ∈ [K] and G = GW . To pre-
where the minimum is over the collection B of all com- vent overloading, we let Z = Y and use z, Zk , zk
patible partitions of C as defined in Defn. 4. to refer to the outcome of the POVMs in B on
Consistency
the X−system of the training sample. Recall that tool in bounding T1 is the Bounded Difference Inequal-
ERM-Q performs POVM G n = {Gw1 ⊗ · ·R· ⊗ Gwn : ity (BDI) [Stephane Boucheron and Massart, 2013,
n ⊗n
P 1 , · · · , wn ) ∈ W } on ρX , where ρX = ρx dpX =
(w Thm. 6.2]. The bound on T2 is derived using the VC
x∈X pX (x)ρx with pX unknown. Each of the n out- inequality and Massart’s lemma Specifically, the ghost
comes Wi : i ∈ [n] (RVs) is passed through stochas- sample technique followed by the symmetrization tech-
tic matrix (α[K] (z|w) = α[K] (z1 , · · · , zK |w) : (z, w) = nique leads us to the Rademacher complexity of the
(z1 , · · · , zK , w) ∈ Z K × W), where5 POVM concept class. Finally, using Massart’s lemma,
K
we derive a bound on T2 . A complete proof is given in
∆
Y the supplementary material.
α[K] (z1 , · · · , zK |w) = αk (zk |w) (10)
k=1
for (z1 , · · · , zK , w) ∈ Z K × W. Provided with quan- 5 UNIVERSAL CONSISTENCY OF

tum states corresponding to features (x1 , · · · , xn ),
it can be verified that the K random labels Z i = ∆ POVM CLASSES
(Z1i , Z2i , · · · , ZKi ) output by the ERM-Q correspond-
ing to state ρxi have distribution We now identify specific sequences of POVM concept
∆ classes that are universally consistent [Devroye et al.,
β[K] (z i |xi ) = β[K] (z1i , · · · , zKi |xi ) (11)
X 1996, Chap. 6]. We identify natural universally consis-
∆
= tr(ρxi Gw )α[K] (z1i , · · · , zKi |w) tent sequences of POVM concept classes. We also dis-
w∈W cuss techniques for identifying other universally con-
sistent sequences. Ours being the first in this theoret-
where (w, z i ) ∈ W × Z K . Moreover, the n vectors
∆ ical line of study, we do not stress on the challenges
Z = (Z i : i ∈ [n]) are mutually independent and
∆ Qn
of designing such classes. We begin by defining uni-
n
we let β[K] (z|x) = i=1 β[K] (z i |xi ) specify the dis- versal consistency and introduce a sequence of POVM
tribution of the outcome label RVs Z. Moreover, from concept classes of interest.
(10), (11), the marginal of Zki conditioned on feature
∆ P
xi is βk (zki |xi ) = w∈W tr(ρxi Gw )αk (zki |w). Using Defn 6. Consider a distribution pXY ∈ P(X × Y),
all of this, it can be verified that the uniform conver- a collection (ρx ∈ D(H) : x ∈ X ) of density oper-
gence property [Shalev-Shwartz and Ben-David, 2014, ators, the density operator of the feature-label quan-
∆ R
Defn. 4.3] demands that we characterize N (, δ) ∈ N tum state ρXY = ρ
X ×Y x
⊗ |yi hy| dpXY ∈ D(HZ )
for which and a loss function l : Y → [A, B]. We let lp∗ =
2 ∆
6
inf lp (M) denote the Bayes’ risk. A sequence
X X X
sup (12) M∈MX→Y
pXY
x∈X n y∈Y n z 1 ∈Z K Ck ⊆ MX→Y : k ≥ 1 of POVM concept classes is
X universally consistent if limk→∞ lp (Ck ) = lp∗ for ev-
··· pnXY (x, y)β[K]
n
(z|x)1{φ(x,y,z)>} ≤ δ,
ery pXY ∈ P(X × Y), where for any B ⊆ MX→Y , we
zn ∈Z K ∆
define lp (B) = inf lp (M).
M∈B
where φ(·) is the maximal deviation of the em-
pirical risk from the true risk, over all parti-
Universally consistent POVM concept classes enables
tion element indices k. Since 1{φ(x,y,z)>} ≤
us to squeeze the approximation error7 to 0. Specif-
1{φ(x,y,z)−E{φ(X,Y ,Z)}> 2 } + 1{E{φ(X,Y ,Z)}> 2 } , it suf- ically, suppose a sequence Mk ∈ MX→Y : k ≥ 1 is
fices to derive conditions under which T1 ≤ δ, where universally consistent, and if we can find a sequence
we define k(n) ∈ N for n ∈ N such that limn→∞ k(n) = ∞,
∆
T1 = (13) then the loss of this sequence limn→∞ lp (Mk(n) ) = lp∗
X achieves the Bayes’ risk. It is therefore of interest to
pnXY (x, y)β[K]
n
(z|x)1{φ(x,y,z)−E{φ(X,Y ,Z)}> 2 } find a sequence k(n) : n ≥ 1 that grows slowly enough
x,y,z such that the estimation error also goes to zero. Next,
we design a sequence of POVM concept classes that
(ii)
and T2 = ∆
E{φ(X, Y , Z)} ≤ 2 , where we have are universally consistent.
suppressed the summation ranges, and let z =
(z 1 , · · · , z n ) ∈ Z KN .
6
lp (M) is defined in (2). We have done away with the
Outline of Key Steps: The crux of the proof lies subscripts XY in pXY while denoting lp∗ and lp (M) to re-
in deriving upper bounds on T1 and T2 . The main duce notational clutter.
7
See [Shalev-Shwartz and Ben-David, 2014, Sec. 5.2] for
5
The reader may recall Ex. 3 definitions of approximation and estimation errors.
5.1 Optimal state discrimination yields function f ∈ FX →Y and any subcollection H ⊆ F, we

universal consistency let Z
∆ ∆
lp (f ) = l(f (x), y)dpXY (x, y), lp (H) = inf lp (f ),
We recall that the learning algorithm is ignorant of X ×Y f ∈H
the collection (ρx ∈ D(H) : x ∈ X ) of density opera- (14)
tors and is only provided the prepared quantum states ∆
and the corresponding labels. Our approach at de- and let lp∗ = inf f ∈F lp (f ) denote the Bayes’ risk. For
signing universally consistent POVM concept classes k ∈ N, let {Ak,j ⊆ X : j ∈ N} be a partition of X such
is based on optimal state discrimination. Suppose the that limk→∞ supj∈N diam(Ak,j ) = 0, where diam(B) =
collection (ρx ∈ D(H) : x ∈ X ) was perfectly dis- supc,d∈B d(c, d) for B ⊆ X denotes the diameter of a
tinguishable via a measurement G ∈ MHX →X . Here set B ⊆ X . For k ∈ N, let
∆
MHX →X denotes the set of all X −POVMs on HX and Hk = {h ∈ FX →Y :h(a) = h(b) (15)
G = {Gx : x ∈ X } has |X | positive operators that sum ⇐⇒ ∃j ∈ N, a, b ∈ Ak,j }. (16)
to the identity on HX that satisfy tr(Gx ρx̂ ) = δxx̂ . Then, Hk : k ∈ N is a universally consistent sequence
Then any universally consistent sequence of function of function concept classes, i.e., limk→∞ lp (Hk ) = lp∗
concept classes can be adapted to form a universally for each distribution p ∈ P(X × Y).
consistent sequence of POVM concept classes. Our
approach when the collection (ρx ∈ D(H) : x ∈ X ) Fact 2 is a direct consequence of [Stephane Boucheron
is not perfectly distinguishable is to employ an opti- and Massart, 2013, Thm. 6.1]. We are now set to define
mal state discriminator G ∈ MHX →X for the collection a sequence of POVM concept classes that are univer-
(ρx ∈ D(H) : x ∈ X ). sally consistent.
To find the optimal state discriminator G ∈ MHX →X Defn 7. Let X be a domain set, Y be a set of labels,
for the collection (ρx ∈ D(H) : x ∈ X ), we turn to HX be a Hilbert space, (ρx ∈ D(H) : x ∈ X ) be density
[Holevo, 2019, Sec. 2.4], wherein it is proven that the operators of the quantum states encoding the features,
problem of identifying optimal state discriminator is G = {Gx : x ∈ X } ∈ MHX →Y be a optimal state
one of maximizing an affine function over a convex set discriminator for the collection (ρx ∈ D(H) : x ∈ X )
of observables. We next state the necessary facts. as defined in Fact 1 and Hk ⊆ FX →Y : k ≥ 1 be
Fact 1. Given a Hilbert space HX of finite dimen- a universally consistent sequence of function concept
sion and a collection (ρx ∈ D(H) : x ∈ X ) of density classes as stated in Fact 2. For k ∈ N, h ∈ Hk , we let
operators, there exists an optimal state discriminator
X
Mkh = {Mh,y k
= Gx : y ∈ Y} (17)
G = {Gx ∈ P(H) : x ∈ X } ∈ MHX →X such that, x∈X :h(x)=y
the probability of error in identification is minimized.
∆
Moreover, identifying an optimal state discriminator and Mk = {Mkh : h ∈ Hk }.
G is equivalent to maximizing an affine function over
a convex set of observables. It is straightforward to verify Mkh ∈ MX→Y and hence
M
Pk ⊆ M X→YPis a P POVM concept class. Indeed
k
P
Fact 1 is stated in [Holevo, 2019, Sec. 2.4, Thm. 2.22]. y∈Y M h,y = y∈Y x∈X :h(x)=y Gx = x∈X Gx =
Additionally, [Holevo, 2019, Ex. 2.27] proves that op- IHX , the identity on HX . We now assert that Mkh :
timal quantum state discrimination is facilitated via k ≥ 1 is a sequence of POVM concept classes that are
unsharp observables - a phenomenon not observed in universally consistent.
classical discrimination. While an analytical charac- Theorem 2. The sequence Mk : k ≥ 1 of POVM
terization of an optimal state discriminator G might concept classes is universally consistent.
be available only for specific structured collections
(ρx ∈ D(H) : x ∈ X ) of density operators, and not
in general, the fact that G can be identified computa- 6 CONCLUSION, FUTURE WORK
tionally efficiently via convex programming leads us to
the use of this tool in our identification of a universally We studied the problem of learning from quantum
consistent sequence of POVM concept classes. data, entailing a graduation from function classes to
measurement concept classes. Next, the challenges
The second fact gives a universally consistent sequence posed by quantum effects forced us to modify the foun-
of function concept classes. dational algorithm (ERM) of statistical learning the-
Fact 2. Let X be an arbitrary domain set with a dis- ory and unravel its sample complexity. We conclude
tance metric d(·, ·) and Y be a finite set of labels. Let by identifying the related combinatorial optimization
∆
l : Y 2 → [A, B] be a loss function and FX →Y = {f : problem of identifying an optimal stochastic compat-
X → Y} be the set of all functions with domain X ible partition of the concept class as a problem of in-
and range Y. For a distribution pXY ∈ P(X × Y) a dependent interest.
Consistency
References [Cheng et al., 2016] Cheng, H.-C., Hsieh, M.-H., and

Yeh, P.-C. (2016). The learnability of unknown
[Aaronson, 2006] Aaronson, S. (2006). The learnabil- quantum measurements. Quantum Info. Comput.,
ity of quantum states. Proceedings of The Royal 16(7–8):615–656.
Society A Mathematical Physical and Engineering
Sciences, 463. [Devroye et al., 1996] Devroye, L., Gy orfi, L., and
Lugosi, G. (1996). A Probabilistic Theory of Pat-
[Aaronson, 2018] Aaronson, S. (2018). Shadow to-
tern Recognition, volume 31 of Stochastic Modelling
mography of quantum states. In Proceedings of the
and Applied Probability. Springer.
50th Annual ACM SIGACT Symposium on The-
ory of Computing, STOC 2018, page 325–338, New [Gortler and Servedio, 2001] Gortler, S. J. and Serve-
York, NY, USA. Association for Computing Ma- dio, R. A. (2001). Quantum versus classical learn-
chinery. ability. In Proceedings of Computational Complex-
[Aaronson et al., 2018] Aaronson, S., Chen, X., ity. Sixteenth Annual IEEE Conference, page 0138,
Hazan, E., and Kale, S. (2018). Online learning Los Alamitos, CA, USA. IEEE Computer Society.
of quantum states. In Proceedings of the 32nd
[Guerini and Terra Cunha, 2018] Guerini, L. and
International Conference on Neural Information
Terra Cunha, M. (2018). Uniqueness of the joint
Processing Systems, NIPS’18, page 8976–8986, Red
measurement and the structure of the set of
Hook, NY, USA. Curran Associates Inc.
compatible quantum measurements. Journal of
[Altepeter et al., 2004] Altepeter, J., James, D., and Mathematical Physics, 59(4):042106.
Kwiat, P. (2004). Qubit quantum state tomography.
Lecture Notes in Physics, 649:113–145. [Haah et al., 2016] Haah, J., Harrow, A. W., Ji, Z.,
Wu, X., and Yu, N. (2016). Sample-optimal to-
[Anguita et al., 2003] Anguita, D., Ridella, S., Riviec- mography of quantum states. In Proceedings of the
cio, F., and Zunino, R. (2003). Quantum optimiza- Forty-Eighth Annual ACM Symposium on Theory
tion for training support vector machines. Neural of Computing, STOC ’16, page 913–925, New York,
Networks, 16(5):763–770. Advances in Neural Net- NY, USA. Association for Computing Machinery.
works Research: IJCNN ’03.
[Heidari et al., 2021] Heidari, M., Padakandla, A.,
[Anshu et al., 2021] Anshu, A., Arunachalam, S.,
and Szpankowski, W. (2021). A theoretical frame-
Kuwahara, T., and Soleimanifar, M. (2021).
work for learning from quantum data. In 2021 IEEE
Sample-efficient learning of interacting quantum
International Symposium on Information Theory
systems. Nature Physics, pages 1745–2481.
(ISIT), pages 1469–1474.
[Arunachalam and de Wolf, 2017] Arunachalam, S.
and de Wolf, R. (2017). A survey of quantum [Holevo, 2019] Holevo, A. S. (2019). Quantum Sys-
learning theory. arXiv:1701.06806. tems, Channels, Information. De Gruyter, 2 edi-
tion.
[Arunachalam et al., 2020] Arunachalam, S., Grilo,
A. B., and Yuen, H. (2020). Quantum statistical [Jae et al., 2019] Jae, J., Baek, K., Ryu, J., and Lee,
query learning. J. (2019). Necessary and sufficient condition for
joint measurability. Phys. Rev. A, 100:032113.
[Atici and Servedio, 2005] Atici, A. and Servedio,
R. A. (2005). Improved bounds on quantum learn- [Karthik et al., 2015] Karthik, H. S., Devi, A. R. U.,
ing algorithms. Quantum Information Processing, and Rajagopal, A. K. (2015). Unsharp measure-
4(5):355–386. ments and joint measurability. Current Science,
109(11):2061–2068.
[Aı̈meur et al., 2012] Aı̈meur, E., Brassard, G., and
Gambs, S. (2012). Quantum speed-up for unsuper- [Kearns and Schapire, 1990] Kearns, M. J. and
vised learning. Machine Learning, 90(2):261–287. Schapire, R. E. (1990). Efficient distribution-free
[Bae and Kwek, 2015] Bae, J. and Kwek, L.-C. learning of probabilistic concepts. In Proceedings
(2015). Quantum state discrimination and its ap- of the 31st Annual Symposium on Foundations of
plications. Journal of Physics A: Mathematical and Computer Science, SFCS ’90, page 382–391 vol.1,
Theoretical, 48(8):083001. USA. IEEE Computer Society.
[Bubeck et al., 2020] Bubeck, S., Chen, S., and Li, J. [Khan and Robles-Kelly, 2020] Khan, T. M. and
(2020). Entanglement is necessary for optimal quan- Robles-Kelly, A. (2020). Machine learning: Quan-
tum property testing. tum vs classical. IEEE Access, 8:219275–219294.
[Kunjwal, 2014] Kunjwal, R. (2014). A note on the [Vapnik and Chervonenkis, 1974] Vapnik, V. and
joint measurability of povms and its implications Chervonenkis, A. (1974). Theory of Pattern Recog-
for contextuality. nition [in Russian]. Nauka, Moscow. (German
Translation: W. Wapnik & A. Tscherwonenkis,
[Lloyd et al., 2014] Lloyd, S., Mohseni, M., and
Theorie der Zeichenerkennung, Akademie–Verlag,
Rebentrost, P. (2014). Quantum principal compo-
Berlin, 1979).
nent analysis. Nature Physics, 10(9):631–633.
[Montanaro and de Wolf, 2018] Montanaro, A. and [Vapnik, 1995] Vapnik, V. N. (1995). The Nature of
de Wolf, R. (2018). A survey of quantum property Statistical Learning Theory. Springer-Verlag, Berlin,
testing. Heidelberg.
[O’Donnell and Wright, 2016] O’Donnell, R. and [Vapnik, 1998] Vapnik, V. N. (1998). Statistical
Wright, J. (2016). Efficient quantum tomography. Learning Theory. Wiley-Interscience.
In Proceedings of the Forty-Eighth Annual ACM [Vapnik and Chervonenkis, 1971] Vapnik, V. N. and
Symposium on Theory of Computing, STOC ’16, Chervonenkis, A. Y. (1971). On the uniform conver-
page 899–912, New York, NY, USA. Association gence of relative frequencies of events to their prob-
for Computing Machinery. abilities. Theory of Probability & Its Applications,
[O’Donnell and Wright, 2017] O’Donnell, R. and 16(2):264–280.
Wright, J. (2017). Efficient quantum tomography
[Waseem et al., 2020] Waseem, M. H., e Ilahi, F., and
ii. In Proceedings of the 49th Annual ACM SIGACT
Anwar, M. S. (2020). Quantum state tomography.
Symposium on Theory of Computing, STOC 2017,
In Quantum Mechanics in the Single Photon Labora-
page 962–974, New York, NY, USA. Association
tory, 2053-2563, pages 6–1 to 6–20. IOP Publishing.
for Computing Machinery.
[Rehacek and Paris, 2004] Rehacek, J. and Paris, M. [Wiebe et al., 2012] Wiebe, N., Braun, D., and Lloyd,
(2004). Quantum state estimation. Lecture notes in S. (2012). Quantum algorithm for data fitting. Phys.
physics. Springer, Berlin. Rev. Lett., 109:050505.
[Schuld et al., 2014] Schuld, M., Sinayskiy, I., and [Wilde, 2017] Wilde, M. (2017). Quantum informa-
Petruccione, F. (2014). Quantum computing for tion theory. Cambridge University Press, Cam-
pattern classification. In Pham, D.-N. and Park, bridge, UK.
S.-B., editors, PRICAI 2014: Trends in Artificial
[Wootters and Zurek, 1982] Wootters, W. K. and
Intelligence, pages 208–220, Cham. Springer Inter-
Zurek, W. H. (1982). A single quantum cannot be
national Publishing.
cloned. Nature, 299(5886):802–803.
[Servedio, 2001] Servedio, R. A. (2001). Separating
quantum and classical learning. In Proceedings of
the 28th International Colloquium on Automata,
Languages and Programming,, ICALP ’01, page
1065–1080, Berlin, Heidelberg. Springer-Verlag.
[Shalev-Shwartz and Ben-David, 2014] Shalev-
Shwartz, S. and Ben-David, S. (2014). Under-
standing Machine Learning: From Theory to
Algorithms. Cambridge University Press, New
York, NY, USA.
[Stephane Boucheron and Massart, 2013]
Stephane Boucheron, G. L. and Massart, P.
(2013). Concentration Inequalities. Oxford
University Press, Oxford, United Kingdom.
[Valiant, 1984] Valiant, L. G. (1984). A theory of the
learnable. Commun. ACM, 27(11):1134–1142.
[Vapnik, 1982] Vapnik, V. (1982). Estimation of De-
pendences Based on Empirical Data: Springer Se-
ries in Statistics (Springer Series in Statistics).
Springer-Verlag, Berlin, Heidelberg.
Supplementary Material:
PAC Learning of Quantum Measurement Classes: Sample
Complexity Bounds and Universal Consistency
A A COMPLETE PROOF OF every 1 ≤ k ≤ K, we have

THEOREM 1
nj

1 X
j
l(yi , Zki ) (18)

nj

In addition to deriving upper bounds on T1 and T2 , we i=1
fill in missing steps and thereby provide a complete −
X
pXY (x, y) (19)
proof of Thm. 1. We refer the reader to Sec. 4.2 to
(x,y,z)∈X ×Y×Z
recall the notation and tools introduced therein. The
reader may note that we had stated in Sec. 4.2 the suf- X
tr(Gjw ρx )αkj (z|w)l(y, z)

ficiency of the uniform convergence property [Shalev- · (20)
Shwartz and Ben-David, 2014, Definition 4.3]. Here, w∈Wj
we provide a proof of this sufficiency tailored to the
≤ (21)
current scenario involving compatible partitions. 4
where the the second term within the modulus above is

A.1 Uniform Convergence Property Suffices lp (Mj,k ), whenever (xi , yi ) : 1 ≤ i ≤ nj are drawn ac-
cording to pXY , then by the definition of the infimum,
we can choose a POVM whose true risk is within 4 of
As stated in Sec. 4.2, let B1 , · · · , B|I| be a compati-
the infimum and thereby guarantee that the true risk
ble partition of C. Suppose Bj = {Mj1 , · · · , MjK } is
of the index arg min1≤j≤|I| kj∗ is indeed within 2 of
compatible with fine graining (Wj , Gj = GWj , αkj = the true risk of l∗ (pXY , C). We have thus argued that
j,k
αY|W j
) where Gj = {Gjw ∈ P(H) : w ∈ Wj } ∈ proving that each compatible partition B1 , · · · , B|I|
MX→Wj .8 Suppose nj of the n training samples are possessing the uniform convergence property is a suf-
used to evaluate the ER of POVMs in Bj . Let Z = Y. ficient property for proving PAC learnability of the
j ERM-Q algorithm. In the following two sections we
The ERM-Q algorithm generates RVs Zki ∈Z :1≤
k ≤ K, ≤ i ≤ nj with distribution derive bounds on T1 and T2 defined in (13).
  A.2 An upper bound on the first Term T1 in


X
tr(Gjw ρxi )αkj (z|w) : z ∈ Z = Y . (13)
w∈Wj
Recall
∆
For j = 1, · · · , |I|, ERM-Q identifies T1 = (22)
X
pnXY (x, y)β[K]
n
(z|x)1{φ(x,y,z)−E{φ(X,Y ,Z)}> 2 } .
nj x,y,z
∆ 1 X j
kj∗ = arg min l(yi , Zki ) (23)
1≤k≤K nj i=1
Henceforth, we include a subscript, focus on the one-

the best POVM in compatible partition element Bj sided term and let
and declares the index of arg min1≤j≤|I| kj∗ as the cho-
sen predictor. Suppose, for every j = 1, · · · , |I| and φ+
pβ (x, y, z) (24)
n
∆ l(yi , zki )
X X
= sup − l(b, c)pXY (a, b)βk (c|a).
8 1≤k≤K i=1 n
We could just choose K = max1≤j≤|I| |Bj | and pop- (a,b,c)∈
X ×Y×Z
ulate the smaller compatible partitions with trivial (Iden-
tity) POVMs. (25)
n o
With this, the first term T1 is n2
2 }) ≤ exp − 2(B−A) 2 and we can conclude

P ◦ β(φpβ (X, Y , Z) − E{φpβ (X, Y , Z)} ≥ 2) ≤
T1 (26) n o
n2

+ + 2 exp − 4(B−A) 2 .
= P ◦ β φpβ (X, Y , Z) − E{φpβ (X, Y , Z)} > .
4
(27) A.3 Bounding the Second Term T2 in (13)
via Rademacher Complexity
In order to upper bound T1 , we use the Bounded
Difference Inequality stated in Theorem 3. Choose ∆
We recall T2 = E{φ(X, Y , Z)}, where
B = X × Y × Z K and Bi = (Xi , Yi , Z i ) =
(Xi , Yi , Z1i , · · · , ZKi ) and f (x, y, z) = φpβ (x, y, z). ∆
φ(x1 , · · · , xn , y1 , · · · , yn , z 1 ,· · ·, z n ) = (33)
Let x, x̂ ∈ X n , y, ŷ ∈ Y n and z, ẑ ∈ Z Kn , be such sup (34)
that xt = x̂t , yt = ŷt and z t = ẑ t for t ∈ [n] \ i. For 1≤k≤K
such a choice, we have

X n
l(yi , zki ) X
φ+ +
pβ (x, y, z) − φpβ (x̂, ŷ, ẑ) (28) − l(b, c)pXY (a, b)βk (c|a) . (35)

i=1 n
n (a,b,c)∈
l(yi , zki )
X X X ×Y×Z

= sup − l(b, c)pXY (a, b)βk (c|a)
1≤k≤K i=1 n
(a,b,c)∈ In bounding E{φ(X, Y , Z)}, we employ the ghost
X ×Y×Z
sample+symmetrization trick [Devroye et al., 1996,
n
X l(ŷi , ẑki ) X Thm. 12.4]. Observe that
− sup − l(b, c)pXY (a, b)βk (c|a)
1≤k≤K i=1 n
(a,b,c)∈ E{φ(X, Y , Z)} (36)
X ×Y×Z
X
≤ sup (29) = pnXY (x, y)β[K]
n
(z|x) (37)
1≤k≤K x, y, z

n
l(yi , zki )

 n X X
X l(yi , zki ) X sup | − l(b, c)pXY (a, b)βk (c|a)|
− l(b, c)pXY (a, b)βk (c|a) k∈[K] i=1 n
 n (a,b,c)∈
 i=1
 (a,b,c)∈ X ×Y×Z
X ×Y×Z X
= pnXY (x, y)β[K]
n
(z|x) (38)

n


X l(ŷi , ẑki ) X  x, y, z
− − l(b, c)pXY (a, b)βk (c|a) n
i=1
n  X l(yi , zki )
(a,b,c)∈ 
 sup | (39)
X ×Y×Z n
k∈[K] i=1
(30)  
( ) n
n n l(b , c )
l(yi , zki ) l(ŷi , ẑki ) B−A j kj
X X
− pnXY (a, b)β[K]
n
(c|a)  | (40)
X X
= sup − ≤ n
1≤k≤K i=1
n i=1
n n a, b, c j=1
(31)
X X
where (30) follows from supa f1 (a) − supa f2 (a) ≤ ≤ pnXY (a, b)β[K]
n
(c|a)pnXY (x, y)β[K]
n
(z|x)
supa (f1 (a) − f2 (a)) and (31) follows from the bounds x, y, z a, b, c
on the loss function l(·, ·). Similarly, we can prove

(41)
φ+ +
pβ (x̂, ŷ, ẑ) − φpβ (x, y, z) ≤
B−A
n . This proves that
+ n n
φpβ (x, y, z) possesses the bounded difference property X l(yi , zki ) X l(bj , ckj )
sup | − | (42)
and we have P ◦ β(φ+ (X, Y , Z) − E{φ+ n n
n pβ
o pβ (X, Y , Z)} ≥ k∈[K] i=1 j=1
2
n
2 ) ≤ exp − 2(B−A)2 . Analogously defining
X X
= (43)
(σ1 ,···σn ) x, y, z
∈{−1,+1}n a, b, c
φ−
pβ (x, y, z) (32)
n pnXY (a, b)β[K]
n
(c|a)pnXY (x, y)β[K]
n
(z|x)
∆
X l(yi , zki )
X (44)
= sup l(b, c)pXY (a, b)βk (c|a) − 2n
1≤k≤K i=1
n n n
(a,b,c)∈ X σi l(yi , zki ) X σj l(bj , ckj )
X ×Y×Z sup | − |
k∈[K] i=1 n j=1
n
and following a similar sequence of steps, one
can prove P ◦ β(φ− − ≤ 2Rn (B). (45)
pβ (X, Y , Z) − E{φpβ (X, Y , Z) ≥
Consistency
where through (33), our uniform convergence bound for B is

X X X
∆ sup (53)
Rn (B) = (46) pXY
x∈X n y∈Y n z 1 ∈Z K
X X pnXY (x, y)β[K]
n
(z|x) X
· (47) ··· pnXY (x, y)β[K]
n
(z|x)1{φ(x,y,z)>η(,δ)}
σ x, y, z
2n
∈{±1}n z n ∈Z K
" n
# (54)
1X
sup σi l(yi , zki ) , (48) ≤δ (55)
k∈[K] n i=1
where
where (38) follows by introducing the ghost sample to s
1

the second term [Devroye et al., 1996, Step 1 in Proof ∆ 2(B − A) log δ + 8B 2 log Sr (B)
η(, δ) = . (56)
of Thm. 12.7], (41) follows from the chain sup |E{·}| ≤ n
sup E{| · |} ≤ E{sup | · |} of inequalities, the next step
follows by symmetrization by random signs [Devroye A.6 Union Bound on the Compatible
et al., 1996, Step 2 in Proof of Thm. 12.4], finally re- Partition
sulting in the familiar average Radamacher complex-
ity. While these steps are fairly standard, we notice the Having derived uniform convergence for one compat-
crucial difference (45), where the averaging includes ible subset B of POVMs, we employ a union bound
the POVM randomness β[K] (·|·). for the compatible partition B1 , · · · , B|I| . Squeezing
δ
δ to |I| , we obtain the sample complexity stated in the
theorem statement.
A.4 Bounding Rn (B) using Massart’s Lemma
Once again, we follow a well established sequence B PROOF OF THEOREM 2 :

of steps in statistical learning theory to bound SKETCH AND OUTLINE
Rn (B). Recognizing √ l(y, z) ≤ B, implying ||(l(yi , zi ) :
i ∈ [n])||2 ≤ B n and the cardinality of the set We first the idea of the proof before fleshing out the
{(l(yi , zki ) : i ∈ [n]) : k ∈ [K]} is at most details. Let us first understand the case when ρx = |xi
|θn (B, (x1 , · · · , xn ))|, we have for x ∈ X with hx|x̂i = δxx̂ . In essence, all the quan-
tum states are distinguishable, reducing this to the
Rn (B) (49) ∆
X For a given pXY ∈ P(X × Y), let
classical problem.
( " #) R(x → y) = pY |X (ŷ|x)l(y, ŷ) denote the risk, i.e.,
n
1 X ŷ∈Y
= EX Y Z Eσ sup σi l(Yi , Zki ) (50) the conditional expectation of the loss, of assigning la-
k∈[K] n i=1
( p ) bel y to x. It can be shown that the Bayes’ rule for this
B 2n log |θn (B, X)| setting is given by the function f ∗ : X → Y defined as
≤ EX Y Z (51)
n f ∗ (x) = arg min R(x → y). Essentially, f ∗ is choosing
p r y∈Y
(i) B 2nE{log |θn (B, X)|} (ii) 2B 2 log Sr (B) the label that minimizes the expected loss for every
≤ ≤ x. This implies that lp∗ = lp (f ∗ ). It is straightforward
n n
(52) to recognize that, in this simplified case the sequence
Mk : k ≥ 1 defined in (17) is indeed the sequence
Hk : k ≥ 1 as defined in (15). Specifically, Mk = Hk
where the inequality in (49) follows from Mas- for k ∈ N. The proof that Mk : k ≥ 1 forms a uni-
sart’s lemma provided in Theorem 4, (52(i)) fol- versally consistent sequence of POVM concept classes
lows
√ from Jensen’s inequality applied to the concave is now a direct consequence of [Stephane Boucheron
·−function, (52(ii)). and Massart, 2013, Thm. 6.1]. We alert the reader
here that our claim of universal consistency implies
that for every distribution pXY ∈ P(X × Y), we have
A.5 Collating Bounds lim lp (Mk ) = lp∗ .
k→∞
We now collate bounds from the first term and (52) What if the quantum states are not distinguishable,
and substitute in (13). Recollecting our compatible i.e., the operators ρx : x ∈ X have overlapping support
partition B = {Mk : 1 ≤ k ≤ K}, definitions (10) spaces? We leverage the approach stated in the above
to provide a sketch of the arguments. In this submis- C.2 Massart’s Lemma

sion, we provide only a sketch of the arguments and
follow it up with all the details fleshed out in a subse- The facts presented below and a proof of the same
quent version available publicly. Having said that, we can be found in [Shalev-Shwartz and Ben-David, 2014,
should mention that the sketch provided here has suffi- Chapter 26].
cient details for a reader informed in SLT and quantum Theorem 4. Let A = {a1 , · · · , arP } ⊆ Rm with
r
theory. For a distribution pXY and a collection (ρx ∈ 1
ai = (a1i , · · · , ami ). Define b = r i=1 ai and let
D(H) : x ∈ X ), we let Mp = {Myp : y ∈ Y} ∈ MX→Y B1 , · · · , Bm ∈ {−1. + 1} be independent and uniformly
denote a Y−POVM for which distributed. Then
X
tr Myp ρx pY |X (ŷ|x)l(y, ŷ)

(57)
 
m
ŷ∈Y
 1 X 
EB1 ,··· ,Bm sup Bj aji (63)
X 1≤i≤r m
tr Mỹp ρx pY |X (ŷ|x)l(ỹ, ŷ)

≤ (58) j=1
p
ŷ∈Y 2 log(r)
≤ max ||b − ai ||2 . (64)
for every 1≤i≤r m
(x, y, ŷ) ∈ X × Y × Y. (59)
The proof is built on two facts that can be proved
based on the arguments presented in proof of
[Stephane Boucheron and Massart, 2013, Thm. 6.1]
and the properties of a POVM. Firstly, (57) is a neces-
sary and sufficient condition for a POVM Mp = {Myp :
y ∈ Y} to achieve the Bayes’ loss for the distribution
pXY and a collection (ρx ∈ D(H) : x ∈ X ). Sec-
ondly, given any pXY ∈ P(X × Y) and any collection
(ρx ∈ D(H) : x ∈ X ) of density operators, we can
identify a sequence Gk ∈ Mk : k ≥ 1 that satisfy the
above the condition in (57) in the limit. This is the ap-
proach we take to prove that the sequence Mk : k ≥ 1
of POVM concept classes is universally consistent.
C TOOLS AND PREREQUISITES

C.1 Bounded Difference Inequality
The following can be found in [Stephane Boucheron

and Massart, 2013, Section 6.1].
Defn 8. A function f : B n → R has bounded dif-
ference property if for some non-negative constants
c1 , · · · , cn , we have
sup |f (b1 , · · · , bn ) − f (b1 , · · · , bi−1 , b̂i , bi+1 , · · · , bn )|
b1 ,··· ,bn ,b̂i
(60)
≤ ci (61)
for 1 ≤ i ≤ n.
Theorem 3. Suppose f : B n → R possesses the
bounded difference property with constants c1 , · · · , cn
and let v = 14 (c21 + · · · + c2n ). Suppose B1 , · · · , Bn are
independent and A = f (B1 , · · · , Bn ), then
2
t
P (A − E{A} > t) ≤ exp − , (62)
2v
2
t
P (A − E{A} < −t) ≤ exp −
2v

202202AISTATS PadMag

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

202202AISTATS PadMag

Uploaded by

Copyright:

Available Formats

PAC Learning of Quantum Measurement Classes: Sample

Complexity Bounds and Universal Consistency

Arun Padakandla Abram Magner

Abstract day’s learning algorithms. On an alternate front, con-

for (z1 , · · · , zK , w) ∈ Z K × W. Provided with quan- 5 UNIVERSAL CONSISTENCY OF

5.1 Optimal state discrimination yields function f ∈ FX →Y and any subcollection H ⊆ F, we

References [Cheng et al., 2016] Cheng, H.-C., Hsieh, M.-H., and

A A COMPLETE PROOF OF every 1 ≤ k ≤ K, we have

where the the second term within the modulus above is

  A.2 An upper bound on the first Term T1 in

Henceforth, we include a subscript, focus on the one-

on the loss function l(·, ·). Similarly, we can prove

where through (33), our uniform convergence bound for B is

Once again, we follow a well established sequence B PROOF OF THEOREM 2 :

to provide a sketch of the arguments. In this submis- C.2 Massart’s Lemma

C TOOLS AND PREREQUISITES

The following can be found in [Stephane Boucheron

You might also like