Professional Documents
Culture Documents
state must proceed via a measurement that is applied measuring multiple identically prepared states. This
to the particle and results in a classical outcome that set includes works on state tomography [O’Donnell
is interpreted as the predicted label. and Wright, 2016, O’Donnell and Wright, 2017, Haah
et al., 2016, Aaronson, 2006, Rehacek and Paris, 2004,
The task of a quantum learning algorithm in this for-
Waseem et al., 2020, Altepeter et al., 2004, Aaronson,
mulation, given a set of physical realizations of quan-
2018, Aaronson et al., 2018], state discrimination [Bae
tum states along with their associated labels, is to
and Kwek, 2015] and quantum property testing [Mon-
identify an “optimal” measurement from a fixed con-
tanaro and de Wolf, 2018,Bubeck et al., 2020] that in-
cept class, which now consists of a library of measure-
vestigate the number of prepared states to accomplish
ments. Measurements are formalized via the generic
the learning task. We highlight the work of Cheng
framework of positive operator-valued measurements
et. al. [Cheng et al., 2016] that studies the related
(POVMs) [Holevo, 2019, Sec. 2.3]. A concept class
problem when the training samples form the collec-
thus consists of a set of POVMs.
tion (ρi , tr(M ρi )) : 1 ≤ i ≤ n, where ρi denotes
In this work, we answer the above three questions in the ith quantum state, and the algorithm must find
the context of POVM concept classes. We then initiate the unknown optimal POVM M . 1 See [Kearns and
a study of universal consistency [Stephane Boucheron Schapire, 1990] for a closely related classical study.
and Massart, 2013] and identify universally consistent The related paper [Heidari et al., 2021] defines and
sequences of POVM concept classes. studies a similar framework and learning rule to our
own but only proves a sample complexity bound for
finite POVM classes. Furthermore, it makes much
Challenges inherent in our setting: This nat-
stronger assumptions regarding the mutual compati-
ural, deceptively simple formulation hides the in-
bility of POVMs in the classes under study. The ex-
volved challenges. The axioms of quantum mechan-
tension in the present work of statistical learning the-
ics, that govern all our interaction with quantum
ory machinery to the quantum setting, necessary for
data, manifests in three unique properties - non-
proving finite upper bounds on the sample complexity
commutativity, quantum indeterminism and entan-
for learning infinite-cardinality POVM classes, is a key
glement. Non-commutativity implies that measur-
contribution of our work.
ing a quantum state causes it to collapse. Post-
measurement collapse results in a significant blowup of The second set is comprised of works [Gortler and
the sample complexity of the empirical risk minimiza- Servedio, 2001, Servedio, 2001, Lloyd et al., 2014,
tion (ERM) rule, thus leading us to modify this foun- Schuld et al., 2014, Atici and Servedio, 2005, Anguita
dational algorithm of SLT. Leveraging the elegant no- et al., 2003, Wiebe et al., 2012, Aı̈meur et al., 2012]
tion of joint measurability, also referred to as compat- that explore the use of a quantum computer to speed
ibility (Sec. 3.1) herein, we propose a modified ERM up classical learning tasks. Surveys [Arunachalam and
rule (Alg. 1 in Sec. 3.2) that prevents this blowup. de Wolf, 2017,Khan and Robles-Kelly, 2020] provide a
Next, we quantify the effect of quantum indetermin- detailed account of works in this set.
ism in the characterization of the VC dimension of a
POVM concept class (Defn. 5). Our analysis (Sec. 4) ∆
Notation: For integer n ∈ N, [n] = {1, · · · , n}. We
reveals the functional dependence of the sample com- reserve H with appropriate subscripts to denote a
plexity (Thm. 1) on this measure of complexity. Our finite-dimensional Hilbert space. The symbols L(H),
formulation generalizes (Rem. 1) the original PAC for- R(H), P(H), D(H) denote the collection of linear,
mulation [Vapnik and Chervonenkis, 1971,Vapnik and Hermitian, positive and density operators acting on
Chervonenkis, 1974, Vapnik, 1982, Valiant, 1984, Vap- ∆
H respectively. A POVM is a subcollection M =
nik, 1995,Vapnik, 1998] and we recover all correspond- {My ∈ P(H) : y ∈ Y} of indexed P positive operators
ing results (Rem. 4). We also generalize the results that sum to the identity IH , i.e, y∈Y My = IH . For
of [Heidari et al., 2021], which only deals with finite clarity, a POVM M = {My ∈ P(H) : y ∈ Y} with
concept classes and a simpler notion of compatibil- outcomes in Y is often referred to as a Y−POVM or
ity. In Sec. 5, we use optimal state discrimination to Y−POVM on H. In this work, we are required to
identify sequences of POVM concept classes that are perform the same measurement on multiple quantum
universally consistent (Defn. 7). states. For a Y−POVM M = {My : y ∈ Y}, we let
∆ ∆
Mn = {Myn = My1 ⊗ · · · ⊗ Myn : y n ∈ Y n } with a
Prior Work: Related prior work can be classi- slight abuse of notation. For a Hilbert space HX and a
fied broadly into two sets. The first set, which in- 1
Our work differs considerably since we are provided the
cludes this work, revolves around learning a quantum actual labels, not the associated probabilities tr(M ρi ) : 1 ≤
state [Arunachalam and de Wolf, 2017, Arunachalam i ≤ n. As discussed in [Kearns and Schapire, 1990], this
et al., 2020], or its properties [Anshu et al., 2021], by makes the problem substantially more challenging.
Arun Padakandla, Abram Magner
set Y, we let MX→Y denote the set of all Y−POVMs label pairs, the feature-label pair (x, y) is encoded in a
on HX . See [Holevo, 2019, Sec. 2.3] for its structure. physically realized quantum system modeled via den-
∆
We let MX n →Y denote the set of all Y−POVMs on sity operators ρxy = ρx ⊗ |yi hy| ∈ D(HZ ), where
⊗n
HX . We let SX→Y denote the subcollection of sharp HZ = HX ⊗ HY , HY = span{|yi : y ∈ Y} with
(projective) measurements on HX with outcomes in Y. hy|ŷi = δyŷ ensuring label distinguishability. Note
∆
We let Z = X ×Y and HZ = HX ⊗HY . We abbreviate that we have modeled labels, as before, to be stored
random variables, probability mass function, empirical in classical registers and have thus adopted the well-
risk as RVs, PMF, ER respectively. We use PMF and established notion of a classical-quantum (CQ) state
distribution interchangeably. We write X ∼ pX if RV [Wilde, 2017, Sec. 4.3.4] to model a feature-label pair.2
X is distributed with PMF pX .
Labeled features are generated with respect to an un-
known distribution pXY . Specifically, the density op-
2 MODEL FORMULATION, erator of a feature-label pair is
PROBLEM STATEMENT ∆
Z
ρXY = ρx ⊗ |yi hy| dpXY ∈ D(HZ ) (1)
X ×Y
Let X represent an (arbitrary) set of features, Y a P
finite set of labels and P(X × Y) the set of all dis- which reduces to ρXY = (x,y)∈X ×Y pXY (x, y)ρx ⊗
tributions/densities/PMFs on X × Y. In the classi- |yihy| if |X | < ∞. A predictor is only provided the
cal model of computation, data and hence features quantum system corresponding to the feature, i.e.,
are stored in registers, and operated on by functions. the X−system. Mathematically, the density opera-
Therefore, random variables (RVs) (X, Y ) ∈ X × Y tor
P of the quantum state provided to a predictor is
with an unknown distribution pXY model feature-label x∈X pX (x)ρx . We exemplify this below.
pairs occurring in nature. A predictor – a function Ex. 1. Consider electrons with spins. An electron
f ∈ FX →Y - is only provided X. Its performance can be prepared with the axis of its spin pointing in a
∆ ∆
is measured through its true risk lp (f ) = lpXY (f ) = direction, represented by a 3−dimensional unit vector.
j2π
EpXY {l(f (X), Y )} = Ep {l(f (X), Y )} with respect to Let finite set X = {(θi , φj ) = ( iπ
8 , 8 ) : 0 ≤ i, j ≤ 7}
a loss function l : Y 2 → [A, B] bounded between index unit vectors in the Bloch sphere representing the
A, B ≥ 0. A learning algorithm’s (LA’s) task is possible spin axis directions. We have two labels in
to identify an optimal predictor – a function – from Y = {blue, red}. Nature decides to label an electron
within a/its concept class C ⊆ FX →Y – a library of ‘red’ if the axis of its spin is orthonormal to a specific
functions. We focus on supervised learning, wherein orthant. Otherwise the electron is labeled ‘blue’. For
the LA is provided with n training samples modelled as this she chooses a specific orthant O. This establishes
n independent and identically distributed (IID) pairs a relationship - pY |X - between the elements (x, y) ∈
(Xi , Yi ) ∈ X × Y : i ∈ [n] with unknown density pXY . X × Y. Going further, she chooses a distribution pX ,
A C−LA is therefore again a sequence of functions samples X with respect to this distribution, endows an
An : (X × Y)n → C : n ≥ 1, of which the ERM electron with the corresponding spin and hands only
rule [Shalev-Shwartz and Ben-David, 2014, Sec. 2.2] is the electron to us. Our predictor is aware of X , its
a canonical example. association with the spin directions, i.e the mapping
Before we describe a quantum formulation, we briefly x → ρx , and Y. Oblivious to both the nature’s decision
highlight two consequences of the classical model that and the orthant, but possessing the prepared electron,
are often taken for granted. Stored in registers and a predictor’s task is to determine the label.3
modeled as elements in a set, features can be dupli- Previously stored in registers, features were labeled
cated (indiscriminately), thereby enabling one to si- via functions. In contrast, on a quantum comput-
multaneously retain an “original copy” x and the out- ing device, features encoded in quantum states can
comes (f (x) : f ∈ C) of evaluating it through any arbi- be labeled only through measurements. A predictor
trary collection of functions (Note 1). Secondly, if one M = {My ∈ P(HX ) : y ∈ Y} is therefore a Y−POVM
ignores circuit reliability issues, no uncertainty is in- on HX .
volved in evaluating features through functions (Note
2
2). In other words, the outcome of evaluating a func- A CQ state [Wilde, 2017, Sec. 4.3.4] models a scenario
wherein some components of data is stored in classical reg-
tion on a feature is deterministic.
isters, while others are encoded in quantum states that
In our proposed quantum learning model, feature x ∈ obey the axioms of quantum mechanics.
3
X is encoded by modifying specific characteristics of For example, we may know that 000 is encoded as hor-
izontal spin axis, 111 as vertical spin axis and so on. How-
a sub-atomic particle. Let ρx ∈ D(H) : x ∈ X ever, we are unaware of which spin axis is labeled as red
model the behaviour of a particle representing feature or blue, i.e., we are unaware of the rule based on which a
x ∈ X . While X × Y indexes (all) possible feature- spin axis is labeled.
PAC Learning of Quantum Measurement Classes: Sample Complexity Bounds and Universal
Consistency
Since our labels are stored in registers, a conventional > 0, δ > 0, there exists a N (, δ) ∈ N such that
bounded loss function l : Y × Y → [A, B] quantifies for all n ≥ N (, δ) and every distribution pXY , the
a predictor’s performance. The true risk of predictor probability of An choosing a predictor M ∈ C for which
M = {My : y ∈ Y} if the underlying distribution is the true risk lp (M) > l∗ (pXY , C) + , is at most δ. In
pXY is this case, the function N : [A, B] × (0, 1) → N is the
sample complexity of algorithm An : n ≥ 1 in learning
∆
lp (M) = C. We say that the C−LA An : n ≥ 1 efficiently PAC
learns C if N (, δ) is polynomial in 1 , 1δ . A concept
X
pXY (x, y)l(y, ŷ) tr((ρx ⊗ |yi hy|)(Mŷ ⊗ IY )),
(x,y,ŷ)∈Z
class C ⊆ MX→Y is (efficiently) PAC learnable if there
(2) exists a C−LA that (efficiently) PAC learns C.
Remark 1. Suppose we are given a classical PAC
∆
where Z = X × Y 2 . A concept class C ⊆ MX→Y is a learning problem with feature set X , a function con-
subcollection of measurements, i.e., POVMs with out- cept class C ⊆ FX→Y . We can recover this classi-
comes in Y. Henceforth, a concept class, a POVM cal formulation via the following substitutions in the
concept class and a measurement concept class all problem formulated herein. For this, choose HX =
refer to the same object and are used interchange- span{|xi : x ∈ X } with hx|x̂i = δxx̂ and ρx = |xi hx|.
ably. For an underlying distribution pXY , we let We then encode the concept class of functions in terms
∆
l∗ (pXY , C) = of POVMs as Mfy = x:f (x)=y |xi hx| for each f ∈ C.
P
inf lp (M) denote the risk of an opti-
M∈C
mal predictor in C. The goal of a C−LA is to choose
a predictor M ∈ C which is a POVM whose true risk While the classical problem is well understood [Shalev-
lp (M) is close to l∗ (pXY , C), while being oblivious to Shwartz and Ben-David, 2014, Devroye et al., 1996],
the underlying distribution pXY . To accomplish this, the unique behaviour of quantum systems and com-
a C−LA is provided n training samples - labeled fea- plexity of the involved mathematical objects throws up
tures - randomly chosen IID from pXY . The density challenges leaving the above problem, which we tackle
operator of the received training samples is therefore here, unresolved.
Z
⊗n ∆
ρXY = ⊗n
ρxn ⊗ |y n i hy n | dpnXY ∈ D(HZ ) (3) The need to modify the ERM rule The following
X n ×Y n example illustrates the need to modify the classical
ERM algorithm; we do so in Sec. 3.2 and analyze the
n n
∆
O ∆
O resulting sample complexity in Sec. 4.
where ρxn = ρxi , |y n i hy n | = |yi ihyi | .
i=1 i=1 Ex. 2. Suppose we are provided n = 20 training
Mathematically, any operation on a quantum state samples, i.e., 20 electrons with corresponding labels
with classical outcomes is a measurement. A C−LA in registers, and our library C consists of 2 predictors
must therefore measure the n training samples mod- M1 = {M1b , M1r }, M2 = {M2b , M2r }. An ERM rule
eled via density operator ρ⊗n would attempt to measure all 20 electrons with both
XY and output an index in
C. measurements and choose that measurement, whose
outcomes disagree least with the provided labels. How-
Defn 1. A POVM concept class for labeling features in ever, every measurement alters the spin, rendering it
D(HX ) with labels in Y is a subcollection C ⊆ MX→Y unusable to perform the next measurement. Moreover,
of POVMs. A C−learning algorithm (C−LA) is a se- electron spins cannot be replicated (No Cloning Theo-
∆ ⊗n
quence An = {AnM ∈ P(HZ ) : M ∈ C} ∈ MZ n →C : rem [Wootters and Zurek, 1982]). This indicates that
⊗n
n ≥ 1 of C−POVMs on HZ . every training sample can be used to evaluate the em-
pirical risk (ER) of just one measurement.
We elaborate on Def. 1 for clarity. A C−LA is a se-
quence An : n ≥ 1 of measurements. Each measure- Unlike in Remark 1, non-commutativity of quantum
ment in this sequence is a C−POVM. When measure- measurements and the No Cloning theorem [Wootters
ment An is performed on n training samples, the out- and Zurek, 1982] suggest that each training sample can
come obtained is the index of the chosen predictor in be used to evaluate the ER of just one measurement.
C. Having clarified this, we now enquire how many Moreover the outcome of measurements are random
samples does a C−LA need to identify −optimal pre- (Remark 2) resulting in random empirical risks. How
dictor with confidence at least 1 − δ? We are thus should a C−LA optimally utilize training samples to
led to the notion of PAC learnability of measurement identify the best predictor from within C? The dis-
spaces. cussion thus far, exemplified through Ex. 1 - 2, sug-
∆ ⊗n
Defn 2. An algorithm An = {AnM ∈ P(HZ ):M∈ gests that each training sample can yield the ER of
C} ∈ MZ n →C : n ≥ 1 (PAC) learns C if for every just one measurement. This results in a blow-up of
Arun Padakandla, Abram Magner
the sample complexity of any concept class. Our first unique to each individual POVM in the compatible
simple step (Sec. 3.1) is to leverage compatible mea- collection.
surements [Holevo, 2019, Sec. 2.3.2] to ‘re-use’ sam-
From (4), it is evident that to obtain the joint statistics
ples effectively. We build on this to derive an upper
of all POVMs in a compatible subcollection B, one can
bound on the sample complexity. In contrast with the
perform POVM GQ and pass the outcome through the
previous work [Heidari et al., 2021], our work necessi- B
stochastic matrix B∈B αY|W . This suggests that we
tates a substantial extension to cover the ubiquitous
partition the POVM concept class C into compatible
case of infinite-cardinality concept classes and to gen-
subcollections and ‘re-use’ training samples amongst
eralize beyond compatible partitions defined in terms
POVMs within a compatible subcollections. Before
of commuting POVMs.
we state this, we identify an important example of a
compatible subcollection.
3 STOCHASTICALLY Remark 2. A subcollection B ⊆ MX→Y of
COMPATIBLE POVMS AND A Y−POVMs on HX are compatible if for any pair
NEW ERM RULE M = {My : y ∈ Y} ∈ B, L = {Ly : y ∈ Y} ∈ B,
we have My Lŷ = Lŷ My for all (y, ŷ) ∈ Y × Y.
While classical register contents maybe duplicated,
thereby permitting simultaneous evaluations of all While commutative POVMs are compatible, the con-
functions in C ⊆ FX→Y , the No Cloning theorem verse is not necessarily true. In fact, the problem of
[Wootters and Zurek, 1982] precludes reproducing characterizing compatible POVMs remains active [Jae
quantum states. Moreover, any measurement results et al., 2019], [Guerini and Terra Cunha, 2018], [Kun-
in the collapse of a quantum state: performing POVM jwal, 2014] [Karthik et al., 2015]. Notwithstanding
M1 = {M1y ⊗ IY : y ∈ Y} on each training sam- this, the notion of compatibility provides us with the
ple results in the modified trainingR samples described appropriate construct to modify a naive ERM rule that
√ √
via ρ̃⊗n
P
XY , where ρ̃ XY = z∈Y M 1z x M1z ⊗
ρ resulted in blowup of sample complexity. We specify
|yi hy| dpXY . Performing subsequent POVMs on the the new ERM rule in the following.
collapsed training samples does not enable us to esti-
mate the true risk of these predictors. 3.2 ERM Rule for Learning Quantum
Measurement Concept Classes
3.1 Stochastically Compatible Measurements
The idea of the new ERM rule is to partition the
The ERM rule we propose is based on the divide-and- POVM concept class C into compatible subcollections.
conquer approach. While POVMs are generally incom- The set of training samples is also partitioned analo-
patible, certain subsets are jointly measurable [Holevo, gously so that there is a 1 : 1 correspondence between
2019, Sec. 2.1.2], leading us to the following.4 the two partitions. The number ni of samples allotted
Defn 3. A subcollection B ⊆ MX→Y of Y−POVMs to a given partition element with index i in the com-
on HX is compatible if there exists a finite outcome patible partition is chosen in accordance with our sam-
set W, a W−POVM G = {Gw ∈ P(HX ) : w ∈ W} on ple complexity upper bound in Theorem 1. That is,
2
B
HX and a stochastic matrix αY|W : W → Y for each we choose ni = 8(B−A) log(1/δ)+8B
2
V C(Bi )
, where the
B = {By ∈ P(HX ) : y ∈ Y} ∈ B such that VC dimension V C(·) is defined in subsequent discus-
sion. Each subset of the training samples is employed
(4)to evaluate the ER of all POVMs in the corresponding
X
By = B
Gw αY|W (y|w) for all y ∈ Y and all B ∈ B.
w∈W
compatible subset. Finally, the POVM with the least
ER among all POVMs is the chosen predictor. The fol-
In this case, we say B is compatible with fine-graining lowing definition enables us specify the proposed ERM
B ∆
(W, G, αY|W )= B
(W, G, αY|W : B ∈ B). rule precisely.
Defn 4. Let C ⊆ MX→Y . We say (Bi : i ∈ I) is a
In other words, a compatible collection of POVMs can
compatible partition of C if (i) Bi ⊆ C is compatible for
be applied by first applying a single POVM to a quan-
each i ∈ I, (ii) Bi ∩ Bî = φ whenever î 6= i and (iii)
tum state, then applying a noisy classical channel to [
the output of this POVM. The classical channel is Bi = C. In this case, suppose Bi is compatible
i∈I
4 i ∆ Bi
Note that our notion of compatible is referred to with fine-graining (Wi , Gi , αY|Wi
) = (Wi , Gi , αY|Wi
)
as stochastically compatible by Holevo [Holevo, 2019, ∆
Sec. 2.1.2, Pg. 13]. Compatible in [Holevo, 2019, Sec. 2.1.2] for each i ∈ I, we say BI = (Bi : i ∈ I) is a compat-
i
corresponds to the case when the stochastic matrices are ible partition of C with fine-graining (Wi , Gi , αY|Wi
):
0 − 1 valued. i ∈ I. Furthermore, we let B denote the set of all com-
PAC Learning of Quantum Measurement Classes: Sample Complexity Bounds and Universal
Consistency
patible partitions of POVM concept class C ⊆ MX→Y . Algorithm 1: ERM-Quantum (ERM-Q) Algo-
rithm
We give a finite example below.
Input: POVM Concept class C and n training
Ex. 3. Suppose Y = {0, 1}, C = {M1 , · · · , M5 } samples with (y1 , · · · , yn ) denoting the
contains 5 POVMs. Suppose B1 = {M1 , M2 } labels.
and B2 = {M3 , M4 , M5 } is a compatible partition Output: Index of the selected predictor in C
of C (Defn. 4). Specifically, suppose B1 is com- 1 Identify a compatible partition BI = ∆
(Bi : i ∈ I)
1 2
patible with fine-graining (W1 , G1 , αY|W 1
, αY|W1
) i
of C with fine graining (Wi , Gi , αY|Wi ) : i ∈ I and
and B2 is compatible with fine graining Y
3 4 5 let t = |I|. Henceforth, let αY|Wi
= αYB |Wi
(W2 , G2 , αY|W2
, αY|W2
, αY|W2
). In order to ob- i
B∈Bi
tain outcomes that are statistically indistinguishable
be the product of the stochastic matrices.
from that of the outcomes of POVMs in B1 , one
2 Partition n training samples into t = |I| subsets
can perform W1 −POVM G1 on the quantum state
with i−th subset having ni training samples. Let
and post-process the outcome w1 ∈ W1 by passing it (i)
(yj : 1 ≤ j ≤ ni ) denote the training sample
through the product stochastic matrix
labels of samples provided in i−th subset. ni is
1 ∆ B1 determined in relation to VC(Bi ) appearing in
αY|W (y , y2 |w1) =
1 1
αY|W1
(y1 , y2 |w1 ) (5)
∆ M1 M2
(9)
= αY|W1
(y1 |w1 )αY|W1
(y2 |w1) (6) 3 for i = 1 to t do
4 Perform Wi −POVM Gi on each of the ni
where (w1 , y1 , y2) ∈ W1 × Y 2 . (i)
samples. For j = 1, · · · , ni , let wj ∈ Wi
Ex. 3 identifies a compatible partition with two el- denote the outcome of the POVM on the
ements, i.e., I = {1, 2} with t = 2. The product j−th quantum state.
(i)
stochastic matrix for i = 1 is explicitly stated in the 5 Postprocess (wj : 1 ≤ j ≤ ni ) by passing each
(i)
example. The outcomes (wj : 1 ≤ j ≤ ni ) are component through the stochastic matrix
the ni outcomes one obtains on performing POVM
i
αY|W i
. For POVM B ∈ Bi ⊆ C, let ŷjB
Gi on the ni quantum states. Each of these is passed denote the post-processed outcome
i Bi corresponding to sample j.
through the product stochastic matrix αY|W i
= αY|W i
.
∆ 1 Pni B (i)
We have denoted the resulting collection of outcomes 6 Compute ER ER(B, Bi ) = ni j=1 l(ŷj , yj )
(ŷjB : B ∈ Bi ) : 1 ≤ j ≤ ni to distinguish it from of POVM B ∈ Bi ⊆ C.
the labels provided as part of the training data. These 7 Let Bi∗ = arg minB∈Bi ER(B, Bi ).
were denoted as y1 , y2 in Ex. 3. The remaining steps 8 return Index arg mini∈I Bi∗ that globally
facilitate identifying the ERM POVM in each parti- minimizes ER.
tion followed by identifying the global ERM POVM
that is returned.
Remark 3. If B = {Mj : 1 ≤ j ≤ J} consists subsets of a compatible partition. Using {ρx : x ∈ X }
of commuting Y−POVMs Mj = {Mjy ∈ P(H) : and the obtained outcomes one can, in theory, charac-
y ∈ Y} : 1 ≤ j ≤ J, then B is compat- terize all possible collapsed states and ‘re-use’ samples
QJ
ible with fine-graining (Y J , j=1 Mj , 1Yj |Y ), where across partitions to glean partial information on the
QJ ∆
Y J = ×Jj=1 Yj is the Cartesian product, j=1 Mj = ER of incompatible POVMs. Since POVMs obfuscate
J
{M1y1 M2y2 · · · MJyJ ∈ P(H) : (y1 , · · · , yJ ) ∈ Y } is phase information of the collapsed state, we have not
the product POVM and 1Yj |Y (y|y1 , · · · , yJ ) = 1{y=yj } attempted to explore this direction in this first step.
is the co-ordinate function. Our sample complexity of ERM-Q is therefore only an
upper bound.
Remark 4. For a set X , consider the Hilbert space
∆
HX = span{|xi : x ∈ X } with hx̂|xi = δx̂x . The set of
all stochastic matrices on X (that subsumes functions) 4 AN UPPER BOUND ON THE
forms a commuting set of operators on HX . Hence the SAMPLE COMPLEXITY OF
classical PAC learning problem reduces to a POVM POVM CONCEPT CLASSES
concept class in which all operators commute. The
ERM-Q therefore reduces to the classical ERM with We derive an upper bound on the sample complexity
one commuting subset of POVMs. of ERM-Q algorithm. As noted before, in contrast to
Remark 5. While the ERM-Q algorithm ‘re-uses’ the classical PAC, there are two points of departure
samples to obtain ER of POVMS within a compatible - non-commutativity and random outcomes. The for-
subset, it makes no attempt to ‘re-use’ samples across mer led us to partitioning C. The latter forces us to
Arun Padakandla, Abram Magner
deal with randomness in outcome labels. We provide Remark 6. From the definition of reachability and
several steps of the proof in Sec. 4.2. To aid clarity θr (B, x), it is evident that the VC dimension of POVM
and readability, probability of events are expressed as classes is, owing to its indeterminsim, quite high. In-
expectations of corresponding indicator RVs, which is deed, any set of outcome labels that get a non-zero
further expressed as a summation. E.g., if X, Y, Z has probability of occurrence, no matter how small, is
a distribution
P pXY pZ|X , then P (l(Y, Z) > a) is written reachable. This increases the sample complexity of
as x,y,z pXY (x, y)pZ|X (z|x)1{l(y,z)>a} . This is done POVM classes. Quantum indeterminism is unfortu-
only to permit an uncluttered integration of quantum nately unavoidable. Additionally, our bound is finite
and classical probability. We emphasize that none of only when a finite-cardinality compatible partition of
the arguments rely on summations confined to finite the hypothesis class exists. This condition naturally
ranges or suprema being maxima. holds in certain use cases: e.g., where an experimenter
has access to finitely many “root” measurement de-
4.1 VC Dimension of Compatible POVMs vices (POVMs), and they form an infinite-cardinality
and Sample Complexity of ERM-Q concept class by passing the results of these POVMs
through infinitely many classical channels. In this sce-
The divide-and-conquer approach of ERM-Q is natu- nario, the natural partition is immediate. It is likely
rally reflected in its sample complexity. Below, we de- that our results can be further extended to the case
fine a notion of VC dimension of a compatible concept where we only require a looser version of compatibility
class and use it to upper bound the sample complexity among POVMs in a partition element. Since the space
of ERM-Q. of POVMs of a given dimension is compact, finitude
Defn 5. Let B = {Bk : k ∈ K} be a compatible parti- of the partition may then be guaranteed. However, the
∆
tion with fine-graining (W, G = ∆ k
GW , αk = αY|W : k ∈ problem of actually specifying a partition for which our
bound is minimized can likely only be solved under fur-
K) where G = {Gw ∈ P(H) : w ∈ W}. Let
ther assumptions on the hypothesis class.
∆
X
βk (yk |x) = tr(ρx Gw )αk (yk |w) (7) Our provided sample complexity bound is a nontriv-
w∈W ial extension of the techniques in the classical learning
theory case to the quantum framework, which is nec-
and essary for showing finite sample complexity bounds for
X Y infinite-cardinality learnable hypothesis classes. Our
βK (yk : k ∈ K|x) = tr(ρx Gw ) αk (yk |w). (8) upper bound is likely not tight for a wide variety of
w∈W k∈K
POVM classes. Thus, it is of interest to further im-
prove our upper bound and to provide sample complex-
For x ∈ X , we say (yk : k ∈ K) is reachable from
ity lower bounds that match upper bounds. Ultimately,
x if βK (yk : k ∈ K|x) > 0. Let r ∈ N and let
∆ the goal is to give necessary and sufficient conditions
y= (y 1 , · · · , y r ) ∈ Y |K| × · · · × Y |K| where y i = (yki :
for learnability of POVM classes, and the present work
k ∈ K) for i ∈ [r]. We say y is reachable from is a first step in this direction.
Qr
x = (x1 , · · · , xr ) ∈ X r if i=1 βK (y i |xi ) > 0. Let
∆
θr (B, x) = {y ∈ Y |K|r : y is reachable from x}. We 4.2 Proof of Theorem 1 : Outline
say B shatters x ∈ X r if |θr (B, x)| = |Y|r . The VC
dimension of B, denoted VC(B), is the maximal d ∈ N To indicate the general direction of the proof, we iden-
for which there exists a set x ∈ X d that is shattered by tify here two main terms T1 and T2 that need to be
B. Finally, we let Sr (B) = ∆
maxx∈X r |θr (B, x)|. bounded on the above in order to derive sample com-
plexity. Upper bounds on the two main terms are de-
This leads to the following theorem. rived in the supplementary material.
Theorem 1. The sample complexity N : [A, B] ×
(0, 1) → N of the ERM-Q algorithm in learning C is Preliminaries and Notation: Let B1 , · · · , B|I| be
bounded by a compatible partition of C. Throughout, we focus
on proving the uniform convergence property [Shalev-
N (, δ) ≤ (9) Shwartz and Ben-David, 2014, Defn. 4.3] of one par-
P|I| tition element B1 , henceforth denoted B. Let B =
|I| 2
8((B − A) log δ +B i=1 VC(Bi ))) {M1 , · · · , MK } ⊆ C be compatible, with fine grain-
min ing (W, GW , αYk |W : 1 ≤ k ≤ K). Henceforth, we
(B1 ,··· ,B|I| )∈B 2
let αk = αYk |W : k ∈ [K] and G = GW . To pre-
where the minimum is over the collection B of all com- vent overloading, we let Z = Y and use z, Zk , zk
patible partitions of C as defined in Defn. 4. to refer to the outcome of the POVMs in B on
PAC Learning of Quantum Measurement Classes: Sample Complexity Bounds and Universal
Consistency
the X−system of the training sample. Recall that tool in bounding T1 is the Bounded Difference Inequal-
ERM-Q performs POVM G n = {Gw1 ⊗ · ·R· ⊗ Gwn : ity (BDI) [Stephane Boucheron and Massart, 2013,
n ⊗n
P 1 , · · · , wn ) ∈ W } on ρX , where ρX = ρx dpX =
(w Thm. 6.2]. The bound on T2 is derived using the VC
x∈X pX (x)ρx with pX unknown. Each of the n out- inequality and Massart’s lemma Specifically, the ghost
comes Wi : i ∈ [n] (RVs) is passed through stochas- sample technique followed by the symmetrization tech-
tic matrix (α[K] (z|w) = α[K] (z1 , · · · , zK |w) : (z, w) = nique leads us to the Rademacher complexity of the
(z1 , · · · , zK , w) ∈ Z K × W), where5 POVM concept class. Finally, using Massart’s lemma,
K
we derive a bound on T2 . A complete proof is given in
∆
Y the supplementary material.
α[K] (z1 , · · · , zK |w) = αk (zk |w) (10)
k=1
[Bubeck et al., 2020] Bubeck, S., Chen, S., and Li, J. [Khan and Robles-Kelly, 2020] Khan, T. M. and
(2020). Entanglement is necessary for optimal quan- Robles-Kelly, A. (2020). Machine learning: Quan-
tum property testing. tum vs classical. IEEE Access, 8:219275–219294.
Arun Padakandla, Abram Magner
[Kunjwal, 2014] Kunjwal, R. (2014). A note on the [Vapnik and Chervonenkis, 1974] Vapnik, V. and
joint measurability of povms and its implications Chervonenkis, A. (1974). Theory of Pattern Recog-
for contextuality. nition [in Russian]. Nauka, Moscow. (German
Translation: W. Wapnik & A. Tscherwonenkis,
[Lloyd et al., 2014] Lloyd, S., Mohseni, M., and
Theorie der Zeichenerkennung, Akademie–Verlag,
Rebentrost, P. (2014). Quantum principal compo-
Berlin, 1979).
nent analysis. Nature Physics, 10(9):631–633.
[Montanaro and de Wolf, 2018] Montanaro, A. and [Vapnik, 1995] Vapnik, V. N. (1995). The Nature of
de Wolf, R. (2018). A survey of quantum property Statistical Learning Theory. Springer-Verlag, Berlin,
testing. Heidelberg.
[O’Donnell and Wright, 2016] O’Donnell, R. and [Vapnik, 1998] Vapnik, V. N. (1998). Statistical
Wright, J. (2016). Efficient quantum tomography. Learning Theory. Wiley-Interscience.
In Proceedings of the Forty-Eighth Annual ACM [Vapnik and Chervonenkis, 1971] Vapnik, V. N. and
Symposium on Theory of Computing, STOC ’16, Chervonenkis, A. Y. (1971). On the uniform conver-
page 899–912, New York, NY, USA. Association gence of relative frequencies of events to their prob-
for Computing Machinery. abilities. Theory of Probability & Its Applications,
[O’Donnell and Wright, 2017] O’Donnell, R. and 16(2):264–280.
Wright, J. (2017). Efficient quantum tomography
[Waseem et al., 2020] Waseem, M. H., e Ilahi, F., and
ii. In Proceedings of the 49th Annual ACM SIGACT
Anwar, M. S. (2020). Quantum state tomography.
Symposium on Theory of Computing, STOC 2017,
In Quantum Mechanics in the Single Photon Labora-
page 962–974, New York, NY, USA. Association
tory, 2053-2563, pages 6–1 to 6–20. IOP Publishing.
for Computing Machinery.
[Rehacek and Paris, 2004] Rehacek, J. and Paris, M. [Wiebe et al., 2012] Wiebe, N., Braun, D., and Lloyd,
(2004). Quantum state estimation. Lecture notes in S. (2012). Quantum algorithm for data fitting. Phys.
physics. Springer, Berlin. Rev. Lett., 109:050505.
[Schuld et al., 2014] Schuld, M., Sinayskiy, I., and [Wilde, 2017] Wilde, M. (2017). Quantum informa-
Petruccione, F. (2014). Quantum computing for tion theory. Cambridge University Press, Cam-
pattern classification. In Pham, D.-N. and Park, bridge, UK.
S.-B., editors, PRICAI 2014: Trends in Artificial
[Wootters and Zurek, 1982] Wootters, W. K. and
Intelligence, pages 208–220, Cham. Springer Inter-
Zurek, W. H. (1982). A single quantum cannot be
national Publishing.
cloned. Nature, 299(5886):802–803.
[Servedio, 2001] Servedio, R. A. (2001). Separating
quantum and classical learning. In Proceedings of
the 28th International Colloquium on Automata,
Languages and Programming,, ICALP ’01, page
1065–1080, Berlin, Heidelberg. Springer-Verlag.
[Shalev-Shwartz and Ben-David, 2014] Shalev-
Shwartz, S. and Ben-David, S. (2014). Under-
standing Machine Learning: From Theory to
Algorithms. Cambridge University Press, New
York, NY, USA.
[Stephane Boucheron and Massart, 2013]
Stephane Boucheron, G. L. and Massart, P.
(2013). Concentration Inequalities. Oxford
University Press, Oxford, United Kingdom.
[Valiant, 1984] Valiant, L. G. (1984). A theory of the
learnable. Commun. ACM, 27(11):1134–1142.
[Vapnik, 1982] Vapnik, V. (1982). Estimation of De-
pendences Based on Empirical Data: Springer Se-
ries in Statistics (Springer Series in Statistics).
Springer-Verlag, Berlin, Heidelberg.
Supplementary Material:
PAC Learning of Quantum Measurement Classes: Sample
Complexity Bounds and Universal Consistency
∆
For j = 1, · · · , |I|, ERM-Q identifies T1 = (22)
X
pnXY (x, y)β[K]
n
(z|x)1{φ(x,y,z)−E{φ(X,Y ,Z)}> 2 } .
nj x,y,z
∆ 1 X j
kj∗ = arg min l(yi , Zki ) (23)
1≤k≤K nj i=1
n o
With this, the first term T1 is n2
2 }) ≤ exp − 2(B−A) 2 and we can conclude
P ◦ β(φpβ (X, Y , Z) − E{φpβ (X, Y , Z)} ≥ 2) ≤
T1 (26) n o
n2
+ + 2 exp − 4(B−A) 2 .
= P ◦ β φpβ (X, Y , Z) − E{φpβ (X, Y , Z)} > .
4
(27) A.3 Bounding the Second Term T2 in (13)
via Rademacher Complexity
In order to upper bound T1 , we use the Bounded
Difference Inequality stated in Theorem 3. Choose ∆
We recall T2 = E{φ(X, Y , Z)}, where
B = X × Y × Z K and Bi = (Xi , Yi , Z i ) =
(Xi , Yi , Z1i , · · · , ZKi ) and f (x, y, z) = φpβ (x, y, z). ∆
φ(x1 , · · · , xn , y1 , · · · , yn , z 1 ,· · ·, z n ) = (33)
Let x, x̂ ∈ X n , y, ŷ ∈ Y n and z, ẑ ∈ Z Kn , be such sup (34)
that xt = x̂t , yt = ŷt and z t = ẑ t for t ∈ [n] \ i. For 1≤k≤K
such a choice, we have
X n
l(yi , zki ) X
φ+ +
pβ (x, y, z) − φpβ (x̂, ŷ, ẑ) (28) − l(b, c)pXY (a, b)βk (c|a) . (35)
i=1 n
n (a,b,c)∈
l(yi , zki )
X X X ×Y×Z
= sup − l(b, c)pXY (a, b)βk (c|a)
1≤k≤K i=1 n
(a,b,c)∈ In bounding E{φ(X, Y , Z)}, we employ the ghost
X ×Y×Z
sample+symmetrization trick [Devroye et al., 1996,
n
X l(ŷi , ẑki ) X Thm. 12.4]. Observe that
− sup − l(b, c)pXY (a, b)βk (c|a)
1≤k≤K i=1 n
(a,b,c)∈ E{φ(X, Y , Z)} (36)
X ×Y×Z
X
≤ sup (29) = pnXY (x, y)β[K]
n
(z|x) (37)
1≤k≤K x, y, z
n
l(yi , zki )
n X X
X l(yi , zki ) X sup | − l(b, c)pXY (a, b)βk (c|a)|
− l(b, c)pXY (a, b)βk (c|a) k∈[K] i=1 n
n (a,b,c)∈
i=1
(a,b,c)∈ X ×Y×Z
X ×Y×Z X
= pnXY (x, y)β[K]
n
(z|x) (38)
n
X l(ŷi , ẑki ) X x, y, z
− − l(b, c)pXY (a, b)βk (c|a) n
i=1
n X l(yi , zki )
(a,b,c)∈
sup | (39)
X ×Y×Z n
k∈[K] i=1
(30)
( ) n
n n l(b , c )
l(yi , zki ) l(ŷi , ẑki ) B−A j kj
X X
− pnXY (a, b)β[K]
n
(c|a) | (40)
X X
= sup − ≤ n
1≤k≤K i=1
n i=1
n n a, b, c j=1
(31)
X X
where (30) follows from supa f1 (a) − supa f2 (a) ≤ ≤ pnXY (a, b)β[K]
n
(c|a)pnXY (x, y)β[K]
n
(z|x)
supa (f1 (a) − f2 (a)) and (31) follows from the bounds x, y, z a, b, c
where
where (38) follows by introducing the ghost sample to s
1
the second term [Devroye et al., 1996, Step 1 in Proof ∆ 2(B − A) log δ + 8B 2 log Sr (B)
η(, δ) = . (56)
of Thm. 12.7], (41) follows from the chain sup |E{·}| ≤ n
sup E{| · |} ≤ E{sup | · |} of inequalities, the next step
follows by symmetrization by random signs [Devroye A.6 Union Bound on the Compatible
et al., 1996, Step 2 in Proof of Thm. 12.4], finally re- Partition
sulting in the familiar average Radamacher complex-
ity. While these steps are fairly standard, we notice the Having derived uniform convergence for one compat-
crucial difference (45), where the averaging includes ible subset B of POVMs, we employ a union bound
the POVM randomness β[K] (·|·). for the compatible partition B1 , · · · , B|I| . Squeezing
δ
δ to |I| , we obtain the sample complexity stated in the
theorem statement.
A.4 Bounding Rn (B) using Massart’s Lemma
We now collate bounds from the first term and (52) What if the quantum states are not distinguishable,
and substitute in (13). Recollecting our compatible i.e., the operators ρx : x ∈ X have overlapping support
partition B = {Mk : 1 ≤ k ≤ K}, definitions (10) spaces? We leverage the approach stated in the above
Arun Padakandla, Abram Magner