You are on page 1of 12

Quantum machine learning in feature Hilbert spaces

Maria Schuld∗ and Nathan Killoran


Xanadu, 372 Richmond St W, Toronto, M5V 2L7, Canada
(Dated: March 21, 2018)
The basic idea of quantum computing is surprisingly similar to that of kernel methods in machine
learning, namely to efficiently perform computations in an intractably large Hilbert space. In this
paper we explore some theoretical foundations of this link and show how it opens up a new avenue
for the design of quantum machine learning algorithms. We interpret the process of encoding inputs
in a quantum state as a nonlinear feature map that maps data to quantum Hilbert space. A
quantum computer can now analyse the input data in this feature space. Based on this link, we
discuss two approaches for building a quantum model for classification. In the first approach, the
quantum device estimates inner products of quantum states to compute a classically intractable
arXiv:1803.07128v1 [quant-ph] 19 Mar 2018

kernel. This kernel can be fed into any classical kernel method such as a support vector machine. In
the second approach, we can use a variational quantum circuit as a linear model that classifies data
explicitly in Hilbert space. We illustrate these ideas with a feature map based on squeezing in a
continuous-variable system, and visualise the working principle with 2-dimensional mini-benchmark
datasets.

I. INTRODUCTION Besides this apparent link, kernel methods have


been hardly studied in the quantum machine learning
The goal of many quantum algorithms is to perform literature, a field that (in the definition we employ here)
efficient computations in a Hilbert space that grows investigates the use of quantum computing as a resource
rapidly with the size of a quantum system. ‘Efficient’ for machine learning. Across the approaches in this
means that the number of operations applied to the young field, which vary from sampling [1–5] to quantum
system grows at most polynomially with the system size. optimisation [6, 7], linear algebra solvers [8–10] and
An illustration is the famous quantum Fourier transform using quantum circuits as trainable models for inference
applied to an n-qubit system, which uses O(poly(n)) [11, 12], a lot of attention has been paid to recent trends
operations to perform a discrete Fourier transform on in machine learning such as deep learning and neural
2n amplitudes. In continuous-variable systems this is networks. Kernel methods, which were most successful
pushed to the extreme, as a single operation – for exam- in the 1990s, are only mentioned in a few references
ple, squeezing – applied to a mode formally manipulates [9, 13]. Besides a single study on the connection between
a quantum state in an infinite-dimensional Hilbert space. coherent states and Gaussian kernels [14], their potential
In this sense, quantum computing can be understood for quantum computing remains widely unexplored.
as a technique to perform “implicit” computations in
an intractably large Hilbert space through the efficient The aim of this paper is to investigate the relationship
manipulation of a quantum system. between feature maps, kernel methods and quantum
computing. We interpret the process of encoding inputs
In machine learning, so-called kernel methods are a into a quantum state as a feature map which maps
well-established field with a surprisingly similar logic. data into a potentially vastly higher-dimensional feature
In a nutshell, the idea of kernel methods is to formally space, the Hilbert space of the quantum system. Data
embed data into a higher- (and sometimes infinite-) can now be analysed in this ‘feature Hilbert space’,
dimensional feature space in which it becomes easier to where simple classifiers such as linear models may
analyse (see Figure 1). A popular example is a support gain enormous power. Furthermore, it is well known
vector machine that draws a decision boundary between that the inner product of two data inputs that have
two classes of datapoints by mapping the data into a been mapped into feature space gives rise to a kernel
feature space where it becomes linearly separable. The function that measures the distance between the data
trick is that the algorithm never explicitly performs points. Kernel methods use these kernel functions to
computations with vectors in feature space, but uses a create models that have been very successful in pattern
so-called kernel function that is defined on the domain recognition. By switching between kernels one effec-
of the original input data. Just like quantum computing, tively switches between different models, which is known
kernel methods therefore perform implicit computations as the kernel trick. In the quantum case, the kernel
in a possibly intractably large Hilbert space through the trick corresponds to changing the data encoding strategy.
efficient manipulation of data inputs.
These two perspectives, namely of kernels on the one
hand and feature spaces one the other hand, naturally
lead to two ways of building quantum classifiers for
∗ maria@xanadu.ai supervised learning. The implicit approach takes a
2

A. Feature maps and kernels

Let us start with the definition of a feature map.

Definition 1. Let F be a Hilbert space, called the feature


space, X an input set and x a sample from the input
original space feature space set. A feature map is a map φ : X → F from inputs to
vectors in the Hilbert space. The vectors φ(x) ∈ F are
FIG. 1. While in the original space of training inputs, data called feature vectors.
from the two classes ‘blue squares’ and ‘red circles’ are not
separable by a simple linear model (left), we can map them Feature maps play an important role in machine learning,
to a higher dimensional feature space where a linear model is since they map any type of input data into a space with a
indeed sufficient to define a separating hyperplane that acts well-defined metric. This space is usually of much higher
as a decision boundary (right). dimension. If the feature map is a nonlinear function it
changes the relative position between data points (as in
the example of Figure 1), and a dataset can become a
lot easier to classify in feature space. Feature maps are
classical model that depends on a kernel function, but intimitely connected to kernels [16].
uses the quantum device to evaluate the kernel, which
is computed as the inner products of quantum states Definition 2. Let X be a nonempty set, called the input
in ‘feature Hilbert space’. The explicit approach uses set. A function κ : X × X → C is called a kernel 0if
the quantum device to directly learn a linear decision the Gram matrix K with entries Km,m0 = κ(xm , xm )
boundary in feature space by optimising a variational is positive semidefinite, in other words, if for any finite
quantum circuit. subset {x1 , ..., xM } ⊆ X with M ≥ 2 and c1 , ..., cM ∈ C,

M
A central result of this paper is that the idea of em- X 0
cm c∗m0 κ(xm , xm ) ≥ 0.
bedding data into a quantum Hilbert space opens up
m,m0 =1
a promising avenue to quantum machine learning, in
which we can generically use quantum devices for pat- By definition of the inner product, every feature map
tern recognition. The implicit and explicit approaches gives rise to a kernel.
are not only hardware-independent, but also suitable for
intermediate-term quantum technologies, which allows us Theorem 1. Let φ : X → F be a feature map. The inner
to test them with the generation of quantum comput- product of two inputs mapped to feature space defines a
ers that is currently being developed. Nonlinear feature kernel via
maps also circumvent the need to implement nonlinear
transformations on amplitude-encoded data, and thereby κ(x, x0 ) := hφ(x), φ(x0 )iF , (1)
solve an outstanding problem in quantum machine learn-
ing which we will come back to in the conclusion. where h·, ·iF is the inner product defined on F.

Proof. We must show that the Gram matrix of this kernel


is positive definite. For arbitrary cm , cm0 ∈ C and any
{x1 , ..., xM } ⊆ X with M ≥ 2, we find that
II. FEATURE MAPS, KERNELS AND M
QUANTUM COMPUTING
X X X
cm c∗m0 κ(xm , xm0 ) = h cm φ(xm ), cm0 φ(xm0 )i
m,m0 =1 m m0
X
In machine learning we are typically given a dataset = || cm φ(xm )||2 ≥ 0
of inputs D = {x1 , ..., xM } from a certain input set X , m
and have to recognise patterns to evaluate or produce
previously unseen data. Kernel methods use a distance
measure κ(x, x0 ) between any two inputs x, x0 ∈ X in or-
der to construct models that capture the properties of a The connection between feature maps and kernels
data distribution. This distance measure is connected to means that every feature map corresponds to a distance
inner products in a certain space, the feature space. Be- measure in input space by means of the inner product
sides many practical applications, the most famous being of feature vectors. It also means that we can compute
the support vector machine, these methods have a rich inner products of vectors mapped to much higher dimen-
theoretical foundation [15] from which we want to high- sional spaces by computing a kernel function, which may
light some relevant points. be computationally a lot easier.
3

Thm 1 Def 3 X × R and f : X → R a class of model functions that


feature map kernel RKHS
live in the reproducing kernel Hilbert space Rκ of κ. Fur-
thermore, assume we have a cost function C that quan-
Thm 2 tifies the quality of a model by comparing predicted out-
puts f (xm ) with targets y m , and which has a regularisa-
tion term of the form g(||f ||) where g : [0, ∞) → R is
FIG. 2. Relationships between the concepts of a feature map,
kernel and reproducing kernel Hilbert space. a strictly monotonically increasing function. Then any
function f ∗ ∈ Rκ that minimises the cost function C can
be written as
B. Reproducing kernel Hilbert spaces M
X
f ∗ (x) = αm κ(x, xm ), (4)
Kernel theory goes further and defines a unique Hilbert m=1
space associated with each kernel, the reproducing kernel for some parameters αm ∈ R.
Hilbert space or RKHS [17, 18]. Although rather ab-
stract, this concept is useful in order to understand the The representer theorem implies that for a common
significance of kernels for machine learning, as well as family of machine learning optimisation problems over
their connection to linear models in feature space. functions in an RKHS R, the solution can be represented
as an expansion of kernel functions as in Eq. (4). Conse-
Definition 3. Let X be a non-empty input set and R a quently, instead of explicitly optimising over an infinite-
Hilbert space of functions f : X → C that map inputs to dimensional RKHS we can directly start with the implicit
the real numbers. Let h·, ·i be an inner productp defined ansatz of Eq. (4) and solve the convex optimisation prob-
on R (which gives rise to a norm via ||f || = hf, f i). lem of finding the parameters αm . The combination of
R is a reproducing kernel Hilbert space if every point Theorem 2 and Theorem 3 shows another facet of the
evaluation is a continuous functional F : f → f (x) for link of kernels and feature maps. A model that defines
all x ∈ X . This is equivalent to the condition that there a hyperplane in feature space can often be written as a
exists a function κ : X × X → C for which model that depends on kernel evaluations. In Section III
we will translate these two viewpoints into two ways of
hf, κ(x, ·)i = f (x) (2) designing quantum machine learning algorithms.
with κ(x, ·) ∈ R and for all f ∈ H, x ∈ X .
C. Input encoding as a feature map
The function κ is the unique reproducing kernel of R,
and Eq. (2) is the reproducing property. Note that a
different, but isometrically isomorphic Hilbert space can The immediate approach to combine quantum me-
be derived for a so-called Mercer kernel [19]. chanics and the theory of kernels is to associate the
Hilbert space of a quantum system with a reproducing
Since a feature map gives rise to a kernel and a kernel kernel Hilbert space and find the reproducing kernel of
gives rise to a reproducing kernel Hilbert space, we can the system. We show in Appendix A that for Hilbert
construct a unique reproducing kernel Hilbert space for spaces with discrete bases, as well as for the special
any given feature map (see Figure 2). ‘continuous-basis’ case of the Hilbert space of coherent
states, the reproducing kernel is given by inner products
Theorem 2. Let φ : X → F be a feature map over an of basis vectors. This insight can lead to interesting
input set X , giving rise to a complex kernel κ(x, x0 ) = results. For example, Chatterjee et al. [14] show that
hφ(x), φ(x0 )iF . The corresponding reproducing kernel the inner product of an optical coherent state can
Hilbert space has the form be turned into a Gaussian kernel (also called radial
basis function kernel ) which is widely used in machine
Rκ = {f : X → C| learning. However, to widen the framework we choose
f (x) = hw, φ(x)iF , ∀x ∈ X , w ∈ F} (3) another route here. Instead of asking what kernel is
associated with a quantum Hilbert space, we associate
The functions hw, ·i in the RKHS associated with a quantum Hilbert space with a feature space and
feature map φ can be interpreted as linear models, for derive a kernel that is given by the inner product of
which w ∈ F defines a hyperplane in feature space. quantum states. As seen in the previous section, this
will automatically give rise to an RKHS, and the entire
In machine learning these rather formal concepts gain apparatus of kernel theory can be applied.
relevance because of the (no less formal) representer the-
orem [20]: Assume we want to encode some input x from an input
set X into a quantum state that is described by a vector
Theorem 3. Let X be an input set, κ : X × X → R a |φ(x)i and which lives in Hilbert space F. This procedure
kernel, D a data set consisting of data pairs (xm , y m ) ∈ of ‘input encoding’ fulfills the definition of a feature map
4

φ : X → F, which we call a quantum feature map here. popular input encoding techniques in quantum machine
According to Theorem 1 we can derive a kernel κ from learning.
this feature map via Eq. (1). By virtue of Theorem
2, the kernel is the reproducing kernel of an RKHS Rκ a. Basis encoding. Many quantum machine learning
as defined in Eq. (3). The functions in Rκ are the inner algorithms assume that the inputs x to the computation
products of the ‘feature-mapped’ input data and a vector are encoded as binary strings represented by a compu-
|wi ∈ F, which defines a linear model tational basis state of the qubits [12, 21]. For exam-
ple, x = 01001 is represented by the 5-qubit basis state
f (x; w) = hw|φ(x)i (5) |01001i. The computational basis state corresponds to a
Note that we use Dirac brackets h·| · i instead of the inner standard basis vector |ii (with i being the integer rep-
product h·, ·i to signify that we are calculating inner prod- resentation of the bitstring) in a 2n -dimensional Hilbert
ucts in a quantum Hilbert space. Finally, the representer space F, and the effect of the feature-embedding circuit
theorem 3 guarantees that the minimiser minw C(w, D) is given by
of the empirical risk Uφ : x ∈ {0, 1}n → |ii.
M
X This feature map maps each data input to a state from an
C(w, D) = |f (xm ; w) − y m |2 + ||f ||Rκ orthonormal basis and is equivalent to the generic finite-
m=1 dimensional case discussed in Appendix A. As shown
there, the generic kernel is the Kronecker delta
can be expressed by Equation (4). The simple idea of in-
terpreting x → |φ(x)i as a feature map therefore allows κ(x, x0 ) = hi|ji = δij ,
us to make use of the rich theory of kernel methods and which is a binary similarity measure that is only nonzero
gives rise to machine learning models whose trained can- for two identical inputs.
didates can be expressed by inner products of quantum
states. Note that if the state |φ(x)i has complex ampli- b. Amplitude encoding. Another approach to infor-
tudes, we can always construct a real kernel by taking mation encoding is to associate normalised input vectors
the absolute square of the inner product. x = (x0 , ..., xN −1 )T ∈ RN of dimension N = 2n with the
amplitudes of a n qubit state |ψx i [8, 13],
N −1
III. QUANTUM MACHINE LEARNING IN X
FEATURE HILBERT SPACE Uφ : x ∈ RN → |ψx i = xi |ii.
i=0

Now let us enter the realm of quantum computing and As above, |ii denotes the i’th computational basis state.
quantum machine learning. We show how to use the ideas This choice corresponds to the linear kernel,
of Section II C to design two types of quantum machine κ(x, x0 ) = hψx |ψx0 i = xT x0 .
learning algorithms and illustrate both approaches with
an example from continuous-variable systems. c. Copies of quantum states. With a slight variation
of amplitude encoding we can implement polynomial ker-
nels [9]. Taking d copies of an amplitude encoded quan-
A. Feature-encoding circuits tum state,
Uφ : x ∈ RN → |ψx i ⊗ · · · ⊗ |ψx i,
From the perspective of quantum computing, a
corresponds to the kernel
quantum feature map x → |φ(x)i corresponds to
a state preparation circuit Uφ (x) that acts on a κ(x, x0 ) = hψx |ψx0 i · · · hψx |ψx0 i = (xT x0 )d .
ground or vacuum state |0...0i of a Hilbert space F as d. Product encoding. One can also use a (tensor)
Uφ (x)|0...0i = |φ(x)i. We will call Uφ (x) the feature- product encoding, in which each feature of the input
embedding circuit. The models from Eq. (5) in the x = (x1 , .., xN )T ∈ RN is encoded in the amplitudes
reproducing Hilbert space from Definition 2 are inner of one separate qubit. An example is to encode xi as
products between |φ(x)i and a general quantum state |φ(xi )i = cos(xi )|0i + sin(xi )|1i for i = 1, ..., N [22, 23].
|wi ∈ F. We therefore consider a second circuit W This corresponds to a feature-embedding circuit with the
with W |0...0i = |wi, which we call the model circuit. effect
The model circuit specifies the hyperplane of a linear    
model in feature Hilbert space. If the feature state cos x1 cos xN N
Uφ : x ∈ RN → ⊗ ··· ⊗ ∈ R2 ,
|φ(x)i is orthogonal to |wi, then x lies on the deci- sin x1 sin xN
sion boundary, whereas states with a positive [negative] and implies a cosine kernel,
inner product lie on the left [right] side of the hyperplane. N
Y
κ(x, x0 ) = cos(xi − x0i ).
To show some examples of feature-embedding circuits
i=1
and their associated kernels, let us have a look at
5

prediction
1.0

model

quantum
device kernel( , ) 0.0
c = 1.0 c = 1.5 c = 2.0
implicit approach
FIG. 4. Shape of the squeezing kernel function κsq (x, x0 ) from
Equation (7) for different squeezing strength hyperparameters
quantum
device
prediction c. The input x is fixed at (0, 0) and x0 is varied. The plots
show the interval [−1, 1] on both horizontal axes.
explicit approach

new input training inputs the model circuit’s architecture defines the space of
possible models and can act as regularisation (see also
FIG. 3. Illustration of the two approaches to use quantum fea- [22]). Below we will follow a slightly more general
ture maps for supervised learning. The implicit approach uses strategy and compute a state W (θ)Uφ |0...0i, from which
the quantum device to evaluate the kernel function as part of measurements determine the output of the model.
a hybrid or quantum-assisted model which can be trained by Depending on the measurement, this is not necessarily a
classical methods. In the explicit approach, the model is solely linear model in feature Hilbert space. We could even go
computed by the quantum device, which consists of a varia- further and include postselection in the model circuit,
tional circuit trained by hybrid quantum-classical methods. which might give the classifier in feature Hilbert space
even more power.
B. Building a quantum classifier
Using quantum computers for learning tasks with
these two approaches is desirable in various settings.
Having formulated the ideas from Section II C in the For example, the implicit approach may be interesting
language of quantum computing, we can identify two in cases where the quantum device evaluates kernels
different strategies of designing a quantum machine or models faster in terms of absolute runtime speed.
learning algorithm (see Figure 3). On the one hand, we Another interesting example is a setting in which the
can use the quantum computer to estimate the inner kernel one wants to use is classically intractable because
products κ(x, x0 ) = hφ(x)|φ(x0 )i from a kernel-dependent the runtime grows exponentially or even faster with
model as in Eq. (4), which we call the implicit approach, the input dimension. The explicit approach may be
since we use the quantum system to estimate distance useful when we want to leave the limits of the RKHS
measures on input space. This strategy requires a quan- framework and construct classifiers directly on Hilbert
tum computer that can do two things: to implement space.
Uφ (x) for any x ∈ X and to estimate inner products
between quantum states (for example using a SWAP In the remainder of this work we want to explore these
test routine). The computation of the model from those two approaches with several examples. We use squeez-
kernel estimates, as well as the training algorithm is left ing in continuous-variable quantum systems as a fea-
to a classical device. This is an excellent strategy in the ture map, for which the Hilbert space F is an infinite-
context of intermediate-term quantum technologies [24], dimensional Fock space. This constructs a squeezing-
where we are interested in using a quantum computer based quantum machine learning classifier which can for
only for small routines of limited gate count, and example be implemented by optical quantum computers.
compute as much as possible on the classical hardware.
Note that in the long term, quantum computers could
also be used to learn the parameters αm by computing C. Squeezing as a feature map
the inverse of the kernel Gram matrix, which has been
investigated in Refs. [9, 25].
A squeezed vacuum state is defined as
On the other hand, and as motivated in the in- ∞ p
1 X (2n)!
troduction, one can bypass the representer theorem |zi = p (−eiϕ tanh(r))n |2ni,
and explicitly perform the classification in the ‘feature cosh(r) n=0 2n n!
Hilbert space’ of the quantum system. We call this
the explicit approach. For example, this can mean to where {|ni} denotes the Fock basis and z = reiϕ is the
find a |wi that defines a model 5. To do so, we can complex squeezing factor with absolute value r and phase
make the model circuit trainable, W = W (θ), so that ϕ. It will be useful to introduce the notation |zi =
quantum-classical hybrid training [23, 26] of θ can learn |(r, ϕ)i. We can interpret x → |φ(x)i = |(c, x)i as a fea-
the optimal model |w(θ)i = W (θ)|0i. The ansatz for ture map from a one-dimensional real input space x ∈ R
6

c = 1.5 c = 1.5 c = 1.5 epoch 1 epoch 500 epoch 5000

c = 1.0 c = 1.5 c = 2.0


train 0 test 0 train 1 test 1

FIG. 6. Decision boundary of a perceptron classifier in Fock


space after mapping the 2-dimensional data points via the
squeezing feature map with phase encoding from Eq. (6)
(with c = 1.5). The perceptron only acts on the real sub-
space and without regularisation. The ‘blobs’ dataset has
train 0 test 0 train 1 test 1
now only 70 training and 20 test samples. The perceptron
achieves a training accuracy of 1 after less than 5000 epochs,
FIG. 5. Decision boundary of a support vector machine with
which means that the data is linearly separable in Fock space.
the custom kernel from Eq. (7). The shaded areas show the
Interestingly, in this example the test performance remains
decision regions for Class 0 (blue) and Class 1 (red), and each
exactly the same. The simulations were performed with the
plot shows the rate of correct classifications on the training
Strawberry Fields simulator as well as a scikit-learn out-of-
set/test set. The first row plots three standard 2-dimensional
the-box perceptron classifier.
datasets: ‘circles’, ‘moons’ and ‘blobs’, each with 150 test and
50 training samples. The second row illustrates that increas-
ing the squeezing hyperparameter c changes the classification
performance. Here we use a dataset of 500 training and 100 features in the absolute value of the squeezing and de-
test samples. Training was performed with python’s scikit- fine a squeezing feature map with absolute value encoding,
learn SVC classifier using a custom kernel which implements x → |φ(x)i = |(x, c)i. However, in this version we cannot
the overlap of Eq. (8). vary the variance of the kernel function, which is why we
use the phase encoding in the following investiagtions.

to the Hilbert space of Fock states, in short, the Fock


space. Here, c is a constant hyperparameter that deter-
D. An implicit quantum-assisted classifier
mines the strength of the squeezing, and x is associated
with the phase. Moreover, when given multi-dimensional
inputs in a dataset of vectors x = (x1 , ..., xN )T ∈ RN , we In the implicit approach, we evaluate the kernel in Eq.
can define the joint state of N squeezed vacuum modes, (7) with a quantum computer and feed it into a classical
kernel method. Instead of using a real quantum device,
φ : x → |(c, x)i, (6) we exploit the fact that, in the case of squeezing, the
kernel can be efficiently computed classically, and use it
with as a custom kernel in a support vector machine. Figure
5 shows that such a model easily learns the decision
|(c, x)i = |(c, x1 )i ⊗ . . . ⊗ |(c, xN )i ∈ F, boundary of 2-dimensional mini-benchmark datasets.

as a feature map, where F is now a multimode Fock Since the idea of a support vector machine is to find
space. We call this feature map the squeezing feature the maximum-margin hyperplane in feature space, we
map with phase encoding. want to know whether we can always find a hyperplane
for which the training accuracy is 1. In other words,
The kernel we ask if the data becomes linearly separable in Fock
N
space by the squeezing feature map. An easy way to
Y do this is to apply a perceptron classifier to the data in
κ(x, x0 ; c) = h(c, xi )|(c, x0i )i (7)
feature space. The perceptron is guaranteed to find such
i=1
a separating hyperplane if it exists. Figure 6 shows the
with performance of a perceptron classifier in the Fock space
r for the ‘blobs’ data. The data was mapped to this space
sech c sech c by the squeezing feature map with phase encoding. As
h(c, xi )|(c, x0i )i = 0 , (8) one can see, after 5000 epochs (runs through the dataset)
1 − ei(xi −xi ) tanh c tanh c
the decision boundary perfectly fits the training data,
derived from this feature map [27] is easy to compute on achieving an accuracy of 1. The number of iterations to
a classical computer. It is plotted in Figure 4, where we train the perceptron is known to increase with O(1/γ 2 )
see that the hyperparameter c determines the variance where γ is the margin between the two classes [28], and
of the kernel function. Note that we can also encode indeed we find in other simulations that the ‘moons’ and
7

‘circles’ data only take a few epochs until reaching full a.)
accuracy. Although the perfect fit to the training data is
feature map model
of course not useful for machine learning (as can be seen circuit W (θ)
circuit
by the non-increasing accuracy on the test set) these
results are a clue to the fact that the squeezing feature
map makes data linearly separable in feature space, a x1 o0 p(y = 0)
fact that we prove in Appendix B.
x2 o1 p(y = 1)
While the results of the simulations are promising,
a goal is to find more sophisticated kernels. Although X Y
quantum computers could offer constant speed advan- ..
.
tages, they become indispensable if the feature map cir- F
cuit is classically intractable. However, squeezed states
are an example of so-called Gaussian states, and it is b.)
well known that Gaussian states (although living in an
infinite-dimensional Hilbert space) can be efficiently sim- |(c, x1 )i p(n1 )
W (θ)
ulated by a classical computer [29], which we used in the |(c, x2 )i p(n2 )
simulations. In order to do something more interesting,
one needs non-Gaussian elements to the circuit. For ex- c.)
ample, one can extend a standard linear optical network
of beamsplitters by a cubic phase gate [30, 31] or use pho- D(θ3 ) P (θ5 ) V (θ7 )
ton number measurements [32]. To this end, let Vφ (x) BS(θ1 , θ2 )
be a non-Gaussian feature map circuit, i.e. a quantum D(θ4 ) P (θ6 ) V (θ8 )
algorithm that takes a vacuum state and prepares an x-
dependent non-Gaussian state. The kernel
FIG. 7. a.) Representation of the Fock-space-classifier in
the graphical language of quantum neural networks. A vector
0
κ(x, x ) = h0...0|Vφ† (x)Vφ (x0 )|0...0i (x1 , x2 )T from the input space X gets mapped into the feature
space F which is the infinite-dimensional 2-mode Fock space
can in general not be simulated by a classical computer of the quantum system. The model circuit, including photon
any more. It is therefore an interesting open question detection measurement, implements a linear model in feature
what type of feature map circuits Vφ are classically in- space and reduces the “infinite hidden layer” to two outputs.
tractable, but at the same time lead to powerful kernels b.) The model circuit of the explicit classifier described in the
for classical models such as support vector machines. text uses only 2 modes to instantiate this infinite-dimensional
hidden layer. The variational circuit W (θ) consists of repe-
titions of a gate block. We use the gate block shown in c.)
E. An explicit quantum classifier with the beamsplitter (BS), displacement (D), quadratic (P)
and cubic phase gates (C) described in the text.
In the explicit approach defined above, we use a
parametrised continuous-variable circuit W (θ) to build
a “Fock-space” classifier. For our squeezing example this higher probability. We can interpret this circuit in the
can be done as follows. We start with two vacuum modes graphical representation of neural networks as shown at
|0i ⊗ |0i. To classify a data input x, first map the input the top in Figure 7.
to a quantum state |c, xi = |c, x1 i ⊗ |c, x2 i by performing
a squeezing operation on each of the modes. Second, ap- Let us assume we could represent any possible quan-
ply the model circuit W (θ) to |c, xi. Third, interpret the tum circuit in the feature Hilbert space with the circuit
probability p(n1 , n2 ) of measuring a certain Fock state W (θ). Since the data in F is linearly separable, there is
|n1 , n2 i as the output of the machine learning model. a W for which we obtain 100% accuracy on the training
Since this probability depends on the displacement and set, as we saw in Figure 6. However, the goal of machine
squeezing intensity, it is better to define two probabili- learning is not to perfectly fit data, but to generalise from
ties, say p(n1 = 2, n2 = 0) and p(n1 = 0, n2 = 2), as it. It is therefore not desirable to find the optimal deci-
a one-hot encoded output vector (o0 , o1 ). This output sion boundary for the training data in F, but to find a
vector can be normalised [33] to a new vector good candidate from a class of decision boundaries that
    captures the structure in the data well. Such a restricted
1 o0 p(y = 0) class of decision boundaries can be defined by using an
= , ansatz for the model circuit W (θ) which cannot represent
o 0 + o 1 o1 p(y = 1)
any circuit, yet still flexible enough to reach interesting
where p(y = 0), p(y = 1) can now be interpreted as the candidates. Figure 7 c.) shows such a model circuit for
probability for the model to predict class y = 0 and the 2 input modes in our continuous-variable example.
y = 1, respectively. The final label is the class with the The architecture consists of repetitions of a general gate
8

1 IV. CONCLUSION
detail

loss
.99 .98
In this paper we introduced a number of new ideas
train 0 test 0 for the area of quantum machine learning based on the
train 1 test 1 iterations theory of feature spaces and kernels. Interpreting the
encoding of inputs into quantum states as a feature map,
FIG. 8. Fock space classifier presented in Figure 7 and the we associate a quantum Hilbert space with a feature
text for the ‘moons’ dataset. The shaded areas show the prob-
space. Inner products of quantum states in this feature
ability p(y = 1) of predicting class 1. The datasets consist of
150 training and 50 test samples, and has been trained for space can be used to evaluate a kernel function. We
5000 steps with stochastic gradient descent of batch-size 5, can alternatively train a variational quantum circuit as
an adaptive learning rate and a square-loss cost function with an explicit classifier in feature space to learn a decision
a gentle l2 regularisation applied to all weights. The loss drops boundary. We introduced a squeezing feature map as
predominantly in the first 200 steps (left). an example and motivated with small-scale simulations
that these two approach can lead to interesting results.

block. We denote by â1,2 , â†1,2 the creation and annihi-


lation operators of mode 1 and 2, and with x̂1,2 , p̂1,2 the
corresponding quadrature operators (see [34]). After an From this work there are many further avenues of
entangling beam splitter gate, research. For example, we raised the question whether
iv
â†1 â2 −e−iv â1 â†2
there are interesting kernel functions that can be
BS(u, v) = eu(e ), computed by estimating the inner products of quan-
tum states, for which state preparation is classically
with u, v ∈ R, the circuit consists of single-mode gates intractable. Another open question are the details
that are first, second and third order in the quadratures. in the design and training of variational circuits, and
The first-order gate is implemented by a displacement how learning algorithms can be tailormade for the use
gate in hybrid training schemes. This is a topic that has
√ just begun to be investigated by the quantum machine
2i(Im(z)x̂−Re(z)p̂)
D(z) = e , learning community [1, 12].
with the complex displacement factor z. We use a
quadratic phase gate for the second order,
u 2
P (u) = ei 2 x̂ , Last but not least, we want to come back to a point
we raised in the introduction. In quantum machine learn-
and a cubic phase gate for the third order operator, ing, a lot of models use amplitude encoding, which means
that a data vector is represented by the amplitudes of
u 3
V (u) = ei 3 x̂ . a quantum state. Especially when trying to reproduce
neural network-like dynamics one would like to perform
We can in principle construct any continuous-variable nonlinear transformations on the data. But while linear
quantum circuit from this gate set. This basic circuit transformations are natural for quantum theory, nonlin-
block can easily be generalised to circuits of more modes earities are difficult to design in this context. Interest-
by replacing the single beam splitter by a full optical ing workarounds based on postselection or repeat-until-
network of beam splitters [35]. success circuits were proposed in [23, 36], but at the con-
siderable costs of making the circuit non-deterministic,
To show that the Fock space classifier works, we plot and with a probability of failure that grows with the
the decision boundary for the ‘moons’ data in Figure 8, size of the architecture. The feature map approach ‘out-
using 4 repetitions of the gate block from Figure 7 c.) sources’ the nonlinearity into the procedure of encoding
and 32 parameters in total. The training loss shows that inputs into a quantum state and therefore offers an ele-
after about 200 iterations of a stochastic gradient descent gant solution to the problem of nonlinearities in ampli-
algorithm, the loss converges to almost zero. tude encoding.

[1] G. Verdon, M. Broughton, and J. Biamonte, arXiv [2] M. H. Amin, Physical Review A 92, 1 (2015).
preprint arXiv:1712.05304 (2017). [3] M. Benedetti, J. Realpe-Gómez, R. Biswas, and
9

A. Perdomo-Ortiz, arXiv preprint arXiv:1609.02542 form probabilities.


(2016). [34] C. Weedbrook, S. Pirandola, R. Garcı́a-Patrón, N. J.
[4] T. J. Y. Guang Hao Low and I. L. Chuang, Physical Cerf, T. C. Ralph, J. H. Shapiro, and S. Lloyd, Reviews
Review A 89, 062315 (2014). of Modern Physics 84, 621 (2012).
[5] P. Wittek and C. Gogolin, Scientific Reports 7 (2017). [35] F. Flamini, N. Spagnolo, N. Viggianiello, A. Crespi,
[6] V. Denchev, N. Ding, H. Neven, and S. Vishwanathan, R. Osellame, and F. Sciarrino, Scientific Reports 7,
in Proceedings of the 29th International Conference on 15133 (2017).
Machine Learning (ICML-12) (2012) pp. 863–870. [36] N. Wiebe and C. Granade, arXiv preprint
[7] B. OGorman, R. Babbush, A. Perdomo-Ortiz, arXiv:1512.03145 (2015).
A. Aspuru-Guzik, and V. Smelyanskiy, The Euro- [37] A. Berlinet and C. Thomas-Agnan, Reproducing kernel
pean Physical Journal Special Topics 224, 163 (2015). Hilbert spaces in probability and statistics (Springer Sci-
[8] N. Wiebe, D. Braun, and S. Lloyd, Physical Review ence & Business Media, 2011).
Letters 109, 050505 (2012). [38] T. Griffiths and A. Yuille, The probabilistic mind:
[9] P. Rebentrost, M. Mohseni, and S. Lloyd, Physcial Re- Prospects for Bayesian cognitive science , 33 (2008).
view Letters 113, 130503 (2014). [39] R. D. la Madrid, European Journal of Physics 26, 287
[10] M. Schuld, I. Sinayskiy, and F. Petruccione, Physical (2005).
Review A 94, 022342 (2016). [40] J. R. Klauder and B.-S. Skagerstam, Coherent states: ap-
[11] K. H. Wan, O. Dahlsten, H. Kristjánsson, R. Gardner, plications in physics and mathematical physics (World
and M. Kim, npj Quantum Information 3, 36 (2017). scientific, 1985).
[12] E. Farhi and H. Neven, arXiv preprint arXiv:1802.06002 [41] L. Hogben, Handbook of linear algebra (CRC Press,
(2018). 2006).
[13] M. Schuld, M. Fingerhuth, and F. Petruccione, Euro-
physics Letters 119, 60002 (2017).
[14] R. Chatterjee and T. Yu, Quantum Information and Appendix A: Reproducing kernels of quantum
Communication 17, 1292 (2017). systems
[15] B. Schölkopf and A. J. Smola, Learning with kernels:
Support vector machines, regularization, optimization,
and beyond (MIT Press, 2002). In this section of the appendix we will try to find
[16] C. Berg, J. P. R. Christensen, and P. Ressel, Harmonic an answer to the question of which reproducing kernels
analysis on semigroups (Springer-Verlag, 1984). the Hilbert space of generic quantum systems gives
[17] T. Hofmann, B. Schölkopf, and A. J. Smola, The Annals rise to. Quantum theory prescribes that the state of a
of Statistics , 1171 (2008). quantum system is modelled by a vector in a Hilbert
[18] N. Aronszajn, Transactions of the American Mathemat- space Hs . In a typical setting, the Hilbert space is
ical Society 68, 337 (1950). constructed from a complete basis of eigenvectors {|si}
[19] B. J Mercer, Phil. Trans. R. Soc. Lond. A 209, 415 of a complete set of commuting Hermitian operators
(1909).
which corresponds to physical observables. Due to the
[20] B. Schölkopf, R. Herbrich, and A. Smola, in Computa-
tional learning theory (Springer, 2001) pp. 416–426. hermiticity of the observables, the basis is orthogonal,
[21] S. Wang, Journal of Mathematics Research 7, 175 (2015). and it can be continuous (i.e., if the observable is the
[22] E. Stoudenmire and D. J. Schwab, in Advances In Neural position operator describing the location of a particle),
Information Processing Systems (2016) pp. 4799–4807. countably infinite (i.e., observing the number of photons
[23] G. G. Guerreschi and M. Smelyanskiy, arXiv preprint in an electric field), or finite (i.e., observing the spin of
arXiv:1701.01450 (2017). an electron). Vectors in the Hilbert space are abstractly
[24] J. Preskill, arXiv preprint arXiv:1801.00862 (2018). referred to as |ψi ∈ H in Dirac notation. However,
[25] M. Schuld, I. Sinayskiy, and F. Petruccione, Physical every such Hilbert space has a functional represen-
Review A 94, 022342 (2016). tation. In the case of a discrete basis of dimension
[26] J. R. McClean, J. Romero, R. Babbush, and A. Aspuru-
N ∈ N ∪ ∞, the functional representation Hsf of Hs
Guzik, New Journal of Physics 18, 023023 (2016).
[27] S. M. Barnett and P. M. Radmore, Methods in theoretical
is given by the (Hilbert) space l2 of square summable
quantum optics, Vol. 15 (Oxford University Press, 2002). sequencesP{ψ(si ) = hsi |ψi}N i=1 with the inner product
[28] A. B. Novikoff, On convergence proofs for perceptrons, hψ, ϕi = si ψ(si )∗ ϕ(si ). In the continuous case this is
Tech. Rep. (Stanford Research Institute, 1963). the space L2 of square summable (equivalence classes
[29] S. D. Bartlett, B. C. Sanders, S. L. Braunstein, and of) functions ψ(s) = hs|ψi with the inner product
hψ, ϕi = dsψ(s)∗ ϕ(s). The preceding formulation of
R
K. Nemoto, in Quantum Information with Continuous
Variables (Springer, 2002) pp. 47–55. quantum theory therefore associates every quantum
[30] D. Gottesman, A. Kitaev, and J. Preskill, Physical Re- system with a Hilbert space of functions mapping from
view A 64, 012310 (2001).
a set S = {s} to the complex numbers. The question is
[31] S. Lloyd and S. L. Braunstein, in Quantum Information
with Continuous Variables (Springer, 1999) pp. 9–17.
if these Hilbert spaces give rise to a reproducing kernel
[32] S. D. Bartlett and B. C. Sanders, Physical Review A 65, that makes them a RKHS with respect to the input set S.
042304 (2002).
Pof identity 1 = ds |sihs| for the
R
[33] In contrast to a standard technique in machine learning, With the resolution
it is not advisable to use a softmax layer for this purpose, continuous and 1 = i |si ihsi | for the discrete case, we
since o0 , o1 can be very small, which leads to almost uni- can immediately “create” the reproducing property from
10

Eq. (2). Consider first the discrete case: space of coherent states is an RKHS for the input set {l}.
X
ψ(si ) = hsi |ψi = hsi |sj ihsj |ψi = hhs| · i|ψ(·)i. The most well-known type of coherent state are optical
sj coherent states

We can identify hsi |sj i with the reproducing kernel. |α|2 X αn
|αi = e− 2 √ |ni,
Since the basis is orthonormal, we have κ(si , sj ) = δi,j . n=0 n!
The continuous case is more subtle. Inserting the iden-
tity, we get which are the eigenstates of the non-Hermitian bosonic
Z creation operator â, with the associated kernel
ψ(s) = ds0 hs|s0 ihs0 |ψi = hhs| · i, ψ(·)i, 
|α|2 |β|2

− 2 + 2 −αβ
κ(α, β) = hα|βi = e , (A1)
which is the reproducing kernel property with the repro-
ducing kernel κ(s, s0 ) = hs|s0 i. However, the “function” whose square is a radial basis function or Gaussian kernel
s0 (s) = δ(s − s0 ) is not square integrable, which means it as remarked in [14].
is itself not part of Hsf , and the properties of Definition
3 are not fulfilled. This is no surprise, as the space of
Appendix B: Linear separability in Fock space
square integrable functions L2 is a frequent example of a
Hilbert space that is not a RKHS [37]. The inconsistency
between Dirac’s formalism and functional analysis is If we map the inputs of a dataset D to a new dataset
also a well-known issue in quantum theory, but usually
D0 = {|(c, x1 )i, ..., |(c, xM )i},
glossed over in physical contexts [38]. If mathematical
rigour is needed, physicists usually refer to the theory of using the squeezing feature map with phase encoding
rigged Hilbert spaces [39]. from Eq. (6), the feature mapped data vectors in D0 are
always linearly separable, which means any assignment
There are quantum systems with an infinite basis of two classes of labels to the data can be separated by
which naturally give rise to a reproducing kernel that is a hyperplane in F (see Figure 1). To show this, first
not the delta function. These systems are described by consider the following:
so-called generalised coherent states [40]. In the context
of quantum machine learning, this has been discussed in Proposition 1. A set of M vectors in RN are linearly
Ref. [14]. Generalised coherent states are vectors |li in separable if M − 1 of them are linear independent.
a Hilbert space Hc of finite or countably infinite dimen-
sion, and where the index l is from some topological space The proof can be found in Appendix C. Proposition
p
L (allowing us to define a norm |||li|| = hl|li). They 1 tells us that if our data is linearly independent,
have two fundamental properties. First, |li is a strongly it is linearly separable. This result is in fact known
continuous function of l, from statistical learning theory: The VC dimension
– a measure of flexibility or expressive power – of
lim || |l0 i − |li|| = 0, |li =
6 0. linear models in K dimensions is K + 1, which means
l→l0 that a linear model can separate or “shatter” K + 1
data points if we can choose the strategy of how to ar-
Note that this excludes for example the discrete Fock
range them, but not the strategy of how they are labelled.
basis {|ni}, but also any orthonormal set of states {|zi}
with a continuous label z ∈ C, since 12 |||z 0 i − |zi|| = 1 for
If we can show that the squeezing feature map maps
z 0 6= z. Second, there exists a measure
R µ on L so that vectors to linearly independent states in Fock space, we
we have a resolution of identity 1 = L |lihl| dµ(l). This
know that any dataset becomes linearly separable in Fock
leads to a functional representation of the Hilbert P space space. To simplify, lets first see look at the squeezing map
where a vector |ψi ∈ Hc is expressed via |ψi = l ψ(l)|li
of a single mode.
with ψ(l) = hl|ψi. Inserting the resolution of identity to
the right hand side of this expression yields Proposition 2. Given a0 set of squeezing phases
Z {ϕ1 , ..., ϕM } with ϕm 6= ϕm for m = 1, ..., M, m 6= m0
ψ(l) = hl|l0 ihl0 |ψi dµ(l0 ), and a hyperparameter c ∈ R, the squeezed vacuum Fock
L states |(c, ϕ1 )i, ..., |(c, ϕM )i are linearly independent.
which is exactly the reproducing property in Definition The proof is found in Appendix D. A very similar
3 with the reproducing kernel κ(l, l0 ) = hl|l0 i. Since proof confirms that the proposition also holds true for
there is a finite overlap between any two states from the sueezing map with absolute value encoding described
the basis, the kernel is not the Dirac delta func- in Section III C. Symbolic computation of the rank of the
tion, and we do not run into the same problem as design matrix in feature space in Mathematica confirms
for continuous orthogonal bases. Hence, the Hilbert this result for randomly selected squeezing factors up
11

to M = 10 and a cutoff dimension that truncates Fock Remember that the rank of a matrix is the number of
space to 40 dimensions. linearly independent row (and column) vectors.

For the multimode feature map dealing with input data If the data vectors are all linearly independent we have
of dimension higher than one, that N ≥ M (if N < M there would be some vectors
that depend on others, because we have more vectors
|(c, ϕm )i = |(c, ϕm m
1 )i ⊗ . . . ⊗ |(c, ϕ ))i, than dimensions), and the rank of X is min(M, N ) = M .
and Augmenting X by stacking any number of column
0 0 0
vectors simply increases N , which means that it does
|(c, ϕm )i = |(c, ϕm m
1 )i ⊗ . . . ⊗ |(c, ϕ ))i. not change the rank of the matrix. It follows that for
M linearly independent data points embedded in a N
We have
dimensional space the system has a solution. The data
N is therefore linearly separable.
0 Y 0
h(c, ϕm )|(c, ϕm )i = h(c, ϕm m
i )|(c, ϕi )i,
i=1 With this argument we can add more vectors that are
0 linearly dependent until M = N . After this, we can in
which is 1 if ϕm
=i ϕm
for all i = 1, ..., N and a value
i fact add one (but only one) more data point that linearly
other than zero else. The linear independence therefore
depends on the others, and still guarantee linear separa-
carries over to multi-dimensional feature maps.
bility. That is because adding one data point makes the
row number equal to the column number in [X|1], and
Appendix C: Proof of Proposition 1 adding more columns does not change the rank. In con-
trast, adding two data points means that we have more
columns than rows in [X|1], and adding the column for
Let [X|1|y] can indeed change the rank.
D = {(x1 , y 1 ), · · · , (xM , y M )}
be a dataset of M vectors with xm ∈ RN for all m = Appendix D: Proof of Proposition 2
1, · · · , M , and y ∈ {−1, 1}. The vectors are guaranteed
to be linearly separable if for any assignment of classes Let’s consider a matrix M where the squeezed states
{−1, 1} to labels y 1 , ..., y M there is a hyperplane defined in Fock basis form the rows:
by parameters w1 , ..., wN , b so that p
1 iφj
n (2n)!
N Mjn := p −e tanh(rj )
sgn(
X
wi xm m
∀m = 1, ..., M. cosh(rj ) 2n n!
i + b) = y (C1)
i=1
We introduce two auxiliary diagonal matrices:
The sign function is a bit tricky, but if we can instead q
show that the stronger condition D1 := diag{ cosh(rj )}
N
X
wi xm
i +b=y
m
∀m = 1, ..., M (C2) ( )
n!
i=1 D2 := diag p
(2n)!
holds for some parameters, Eq. C1 must automatically
be satisfied. Multiplying, we find that the matrix V := D1 M D2 has
matrix elements
Equation C2 defines a system of M linear equations  n
with N + 1 unknowns (namely the variables). From the 1 iφj
Vjn = − e tanh(rj ) .
theorey of linear algebra we know [41] that there is at 2
least one solution if and only if the rank of the ‘coefficient
Importantly, V has the structure of a Vandermonde ma-
matrix’
 1 trix. In particular, it has determinant
x1 · · · x1N 1

1 Y
[X|1] =  ... . . . ... ...  −eiφi tanh(ri ) + eiφj tanh(rj ) .

  det(V ) =
2
1≤i<j≤n
xM
1 · · · xM
N 1

is equal to the rank of its augmented matrix The only way that det(V ) = 0 is if

ei(φi −φj ) tanh(ri ) = tanh(rj )


 1
x1 · · · x1N 1 y 1

[X|1|y] =  ... . . . ... ... ...  .


 
for some i = j. The squeezing feature map with phase
xM
1 · · · xM
N 1 y
M encoding prescribes that ri = rj = c (and we assume that
12

c > 0). Thus, the only solution to the above equation is linearly independent. Note that the same proof also pre-
when ϕi = ϕj , which can only be true if the two feature scribes that squeezing feature maps with absolute value
vectors describe the same datapoint, which we excluded encoding makes distinct data points linearly independent
in Proposition 2. Thus, det(V ) > 0, which means that in Fock space.
det(M ) > 0, and hence M is full rank. This means that
the columns of M , which are our feature vectors, are

You might also like