Professional Documents
Culture Documents
` ` `
`
W`
= ⇡
f ` (x)
(x)
FIG. 5. Approximation of the weight tensor W ` by a matrix
FIG. 4. The overlap of the weight tensor W ` with a specific product state. The label index ` is placed arbitrarily on one
input vector Φ(x) defines the decision function f ` (x). The of the MPS tensors but can be moved to other locations.
label ` for which f ` (x) has maximum magnitude is the pre-
dicted label for x.
In the above decomposition, the label index ` was ar-
f ` (x) is a function mapping inputs to the space of labels. bitrarily placed on the j th tensor, but this index can be
The tensor diagram for evaluating f ` (x) for a particular moved to any other tensor of the MPS without chang-
input is depicted in Fig. 4. ing the overall W ` tensor it represents. To do this, one
contracts the j th tensor with one of its neighbors, then
decomposes this larger tensor using a singular value de-
IV. MPS APPROXIMATION composition such that ` now belongs to the neighboring
tensor—see Fig. 7(b).
Because the weight tensor Ws`1 s2 ···sN has NL · dN com- In typical physics applications the MPS bond dimen-
ponents, where NL is the number of labels, we need a sion m can range from 10 to 10,000 or even more; for the
way to regularize and optimize this tensor efficiently. The most challenging physics systems one wants to allow as
strategy we will use is to represent this high-order tensor large a bond dimension as possible since a larger dimen-
as a tensor network, that is, as the contracted product of sion means more accuracy. However, when using MPS in
lower-order tensors. a machine learning setting, the bond dimension controls
A tensor network approximates the exponentially large the number of parameters of the model. So in contrast
set of components of a high-order tensor in terms of to physics, taking too large a bond dimension might not
a much smaller set of parameters whose number grows be desirable as it could lead to overfitting.
only polynomially in the size of the input space. Various
tensor network approximations impose different assump-
tions, or implicit priors, about the pattern of correla- V. SWEEPING ALGORITHM FOR
tion of the local indices when viewing the original tensor OPTIMIZING WEIGHTS
as a distribution. For example, a MERA network can
explicitly model power-law decaying correlations while Inspired by the very successful density matrix renor-
a matrix product state (MPS) has exponentially decay- malization group (DMRG) algorithm developed in
ing correlations [27]. Yet an MPS can still approximate physics, here we propose a similar algorithm which
power-law decays over quite long distances. “sweeps” back and forth along an MPS, iteratively min-
For the rest of this work, we will approximate W ` as imizing the cost function defining the classification task.
an MPS, which have the key advantage that methods for For concreteness, let us choose to optimize the
manipulating and optimizing them are well understood quadratic cost
and highly efficient. Although MPS are best suited for
approximating tensors with a one-dimensional pattern of NT X
correlations, they can also be a powerful approach for 1X
C= (f ` (xn ) − δL
`
)2 (6)
decomposing tensors with two-dimensional correlations 2 n=1 n
`
as well [40].
An MPS decomposition of the weight tensor W ` has where n runs over the NT training inputs and Ln is the
the form known correct label for training input n. (The symbol
X `
δL = 1 if ` equals Ln and zero otherwise, and represents
Ws`1 s2 ···sN = Aα α1 α2
s1 As2
1
· · · A`;α
sj
j αj+1
· · · AαN −1
sN (5) n
the ideal output of the function f ` for the input xn .)
{α}
Our strategy for reducing this cost function will be to
and is illustrated in Fig. 5. Each “virtual” index αj has a vary only two neighboring MPS tensors at a time within
dimension m which is known as the bond dimension and the approximation Eq. (5). We could conceivably just
is the (hyper) parameter controlling the MPS approxi- vary one at a time but as will become clear, varying two
mation. For sufficiently large m an MPS can represent tensors leads to a straightforward method for adaptively
any tensor [41]. The name matrix product state refers changing the MPS bond dimension.
to the fact that any specific component of the full ten- Say we want to improve the tensors at sites j and j + 1
sor Ws`1 s2 ···sN can be recovered efficiently by summing which share the j th bond. Assume we have moved the
over the {αj } indices from left to right via a sequence of label index ` to the j th MPS tensor. First we combine the
matrix products. MPS tensors A`sj and Asj+1 into a single “bond tensor”
4
` ` (a) ` ` `
(a)
= = +
j j +1 j j +1
B 0` B` ↵ B`
(b)
= ˜n ~t = (1,
(b) ` ` `
0, 0) ?
n
~t = (0, SVD
=
1, 0) Z2
⇡
~t = (0, Z2
0, 1)
(c) ` ` ~t = (0,
2
Z
1, 0)
>0
Z2
t 1, t 2
=
, t3
` Fib
t 1, t 2
`
, t3
B f (xn )
Z2
<0
~t = (0,
2
0, 1)
Z
B 0`
?
Usj S Vs`j+1 A0sj A0`
?
˜n sj+1
?
~t = (0, ~t = (0,
1, 0) 0, 1)
`
t1
~t = (
Z2
1, 0, 0)
A0sj
(d) X `
f ` (xn )) (c)
( L
=
t3 t2
B` = n ˜n
n
FIG. 6. Steps leading to computing the gradient of the bond FIG. 7. Update (a) of bond tensor B ` , (b) restoration of MPS
tensor B ` at bond j: (a) forming the bond tensor; (b) project- form, and (c) advancing a projected training input before op-
ing a training input into the “MPS basis” at bond j; (c) com- timizing the tensors at the next bond.
puting the decision function in terms of a projected input;
(d) the gradient correction to B ` . The dark shaded circu-
lar tensors in step (b) are “effective features” formed from m empirical paramater used to control convergence. (Note
different linear combinations of many original features. that for this step one could also use the conjugate gradi-
ent method to improve the performance.) Having ob-
α `α
tained our improved B ` , we must decompose it back
Bsjj−1
sj+1
j+1
by contracting over the index αj as shown in into separate MPS tensors to maintain efficiency and ap-
Fig. 6(a). ply our algorithm to the next bond. Assume the next
Next we compute the derivative of the cost function C bond we want to optimize is the one to the right (bond
with respect to the bond tensor B ` in order to update j + 1). Then we can compute a singular value decom-
it using a gradient descent step. Because the rest of the position (SVD) of B ` , treating it as a matrix with a
MPS tensors are kept fixed, let us show that to compute collective row index (αj−1 , sj ) and collective column in-
the gradient it suffices to feed, or project, each input dex (`, αj+1 , sj+1 ) as shown in Fig. 7(b). Computing the
xn through the fixed “wings” of the MPS as shown on SVD this way restores the MPS form, but with the ` in-
the left-hand side of Fig. 6(b). Doing so produces the dex moved to the tensor on site j + 1. If the SVD of B `
projected, four-index version of the input Φ̃n shown on is given by
the right-hand of Fig. 6(b). The current decision function X α
α0j
can be efficiently computed from this projected input Φ̃n Bsαjj−1 `αj+1
sj+1 = Usjj−1
α0 S
αj `αj+1
αj Vsj+1 , (11)
j
and the current bond tensor B ` as α0j αj
X X s sj+1
f ` (xn ) = Bsαjj−1 `αj+1
sj+1 (Φ̃n )αjj−1 `αj+1 (7) then to proceed to the next step we define the new MPS
αj−1 αj+1 sj sj+1 tensor at site j to be A0sj = Usj and the new tensor at
site j + 1 to be A0` `
sj+1 = SVsj+1 where a matrix multipli-
or as illustrated in Fig. 6(c). Thus the leading-order up- cation over the suppressed α indices is implied. Crucially
date to the tensor B ` can be computed as at this point, only the m largest singular values in S are
def ∂C kept and the rest are truncated (along with the corre-
∆B ` = − (8) sponding columns of U and V † ) in order to control the
∂B `
NT X
computational cost of the algorithm. Such a truncation is
X 0
∂f ` (xn ) guaranteed to produce an optimal approximation of the
`0 `0
= (δL − f (x n )) (9)
n
∂B ` tensor B ` ; furthermore if all of the MPS tensors to the left
n=1 0 `
and right of B ` are formed from (possibly truncated) uni-
NT
X
` tary matrices similar to the definition of A0sj above, then
= (δL − f ` (xn ))Φ̃n . (10)
n=1
n
the optimality of the truncation of B ` applies globally to
the entire MPS as well. For further background reading
Note that last expression above is a tensor with the same on these technical aspects of MPS, see Refs. 26 and 42.
index structure as B ` as shown in Fig. 6(d). Finally, when proceeding to the next bond, it would be
Assuming we have computed the gradient, we use it inefficient to fully project each training input over again
to compute a small update to B ` , replacing it with into the configuration in Fig. 6(b). Instead it is only
B ` + α∆B ` as shown in Fig. 7(a), where α is a small necessary to advance the projection by one site using the
5
To test the tensor network approach on a realistic task, Φ(x1 , x2 ) = φs1 (x1 ) ⊗ φs2 (x2 ) (12)
we used the MNIST data set, which consists of grayscale
images of the digits zero through nine [44]. The calcu- and the full weight tensors Ws`1 s2 for ` ∈ {A, B} are op-
lations were implemented using the ITensor library [37]. timized directly using gradient descent.
Each image was originally 28 × 28 pixels, which we scaled When selecting a model, our main control parameter
down to 14×14 by averaging clusters of four pixels; other- is the dimension d of the local indices s1 and s2 . For the
wise we performed no further modifications to the train- case d = 2, the local feature map is chosen as in Eq. 3.
ing or test sets. Working with smaller images reduced the For d > 2 we generalize φsj (xj ) to be a normalized d-
time needed for training, with the tradeoff being that less component vector as described in Appendix B.
information was available for learning.
To approximate the classifier tensors as MPS, one must
choose a one-dimensional ordering of the local indices A. Regularizing By Local Dimension d
s1 , s2 , . . . , sN . We chose a “zig-zag” ordering shown in
Fig. 8, which on average keeps spatially neighboring pix- To understand how the flexibility of the model grows
els as close to each other as possible along the one- with increasing d, consider the case where PA and PB
dimensional MPS path. We then mapped each grayscale are overlapping distributions. Specifically, we take each
image x to a tensor Φ(x) using the local map Eq. (3). to be a multivariate Gaussian centered respectively in
6
FIG. 10. Toy models learned from the overlapping data set
Fig. 9. The results shown are for local dimension (a) d = 2,
(b) d = 3, and (c) d = 6. Background colors show how every
spatial point would be classified. Misclassified data points are
colored white.
form Eq. (2), let us fix ` to a single label and omit it {α}
from the notation. We will also consider W to be a com- (17)
pletely general order-N tensor with no tensor network
constraint. Then f (x) is a function of the form as illustrated in Fig. 12(a). To say the U and V tensors
X are left or right orthogonal means when viewed as ma-
f (x) = Ws1 s2 ···sN φs1 (x1 ) ⊗ φs2 (x2 ) ⊗ · · · φsN (xN ) .
trices Uαj−1 sj αj and V αj−1 sj αj these tensors have the
{s}
(13) property U † U = I and V V † = I where I is the identity;
these orthogonality conditions can be understood more
s
If the functions {φ (x)}, s = 1, 2, . . . , d form a basis for a clearly in terms of the diagrams in Fig. 12(b). Any MPS
Hilbert space of functions over x ∈ [0, 1], then the tensor can be brought into the form Eq. (17) through an effi-
product basis cient sequence of tensor contractions and SVD operations
similar to the steps in Fig. 7(b).
φs1 (x1 ) ⊗ φs2 (x2 ) ⊗ · · · φsN (xN ) (14)
The form in Eq. (17) suggests an interpretation where
forms a basis for a Hilbert space of functions over the decision function f ` (x) acts in three stages. First, an
x ∈ [0, 1]×N . Moreover, if the basis {φs (x)} is complete, input x is mapped into the exponentially larger feature
then the tensor product basis is also complete and f (x) space defined by Φ(x). Next, the dN dimensional feature
can be any square integrable function. vector Φ is mapped into a much smaller m2 dimensional
Next, consider the effect of restricting the local dimen- space by contraction with all the U and V site tensors
sion to d = 2 as in the local feature map of Eq. (3) of the MPS. This second step defines a new feature map
which was used to classify grayscale images in our MNIST Φ̃(x) with m2 components as illustrated in Fig. 12(c).
benchmark in Section VI. Recall that for this choice of Finally, f ` (x) is computed by contracting Φ̃(x) with C ` .
φ(x), To justify calling Φ̃(x) a feature map, it follows from
the left- and right-orthogonality conditions of the U and
φ(0) = [1, 0] (15)
V tensors of the MPS Eq. (17) that the indices αi and
φ(1) = [0, 1] . (16) αi+1 of the core tensor C label an orthonormal basis for
8
defining a reduced feature map Φ̃(x). φ̄s (x)φs (x) dµ(x) = δss0 , (20)
x
of shallow neural network models [45]. A similar interpre- for all x ∈ [0, 1].
tation of an MPS as learning features was first proposed Unfortunately neither the local feature map Eq. (3)
in Ref. 23, though with quite a different scheme for repre- nor its generalizations in Appendix B meet the first cri-
senting data than what is used here. It is also interesting terion Eq. (20). A different choice that satisfies both
to note that an interpretation of the U and V tensors as the orthogonality condition Eq. (20) and normalization
combining and projecting features into only the m most condition Eq. (21) could be
important combinations can be applied at any bond of
α
the MPS. For example, the tensor Uαjj+1 sj tensor at site j φ(x) = cos(πx), sin(πx) . (22)
can be viewed as defining a vector of m features labeled
by αj+1 by forming linear combinations of products of However, this map is not suitable for inputs like grayscale
the features φsj (xj ) and the features αj defined by the pixels since it is anti-periodic over the interval x ∈ [0, 1]
previous U tensor, similar to the contraction in Fig. 7(c). and would lead to a periodic probability distribution. An
example of an orthogonal, normalized map which is not
periodic on x ∈ [0, 1] is
C. Generative Interpretation h π π i
φ(x) = ei(3π/2)x cos x , e−i(3π/2)x sin x .
2 2
Because MPS were originally introduced to represent (23)
wavefunctions of quantum systems [30], it is tempting to
This local feature map meets the criteria Eqs. (20) and
interpret a decision function f ` (x) with an MPS weight
(21) if the integration measure chosen to be dµ(x) = 2dx.
vector as a wavefunction. This would mean interpreting
As a basic consistency check of the above genera-
|f ` (x)|2 for each fixed ` as a probability distribution func-
tive interpretation, we performed an experiment on our
tion over the set of inputs x belonging to class `. A major
toy model of Section VII, using the local feature map
motivation for this interpretation would be that many in-
Eq. (23). Recall that our toy data can have two possible
sights from physics could be applied to machine learned
labels A and B. To test the generative interpretation, we
models. For example, tensor networks in the same family
first generated a single, random “hidden” weight tensor
as MPS, when viewed as defining a probability distribu-
W . From this weight tensor we sampled Ns data points
tion, can be used to efficiently perform perfect sampling
in a two step process:
of the distribution they represent [46].
Let us investigate the properties of W ` and Φ(x) re- 1. Sample a label ` = A
R or ` = B according
P to the
quired for a consistent interpretation of |f ` (x)|2 as a probabilities PA = x |f A (x)|2 = s1 s2 |W A
s1 s2 |
2
where p(`, x) = |f ` (x)|2 = |W ` · Φ(x)|2 and p̃(`, x) has Note: while preparing our final manuscript, Novikov et
similar definition in terms of W̃ . The resulting aver- al. [49] published a related framework for parameterizing
age DKL as a function of number of sampled training supervised learning models with MPS (tensor trains).
points
√ Ns is shown in Fig. 13 along with a fit of the form
σ/ Ns where σ is a fitting parameter. The results indi-
cate that given enough training data, the learning process Acknowledgments
can eventually recapture the original probabilistic model
that generated the data. We would like to acknowledge helpful discussions with
Juan Carrasquilla, Josh Combes, Glen Evenbly, Bo-
hdan Kulchytskyy, Li Li, Roger Melko, Pankaj Mehta,
IX. DISCUSSION U. N. Niranjan, Giacomo Torlai, and Steven R. White.
This research was supported in part by the Perimeter
We have introduced a framework for applying Institute for Theoretical Physics. Research at Perime-
quantum-inspired tensor networks to multi-class super- ter Institute is supported by the Government of Canada
vised learning tasks. While using an MPS ansatz for through Industry Canada and by the Province of Ontario
the model parameters worked well even for the two- through the Ministry of Economic Development & Inno-
dimensional data in our MNIST experiment, other tensor vation. This research was also supported in part by the
networks such as PEPS, which are explicitly designed for Simons Foundation Many-Electron Collaboration.
10
Though matrix product states (MPS) have a relatively (cos2 (θj ) + sin2 (θj ))d−1 = 1 . (B2)
simple structure, more powerful tensor networks, such as
Expand the above identity using the binomial co-
PEPS and MERA, have such complex structure that tra-
effiecients nk = n!/(k!(n − k)!).
ditional tensor notation becomes unwieldy. For these net-
works, and even for MPS, it is helpful to use a graphical
(cos2 (θj ) + sin2 (θj ))d−1 = 1
notation. For some more complete reviews of this nota-
d−1
X
tion and its uses in various tensor networks see Ref. 25 d−1
and 42. = (cos θj )2(d−1−p) (sin θj )2p . (B3)
p=0
p
The basic graphical notation for a tensor is to represent
it as a closed shape. Typically this shape is a circle, This motivates defining φsj (xj ) to be
though other shapes can be used to distinguish types of
s
tensors (there is no standard convention for the choice
d−1 π π
of shapes). Each index of the tensor is represented by sj
φ (xj ) = (cos( xj ))d−sj (sin( xj ))sj −1
a line emanating from it; an order-N tensor has N such sj − 1 2 2
lines. Figure 14 shows examples of diagrams for tensors (B4)
of order one, two, and three. where recall that sj runs from 1 to d. The above defini-
To indicate that a certain pair of tensor indices are con- tion reduces
P to the d = 2 case Eq. (B1) and guarantees
tracted, they are joined together by a line. For example, that sj |φsj |2 = 1 for larger d. (These functions are ac-
Fig. 15(a) shows the contraction of an an order-1 tensor tually a special case of what are known as spin coherent
with the an order-2 tensor; this is the usual matrix-vector states.)
multiplication. Figure 15(b) shows a more general con- Using the above mapping φsj (xj ) for larger d allows the
traction of an order-4 tensor with an order-3 tensor. product W ` · Φ(x) to realize a richer class of functions.
Graphical tensor notation offers many advantages over One reason is that a larger local dimension allows the
traditional notation. In graphical form, indices do not weight tensor to have many more components. Also, for
usually require names or labels since they can be dis- larger d the components of φsj (xj ) contain
ever higher
tinguished by their location in the diagram. Operations frequency terms of the form cos π2 xj , cos 3π 2 xj , . . . ,
such as the outer product, tensor trace, and tensor con- cos (d−1)π
2 x j and similar terms replacing cos with sin.
traction can be expressed without additional notation; The net result is that the decision functions realizable
for example, the outer product is just the placement of for larger d are composed from a more complete basis of
one tensor next to another. For a network of contracted functions and can respond to smaller variations in x.
11
[1] J. J. Hopfield, “Neural networks and physical systems Anderson impurity model,” Phys. Rev. B 90, 155136
with emergent collective computational abilities,” PNAS (2014).
79, 2554–2558 (1982). [18] Mohammad H. Amin, Evgeny Andriyash, Jason Rolfe,
[2] Daniel J. Amit, Hanoch Gutfreund, and H. Sompolinsky, Bohdan Kulchytskyy, and Roger Melko, “Quantum
“Spin-glass models of neural networks,” Phys. Rev. A 32, Boltzmann machine,” arxiv:1601.02036 (2016).
1007–1018 (1985). [19] Juan Carrasquilla and Roger G. Melko, “Machine learn-
[3] Bernard Derrida, Elizabeth Gardner, and Anne Zip- ing phases of matter,” Nat Phys 13, 431–434 (2017).
pelius, “An exactly solvable asymmetric neural network [20] Animashree Anandkumar, Rong Ge, Daniel Hsu,
model,” EPL (Europhysics Letters) 4, 167 (1987). Sham M. Kakade, and Matus Telgarsky, “Tensor decom-
[4] Daniel J. Amit, Hanoch Gutfreund, and H. Sompolinsky, positions for learning latent variable models,” Journal of
“Information storage in neural networks with low levels Machine Learning Research 15, 2773–2832 (2014).
of activity,” Phys. Rev. A 35, 2293–2303 (1987). [21] Animashree Anandkumar, Rong Ge, Daniel Hsu, and
[5] HS Seung, Haim Sompolinsky, and N Tishby, “Statistical Sham M. Kakade, “A tensor approach to learning mixed
mechanics of learning from examples,” Physical Review membership community models,” J. Mach. Learn. Res.
A 45, 6056 (1992). 15, 2239–2312 (2014).
[6] Andreas Engel and Christian Van den Broeck, Statisti- [22] Anh Huy Phan and Andrzej Cichocki, “Tensor decom-
cal mechanics of learning (Cambridge University Press, positions for feature extraction and classification of high
2001). dimensional datasets,” Nonlinear theory and its applica-
[7] Dörthe Malzahn and Manfred Opper, “A statistical tions, IEICE 1, 37–68 (2010).
physics approach for the analysis of machine learning al- [23] J.A. Bengua, H.N. Phien, and H.D. Tuan, “Optimal
gorithms on real data,” Journal of Statistical Mechanics: feature extraction and classification of tensors via ma-
Theory and Experiment 2005, P11001 (2005). trix product state decomposition,” in 2015 IEEE Intl.
[8] Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh, Congress on Big Data (BigData Congress) (2015) pp.
“A fast learning algorithm for deep belief nets,” Neural 669–672.
Computation 18, 1527–1554 (2006). [24] Alexander Novikov, Dmitry Podoprikhin, Anton Os-
[9] Marc Mezard and Andrea Montanari, Information, okin, and Dmitry Vetrov, “Tensorizing neural networks,”
physics, and computation (Oxford University Press, arxiv:1509.06569 (2015).
2009). [25] Jacob C. Bridgeman and Christopher T. Chubb, “Hand-
[10] Pankaj Mehta and David Schwab, “An exact mapping waving and interpretive dance: An introductory course
between the variational renormalization group and deep on tensor networks,” arxiv:1603.03039 (2016).
learning,” arxiv:1410.3831 (2014). [26] U. Schollwöck, “The density-matrix renormalization
[11] Christopher C. Fischer, Kevin J. Tibbetts, Dane Mor- group in the age of matrix product states,” Annals of
gan, and Gerbrand Ceder, “Predicting crystal structure Physics 326, 96–192 (2011).
by merging data mining with quantum mechanics,” Nat. [27] G. Evenbly and G. Vidal, “Tensor network states and
Mater. 5, 641–646 (2006). geometry,” Journal of Statistical Physics 145, 891–918
[12] Geoffroy Hautier, Christopher C. Fischer, Anubhav (2011).
Jain, Tim Mueller, and Gerbrand Ceder, “Find- [28] K. R. Muller, S. Mika, G. Ratsch, K. Tsuda, and
ing nature’s missing ternary oxide compounds us- B. Scholkopf, “An introduction to kernel-based learning
ing machine learning and density functional the- algorithms,” IEEE Transactions on Neural Networks 12,
ory,” Chemistry of Materials 22, 3762–3767 (2010), 181–201 (2001).
http://dx.doi.org/10.1021/cm100795d. [29] E. M. Stoudenmire and Steven R. White, “Real-space
[13] Matthias Rupp, Alexandre Tkatchenko, Klaus-Robert parallel density matrix renormalization group,” Phys.
Müller, and O. Anatole von Lilienfeld, “Fast and ac- Rev. B 87, 155137 (2013).
curate modeling of molecular atomization energies with [30] Stellan Östlund and Stefan Rommer, “Thermodynamic
machine learning,” Phys. Rev. Lett. 108, 058301 (2012). limit of density matrix renormalization,” Phys. Rev. Lett.
[14] Yousef Saad, Da Gao, Thanh Ngo, Scotty Bobbitt, 75, 3537–3540 (1995).
James R. Chelikowsky, and Wanda Andreoni, “Data [31] I. Oseledets, “Tensor-train decomposition,” SIAM Jour-
mining for materials: Computational experiments with nal on Scientific Computing 33, 2295–2317 (2011).
AB compounds,” Phys. Rev. B 85, 104104 (2012). [32] F. Verstraete and J. I. Cirac, “Renormalization algo-
[15] John C. Snyder, Matthias Rupp, Katja Hansen, Klaus- rithms for quantum-many body systems in two and
Robert Müller, and Kieron Burke, “Finding density higher dimensions,” cond-mat/0407066 (2004).
functionals with machine learning,” Phys. Rev. Lett. [33] G. Vidal, “Entanglement renormalization,” Phys. Rev.
108, 253002 (2012). Lett. 99, 220405 (2007).
[16] Ghanshyam Pilania, Chenchen Wang, Xun Jiang, [34] G. Evenbly and G. Vidal, “Algorithms for entanglement
Sanguthevar Rajasekaran, and Ramamurthy Ram- renormalization,” Phys. Rev. B 79, 144108 (2009).
prasad, “Accelerating materials property predictions us- [35] Steven R. White, “Density matrix formulation for quan-
ing machine learning,” Scientific Reports 3, 2810 EP – tum renormalization groups,” Phys. Rev. Lett. 69, 2863–
(2013). 2866 (1992).
[17] Louis-François Arsenault, Alejandro Lopez-Bezanilla, [36] Sebastian Holtz, Thorsten Rohwedder, and Reinhold
O. Anatole von Lilienfeld, and Andrew J. Millis, “Ma- Schneider, “The alternating linear scheme for tensor op-
chine learning for many-body physics: The case of the timization in the tensor train format,” SIAM Journal on
12
Scientific Computing 34, A683–A713 (2012). [44] Christopher J.C. Burges Yann LeCun,
[37] ITensor Library (version 2.0.7) http://itensor.org. Corinna Cortes, “MNIST handwritten digit database,”
[38] Vladimir Vapnik, The Nature of Statistical Learning The- http://yann.lecun.com/exdb/mnist/ .
ory (Springer-Verlag New York, 2000). [45] Michael Nielsen, Neural Networks and Deep Learning
[39] W. Waegeman, T. Pahikkala, A. Airola, T. Salakoski, (Determination Press, 2015).
M. Stock, and B. De Baets, “A kernel-based framework [46] Andrew J. Ferris and Guifre Vidal, “Perfect sampling
for learning graded relations from data,” Fuzzy Systems, with unitary tensor networks,” Phys. Rev. B 85, 165146
IEEE Transactions on 20, 1090–1101 (2012). (2012).
[40] E.M. Stoudenmire and Steven R. White, “Studying two- [47] Efi Efrati, Zhe Wang, Amy Kolan, and Leo P Kadanoff,
dimensional systems with the density matrix renormal- “Real-space renormalization in statistical mechanics,”
ization group,” Annual Review of Condensed Matter Reviews of Modern Physics 86, 647 (2014).
Physics 3, 111–128 (2012). [48] G. Evenbly and G. Vidal, “Tensor network renormaliza-
[41] F. Verstraete, D. Porras, and J. I. Cirac, “Density matrix tion,” Phys. Rev. Lett. 115, 180405 (2015).
renormalization group and periodic boundary conditions: [49] Alexander Novikov, Mikhail Trofimov, and Ivan
A quantum information perspective,” Phys. Rev. Lett. Oseledets, “Exponential machines,” arxiv:1605.03795
93, 227205 (2004). (2016).
[42] Andrzej Cichocki, “Tensor networks for big data [50] Matthew T. Fishman and Steven R. White, “Compres-
analytics and large-scale optimization problems,” sion of correlation matrices and an efficient method
arxiv:1407.3124 (2014). for forming matrix product states of fermionic gaussian
[43] Jason T. Rolfe and Matthew Cook, “Multifactor expec- states,” Phys. Rev. B 92, 075132 (2015).
tation maximization for factor graphs,” (Springer Berlin
Heidelberg, Berlin, Heidelberg, 2010) pp. 267–276.