You are on page 1of 90

# Uni10

The Universal Tensor Network Library
Yun-Da Hsieh, Ying-Jer Kao
Center for Advanced Study in Theoretical Science
and Department of Physics
National Taiwan University
http://www.uni10.org
Tensor Network Workshop: Algorithms and Applications
IOP, ACS, 2014-12-01~05

Developers
Yun-Da Hsieh (National Taiwan University)
Implementation of Uni10 (C++, CUDA, Python)
Pochung Chen (National Tsing-Hua University)
Design and testing
Tama Ma (Singapore National University)
Cross-platform build
Sukhbinder Singh (Macquarie University)
Matlab wrapper

Tensor
A(i,:,:)

A(:,5,1)

A(:,j,:)

A(1,:,3)

A(:,:,k)

A(1,3,:)

,3

,…

., K

A tensor is a multidimensional array

i=1, 2, 3,…., I

k=
1,
2

j=1, 2, 3,…., J

A(i,j,k)

Tensors in Science • Data analysis • Signal and image processing • Quantum Physics/Information • Quantum Loop Gravity • Quantum Chemistry • Bioinformatics • Computational neuroscience .

Tensors in Science • Data analysis • Signal and image processing • Quantum Physics/Information • Quantum Loop Gravity • Quantum Chemistry • Bioinformatics • Computational neuroscience .

Graphical Representation .

. AN 3 7 7 7 7 7 5 vector A↵ α .Graphical Representation 2 6 6 6 A=6 6 4 A1 A2 A3 . .

.. . 7 .Graphical Representation 2 6 6 6 A=6 6 4 A1 A2 A3 ... 5 Bmn β . Bm1 ··· . AN 3 2 7 7 7 7 7 5 B11 6 B = 4 .. ··· vector matrix A↵ B↵ α α 3 B1n . .

AN 3 2 7 7 7 7 7 5 B11 6 B = 4 ...Graphical Representation 2 6 6 6 A=6 6 4 A1 A2 A3 ... ··· vector matrix A↵ B↵ 3 B1n . . 5 Bmn rank-3 tensor C↵ γ α α β α β . Bm1 ··· . 7 .. .

... 7 . 5 Bmn rank-3 tensor rank-n tensor T↵1 ↵2 ↵3 ... ··· vector matrix A↵ B↵ 3 B1n . AN 3 2 7 7 7 7 7 5 B11 6 B = 4 . ..↵n C↵ γ α α β α β α1 αn α2 α3 … .. . Bm1 ··· .Graphical Representation 2 6 6 6 A=6 6 4 A1 A2 A3 .

7 ..... ··· vector matrix A↵ B↵ 3 B1n ... Bm1 ··· .Graphical Representation 2 6 6 6 A=6 6 4 A1 A2 A3 . AN 3 2 7 7 7 7 7 5 B11 6 B = 4 .↵n C↵ γ α α β α β α1 αn α2 α3 … . . .. 5 Bmn scalar rank-3 tensor S rank-n tensor T↵1 ↵2 ↵3 .

Graphical Representation product of tensors (matrices) α Q β = Q↵ = • • α R X γ S β R↵ S Internal lines are summed over External lines are external indices .

Graphical Representation product of tensors (matrices) α Q β = Q↵ = • • α R X γ S B β R↵ S Internal lines are summed over External lines are external indices C A D Tr(ABCD) .

Graphical Representation product of tensors (matrices) α Q β = Q↵ = α R X γ S B β C A D R↵ S Tr(ABCD) α β A • • B C i j k Internal lines are summed over X External lines are external indices T = A↵i B↵ j C ijk ↵ k .

Quantum Many-body Problems PEPS (lower half) 2D MERA w1 w3 w2 ux w4 uy u H 1D MERA (upper half) Things get messy when fermions are involved 7 8 9 4 5 6 w 3 4 u 1 2 3 4 ux 1 2 uy 2 1 2 .Tensor Networks for E_*!+05*!*A_*!+05*0)^2)&).

MIC) A γ β α γ μ A A B B β X A ν ↵ Bµ⌫ .Tensor Contraction α • Bookkeeping tensor indices is tedious and error prone. • Programming fermonic systems can be complicated • Minimize computation time and memory usage • Speed up computation with compute accelerators (GPU.

etc. Magma Hardware CPU. MERA. Python Low-Level Libraries BLAS. Matlab.Tensor Network Applications DMRG. cuBLAS. Fortran. LAPACK. TRG. GPU . iTEBD. iPEPS. CUDA. C++.

NCON . Python Uni10 Low-Level Libraries BLAS. Matlab. etc. C++. cuBLAS.Tensor Network Applications DMRG. MERA. Magma Hardware CPU. iPEPS. TNT. TRG. Fortran. CUDA. GPU Other tensor libraries: iTensor. LAPACK. iTEBD. SCON.

Qnum . Bond and functions performing tensor operations. • A heuristic algorithm to search for an optimal binary contraction order based on the computational and memory constraints. • A network class Network. Supports acceleration with Nvidia Cuda based GPU. • An engine to construct and analyze the contraction tree for a given network. Open source LGPL with cite-me license . Provides wrappers for Matlab (soon)and Python (pyUni10). where details of the graphical representations of the networks are processed and stored.Uni10 • • • • • • Fully implemented in objected-oriented C++ Aimed toward applications in tensor network algorithms Provides basic tensor operations with easy-to-use interface • A symmetric tensor class UniTensor (Abelian symmetry) with auxiliary classes for quantum numbers. blocks Block and bond labels.

Network -1 7 C 1 Network A 2 -2 putTensor() launch() D 3 4 E -3 F -4 5 Β 6 8 putBlock() Matrix UniTensor UniTensor getBlock() 1 Bond Qnum Block Matrix 3 Q1 Q2 4 2 Block Qa Qb Qc Qd Q3 Qnum Bond .

|0i. | 1i} Construct symmetric basis 1 3 2 4 .Spin-1 Heisenberg Model S1 S2 H = S1 · S2 |Sz i 2 {|1i.

|1> |0> |-1> |1> |0> |-1> 1 3 2 4 .

|1> |0> |-1> |1> |0> |-1> q# 1 1 3 3 q# 2 2 4 Sz=1 4 .

bdc. raw element block element T. q0. qm1(-1). q1]). bdr. Bond bdr(BD_IN.q# 1 3 q# 2 4 Qnum q0(0). [qm1. Bond bdc(BD_OUT.addRawElem(raw) . q1]). q1(1). UniTensor T([bdr. q0. [qm1. bdc]).

6]). 4. 2. Ta * Tb .addLabel([3. 4]). 3. 5.addLabel([1. Tb.Ta Tb 1 3 5 2 4 6 Ta.

• read: UniTensor Ta(“out”).save(“out”). • print: cout << Ta.UniTensor • Operation • • Tc = 2. • permute() and transpose() Tc = Ta * Tb. • T.13 * Ta. •I/O • write: Ta.eye(). T 1 3 2 4 • Initialization • T. • Tc = Ta + Tb.orthoRand(). Functionalities similar to iTensor .

• print: cout << Ta. • Tc = 2.13 * Ta. • T. • read: UniTensor Ta(“out”).permute([3.save(“out”).1 UniTensor T 2 • Operation • Tc = Ta * Tb.eye(). 1) 1 3 4 2 • Initialization • T. • permute() and transpose() •I/O • write: Ta. 4. • Tc = Ta + Tb. Functionalities similar to iTensor . 1. 3 4 T. 2].orthoRand().

Symmetries U(1) • Total Sz (spin systems) • Total particle number (fermionic/bosonic systems) Z2 • Parity • Key to fermionic system .

Fermionic system minus sign C1† C2† |0> = -C2† C1† |0> |1.1> = .|1.1> .

Particle parity p=0 p=1 4 A 1 2 3 |0> |1> |0> |1> .

0> 1.1> 0.0> 0.1> 7 ✗ ✗ 8 .0> 1 ✗ ✗ 2 0.1> 1.0> ✗ 5 6 ✗ 1.Particle parity p=0 p=1 4 |1> A 1 |0> 2 |0> 3 |1> 0.1> ✗ 3 4 ✗ 1.

minus sign! 4 3 3 4 2 5 B A 1 Ta * Tb 6 .

[1, 2, 4, 3]

[3, 4, 5, 6]

minus sign!

4

3

3

4

2

5

B

A

1

Ta * Tb
Before contraction:
Ta.permute([1, 2, 3, 4], 2)
[1, 2, 4, 3] —> [1, 2, 3, 4]

6

1

|0>

4

A

|1>

2

|0>

3

|1>

0,0> 0,1> 1,0> 1,1>
0,0>

1

2

0,1>

3

4

1,0>

5

6

1,1>

7

8

1

|0>

4

A

|1>

2

|0>

3

|1>

swap gates are added in permute()

0,0> 0,1> 1,0> 1,1>
0,0>

1

-2

0,1>

3

4

1,0>

5

6

1,1>

7

-8

UniTensor & Matrix
• Interactions:

• T.getBlock(q3);
• T.putBlock(q3);
T.qnums();

q1
1

3

q2
q3

2

q4 4

getBlock(q3)

putBlock(q3)

• Matrix operations:
• M.eigh();
• M.svd();
• Mc = Ma * Mb;
• M.transpose();

svd()

U

Σ

VT

U * Σ * VT

Network
File for network connection: demo.txt
C: -1; 7 1
D: -2; 2 8
A: 1 2; 3 4
B: 3 4; 5 6
E: 7 5; -3
F: 6 8; -4
ABCDEF

-1

7

C

5

3

1

A
2
-2

E

D

Network net(“demo.txt”)

-3

Β
4
8

6

F

-4

Network net C: -1. 3 4 B: 3 4. -4 ABCDEF -1 7 C A -2 D 5 3 1 2 E -3 Β 4 8 6 F -4 . 7 1 D: -2. 5 6 E: 7 5. 2 8 A: 1 2. -3 F: 6 8.

Network . 7 1 D: -2. 5 6 E: 7 5. 3 4 B: 3 4. -3 F: 6 8. net 0 1 2 3 4 5 C: -1. 2 8 A: 1 2.putTensor net. C).putTensor(0. -4 ABCDEF -1 7 C A -2 D 5 3 1 2 E -3 Β 4 8 6 F -4 .

putTensor(2. -3 F: 6 8.Network . A). -4 ABCDEF -1 7 C A -2 D 5 3 1 2 E -3 Β 4 8 6 F -4 . 5 6 E: 7 5. 3 4 B: 3 4. net 0 1 2 3 4 5 C: -1.putTensor net. 7 1 D: -2. 2 8 A: 1 2.

7 1 D: -2.launch(). 3 4 B: 3 4.Network . -3 F: 6 8.launch Tresult = net. 5 6 E: 7 5. 2 8 A: 1 2. -4 ABCDEF -1 7 C A -2 D 5 3 1 2 E -3 Β 4 8 6 F -4 . net C: -1.

5 6 E: 7 5.reusable net.putTensor(3. -4 ABCDEF -1 net 7 C A -2 D 5 3 1 2 E -3 Β’ 4 8 6 F -4 . 3 4 B: 3 4.Network . B’) Tresult = net. -3 F: 6 8.launch(). 0 1 2 3 4 5 C: -1. 2 8 A: 1 2. 7 1 D: -2.

-4 CDABEF -4 8 Finding the optimal contraction order for a general tensor network is NP-hard C.. .-C. et al. 157 (1997). 5 6 E: 7 5.Contraction order Β A 4 6 F -2 -3 5 3 1 2 E 7 C -1 D C: -1. 7 1 D: -2. Lam. -3 F: 6 8. 3 4 B: 3 4.Parallel Processing Letters 07. 2 8 A: 1 2.

Contraction order Β A 4 6 F -2 -3 5 3 1 2 E 7 C -1 D C: -1. et al. Lam. -4 CDABEF -4 8 Finding the optimal contraction order for a general tensor network is NP-hard Using heuristics C. -3 F: 6 8. 3 4 B: 3 4. 5 6 E: 7 5. . 2 8 A: 1 2.-C. 7 1 D: -2..Parallel Processing Letters 07. 157 (1997).

5 6 E: 7 5. 2 8 A: 1 2. 3 4 B: 3 4.launch CDABEF C: -1.Network . -3 F: 6 8. 7 1 D: -2. -4 D E F C A B (((A * B) * C) * E) * (D * F) .

3 4 B: 3 4.launch CDABEF net. 7 1 D: -2. 2 8 A: 1 2. 5 6 E: 7 5.Network .launch() C: -1. -3 F: 6 8. -4 D E F C A B (((A * B) * C) * E) * (D * F) .

7 1 D: -2.launch() C: -1. -3 F: 6 8. 3 4 B: 3 4.Network . 2 8 A: 1 2.launch CDABEF net. 5 6 E: 7 5. -4 D E F C A B (((A * B) * C) * E) * (D * F) Using heuristics to generate an “optimal” binary contraction tree .

Why binary? Dabc = X Aaln Bbml Ccnm lmn a A l B b n m C c Cost: abclmn .

a A l B b n m C c pairwise contractions Cost: b c l m n + a b c l n A B Cost: a b c l m n (pairwise) (direct sum) C pairwise is better if: (1 / a + 1 / m) < 1 .

a A l B b n m C c pairwise contractions Intermediate tensor Cost: b c l m n + a b c l n bcln B A Cost: a b c l m n (pairwise) (direct sum) C pairwise is better if: (1 / a + 1 / m) < 1 .

a A l n m B C b c pairwise contractions A B Time Space C B C A C A B (1 / a + 1 / m) (1 / b + 1 / n) (1 / c + 1 / l) bcln aclm abmn .

Nz A B C D E F G A C F Ny Nx B B D D E E A G C G F getCost(Nz) = min( Cost(Nx * Ny) + getCost(Nx) + getCost(Ny) ) n 1 For all possible (2 n!(n Total: 2n 1) (Nx. Ny) 1)! 1 Pfeifer et al.6112(2013) .. arXiv:1304.

a A l n m B C b c pairwise contractions Assume:a=b=c A B Time Space C B C A C A B (1 / a + 1 / m) (1 / b + 1 / n) (1 / c + 1 / l) bcln aclm abmn .

a A l B b n m C c pairwise contractions Efficiency of B * C: (ElemNum(B) + ElemNum(C)) / ElemNum(B * C) A B C B * C: (lmb + nmc) / lnbc = m / nc + m / lb .

5 6 E: 7 5. 7 1 D: -2.Network CDABEF -1 -2 D C 1 2 A 7 Β 5 6 F E -3 -4 Optimal contraction order: NP-hard 3 4 8 C: -1. -3 F: 6 8. 2 8 A: 1 2. 3 4 B: 3 4. -4 .

5 6 E: 7 5. 3 4 B: 3 4. -3 F: 6 8. 7 1 D: -2. -4 .Network CDABEF -1 -2 D C 7 3 4 Β 5 6 F E -3 -4 Use human intelligence 1 2 A 8 C: -1. 2 8 A: 1 2.

-3 F: 6 8. 2 8 A: 1 2. -4 ORDER: ABCEDF . 5 6 E: 7 5.Network CDABEF -1 -2 C 1 2 A 7 3 4 Β 5 6 F Give a suggested sequence of contraction E -3 -4 Use human intelligence D 8 C: -1. 3 4 B: 3 4. 7 1 D: -2.

but the algorithm is fast.ORDER: ABCEDF D E F C A n!(n 1)! 2n 1 B O[log(n)] ~ O[n] Not always optimal. . and several trials can give a better sequence.

Network .launch -4 -3 ABCDEF F 6 5 E 8 4 3 7 Β D -2 -1 C 2 1 A .

Network .launch -4 -3 ABCDEF F 6 5 E 8 4 3 7 Β D -2 -1 C A 2 1 A B .

launch -4 -3 ABCDEF F 6 5 E 8 4 3 7 Β D -2 -1 C A 2 1 A C B .Network .

launch -4 -3 ABCDEF F 6 5 E D 8 4 3 7 Β D -2 -1 C A 2 1 A C B .Network .

Network .launch -4 -3 ABCDEF F E 6 5 E D 8 4 3 7 Β D -2 -1 C A 2 1 A C B .

launch -4 -3 ABCDEF F 6 5 E E 8 4 3 7 Β D -2 -1 C C A 2 1 A D B .Network .

launch -4 -3 ABCDEF F D 6 5 E E 8 4 3 7 Β D -2 -1 C A 2 1 A C B .Network .

launch -4 -3 ABCDEF F 6 5 E 8 4 3 7 Β C A D -2 -1 F B 2 1 A C D E (((A * B) * C) * E) * (D * F) .Network .

launch() B . 3 4 B: 3 4.launch CDABEF C: -1.Network . -4 D E F C A ORDER: (((A * B) * C) * E) * (D * F) net. 5 6 E: 7 5. -3 F: 6 8. 7 1 D: -2. 2 8 A: 1 2.

3 4 B: 3 4. -3 F: 6 8. 7 1 D: -2. 5 6 E: 7 5. 2 8 A: 1 2.launch C: -1. -4 ABCDEF D E F C A B (((A * B) * C) * E) * (D * F) .Network .

7 1 D: -2. 2 8 A: 1 2.launch C: -1. 3 4 B: 3 4. -3 F: 6 8. 5 6 E: 7 5. -4 ABCDEF D E F C A B (((A * B) * C) * E) * (D * F) for fermionic system CDABEF != ABCEDF .Network .

Network . 5 6 E: 7 5.launch() . -4 ABCDEF D E F C A B (((A * B) * C) * E) * (D * F) for fermionic system CDABEF != ABCEDF CDABEF = net. 3 4 B: 3 4.launch C: -1. 2 8 A: 1 2. -3 F: 6 8. 7 1 D: -2.

PRB 80.Example: Corner Transfer Matrix • • • Used in iPEPS to estimate the environment Different bond dims • χbd: boundary index • χint: internal index • d: physical index Expectation value of an operator ˆ hOi Román Orús and Guifré Vidal. 094403 (2009) .

CTM.net C1: 1 2 C2: 34 35 C3: 41 42 C4: 19 16 T1b: 1 20 3 4 T1a: 20 34 21 22 T2a: 35 38 36 37 T2b: 38 41 39 40 T3b: 42 33 31 32 T3a: 33 19 17 18 T4a: 16 13 14 15 T4b: 13 2 5 6 A1: 4 24 10 6 8 A1T: 3 23 9 5 7 B1: 22 37 27 24 25 B1T: 21 36 26 23 25 A2: 27 40 32 29 30 A2T: 26 39 31 28 30 B2: 10 29 18 15 12 B2T: 9 28 17 14 11 O: 7 8 11 12 TOUT: ORDER: C1 T1b T4b A1T A1 O B2T B2 T4a C4 T3a T1a B1T B1 A2T A2 T3b C2 T2a T2b C3 .

13. 24. 18. 15. 14. ((((((((((((((((((((C1 T1b) T4b) A1T) A1) O) B2T) B2) T4a) C4) T3a) T1a) B1T) B1) A2T) A2) T3b) C2) T2a) T2b) C3) . 29.Memory Requirement: 37846846304 Sum of memory usage: 47505684008 Maximun tensor: elemNum: 4304672100 10 bonds and labels: 20. 23. 28. 17.

28. 14. 13. 17. 12. 23. 24. 10.Memory Requirement: 6810336800 Sum of memory usage: 16039257608 Maximun tensor: elemNum: 425152800 9 bonds and labels: 20. .

23. 24. 17. 13.33762817951 Memory Requirement: 37846846304 Sum of memory usage: 47505684008 Maximun tensor: elemNum: 4304672100 10 bonds and labels: 20. 15.Memory Requirement: 6810336800 Sum of memory usage: 16039257608 Maximun tensor: elemNum: 425152800 9 bonds and labels: 20. 23. 28. 29. 10.17994463119 Total Memory Usage 16 039 257 608 / 47 505 684 008 = 0. 18. 17. 12. . Memory Requirement: 6 810 336 800 / 37 846 846 304 = 0. 14. 24. 14. 13. 28.

Memory Usage χ8 →χ7 Bond Dimension Courtesy: Marc Ziegler @ U Mainz .

Code Sample: MERA !1 !2 w1 w2 2 1 0 3 4 5 u u† 6 7 8 10 9 w1† w2† !3 !4 !1 !2 !3 !4 11 Ascending operator: renormalization of a two-site operator .

Qnum q30(3. -3. Qnum q_31(PRTF_ODD. PRT_EVEN). PRT_EVEN). Define quantum numbers . PRT_EVEN). 1. 10 9 † w1 w2† !3 !4 11 Qnum q_11(PRTF_ODD. PRT_EVEN). Qnum q_10(-1. -1.Code Sample: MERA !1 !2 w1 w2 3 1 0 2 4 5 6 u u† 7 #include "uni10. PRT_EVEN). PRT_EVEN). Qnum q11(PRTF_ODD.hpp" 8 Qnum q10(1.

qnums. qnums).push_back(bdr). bonds. vector<Qnum> qnums. bonds.push_back(bdr). 7 8 u† vector<Bond> bonds.push_back(q10). Assign quantum numbers to bonds .Code Sample: MERA !1 !2 w1 w2 3 1 0 2 4 5 6 u qnums. bonds.push_back(bdc). qnums).push_back(q_11). 10 9 † w1 w2† !3 !4 11 Bond bdr(BD_IN.push_back(bdc). bonds. Bond bdc(BD_OUT.

0. . "Ob"). -1.addRawElem(H_elem). 1.0/2. 0.0/4. UniTensor H0(bonds. 0. 1. 1. -1. 0. 9 † w1 w2† !3 !4 Define Hamiltonian operator 0. 0.0/4.0/2.0/4. 0.Code Sample: MERA !1 !2 w1 w2 3 1 0 2 4 5 6 u 7 8 u † double H_elem[] = {1. 0.0/4}. 0. 10 11 0. H0.

Code Sample: MERA !1 !2 w1 w2 3 1 0 2 4 5 6 u 7 8 u† vector<Qnum> qnums1. w† 11 Bond bdr1(BD_IN.push_back(q30). qnums1. qnums1). qnums1.push_back(bdc). bonds.push_back(q_31).push_back(q_10). bonds. qnums1.push_back(q11). UniTensor W1(bonds.transpose(). qnums1. bonds. qnums1.orthoRand(). qnums1. W1T. 10 9 † w1 w2† !3 !4 Define w.clear(). bonds.push_back(bdc).push_back(bdc). qnums1. qnums1. .push_back(q11).push_back(q_10).push_back(q_10).push_back(bdr1). UniTensor W1T = W1.push_back(q11). W1. "W1"). bonds.

push_back(&W1).push_back(&UT). Assign tensors into network .push_back(&W2T). tens.net". Network asdL(“AscendL. tens. tens. tens. tens. H2.push_back(&H0). tens.push_back(&W2). UniTensor H1. H1 = asdL. tens.Code Sample: MERA !1 !2 w1 w2 3 1 0 2 4 5 6 u 7 8 u† 10 9 † w1 w2† !3 !4 11 vector<UniTensor*> tens.push_back(&U).launch(). tens).push_back(&W1T).

UT))))) ) . 2 5 UT: 5 8. 6 9 W1T: 0 2 6. 4 8 Ob: 1 4. -3 W2T: 9 10 11. -4 TOUT:-1 -2.Code Sample: MERA !1 !2 w1 w2 3 1 0 2 4 5 6 u 7 8 u† AscendL. 7 10 11 U: 3 7.net 10 9 † w1 w2† !3 !4 11 W1: -1. -3 -4 ORDER: (W1T ((W1 (Ob ((W2 * W2T) ( U. 0 1 3 W2: -2.

&H0). asdL.putTensor(6. &W2). &W1T). asdL. <100 line C++ code! . asdL.putTensor(2.putTensor(3.launch(). &W1). H2 = asdL.putTensor(4.Code Sample: MERA !1 !2 w1 w2 3 1 0 2 Reuse the network 4 5 6 u 7 8 u† 10 9 † w1 w2† !3 !4 11 asdL. &U).putTensor(5.putTensor(1. &UT). &W2T). asdL.putTensor(0. asdL. asdL.

• Need something that even a layman knows how to program. • Take advantage of the beauty of Python.pyUni10 • Programming in C++ is too hardcore. . • pyUni10: Wrapping the power of Uni10 with a easy-to-program Python interface.

Acceleration with GPU α γ A A β GPU CPU μ β B B ν .

Acceleration with GPU α μ A A* B γ GPU ν CPU Unified API Switching from CPU to GPU version without any change in your functions .

) http://uni10. Multi-GPU/Multi-node support • Better search algorithm for contraction order (Tensor Contraction Engine. etc. MIC.org .Uni10 • Geared toward tensor network algorithms • Reduce development time • Behind-the-scene optimization • Support for Python (pyUni10) and Matlab (soon) • To do: • GUI tools for network generation • Non-abelian symmetries: SU(2) • OpenCL.

Network -1 7 C 1 Network A 2 -2 putTensor() launch() D 3 4 E -3 F -4 5 Β 6 8 putBlock() Matrix UniTensor UniTensor getBlock() 1 Bond Qnum Block Matrix 3 Q1 Q2 4 2 Block Qa Qb Qc Qd Q3 Qnum Bond .

Memory Hierarchy Chapter(2:(Programming(Model! Thread! Per9thread!local! memory! Thread!Block! ! Per9block!shared! memory! Grid!0! Block!(0.!1)! Block!(1.!0)! Block!(1.!1)! Block!(2.!0)! Block!(0.!1)! Block!(1.!1)! Block!(0.!0)! Block!(0.!0)! Block!(2.!0)! Block!(1.!1)! Grid!1! Block!(0.!2)! ! ! ! ! ! ! ! ! ! ! ! ! ! ! Global!memory! ( Figure(2?2.( Memory(Hierarchy( .!2)! Block!(1.