Professional Documents
Culture Documents
ch
@spcl_eth
Sor&ng Networks
and the LLVM Vector IR
2
spcl.inf.ethz.ch
@spcl_eth
Sor&ng Networks
3
spcl.inf.ethz.ch
@spcl_eth
N 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Depth 0 1 3 3 5 5 6 6 7 7 8 8 9 9 9 9 10
Size, 0 1 3 5 9 12 16 19 25 29 35 39 45 51 56 60 71
upper
bound
Size, 33 37 41 45 49 53 58
lower
bound
4
spcl.inf.ethz.ch
@spcl_eth
(0,2) (1,3)
(0,1) (2,3)
(1,2) (0,3)
5
spcl.inf.ethz.ch
@spcl_eth
6
spcl.inf.ethz.ch
@spcl_eth
%shuffle4 = shufflevector <2 x double> %0, <2 x double> %1, <2 x i32> <i32 0, i32 2>
%shuffle5 = shufflevector <2 x double> %0, <2 x double> %1, <2 x i32> <i32 1, i32 3>
%2 = tail call <2 x double> @llvm.x86.sse2.min.pd(<2 x double> %shuffle4, <2 x double> %shuffle5) #9
%3 = tail call <2 x double> @llvm.x86.sse2.max.pd(<2 x double> %shuffle4, <2 x double> %shuffle5) #9
%shuffle8 = shufflevector <2 x double> %2, <2 x double> %3, <4 x i32> <i32 0, i32 2, i32 1, i32 3>
%shuffle9 = shufflevector <4 x double> %shuffle8, <4 x double> undef, <2 x i32> <i32 1, i32 0>
%shuffle10 = shufflevector <4 x double> %shuffle8, <4 x double> undef, <2 x i32> <i32 2, i32 3>
%4 = tail call <2 x double> @llvm.x86.sse2.min.pd(<2 x double> %shuffle9, <2 x double> %shuffle10) #9
%5 = tail call <2 x double> @llvm.x86.sse2.max.pd(<2 x double> %shuffle9, <2 x double> %shuffle10) #9
%shuffle13 = shufflevector <2 x double> %4, <2 x double> %5, <4 x i32> <i32 1, i32 0, i32 2, i32 3>
ret <4 x double> %shuffle13 } 7
spcl.inf.ethz.ch
@spcl_eth
Assembly Code
10
spcl.inf.ethz.ch
@spcl_eth
11
spcl.inf.ethz.ch
@spcl_eth
§ Validity
§ Innermost loop must be parallel (or behave aVer vectorizaRon as if it was)
§ No aliasing between different arrays
§ Profitability
§ Memory accesses must be “stride-one”
or
§ ComputaRonal cost must dominate the loop
12
spcl.inf.ethz.ch
@spcl_eth
§ Validity
§ Innermost loop must be parallel (or behave aVer vectorizaRon as if it was)
§ No aliasing between different arrays
§ Profitability
§ Memory accesses must be “stride-one”
or
§ ComputaRonal cost must dominate the loop
13
spcl.inf.ethz.ch
@spcl_eth
for (int i = 0; i <= n; i++) YES, the arrays are different objects
B[i] += A[i];
14
spcl.inf.ethz.ch
@spcl_eth
15
spcl.inf.ethz.ch
@spcl_eth
Pointer-to-Pointer Arrays
int[ ][ ] A = new int[N][M];
int[ ][ ] B = new int[N][M]; A[0] A[0][1] A[0][2] A[0][2] A[0][3] A[0][4]
16
spcl.inf.ethz.ch
@spcl_eth
17
spcl.inf.ethz.ch
@spcl_eth
18
spcl.inf.ethz.ch
@spcl_eth
19
spcl.inf.ethz.ch
@spcl_eth
/// \return The cost of a shuffle instruction of kind Kind and of type Tp.
/// The index and subtype parameters are used by the subvector insertion and
/// extraction shuffle kinds.
int getShuffleCost(ShuffleKind Kind, Type *Tp, int Index = 0,
Type *SubTp = nullptr) const; 20
spcl.inf.ethz.ch
@spcl_eth
LoopAccessAnalysis
21
spcl.inf.ethz.ch
@spcl_eth
22
spcl.inf.ethz.ch
@spcl_eth
Clang 3.5 Clang 3.6 Clang 3.7 Clang 3.8 Clang 3.9 Clang 4.0
23
spcl.inf.ethz.ch
@spcl_eth
24
spcl.inf.ethz.ch
@spcl_eth
Components
Backends
Frontends RunTime Libraries
AMDGPU
OpenCL OpenMP 4.0 / 4.5
NVPTX
CUDA Libclc
StreamExecutor
End-to-end Tools
25
spcl.inf.ethz.ch
@spcl_eth
26
spcl.inf.ethz.ch
@spcl_eth
/* Test kernel */
__kernel void test(__global float *in, __global float *out)
{
int index = get_global_id(0);
out[index] = 3.14159f * in[index] + in[index];
}
clang –S –emit-llvm test.cl –o -
27
spcl.inf.ethz.ch
@spcl_eth
28
spcl.inf.ethz.ch
@spcl_eth
!opencl.kernels = !{!0}
!llvm.ident = !{!6}
Address Spaces
0 1 2 3 4 5
Generic Global Region Local Constant Private
30
spcl.inf.ethz.ch
@spcl_eth
§ Support:
31
spcl.inf.ethz.ch
@spcl_eth
SPIR-V
32
spcl.inf.ethz.ch
@spcl_eth
33
spcl.inf.ethz.ch
@spcl_eth
34
spcl.inf.ethz.ch
@spcl_eth
Kernel IR PTX
Binary
Program.cu Mixed AST
35
spcl.inf.ethz.ch
@spcl_eth
36
spcl.inf.ethz.ch
@spcl_eth
Quasi-Affine Expression
37
spcl.inf.ethz.ch
@spcl_eth
Presburger Formula
§ Base
§ Boolean Constants (⊤, ⊥)
§ OperaRons
§ Comparisons of quasi-affine expressions
𝑒↓0 ⊕𝑒↓1 ,⊕{ <, ≤, =, ≠, ≥, >}
§ Boolean OperaRons between Presburger Formula
𝑝↓0 ⊗𝑝↓1 ,⊗{ ∧, ∨, ¬, ⇒, ⇐, ⇔}
§ QuanRfied Variables
∃𝑥∣𝑝(𝑥, …)
∀𝑥∣𝑝(𝑥, …)
38
spcl.inf.ethz.ch
@spcl_eth
Presburger Set
Presburger RelaDon
39
spcl.inf.ethz.ch
@spcl_eth
S = (𝑥,𝑦)1≤𝑥, 𝑦≤4∧(𝑥+𝑦≤3∨𝑥+𝑦≥5)
40
spcl.inf.ethz.ch
@spcl_eth
41
spcl.inf.ethz.ch
@spcl_eth
Presburger Arithme&c
§ Benefits
§ Decidable
§ Closed under common operaRons
∩, ∪, ∖, proj, ∘, not transiRve hull
Þ Precise results
§ ComputaRonal Complexity
§ Some operaRons double-exponenRal (in dimensions)
§ OVen lower complexity for bounded dimension
42
spcl.inf.ethz.ch
@spcl_eth
§ Does 𝑥↑3 +𝑦↑3 +𝑧↑3 =29 have a soluRon? Yes, (3, 1, 1)!
§ Does 𝑥 ↑3 +𝑦↑3 +𝑧↑3 =33 have a soluRon? No, this is unknown today!
Note: No general algorithm for solving polynomial equaRons over integers exists!
(Hilbert’s 10th problem)
44
spcl.inf.ethz.ch
@spcl_eth
45
spcl.inf.ethz.ch
@spcl_eth
i≤N=4
for (i = 0; i <= N; i++)
5
i = 4, j =i =0 4, j =i =14, ji == 24, j =i 3= 4, j = 4
i
4
for (j = 0; j <= i; j++) i = 3, j =i 0
= 3, j =i =13, ji == 23, j = 3
S(i,j); 3
0 2≤ j i = 2, j =i 0= 2, j =i =12, j = 2
i = 1, j =i 0= 1, j = 1
1
j≤i
i = 0, j = 1
0
N=4
00≤ i 1 2 3 4 5
D = { (i,j) | 0 ≤ i ≤ N ∧ 0 ≤ j ≤ i }
46
spcl.inf.ethz.ch
@spcl_eth
§ Structured Control
§ IF-condiRons
§ Counted FOR-loops (Fortran style)
§ MulR-dimensional array accesses (and scalars)
§ Loop-condiRons and IF-condiRons are Presburger Formula
§ Loop increments are constant (non-parametric)
§ Array subscript expressions are piecewise-affine
47
spcl.inf.ethz.ch
@spcl_eth
• Access Relation
• Reads: {𝑆(𝑖,𝑗)→𝐴(𝑖,𝑗);𝑆(𝑖,𝑗)→𝐴(𝑖, 𝑗+1)}
• Writes: {𝑆(𝑖,𝑗)→𝐵(𝑖,𝑗)}
48
spcl.inf.ethz.ch
@spcl_eth
Model
𝐼↓𝑆 =𝑆(𝑖,𝑗)0≤𝑖≤𝑛∧0≤𝑗≤𝑖
𝜃↓𝑆 ={𝑆 (𝑖,𝑗)→(𝑖,𝑗)} →(⌊𝑖/4 ⌋, 𝑗, 𝑖 𝑚𝑜𝑑 4)
Code
for (i = 0; i <= n; i++)
for (j = 0; j <= i; j++)
S(i, j);
49
spcl.inf.ethz.ch
@spcl_eth
Model
𝐼↓𝑆 =𝑆(𝑖,𝑗)0≤𝑖≤𝑛∧0≤𝑗≤𝑖
𝜃↓𝑆 ={𝑆 (𝑖,𝑗)→(𝑖,𝑗)} →(⌊𝑖/4 ⌋, 𝑗, 𝑖 𝑚𝑜𝑑 4)
Code
for (c0 = 0; c0 <= n; c0++)
for (c1 = 0; c1 <= c0; c1++)
S(c0, c1);
50
spcl.inf.ethz.ch
@spcl_eth
Model
𝐼↓𝑆 =𝑆(𝑖,𝑗)0≤𝑖≤𝑛∧0≤𝑗≤𝑖
𝜃↓𝑆 ={𝑆 (𝑖,𝑗)→(𝑗, 𝑖)} →(⌊𝑖/4 ⌋, 𝑗, 𝑖 𝑚𝑜𝑑 4)
Code
for (c0 = 0; c0 <= n; c0++)
for (c1 = c0; c1 <= n; c1++)
S(c1, c0);
51
spcl.inf.ethz.ch
@spcl_eth
Model
𝐼↓𝑆 =𝑆(𝑖,𝑗)0≤𝑖≤𝑛∧0≤𝑗≤𝑖
𝜃↓𝑆 ={𝑆 (𝑖,𝑗)→(⌊𝑖/4 ⌋, 𝑗, 𝑖 𝑚𝑜𝑑 4)}
Code
for (c0 = 0; c0 <= floord(n, 4); c0++)
for (c1 = 0; c1 <= min(n, 4 * c0 + 3); c1++)
for (c2 = max(0, -4 * c0 + c1);
c1 <= min(3, n - 4 * c0); c2++)
S(4 * c0 + c2, c1);
52
spcl.inf.ethz.ch
@spcl_eth
Model
𝐼↓𝑆 =𝑆(𝑖,𝑗)0≤𝑖≤𝑛∧0≤𝑗≤𝑖
𝜃↓𝑆 ={𝑆 (𝑖,𝑗)→(⌊𝑖/4 ⌋, ⌊𝑗/4 ⌋, 𝑖 𝑚𝑜𝑑 4,𝑗 𝑚𝑜𝑑 4)}
Code
for (c0 = 0; c0 <= floord(n, 4); c0++)
for (c1 = 0; c1 <= c0; c1++)
for (c2 = 0; c2 <= min(3, n - 4 * c0); c2++)
for (c3 = 0; c3 <= min(3, 4 * c0 – 4 * c1 + c2); c3++)
S(4 * c0 + c2, 4 * c1 + c3);
53
spcl.inf.ethz.ch
@spcl_eth
54
spcl.inf.ethz.ch
@spcl_eth
// Original Loop
for (i = 0; i <= n; i+=1) 𝐷↓𝐼 =𝑆(𝑖)0≤𝑖≤𝑛
S(i); 𝑆↓𝑂𝑟𝑖𝑔 ={ 𝑆(𝑖)→(𝑖)}
𝑆↓𝑇 ={ 𝑆(𝑖)→(𝑛 −𝑖)}
// Transformed Loop
for (i = n; i >= 0; i-=1)
S(i);
55
spcl.inf.ethz.ch
@spcl_eth
// Original Loop
for (i = 0; i <= n; i+=1)
for (j = 0; j <= n; j+=1) 𝐷↓𝐼 = 𝑆(𝑖,𝑗)0≤𝑖,𝑗≤𝑛
S(i,j); 𝑆↓𝑂𝑟𝑖𝑔 ={ 𝑆(𝑖,𝑗)→(𝑖,𝑗)}
𝑆↓𝑇 ={𝑆(𝑖,𝑗)→(𝑗,𝑖)
// Transformed Loop
for (j = 0; j <= n; j+=1)
for (i = 0; i <= n; i+=1)
S(i,j);
56
spcl.inf.ethz.ch
@spcl_eth
// Original Loop
for (i = 0; i <= n; i+=1)
S1(i);
for (i = 0; i <= n; i+=1) 𝐷↓𝐼 = 𝑆(𝑖)0≤𝑖≤𝑛𝑇(𝑖)∣0≤𝑗≤𝑛
𝑆↓𝑂𝑟𝑖𝑔 ={ 𝑆(𝑖)→(0,𝑖);𝑇(𝑖)→(1,𝑖)}
S2(i); 𝑆↓𝑇 ={𝑆(𝑖)→(𝑖, 0);𝑇(𝑖)→(𝑖,1)}
// Transformed Loop
for (i = 0; i <= n; i+=1) {
S1(i);
S2(i);
}
57
spcl.inf.ethz.ch
@spcl_eth
// Transformed Loop
for (i = 0; i <= n; i+=1)
S1(i);
for (i = 0; i <= n; i+=1)
S2(i);
58
spcl.inf.ethz.ch
@spcl_eth
// Original Loop
for (i = 0; i <= n; i+=1)
for (j = 0; j <= n; j+=1)
𝐷↓𝐼 = 𝑆(𝑖,𝑗)0≤𝑖,𝑗≤𝑛
S(i,j); 𝑆↓𝑂𝑟𝑖𝑔 ={ 𝑆(𝑖,𝑗)→(𝑖,𝑗)}
𝑆↓𝑇 ={𝑆(𝑖,𝑗)→(𝑖,𝑖+𝑗)
// Transformed Loop
for (i = 0; i <= n; i+=1)
for (j = i+1; j <=n+i; j+=1)
S(i,j);
59
spcl.inf.ethz.ch
@spcl_eth
// Original Loop
for (i = 0; i <= 1024; i+=1)
𝐷↓𝐼 = 𝑆(𝑖)0≤𝑖≤𝑛
S(i); 𝑆↓𝑂𝑟𝑖𝑔 ={ 𝑆(𝑖)→(𝑖)}
𝑆↓𝑇 ={𝑆(𝑖)→(⌊𝑖/4 ⌋, 𝑖)}
// Transformed Loop
for (i = 0; i <= 1024; i+=4)
for (ii = i; ii <= i+3; ii+=1)
S(ii);
60
spcl.inf.ethz.ch
@spcl_eth
// Original Loop
for (i = 0; i <= 1024; i+=1)
for (j = 0; j <= 1024; j+=1)
𝐷↓𝐼 = 𝑆(𝑖,𝑗)0≤𝑖,𝑗≤𝑛
S(i,j); 𝑆↓𝑂𝑟𝑖𝑔 ={ 𝑆(𝑖,𝑗)→(𝑖,𝑗)}
𝑆↓𝑇 ={𝑆(𝑖)→(⌊𝑖/4 ⌋,⌊𝑗/4 ⌋,𝑖,𝑗)}
// Transformed Loop
for (i = 0; i <= 1024; i+=8)
for (j = 0; j <= 1024; j+=8)
for (ii = i; ii <= i+8; ii+=1)
for (jj = j; jj <= j+8; j+=1)
S(ii, jj);
61
spcl.inf.ethz.ch
@spcl_eth
1. Conflic&ng Accesses
Two statement instance access the same memory locaRon
2. Execu&on
Each statement instance is known to be executed
3. At least one write access
Two memory reads do not conflict
62
spcl.inf.ethz.ch
@spcl_eth
1. Conflic&ng Accesses
Two statement instance access the same memory locaRon
2. Execu&on
Each statement instance is known to be executed
3. At least one write access
Two memory reads do not conflict
63
spcl.inf.ethz.ch
@spcl_eth
§ Read-Arer-Write (true)
§ Flow (subset of RAW-dependences that carries data)
§ Write-Arer-Read (an&)
§ Write-Arer-Write (output)
§ Read-Arer-Read
64
spcl.inf.ethz.ch
@spcl_eth
§ Distance Vectors
Dependences are given through their integer distance D(1, 0, -1)
§ Presburger Sets
Dependences are described as Presburger RelaRons
{(𝐼,𝐽,𝐾)→(𝐼+1, 𝐽, 𝐾−1)∣0≤𝐼, 𝐽,𝐾≤𝑁}
65
spcl.inf.ethz.ch
@spcl_eth
Invariants on Dependences
66
spcl.inf.ethz.ch
@spcl_eth
Validity of a Schedule
67
spcl.inf.ethz.ch
@spcl_eth
68
spcl.inf.ethz.ch
@spcl_eth
Parallel Loops
69
spcl.inf.ethz.ch
@spcl_eth
A loop carried write-aVer-read (anR) Transform scalar tmp into an array TMP
dependence prevents parallel execuRon. that contains for each loop iteraRon
private storage.
70
spcl.inf.ethz.ch
@spcl_eth
71
spcl.inf.ethz.ch
@spcl_eth
72
spcl.inf.ethz.ch
@spcl_eth
Tiling of a 1D Stencil
73
spcl.inf.ethz.ch
@spcl_eth
74
spcl.inf.ethz.ch
@spcl_eth
75
spcl.inf.ethz.ch
@spcl_eth
76
spcl.inf.ethz.ch
@spcl_eth
77
spcl.inf.ethz.ch
@spcl_eth
78
spcl.inf.ethz.ch
@spcl_eth
79
spcl.inf.ethz.ch
@spcl_eth
80
spcl.inf.ethz.ch
@spcl_eth
81
spcl.inf.ethz.ch
@spcl_eth
82
spcl.inf.ethz.ch
@spcl_eth
83
spcl.inf.ethz.ch
@spcl_eth
84
spcl.inf.ethz.ch
@spcl_eth
85
spcl.inf.ethz.ch
@spcl_eth
86
spcl.inf.ethz.ch
@spcl_eth
87
spcl.inf.ethz.ch
@spcl_eth
88
spcl.inf.ethz.ch
@spcl_eth
Seman&c Unrolling
89
spcl.inf.ethz.ch
@spcl_eth
Isola&on
90
spcl.inf.ethz.ch
@spcl_eth
Isola&on
91
spcl.inf.ethz.ch
@spcl_eth
Isola&on
92
spcl.inf.ethz.ch
@spcl_eth
Schedule Trees
93
spcl.inf.ethz.ch
@spcl_eth
Schedule Trees
94
spcl.inf.ethz.ch
@spcl_eth
95
spcl.inf.ethz.ch
@spcl_eth
96
spcl.inf.ethz.ch
@spcl_eth
97
spcl.inf.ethz.ch
@spcl_eth
98
spcl.inf.ethz.ch
@spcl_eth
99
spcl.inf.ethz.ch
@spcl_eth
EvaluaRon
AST GeneraRon
100
spcl.inf.ethz.ch
@spcl_eth
101
spcl.inf.ethz.ch
@spcl_eth
102
spcl.inf.ethz.ch
@spcl_eth
103
spcl.inf.ethz.ch
@spcl_eth
104
spcl.inf.ethz.ch
@spcl_eth
105
spcl.inf.ethz.ch
@spcl_eth
106
spcl.inf.ethz.ch
@spcl_eth
107
spcl.inf.ethz.ch
@spcl_eth
108
spcl.inf.ethz.ch
@spcl_eth
ShiVed
CondiRonal
spcl.inf.ethz.ch
@spcl_eth
Polyhedral Unrolling
110
spcl.inf.ethz.ch
@spcl_eth
Polyhedral Unrolling
Two
variables
MulR
Bound
spcl.inf.ethz.ch
@spcl_eth
112
spcl.inf.ethz.ch
@spcl_eth
Heat 2D Heat 3D
113
spcl.inf.ethz.ch
@spcl_eth
114
spcl.inf.ethz.ch
@spcl_eth
115
spcl.inf.ethz.ch
@spcl_eth
116
spcl.inf.ethz.ch
@spcl_eth
Simplify Assump&ons
117
spcl.inf.ethz.ch
@spcl_eth
118
spcl.inf.ethz.ch
@spcl_eth
Op&mis&c Delineariza&on
119
spcl.inf.ethz.ch
@spcl_eth
120
spcl.inf.ethz.ch
@spcl_eth
121
spcl.inf.ethz.ch
@spcl_eth
Integer Overflow
122