Professional Documents
Culture Documents
Stéphane Vialle
Stephane.Vialle@centralesupelec.fr
http://www.metz.supelec.fr/~vialle
HPC programming strategy
Numerical algorithm
5
3 important issues to get optimized code on 1 core
6
1 - Three important issues to get
optimized code on one CPU core
CPU
Res Fast read and write
9
3 important issues to get optimized code on 1 core
x y z x y z x y z Tab
… RAM
… RAM
for (i=0; i<N; i++) {
Structure of arrays most adapted to this code:
Sx += pts.TabX[i];
‘𝑥 coordinates’ are contiguously stored in
}
RAM, and contiguously copied/stored in
MoyX = Sx/N;
cache: any valid value in cache should be
…
useful
11
1 - Three important issues to get
optimized code on one CPU core
13
3 important issues to get optimized code on 1 core
4 successive 1 vector
scalar operations operation
x4
Instru Instru
-ctions ALU -ctions ALUs
14
3 important issues to get optimized code on 1 core
• auto-vectorization:
Write clean standard code with respect to all vectorization constraints
and ask to the compiler to vectorize the code (compilation option)
for (j = 0; j < 1000; j++)
StaticTab[j] = …
gcc –O3 pgm.c –o pgm
16
3 important issues to get optimized code on 1 core
17
3 important issues to get optimized code on 1 core
19
Exp. on an Intel Xeon Haswell (2018)
21
« Cache blocking / Tiling »
Exemple of optimized matrix product: B
1. Read one row of A and one row of
Bt
Bt and compute one element of C
2. Compute the « next » element of C:
same line, next column
Avoids many cache misses and
supports auto-vectorization
But cache memory is small:
it can not store en entire A line, nor an entire B column!
Reads A[0] row … and reads each row of Bt
Reads A[1] row … and reads again each row of Bt
….
Always intensive use of the « memory ↔ cache » bus
(in spite of code optimizations)
Bottleneck when using multiple cores in the processor 22
« Cache blocking / Tiling »
= × + × + ×
Questions ?
24