You are on page 1of 24

ST7: High Performance Simulation

Serial optimizations and


vectorization
(programming close to the architecture)

Stéphane Vialle

Stephane.Vialle@centralesupelec.fr
http://www.metz.supelec.fr/~vialle
HPC programming strategy
Numerical algorithm

Optimized code on one core:


- Optimized compilation
- Serial optimizations
- Vectorization

Parallelized code on one node


- Multithreading
- Minimal/relaxed synchronization
- Data-Computation localization
- NUMA effect avoidance

Distributed code on a cluster:


- Message passing across processes
- Load balanced computations
- Minimal communications
- Overlapped computations and comms
2
1 - Three important issues to get
optimized code on one CPU core

1.1 – Compile with optimization options

1.2 – Implement data storage and data accesses taking


advantage of cache memory hierarchy

1.3 – Implement sequences of instructions taking


advantage of vector computing units

1.4 – Experimental performances


3
3 important issues to get optimized code on 1 core

Compile with optimization options


Every compiler offers options for producing optimized executable code :
• Requiring less memory
• Or running faster
With Linux C/C++ compilers (ex: gcc/icc): With Windows compilers:
-O0: no optimization • similar and different
-O1: 1st level of standard optimizations options
-O2: 2nd level of standard optimizations • in the configuration
-O3: 3rd level of standard optimizations, menu of Visual Studio
- usual level for HPC codes (for example)
- enables « vectorization »
-Ofast: more agressive, better vectorization
… sometimes wrong! Check your code!
+ many specific optimization options:
Ex: gcc –funroll-loops... (see further)
 gcc –O3 –funroll-loops pgm.c –o pgm 4
3 important issues to get optimized code on 1 core

Compile with optimization options


Every compiler offers options for producing optimized executable code :
• Requiring less memory
• Or running faster

But an optimized compilation:


• Lasts longer
Ex: Intel compiler takes long time to attempt to vectorize the
code but produces very helpful vectorization reports!
• Requires a higher quality of the source codes
Some source codes can compile and successfully run when using
-O0 or -O1, but fails when using –O3 (or –Ofast)!
(especially when tinkering the data alignment in memory…)

5
3 important issues to get optimized code on 1 core

Compile with optimization options


Every compiler offers options for producing optimized executable code :
• Requiring less memory
• Or running faster

HPC compilation rules:


1. Try to use standard compiler optimisation options (ex: -O3)
2. Enhance your source code to accept standard optimisation options
3. Examine and experiment with the specific optimisation options of
your compiler.

6
1 - Three important issues to get
optimized code on one CPU core

1.1 – Compile with optimization options

1.2 – Implement data storage and data accesses


taking advantage of cache memory hierarchy

1.3 – Implement sequences of instructions taking


advantage of vector computing units

1.4 – Experimental performances


7
3 important issues to get optimized code on 1 core

Take advantage of cache memory


Principles of the cache memory:
A B

CPU
Res Fast read and write

Cache memory: small and fast


and large surface/byte
Efficient write and energy consumming
of 1 cache line
Efficient load
of 1 cache line
RAM:
large and slow
TabIn TabOut and small surface/byte
and not energy consumming
8
3 important issues to get optimized code on 1 core

Take advantage of cache memory


Memory hierarchy:

• L1 and L2 cache memory are « internal CPU cache »


• L3 cache is external, or internal, or does not exist: depends on the CPU

9
3 important issues to get optimized code on 1 core

Take advantage of cache memory


Example of inefficient data storage and accesses:
typedef struct { Like Class
double x, y, z; & Objects CPU
} Point;
Point Tab[N]; Cache memory
x y z x
double Sx = 0.0; 4 data
1 cache line:
double MoyX; 2 useful data (x)

x y z x y z x y z Tab
… RAM

for (i=0; i<N; i++) {


Array of structures (objects) 
Sx += Tab[i].x;
2 ‘𝑥 coordinates’ are separated by 2 others
}
values, unused in this computations, but
MoyX = Sx/N;
tacking space into cache memory

10
3 important issues to get optimized code on 1 core

Take advantage of cache memory


Example of efficient data storage and accesses:
typedef struct { Like Class
double TabX[N]; & Objects CPU
double TabY[N];
double TabZ[N]; Cache memory
} Points; x x x x
4 data
1 cache line:
Points pts; 4 useful data
double Sx = 0.0;
double MoyX; x x x x y y y y z z z z pts

… RAM
for (i=0; i<N; i++) {
Structure of arrays most adapted to this code:
Sx += pts.TabX[i];
‘𝑥 coordinates’ are contiguously stored in
}
RAM, and contiguously copied/stored in
MoyX = Sx/N;
cache: any valid value in cache should be

useful
11
1 - Three important issues to get
optimized code on one CPU core

1.1 – Compile with optimization options

1.2 – Implement data storage and data accesses taking


advantage of cache memory hierarchy

1.3 – Implement sequences of instructions taking


advantage of vector computing units

1.4 – Experimental performances


12
3 important issues to get optimized code on 1 core

Take advantage of vector computing units


Vector computing:
Each CPU core is a small
vector machine:
1 instruction can be
applied on vectors of
input data and can
produce a vector of
output data

 One instruction can be executed Instru-


ctions
in parallel by different ALU, on ALUs
different data AVX units

13
3 important issues to get optimized code on 1 core

Take advantage of vector computing units


Scalar vs vector operations:

4 successive 1 vector
scalar operations operation

x4

Instru Instru
-ctions ALU -ctions ALUs
14
3 important issues to get optimized code on 1 core

Take advantage of vector computing units


Requirements to enable vector operations:
• Input and output data must be stored in arrays at contiguous indexes
(compliant with an efficient data storage based on cache memory usage)
• Internal loop must have:
- Identical operations:
- no if-then-else in the loop body
- but if-then is possible, as an ALU can do the operation or nothing
- Independent operations:
- executable in any order
res[i] = input[i] + res[i-1] : impossible to vectorize
- write in different variables  reductions must be adapted
res += input[i]*input[i] : limit/stop the vectorization
• Arrays must not overlap (« no aliasing »)  operations could not be
15
vectorized / parallelized
3 important issues to get optimized code on 1 core

Take advantage of vector computing units


How to write vector code (instead of scalar code) ?
• Implement explicit vector code using « intrinsic » low level library…
voir cours d’architecture en 3A SI
• Add directives to guide the compiler to generate vector code:
Ex: #pragma vector
#pragma ivdep
for (j = 0; j < n; j++)
DynamicTab[j] = …

• auto-vectorization:
Write clean standard code with respect to all vectorization constraints
and ask to the compiler to vectorize the code (compilation option)
for (j = 0; j < 1000; j++)
StaticTab[j] = …
gcc –O3 pgm.c –o pgm
16
3 important issues to get optimized code on 1 core

Take advantage of vector computing units


Example of loop re-implementation compliant with auto-vectorization:
j
B , = , ,
A
C
i for (int i = 0; i < N; i++) Bad use of cache
for (int j = 0; j < N; j++) memory
for (int k = 0; k < N; k++) Reduction can stop
C[i][j] += A[i][k]*B[k][j]; vectorization

for (int i = 0; i < n; i++)


for (int k = 0; k < n; k++)
for (int j = 0; j < n; j++)
C[i][j] += A[i][k]*B[k][j];

17
3 important issues to get optimized code on 1 core

Take advantage of vector computing units


Example of loop re-implementation compliant with auto-vectorization:
j
B , = , ,
A
C for (int i = 0; i < n; i++)
i
for (int k = 0; k < n; k++)
for (int j = 0; j < n; j++)
C[i][j] += A[i][k]*B[k][j];

… and ask the compiler to do its best to optimize and vectorize

gcc –O3 –o pgm pgm.c

gcc –Ofast –o pgm pgm.c (2022)

But …. aggressive optimization can lead to runtime errors!


18
1 - Three important issues to get
optimized code on one CPU core
1.1 – Compile with optimization options

1.2 – Implement data storage and data accesses taking


advantage of cache memory hierarchy

1.3 – Implement sequences of instructions taking


advantage of vector computing units

1.4 – Experimental performances

19
Exp. on an Intel Xeon Haswell (2018)

Dense matrix product: 4096x4096, double precision


Processor: Intel Xeon Haswell E5‐2637 v3 - 2014
(4 physical cores – 2 threads/core)

Seq. Naive Seq. Naive -O3 + BLAS


-O0 -O3 Optimized code + monothread
Vectorization (OpenBLAS)
0.12 Gflops 0.35 Gflops 3.10 Gflops 46.3 Gflops
×1.0 ×2.9 ×25.8 ×385.8
2 threads/core 1 thread/core
(best configuration) (best configuration)

 Use optimized HPC libraries when available


 Optimize your source code when HPC library does not exist
20
2 – « cache blocking / tiling » to
improve data accesses from
multicores

21
« Cache blocking / Tiling »
Exemple of optimized matrix product: B
1. Read one row of A and one row of
Bt
Bt and compute one element of C
2. Compute the « next » element of C:
same line, next column
 Avoids many cache misses and
supports auto-vectorization
But cache memory is small:
 it can not store en entire A line, nor an entire B column!
Reads A[0] row … and reads each row of Bt
Reads A[1] row … and reads again each row of Bt
….
 Always intensive use of the « memory ↔ cache » bus
(in spite of code optimizations)
 Bottleneck when using multiple cores in the processor 22
« Cache blocking / Tiling »

= × + × + ×

Set Tile size to have: 3×Tile size ≤ Cache size


cache size
At any time:
• 3 tiles in cache memory
• computation using only the 3 tiles
• maximum re-use of each tile while it is in cache
(+ optimising the calculation schedule for C tiles)

Useful for parallel codes on multi-cores, when all cores compete to


use the memory bus and access RAM 23
Serial optimizations and
vectorization
(programming close to the architecture)

Questions ?

24

You might also like