ST7 SHP 1.2 OptimSerielle Vecto 1spp

ST7: High Performance Simulation
Serial optimizations and

vectorization
(programming close to the architecture)
Stéphane Vialle
Stephane.Vialle@centralesupelec.fr
http://www.metz.supelec.fr/~vialle
HPC programming strategy
Numerical algorithm
Optimized code on one core:

- Optimized compilation
- Serial optimizations
- Vectorization
Parallelized code on one node

- Multithreading
- Minimal/relaxed synchronization
- Data-Computation localization
- NUMA effect avoidance
Distributed code on a cluster:

- Message passing across processes
- Load balanced computations
- Minimal communications
- Overlapped computations and comms
2
1 - Three important issues to get
optimized code on one CPU core
1.1 – Compile with optimization options
1.2 – Implement data storage and data accesses taking

advantage of cache memory hierarchy
1.3 – Implement sequences of instructions taking

advantage of vector computing units
1.4 – Experimental performances

3
3 important issues to get optimized code on 1 core
Compile with optimization options

Every compiler offers options for producing optimized executable code :
• Requiring less memory
• Or running faster
With Linux C/C++ compilers (ex: gcc/icc): With Windows compilers:
-O0: no optimization • similar and different
-O1: 1st level of standard optimizations options
-O2: 2nd level of standard optimizations • in the configuration
-O3: 3rd level of standard optimizations, menu of Visual Studio
- usual level for HPC codes (for example)
- enables « vectorization »
-Ofast: more agressive, better vectorization
… sometimes wrong! Check your code!
+ many specific optimization options:
Ex: gcc –funroll-loops... (see further)
 gcc –O3 –funroll-loops pgm.c –o pgm 4

But an optimized compilation:

• Lasts longer
Ex: Intel compiler takes long time to attempt to vectorize the
code but produces very helpful vectorization reports!
• Requires a higher quality of the source codes
Some source codes can compile and successfully run when using
-O0 or -O1, but fails when using –O3 (or –Ofast)!
(especially when tinkering the data alignment in memory…)
5

HPC compilation rules:

1. Try to use standard compiler optimisation options (ex: -O3)
2. Enhance your source code to accept standard optimisation options
3. Examine and experiment with the specific optimisation options of
your compiler.
6
1.2 – Implement data storage and data accesses

taking advantage of cache memory hierarchy


7
Take advantage of cache memory

Principles of the cache memory:
A B
CPU
Res Fast read and write
Cache memory: small and fast

and large surface/byte
Efficient write and energy consumming
of 1 cache line
Efficient load
of 1 cache line
RAM:
large and slow
TabIn TabOut and small surface/byte
and not energy consumming
8

Memory hierarchy:
• L1 and L2 cache memory are « internal CPU cache »

• L3 cache is external, or internal, or does not exist: depends on the CPU
9

Example of inefficient data storage and accesses:
typedef struct { Like Class
double x, y, z; & Objects CPU
} Point;
Point Tab[N]; Cache memory
x y z x
double Sx = 0.0; 4 data
1 cache line:
double MoyX; 2 useful data (x)
x y z x y z x y z Tab
… RAM
for (i=0; i<N; i++) {

Array of structures (objects) 
Sx += Tab[i].x;
2 ‘𝑥 coordinates’ are separated by 2 others
}
values, unused in this computations, but
MoyX = Sx/N;
tacking space into cache memory
…
10

Example of efficient data storage and accesses:
typedef struct { Like Class
double TabX[N]; & Objects CPU
double TabY[N];
double TabZ[N]; Cache memory
} Points; x x x x
4 data
1 cache line:
Points pts; 4 useful data
double Sx = 0.0;
double MoyX; x x x x y y y y z z z z pts
… RAM
for (i=0; i<N; i++) {
Structure of arrays most adapted to this code:
Sx += pts.TabX[i];
‘𝑥 coordinates’ are contiguously stored in
}
RAM, and contiguously copied/stored in
MoyX = Sx/N;
cache: any valid value in cache should be
…
useful
11



12
Take advantage of vector computing units

Vector computing:
Each CPU core is a small
vector machine:
1 instruction can be
applied on vectors of
input data and can
produce a vector of
output data
 One instruction can be executed Instru-

ctions
in parallel by different ALU, on ALUs
different data AVX units
13

Scalar vs vector operations:
4 successive 1 vector
scalar operations operation
x4
Instru Instru
-ctions ALU -ctions ALUs
14

Requirements to enable vector operations:
• Input and output data must be stored in arrays at contiguous indexes
(compliant with an efficient data storage based on cache memory usage)
• Internal loop must have:
- Identical operations:
- no if-then-else in the loop body
- but if-then is possible, as an ALU can do the operation or nothing
- Independent operations:
- executable in any order
res[i] = input[i] + res[i-1] : impossible to vectorize
- write in different variables  reductions must be adapted
res += input[i]*input[i] : limit/stop the vectorization
• Arrays must not overlap (« no aliasing »)  operations could not be
15
vectorized / parallelized

How to write vector code (instead of scalar code) ?
• Implement explicit vector code using « intrinsic » low level library…
voir cours d’architecture en 3A SI
• Add directives to guide the compiler to generate vector code:
Ex: #pragma vector
#pragma ivdep
for (j = 0; j < n; j++)
DynamicTab[j] = …
• auto-vectorization:
Write clean standard code with respect to all vectorization constraints
and ask to the compiler to vectorize the code (compilation option)
for (j = 0; j < 1000; j++)
StaticTab[j] = …
gcc –O3 pgm.c –o pgm
16

Example of loop re-implementation compliant with auto-vectorization:
j
B , = , ,
A
C
i for (int i = 0; i < N; i++) Bad use of cache
for (int j = 0; j < N; j++) memory
for (int k = 0; k < N; k++) Reduction can stop
C[i][j] += A[i][k]*B[k][j]; vectorization
for (int i = 0; i < n; i++)

for (int k = 0; k < n; k++)
for (int j = 0; j < n; j++)
C[i][j] += A[i][k]*B[k][j];
17

Example of loop re-implementation compliant with auto-vectorization:
j
B , = , ,
A
C for (int i = 0; i < n; i++)
i
for (int k = 0; k < n; k++)
for (int j = 0; j < n; j++)
C[i][j] += A[i][k]*B[k][j];
… and ask the compiler to do its best to optimize and vectorize
gcc –O3 –o pgm pgm.c
gcc –Ofast –o pgm pgm.c (2022)
But …. aggressive optimization can lead to runtime errors!

18


19
Exp. on an Intel Xeon Haswell (2018)
Dense matrix product: 4096x4096, double precision

Processor: Intel Xeon Haswell E5‐2637 v3 - 2014
(4 physical cores – 2 threads/core)
Seq. Naive Seq. Naive -O3 + BLAS

-O0 -O3 Optimized code + monothread
Vectorization (OpenBLAS)
0.12 Gflops 0.35 Gflops 3.10 Gflops 46.3 Gflops
×1.0 ×2.9 ×25.8 ×385.8
2 threads/core 1 thread/core
(best configuration) (best configuration)
 Use optimized HPC libraries when available

 Optimize your source code when HPC library does not exist
20
2 – « cache blocking / tiling » to
improve data accesses from
multicores
21
« Cache blocking / Tiling »
Exemple of optimized matrix product: B
1. Read one row of A and one row of
Bt
Bt and compute one element of C
2. Compute the « next » element of C:
same line, next column
 Avoids many cache misses and
supports auto-vectorization
But cache memory is small:
 it can not store en entire A line, nor an entire B column!
Reads A[0] row … and reads each row of Bt
Reads A[1] row … and reads again each row of Bt
….
 Always intensive use of the « memory ↔ cache » bus
(in spite of code optimizations)
 Bottleneck when using multiple cores in the processor 22
« Cache blocking / Tiling »
= × + × + ×
Set Tile size to have: 3×Tile size ≤ Cache size

cache size
At any time:
• 3 tiles in cache memory
• computation using only the 3 tiles
• maximum re-use of each tile while it is in cache
(+ optimising the calculation schedule for C tiles)
Useful for parallel codes on multi-cores, when all cores compete to

use the memory bus and access RAM 23
Serial optimizations and
vectorization
(programming close to the architecture)
Questions ?
24

ST7 SHP 1.2 OptimSerielle Vecto 1spp

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ST7 SHP 1.2 OptimSerielle Vecto 1spp

Uploaded by

Copyright:

Available Formats

ST7: High Performance Simulation

Serial optimizations and

Optimized code on one core:

Parallelized code on one node

Distributed code on a cluster:

1.1 – Compile with optimization options

1.2 – Implement data storage and data accesses taking

1.3 – Implement sequences of instructions taking

1.4 – Experimental performances

Compile with optimization options

Compile with optimization options

But an optimized compilation:

Compile with optimization options

HPC compilation rules:

1.1 – Compile with optimization options

1.2 – Implement data storage and data accesses

1.3 – Implement sequences of instructions taking

1.4 – Experimental performances

Take advantage of cache memory

Cache memory: small and fast

Take advantage of cache memory

• L1 and L2 cache memory are « internal CPU cache »

Take advantage of cache memory

for (i=0; i<N; i++) {

Take advantage of cache memory

1.1 – Compile with optimization options

1.2 – Implement data storage and data accesses taking

1.3 – Implement sequences of instructions taking

1.4 – Experimental performances

Take advantage of vector computing units

 One instruction can be executed Instru-

Take advantage of vector computing units

Take advantage of vector computing units

Take advantage of vector computing units

Take advantage of vector computing units

for (int i = 0; i < n; i++)

Take advantage of vector computing units

… and ask the compiler to do its best to optimize and vectorize

gcc –O3 –o pgm pgm.c

gcc –Ofast –o pgm pgm.c (2022)

But …. aggressive optimization can lead to runtime errors!

1.2 – Implement data storage and data accesses taking

1.3 – Implement sequences of instructions taking

1.4 – Experimental performances

Dense matrix product: 4096x4096, double precision

Seq. Naive Seq. Naive -O3 + BLAS

 Use optimized HPC libraries when available

Set Tile size to have: 3×Tile size ≤ Cache size

Useful for parallel codes on multi-cores, when all cores compete to

You might also like