You are on page 1of 18

ST7: High Performance Simulation

Multithreading on multicores

Stéphane Vialle

Stephane.Vialle@centralesupelec.fr
http://www.metz.supelec.fr/~vialle
Multithreading on multicores

1. Threads vs Processes
2. OpenMP principles
3. Memory access bottleneck
Threads vs Processes

Multithreaded processes
A sequential process: Memory space of the
• in the RAM of one node process (and of its threads)
• running on one core code
Stack and code of the process
A multithreaded process:
• in the RAM of one node
• running on … one or several cores!
Memory space of the
process (and of its threads)
code code
Stack and code of
stack the main thread of
thread x thread y thread z the process

The process threads will distribute themselves over the resources (RAM
and cores) accessible to the process: the whole node, or part of the node.
Threads vs Processes

Multithreaded processes
A multithreaded process:
Memory space of the
process (and of its threads)
code code
Stack and code of
stack the main thread of
thread x thread y thread z the process

• A process can create threads


• There are several thread libraries (Posix threads, OpenMP threads, MPI
threads when calling asynchronous communications…)

• Modern OS can manage many threads, even on one core !


• But multiple cores are needed to run the threads
simultaneously and produce an acceleration (one or two threads
per core is efficient)
Threads vs Processes

Examples of deployment
One multithreaded process per node:

One multithreaded process per processor (per « socket »):


Multithreading on multicores

1. Threads vs Processes
2. OpenMP principles
3. Memory access bottleneck
OpenMP principles

Objectives
Sequential code development:
• design Initialisation();
• implementation for (int i=0; i<N; i++)
Calcul(i);
• debug
Autre_calcul();

Add compilation directives for parallelism


• incremental parallelization with little extra code
• multithreading
• limited to the native parallelism of the initial code
#include <omp.h>
Initialisation();
Ex. : added an #pragma omp parallel for
OpenMP directive for (int i=0; i<N; i++)
Calcul(i);
Autre_calcul();
OpenMP principles
Parallel regions
main() {
…… 1th: seq. code
#pragma omp parallel parallel region
{
…… 3th: replicated
…… code
}
…… 1th: seq. code
}

Example with a 3-core machine (!):


• Creation and destruction of 3 threads at the beginning and end of the
parallel region
• Automatic synchronization at the end of a parallel region
 execution continues when all threads in the region are terminated
• Very simple and concise syntax
OpenMP principles
Parallelism with directives
main() {
…… Code seq.
#pragma omp parallel parallel region
{
…… Code répliqué
#pragma omp for
for (int i=0; i<N; i++){ Calculs répartis
………………
} de même nature
#pragma omp single Calcul séq.
{ …… }
Calculs
#pragma omp critical
{ …… } « mutexés »

…… Code répliqué
à durée variable
#pragma omp barrier -- Synchro --
……
Code répliqué
}
……
} Code seq.
OpenMP principles
Parallelism with directives
main() {
……
Seq. code
#pragma omp parallel parallel region
{
…… Replicated code
#pragma omp sections Distributed
{
#pragma omp section calculations of
{ …… } various kinds
#pragma omp section
{ …… }
}
…… Replicated code
}
…… Seq. code
}
OpenMP principles
Parallelism with directives
Parallelization of a sequential function call :
main() { Seq. code
……
Parallele code f_lib(0, N, SharedTable);
main() { ……
…… }
#pragma omp parallel
{
// Lower boundary of the thread
int inf = N/omp_get_num_threads()* omp_get_thread_num();
// Upper boundary of the thread
int sup = N/omp_get_num_threads()*(omp_get_thread_num()+1);
// Call to the sequential library function
} f_lib(inf, sup, SharedTable);
} …… omp_get_num_threads(): nb of threads in the current region
omp_get_thread_num() : rank of the thread
Replicated code BUT with specific parameters for each thread
Rmq: the function code must be reentrant (avoid global variables).
OpenMP principles
Parallelism with directives
main() { Hyp : 3 threads OpenMP
…… created on a machine with
#pragma omp parallel 3 CPU cores
{
switch (omp_get_thread_num()) { … and one of
case 0 : the CPU
……… // calcul sur cores is
break; // le GPU dedicated to
default : driving a
…… // calcul sur
break; // les cœurs CPU
GPU
}
}
……
}

Make thread 0 (which still exists) perform a special task


Ex: drive a GPU (or make disk IO)
In parallel with calculations launched on the other CPU cores
OpenMP principles

Limitations of OpenMP
OpenMP encounters classic multithreading limitations

• pb de synchronisation ShM
stop stop
• pb de contention stop stop
stop stop stop stop
• pb de false sharing cache cache

(« cache war »)

The use of OpenMP facilitates the implementation of multithreading.


But this does not dispense with the problems of parallel algorithms in
shared memory.
Multithreading on multicores

1. Threads vs Processes
2. OpenMP principles
3. Memory access bottleneck
Memory access bottleneck
Hardware:
• k RAM access channels per processor
• L1 cache memory per core
• L2 cache memory per subset of
cores, or per processor
• NUMA computing nodes
(Non Uniform Memory Access)

(Complicated) HPC development:


• Serial optimizations to achieve cache compliant
data accesses
• Cache blocking to relieve the memory bus of
multicore nodes

But today: n cores × m vector units per processor! On the rise!


 Memory access remains a bottleneck! 15
Choose your processor
When the number of cores increases: the frequency decreases
The number of memory channels does not increases like the nb of cores

Do you prefer:
• 4 cores at 4.0 - 4.5 GHz, with 4 channels Easier to program
• 8/12/16 cores at 2.2 GHz, with 4 channels Higher theoretical
… ?? peak performance
16
Speedup limitation on multicores
Experiments:
• Performance does not increase linearly on multicores!
(optimized OpenBLAS matrix product)
• Our 2x8-cores node were more expensive than our 2x4-cores node
• But is just a little bit faster!
Matrix Product (BLAS) on a 2x8-cores Matrix Product (BLAS) on a 2x4-cores
300 node at 2.1 GHz 300 node at 3.5 GHz
265 Gflops Gflops max Gflops max
Gflops (double precision)

250 Gflops min 250 234 Gflops Gflops min

200 200

150 150

100 100

50 50

0 0
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
number of threads number of threads
Memory access bottleneck
is the problem! 17
Multithreading on multicores

Questions ?

18

You might also like