ST7 SHP 2.1 Multithreading On Multicores 1spp 2

ST7: High Performance Simulation
Multithreading on multicores
Stéphane Vialle
Stephane.Vialle@centralesupelec.fr
http://www.metz.supelec.fr/~vialle
1. Threads vs Processes
2. OpenMP principles
3. Memory access bottleneck
Threads vs Processes
Multithreaded processes
A sequential process: Memory space of the
• in the RAM of one node process (and of its threads)
• running on one core code
Stack and code of the process
A multithreaded process:
• in the RAM of one node
• running on … one or several cores!
Memory space of the
process (and of its threads)
code code
Stack and code of
stack the main thread of
thread x thread y thread z the process
The process threads will distribute themselves over the resources (RAM
and cores) accessible to the process: the whole node, or part of the node.
Multithreaded processes
A multithreaded process:
Memory space of the
process (and of its threads)
code code
Stack and code of
stack the main thread of
thread x thread y thread z the process
• A process can create threads

• There are several thread libraries (Posix threads, OpenMP threads, MPI
threads when calling asynchronous communications…)
• Modern OS can manage many threads, even on one core !

• But multiple cores are needed to run the threads
simultaneously and produce an acceleration (one or two threads
per core is efficient)
Examples of deployment
One multithreaded process per node:
One multithreaded process per processor (per « socket »):

OpenMP principles
Objectives
Sequential code development:
• design Initialisation();
• implementation for (int i=0; i<N; i++)
Calcul(i);
• debug
Autre_calcul();
Add compilation directives for parallelism

• incremental parallelization with little extra code
• multithreading
• limited to the native parallelism of the initial code
#include <omp.h>
Initialisation();
Ex. : added an #pragma omp parallel for
OpenMP directive for (int i=0; i<N; i++)
Calcul(i);
Autre_calcul();
OpenMP principles
Parallel regions
main() {
…… 1th: seq. code
#pragma omp parallel parallel region
{
…… 3th: replicated
…… code
}
…… 1th: seq. code
}
Example with a 3-core machine (!):

• Creation and destruction of 3 threads at the beginning and end of the
parallel region
• Automatic synchronization at the end of a parallel region
 execution continues when all threads in the region are terminated
• Very simple and concise syntax
OpenMP principles
Parallelism with directives
main() {
…… Code seq.
{
…… Code répliqué
#pragma omp for
for (int i=0; i<N; i++){ Calculs répartis
………………
} de même nature
#pragma omp single Calcul séq.
{ …… }
Calculs
#pragma omp critical
{ …… } « mutexés »
…… Code répliqué
à durée variable
#pragma omp barrier -- Synchro --
……
Code répliqué
}
……
} Code seq.
OpenMP principles
main() {
……
Seq. code
{
…… Replicated code
#pragma omp sections Distributed
{
#pragma omp section calculations of
{ …… } various kinds
#pragma omp section
{ …… }
}
…… Replicated code
}
…… Seq. code
}
OpenMP principles
Parallelization of a sequential function call :
main() { Seq. code
……
Parallele code f_lib(0, N, SharedTable);
main() { ……
…… }
#pragma omp parallel
{
// Lower boundary of the thread
int inf = N/omp_get_num_threads()* omp_get_thread_num();
// Upper boundary of the thread
int sup = N/omp_get_num_threads()*(omp_get_thread_num()+1);
// Call to the sequential library function
} f_lib(inf, sup, SharedTable);
} …… omp_get_num_threads(): nb of threads in the current region
omp_get_thread_num() : rank of the thread
Replicated code BUT with specific parameters for each thread
Rmq: the function code must be reentrant (avoid global variables).
OpenMP principles
main() { Hyp : 3 threads OpenMP
…… created on a machine with
#pragma omp parallel 3 CPU cores
{
switch (omp_get_thread_num()) { … and one of
case 0 : the CPU
……… // calcul sur cores is
break; // le GPU dedicated to
default : driving a
…… // calcul sur
break; // les cœurs CPU
GPU
}
}
……
}
Make thread 0 (which still exists) perform a special task

Ex: drive a GPU (or make disk IO)
In parallel with calculations launched on the other CPU cores
OpenMP principles
Limitations of OpenMP
OpenMP encounters classic multithreading limitations
• pb de synchronisation ShM
stop stop
• pb de contention stop stop
stop stop stop stop
• pb de false sharing cache cache
(« cache war »)
The use of OpenMP facilitates the implementation of multithreading.

But this does not dispense with the problems of parallel algorithms in
shared memory.
Memory access bottleneck
Hardware:
• k RAM access channels per processor
• L1 cache memory per core
• L2 cache memory per subset of
cores, or per processor
• NUMA computing nodes
(Non Uniform Memory Access)
(Complicated) HPC development:

• Serial optimizations to achieve cache compliant
data accesses
• Cache blocking to relieve the memory bus of
multicore nodes
But today: n cores × m vector units per processor! On the rise!

 Memory access remains a bottleneck! 15
Choose your processor
When the number of cores increases: the frequency decreases
The number of memory channels does not increases like the nb of cores
Do you prefer:
• 4 cores at 4.0 - 4.5 GHz, with 4 channels Easier to program
• 8/12/16 cores at 2.2 GHz, with 4 channels Higher theoretical
… ?? peak performance
16
Speedup limitation on multicores
Experiments:
• Performance does not increase linearly on multicores!
(optimized OpenBLAS matrix product)
• Our 2x8-cores node were more expensive than our 2x4-cores node
• But is just a little bit faster!
Matrix Product (BLAS) on a 2x8-cores Matrix Product (BLAS) on a 2x4-cores
300 node at 2.1 GHz 300 node at 3.5 GHz
265 Gflops Gflops max Gflops max
Gflops (double precision)
250 Gflops min 250 234 Gflops Gflops min
200 200
150 150
100 100
50 50
0 0
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
number of threads number of threads
Memory access bottleneck
is the problem! 17
Questions ?
18

ST7 SHP 2.1 Multithreading On Multicores 1spp 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ST7 SHP 2.1 Multithreading On Multicores 1spp 2

Uploaded by

Copyright:

Available Formats

ST7: High Performance Simulation

• A process can create threads

• Modern OS can manage many threads, even on one core !

One multithreaded process per processor (per « socket »):

Add compilation directives for parallelism

Example with a 3-core machine (!):

Make thread 0 (which still exists) perform a special task

The use of OpenMP facilitates the implementation of multithreading.

(Complicated) HPC development:

But today: n cores × m vector units per processor! On the rise!

250 Gflops min 250 234 Gflops Gflops min

You might also like