Professional Documents
Culture Documents
Hpca2020 Gpu 2
Hpca2020 Gpu 2
January 2020
Caroline Collange
Inria Rennes – Bretagne Atlantique
https://team.inria.fr/pacap/members/collange/
caroline.collange@inria.fr
Agenda
Asynchronous execution
Streams
Task graphs
Fine-grained synchronization
Atomics
Memory consistency model
Unified memory
Memory allocation
Optimizing transfers
Asynchronous execution
3
Asynchronous transfers
Asynchronous
>
Synchronous:
>>
H e mc >
>>
y
py
Dto daM <>>
cp
blocking
>
<<
<<
D em
cu 2<<
l1<
<
Hto daM
el3
rn e
l
rn e
rn
cu
ke
wait
ke
ke
CPU
Time
Can we do better? 4
Multiple command queues / streams
Application: GPU
Fat binary code
CUDA Runtime API CUDA Driver API
cudaXxx functions cuYyy functions
CUDA Runtime
User mode
push pop
Kernel mode
GPU Driver
push pop
Command
queues
CPU GPU
5
CPU
GPU
DMA
cu
Hto daM
D, emc
ke a py
rne As
yn
cu l<<< c
Dto daM ,,,a>
H, emc >>
cu a py
d As
yn
kernel a
ke yA
rn e sy
l< < nc
cu
d <,,
Command queues in OpenCL
Dto M a ,b>
e
H , mc > >
b py
cu As
d yn
Hto aM c
D, emc
DtoH
c
copy a
ke py
rne As
yn
l<< c
kernel b
cu < ,
Dto daM ,,c>>
Commands from the same stream run in-order
H , e mc >
c py
As
yn
Commands from different streams run out-of-order
c
Streams: pipelining commands
DtoH
copy b
cu
da
kernel c
St r
cu ea
da mS
St r yn
cu eam ch
ron
da
St r S yn i ze
ea c a
mS hron
yn i ze
b
DtoH
ch
copy c
ron
i ze
c
Streams: benefits
Example
Kernel<<<5,,,a>>>
Kernel<<<4,,,b>>>
Serial kernel execution Concurrent kernel execution
a block 0 a 3 b0 b3 a block 0 a 3 b2
a1 a4 b1 a1 a4 b1
a2 b2 a2 b0 b3
7
Page-locked memory
8
Streams: example
Send data, execute, receive data
cudaStream_t stream[2];
for (int i = 0; i < 2; ++i)
cudaStreamCreate(&stream[i]);
float* hostPtr;
cudaMallocHost(&hostPtr, 2 * size);
cudaStream_t stream[2];
for (int i = 0; i < 2; ++i)
cudaStreamCreate(&stream[i]);
float* hostPtr;
cudaMallocHost(&hostPtr, 2 * size);
CPU
code
kernel 4
12
Scheduling data dependency graphs
kernel1<<<,,,a>>>();
cudaEventRecord(e1, a);
cudaStreamWaitEvent(b, e1); e1
wait(e1)
cudaMemcpyAsync(,,,,b);
kernel 5 memcpy
cudaEventRecord(e2, b);
e2
e3 wait(e2)
kernel3<<<,,,a>>>();
cudaEventRecord(e3, a);
CPU
cudaEventSynchronize(e2); code
13
Agenda
Asynchronous execution
Streams
Task graphs
Fine-grained synchronization
Atomics
Memory consistency model
Unified memory
Memory allocation
Optimizing transfers
From streams and events to task graphs
15
Recording asynchronous CUDA calls
cudaStreamEndCapture(stream, &graph);
kernel3<<<,,,a>>>(); e3
cudaEventRecord(e3, a);
CPU
cudaLaunchHostFunc(b, cpucode, params); code
cudaStreamWaitEvent(b, e3);
wait(e3)
kernel4<<<,,,b>>>();
kernel3<<<,,,a>>>(); e3
cudaEventRecord(e3, a);
CPU
cudaLaunchHostFunc(b, cpucode, params); code
cudaStreamWaitEvent(b, e3);
wait(e3)
kernel4<<<,,,b>>>();
cudaGraph_t graph;
cudaGraphCreate(&graph, 0); kernel1 kernel 2
cudaGraphNode_t k1,k2,k3,k4,mc,cpu;
cudaGraphAddKernelNode(&k1, graph,
kernel 3 memcpy
0, 0, // no dependency yet
paramsK1, 0);
...
cudaGraphAddKernelNode(&k4, graph,
0, 0, paramsK4, 0); CPU
code
cudaGraphAddMemcpyNode(&mc, graph,
0, 0, paramsMC);
struct cudaKernelNodeParams
{
void* func; // Kernel function
dim3 gridDim;
dim3 blockDim;
unsigned int sharedMemBytes;
void **kernelParams; // Array of pointers to arguments
void **extra; // (low-level alternative to kernelParams)
};
cudaGraph_t graph;
cudaGraphExec_t exec;
cudaGraphInstantiate(&exec, graph, 0, 0, 0);
cudaGraphLaunch(exec, stream);
cudaStreamSynchronize(stream);
22
Agenda
Asynchronous execution
Streams
Scheduling dependency graphs
Fine-grained synchronization
Atomics
Memory consistency model
Unified memory
Memory allocation
Optimizing transfers
Inter-thread/inter-warp communication
Atomics
Fine-grained
Can implement wait-free algorithms
May be used for blocking algorithms (locks)
among warps of the same block
From Volta (CC≥7.0), also among threads of the same warp
24
Atomics
Available operators
Arithmetic: atomic{Add,Sub,Inc,Dec}
Min-max: atomic{Min,Max}
Synchronization primitives: atomic{Exch,CAS}
Bitwise: atomic{And,Or,Xor}
25
Example: reduction
Block 2
Block 0
Block 1 Global memory
accumulator
Block n
atomicAdd
atomicAdd
atomicAdd
atomicAdd
Result
Complexity?
Time including kernel launch overhead?
26
Example: compare-and-swap
27
Memory consistency model
T1 T2
T1 T2
Remember
Floating-point addition is not associative
Thread scheduling is not deterministic
Without FP atomics
Evaluation order is independent of thread scheduling
You should expect deterministic result for a fixed combination
of GPU, driver, and runtime parameters (thread block size, etc.)
With FP atomics
Evaluation order depends on thread scheduling
You will get different answers from one run to the other
30
Agenda
Asynchronous execution
Streams
Scheduling dependency graphs
Fine-grained synchronization
Atomics
Memory consistency model
Unified memory
Memory allocation
Optimizing transfers
Unified memory and other features
OK, we lied
You were told
CPU and GPU have distinct memory spaces
Blocks cannot communicate
You need to synchronize threads inside a block
This is the least-common denominator across all CUDA GPUs
32
Unified memory
33
VectorAdd using unified memory
int main()
{
int numElements = 50000;
size_t size = numElements * sizeof(float);
34
How it works: on-demand paging
Managed memory pages mapped in both CPU and GPU spaces
Same virtual address
Not necessarily allocated in physical memory
Typical flow
1. Data allocated in CPU memory
35
How it works: on-demand paging
Managed memory pages mapped in both CPU and GPU spaces
Same virtual address
Not necessarily allocated in physical memory
Typical flow
1. Data allocated in CPU memory
2. GPU code touches unallocated page, triggers page fault
36
How it works: on-demand paging
Managed memory pages mapped in both CPU and GPU spaces
Same virtual address
Not necessarily allocated in physical memory
Typical flow
1. Data allocated in CPU memory
2. GPU code touches unallocated page, triggers page fault
3. Page fault handler allocates page in GPU mem, copies contents
37
How it works: on-demand paging
Managed memory pages mapped in both CPU and GPU spaces
Same virtual address
Not necessarily allocated in physical memory
Typical flow
1. Data allocated in CPU memory
2. GPU code touches unallocated page, triggers page fault
3. Page fault handler allocates page in GPU mem, copies contents
4. If GPU modifies page contents, invalidate CPU copy
Next CPU access will cause data to be copied back from GPU mem
38
Prefetching data
On-demand paging has overhead
Solution: load data in advance using cudaMemPrefetchAsync
...
cudaStream_t stream;
cudaStreamCreate(&stream);
cudaDeviceSynchronize();
Display(C);
...
gpuId=? cudaCpuDeviceId 0
1
CPU GPU 0
GPU 1
41