You are on page 1of 33

CS401/807

Parallel

Parallel
Begin reading Kirk and Hwu
Book begins by focussing on CUDA
extends to openCL later
Check your GPU to figure out which
language you can use
CUDA is mostly restricted to NVIDIA
2

Intro to Parallel
Latency devices (CPU) versus throughput
devices (GPU)
Latency devices add functionality to
reduce latency
Branch prediction, multithreading,
superscalar, cache optimization;
Complex control
Powerful ALU
3

GPUs: throughput devices


no branch prediction
no data forwarding
simpler ALU, focusing on throughput
e.g. single-cycle SIMD multiply
small cache
heavy pipelining
Many simpler cores versus few complex
cores
4

CPU vs GPU
ALU
Control

ALU

Control
Cache
Control

ALU

ALU

Cache
Control
Cache

Cache

Control
Cache
Control
Cache

DRAM

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

DRAM

Writing CPU versus GPU code


Use CPU when appropriate
Serial operations, Decision making
faster than GPUs for these tasks
Use GPU when appropriate
parallel operations, SIMD (more
accurately SPMD)
Single instruction/program, multiple
data
faster than CPU for these tasks
6

Amdahls law
Originally
formulated for
parallel
computing
Limit on
performance is
the serial part
Dont neglect
your CPU code
7

Scalability, Portability, and Cores


Scalability: the code can run faster, given
access to more of the same core
parallel
Portability: the code can run efficiently,
given access to different cores
heterogeneous
Scalability plus portability means taking
advantage of all computing resources
8

GPGPU frameworks and portability


Cuda, OpenCL etc aims for portability
write once, deploy on different
arrangements of CPU and GPU
General structure for GPGPU: Host plus
device model
CPU is host: make decisions, perform
serial code, issue tasks to the device
GPU is the device: execute tasks as
efficiently as possible
9

Data parallelism

Given a vector of data, do the same


operation to each element in the vector
Simplest parallel computation
EG: vector addition
c[i] = a[i]+b[i]

10

Vector Addition: Serial


One ALU
One addition at a
time
Each element is
presented to the ALU
in turn
O(n)

A[k]
B[k]

C[k]

11

Vector Addition: Parallel


Many ALUs
Each ALU performs
one slice of the
operation
All additions happen
at once
O(1)

A[k]
B[k]

C[k]

12

CUDA organization
CUDA code executes general compute
requests on parallel hardware
Two-part model
Host: executes serial code, decision
making, issues commands to device
Device: executes parallel code, responds
to requests from host
13

Parallel execution on CUDA


a CUDA (parallel) device executes a block
of related threads
also called grid or array of threads
All threads in the block execute the same
code (thus: SIMD)
Each thread has its own index, used to
identify which data it is operating on
compute memory addresses, make
control decisions
14

Thread Blocks
Parallel threads can be issued in blocks
collections of threads from the same
code
Threads in a block can communicate:
shared memory
atomic operations
barrier synchronization
i = blockIdx.x * blockDim.x + threadIdx.x;
C[i] = A[i] + B[i];
15

Thread Blocks and Communication


Thread Block 0
i = blockIdx.x * blockDim.x + threadIdx.x;
C[i] = A[i] + B[i];

A[i]

B[i]

C[i]

Thread Block 1
i = blockIdx.x * blockDim.x + threadIdx.x;
C[i] = A[i] + B[i];

A[i]

B[i]

C[i]

Thread Block 2
i = blockIdx.x * blockDim.x + threadIdx.x;
C[i] = A[i] + B[i];

A[i]

B[i]

C[i]

16

Thread Blocks and Communication


Threads in the same block can
communicate with each other
Threads in different blocks cannot, even
if they are executing the same code
block ID and thread ID are used together
to determine portion of the work being
done
Can be arranged as 1D, 2D, 3D data
structures
17

Vector addition example: assembly


to complete a vector addition in a
traditional machine
move data from memory to registers
add registers
move result back to memory
repeat for every element
18

Vector addition example: Serial code


void vecAdd(float *h_A, float *h_B, float *h_C, int n)
{
int i;
for (i = 0; i<n; i++)
h_C[i] = h_A[i]+h_B[i];
}
int main()
{
// Memory allocation for h_A, h_B, and h_C
// I/O to read h_A and h_B, N elements
...
vecAdd(h_A, h_B, h_C, N);
}
19

Vector Addition: Parallel


to complete a vector addition in a
traditional machine
move data from serial processor to
parallel processor cache
Issue threads to complete addition for
all elements in the array
multiple blocks may be required
Move data from parallel processor cache
back to serial processor cache
20

Parallel device memory access


Device code can access per-thread registers and
shared global memory
Host code can move data to and from shared global
memory
Host
CPU
Memory

Parallel Device

System bus (DMA)

Host CPU

Global
Device
Memory

Thread Block (0,0)

Thread Block (0,1)

Thread 0

Thread 0

Registers
Thread 1
Registers
Thread k
Registers

Registers
Thread 1
Registers
Thread k
Registers
21

CUDA memory management: Device


cudaMalloc()
Allocates memory in global device
memory
Two parameters
Pointer to the allocated object
Size of allocated object in bytes
cudaFree()
Frees object from global device memory
requires pointer to object to be freed
22

CUDA memory management: Host


cudaMemcpy()
memory data transfer
Requires four parameters
Pointer to destination
Pointer to source
Number of bytes copied
Type/Direction of transfer
Transfer to device is asynchronous
23

CUDA host code for vector addition


void vecAdd(float *h_A, float *h_B, float *h_C, int n)
{
int size = n * sizeof(float);
float *d_A, *d_B, *d_C;
cudaMalloc((void **) &d_A, size);
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMalloc((void **) &d_B, size);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
cudaMalloc((void **) &d_C, size);

// Kernel invocation code



cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
cudaFree(d_A); cudaFree(d_B); cudaFree (d_C);
}
24

CUDA host code for vector addition


You should check for errors at each stage
cudaError_t err = cudaMalloc((void **) &d_A, size);
if (err != cudaSuccess)
{
printf(%s in %s at line %d\n,
cudaGetErrorString(err), __FILE__, __LINE__);
exit(EXIT_FAILURE);
}
25

Vector addition device code

// Compute vector sum C = A+B


// Each thread performs one pair-wise addition
__global__
void vecAddKernel(float* A, float* B, float* C, int n)
{
int i = threadIdx.x+blockDim.x*blockIdx.x;
if(i<n) C[i] = A[i] + B[i];
}

26

aside: CUDA function declarations


__ = _ _
__device__ float DeviceFunc()
execute on device, call from device
__global__ void KernelFunc()
execute on device, call from host
__host__ float HostFunc()
execute on host, call from host
27

Vector addition device code

Each thread calculates its local index


relating to the block its in and the size of
the blocks
Each thread validates that the calculated
index is within the original index range

28

Vector addition kernel host code

int vecAdd(float* h_A, float* h_B, float* h_C, int n)


{
// d_A, d_B, d_C allocations and copies omitted
// Run ceil(n/256.0) blocks of 256 threads each
vecAddKernnel<<<ceil(n/256.0),256>>>(d_A, d_B, d_C, n);
}

29

Vector addition kernel host code


Kernel host sets blocks of 256 threads
Kernel host activates enough thread
blocks to satisfy the original range
more general:
int vecAdd(float* h_A, float* h_B, float* h_C, int n)
{
dim3 DimGrid((n-1)/256 + 1, 1, 1);
dim3 DimBlock(256, 1, 1);
vecAddKernnel<<<DimGrid,DimBlock>>>(d_A, d_B, d_C, n);
}
30

Thread Blocks and Grids


In our example, thread blocks are 256
threads
If we want to add 1000 elements, we
need 4 thread blocks
there will be 24 threads left over
which is why threads need to validate
array index
...
...
block 0

...
block 1

...
block 2

...
block 3

31

Thread Blocks and


Multidimensional Grids
Image index y

Image index x

...

...

Block 00

Block 01

Block 02

Block 03

Block 10

Block 11

Block 12

Block 13

Block 20

Block 21

Block 22

Block 23

Block 30

Block 31

Block 32

Block 33
32

Picture kernel block indexing


__global__ void PictureKernel(float* d_Pin, float* d_Pout, int n, int m)
{
// Calculate the row # of the d_Pin and d_Pout element
int Row = blockIdx.y*blockDim.y + threadIdx.y;
// Calculate the column # of the d_Pin and d_Pout element
int Col = blockIdx.x*blockDim.x + threadIdx.x;
// each thread computes one element of d_Pout if in range
if ((Row < m) && (Col < n)) {
d_Pout[Row*n+Col] = 2.0*d_Pin[Row*n+Col];
}
}

33