You are on page 1of 7

2/23/09

Administrative

•  Homework #2
–  Due 5PM, TODAY
L8: Memory Hierarchy Optimization, –  Use handin program to submit
Bandwidth •  Project proposals
•  Due 5PM, Friday, March 13 (hard deadline)
•  MPM Sequential code and information
posted on website
•  Class cancelled on Wednesday, Feb. 25
•  Questions?
2
CS6963
 CS6963

L8:
Memory
Hierarchy
III


Targets of Memory Hierarchy Optimizing the Memory Hierarchy on


Optimizations GPUs
•  Device memory access times non-uniform so
•  Reduce memory latency
data placement significantly affects
–  The latency of a memory access is the time
(usually in cycles) between a memory request performance.
and its completion •  But controlling data placement may require
additional copying, so consider overhead.
•  Maximize memory bandwidth
–  Bandwidth is the amount of useful data that •  Optimizations to increase memory bandwidth.
can be retrieved over a time interval Idea: maximize utility of each memory access.
•  Manage overhead •  Align data structures to address boundaries
–  Cost of performing optimization (e.g., copying) •  Coalesce global memory accesses
should be less than anticipated gain •  Avoid memory bank conflicts to increase memory
access parallelism

3 4
CS6963
 CS6963

L8:
Memory
Hierarchy
III
 L8:
Memory
Hierarchy
III


1
2/23/09

Outline Global Memory Accesses

•  Bandwidth Optimizations •  Each thread issues memory accesses to


–  Parallel memory accesses in shared memory data types of varying sizes, perhaps as
–  Maximize utility of every memory operation small as 1 byte entities
•  Load/store USEFUL data •  Given an address to load or store, memory
•  Architecture Issues returns/updates “segments” of either 32
–  Shared memory bank conflicts bytes, 64 bytes or 128 bytes
–  Global memory coalescing •  Maximizing bandwidth:
–  Alignment –  Operate on an entire 128 byte segment for
each memory transfer
5 6
CS6963
 CS6963

L8:
Memory
Hierarchy
III
 L8:
Memory
Hierarchy
III


Protocol for most systems (including lab


Understanding Global Memory Accesses
machines) even more restrictive
Memory protocol for compute capability
•  For compute capability 1.0 and 1.1
1.2* (CUDA Manual 5.1.2.1)
–  Threads must access the words in a
•  Start with memory request by smallest numbered
thread. Find the memory segment that contains the segment in sequence
address (32, 64 or 128 byte segment, depending on –  The kth thread must access the kth word
data type)
•  Find other active threads requesting addresses
within that segment and coalesce
•  Reduce transaction size if possible
•  Access memory and mark threads as “inactive”
•  Repeat until all threads in half-warp are serviced
*Includes Tesla and GTX platforms
7 8
CS6963
 CS6963

L8:
Memory
Hierarchy
III
 L8:
Memory
Hierarchy
III


2
2/23/09

Memory Layout of a Matrix in C Memory Layout of a Matrix in C


Access M0,0 M1,0 M2,0 M3,0 Access M0,0 M1,0 M2,0 M3,0
direction in M0,1 M1,1 M2,1 M3,1 direction in M0,1 M1,1 M2,1 M3,1
Kernel Kernel
code M0,2 M1,2 M2,2 M3,2 code M0,2 M1,2 M2,2 M3,2

M0,3 M1,3 M2,3 M3,3 M0,3 M1,3 M2,3 M3,3



Time Period 2
T1 T2 T3 T4
Time Period 1 Time Period 2 … Time Period 1
T1 T2 T3 T4 T1 T2 T3 T4 T1 T2 T3 T4
M M

M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3 M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3

©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007‐2009
 9 ©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007‐2009
 10
ECE
498AL,
University
of
Illinois,
Urbana‐Champaign
 L8:
Memory
Hierarchy
III
 ECE
498AL,
University
of
Illinois,
Urbana‐Champaign
 L8:
Memory
Hierarchy
III


Impact of Global Memory Coalescing Shared Memory Version of Reverse Array


(Compute capability 1.1 and below example) __global__ void reverseArrayBlock(int *d_out, int *d_in) {
extern __shared__ int s_data[];
int inOffset = blockDim.x * blockIdx.x;
Consider the following CUDA kernel that reverses the int in = inOffset + threadIdx.x;
elements of an array*
// Load one element per thread from device memory and store it
// *in reversed order* into temporary shared memory
__global__ void reverseArrayBlock(int *d_out, int *d_in) { s_data[blockDim.x - 1 - threadIdx.x] = d_in[in];
int inOffset = blockDim.x * blockIdx.x;
int outOffset = blockDim.x * (gridDim.x - 1 - blockIdx.x); // Block until all threads in the block have written their data to shared mem
int in = inOffset + threadIdx.x; __syncthreads();
int out = outOffset + (blockDim.x - 1 - threadIdx.x);
d_out[out] = d_in[in]; // write the data from shared memory in forward order,
} // but to the reversed block offset as before
int outOffset = blockDim.x * (gridDim.x - 1 - blockIdx.x);
int out = outOffset + threadIdx.x; d_out[out] = s_data[threadIdx.x];
From Dr. Dobb’s Journal, }
http://www.ddj.com/hpc-high-performance-computing/207603131
From Dr. Dobb’s Journal,
http://www.ddj.com/hpc-high-performance-computing/208801731
11 12
CS6963
 CS6963

L8:
Memory
Hierarchy
III
 L8:
Memory
Hierarchy
III


3
2/23/09

Alignment
What Happened?
•  Addresses accessed within a half-warp
•  The first version is about 50% slower! may need to be aligned to the beginning
•  On examination, the same amount of of a segment to enable coalescing
data is transferred to/from global –  An aligned memory address is a multiple of
memory the memory segment size
–  In compute 1.0 and 1.1 devices, address
•  Let’s look at the access patterns accessed by lowest numbered thread must
–  More examples in CUDA programming guide be aligned to beginning of segment for
coalescing
–  In future systems, sometimes alignment
can reduce number of accesses
13 14
CS6963
 CS6963

L8:
Memory
Hierarchy
III
 L8:
Memory
Hierarchy
III


What Can You Do to Improve Bandwidth


More on Alignment
to Global Memory?
•  Objects allocated statically or by •  Think about spatial reuse and access
cudaMalloc begin at aligned addresses patterns across threads
–  But still need to think about index –  May need a different computation & data
expressions partitioning
•  May want to align structures –  May want to rearrange data in shared
struct __align__(8) { struct __align__(16) { memory, even if no temporal reuse
float a; float a; –  Similar issues, but much better in future
float b; float b;
hardware generations
}; float c;
};

15 16
CS6963
 CS6963

L8:
Memory
Hierarchy
III
 L8:
Memory
Hierarchy
III


4
2/23/09

Bandwidth to Shared Memory:


Bank Addressing Examples
Parallel Memory Accesses
•  No Bank Conflicts •  No Bank Conflicts
•  Consider each thread accessing a –  Linear addressing –  Random 1:1 Permutation
different location in shared memory stride == 1

•  Bandwidth maximized if each one is able Thread 0


Thread 1
Bank 0
Bank 1
Thread 0
Thread 1
Bank 0
Bank 1
to proceed in parallel Thread 2
Thread 3
Bank 2
Bank 3
Thread 2
Thread 3
Bank 2
Bank 3
•  Hardware to support this Thread 4
Thread 5
Bank 4
Bank 5
Thread 4
Thread 5
Bank 4
Bank 5
–  Banked memory: each bank can support an Thread 6 Bank 6 Thread 6 Bank 6
Thread 7 Bank 7 Thread 7 Bank 7
access on every memory cycle

Thread 15 Bank 15 Thread 15 Bank 15

17 ©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007‐2009
 18
CS6963
 ECE
498AL,
University
of
Illinois,
Urbana‐Champaign

L8:
Memory
Hierarchy
III
 L8:
Memory
Hierarchy
III


Bank Addressing Examples How addresses map to banks on G80

•  2-way Bank Conflicts •  8-way Bank Conflicts •  Each bank has a bandwidth of 32 bits
–  Linear addressing –  Linear addressing per clock cycle
stride == 2 stride == 8
Thread 0 Bank 0 Thread 0
x8
Bank 0
•  Successive 32-bit words are assigned to
Thread 1 Bank 1 Thread 1 Bank 1 successive banks
Thread 2 Bank 2 Thread 2 Bank 2
Thread 3
Thread 4
Bank 3
Bank 4
Thread 3
Thread 4
•  G80 has 16 banks
Bank 5 Thread 5 Bank 7 –  So bank = address % 16
Bank 6 Thread 6 Bank 8
Thread 8
Bank 7 Thread 7
x8 Bank 9 –  Same as the size of a half-warp
Thread 9 •  No bank conflicts between different half-
Thread 10
Thread 11 Bank 15 Thread 15 Bank 15
warps, only within a single half-warp

©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007‐2009
 ©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007‐2009
 20
19

ECE
498AL,
University
of
Illinois,
Urbana‐Champaign
 ECE
498AL,
University
of
Illinois,
Urbana‐Champaign
 L8:
Memory
Hierarchy
III


5
2/23/09

Shared memory bank conflicts Linear Addressing s=1

•  Given:
Thread 0 Bank 0
Thread 1 Bank 1

•  Shared memory is as fast as registers if there are no Thread 2


Thread 3
Bank 2
Bank 3

bank conflicts __shared__ float shared[256]; Thread 4 Bank 4


Thread 5 Bank 5

float foo = Thread 6


Thread 7
Bank 6
Bank 7

•  The fast case: shared[baseIndex + s *


–  If all threads of a half-warp access different banks, there threadIdx.x]; Thread 15 Bank 15

is no bank conflict
–  If all threads of a half-warp access the identical address,
there is no bank conflict (broadcast) •  This is only bank-conflict- Thread 0
s=3
Bank 0

free if s shares no common


Thread 1 Bank 1

•  The slow case: Thread 2


Thread 3
Bank 2
Bank 3

–  Bank Conflict: multiple threads in the same half-warp factors with the number of Thread 4
Thread 5
Bank 4
Bank 5
access the same bank
banks
Thread 6 Bank 6
Thread 7 Bank 7
–  Must serialize the accesses
–  Cost = max # of simultaneous accesses to a single bank –  16 on G80, so s must be odd Thread 15 Bank 15

©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007‐2009
 21 ©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007‐2009
 22
ECE
498AL,
University
of
Illinois,
Urbana‐Champaign
 L8:
Memory
Hierarchy
III
 ECE
498AL,
University
of
Illinois,
Urbana‐Champaign
 L8:
Memory
Hierarchy
III


Data types and bank conflicts Structs and Bank Conflicts


•  Struct assignments compile into as many memory accesses as
•  This has no conflicts if type of shared is 32- there are struct members:
bits: Thread 0 Bank 0
Thread 0 Bank 0

foo = shared[baseIndex + threadIdx.x] Thread 1 Bank 1 struct vector { float x, y, z; }; Thread 1 Bank 1
Thread 2 Bank 2
Thread 2 Bank 2
Thread 3 Bank 3
struct myType { Thread 3 Bank 3
Thread 4 Bank 4
Thread 4 Bank 4 float f; Thread 5 Bank 5
Thread 5 Bank 5

•  But not if the data type is smaller


Thread 6 Bank 6
Thread 6 Bank 6 int c; Thread 7 Bank 7
Thread 7 Bank 7
};
–  4-way bank conflicts: Thread 15 Bank 15
__shared__ struct vector vectors[64]; Thread 15 Bank 15

__shared__ char shared[]; __shared__ struct myType myTypes[64];


foo = shared[baseIndex + threadIdx.x]; •  This has no bank conflicts for vector; struct size is 3 words
Thread 0 Bank 0
Thread 1 Bank 1 –  3 accesses per thread, contiguous banks (no common factor with
16)
–  2-way bank conflicts:
Thread 2 Bank 2
Thread 3 Bank 3
Thread 4 Bank 4
__shared__ short shared[]; Thread 5 Bank 5 struct vector v = vectors[baseIndex + threadIdx.x];
Thread 6 Bank 6
foo = shared[baseIndex + threadIdx.x]; Thread 7 Bank 7
•  This has 2-way bank conflicts for my Type; (2 accesses per
thread)
Thread 15 Bank 15
struct myType m = myTypes[baseIndex + threadIdx.x];
©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007
 23 ©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007
 24
ECE
498AL,
University
of
Illinois,
Urbana‐Champaign
 L8:
Memory
Hierarchy
III
 ECE
498AL,
University
of
Illinois,
Urbana‐Champaign
 L8:
Memory
Hierarchy
III


6
2/23/09

Common Bank Conflict Patterns, 1D Array A Better Array Access Pattern


•  Each thread loads 2 elements into •  Each thread loads one
shared mem: element in every
–  2-way-interleaved loads result in
2-way bank conflicts: Thread 0 Bank 0 consecutive group of Thread 0 Bank 0

blockDim elements.
Thread 1 Bank 1
Thread 1 Bank 1
int tid = threadIdx.x; Thread 2 Bank 2 Thread 2 Bank 2

shared[2*tid] = global[2*tid]; Thread 3 Bank 3 Thread 3 Bank 3


Thread 4 Bank 4
shared[2*tid+1] = global[2*tid+1]; Thread 4 Bank 4
Bank 5
shared[tid] = global[tid];
•  This makes sense for traditional Bank 6
Thread 5 Bank 5

CPU threads, exploits spatial Thread 8


Bank 7 shared[tid + blockDim.x] = Thread 6 Bank 6

locality in cache line and reduces Thread 9 global[tid + blockDim.x]; Thread 7 Bank 7

sharing traffic
Thread 10

Thread 11 Bank 15

–  Not in shared memory usage where


there is no cache line effects but Thread 15 Bank 15

banking effects
©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007
 25 ©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007
 26
ECE
498AL,
University
of
Illinois,
Urbana‐Champaign
 L8:
Memory
Hierarchy
III
 ECE
498AL,
University
of
Illinois,
Urbana‐Champaign
 L8:
Memory
Hierarchy
III


What Can You Do to Improve Bandwidth


Summary of Lecture
to Shared Memory?
•  Think about memory access patterns •  Maximize Memory Bandwidth!
across threads –  Make each memory access count
–  May need a different computation & data •  Exploit spatial locality in global memory
partitioning accesses
–  Sometimes “padding” can be used on a
•  The opposite goal in shared memory
dimension to align accesses
–  Each thread accesses independent memory
banks

27 28
CS6963
 CS6963

L8:
Memory
Hierarchy
III
 L8:
Memory
Hierarchy
III


You might also like