Lesosn 7 Memory Hierarcy Optimizatin

2/23/09
Administrative
•  Homework #2
–  Due 5PM, TODAY
L8: Memory Hierarchy Optimization, –  Use handin program to submit
Bandwidth •  Project proposals
•  Due 5PM, Friday, March 13 (hard deadline)
•  MPM Sequential code and information
posted on website
•  Class cancelled on Wednesday, Feb. 25
•  Questions?
2
CS6963  CS6963 
L8: Memory Hierarchy III 
Targets of Memory Hierarchy Optimizing the Memory Hierarchy on

Optimizations GPUs
•  Device memory access times non-uniform so
•  Reduce memory latency
data placement significantly affects
–  The latency of a memory access is the time
(usually in cycles) between a memory request performance.
and its completion •  But controlling data placement may require
additional copying, so consider overhead.
•  Maximize memory bandwidth
–  Bandwidth is the amount of useful data that •  Optimizations to increase memory bandwidth.
can be retrieved over a time interval Idea: maximize utility of each memory access.
•  Manage overhead •  Align data structures to address boundaries
–  Cost of performing optimization (e.g., copying) •  Coalesce global memory accesses
should be less than anticipated gain •  Avoid memory bank conflicts to increase memory
access parallelism
3 4
CS6963  CS6963 
L8: Memory Hierarchy III  L8: Memory Hierarchy III 
1
2/23/09
Outline Global Memory Accesses
•  Bandwidth Optimizations •  Each thread issues memory accesses to

–  Parallel memory accesses in shared memory data types of varying sizes, perhaps as
–  Maximize utility of every memory operation small as 1 byte entities
•  Load/store USEFUL data •  Given an address to load or store, memory
•  Architecture Issues returns/updates “segments” of either 32
–  Shared memory bank conflicts bytes, 64 bytes or 128 bytes
–  Global memory coalescing •  Maximizing bandwidth:
–  Alignment –  Operate on an entire 128 byte segment for
each memory transfer
5 6
CS6963  CS6963 
Protocol for most systems (including lab

Understanding Global Memory Accesses
machines) even more restrictive
Memory protocol for compute capability
•  For compute capability 1.0 and 1.1
1.2* (CUDA Manual 5.1.2.1)
–  Threads must access the words in a
•  Start with memory request by smallest numbered
thread. Find the memory segment that contains the segment in sequence
address (32, 64 or 128 byte segment, depending on –  The kth thread must access the kth word
data type)
•  Find other active threads requesting addresses
within that segment and coalesce
•  Reduce transaction size if possible
•  Access memory and mark threads as “inactive”
•  Repeat until all threads in half-warp are serviced
*Includes Tesla and GTX platforms
7 8
CS6963  CS6963 
2
2/23/09
Memory Layout of a Matrix in C Memory Layout of a Matrix in C

Access M0,0 M1,0 M2,0 M3,0 Access M0,0 M1,0 M2,0 M3,0
direction in M0,1 M1,1 M2,1 M3,1 direction in M0,1 M1,1 M2,1 M3,1
Kernel Kernel
code M0,2 M1,2 M2,2 M3,2 code M0,2 M1,2 M2,2 M3,2
M0,3 M1,3 M2,3 M3,3 M0,3 M1,3 M2,3 M3,3

…
Time Period 2
T1 T2 T3 T4
Time Period 1 Time Period 2 … Time Period 1
T1 T2 T3 T4 T1 T2 T3 T4 T1 T2 T3 T4
M M
M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3 M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3
© David Kirk/NVIDIA and Wen‐mei W. Hwu, 2007‐2009  9 © David Kirk/NVIDIA and Wen‐mei W. Hwu, 2007‐2009  10
ECE 498AL, University of Illinois, Urbana‐Champaign  L8: Memory Hierarchy III  ECE 498AL, University of Illinois, Urbana‐Champaign  L8: Memory Hierarchy III 
Impact of Global Memory Coalescing Shared Memory Version of Reverse Array

(Compute capability 1.1 and below example) __global__ void reverseArrayBlock(int *d_out, int *d_in) {
extern __shared__ int s_data[];
int inOffset = blockDim.x * blockIdx.x;
Consider the following CUDA kernel that reverses the int in = inOffset + threadIdx.x;
elements of an array*
// Load one element per thread from device memory and store it
// *in reversed order* into temporary shared memory
__global__ void reverseArrayBlock(int *d_out, int *d_in) { s_data[blockDim.x - 1 - threadIdx.x] = d_in[in];
int inOffset = blockDim.x * blockIdx.x;
int outOffset = blockDim.x * (gridDim.x - 1 - blockIdx.x); // Block until all threads in the block have written their data to shared mem
int in = inOffset + threadIdx.x; __syncthreads();
int out = outOffset + (blockDim.x - 1 - threadIdx.x);
d_out[out] = d_in[in]; // write the data from shared memory in forward order,
} // but to the reversed block offset as before
int outOffset = blockDim.x * (gridDim.x - 1 - blockIdx.x);
int out = outOffset + threadIdx.x; d_out[out] = s_data[threadIdx.x];
From Dr. Dobb’s Journal, }
http://www.ddj.com/hpc-high-performance-computing/207603131
From Dr. Dobb’s Journal,
http://www.ddj.com/hpc-high-performance-computing/208801731
11 12
CS6963  CS6963 
3
2/23/09
Alignment
What Happened?
•  Addresses accessed within a half-warp
•  The first version is about 50% slower! may need to be aligned to the beginning
•  On examination, the same amount of of a segment to enable coalescing
data is transferred to/from global –  An aligned memory address is a multiple of
memory the memory segment size
–  In compute 1.0 and 1.1 devices, address
•  Let’s look at the access patterns accessed by lowest numbered thread must
–  More examples in CUDA programming guide be aligned to beginning of segment for
coalescing
–  In future systems, sometimes alignment
can reduce number of accesses
13 14
CS6963  CS6963 
What Can You Do to Improve Bandwidth

More on Alignment
to Global Memory?
•  Objects allocated statically or by •  Think about spatial reuse and access
cudaMalloc begin at aligned addresses patterns across threads
–  But still need to think about index –  May need a different computation & data
expressions partitioning
•  May want to align structures –  May want to rearrange data in shared
struct __align__(8) { struct __align__(16) { memory, even if no temporal reuse
float a; float a; –  Similar issues, but much better in future
float b; float b;
hardware generations
}; float c;
};
15 16
CS6963  CS6963 
4
2/23/09
Bandwidth to Shared Memory:

Bank Addressing Examples
Parallel Memory Accesses
•  No Bank Conflicts •  No Bank Conflicts
•  Consider each thread accessing a –  Linear addressing –  Random 1:1 Permutation
different location in shared memory stride == 1
•  Bandwidth maximized if each one is able Thread 0

Thread 1
Bank 0
Bank 1
Thread 0
Thread 1
Bank 0
Bank 1
to proceed in parallel Thread 2
Thread 3
Bank 2
Bank 3
Thread 2
Thread 3
Bank 2
Bank 3
•  Hardware to support this Thread 4
Thread 5
Bank 4
Bank 5
Thread 4
Thread 5
Bank 4
Bank 5
–  Banked memory: each bank can support an Thread 6 Bank 6 Thread 6 Bank 6
Thread 7 Bank 7 Thread 7 Bank 7
access on every memory cycle
17 © David Kirk/NVIDIA and Wen‐mei W. Hwu, 2007‐2009  18
CS6963  ECE 498AL, University of Illinois, Urbana‐Champaign 
Bank Addressing Examples How addresses map to banks on G80
•  2-way Bank Conflicts •  8-way Bank Conflicts •  Each bank has a bandwidth of 32 bits
–  Linear addressing –  Linear addressing per clock cycle
stride == 2 stride == 8
Thread 0 Bank 0 Thread 0
x8
Bank 0
•  Successive 32-bit words are assigned to
Thread 1 Bank 1 Thread 1 Bank 1 successive banks
Thread 3
Thread 4
Bank 3
Bank 4
Thread 3
Thread 4
•  G80 has 16 banks
Bank 5 Thread 5 Bank 7 –  So bank = address % 16
Bank 6 Thread 6 Bank 8
Thread 8
Bank 7 Thread 7
x8 Bank 9 –  Same as the size of a half-warp
Thread 9 •  No bank conflicts between different half-
Thread 10
warps, only within a single half-warp
© David Kirk/NVIDIA and Wen‐mei W. Hwu, 2007‐2009  © David Kirk/NVIDIA and Wen‐mei W. Hwu, 2007‐2009  20
19 
ECE 498AL, University of Illinois, Urbana‐Champaign  ECE 498AL, University of Illinois, Urbana‐Champaign  L8: Memory Hierarchy III 
5
2/23/09
Shared memory bank conflicts Linear Addressing s=1
•  Given:
Thread 0 Bank 0
Thread 1 Bank 1
•  Shared memory is as fast as registers if there are no Thread 2

Thread 3
Bank 2
Bank 3
bank conflicts __shared__ float shared[256]; Thread 4 Bank 4

Thread 5 Bank 5
float foo = Thread 6

Thread 7
Bank 6
Bank 7
•  The fast case: shared[baseIndex + s *

–  If all threads of a half-warp access different banks, there threadIdx.x]; Thread 15 Bank 15
is no bank conflict
–  If all threads of a half-warp access the identical address,
there is no bank conflict (broadcast) •  This is only bank-conflict- Thread 0
s=3
Bank 0
free if s shares no common

Thread 1 Bank 1
•  The slow case: Thread 2

Thread 3
Bank 2
Bank 3
–  Bank Conflict: multiple threads in the same half-warp factors with the number of Thread 4
Thread 5
Bank 4
Bank 5
access the same bank
banks
Thread 6 Bank 6
Thread 7 Bank 7
–  Must serialize the accesses
–  Cost = max # of simultaneous accesses to a single bank –  16 on G80, so s must be odd Thread 15 Bank 15
© David Kirk/NVIDIA and Wen‐mei W. Hwu, 2007‐2009  21 © David Kirk/NVIDIA and Wen‐mei W. Hwu, 2007‐2009  22
Data types and bank conflicts Structs and Bank Conflicts

•  Struct assignments compile into as many memory accesses as
•  This has no conflicts if type of shared is 32- there are struct members:
bits: Thread 0 Bank 0
Thread 0 Bank 0
foo = shared[baseIndex + threadIdx.x] Thread 1 Bank 1 struct vector { float x, y, z; }; Thread 1 Bank 1
Thread 2 Bank 2
Thread 2 Bank 2
Thread 3 Bank 3
struct myType { Thread 3 Bank 3
Thread 4 Bank 4
Thread 4 Bank 4 float f; Thread 5 Bank 5
Thread 5 Bank 5
•  But not if the data type is smaller

Thread 6 Bank 6
Thread 6 Bank 6 int c; Thread 7 Bank 7
Thread 7 Bank 7
};
–  4-way bank conflicts: Thread 15 Bank 15
__shared__ struct vector vectors[64]; Thread 15 Bank 15
__shared__ char shared[]; __shared__ struct myType myTypes[64];

foo = shared[baseIndex + threadIdx.x]; •  This has no bank conflicts for vector; struct size is 3 words
Thread 0 Bank 0
Thread 1 Bank 1 –  3 accesses per thread, contiguous banks (no common factor with
16)
–  2-way bank conflicts:
Thread 2 Bank 2
Thread 3 Bank 3
Thread 4 Bank 4
__shared__ short shared[]; Thread 5 Bank 5 struct vector v = vectors[baseIndex + threadIdx.x];
Thread 6 Bank 6
foo = shared[baseIndex + threadIdx.x]; Thread 7 Bank 7
•  This has 2-way bank conflicts for my Type; (2 accesses per
thread)
Thread 15 Bank 15
struct myType m = myTypes[baseIndex + threadIdx.x];
© David Kirk/NVIDIA and Wen‐mei W. Hwu, 2007  23 © David Kirk/NVIDIA and Wen‐mei W. Hwu, 2007  24
6
2/23/09
Common Bank Conflict Patterns, 1D Array A Better Array Access Pattern

•  Each thread loads 2 elements into •  Each thread loads one
shared mem: element in every
–  2-way-interleaved loads result in
2-way bank conflicts: Thread 0 Bank 0 consecutive group of Thread 0 Bank 0
blockDim elements.
Thread 1 Bank 1
Thread 1 Bank 1
int tid = threadIdx.x; Thread 2 Bank 2 Thread 2 Bank 2
shared[2*tid] = global[2*tid]; Thread 3 Bank 3 Thread 3 Bank 3

Thread 4 Bank 4
shared[2*tid+1] = global[2*tid+1]; Thread 4 Bank 4
Bank 5
shared[tid] = global[tid];
•  This makes sense for traditional Bank 6
Thread 5 Bank 5
CPU threads, exploits spatial Thread 8

Bank 7 shared[tid + blockDim.x] = Thread 6 Bank 6
locality in cache line and reduces Thread 9 global[tid + blockDim.x]; Thread 7 Bank 7
sharing traffic
Thread 10
Thread 11 Bank 15
–  Not in shared memory usage where

there is no cache line effects but Thread 15 Bank 15
banking effects
© David Kirk/NVIDIA and Wen‐mei W. Hwu, 2007  25 © David Kirk/NVIDIA and Wen‐mei W. Hwu, 2007  26
What Can You Do to Improve Bandwidth

Summary of Lecture
to Shared Memory?
•  Think about memory access patterns •  Maximize Memory Bandwidth!
across threads –  Make each memory access count
–  May need a different computation & data •  Exploit spatial locality in global memory
partitioning accesses
–  Sometimes “padding” can be used on a
•  The opposite goal in shared memory
dimension to align accesses
–  Each thread accesses independent memory
banks
27 28
CS6963  CS6963 

Lesosn 7 Memory Hierarcy Optimizatin

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lesosn 7 Memory Hierarcy Optimizatin

Uploaded by

Copyright:

Available Formats

2/23/09

Targets of Memory Hierarchy Optimizing the Memory Hierarchy on

Outline Global Memory Accesses

• Bandwidth Optimizations • Each thread issues memory accesses to

Protocol for most systems (including lab

Memory Layout of a Matrix in C Memory Layout of a Matrix in C

M0,3 M1,3 M2,3 M3,3 M0,3 M1,3 M2,3 M3,3

Impact of Global Memory Coalescing Shared Memory Version of Reverse Array

What Can You Do to Improve Bandwidth

Bandwidth to Shared Memory:

• Bandwidth maximized if each one is able Thread 0

Thread 15 Bank 15 Thread 15 Bank 15

Bank Addressing Examples How addresses map to banks on G80

Shared memory bank conflicts Linear Addressing s=1

• Shared memory is as fast as registers if there are no Thread 2

bank conflicts __shared__ float shared[256]; Thread 4 Bank 4

float foo = Thread 6

• The fast case: shared[baseIndex + s *

free if s shares no common

• The slow case: Thread 2

Data types and bank conflicts Structs and Bank Conflicts

• But not if the data type is smaller

__shared__ char shared[]; __shared__ struct myType myTypes[64];

Common Bank Conflict Patterns, 1D Array A Better Array Access Pattern

shared[2*tid] = global[2*tid]; Thread 3 Bank 3 Thread 3 Bank 3

CPU threads, exploits spatial Thread 8

– Not in shared memory usage where

What Can You Do to Improve Bandwidth

You might also like

•  Bandwidth Optimizations •  Each thread issues memory accesses to

•  Bandwidth maximized if each one is able Thread 0

•  Shared memory is as fast as registers if there are no Thread 2

bank conflicts shared float shared[256]; Thread 4 Bank 4

•  The fast case: shared[baseIndex + s *

•  The slow case: Thread 2

•  But not if the data type is smaller

shared char shared[]; shared struct myType myTypes[64];

shared[2tid] = global[2tid]; Thread 3 Bank 3 Thread 3 Bank 3

–  Not in shared memory usage where