Professional Documents
Culture Documents
Lesosn 7 Memory Hierarcy Optimizatin
Lesosn 7 Memory Hierarcy Optimizatin
Administrative
• Homework #2
– Due 5PM, TODAY
L8: Memory Hierarchy Optimization, – Use handin program to submit
Bandwidth • Project proposals
• Due 5PM, Friday, March 13 (hard deadline)
• MPM Sequential code and information
posted on website
• Class cancelled on Wednesday, Feb. 25
• Questions?
2
CS6963
CS6963
L8:
Memory
Hierarchy
III
3
4
CS6963
CS6963
L8:
Memory
Hierarchy
III
L8:
Memory
Hierarchy
III
1
2/23/09
2
2/23/09
M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3 M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3
©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007‐2009
9
©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007‐2009
10
ECE
498AL,
University
of
Illinois,
Urbana‐Champaign
L8:
Memory
Hierarchy
III
ECE
498AL,
University
of
Illinois,
Urbana‐Champaign
L8:
Memory
Hierarchy
III
3
2/23/09
Alignment
What Happened?
• Addresses accessed within a half-warp
• The first version is about 50% slower! may need to be aligned to the beginning
• On examination, the same amount of of a segment to enable coalescing
data is transferred to/from global – An aligned memory address is a multiple of
memory the memory segment size
– In compute 1.0 and 1.1 devices, address
• Let’s look at the access patterns accessed by lowest numbered thread must
– More examples in CUDA programming guide be aligned to beginning of segment for
coalescing
– In future systems, sometimes alignment
can reduce number of accesses
13
14
CS6963
CS6963
L8:
Memory
Hierarchy
III
L8:
Memory
Hierarchy
III
15
16
CS6963
CS6963
L8:
Memory
Hierarchy
III
L8:
Memory
Hierarchy
III
4
2/23/09
17
©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007‐2009
18
CS6963
ECE
498AL,
University
of
Illinois,
Urbana‐Champaign
L8:
Memory
Hierarchy
III
L8:
Memory
Hierarchy
III
• 2-way Bank Conflicts • 8-way Bank Conflicts • Each bank has a bandwidth of 32 bits
– Linear addressing – Linear addressing per clock cycle
stride == 2 stride == 8
Thread 0 Bank 0 Thread 0
x8
Bank 0
• Successive 32-bit words are assigned to
Thread 1 Bank 1 Thread 1 Bank 1 successive banks
Thread 2 Bank 2 Thread 2 Bank 2
Thread 3
Thread 4
Bank 3
Bank 4
Thread 3
Thread 4
• G80 has 16 banks
Bank 5 Thread 5 Bank 7 – So bank = address % 16
Bank 6 Thread 6 Bank 8
Thread 8
Bank 7 Thread 7
x8 Bank 9 – Same as the size of a half-warp
Thread 9 • No bank conflicts between different half-
Thread 10
Thread 11 Bank 15 Thread 15 Bank 15
warps, only within a single half-warp
©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007‐2009
©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007‐2009
20
19
ECE
498AL,
University
of
Illinois,
Urbana‐Champaign
ECE
498AL,
University
of
Illinois,
Urbana‐Champaign
L8:
Memory
Hierarchy
III
5
2/23/09
• Given:
Thread 0 Bank 0
Thread 1 Bank 1
is no bank conflict
– If all threads of a half-warp access the identical address,
there is no bank conflict (broadcast) • This is only bank-conflict- Thread 0
s=3
Bank 0
– Bank Conflict: multiple threads in the same half-warp factors with the number of Thread 4
Thread 5
Bank 4
Bank 5
access the same bank
banks
Thread 6 Bank 6
Thread 7 Bank 7
– Must serialize the accesses
– Cost = max # of simultaneous accesses to a single bank – 16 on G80, so s must be odd Thread 15 Bank 15
©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007‐2009
21
©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007‐2009
22
ECE
498AL,
University
of
Illinois,
Urbana‐Champaign
L8:
Memory
Hierarchy
III
ECE
498AL,
University
of
Illinois,
Urbana‐Champaign
L8:
Memory
Hierarchy
III
foo = shared[baseIndex + threadIdx.x] Thread 1 Bank 1 struct vector { float x, y, z; }; Thread 1 Bank 1
Thread 2 Bank 2
Thread 2 Bank 2
Thread 3 Bank 3
struct myType { Thread 3 Bank 3
Thread 4 Bank 4
Thread 4 Bank 4 float f; Thread 5 Bank 5
Thread 5 Bank 5
6
2/23/09
blockDim elements.
Thread 1 Bank 1
Thread 1 Bank 1
int tid = threadIdx.x; Thread 2 Bank 2 Thread 2 Bank 2
locality in cache line and reduces Thread 9 global[tid + blockDim.x]; Thread 7 Bank 7
sharing traffic
Thread 10
Thread 11 Bank 15
banking effects
©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007
25
©
David
Kirk/NVIDIA
and
Wen‐mei
W.
Hwu,
2007
26
ECE
498AL,
University
of
Illinois,
Urbana‐Champaign
L8:
Memory
Hierarchy
III
ECE
498AL,
University
of
Illinois,
Urbana‐Champaign
L8:
Memory
Hierarchy
III
27
28
CS6963
CS6963
L8:
Memory
Hierarchy
III
L8:
Memory
Hierarchy
III