MCC Iitg Sahu MPLec11

CS521 CSE IITG 11/23/2012
• SIMD architectures can exploit significant
data‐level parallelism for:
– matrix‐oriented scientific computing
– media‐oriented image and sound processors
• SIMD is more energy efficient than MIMD
SIMD is more energy efficient than MIMD
– Only needs to fetch one instruction per data
operation
– Makes SIMD attractive for personal mobile devices
• SIMD allows programmer to continue to think
sequentially
July‐Aug 2011 A Sahu 1 A Sahu
July‐Aug 2011 2
• Vector architectures • Basic idea:
• SIMD extensions (MMX, SSE) – Read sets of data elements into “vector registers”
• Graphics Processor Units (GPUs) – Operate on those registers
– Disperse the results back into memory
p y
• For x86 processors: • Registers are controlled by compiler
– Used to hide memory latency
– Expect two additional cores per chip per year
– Leverage memory bandwidth
– SIMD width to double every four years
– Potential speedup from SIMD to be twice that
from MIMD!
A Sahu
July‐Aug 2011 3 A Sahu
July‐Aug 2011 4
• Example architecture: VMIPS
– Loosely based on Cray‐1 Main Memory
– Vector registers
• Each register holds a 64‐element, 64 bits/element vector
Vector
• Register file has 16 read ports and 8 write ports Load /Store
– Vector functional units
Vector functional units
• Fully pipelined
• Data and control hazards are detected Vector
registers
– Vector load‐store unit
• Fully pipelined
• One word per clock cycle after initial latency
Scalar
– Scalar registers registers
• 32 general‐purpose registers, 32 floating‐point registers
A Sahu
July‐Aug 2011 6
A Sahu 1
CS521 CSE IITG 11/23/2012
//SAXPY //DAXPY • ADDVV.D: add two vectors

//Single precision //Double precision • ADDVS.D: add vector to a scalar
float X[64], Y[64]; double X[64], Y[64]; • LV/SV: vector load and vector store from address
float const a=av; double const a=av; Y = a.X +Y (Double prec.AXPlusY)
• Example: DAXPY
for(i=0;i<64;i++){ for(i=0;i<64;i++){
Y[i]=a.X[i]+Y[i]; Y[i]=a.X[i]+Y[i]; L.D
LD F0,a
F0 ; load scalar a
l d l
} } LV V1,Rx ; load vector X
MULVS.D V2,V1,F0 ; vector‐scalar multiply
LV V3,Ry ; load vector Y
ADDVV V4,V2,V3 ; add
SV Ry,V4 ; store the result
• Requires 6 instructions vs. almost 600 for MIPS
July‐Aug 2011 A Sahu 7 A Sahu
July‐Aug 2011 8
• Execution time depends on three factors: • Sequences with read‐after‐write dependency
– Length of operand vectors hazards can be in the same convey via chaining
– Structural hazards • Chaining
– Data dependencies – Allows a vector operation to start as soon as the
• VMIPS functional units consume one element
VMIPS functional units consume one element individual elements of its vector source operand
individual elements of its vector source operand
become available
per clock cycle
– Execution time is approximately the vector length • Chime
– Unit of time to execute one convey
• Convey – m conveys executes in m chimes
– Set of vector instructions that could potentially – For vector length of n, requires m x n clock cycles
execute together • Example next slide
A Sahu
July‐Aug 2011 10
LV V1,Rx ;load vector X • Start up time

MULVS.D V2,V1,F0 ;vector‐scalar multiply
LV V3, Ry ;load vector Y –Latency of vector functional unit
RAW
ADDVV.D V4,V2,V3 ;add two vectors • All are pipelined
SV Ry,V4 ;store the sum
–Assume the same as Cray‐1
y
Convoys: • Floating‐point add => 6 clock cycles (6 stages)
3 cycle, 2 FOP =
1 LV MULVS.D FLOP/Cycle =3/2 • Floating‐point multiply => 7 clock cycles (7 stages)
2 LV ADDVV.D Assume one • Floating‐point divide => 20 clock cycles (20 stages)
3 SV FPU • Vector load => 12 clock cycles (12 stages)
4 3 chimes, 2 FP ops per result, cycles per FLOP = 1.5
For 64 element vectors, requires (64 Conoys) x (3 Cycle/Chime)
A Sahu = 192 clock cycles
July‐Aug 2011 12
A Sahu 2
CS521 CSE IITG 11/23/2012
• Element n of vector register A is “hardwired” to element n of
• Improvements: vector register B
– Allows for multiple hardware lanes
– Can we process > 1 element per clock cycle
– Non‐64 wide vectors
– IF statements in vector code
– Memory system optimizations to support vector
Memory system optimizations to support vector
processors
– Multiple dimensional matrices
– Sparse matrices
– Programming a vector computer
A Sahu
July‐Aug 2011 13 July‐Aug 2011
A Sahu 14
• Element n of vector register A is “hardwired” to element n of
vector register B
• Vector length not known at compile time?
– Allows for multiple hardware lanes • Use Vector Length Register (VLR)
• Use strip mining for vectors over the
maximum length:
July‐Aug 2011
A Sahu 15 A Sahu
July‐Aug 2011 16
low = 0; • Consider: for (i = 0; i < 64; i=i+1)

VL = (n % MVL); /*find odd‐size piece using modulo op % */ if (X[i] != 0)
for (j = 0; j <= (n/MVL); j=j+1) { /*outer loop*/ X[i] = X[i] – Y[i];
for (i = low; i < (low+VL); i=i+1) /*runs for length VL*/ for (i = 0; i < 64; i=i+1)
Y[i] = a * X[i] + Y[i] ; /*main operation*/
if (X[i] != 0) M[i]=1 ; //Predicate Instruction
low = low + VL; /*start of next vector*/
;/ g
VL = MVL; /*reset the length to maximum vector length*/ g /
for (i = 0; i < 64; i=i+1)
} [i]? { [i] [i] Y[i]}: ; // Predicate Statements
M[i]? {X[i] = X[i] – [i]} // di
Use vector mask register to “disable” elements:
LV V1,Rx ;load vector X into V1
LV V2,Ry ;load vector Y
L.D F0,#0 ;load FP zero into F0
SNEVS.D V1,F0 ;sets VM(i) to 1 if V1(i)!=F0
SUBVV.D V1,V1,V2 ;subtract under vector mask
SV Rx,V1 ;store the result in X
A Sahu
July‐Aug 2011 18
• GFLOPS rate decreases!
A Sahu 3
CS521 CSE IITG 11/23/2012
• Memory system must be designed to support high • Consider: for (i = 0; i < 100; i=i+1)

BW for vector loads and stores for (j = 0; j < 100; j=j+1) {
A[i][j] = 0.0;
• Spread accesses across multiple banks
for (k = 0; k < 100; k=k+1)
– Control bank addresses independently
A[i][j] = A[i][j] + B[i][k] * D[k][j];
– Load or store non sequential words }
– Support multiple vector processors sharing the same
memory • Must vectorize multiplication of rows of B with
columns of D
• Example:
– 32 processors, each generating 4 LD and 2 ST/cycle • Stride: distance separating elements to be
– Proc cycle time is 2.167 ns, SRAM cycle time is 15 ns gathered to a single vector register
– How many memory banks needed? • Use non‐unit stride (B stride =1, D stride=100)
A Sahu
July‐Aug 2011 20
• SRAM is busy for ceil(15/2.167)=7 cycle, 32*6*7=1344Banks
for (i = 0; i < 100; i=i+1)
for (j = 0; j < 100; j=j+1) { 16 17 18 19 20 21 22 24
A[i][j] = 0.0; 8 9 10 11 12 13 14 15
for (k = 0; k < 100; k=k+1) 0 1 2 3 4 5 6 7
A[i][j] = A[i][j] + B[i][k] * D[k][j]; Interleaved Memory
Processor
}
• 8 memory bank, bank busy time =6 clock
• Vector processor can handle stride >1
Vector processor can handle stride >1
• Memory latency=12 cycle
• Once a vector is loaded vector register
– It act as if it had logically adjacent element • How long it will take time to load 64 element
vector load with stride=1
• Bank conflict (stall) occurs when the same
– 12+64 = 76 clock 1.2cycle/elem
bank is hit faster than bank busy time:
– #banks / LCM(stride,#banks) < bank busy time • With Stride=32
A Sahu
July‐Aug 2011 21 – 12+1+6*63 = 391 clock
July‐Aug 2011 A Sahu 6.1cycle/elem 22
• Consider: • Consider:
for (i = 0; i < n; i=i+1) for (i = 0; i < n; i=i+1)
A[K[i]] = A[K[i]] + C[M[i]]; A[K[i]] = A[K[i]] + C[M[i]];
• Gather: Takes the index vector and fetch the vector • Use index vector:
whole elements are at the address LV Vk, Rk ;load K
– Result is a dense vector in a vector register LVI Va, (Ra+Vk) ;load A[K[]]
• Scatter: Store in expanded form using the same LV Vm, Rm ;load M
index vector
LVI Vc, (Rc+Vm) ;load C[M[]]
– LVI Load vector indexed or GATHER
– SVI Store vector indexed or SCATTER ADDVV.D Va, Va, Vc ;add them
SVI (Ra+Vk), Va ;store A[K[]]
A Sahu
July‐Aug 2011 24
A Sahu 4
CS521 CSE IITG 11/23/2012
• Media applications operate on data types
narrower than the native word size • Implementations:
– Example: disconnect carry chains to – Intel MMX (1996)
• Eight 8‐bit integer ops or four 16‐bit integer ops
“partition” adder
– Streaming SIMD Extensions (SSE) (1999)
• Limitations, compared to vector instructions:
Limitations compared to vector instructions: • Eight 16‐bit integer ops
– Number of data operands encoded into op • Four 32‐bit integer/fp ops or two 64‐bit integer/fp ops
code – Advanced Vector Extensions (2010)
– No sophisticated addressing modes • Four 64‐bit integer/fp ops
(strided, scatter‐gather) – Operands must be consecutive and aligned memory
locations
– No mask registers
A Sahu
July‐Aug 2011 26
• Example DXPY: • Basic idea:
L.D F0,a ;load scalar a
MOV F1, F0 ;copy a into F1 for SIMD MUL – Plot peak floating‐point throughput as a function
MOV F2, F0 ;copy a into F2 for SIMD MUL of arithmetic intensity
MOV F3, F0 ;copy a into F3 for SIMD MUL – Ties together floating‐point performance and
DADDIU R4,Rx,#512 ;last address to load
memory performance for a target machine
Loop: L.4D
L.4D F4,0[Rx] ;load X[i], X[i+1], X[i+2], X[i+3]
;load X[i], X[i+1], X[i+2], X[i+3]
MUL.4D F4,F4,F0 ;a×X[i],a×X[i+1],a×X[i+2],a×X[i+3] • Arithmetic intensity
L.4D F8,0[Ry] ;load Y[i], Y[i+1], Y[i+2], Y[i+3]
– Floating‐point operations per byte read
ADD.4D F8,F8,F4 ;a×X[i]+Y[i], ..., a×X[i+3]+Y[i+3]
S.4D 0[Ry],F8 ;store into Y[i], Y[i+1], Y[i+2], Y[i+3]
DADDIU Rx,Rx,#32 ;increment index to X
DADDIU Ry,Ry,#32 ;increment index to Y
DSUBU R20,R4,Rx ;compute bound
BNEZ R20,Loop ;check if done
A Sahu
July‐Aug 2011 28
Crossroads
• Attainable GFLOPs/sec Min = (Peak Memory BW
× Arithmetic Intensity, Peak Floating Point Perf.)
1999 2002 2009 2011

When I Start reading ILP and HT Shift to multicore! Reduced ILP to 1 chapter!
Superscalar and Pipeline Reduced emphasis Request, Data, Thread,
on ILP Instruction Level
Introduce thread
Introduce: GPU, cloud
level P.
computing, Smart phones,
July‐Aug 2011
A Sahu 29 July‐Aug 2011 A Sahu tablets! 30
A Sahu 5
CS521 CSE IITG 11/23/2012
If we extrapolate the trend, in a few generations, Pentium

will look like:
Of course, we know it did not happen.
Q: What happened instead? Why?

Source: Nvidia CUDA Programming Guide
July‐Aug 2011 A Sahu 31 July‐Aug 2011 A Sahu 32
Penryn Chip area Bloomfield

breakdown
Gulftown Beckton
Less than 10% of total chip area is used for the real execution.
Q: Why?
Q: What can you observe? Why?
Power Wall: power expensive, transistors free

"This [multiple IA-core] approach is analogous to trying to build
Memory Wall: Memory slow, multiplies fast
an airplane by putting wings on a train.“ --Bill Dally, NVIDIA
ILP Wall: diminishing returns on more ILP HW
Power Wall + Memory Wall + ILP Wall = Brick Wall
1. 2010 Eckert‐Mauchly (ENIVAC inventor) Award, considered the highest
prize in computer architecture
2. 2004 IEEE Computer Society Seymour Cray Computer Engineering Award
David Patterson, "Computer Architecture is Back - The Berkeley View of the Parallel Computing Research Landscape", Stanford
3. 2000 ACM Maurice Wilkes Award. July‐Aug 2011
EE Computer Systems Colloquium, Jan 2007, link A Sahu 36
A Sahu 6
CS521 CSE IITG 11/23/2012
The granularity of interleaved multi-threading:

• 100 cycles: hide off-chip memory latency
Hide the memory latency through fine-grained interleaved • 10 cycles: + hide cache latency
threading. • 1 cycle: + hide branch latency, instruction dependency
Fine-grained interleaved multi-threading:

Pros: remove branch predictor, OOO scheduler, large cache
July‐Aug 2011 A Sahu 37
Cons:
July‐Aug 2011
register pressure, etc. A Sahu 38
Without and with fine-grained interleaved threading
Reducing large cache gives 2x computational density.

Pros: Q: Can we make further improvements?
reduce cache size,
no branch predictor,
no OOO scheduler Hint:
We have only utilized thread
Cons: level parallelism (TLP) so far.
register pressure,
thread scheduler,
require huge parallelism
Supporting interleaved threading + SIMD execution

GPU uses wide SIMD: 8/16/24/... processing elements (PEs)
CPU uses short SIMD: usually has vector width of 4.
SSE has 4 data lanes GPU has 8/16/24/... data lanes
A Sahu 7
CS521 CSE IITG 11/23/2012
Assume 32 threads are grouped into one warp.

Hide vector width using scalar threads.
NVIDIA Fermi, 512 Processing Elements (PEs)

The Stream Multiprocessor (SM) is a
light weight core compared to IA core.
Light weight PE:

Fused Multiply Add
(FMA)
SFU:
Special Function
Unit
A Sahu 8

MCC Iitg Sahu MPLec11

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MCC Iitg Sahu MPLec11

Uploaded by

Copyright:

Available Formats

CS521 CSE IITG 11/23/2012

//SAXPY //DAXPY • ADDVV.D: add two vectors

LV V1,Rx ;load vector X • Start up time

low = 0; • Consider: for (i = 0; i < 64; i=i+1)

• Memory system must be designed to support high • Consider: for (i = 0; i < 100; i=i+1)

1999 2002 2009 2011

If we extrapolate the trend, in a few generations, Pentium

Of course, we know it did not happen.

Q: What happened instead? Why?

July‐Aug 2011 A Sahu 31 July‐Aug 2011 A Sahu 32

Penryn Chip area Bloomfield

July‐Aug 2011 A Sahu 33 July‐Aug 2011 A Sahu 34

Power Wall: power expensive, transistors free

Power Wall + Memory Wall + ILP Wall = Brick Wall

The granularity of interleaved multi-threading:

Fine-grained interleaved multi-threading:

Without and with fine-grained interleaved threading

Reducing large cache gives 2x computational density.

July‐Aug 2011 A Sahu 39 July‐Aug 2011 A Sahu 40

Supporting interleaved threading + SIMD execution

SSE has 4 data lanes GPU has 8/16/24/... data lanes

July‐Aug 2011 A Sahu 41 July‐Aug 2011 A Sahu 42

Assume 32 threads are grouped into one warp.

July‐Aug 2011 A Sahu 43 July‐Aug 2011 A Sahu 44

NVIDIA Fermi, 512 Processing Elements (PEs)

Light weight PE:

July‐Aug 2011 A Sahu 45 July‐Aug 2011 A Sahu 46

You might also like