Professional Documents
Culture Documents
• SIMD architectures can exploit significant
data‐level parallelism for:
– matrix‐oriented scientific computing
– media‐oriented image and sound processors
• SIMD is more energy efficient than MIMD
SIMD is more energy efficient than MIMD
– Only needs to fetch one instruction per data
operation
– Makes SIMD attractive for personal mobile devices
• SIMD allows programmer to continue to think
sequentially
July‐Aug 2011 A Sahu 1 A Sahu
July‐Aug 2011 2
• Vector architectures • Basic idea:
• SIMD extensions (MMX, SSE) – Read sets of data elements into “vector registers”
• Graphics Processor Units (GPUs) – Operate on those registers
– Disperse the results back into memory
p y
• For x86 processors: • Registers are controlled by compiler
– Used to hide memory latency
– Expect two additional cores per chip per year
– Leverage memory bandwidth
– SIMD width to double every four years
– Potential speedup from SIMD to be twice that
from MIMD!
A Sahu
July‐Aug 2011 3 A Sahu
July‐Aug 2011 4
• Example architecture: VMIPS
– Loosely based on Cray‐1 Main Memory
– Vector registers
• Each register holds a 64‐element, 64 bits/element vector
Vector
• Register file has 16 read ports and 8 write ports Load /Store
– Vector functional units
Vector functional units
• Fully pipelined
• Data and control hazards are detected Vector
registers
– Vector load‐store unit
• Fully pipelined
• One word per clock cycle after initial latency
Scalar
– Scalar registers registers
• 32 general‐purpose registers, 32 floating‐point registers
A Sahu
July‐Aug 2011 5 A Sahu
July‐Aug 2011 6
A Sahu 1
CS521 CSE IITG 11/23/2012
• Execution time depends on three factors: • Sequences with read‐after‐write dependency
– Length of operand vectors hazards can be in the same convey via chaining
– Structural hazards • Chaining
– Data dependencies – Allows a vector operation to start as soon as the
• VMIPS functional units consume one element
VMIPS functional units consume one element individual elements of its vector source operand
individual elements of its vector source operand
become available
per clock cycle
– Execution time is approximately the vector length • Chime
– Unit of time to execute one convey
• Convey – m conveys executes in m chimes
– Set of vector instructions that could potentially – For vector length of n, requires m x n clock cycles
execute together • Example next slide
A Sahu
July‐Aug 2011 9 A Sahu
July‐Aug 2011 10
A Sahu 2
CS521 CSE IITG 11/23/2012
• Element n of vector register A is “hardwired” to element n of
• Improvements: vector register B
– Allows for multiple hardware lanes
– Can we process > 1 element per clock cycle
– Non‐64 wide vectors
– IF statements in vector code
– Memory system optimizations to support vector
Memory system optimizations to support vector
processors
– Multiple dimensional matrices
– Sparse matrices
– Programming a vector computer
A Sahu
July‐Aug 2011 13 July‐Aug 2011
A Sahu 14
• Element n of vector register A is “hardwired” to element n of
vector register B
• Vector length not known at compile time?
– Allows for multiple hardware lanes • Use Vector Length Register (VLR)
• Use strip mining for vectors over the
maximum length:
July‐Aug 2011
A Sahu 15 A Sahu
July‐Aug 2011 16
A Sahu 3
CS521 CSE IITG 11/23/2012
for (i = 0; i < 100; i=i+1)
for (j = 0; j < 100; j=j+1) { 16 17 18 19 20 21 22 24
A[i][j] = 0.0; 8 9 10 11 12 13 14 15
for (k = 0; k < 100; k=k+1) 0 1 2 3 4 5 6 7
A[i][j] = A[i][j] + B[i][k] * D[k][j]; Interleaved Memory
Processor
}
• 8 memory bank, bank busy time =6 clock
• Vector processor can handle stride >1
Vector processor can handle stride >1
• Memory latency=12 cycle
• Once a vector is loaded vector register
– It act as if it had logically adjacent element • How long it will take time to load 64 element
vector load with stride=1
• Bank conflict (stall) occurs when the same
– 12+64 = 76 clock 1.2cycle/elem
bank is hit faster than bank busy time:
– #banks / LCM(stride,#banks) < bank busy time • With Stride=32
A Sahu
July‐Aug 2011 21 – 12+1+6*63 = 391 clock
July‐Aug 2011 A Sahu 6.1cycle/elem 22
• Consider: • Consider:
for (i = 0; i < n; i=i+1) for (i = 0; i < n; i=i+1)
A[K[i]] = A[K[i]] + C[M[i]]; A[K[i]] = A[K[i]] + C[M[i]];
• Gather: Takes the index vector and fetch the vector • Use index vector:
whole elements are at the address LV Vk, Rk ;load K
– Result is a dense vector in a vector register LVI Va, (Ra+Vk) ;load A[K[]]
• Scatter: Store in expanded form using the same LV Vm, Rm ;load M
index vector
LVI Vc, (Rc+Vm) ;load C[M[]]
– LVI Load vector indexed or GATHER
– SVI Store vector indexed or SCATTER ADDVV.D Va, Va, Vc ;add them
SVI (Ra+Vk), Va ;store A[K[]]
A Sahu
July‐Aug 2011 23 A Sahu
July‐Aug 2011 24
A Sahu 4
CS521 CSE IITG 11/23/2012
• Media applications operate on data types
narrower than the native word size • Implementations:
– Example: disconnect carry chains to – Intel MMX (1996)
• Eight 8‐bit integer ops or four 16‐bit integer ops
“partition” adder
– Streaming SIMD Extensions (SSE) (1999)
• Limitations, compared to vector instructions:
Limitations compared to vector instructions: • Eight 16‐bit integer ops
– Number of data operands encoded into op • Four 32‐bit integer/fp ops or two 64‐bit integer/fp ops
code – Advanced Vector Extensions (2010)
– No sophisticated addressing modes • Four 64‐bit integer/fp ops
(strided, scatter‐gather) – Operands must be consecutive and aligned memory
locations
– No mask registers
A Sahu
July‐Aug 2011 25 A Sahu
July‐Aug 2011 26
• Example DXPY: • Basic idea:
L.D F0,a ;load scalar a
MOV F1, F0 ;copy a into F1 for SIMD MUL – Plot peak floating‐point throughput as a function
MOV F2, F0 ;copy a into F2 for SIMD MUL of arithmetic intensity
MOV F3, F0 ;copy a into F3 for SIMD MUL – Ties together floating‐point performance and
DADDIU R4,Rx,#512 ;last address to load
memory performance for a target machine
Loop: L.4D
L.4D F4,0[Rx] ;load X[i], X[i+1], X[i+2], X[i+3]
;load X[i], X[i+1], X[i+2], X[i+3]
MUL.4D F4,F4,F0 ;a×X[i],a×X[i+1],a×X[i+2],a×X[i+3] • Arithmetic intensity
L.4D F8,0[Ry] ;load Y[i], Y[i+1], Y[i+2], Y[i+3]
– Floating‐point operations per byte read
ADD.4D F8,F8,F4 ;a×X[i]+Y[i], ..., a×X[i+3]+Y[i+3]
S.4D 0[Ry],F8 ;store into Y[i], Y[i+1], Y[i+2], Y[i+3]
DADDIU Rx,Rx,#32 ;increment index to X
DADDIU Ry,Ry,#32 ;increment index to Y
DSUBU R20,R4,Rx ;compute bound
BNEZ R20,Loop ;check if done
A Sahu
July‐Aug 2011 27 A Sahu
July‐Aug 2011 28
Crossroads
• Attainable GFLOPs/sec Min = (Peak Memory BW
× Arithmetic Intensity, Peak Floating Point Perf.)
A Sahu 5
CS521 CSE IITG 11/23/2012
Gulftown Beckton
Less than 10% of total chip area is used for the real execution.
Q: Why?
Q: What can you observe? Why?
1. 2010 Eckert‐Mauchly (ENIVAC inventor) Award, considered the highest
prize in computer architecture
2. 2004 IEEE Computer Society Seymour Cray Computer Engineering Award
David Patterson, "Computer Architecture is Back - The Berkeley View of the Parallel Computing Research Landscape", Stanford
3. 2000 ACM Maurice Wilkes Award. July‐Aug 2011
EE Computer Systems Colloquium, Jan 2007, link A Sahu 36
A Sahu 6
CS521 CSE IITG 11/23/2012
A Sahu 7
CS521 CSE IITG 11/23/2012
SFU:
Special Function
Unit
A Sahu 8