Professional Documents
Culture Documents
Vector Processing
John Kubiatowicz
Electrical Engineering and Computer Sciences
University of California, Berkeley
http://www.eecs.berkeley.edu/~kubitron/cs252
Discussion of papers (Con’t)
Complexity-effective superscalar processors
• “Complexity-effective superscalar processors”, Subbarao
Palacharla, Norman P. Jouppi and J. E. Smith.
– Several data structures analyzed for complexity WRT issue width
» Rename: Roughly Linear in IW, steeper slope for smaller feature size
» Wakeup: Roughly Linear in IW, but quadratic in window size
» Bypass: Strongly quadratic in IW
– Overall results:
» Bypass significant at high window size/issue width
» Wakeup+Select delay dominates otherwise
– Proposed Complexity-effective design:
3/4/2009 » Replace issue windowcs252-S09,
with FIFOs/steer
Lecture 12 dependent Insts to same FIFO 2
Discussion of Papers: SPARCLE paper
• Example of close coupling between processor and memory
controller (CMMU)
– All of features mentioned in this paper implemented by combination of
processor and memory controller
– Some functions implemented as special “coprocessor” instructions
– Others implemented as “Tagged” loads/stores/swaps
• Course Grained Multithreading
– Using SPARC register windows
– Automatic synchronous trap on cache miss
– Fast handling of all other traps/interrupts (great for message interface!)
– Multithreading half in hardware/half software (hence 14 cycles)
• Fine Grained Synchronization
– Full-Empty bit/32 bit word (effectively 33 bits)
» Groups of 4 words/cache line F/E bits put into memory TAG
– Fast TRAP on bad condition
– Multiple instructions. Examples:
» LDT (load/trap if empty)
» LDET (load/set full/trap if empty)
» STF (Store/set full)
» STFT (store/set full/trap if full)
3/4/2009 cs252-S09, Lecture 12 3
Discussion of Papers: Sparcle (Con’t)
• Message Interface
– Closely couple with processor
» Interface at speed of first-level cache
– Atomic message launch:
» Describe message (including DMA ops) with simple stio insts
» Atomic launch instruction (ipilaunch)
– Message Reception
» Possible interrupt on message receive: use fast context switch
» Examine message with simple ldio instructions
» Discard in pieces, possibly with DMA
» Free message (ipicst, i.e “coherent storeback”)
• We will talk about message interface in greater detail
Definition of a supercomputer:
• Fastest machine in world at given task
• A device to turn a compute-bound problem into an
I/O bound problem
• Any machine costing $30M+
• Any machine designed by Seymour Cray
r0 v0
[0] [1] [2] [VLRMAX-1]
Vector Length Register VLR
v1
Vector Arithmetic v2
Instructions + + + + + +
ADDV v3, v1, v2 v3
[0] [1] [VLR-1]
Memory
Base,
3/4/2009 r1 Stride, r2 cs252-S09, Lecture 12 10
Vector Code Example
• Compact
– one short instruction encodes N operations
• Expressive, tells hardware that these N operations:
– are independent
– use the same functional unit
– access disjoint registers
– access registers in the same pattern as previous instructions
– access a contiguous block of memory (unit-stride load/store)
– access memory in a known pattern (strided load/store)
• Scalable
– can run same object code on more parallel pipelines or lanes
V3 <- v1 * v2
Address
Generator +
0 1 2 3 4 5 6 7 8 9 A B C D E F
Memory Banks
A[6] B[6] A[24] B[24] A[25] B[25] A[26] B[26] A[27] B[27]
A[5] B[5] A[20] B[20] A[21] B[21] A[22] B[22] A[23] B[23]
A[4] B[4] A[16] B[16] A[17] B[17] A[18] B[18] A[19] B[19]
A[3] B[3] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15]
Vector
Registers
Elements Elements Elements Elements
0, 4, 8, … 1, 5, 9, … 2, 6, 10, … 3, 7, 11, …
Lane
Memory Subsystem
3/4/2009 cs252-S09, Lecture 12 16
T0 Vector Microprocessor (1995)
Instruction
issue
V V V V V
LV v1
1 2 3 4 5
MULV v3,v1,v2
ADDV v5, v3, v4
Chain Chain
Load
Unit
Mult. Add
Memory
Load
Mul
Time Add
Load
Mul
Add
R X X X W
R X X X W
R X X X W
Dead Time
R X X X W
R X X X W
R X X X W
Dead Time R X X X W Second Vector Instruction
3/4/2009 R Lecture
cs252-S09, X X 12
X W 26
Dead Time and Short Vectors
No dead time
64 cycles active
M[0]=0 C[0]
Write Enable Write data port
Compress Expand
Used for density-time conditionals and also for general
selection operations
3/4/2009 cs252-S09, Lecture 12 32
Vector Reductions
Problem: Loop-carried dependence on reduction variables
sum = 0;
for (i=0; i<N; i++)
sum += A[i]; # Loop-carried dependence on sum
Solution: Re-associate operations if possible, use binary
tree to perform reduction
# Rearrange as:
sum[0:VL-1] = 0 # Vector of VL partial sums
for(i=0; i<N; i+=VL) # Stripmine VL-sized chunks
sum[0:VL-1] += A[i:i+VL-1]; # Vector sum
# Now have VL partial sums in one vector register
do {
VL = VL/2; # Halve vector length
sum[0:VL-1] += sum[VL:2*VL-1] # Halve no. of partials
} while (VL>1)
/* Do a vector-scalar multiply. */
prod[0:31] = b_vector[0:31]*a_scalar;