You are on page 1of 27

The Vector-Thread Architecture

Ronny Krashinsky, Christopher Batten, Mark Hampton, Steve Gerding, Brian Pharris, Jared Casper, Krste Asanovi MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA

ISCA 2004

Goals For Vector-Thread Architecture


Primary goal is efficiency
High performance with low energy and small area

Take advantage of whatever parallelism and locality is available: DLP, TLP, ILP
Allow intermixing of multiple levels of parallelism

Programming model is key


Encode parallelism and locality in a way that enables a complexity effective implementation Provide clean abstractions to simplify coding and compilation

Vector and Multithreaded Architectures


Control vector control Processor PE0 PE1 PE2 Memory PEN PE0 PE1 PE2 Memory PEN thread control

Vector processors provide efficient DLP execution


Amortize instruction control Amortize loop bookkeeping overhead Exploit structured memory accesses

Unable to execute loops with loopcarried dependencies or complex internal control flow

Multithreaded processors can flexibly exploit TLP Unable to amortize common control overhead across threads Unable to exploit structured memory accesses across threads Costly memorybased synchronization and communication between threads

Vector-Thread Architecture
VT unifies the vector and multithreaded compute models A control processor interacts with a vector of virtual processors (VPs) Vector-fetch: control processor fetches instructions for all VPs in parallel Thread-fetch: a VP fetches its own instructions VT allows a seamless intermixing of vector and thread control
Control vector-fetch Processor VP0 VP1 threadVPN fetch

VP2

VP3

Memory

Outline
VectorThread Architectural Paradigm
Abstract model Physical Model

SCALE VT Processor Evaluation Related Work

Virtual Processor Abstraction


VPs contain a set of registers vector-fetch VP VPs execute RISClike instructions grouped into atomic instruction blocks (AIBs) Registers VPs have no automatic program counter, AIBs must be explicitly ALUs fetched
VPs contain pending vectorfetch and threadfetch addresses

threadfetch

A fetch instruction allows a VP to fetch its own AIB


May be predicated for conditional branch

VP thread execution AIB instruction


fetch

If an AIB does not execute a fetch, the VP thread stops

thread-fetch thread-fetch

fetch

Virtual Processor Vector


A VT architecture includes a control processor and a virtual processor vector
Two interacting instruction sets

A vectorfetch command allows the control processor to fetch an AIB for all the VPs in parallel Vectorload and vectorstore commands transfer blocks of data between memory and the VP registers
Control vector-fetch Processor VP0 vector -load vector Registers -store
Vector Memory Unit

VP1

VPN

Registers ALUs

Registers ALUs

ALUs

Memory

Cross-VP Data Transfers


CrossVP connections provide finegrain data operand communication and synchronization
VP instructions may target nextVP as destination or use prevVP as a source CrossVP queue holds wraparound data, control processor can push and pop Restricted ring communication pattern is cheap to implement, scalable, and matches the software usage model for VPs Control vector-fetch Processor VP0 cross VPcross pop Registers VPpush ALUs crossVP queue

VP1

VPN

Registers ALUs

Registers ALUs

Mapping Loops to VT
A broad class of loops map naturally to VT
Vectorizable loops Loops with loopcarried dependencies Loops with internal control flow

Each VP executes one loop iteration


Control processor manages the execution Stripmining enables implementationdependent vector lengths

Programmer or compiler only schedules one loop iteration on one VP


No crossiteration scheduling

Vectorizable Loops
ld << + st ld x

Dataparallel loops with no internal control flow mapped using vector commands
predication for small conditionals operation loop iteration DAG VP0
ld ld << + x st ld ld ld ld << + st ld ld

Control Processor
vector-load vector-load vector-fetch

VP1
ld ld x

VP2
ld ld << + st x

VP3
ld ld << + st ld ld x

VPN
ld ld << + st ld ld x

vector-store vector-load vector-load vector-fetch

Loop-Carried Dependencies
ld << + st ld x

Loops with crossiteration dependencies mapped using vector commands with crossVP data transfers
Vectorfetch introduces chain of prevVP receives and nextVP sends Vectormemory commands with nonvectorizable compute VP1 VP0 VP2 VP3 VPN
ld ld << + x st << + st ld ld x << + st ld ld x << + st ld ld x << + st ld ld x

Control Processor
vector-load vector-load vector-fetch

vector-store

Loops with Internal Control Flow


ld ld == br st

Dataparallel loops with large conditionals or innerloops mapped using threadfetches


Vectorcommands and threadfetches freely intermixed Once launched, the VP threads execute to completion before the next control processor command

Control Processor
vector-load vector-fetch

VP0
ld ld == br ld == br

VP1
ld ld == br

VP2
ld ld == br

VP3
ld ld == br ld == br

VPN
ld ld == br ld == br st

vector-store

st

st

st

st

VT Physical Model
A VectorThread Unit contains an array of lanes with physical register files and execution units VPs map to lanes and share physical resources, VP execution is time multiplexed on the lanes Independent parallel lanes exploit parallelism across VPs and data operand locality within VPs
Control Processor

Vector-Thread Unit
Lane 0
VP12 VP8 VP4 VP0

Lane 1
VP13 VP9 VP5 VP1

Lane 2
VP14 VP10 VP6 VP2

Lane 3
VP15 VP11 VP7 VP3

Vector Memory Unit

ALU

ALU

ALU

ALU

Memory

Lane Execution
Lanes execute decoupled from each other Command management unit handles vectorfetch and threadfetch commands Execution cluster executes instructions inorder from small AIB cache (e.g. 32 instructions)
AIB caches exploit locality to reduce instruction fetch energy (on par with register read)
vector-fetch

Lane 0

vector-fetch thread-fetch

miss addr miss

AIB tags

AIB address

Execute directives point to AIBs and indicate which VP(s) the AIB should be executed for
For a threadfetch command, the lane executes the AIB for the requesting VP For a vectorfetch command, the lane executes the AIB for every VP

execu te direct ive

VP

VP12 VP8 VP4 VP0

AIBs and vectorfetch commands reduce control overhead


10s100s of instructions executed per fetch address tagcheck, even for non vectorizable loops

AIB cache AIB fill AIB instr.

ALU

VP Execution Interleaving
Hardware provides the benefits of loop unrolling by interleaving VPs Timemultiplexing can hide threadfetch, memory, and functional unit latencies
time-multiplexing Lane 0
vector-fetch AIB

Lane 1

Lane 2

Lane 3

VP0 VP4VP8 VP12 VP1 VP5VP9 VP13 VP2 VP10P14 VP3 VP11P15 VP6 V VP7 V

vector-fetch threadfetch

time

vector-fetch

VP Execution Interleaving
Dynamic scheduling of crossVP data transfers automatically adapts to software critical path (in contrast to static software pipelining)
No static crossiteration scheduling Tolerant to variable dynamic latencies
time-multiplexing Lane 0
vector-fetch AIB

Lane 1

Lane 2

Lane 3

VP0 VP4VP8 VP12 VP1 VP5VP9 VP13 VP2 VP10P14 VP3 VP11P15 VP6 V VP7 V

time

SCALE Vector-Thread Processor


SCALE is designed to be a complexityeffective allpurpose embedded processor
Exploit all available forms of parallelism and locality to achieve high performance and low energy

Constrained to small area (estimated 10 mm2 in 0.18 m)


Reduce wire delay and complexity Support tiling of multiple SCALE processors for increased throughput

Careful balance between software and hardware for code mapping and scheduling
Optimize runtime energy, area efficiency, and performance while maintaining a clean scalable programming model

SCALE Clusters
VPs partitioned into four clusters to exploit ILP and allow lane implementations to optimize area, energy, and circuit delay
Clusters are heterogeneous c0 can execute loads and stores, c1 can execute fetches, c3 has integer mult/div Clusters execute decoupled from each other SCALE VP
c3

Control Processor

Lane 0

Lane 1

Lane 2

Lane 3

c3

c3

c3

c3

c2

c1

AIB Fill Unit

c2

c2

c2

c2

c1

c1

c1

c1

c0

c0

c0

c0

c0

L1 Cache

SCALE Registers and VP Configuration


Atomic instruction blocks allow VPs to share temporary state only valid within the AIB
VP general registers divided into private and shared Chain registers at ALU inputs avoid reading and writing general register file to save energy
c0

shared VP8 VP4 VP0 cr0 cr1

Number of VP registers in each cluster is configurable


The hardware can support more VPs when they each have fewer private registers Low overhead: Control processor instruction configures VPs before entering stripmine loop, VP state undefined across reconfigurations
4 VPs with 0 shared regs 8 private regs
VP12 VP8 VP4 VP0

7 VPs with 4 shared regs 4 private regs

shared
VP24 VP20 VP16 VP12 VP8 VP4 VP0

25 VPs with 7 shared regs 1 private reg

shared

SCALE Micro-Ops
Assembler translates portable software ISA into hardware microops Percluster microop bundles access local registers only Intercluster data transfers broken into transports and writebacks Software VP code:
cluster operation destinations c0 xor pr0, pr1 c1/cr0, c2/cr0 c1 sll cr0, 2 c2/cr1 c2 add cr0, cr1 pr0

Hardware micro-ops:
Cluster 0
wb

Cluster 1

Cluster 2
compute tp

compute tp wb compute tp wb xor pr0,pr1 c1,c2 c0cr0 sll cr0,2 c2 c0cr0

c1cr1 add cr0,cr1pr0

cluster micro-op bundle

Cluster 3 not shown

SCALE Cluster Decoupling


Cluster execution is decoupled
Cluster AIB caches hold microop bundles Each cluster has its own execute directive queue, and local control Intercluster data transfers synchronize with handshake signals
Cluster 3 VP compute AIB cache Regs writeback

ALU
transport Cluster 2 AIBs Cluster 1 AIBs Cluster 0 AIBs compute compute compute writeback transport writeback transport writeback transport

Memory Access Decoupling

(see paper) Loaddataqueue enables continued execution after a cache miss Decoupledstorequeue enables loads to slip ahead of stores

SCALE Prototype and Simulator


Prototype SCALE processor in development
Control processor: MIPS, 1 instr/cycle VTU: 4 lanes, 4 clusters/lane, 32 registers/cluster, 128 VPs max Primary I/D cache: 32 KB, 4x128b per cycle, nonblocking DRAM: 64b, 200 MHz DDR2 (64b at 400Mb/s: 3.2GB/s) Estimated 10 mm2 in 0.18m, 400 MHz (25 FO4)

Cyclelevel executiondriven microarchitectural simulator


Detailed VTU and memory system model 4 mm Cache Bank (8KB) Cache Bank (8KB) Cache Bank (8KB) Cache Bank (8KB)
ctrl L/S byp ALU shftr CP0 MD PC RF latch IQC latch IQC latch IQC latch IQC mux/ mux/ mux/ mux/ latch IQC latch IQC latch IQC latch IQC mux/ mux/ mux/ mux/ LDQ ctrl LDQ ctrl LDQ ctrl LDQ ctrl ctrl

Cache Tags
latch IQC latch IQC latch IQC latch IQC mux/ mux/ mux/ mux/ ctrl ctrl

latch IQC latch IQC latch IQC latch IQC mux/ mux/ mux/ mux/

ctrl

ctrl

ctrl

shftr ALU

shftr ALU

shftr ALU

shftr ALU

RF

RF

RF

ctrl

ctrl

ctrl

ctrl

RF

ctrl

Control Processor Mult Div Cluster


shftr ALU shftr ALU shftr ALU RF RF ctrl ctrl shftr ALU shftr ALU shftr ALU RF RF RF ctrl RF ctrl

Memory Interface / Cache Control

shftr ALU

RF

shftr ALU

shftr ALU

shftr ALU

shftr ALU

RF

RF

RF

RF

Crossbar

2.5 mm

Memory Unit

shftr ALU

ctrl

RF

Lane

Benchmarks
16 from EEMBC, 6 from MiBench, Mediabench, and SpecInt

Diverse set of 22 benchmarks chosen to evaluate a range of applications with different types of parallelism Handwritten VP assembly code linked with C code compiled for control processor using gcc
Reflects typical usage model for embedded processors

EEMBC enables comparison to other processors running hand optimized code, but it is not an ideal comparison
Performance depends greatly on programmer effort, algorithmic changes are allowed for some benchmarks, these are often unpublished Performance depends greatly on special compute instructions or subword SIMD operations (for the current evaluation, SCALE does not provide these) Processors use unequal silicon area, process technologies, and circuit styles

Overall results: SCALE is competitive with larger and more complex processors on a wide range of codes from different domains
See paper for detailed results Results are a snapshot, SCALE microarchitecture and benchmark mappings continue to improve

Mapping Parallelism to SCALE


Dataparallel loops with no complex control flow
Use vectorfetch and vectormemory commands EEMBC rgbcmy, rgbyiq, and hpg execute 610 ops/cycle for 12x32x speedup over control processor, performance scales with number of lanes

Loops with loopcarried dependencies


Use vectorfetched AIBs with crossVP data transfers Mediabench adpcm.dec: two loopcarried dependencies propagate in parallel, on average 7 loop iterations execute in parallel, 8x speedup MiBench sha has 5 loopcarried dependencies, exploits ILP

Speedup vs. Control Processor

70 60 50 40 30 20 10 0 rgbcmy rgbyiq hpg

9 8 7 6 5 4 3 2 1 0 adpcm.dec

prototype

1 Lane 2 Lanes 4 Lanes 8 Lanes

sha

Mapping Parallelism to SCALE


Dataparallel loops with large conditionals
Use vectorfetched AIBs with conditional threadfetches EEMBC dither: specialcase for white pixels (18 ops vs. 49)

Dataparallel loops with inner loops


Use vectorfetched AIBs with threadfetches for inner loop EEMBC lookup: VPs execute pointerchaising IP address lookups in routing table

Freerunning threads
No control processor interaction VP worker threads get tasks from shared queue using atomic memory ops EEMBC pntrch and MiBench qsort achieve significant speedups

Speedup vs. Control Processor

10 8 6 4 2 0 dither lookup pntrch

1 Lane 2 Lanes 4 Lanes 8 Lanes

qsort

Comparison to Related Work


TRIPS and Smart Memories can also exploit multiple types of parallelism, but must explicitly switch modes Raws tiled processors provide much lower compute density than SCALEs clusters which factor out instruction and data overheads and use direct communication instead of programmed switches Multiscalar passes loopcarried register dependencies around a ring network, but it focuses on speculative execution and memory, whereas VT uses simple logic to support common loop types Imagine organizes computation into kernels to improve register file locality, but it exposes machine details with a lowlevel VLIW ISA, in contrast to VTs VP abstraction and AIBs CODE uses registerrenaming to hide its decoupled clusters from software, whereas SCALE simplifies hardware by exposing clusters and statically partitioning intercluster transport and writeback ops Jesshopes micro-threading is similar in spirit to VT, but its threads are dynamically scheduled and crossiteration synchronization uses full/empty bits on global registers

Summary
The vectorthread architectural paradigm unifies the vector and multithreaded compute models VT abstraction introduces a small set of primitives to allow software to succinctly encode parallelism and locality and seamlessly intermingle DLP, TLP, and ILP
Virtual processors, AIBs, vectorfetch and vectormemory commands, threadfetches, crossVP data transfers

SCALE VT processor efficiently achieves high performance on a wide range of embedded applications