From Superscalar OO to Multicore SST

Checkpoint and Transactional memory support for SST
© stratusdesign@gmail.com stratusdesign.blogspot.com

Author details
‡ ‡ No connection to Sun Microsystems/Oracle. Worked on low level Mainframe and Embedded class Kernel, Distributed Microkernel for International Hardware/Microprocessor Companies as well as having extensive experience developing a proprietary ANSI C compiler toolchain for Intel, MIPS and Alpha processors. Also experience with parallel computers and DFT for signal processing using CUDA/GPU Now looking for new contracts, projects (primarily software) in GB, Europe or US in Kernel, Compiler, Signal Processing, Parallel area (as of Jun 2011) Contact dmctek@gmail.com

‡ ‡

The OO Superscalar legacy
‡ OO legacy technique of the superscalar era does it have a future in multicore? ‡ Used to utilise otherwise wasted cycles while waiting for memory ‡ State of the art Year 1998 2008 ‡ Ultimately limited by
Parallelism found in code Logic for RRF/CDB cycle latencies Hazard checking burden increases quadratically with size of issue queue/reservation station due to the CAM like structure used for these queues o Ability to scale as memory wall increases
o o o

Inflight Instructions 90 200

Clock Speed 600Mhz 3200Mh z

Speculative execution evolution
‡ Scout (Run-ahead) thread o During a data-dependent stall (eg L1 cache miss) enter run-ahead mode continuing execution in program order 
Helps to warm caches until dependency resolved and normal execution can be resumed  Throws away lots of instructions that could have been executed between stall and resolution of the data-dependency

‡ Evolution to SST o During a data-dependent stall (eg L1 cache miss) enter execute-ahead mode doing speculative execution
(..contd)

Evolution to SST
‡ Speculative Execution Depends on =>
o o

Checkpointing Transactional Memory Ahead thread executing instructions speculatively Behind thread executing instructions with resolved data dependencies [+] Single threaded software code is being executed simultaneously from 2 different locations using hardware threads [+] Achieves MLP and ILP [-] Program locality works toward ensuring cache misses are kept to a minimum or the prefetcher may be able to produce the result with a very low cycle latency

‡ Exploits => hardware threading
o o

‡ Advantages =>
o o o

Hazards
‡ Common to OO and SST ‡ Data
o

RAW, WAR, WAW Branching, Exceptions Scheme must not break effect of Total Store Ordering (The Von-Neuman/Turing ordering of a code). In other words the results of the dynamic machine scheduling of code must not differ with the static program schedule)

‡ Control
o

‡ Memory Consistency Protocols
o

OO & SST Differences
‡ Traditional OO
Stalls instructions with any data dependency, that is , there is no progression to the retirement unit. o Uses register renaming to continue OO µexecute¶
o

‡ SST
RAW => Defers instructions and any resolved operands in a deferred queue (DQ) o WAR, WAW => Speculatively retired
o

Data hazards
‡ Executing instructions out of order is problematical as potentially N versions of operands held in finite set of registers ‡ When does the register have the correct value for the right instruction?
‡ RAW a=5; a=10; b=a+1; ‡ b should be 11 not 6 ‡ WAR a=5 b=a+1 a=6 ‡ b should be 5 not 6 ‡ WAW a=5; b=50; ‡ b should be 50 not 5

SST handling of Data Hazards
‡ Ahead thread
o

Avoids RAW by using NT bit and deferring the instruction Avoids WAR by saving resolved operands alongside relevant instruction in the DQ Avoids WAW the NT bits determines if it can update the ARF (architectural register file) if not the WAW bit is set preventing this and the SRF register update may only be used to do data forwarding Reg[dest] = Reg[operand_1] || Reg[operand_n]

‡ Behind thread
o o

‡ Discovering and propagating data dependencies
o

SST handling of Control Hazards
‡ Speculation fails if any of the following occur
o o o

Branch Mis-Prediction Transactional Memory Failure 
Memory order violation detected by µS¶ bit in cache

Exception speculative checkpoint to be discarded and, architectural checkpoint restored

‡ Failed speculation causes
o o

SST Memory Consistency Protocol
‡ Load Order protocol
o Speculative

loads set the cache line ³S´ speculatively read bit (transactional memory support) o If cache logic evicts or invalidates a line with the µS¶ bit set then ahead thread speculation has failed for this episode

Checkpoints
‡ For N=2 ‡ At start of an SST episode 2 checkpoints are created
o

Architectural Checkpoint 
Initially active  Once active ahead-thread progresses with speculative execution

o

Speculative Checkpoint (inactive) 
Behind thread wakes then makes it active ; clears W bit vector  NT bit vector copied to SNT bit vector to detect WAW hazards

o

When deferred queue empty for speculative episode a ³merge´ operation is performed 
Merge is Ahead-thread results + Behind-thread results => Architectural Checkpoint  NT = SNT && W ; SNT and W bit vectors cleared ; Architectural Checkpoint is discarded ; Speculative Checkpoint is made active aka it becomes the new Architectural Checkpoint

o

When deferred queue empty for all speculative episodes a ³join´ operation is performed 
Join similar to Merge except nothing remains in the Deferred Queue and the speculative episode is ended returning the Ahead-thread to Normal mode

SST new circuit structures
‡ To Handle N Checkpoints (assume N=2)
o o o

2 Defer Queues 
Hold instructions & resolved operands used by behind thread

1 Architectural register file (aka Normal RF) 
Initially read by Ahead-thread

2 Working register files (aka speculative RF) 
Ahead-thread initially reads ARF updates SRF1 until,  speculative checkpoint when it updates SRF2 the behind-thread wakes and uses SRF1

o

Status bits NT, SNT, W, WAW 
Not There, Speculatively Not There, Written, WAW  Behind thread uses W bit like Ahead thread uses NT bit  SNT bit is used to capture register state of Ahead thread when Behind thread initiates  NT =/= SNT => WAW when checked during SST episode  Any Register with WAW set value gets dropped at end of SST episode

o

S bit in Cache line 
Cache Slot is waiting for a µS¶peculative Load

High Level SW initiates a Memory Transaction L1 Miss

SST logic
L1 Resolved WAIT more data expected Speculation Successful Program Execution resumes were speculation finished

Active‡Architectural Inactive‡Speculative

Arch Checkpoint

Set µS¶ bit in Cache

Begin SST Episode

Wakeup Behind Thread

Instr has no Data Dependencies?

Start Executing Main thread Speculatively ahead

Start Behind thread in wait mode to handle Defers

Behind Thread Runs Thru DQ for Active Checkpoint

DQ Empty for current & spec ckpt? Tx Fail µS¶bit Detect Mem Order Violation

Done Ahead Thread ‡Normal Mode Behind Thread ‡Pause

DQ Full? Execute Instr and Retire OO Instr has Data Dependencies? WAIT Restore Checkpoint

Br Mispredict

Enqueue DQ with Instr & All Resolved Opr

Exception Ahead Thread‡ Scout Mode Behind Thread‡Pause

SST scheduling
Program Order
LDX addr1, %r1 ADD %r1, 0x04, %r2 STX %r2, addr2 SETHI 0x01, %r2 STX %r2, addr3 etc.. Deferring data-dependent instructions prevents RAW ± here %r2 was read at 3 but written before at 2 Saving operands in DQ prevents WAR as any valid data in register at that time is captured and saved for Behind-Thread to use later regardless of future writes by Ahead-Thread

; Ahead-Thread
1 LDX addr1, %r1
; Load Miss on addr1, Defer and set R1[ NT ]) To Defer Q ; Checkpoint Start Ahead-Thread, Behind-Thread Waits for data read

RAW

;Deferred Queue
LDX addr1, %r1[ NT ]

WAR

2 ADD %r1, 0x04, %r2
; Source Operand has NT bit set Defer and set R2[NT] To Defer Q

ADD

%r1[ NT ], 0x04, %r2[ NT ]

SST Order
LDX addr1, %r1 ADD %r1, 0x04, %r2 STX %r2, addr2 SETHI 0x01, %r2 STX %r2, addr3 etc..

3 STX %r2, addr2
; Source Operand has NT bit set Defer)To Defer Q STX %r2[ NT ], addr2

4 SETHI 0x01, %r2
; Ahead Thread Executes Independently)

5 STX %r2, addr3
; Ahead Thread Executes Independently & continues speculative execution of more program instructions

WAW

; Load Miss resolves start Behind-Thread 6 ADD %r1, 0x04, %r2[NT=0,SNT=1] ; NT was reset at 4, set waw bit 7 STX %r2, addr3

Registers with WAW bit not committed to Architectural state ± here %r2 was written at 4 & 6

Sign up to vote on this title
UsefulNot useful