You are on page 1of 55

Simplescalars out-of-order simulator (v3)

ECE1773 Andreas Moshovos Visit www.simplescalar.com for additional info Simplescalar was developed by Todd Austin now at Michigan. First version while at UWisconsin. Builds on the experience with other simulators that existed at the time at UWisc. Introduced many simulation speed enhancements. Can be used for free for academic purposes.
ECE ECE1773 Spring 02 A. Moshovos (Toronto)

What is sim-outorder Approximate model of dynamically scheduled processor Simulates:


I and D caches Branch prediction I and D TLBs (constant latency) Combined Reorder buffer and scheduler Register renaming Support for speculative execution after branches Load/Store scheduler

ECE ECE1773 Spring 02 A. Moshovos (Toronto)

How is sim-outorder structured


bpred fetch disp. sched. mem sched. U-cache L1 exec mem D-cache L1 WB commit mem D-TLB

I-cache I-TLB L1

Main Memory Virtual


ECE ECE1773 Spring 02 A. Moshovos (Toronto)

Main Simulator Loop


sim_main: forever do
ruu_commit () ruu_release_fu() Internal bookeeping of which functional units are available ruu_writeback() lsq_refresh() Load/store scheduler ruu_issue() Non-load/store instruction scheduler ruu_dispatch() ruu_fetch()

These correspond to the green boxes on the previous slide Every iteration is a single cycle: sim_cycle variable counts them
ECE ECE1773 Spring 02 A. Moshovos (Toronto)

ruu_fetch()
Fetch and predict up to ruu_decode_width instructions Place them into fetch_data[] buffer Inputs: 2 globals
Fetch_regs_PC: what fetch thinks is the next PC to fetch from Fetch_pred_PC: what is the predicted PC for after this instruction

Output: fetch_data[] buffer


Fetch_tail used by ruu_fetch() Fetch_head used by ruu_dispatch() Fetch_num = total number of occupied fetch_data entries ruu_ifq_size = total number of fetch_data entries

Fetch places insts and Dispatch consumes them On miss-prediction:


PCs are reset to appropriate values and fetch_data is drained
ECE ECE1773 Spring 02 A. Moshovos (Toronto)

ruu_fetch() - loop If not a bogus address Access I-Cache with fetch_regs_PC get latency of access Access I-TLB hit/miss Determine overall latency as max of the two If prediction is enabled: Access predictor and get fetch_pred_PC plus a backpointer to predictor entry Instruction, PCs and prediction info go into fetch_data[fetch_tail] Fetch_num++, fetch_tail++ MOD ruu_ifq_size
ECE ECE1773 Spring 02 A. Moshovos (Toronto)

I-Cache Interface cache.[ch]


Cache_access (*cache_il1, Read/Write, Address, *IObuffer, nbytes, CycleNow, UserData, *repl_address)
IObuffer, UserData and repl_address are usually NULL See cache.h

What it returns is a latency in cycles


Checks if hit If miss, accesses L2 which in turn may access main memory Look for il1_access_fn() and ul2_access_fn()

An approximation:
No real, event-driven simulation of the memory system

Careful, how one interprets the simulation result I-TLB also simulated as a cache with few entries and constant, still large miss latency Cache does not hold memory data, only the tags of cached blocks access memory to get insts (optimization be careful)
ECE ECE1773 Spring 02 A. Moshovos (Toronto)

Branch Prediction Interface bpred.[ch] bpred_lookup (*pred, PC, *target_address, opcode, Call?, Return?, *back-pointer for updates, *back-pointer for stack updates) Returns a Predicted PC
Can check whether it is taken or not by comparing with the next sequential PC Pred_PC = PC + sizeof (md_inst_t)

Eventually, call bpred_update (*pred, PC, actual target_address, taken?, pred_taken?, opcode, back_pointer, stack back-pointer)
Can be called at writeback or commit
ECE ECE1773 Spring 02 A. Moshovos (Toronto)

Fetch buffer: fetch_data[]


struct fetch_rec { md_inst_t IR; md_addr_t regs_PC; md_addr_t pred_PC; struct bpred_update_t dir_update; int stack_recover_idx; unsigned int ptrace_seq; }; fetch_tail fetch_head fetch_num ruu_ifq_num
ECE ECE1773 Spring 02 A. Moshovos (Toronto)

Complete instruction Current PC Predicted PC

bpred back-pointer stack back-pointer print trace sequence id ruu_fetch writes there ruu_dispatch reads from there how many valid max entries

ruu_fetch()
for (i=0, branch_cnt=0; /* fetch up to as many instruction as the DISPATCH stage can decode */ i < (ruu_decode_width * fetch_speed) /* fetch until IFETCH -> DISPATCH queue fills */ && fetch_num < ruu_ifq_size /* and no IFETCH blocking condition encountered */ && !done; i++) { MAIN LOOP } Done is used for enforcing fetch break conditions Currently this happens only when number of branches exceeds fetch_speed
ECE ECE1773 Spring 02 A. Moshovos (Toronto)

ruu_fetch() Invalid Address Check if (ld_text_base <= fetch_regs_PC && fetch_regs_PC < (ld_text_base+ld_text_size) && !(fetch_regs_PC & (sizeof(md_inst_t)-1))) { /* read instruction from memory */ MD_FETCH_INST(inst, mem, fetch_regs_PC);

ECE ECE1773 Spring 02 A. Moshovos (Toronto)

ruu_fetch() I-Cache Access


if (cache_il1) /* access the I-cache */ lat = cache_access(cache_il1, Read, IACOMPRESS(fetch_regs_PC), NULL, ISCOMPRESS(sizeof(md_inst_t)), sim_cycle, NULL, NULL); if (lat > cache_il1_lat) last_inst_missed = TRUE; } if (itlb) tlb_lat = cache_access(itlb, Read, IACOMPRESS(fetch_regs_PC)

...
lat = MAX(tlb_lat, lat); if (lat != cache_il1_lat) /* I-cache miss, block fetch until it is resolved */ ruu_fetch_issue_delay += lat - 1; break;

ECE ECE1773 Spring 02 A. Moshovos (Toronto)

sim_main() ruu_fetch() code if (!ruu_fetch_issue_delay) ruu_fetch(); else ruu_fetch_issue_delay--;

ECE ECE1773 Spring 02 A. Moshovos (Toronto)

ruu_dispatch() Get next inst from fetch buffer Functionally execute the instruction Split load/stores into
1. Address calculation 2. Memory operation

Rename input dependences Rename target register Place into scheduler RUU[] and load/store LSQ[] scheduler if necessary Determine if miss-prediction Issue if ready
ECE ECE1773 Spring 02 A. Moshovos (Toronto)

Functional and timing execution Ignore miss-predicts for the time being Simplescalar executes all instructions in-order during dispatch
They update registers and memory at that time

Then it tries to determine when they would actually execute taking into consideration dependences and latencies This is simulation so we can do this
Pros: fast, easy to debug Cons: timing model can be wrong and the simulation will not produce incorrect results
ECE ECE1773 Spring 02 A. Moshovos (Toronto)

Handling Miss-Predictions Two modes: correct & miss-speculated ruu_dispatch switches to the 2nd when it decodes a miss-predicted branch
Know about it because it executes the branch and figures out whether the prediction is correct Global spec_mode is 1 when in miss-speculated mode

Switch back to correct when branch is resolved

ECE ECE1773 Spring 02 A. Moshovos (Toronto)

Handling Miss-Predictions
Keep two states: correct and miss-speculated
For regs there is regs_R[] and spec_regs_R[] (and _F) For memory, there is mem_access and spec_mem_access Speculative memory updates are kept in a temporary hash table Loads access this table first and then memory if needed Stores only write to it when in spec mode

If in correct state access the correct state If in spec_mode access the miss-speculated state Effect: No need to restore state
Incorrect, speculative updates do not clobber the correct state

When squashing we simply return to the correct state


i.e., disregard the spec. hash mem table.
ECE ECE1773 Spring 02 A. Moshovos (Toronto)

ruu_dispatch(): reading from fetch buffer


inst = fetch_data[fetch_head].IR; regs.regs_PC = fetch_data[fetch_head].regs_PC; pred_PC = fetch_data[fetch_head].pred_PC;

dir_update_ptr = &(fetch_data[fetch_head].dir_update);
stack_recover_idx = fetch_data[fetch_head].stack_recover_idx; pseq = fetch_data[fetch_head].ptrace_seq; ignore all pseq They are for a debugging/tracing facility
ECE ECE1773 Spring 02 A. Moshovos (Toronto)

Scheduler Structure Circular buffer named RUU Each entry contains


The instruction, PC and pred_PC Valid bits for input registers A linked list of consumers per target register Branch prediction back-pointers Status flags, e.g., what state is this in, is it an address op

An instruction can execute when all source registers are available: readyq in ruu_issue() On writeback:
walk target list and set bits of consumers and places them on readyq if they become ready
ECE ECE1773 Spring 02 A. Moshovos (Toronto)

Scheduler structure: RUU_station


struct RUU_station md_inst_t IR; enum md_opcode op; md_addr_t PC, next_PC, pred_PC; int in_LSQ; int ea_comp; int recover_inst; int stack_recover_idx; struct bpred_update_t dir_update; int spec_mode; md_addr_t addr; INST_TAG_TYPE tag; INST_SEQ_TYPE seq; int queued; int issued; int completed; int onames[MAX_ODEPS]; struct RS_link *odep_list[MAX_ODEPS]; int idep_ready[MAX_IDEPS];
ECE ECE1773 Spring 02 A. Moshovos (Toronto)

/* instruction bits */ /* decoded instruction opcode */ /* inst PC, next PC, predicted PC */ /* non-zero if op is in LSQ */ /* non-zero if op is an addr comp */ /* start of mis-speculation? */ /* non-speculative TOS for RSB pred */ /* bpred direction update info */ /* non-zero if issued in spec_mode */ /* effective address for ld/st's */ /* RUU slot tag, increment to squash operation */ /* used to sort the ready list and tag inst */ /* operands ready and queued */ /* operation is/was executing */ /* operation has completed execution */ /* output logical names (NA=unused) */ /* chains to consuming operations */ /* input operand ready? */

Scheduler State RUU[]: in-order instructions to be executed


Allocated at dispatch Deallocated at commit or on squash (tracer_recover())

RUU_head, RUU_tail, RUU_num, RUU_size LSQ[]: in order loads and stores


Same as above Scheduling is done by comparing addresses More on this soon

ECE ECE1773 Spring 02 A. Moshovos (Toronto)

Determining Dependences ruu_link_idep(rs, /* idep_ready[] index */0, reg_name); ruu_install_odep (rs, /* odep_list[] index*/0, reg_name); Rename table: CREATE_VECTOR(reg_name)
Returns pointer to RUU entry of producer or NULL if result is available Actual data type is CV_link (RUU_station *, next)

SET_CREATE_VECTOR(reg_name, RUU station)


Make this RUU_Station the current producer of reg_name

Two copies of the create vector:


Create_vector and spec_create_vector
ECE ECE1773 Spring 02 A. Moshovos (Toronto)

Renaming Non-Load/Store Instructions ruu_link_idep(rs, /* idep_ready[] index */0, in1); ruu_link_idep(rs, /* idep_ready[] index */1, in2); ruu_link_idep(rs, /* idep_ready[] index */2, in3); ruu_install_odep(rs, /* odep_list[] index */0, out1); ruu_install_odep(rs, /* odep_list[] index */1, out2);

ECE ECE1773 Spring 02 A. Moshovos (Toronto)

Renamind loads/stores
ruu_link_idep(rs, /* idep_ready[] index */0, NA); ruu_link_idep(rs, /* idep_ready[] index */1, in2); ruu_link_idep(rs, /* idep_ready[] index */2, in3); ruu_install_odep(rs, /* odep_list[] index */0, DTMP); ruu_install_odep(rs, /* odep_list[] index */1, NA); ruu_link_idep(lsq,/* idep_ready[] index */STORE_OP_INDEX/* 0 */,in1); ruu_link_idep(lsq, /* idep_ready[] index */STORE_ADDR_INDEX/* 1 */, DTMP); ruu_link_idep(lsq, /* idep_ready[] index */2, NA); ruu_install_odep(lsq, /* odep_list[] index */0, out1); ruu_install_odep(lsq, /* odep_list[] index */1, out2);

ECE ECE1773 Spring 02 A. Moshovos (Toronto)

ruu_idep_link (rs, idep_num, idep_name)


struct CV_link head; struct RS_link *link; if (idep_name == NA) rs->idep_ready[idep_num] = TRUE, return; head = CREATE_VECTOR(idep_name); if (!head.rs) rs->idep_ready[idep_num] = TRUE, return; rs->idep_ready[idep_num] = FALSE; RSLINK_NEW(link, rs); link->x.opnum = idep_num; link->next = head.rs->odep_list[head.odep_num]; head.rs->odep_list[head.odep_num] = link;
ECE ECE1773 Spring 02 A. Moshovos (Toronto)

CREATE_VECTOR(N): Register Rename Table Read (BITMAP_SET_P(use_spec_cv, CV_BMAP_SZ, (N)) ? spec_create_vector[N] : create_vector[N]) use_spec_cv(N) is set when we rename the target register N while in spec_mode It is a bit vector: one bit per register

ECE ECE1773 Spring 02 A. Moshovos (Toronto)

ruu_install_odep(rs, odep_num, odep_name)


struct CV_link cv; if (odep_name == NA) rs->onames[odep_num] = NA, return; rs->onames[odep_num] = odep_name; rs->odep_list[odep_num] = NULL; /* indicate this operation is latest creator of ODEP_NAME */ CVLINK_INIT(cv, rs, odep_num); SET_CREATE_VECTOR(odep_name, cv);
ECE ECE1773 Spring 02 A. Moshovos (Toronto)

SET_CREATE_VECTOR(odep_name, cv) Set the current producer of register odep_name to the RUU entry stored in the cv SET_CREATE_VECTOR(N, L)
If (spec_mode) BITMAP_SET(use_spec_cv, CV_BMAP_SZ, (N) spec_create_vector[N] = (L)) else (create_vector[N] = (L)))

No need to keep old mapping around since we never have to restore


ECE ECE1773 Spring 02 A. Moshovos (Toronto)

ruu_dispatch(): determining ready to issue insts


if (OPERANDS_READY(rs)) { /* eff addr computation ready, queue it on ready list */ readyq_enqueue(rs); } /* issue may continue when the load/store is issued */ RSLINK_INIT(last_op, lsq); // for in-order simulation

/* issue stores only, loads are issued by lsq_refresh() */ if (((MD_OP_FLAGS(op) & (F_MEM|F_STORE)) == (F_MEM|F_STORE)) && OPERANDS_READY(lsq)) { /* put operation on ready list, ruu_issue() issue it later */ readyq_enqueue(lsq); }
ECE ECE1773 Spring 02 A. Moshovos (Toronto)

Miss-Prediction Detection
if (MD_OP_FLAGS(op) & F_CTRL) sim_num_branches++; if (pred && bpred_spec_update == spec_ID) update predictor if configured for spec. updates if (pred_PC != regs.regs_NPC && !fetch_redirected) spec_mode = TRUE; rs->recover_inst = TRUE; recover_PC = regs.regs_NPC;

ECE ECE1773 Spring 02 A. Moshovos (Toronto)

ruu_issue(): Dynamic scheduling of non loads/stores Walk the readyq Try to get resources (FUs) Get latency of execution Put an entry into the event_q for the completion time If cannot execute place back into readyq

Eventq is serviced by ruu_writeback

ECE ECE1773 Spring 02 A. Moshovos (Toronto)

Who places instructions in readyq? In readyq means the instruction is ready to issue From dispatch:
Non-load/store if all sources are available This includes the address component of lds/sts Stores if data is available. Recall address computation is separate instruction

From writeback:
Producer writes last result a consumer waits for

From lsq_refresh
Called every cycle: Load is ready Address is know, all preceding store addresses known and there is no conflict with unavailable store data
ECE ECE1773 Spring 02 A. Moshovos (Toronto)

ruu_issue(): main loop Get next entry from readyq If still valid (RSLINK_VALID(rs)) try to execute If store complete instantaneously nothing to produce fu = res_get (fu_pool, MD_OP_class (rsop)
Get functional unit for instruction based on operation

Get latency of execution


For loads access data cache and tlb

Queue event in eventq for completion (ruu_writeback)


eventq_queue_event(rs, sim_cycle + latency);

If cannot execute place back in readyq


ECE ECE1773 Spring 02 A. Moshovos (Toronto)

ruu_issue(): Loads Get mem port resource Scan LSQ for matching preceding store
For this to be executing it must be that if there is a matching store then it has its data This is called store-load forwarding

If no match, access cache_dl1 and dtlb Get latency to be the max of the two

ECE ECE1773 Spring 02 A. Moshovos (Toronto)

ruu_issue(): High-Level Structure Temporary list node= readyq; readyq = NULL So long as there are issue slots available Get next element from node
If still valid Try to get resource Determine latency Schedule eventq event Place back in readyq

Place remaining nodes back into readyq (readyq_enqueue() sorted by latency and age) Order in readyq implicit issue priority
ECE ECE1773 Spring 02 A. Moshovos (Toronto)

lsq_refresh(): Placing loads into readyq LSQ uses same elements as RUU Scheduling is done based on addr field and availability of operands Scan forward (LSQ_head, counting to LSQ_num)
If store Stop if address is unknown loads after it should wait If data unavailable record address in std_unknowns Loads that need this data should wait If Load and all register ops are ready Scan std_unknowns for match Place in readyq if no match
ECE ECE1773 Spring 02 A. Moshovos (Toronto)

lsq_refresh(): stores
if (!STORE_ADDR_READY(&LSQ[index])) break; else if (!OPERANDS_READY(&LSQ[index])) std_unknowns[n_std_unknowns++] = LSQ[index].addr; else /* STORE_ADDR_READY() && OPERANDS_READY() */ /* a later STD known hides an earlier STD unknown */ for (j=0; j<n_std_unknowns; j++) if (std_unknowns[j] == /* STA/STD known */LSQ[index].addr) std_unknowns[j] = /* bogus addr */0;
ECE ECE1773 Spring 02 A. Moshovos (Toronto)

lsq_refresh(): Loads
if (/* load? */ ((MD_OP_FLAGS(LSQ[index].op) & (F_MEM|F_LOAD)) == (F_MEM|F_LOAD)) && /* queued? */!LSQ[index].queued && /* waiting? */!LSQ[index].issued && /* completed? */!LSQ[index].completed && /* regs ready? */OPERANDS_READY(&LSQ[index])) for (j=0; j<n_std_unknowns; j++) if (std_unknowns[j] == LSQ[index].addr) break; if (j == n_std_unknowns)
/* no STA or STD unknown conflicts, put load on ready queue */

readyq_enqueue(&LSQ[index]);
ECE ECE1773 Spring 02 A. Moshovos (Toronto)

ruu_writeback(): Producer notifies consumers


Get next event from eventq If this is a recover instruction
Squash all that follows Ruu_recover, tracer_recover() & bpred_recover()

If branch update predictor Update rename table if still the creator


rs->spec_mode determines which one Subsequent consumers can get result from register file

Walk output dependence lists


If link still valid Set idep_ready flags If consumer becomes ready place on readyq ruu_issue()
ECE ECE1773 Spring 02 A. Moshovos (Toronto)

Recovering from Miss-Predictions rsrecover_inst as set by ruu_dispatch writesback ruu_recover()


From the end of RUU Clean up output dependence lists freeing RSLinks Same for LSQ entry if it exists (1-to-1 correspondence with RUU entries that have rsea_comp set) rstag++ (invalidate all RSLinks to this RUU, could be that we linked to producer that will not be squashed) Clear use_spec_cv (create vector)

ECE ECE1773 Spring 02 A. Moshovos (Toronto)

tracer_recover() Clear use_spec_R etc.


Bitmaps indicating where register values are Set when writing to register file in spec_mode

Cleanup speculative memory store state Reset fetch stage by emptying fetch_data
Fetch_tail = fetch_head = fetch_num = 0

For bpred_recover look into bpred.c

ECE ECE1773 Spring 02 A. Moshovos (Toronto)

ruu_commit() Scan starting from the oldest inst in RUU (RUU_head) If completed then try to commit If store get memory port and write to memory
Fail if cant get resource Does not simulate writebuffer Access data cache

If load/store release LSQ entry If branch update predictor if so configured Release RUU entry

ECE ECE1773 Spring 02 A. Moshovos (Toronto)

How is sim-outorder structured


bpred fetch disp. sched. mem sched. U-cache L1 exec mem D-cache L1 WB commit mem D-TLB

I-cache I-TLB L1

Main Memory Virtual


ECE ECE1773 Spring 02 A. Moshovos (Toronto)

fetch_data[]
ruu_fetch()
fetch_tail IR fetch_num regs_PC pred_PC bpred ptrs

fetch_head

ruu_dispatch() tracer_recover
ECE ECE1773 Spring 02 A. Moshovos (Toronto)

ruu_writeback

ruu_ifq_size

struct RUU_station RUU[WINDOW]


ruu_dispatch()
RUU_tail RUU_num

RUU_head

ruu_commit()

ruu_recover
ECE ECE1773 Spring 02 A. Moshovos (Toronto)

ruu_writeback

RUU_size

struct RUU_station Scheduling Related Entries


ruu_dispatch()
Input ready flags idep_ready[0] idep_ready[1] idep_ready[2]
All must be 1 to be ready

ruu_writeback()

ruu_recover ruu_writeback
ECE ECE1773 Spring 02 A. Moshovos (Toronto)

Output Registers onames[0] consumer list odep_list[0] next onames[1] &RUU[cosumer] odep_list[1] tag Unique ID x.opnum tag struct RS_link

LSQ: Load/Store Scheduler Same as RUU


LSQ_tail LSQ_num LSQ_size

ruu_dispatch()

LSQ_head

ruu_commit()

ruu_recover
ECE ECE1773 Spring 02 A. Moshovos (Toronto)

ruu_writeback()

Register Renaming Structures


Link to RUU and output reg

reg 1

reg 2

reg N

create_vector spec_create_vector
Which Vector to use

*rs or *lsq *rs or *lsq *rs or *lsq opnum (0 or 1) opnum (0 or 1) opnum (0 or 1) *rs or *lsq *rs or *lsq *rs or *lsq opnum (0 or 1) opnum (0 or 1) opnum (0 or 1)

use_spec_cv

ruu_writeback() ruu_recover

ruu_install_odep ruu_dispatch()

ECE ECE1773 Spring 02 A. Moshovos (Toronto)

Register State e.g., reg_R[]


reg 1
regs.reg_R spec_reg_R
Which use_spec_R Reg to use

reg 2
value

reg N
value

value

value

value

value

tracer_recover ruu_writeback() ruu_dispatch()

ECE ECE1773 Spring 02 A. Moshovos (Toronto)

Ready Queue
ruu_writeback() ruu_dispatch()
Insert non-loads if ready Insert non-loads if ready

ruu_issue()
Remove and try to execute

readyq
Insert loads

next *rs or *lsq tag RS_link

next *rs or *lsq tag

lsq_refresh()
ECE ECE1773 Spring 02 A. Moshovos (Toronto)

Event Queue
ruu_issue()
Insert at sim_cycle + latency

ruu_writeback()

Remove upon completion

eventq

RS_link

next *rs or *lsq tag x.when

next *rs or *lsq

tag x.when

ECE ECE1773 Spring 02 A. Moshovos (Toronto)

Summary of Concepts/Interfaces ruu_fetch to ruu_dispatch via fetch_data buffer ruu_dispatch executes instructions in order
Breaks load/store into addr and memory op Links to producer of input regs Renames output reg to RUU or LSQ Determines if entering in miss-prediction mode Marks inst via rs->recover inst Two states: miss-speculated and corrected (reg files, memory, rename tables, etc.) May place insts in readyq if ready

ECE ECE1773 Spring 02 A. Moshovos (Toronto)

Summary contd. ruu_issue:


Scan readyq trying to issue Insts in readyq? ruu_dispatch: non-loads if inputs are ready lsq_refresh: loads when certain that there are no conflicts ruu_writeback: producer places consumers if they become ready Get fu, get latency, schedule event for writeback\

lsq_refresh
When loads can issue Wait until all preceding stores calculate their address Stall if conflict with store that has no data
ECE ECE1773 Spring 02 A. Moshovos (Toronto)

Summary contd. ruu_writeback:


Producer notifies consumers of result Determines if producer is ready and places in readyq Updates rename tables to indicate that the result is now in the register file Calls recovery routines if this is a recover instruction (first miss-predicted)

ruu_commit:
Perform Stores Release RUU and LSQ entry

ECE ECE1773 Spring 02 A. Moshovos (Toronto)

Caveats Simplescalar uses optimizations to optimize for simulation speed Does not simulate an event driven memory system Be careful to make sure that you use it appropriately

ECE ECE1773 Spring 02 A. Moshovos (Toronto)

You might also like