Professional Documents
Culture Documents
ECE1773 Andreas Moshovos Visit www.simplescalar.com for additional info Simplescalar was developed by Todd Austin now at Michigan. First version while at UWisconsin. Builds on the experience with other simulators that existed at the time at UWisc. Introduced many simulation speed enhancements. Can be used for free for academic purposes.
ECE ECE1773 Spring 02 A. Moshovos (Toronto)
I-cache I-TLB L1
These correspond to the green boxes on the previous slide Every iteration is a single cycle: sim_cycle variable counts them
ECE ECE1773 Spring 02 A. Moshovos (Toronto)
ruu_fetch()
Fetch and predict up to ruu_decode_width instructions Place them into fetch_data[] buffer Inputs: 2 globals
Fetch_regs_PC: what fetch thinks is the next PC to fetch from Fetch_pred_PC: what is the predicted PC for after this instruction
ruu_fetch() - loop If not a bogus address Access I-Cache with fetch_regs_PC get latency of access Access I-TLB hit/miss Determine overall latency as max of the two If prediction is enabled: Access predictor and get fetch_pred_PC plus a backpointer to predictor entry Instruction, PCs and prediction info go into fetch_data[fetch_tail] Fetch_num++, fetch_tail++ MOD ruu_ifq_size
ECE ECE1773 Spring 02 A. Moshovos (Toronto)
An approximation:
No real, event-driven simulation of the memory system
Careful, how one interprets the simulation result I-TLB also simulated as a cache with few entries and constant, still large miss latency Cache does not hold memory data, only the tags of cached blocks access memory to get insts (optimization be careful)
ECE ECE1773 Spring 02 A. Moshovos (Toronto)
Branch Prediction Interface bpred.[ch] bpred_lookup (*pred, PC, *target_address, opcode, Call?, Return?, *back-pointer for updates, *back-pointer for stack updates) Returns a Predicted PC
Can check whether it is taken or not by comparing with the next sequential PC Pred_PC = PC + sizeof (md_inst_t)
Eventually, call bpred_update (*pred, PC, actual target_address, taken?, pred_taken?, opcode, back_pointer, stack back-pointer)
Can be called at writeback or commit
ECE ECE1773 Spring 02 A. Moshovos (Toronto)
bpred back-pointer stack back-pointer print trace sequence id ruu_fetch writes there ruu_dispatch reads from there how many valid max entries
ruu_fetch()
for (i=0, branch_cnt=0; /* fetch up to as many instruction as the DISPATCH stage can decode */ i < (ruu_decode_width * fetch_speed) /* fetch until IFETCH -> DISPATCH queue fills */ && fetch_num < ruu_ifq_size /* and no IFETCH blocking condition encountered */ && !done; i++) { MAIN LOOP } Done is used for enforcing fetch break conditions Currently this happens only when number of branches exceeds fetch_speed
ECE ECE1773 Spring 02 A. Moshovos (Toronto)
ruu_fetch() Invalid Address Check if (ld_text_base <= fetch_regs_PC && fetch_regs_PC < (ld_text_base+ld_text_size) && !(fetch_regs_PC & (sizeof(md_inst_t)-1))) { /* read instruction from memory */ MD_FETCH_INST(inst, mem, fetch_regs_PC);
...
lat = MAX(tlb_lat, lat); if (lat != cache_il1_lat) /* I-cache miss, block fetch until it is resolved */ ruu_fetch_issue_delay += lat - 1; break;
ruu_dispatch() Get next inst from fetch buffer Functionally execute the instruction Split load/stores into
1. Address calculation 2. Memory operation
Rename input dependences Rename target register Place into scheduler RUU[] and load/store LSQ[] scheduler if necessary Determine if miss-prediction Issue if ready
ECE ECE1773 Spring 02 A. Moshovos (Toronto)
Functional and timing execution Ignore miss-predicts for the time being Simplescalar executes all instructions in-order during dispatch
They update registers and memory at that time
Then it tries to determine when they would actually execute taking into consideration dependences and latencies This is simulation so we can do this
Pros: fast, easy to debug Cons: timing model can be wrong and the simulation will not produce incorrect results
ECE ECE1773 Spring 02 A. Moshovos (Toronto)
Handling Miss-Predictions Two modes: correct & miss-speculated ruu_dispatch switches to the 2nd when it decodes a miss-predicted branch
Know about it because it executes the branch and figures out whether the prediction is correct Global spec_mode is 1 when in miss-speculated mode
Handling Miss-Predictions
Keep two states: correct and miss-speculated
For regs there is regs_R[] and spec_regs_R[] (and _F) For memory, there is mem_access and spec_mem_access Speculative memory updates are kept in a temporary hash table Loads access this table first and then memory if needed Stores only write to it when in spec mode
If in correct state access the correct state If in spec_mode access the miss-speculated state Effect: No need to restore state
Incorrect, speculative updates do not clobber the correct state
dir_update_ptr = &(fetch_data[fetch_head].dir_update);
stack_recover_idx = fetch_data[fetch_head].stack_recover_idx; pseq = fetch_data[fetch_head].ptrace_seq; ignore all pseq They are for a debugging/tracing facility
ECE ECE1773 Spring 02 A. Moshovos (Toronto)
An instruction can execute when all source registers are available: readyq in ruu_issue() On writeback:
walk target list and set bits of consumers and places them on readyq if they become ready
ECE ECE1773 Spring 02 A. Moshovos (Toronto)
/* instruction bits */ /* decoded instruction opcode */ /* inst PC, next PC, predicted PC */ /* non-zero if op is in LSQ */ /* non-zero if op is an addr comp */ /* start of mis-speculation? */ /* non-speculative TOS for RSB pred */ /* bpred direction update info */ /* non-zero if issued in spec_mode */ /* effective address for ld/st's */ /* RUU slot tag, increment to squash operation */ /* used to sort the ready list and tag inst */ /* operands ready and queued */ /* operation is/was executing */ /* operation has completed execution */ /* output logical names (NA=unused) */ /* chains to consuming operations */ /* input operand ready? */
Determining Dependences ruu_link_idep(rs, /* idep_ready[] index */0, reg_name); ruu_install_odep (rs, /* odep_list[] index*/0, reg_name); Rename table: CREATE_VECTOR(reg_name)
Returns pointer to RUU entry of producer or NULL if result is available Actual data type is CV_link (RUU_station *, next)
Renaming Non-Load/Store Instructions ruu_link_idep(rs, /* idep_ready[] index */0, in1); ruu_link_idep(rs, /* idep_ready[] index */1, in2); ruu_link_idep(rs, /* idep_ready[] index */2, in3); ruu_install_odep(rs, /* odep_list[] index */0, out1); ruu_install_odep(rs, /* odep_list[] index */1, out2);
Renamind loads/stores
ruu_link_idep(rs, /* idep_ready[] index */0, NA); ruu_link_idep(rs, /* idep_ready[] index */1, in2); ruu_link_idep(rs, /* idep_ready[] index */2, in3); ruu_install_odep(rs, /* odep_list[] index */0, DTMP); ruu_install_odep(rs, /* odep_list[] index */1, NA); ruu_link_idep(lsq,/* idep_ready[] index */STORE_OP_INDEX/* 0 */,in1); ruu_link_idep(lsq, /* idep_ready[] index */STORE_ADDR_INDEX/* 1 */, DTMP); ruu_link_idep(lsq, /* idep_ready[] index */2, NA); ruu_install_odep(lsq, /* odep_list[] index */0, out1); ruu_install_odep(lsq, /* odep_list[] index */1, out2);
CREATE_VECTOR(N): Register Rename Table Read (BITMAP_SET_P(use_spec_cv, CV_BMAP_SZ, (N)) ? spec_create_vector[N] : create_vector[N]) use_spec_cv(N) is set when we rename the target register N while in spec_mode It is a bit vector: one bit per register
SET_CREATE_VECTOR(odep_name, cv) Set the current producer of register odep_name to the RUU entry stored in the cv SET_CREATE_VECTOR(N, L)
If (spec_mode) BITMAP_SET(use_spec_cv, CV_BMAP_SZ, (N) spec_create_vector[N] = (L)) else (create_vector[N] = (L)))
/* issue stores only, loads are issued by lsq_refresh() */ if (((MD_OP_FLAGS(op) & (F_MEM|F_STORE)) == (F_MEM|F_STORE)) && OPERANDS_READY(lsq)) { /* put operation on ready list, ruu_issue() issue it later */ readyq_enqueue(lsq); }
ECE ECE1773 Spring 02 A. Moshovos (Toronto)
Miss-Prediction Detection
if (MD_OP_FLAGS(op) & F_CTRL) sim_num_branches++; if (pred && bpred_spec_update == spec_ID) update predictor if configured for spec. updates if (pred_PC != regs.regs_NPC && !fetch_redirected) spec_mode = TRUE; rs->recover_inst = TRUE; recover_PC = regs.regs_NPC;
ruu_issue(): Dynamic scheduling of non loads/stores Walk the readyq Try to get resources (FUs) Get latency of execution Put an entry into the event_q for the completion time If cannot execute place back into readyq
Who places instructions in readyq? In readyq means the instruction is ready to issue From dispatch:
Non-load/store if all sources are available This includes the address component of lds/sts Stores if data is available. Recall address computation is separate instruction
From writeback:
Producer writes last result a consumer waits for
From lsq_refresh
Called every cycle: Load is ready Address is know, all preceding store addresses known and there is no conflict with unavailable store data
ECE ECE1773 Spring 02 A. Moshovos (Toronto)
ruu_issue(): main loop Get next entry from readyq If still valid (RSLINK_VALID(rs)) try to execute If store complete instantaneously nothing to produce fu = res_get (fu_pool, MD_OP_class (rsop)
Get functional unit for instruction based on operation
ruu_issue(): Loads Get mem port resource Scan LSQ for matching preceding store
For this to be executing it must be that if there is a matching store then it has its data This is called store-load forwarding
If no match, access cache_dl1 and dtlb Get latency to be the max of the two
ruu_issue(): High-Level Structure Temporary list node= readyq; readyq = NULL So long as there are issue slots available Get next element from node
If still valid Try to get resource Determine latency Schedule eventq event Place back in readyq
Place remaining nodes back into readyq (readyq_enqueue() sorted by latency and age) Order in readyq implicit issue priority
ECE ECE1773 Spring 02 A. Moshovos (Toronto)
lsq_refresh(): Placing loads into readyq LSQ uses same elements as RUU Scheduling is done based on addr field and availability of operands Scan forward (LSQ_head, counting to LSQ_num)
If store Stop if address is unknown loads after it should wait If data unavailable record address in std_unknowns Loads that need this data should wait If Load and all register ops are ready Scan std_unknowns for match Place in readyq if no match
ECE ECE1773 Spring 02 A. Moshovos (Toronto)
lsq_refresh(): stores
if (!STORE_ADDR_READY(&LSQ[index])) break; else if (!OPERANDS_READY(&LSQ[index])) std_unknowns[n_std_unknowns++] = LSQ[index].addr; else /* STORE_ADDR_READY() && OPERANDS_READY() */ /* a later STD known hides an earlier STD unknown */ for (j=0; j<n_std_unknowns; j++) if (std_unknowns[j] == /* STA/STD known */LSQ[index].addr) std_unknowns[j] = /* bogus addr */0;
ECE ECE1773 Spring 02 A. Moshovos (Toronto)
lsq_refresh(): Loads
if (/* load? */ ((MD_OP_FLAGS(LSQ[index].op) & (F_MEM|F_LOAD)) == (F_MEM|F_LOAD)) && /* queued? */!LSQ[index].queued && /* waiting? */!LSQ[index].issued && /* completed? */!LSQ[index].completed && /* regs ready? */OPERANDS_READY(&LSQ[index])) for (j=0; j<n_std_unknowns; j++) if (std_unknowns[j] == LSQ[index].addr) break; if (j == n_std_unknowns)
/* no STA or STD unknown conflicts, put load on ready queue */
readyq_enqueue(&LSQ[index]);
ECE ECE1773 Spring 02 A. Moshovos (Toronto)
Cleanup speculative memory store state Reset fetch stage by emptying fetch_data
Fetch_tail = fetch_head = fetch_num = 0
ruu_commit() Scan starting from the oldest inst in RUU (RUU_head) If completed then try to commit If store get memory port and write to memory
Fail if cant get resource Does not simulate writebuffer Access data cache
If load/store release LSQ entry If branch update predictor if so configured Release RUU entry
I-cache I-TLB L1
fetch_data[]
ruu_fetch()
fetch_tail IR fetch_num regs_PC pred_PC bpred ptrs
fetch_head
ruu_dispatch() tracer_recover
ECE ECE1773 Spring 02 A. Moshovos (Toronto)
ruu_writeback
ruu_ifq_size
RUU_head
ruu_commit()
ruu_recover
ECE ECE1773 Spring 02 A. Moshovos (Toronto)
ruu_writeback
RUU_size
ruu_writeback()
ruu_recover ruu_writeback
ECE ECE1773 Spring 02 A. Moshovos (Toronto)
Output Registers onames[0] consumer list odep_list[0] next onames[1] &RUU[cosumer] odep_list[1] tag Unique ID x.opnum tag struct RS_link
ruu_dispatch()
LSQ_head
ruu_commit()
ruu_recover
ECE ECE1773 Spring 02 A. Moshovos (Toronto)
ruu_writeback()
reg 1
reg 2
reg N
create_vector spec_create_vector
Which Vector to use
*rs or *lsq *rs or *lsq *rs or *lsq opnum (0 or 1) opnum (0 or 1) opnum (0 or 1) *rs or *lsq *rs or *lsq *rs or *lsq opnum (0 or 1) opnum (0 or 1) opnum (0 or 1)
use_spec_cv
ruu_writeback() ruu_recover
ruu_install_odep ruu_dispatch()
reg 2
value
reg N
value
value
value
value
value
Ready Queue
ruu_writeback() ruu_dispatch()
Insert non-loads if ready Insert non-loads if ready
ruu_issue()
Remove and try to execute
readyq
Insert loads
lsq_refresh()
ECE ECE1773 Spring 02 A. Moshovos (Toronto)
Event Queue
ruu_issue()
Insert at sim_cycle + latency
ruu_writeback()
eventq
RS_link
tag x.when
Summary of Concepts/Interfaces ruu_fetch to ruu_dispatch via fetch_data buffer ruu_dispatch executes instructions in order
Breaks load/store into addr and memory op Links to producer of input regs Renames output reg to RUU or LSQ Determines if entering in miss-prediction mode Marks inst via rs->recover inst Two states: miss-speculated and corrected (reg files, memory, rename tables, etc.) May place insts in readyq if ready
lsq_refresh
When loads can issue Wait until all preceding stores calculate their address Stall if conflict with store that has no data
ECE ECE1773 Spring 02 A. Moshovos (Toronto)
ruu_commit:
Perform Stores Release RUU and LSQ entry
Caveats Simplescalar uses optimizations to optimize for simulation speed Does not simulate an event driven memory system Be careful to make sure that you use it appropriately