You are on page 1of 24

Dynamic Branch Prediction 193

• Dynamic branch prediction, makes prediction based on accumulated


data during run-time or execution of an instruction.
• This requires hardware to save some past data of branches and utilizes
that data for the prediction of the branches.
• This prediction is based on two interacting methods:
– Branch outcome/direction prediction: Taken or Not Taken
– Branch target prediction: Branch target address
– (When a branch is predicted as taken, the branch target address is
required, since while fetching a branch instruction, the branch
target address is not available, hence, this needs prediction)
Kuruvilla Varghese

193

Dynamic Branch Prediction 194

• Simple Method: Look up the address of the branch instruction to see


if the conditional branch was taken the last time this instruction was
executed, if so, begin fetching new instructions from the same place
as the last time.
• This is implemented using a Branch prediction buffer / branch history
table: small memory indexed by the lower portion of the address of
the branch instruction. The memory contains a bit that says whether
the branch was recently taken or not.

Kuruvilla Varghese

194

1
Branch Instruction (for … loop) 195

for (i = 0, i < 10, i++) add x1, x0, x0 // i = 0


{ addi x2, x0, 10
loop:

addi x1, x1, 1 // i++


} bne x1, x2, loop

Kuruvilla Varghese

195

Dynamic Branch Prediction 196

• The steady-state prediction behaviour will mis-predict on the first and


last loop iterations.
• Misprediction in the last iteration is inevitable since the prediction bit
will indicate taken, as the branch has been taken previously.
• The misprediction on the first iteration happens because the bit is
flipped on prior execution of the last iteration of the loop, since the
branch was not taken on that exiting iteration.
• Two incorrect predictions per loop.
• To remedy this weakness, 2-bit prediction schemes are often used. In
a 2-bit scheme, a prediction must be wrong twice before it is changed.

Kuruvilla Varghese

196

2
2-bit prediction Scheme 197

Taken • Output is indicated in the


state name.
Not Taken
Predict Predict • We can call states as
Taken Taken
⁃ 0 Strong Taken
0 1
Taken ⁃ 1 Weak taken
Taken Not Taken
⁃ 2 Weak Not Taken
⁃ 3 Strong Not Taken
Predict Taken Predict
Not Not
Taken Taken
• ~85-90% accuracy for
3 Not Taken 2 many programs with 2-bit
counter based prediction.
Not Taken

Kuruvilla Varghese

197

2-bit prediction: Another scheme 198

Taken • Output is indicated in the


state name.
Not Taken
Predict Predict • We can call states as
Taken Taken
Taken ⁃ 0 Strong Taken
0 1
Not Taken ⁃ 1 Weak taken
Taken
⁃ 2 Weak Not Taken
⁃ 3 Strong Not Taken
Predict Taken Predict
Not Not
Taken Not Taken Taken
• ~85-90% accuracy for
3 2 many programs with 2-bit
counter based prediction.
Not Taken

Kuruvilla Varghese

198

3
Branch Target Buffer (BTB) 199

Outcome
• Branch Target Buffer
1 Branch PC(63:8) Computed Address FSM- Structure shown here
NSL
BTB is of Direct Mapped
Valid Tag Target Address Dir bits Type, It can be Set
Associative or Fully
63 PC 87 0
associative Type also
Branch
History • This scheme can be
Address Table
(BHT)
extended to
accommodate
Global/Local branch
History.
= Decode

PC + 4

Kuruvilla Varghese

199

Branch Target Buffer . 200

• Branch Target Buffer or Branch Target address cache stores the


branch target address prediction.
• BTB is accessed in IF stage and predicted Branch target address is
used as the next fetch address, if the fetched instruction is a branch or
jump (jump is always taken).
• Once the branch direction is determined and the branch target address
is computed in a later stage, these can be compared with the predicted
values and decision to continue or flush the pipeline can be taken.
• Once the real values are computed, BTB values can be updated. For
direction bits to be updated, FSM Next state logic has to be
implemented.
Kuruvilla Varghese

200

4
Branch Target Buffer .. 201

• The BTB valid bits must be cleared at the start and during a context switch.
Similarly, the 2-bits for outcome must be initialized to default (e.g., weekly not
taken) at the start and during a context switch. The state assignment for weekly
not taken state can be “00” for ease of implementation.
• When branch address is looked up in BTB, if the valid is ‘0’ or the address tag
does not match, the outcome can be predicted to default (e.g., weekly not taken).
• When branch address is looked up in BTB, if the valid is ‘1’ and the address tag
is matched, the outcome is fed to NSL of FSM, this can be registered at NSL to
update the outcome when the branch decision is computed later. As this time
another branch may be looking up the BTB.

Kuruvilla Varghese

201

Global/Local Branch Correlation 202

• History of previous branches’ outcome may influence the outcome


of the current branch.
• E.g., Take an if … then … else construct. If the if condition is
evaluated as true, all following else conditions would be false.
Hence, this affects the conditional branches used in if and else
conditions.
• Similarly, previous outcome of the current branch itself may
influence the present outcome of the branch.
• E.g., As we discussed earlier, think of the conditional branch in the
end of a for … loop, If the previous outcome is taken, most probably
current outcome also will be taken.
Kuruvilla Varghese

202

5
Global Branch Correlation 203

• Recently executed branch outcomes in the execution path is correlated


with the outcome of the next branch.
• If first branch is not taken, second is also not taken.
if (cond1)

if (cond1 AND cond2)
• If first branch is taken, second is not taken.
if cond1 then (a = 2)

if (a == 0)

Kuruvilla Varghese

203

Global Branch Correlation . 204

if (cond1)

if (cond2)

if (cond1 AND cond2)

• If first and second branches are both taken, then third branch is also
taken.
• If first or second branch is not taken, then third branch is also not
taken.

Kuruvilla Varghese

204

6
Global Branch Correlation .. 205

if (aa == 2) ; B1
aa = 0;
if (bb == 2) ; B2
bb = 0;
if (aa != bb) { ; B3

}

• If B1 is taken (i.e., aa == 0 @B3) and B2 is taken (i.e., bb = 0 @B3)


then B3 is not certainly taken.

Kuruvilla Varghese

205

Global Branch Correlation 206

• Associate branch outcomes with global outcome history of all


branches.
• Make a prediction based on the outcome of the branch the last time
the same global branch history was encountered.
• Keep track of the global outcome history of all branches in a register
called Global History Register (GHR).
• Use GHR to index into a table that recorded the outcome that was
seen for that GHR value in the recent past in Pattern History Table
(table of 2-bit counters).

Kuruvilla Varghese

206

7
Global Branch Correlation 207

Outcome
Computed Address FSM-
NSL
BTB
Valid Tag Target Address Dir Bits
Branch
History
63 PC 87 0 Table (BHT)

1 0 1 1
Address
Global Branch
History Register

= Decode

Note: The sizes of


PC + 4 BTB and BHT may
not be same

Kuruvilla Varghese

207

Local Branch Correlation 208

• Idea: Have a per-branch history register.


• Associate the predicted outcome of a branch with branch
outcome history of the same branch.
• Make a prediction based on the outcome of the branch the
last time the same local branch history was encountered.
• Local history/branch predictor.
• Uses two levels of history (Per-branch history register +
history at that history register value).

Kuruvilla Varghese

208

8
Local Branch Correlation . 209

• for (i = 0, i <= 4, i++) { }

• If the loop test is done at the end of the body, the


corresponding branch will execute the pattern (1110)n, where
1 and 0 represent taken and not taken respectevely, and n is
the number of times the loop is executed.
• If we know the direction of this branch taken in last three
executions, we can predict the next branch direction.

Kuruvilla Varghese

209

Local Branch Correlation 210

Computed Address FSM- Update History


NSL
BTB
Valid Tag Target Address Dir Bits LBHR Local Branch
History Register

63 PC 87 0
PC
Address

= Decode

PC + 4

Kuruvilla Varghese

210

9
Tournament Branch Predictors 211

• Uses multiple predictors tracking for each branch, which predictor


yields the best results.
• A typical tournament predictor might contain two predictions for each
branch index: one based on local information and one based on global
branch behaviour.
• A selector would choose which predictor to use for any given
prediction.
• The selector can operate similarly to a 1-bit or 2-bit predictor,
favouring whichever of the two predictors has been more accurate.

Kuruvilla Varghese

211

Exceptions - Interrupts 212

• Events other than branches that alter the flow of instructions.


• Initially designed to handle unexpected events like undefined
instructions, later extended for I/O devices to communicate
to processor.

Kuruvilla Varghese

212

10
Exceptions 213

• Exceptional events occur during the run time within a


processor
– OS calls by user program (Internal)
– Arithmetic overflow, Divide by zero (Internal)
– Using undefined Instructions (Internal)
– Misaligned Instructions (Internal)
– Hardware malfunctions, (e.g., Bus error) (Internal / External)
• Since an exception occurs as a result of executing an
instruction, they are synchronous by nature.
Kuruvilla Varghese

213

RISC-V Privilege Modes 214

• A CPU has three privilege modes: Machine, Supervisor, and User.


Each privilege mode has its own user registers, control and status
registers (CSRs) for trap handling, and stack area dedicated to them.
• While operating in User mode, a context switch is required to handle
an event in Supervisor mode.
• The software sets up the system for a context switch, and then an
ECALL instruction is executed which synchronously switches control
to the environment-call-from-User mode exception handler.
• Trap refers to the synchronous transfer of control to a trap handler
caused by an exceptional condition occurring within a RISC-V thread.
Trap handlers usually execute in a more privileged environment.
Kuruvilla Varghese

214

11
Interrupts 215

• Interrupts – service requests from hardware blocks to the


processor core.
– External devices / blocks
• Since an interrupt may occur anytime, and they are typically
not part of the instruction execution sequence, they are
asynchronous by nature.

Kuruvilla Varghese

215

Exceptions - Interrupts 216

• E.g., Hardware malfunction during add instruction


• Machine Exception Program Counter (MEPC) – contains the
address of the offending instruction.
• Branch to a specific location with a Cause Register
specifying the source of exception (RISC-V).
• Or Vectored Interrupts – branching to separate location
depending on source (Base register + offset as per the
source).

Kuruvilla Varghese

216

12
What to do? 217

• The operating system can then take the appropriate action,


which may involve providing some service to the user
program, taking some predefined action in response to a
malfunction, or stopping the execution of program and
reporting an error.
• After performing whatever action is required because of
exception, the operating system can terminate the program or
may continue its execution, using the MEPC to determine
where to restart the execution of the program.

Kuruvilla Varghese

217

RISC-V Exceptions - Registers 218

• Machine Exception Program Counter (MEPC): A 64-bit


register used to hold the address of the affected instruction.
(Such a register is needed even when exceptions are
vectored.)
• Machine Execution Cause register (MCAUSE): A register
used to record the cause of the exception. In the RISC-V
architecture, this register is 64 bits, most bits are currently
unused.

Kuruvilla Varghese

218

13
RISC-V Exceptions 219

• A pipelined implementation treats exceptions as another form of


control hazard.
• Suppose there is a hardware malfunction in an add instruction.
• The instructions that follow the add instruction must be flushed from
the pipeline and instructions from the exception address should be
fetched.

Kuruvilla Varghese

219

RISC-V Exceptions . 220

• To flush the instruction in the IF stage, the instruction in


IF/ID pipeline register can be turned in to a nop.
• To flush instructions in the ID stage, control signal ID.Flush
from the hazard detection unit selects the multiplexer input in
the ID stage that zeros control signals for stalls.
• To flush the instruction in the EX phase, a new signal
EX.Flush cause new multiplexers to zero the control lines.
• Instructions in the MEM stage and WB stage completes the
operation.
Kuruvilla Varghese

220

14
RISC-V Exceptions .. 221

• To start fetching instruction from exception location, an


additional input to the PC multiplexer is added that feeds the
exception location address as input.
• The final step is to save the address of the offending
instruction in the supervisor exception program counter
(SEPC)

Kuruvilla Varghese

221

Exception Control: divide by zero Exception


Conditions
MCAUSE
IF.Flush
Excption MEPC
Unit

IF/ID ID/EX EX.Flush EX/MEM MEM/WB


Co ID.Flush
4 Control signals
+ ntr 0
0

ol
dmr dmw
zero
CE
Exception Inst rd1# dt1
Address PC Mem FA ad# do
rd2#
ALU
wr# dt2
NOP wr dt
wr dt

Im FB ALU Control
m
rd memreg
rs1
rs2
Forwarding
Unit

222

15
RISC-V Exceptions … 223

• With five instructions active in any clock cycle, the hard part is to associate an
exception with the appropriate instruction. Also, multiple exceptions can occur
simultaneously in a single clock cycle.
• The solution is to prioritize the exceptions so that it is easy to determine which is
serviced first.
• In RISC-V implementations, the hardware sorts exceptions so that the earliest
instruction is interrupted.
• The MEPC register captures the address of the interrupted instructions, and the
MCAUSE register records the highest priority exception in a clock cycle if more
than one exception occurs.

Kuruvilla Varghese

223

RISC-V Exceptions …. 224

• If the exception is due to the instruction execution, PC – 8, or PC value from


ID/EX Pipeline Register, can be registered in SEPC and SCAUSE register can
be updated with this cause.
• If the instruction is an undefined instruction, then decode of that happens in ID
stage, one need to flush only IF and ID stage and PC – 4, or PC value from
IF/ID Pipeline Register, can be registered in SEPC and SCAUSE register can be
updated with this cause.
• Same way, other sources of exceptions/interrupt need to be worked out.
• I/O device requests and hardware malfunctions are not associated with a
specific instruction, so the implementation has some flexibility as to when to
interrupt the pipeline. Best way is to complete all the instructions in the
pipeline, to do that, it is enough that the PC is loaded with the
Interrupt/Exception address when the external interrupt is detected.
Kuruvilla Varghese

224

16
Hardware-Software Interface 225

• The hardware and the operating system must work in conjunction so that
exceptions behave as you would expect. The hardware contract is normally to
stop the offending instruction in midstream, let all prior instructions complete,
flush all following instructions, set a register to show the cause of the exception,
save the address of the offending instruction, and then branch to a prearranged
address.
• The operating system contract is to look at the cause of the exception and act
appropriately. For an undefined instruction or hardware failure, the operating
system normally kills the program and returns an indicator of the reason. For an
I/O device request or an operating system service call, the operating system saves
the state of the program, performs the desired task, and, at some point in the
future, restores the program to continue execution.

Kuruvilla Varghese

225

Precise / Imprecise Exceptions or Interrupts 226

• The difficulty of always associating the proper exception with the correct
instruction in pipelined computers has led some computer designers to relax this
requirement in noncritical cases.
• Such processors are said to have imprecise interrupts or imprecise exceptions.
E.g., suppose exception has occurred due to an instruction at EX stage, but if the
hardware, register current PC value into SEPC, it is an imprecise exception.
• RISC-V and the vast majority of computers today support precise interrupts or
precise exceptions.

Kuruvilla Varghese

226

17
RISC-V Interrupt Vector Address 227

• Many RISC-V computers store the exception entry address in a


special register named Machine Trap Vector (MTVEC), which the OS
can load with a value of its choosing.
• The exception address in the hardware is not accessible to OS, Hence
this register.
• This would be a CSR and using instruction for CSR it could be
accessed.

Kuruvilla Varghese

227

Interrupt Controllers 228

• Local Interrupt Controllers


– Core Local Interrupter (CLINT)
– Core Local Interrupt Controller (CLIC)
• Global Interrupt Controllers
– Platform Local Interrupt Controller (PLIC)

Kuruvilla Varghese

228

18
Core Local Interrupter (CLINT) 229

• CLINT offers a compact design with a fixed priority scheme, with


preemption support for interrupts from higher privilege levels only.
• The primary purpose of the CLINT is to serve as a simple CPU
interrupter for software and timer interrupts, since it does not control
other local interrupts wired directly to the CPU.

Kuruvilla Varghese

229

Core Local Interrupt Controller (CLIC) 230

• CLIC is a fully featured local interrupt controller with configurations


that support programmable interrupt levels and priorities.
• The CLIC also supports nested interrupts (preemption) within a given
privilege level, based on the interrupt level and priority configuration.
• Both the CLINT and CLIC integrate registers mtime and mtimecmp to
configure timer interrupts, and msip to trigger software interrupts.
• Additionally, both the CLINT and the CLIC run at the core clock
frequency.

Kuruvilla Varghese

230

19
Global - Platform Local Interrupt Controller (PLIC) 231

• PLIC provides system level flexibility for dispatching interrupts to a


single CPU or multiple CPUs in the system.
• Global interrupts that route through the PLIC arrive at the CPU
through a single interrupt connection with a dedicated interrupt ID.
• Each global interrupt has a programmable priority register available in
the PLIC memory map.
• There is also a system level programmable threshold register which
can be used to mask all interrupts below a certain level.
• The PLIC runs off a different clock than local interrupt controllers,
which is typically an integer divided ratio from the core clock.

Kuruvilla Varghese

231

Interrupt Control and Status Registers (CSRs) 232

• There are interrupt related CSRs contained in the CPU, as well as


memory mapped configuration registers in the respective interrupt
controllers.
• Both are used to configure and properly route interrupts to a CPU.
• Machine mode interrupt CSRs are discussed here.
• Many Machine mode interrupt CSRs have Supervisor or User mode
equivalents.

Kuruvilla Varghese

232

20
Interrupt Control and Status Registers (CSRs) 233

• mstatus: Status register containing interrupt enables for all privilege


modes, previous privilege mode, and other privilege level settings.
• mcause: Status register which indicates whether an exception or
interrupt occurred, along with a code to distinguish details of each
type.
• mie: Interrupt enable register for local interrupts when using CLINT
modes of operation. In CLIC modes, this is hardwired to 0 and
interrupt enables are handled using clicintie[i] memory mapped
registers.

Kuruvilla Varghese

233

Interrupt Control and Status Registers (CSRs) 234

• mip: Interrupt pending register for local interrupts when using CLINT
modes of operation. In CLIC modes, this is hardwired to 0 and
pending interrupts are handled using clicintip[i] memory mapped
registers.
• mtvec: Machine Trap Vector register which holds the base address of
the interrupt vector table, as well as the interrupt mode configuration
(direct or vectored) for CLINT and CLIC controllers.
• All synchronous exceptions also use mtvec as the base address for
exception handling in all CLINT and CLIC modes.

Kuruvilla Varghese

234

21
Interrupt Control and Status Registers (CSRs) 235

• mtvt: Used only in CLIC modes of operation. Contains the base


address of the interrupt vector table for selectively vectored interrupt
in CLIC direct mode, and for all vectored interrupts in CLIC vectored
mode. This register does not exist on designs with a CLINT.

Kuruvilla Varghese

235

Common Registers to CLIC and CLINT 236

• msip: Machine mode software interrupt pending register,


used to assert a software interrupt for a CPU.
• mtime: Machine mode timer register which runs at a constant
frequency. Part of the CLINT and CLIC designs. There is a
single mtime register on designs that contain one or more
CPUs.
• mtimecmp: Memory mapped machine mode timer compare
register, used to trigger an interrupt when mtimecmp is
greater than or equal to mtime. There is an mtimecmp
dedicated to each CPU.
Kuruvilla Varghese

236

22
Timer Interrupts 237

• Timer interrupts always trap to Machine mode, unless


delegated to Supervisor mode using the mideleg register.
• Similarly, Machine mode exceptions may be delegated to
Supervisor mode using the medeleg register.
• For designs that also implement User mode, there exists
sideleg and sedeleg registers to delegate Supervisor interrupts
to User mode.

Kuruvilla Varghese

237

Entry Behaviour for Interrupt Handlers 238

• Whenever an interrupt occurs, hardware will automatically save and


restore important registers. The following steps are complete as an
interrupt handler is entered.
• Save pc to mepc
• Save Privilege level to mstatus.mpp
• Save mie to mstatus.mpie
• Set pc to interrupt handler address, based on mode of operation
• Disable interrupts by setting mstatus.mie = 0
• At this point control is handed over to software where the interrupt
processing begins.

Kuruvilla Varghese

238

23
Exit Behaviour for Interrupt Handlers 239

• At the end of the interrupt handler, the mret instruction will do the
following.
– Restore mepc to pc
– Restore mstatus.mpp to Privilege level
– Restore mstatus.mpie to mie

• There may be additional instructions or functionality contained within


the handler, based on the interrupt controller and mode of operation.
• For example, saving/restoring additional user registers, enabling
preemption, and handling global interrupts routed through the PLIC
which require a claim/complete step.
Kuruvilla Varghese

239

240

Thank You

Kuruvilla Varghese

240

24

You might also like