You are on page 1of 108

CPU Organization

Datapath Design:
Capabilities & performance characteristics of principal
Functional Units (FUs):
(e.g., Registers, ALU, Shifters, Logic Units, ...)
Ways in which these components are interconnected (buses
connections, multiplexors, etc.).
How information flows between components.

Control Unit Design:


Logic and means by which such information flow is controlled.
Control and coordination of FUs operation to realize the targeted
Instruction Set Architecture to be implemented (can either be
implemented using a finite state machine or a microprogram).

Hardware description with a suitable language, possibly


using Register Transfer Notation (RTN).
EECC550 - Shaaban
#1 Final Review Winter 2000 2-19-2001

Hierarchy of Computer Architecture


High-Level Language Programs

Software

Assembly Language
Programs

Application
Operating
System

Machine Language
Program

Compiler

Software/Hardware
Boundary

Firmware

Instr. Set Proc. I/O system

Instruction Set
Architecture

Datapath & Control

Hardware

Digital Design
Circuit Design

Microprogram

Layout
Logic Diagrams

Register Transfer
Notation (RTN)

Circuit Diagrams

EECC550 - Shaaban
#2 Final Review Winter 2000 2-19-2001

Computer Performance Measures:


Program Execution Time
For a specific program compiled to run on a specific machine
A, the following parameters are provided:
The total instruction count of the program.
The average number of cycles per instruction (average CPI).
Clock cycle of machine A

How can one measure the performance of this machine running this
program?
Intuitively the machine is said to be faster or has better
performance running this program if the total execution time is
shorter.
Thus the inverse of the total measured program execution time is
a possible performance measure or metric:
PerformanceA = 1 / Execution TimeA
How to compare performance of different machines?
What factors affect performance? How to improve performance?

EECC550 - Shaaban
#3 Final Review Winter 2000 2-19-2001

CPU Execution Time: The CPU Equation


A program is comprised of a number of instructions
Measured in:
instructions/program
The average instruction takes a number of cycles per
instruction (CPI) to be completed.
Measured in: cycles/instruction
CPU has a fixed clock cycle time = 1/clock rate
Measured in:
seconds/cycle

CPU execution time is the product of the above three


parameters as follows:
CPU
CPUtime
time

== Seconds
Seconds
Program
Program

==Instructions
xx Seconds
Instructions xx Cycles
Cycles
Seconds
Program
Instruction
Cycle
Program
Instruction
Cycle

EECC550 - Shaaban
#4 Final Review Winter 2000 2-19-2001

Factors Affecting CPU Performance


CPU
CPUtime
time

== Seconds
Seconds
Program
Program

==Instructions
xx Seconds
Instructions xx Cycles
Cycles
Seconds
Program
Instruction
Cycle
Program
Instruction
Cycle

Instruction
Count

CPI

Program

Compiler

Instruction Set
Architecture (ISA)

Organization
Technology

Clock Rate

X
X
EECC550 - Shaaban
#5 Final Review Winter 2000 2-19-2001

Aspects of CPU Execution Time


CPU Time = Instruction count x CPI x Clock cycle
Depends on:
Program Used
Compiler
ISA

Instruction Count

Depends on:
Program Used
Compiler
ISA
CPU Organization

CPI

Clock
Cycle

Depends on:
CPU Organization
Technology

EECC550 - Shaaban
#6 Final Review Winter 2000 2-19-2001

Instruction Types & CPI: An Example


An instruction set has three instruction classes:
Instruction class
A
B
C

CPI
1
2
3

Two code sequences have the following instruction counts:


Code Sequence
1
2

Instruction counts for instruction class


A
B
C
2
1
2
4
1
1

CPU cycles for sequence 1 = 2 x 1 + 1 x 2 + 2 x 3 = 10 cycles


CPI for sequence 1 = clock cycles / instruction count
= 10 /5 = 2
CPU cycles for sequence 2 = 4 x 1 + 1 x 2 + 1 x 3 = 9 cycles
CPI for sequence 2 = 9 / 6 = 1.5
EECC550 - Shaaban
#7 Final Review Winter 2000 2-19-2001

Instruction Frequency & CPI


Given a program with n types or classes of
instructions with the following characteristics:
Ci = Count of instructions of type i
CPIi = Average cycles per instruction of typei
Fi = Frequency of instruction typei
= Ci/ total instruction count
Then:
n

CPI =
i =1

(CPI F )
i

EECC550 - Shaaban
#8 Final Review Winter 2000 2-19-2001

Instruction Type Frequency & CPI:


A RISC Example
Base Machine (Reg / Reg)
Op
Freq Cycles CPI(i)
ALU
50%
1
.5
Load
20%
5
1.0
Store
10%
3
.3
Branch
20%
2
.4

% Time
23%
45%
14%
18%

Typical Mix

CPI =

(CPI
n

i =1

F )
i

CPI = .5 x 1 + .2 x 5 + .1 x 3 + .2 x 2 = 2.2
EECC550 - Shaaban
#9 Final Review Winter 2000 2-19-2001

Metrics of Computer Performance


Execution time: Target workload,
SPEC95, etc.

Application
Programming
Language
Compiler
ISA

(millions) of Instructions per second MIPS


(millions) of (F.P.) operations per second MFLOP/s

Datapath
Control

Megabytes per second.

Function Units
Transistors Wires Pins

Cycles per second (clock rate).

Each metric has a purpose, and each can be misused.

EECC550 - Shaaban
#10 Final Review Winter 2000

2-19-2001

Performance Enhancement Calculations:


Amdahl's Law
The performance enhancement possible due to a given design
improvement is limited by the amount that the improved feature is used

Amdahls Law:
Performance improvement or speedup due to enhancement E:
Execution Time without E
Speedup(E) = -------------------------------------Execution Time with E

Performance with E
= --------------------------------Performance without E

Suppose that enhancement E accelerates a fraction F of the


execution time by a factor S and the remainder of the time is
unaffected then:
Execution Time with E = ((1-F) + F/S) X Execution Time without E
Hence speedup is given by:
Execution Time without E
1
Speedup(E) = --------------------------------------------------------- = -------------------((1 - F) + F/S) X Execution Time without E
(1 - F) + F/S
Note: All fractions here refer to original execution time.

EECC550 - Shaaban
#11 Final Review Winter 2000

2-19-2001

Pictorial Depiction of Amdahls Law


Enhancement E accelerates fraction F of execution time by a factor of S
Before:
Execution Time without enhancement E:

Unaffected, fraction: (1- F)

Affected fraction: F

Unchanged

Unaffected, fraction: (1- F)

F/S

After:
Execution Time with enhancement E:
Execution Time without enhancement E
1
Speedup(E) = ------------------------------------------------------ = -----------------Execution Time with enhancement E
(1 - F) + F/S

EECC550 - Shaaban
#12 Final Review Winter 2000

2-19-2001

Amdahl's Law With Multiple Enhancements:


Example

Three CPU performance enhancements are proposed with the following


speedups and percentage of the code execution time affected:
Speedup1 = S1 = 10
Speedup2 = S2 = 15
Speedup3 = S3 = 30

Percentage1 = F1 = 20%
Percentage1 = F2 = 15%
Percentage1 = F3 = 10%

While all three enhancements are in place in the new design, each
enhancement affects a different portion of the code and only one
enhancement can be used at a time.
What is the resulting overall speedup?
Speedup

( (1 F
i

)+

F
S

Speedup = 1 / [(1 - .2 - .15 - .1) + .2/10 + .15/15 + .1/30)]


= 1/ [
.55
+
.0333
]
= 1 / .5833 = 1.71

EECC550 - Shaaban
#13 Final Review Winter 2000

2-19-2001

MIPS Instruction Formats


31

R-Type

26
op

rs

6 bits

I-Type: ALU

31

26

31

J-Type: Jumps

5 bits

11

rd

shamt

funct

5 bits

5 bits

6 bits

16

0
immediate

rt
5 bits

16 bits

26
op
6 bits

5 bits
21

rs

6 bits

16
rt

5 bits

op

Load/Store, Branch

21

0
target address
26 bits

op: Opcode, operation of the instruction.


rs, rt, rd: The source and destination register specifiers.
shamt: Shift amount.
funct: selects the variant of the operation in the op field.
address / immediate: Address offset or immediate value.
target address: Target address of the jump instruction.

EECC550 - Shaaban
#14 Final Review Winter 2000

2-19-2001

A Single Cycle MIPS Datapath

nPC_sel

RegDst

00

MemtoReg

Rs Rt
5
busA
Rw Ra Rb
32 32-bit
Registers
busB
32
5

imm16

16

32
Data In

32

ExtOp

Clk

32

0
Mux

Clk

Extender

Clk

32
Mux

PC

Mux
Adder

PC Ext

imm16

ALUctr MemWr

Equal

ALU

Adder

32

Imm16

RegWr 5
busW

Rd

Rd Rt

1
4

Rt

Instruction<31:0>

<0:15>

Rs

<11:15>

Adr

<16:20>

<21:25>

Inst
Memory

WrEn Adr

Data
Memory

ALUSrc

EECC550 - Shaaban
#15 Final Review Winter 2000

2-19-2001

<0:25>

Rd

<0:15>

Rs

<11:15>

Rt

<16:20>

Op Fun

<21:25>

Adr

Instruction<31:0>
<21:25>

Instruction
Memory

Imm16 Jump_target

Control Unit
nPC_sel RegWr RegDst ExtOp ALUSrc ALUctr MemWr MemtoReg Jump

Equal

DATA PATH
EECC550 - Shaaban
#16 Final Review Winter 2000

2-19-2001

The Truth Table For The Main Control


op

00 0000

00 1101 10 0011 10 1011 00 0100 00 0010

R-type

ori

lw

sw

beq

jump

RegDst

ALUSrc

MemtoReg

RegWrite

MemWrite

Branch

Jump

ExtOp

R-type

Or

Add

Add

Subtract

xxx

ALUop <2>

ALUop <1>

ALUop <0>

ALUop (Symbolic)

EECC550 - Shaaban
#17 Final Review Winter 2000

2-19-2001

Worst Case Timing (Load)


Clk
PC Old Value

Clk-to-Q
New Value
Instruction Memoey Access Time
New Value

Rs, Rt, Rd,


Op, Func

Old Value

ALUctr

Old Value

ExtOp

Old Value

New Value

ALUSrc

Old Value

New Value

MemtoReg

Old Value

New Value

RegWr

Old Value

New Value

busA
busB

Delay through Control Logic


New Value

Register
Write Occurs

Register File Access Time


New Value

Old Value
Delay through Extender & Mux
Old Value

New Value
ALU Delay

Address

Old Value

New Value
Data Memory Access Time

busW

Old Value

New

EECC550 - Shaaban
#18 Final Review Winter 2000

2-19-2001

MIPS Single Cycle Instruction


Timing Comparison
Arithmetic & Logical
PC
Inst Memory

Reg File

mux

ALU

mux

setup

Load
PC

Inst Memory

Reg File mux


Critical Path

ALU

Data Mem

Store
PC

Inst Memory

Reg File

ALU

Data Mem

Branch
PC

Inst Memory

Reg File

mux

cmp

mux setup

mux

EECC550 - Shaaban
#19 Final Review Winter 2000

2-19-2001

CPU Design Steps


1. Analyze instruction set operations using independent
RTN => datapath requirements.
2. Select set of datapath components & establish clock
methodology.
3. Assemble datapath meeting the requirements.
4. Analyze implementation of each instruction to determine
setting of control points that effects the register transfer.
5. Assemble the control logic.

EECC550 - Shaaban
#20 Final Review Winter 2000

2-19-2001

Drawback of Single Cycle Processor


Long cycle time.
All instructions must take as much time as the slowest:
Cycle time for load is longer than needed for all other
instructions.

Real memory is not as well-behaved as idealized memory


Cannot always complete data access in one (short) cycle.

EECC550 - Shaaban
#21 Final Review Winter 2000

2-19-2001

Reducing Cycle Time: Multi-Cycle Design


Cut combinational dependency graph by inserting registers / latches.
The same work is done in two or more fast cycles, rather than one slow
cycle.
storage element

storage element

Acyclic
Combinational
Logic (A)

Acyclic
Combinational
Logic

=>

storage element

storage element
Acyclic
Combinational
Logic (B)
storage element

EECC550 - Shaaban
#22 Final Review Winter 2000

2-19-2001

Instruction Processing Cycles


Instruction
Fetch
Next

Obtain instruction from program storage

Update program counter to address

Instruction

of next instruction

Instruction

Determine instruction type

Decode

Obtain operands from registers

Execute

Compute result value or status

Result
Store

Common
steps
for all
instructions

Store result in register/memory if needed


(usually called Write Back).

EECC550 - Shaaban
#23 Final Review Winter 2000

2-19-2001

Partitioning The Single Cycle Datapath

Result Store

MemWr

MemRd
MemWr

RegDst
RegWr

Reg.
File
Data
Mem

Exec

Mem
Access

ALUctr

ALUSrc

ExtOp

Operand
Fetch

Instruction
Fetch

PC

Next PC

nPC_sel

Add registers between smallest steps

EECC550 - Shaaban
#24 Final Review Winter 2000

2-19-2001

RegDst
Reg.
RegWr
File
Equal

MemToReg

ALUSrc
ALUctr

MemRd
MemWr

M
Result Store

Operand
Fetch

Instruction
Fetch

Data
Mem

Mem
Access

Reg
File

Ext
ALU

ExtOp

IR

PC

Next PC

nPC_sel

Example Multi-cycle Datapath

Registers added:
IR:
Instruction register
A, B: Two registers to hold operands read from register file.
R:
or ALUOut, holds the output of the ALU
M:
or Memory data register (MDR) to hold data read from data memory

EECC550 - Shaaban
#25 Final Review Winter 2000

2-19-2001

Operations In Each Cycle


R-Type

Logic
Immediate

Load

Store

Branch

Instruction
Fetch

IR Mem[PC]

IR Mem[PC]

IR Mem[PC]

IR Mem[PC]

Instruction
Decode

A R[rs]

A R[rs]

A R[rs]

A R[rs]

R[rs]

B R[rt]

R[rt]

B R[rt]

IR Mem[PC]

If Equal = 1

Execution

R A + B

R A OR ZeroExt[imm16]

R A + SignEx(Im16)

PC PC + 4 +
R A + SignEx(Im16)

(SignExt(imm16) x4)
else
PC PC + 4

Memory

M Mem[R]

Mem[R]

PC PC + 4

Write
Back

R[rd] R

R[rt] R

PC PC + 4

PC PC + 4

R[rd]

PC PC + 4

EECC550 - Shaaban
#26 Final Review Winter 2000

2-19-2001

Finite State Machine (FSM) Control Model


State specifies control points for Register Transfer.
Transfer occurs upon exiting state (same falling edge).
inputs (conditions)

Next State
Logic

State X

Control State

Register Transfer
Control Points
Depends on Input

Output Logic
outputs (control points)

EECC550 - Shaaban
#27 Final Review Winter 2000

2-19-2001

Control Specification For Multi-cycle CPU


Finite State Machine (FSM)
instruction fetch

IR MEM[PC]

A R[rs]
B R[rt]

R A or ZX

R[rd] R
PC PC + 4

R[rt] R
PC PC + 4

To instruction fetch

LW

SW

BEQ & Equal


BEQ & ~Equal
PC PC + 4

R A + SX

R A + SX

M MEM[R]

MEM[R] B
PC PC + 4

PC PC +
SX || 00

To instruction fetch

Write-back

R A fun B

ORi

Memory

Execute

R-type

decode / operand fetch

R[rt] M
PC PC + 4
To instruction fetch

EECC550 - Shaaban
#28 Final Review Winter 2000

2-19-2001

Traditional FSM Controller


datapath + state diagram => control
Translate RTN statements into
control points.
Assign states.
Implement the controller.
EECC550 - Shaaban
#29 Final Review Winter 2000

2-19-2001

Mapping RTNs To Control Points Examples


& State Assignments
IR MEM[PC]

instruction fetch

0000
imem_rd, IRen

A R[rs]
B R[rt]

Aen, Ben

decode / operand fetch

0001
ALUfun, Sen

ORi

LW

R A or ZX

R A + SX

0110

1000

RegDst,
RegWr,
PCen

M MEM[S]
1001

BEQ & Equal

SW

BEQ & ~Equal

R A + SX
1011

MEM[S] B
PC PC + 4
1100

R[rd] R
PC PC + 4

R[rt] R
PC PC + 4

0101

0111

To instruction fetch state 0000

PC PC + 4
0011

PC PC +
SX || 00
0010

To instruction fetch
state 0000

Write-back

R A fun B
0100

Memory

Execute

R-type

R[rt] M
PC PC + 4
1010
To instruction fetch state 0000

EECC550 - Shaaban
#30 Final Review Winter 2000

2-19-2001

Detailed Control Specification

State Op field Eq Next IR PC


en sel
0000 ?????? ? 0001 1
0001 BEQ
0 0011
0001 BEQ
1 0010
0001 R-type x 0100
0001 orI
x 0110
0001 LW
x 1000
0001 SW
x 1011
0010 xxxxxx x 0000
1 1
0011 xxxxxx x 0000
1 0
0100 xxxxxx x 0101
0101 xxxxxx x 0000
1 0
0110 xxxxxx x 0111
0111 xxxxxx x 0000
1 0
1000 xxxxxx x 1001
1001 xxxxxx x 1010
1010 xxxxxx x 0000
1 0
1011 xxxxxx x 1100
1100 xxxxxx x 0000
1 0

Ops
AB

Exec
Ex Sr ALU S

Mem
RWM

Write-Back
M-R Wr Dst

11
11
11
11
11
11

0 1 fun

1
0 1 1

0 0 or

1
0 1 0

1 0 add 1
1 0 0
1

1 0

1 0 add 1
0 1

EECC550 - Shaaban
#31 Final Review Winter 2000

2-19-2001

Alternative Multiple Cycle Datapath (In Textbook)

Shared instruction/data memory unit


A single ALU shared among instructions
Shared units require additional or widened multiplexors
Temporary registers to hold data between clock cycles of the instruction:
Additional registers: Instruction Register (IR),
Memory Data Register (MDR), A, B, ALUOut

EECC550 - Shaaban
#32 Final Review Winter 2000

2-19-2001

Operations In Each Cycle


R-Type
Instruction
Fetch

IR Mem[PC]
PC PC + 4
A R[rs]

Instruction
Decode

Execution

B R[rt]

Logic
Immediate
IR Mem[PC]
PC PC + 4

Load

Store

IR Mem[PC]
PC PC + 4

IR Mem[PC]
PC PC + 4

A R[rs]

A R[rs]

B R[rt]

B R[rt]

ALUout PC +
(SignExt(imm16)
x4)

ALUout PC +

ALUout A + B

ALUout

(SignExt(imm16) x4)

A OR ZeroExt[imm16]

R[rt]

ALUout PC +
(SignExt(imm16) x4)

ALUout

Branch
IR Mem[PC]
PC PC + 4
A R[rs]

R[rs]

ALUout PC +

ALUout PC +

(SignExt(imm16) x4)

(SignExt(imm16) x4)

If Equal = 1

ALUout

A + SignEx(Im16)

R[rt]

PC ALUout

A + SignEx(Im16)

Memory
M Mem[ALUout]

Write
Back

R[rd] ALUout

R[rt] ALUout

R[rd]

Mem[ALUout]

Mem

EECC550 - Shaaban
#33 Final Review Winter 2000

2-19-2001

Finite State Machine (FSM) Specification


IR MEM[PC]
PC PC + 4

instruction fetch

0000

A R[rs]
B R[rt]

ALUout
PC +SX

decode

0001
LW

ALUout
A fun B

ALUout
A op ZX

ALUout
A + SX

0100

0110

1000
M
MEM[ALUout]

1001

BEQ
SW
ALUout
A + SX

If A = B then
PC ALUout

1011

0010

MEM[ALUout]
B

To instruction fetch

Write-back

ORi

Memory

Execute

R-type

1100
R[rd]
ALUout

R[rt]
ALUout

0101

0111

R[rt] M

1010

To instruction fetch
To instruction fetch

EECC550 - Shaaban
#34 Final Review Winter 2000

2-19-2001

MIPS Multi-cycle Datapath


Performance Evaluation
What is the average CPI?
State diagram gives CPI for each instruction type
Workload below gives frequency of each type
Type

CPIi for type

Frequency

CPIi x freqIi

Arith/Logic

40%

1.6

Load

30%

1.5

Store

10%

0.4

branch

20%

0.6

Average CPI:

4.1

Better than CPI = 5 if all instructions took the same number


of clock cycles (5).

EECC550 - Shaaban
#35 Final Review Winter 2000

2-19-2001

Control Implementation Alternatives


Control may be designed using one of several initial representations. The
choice of sequence control, and how logic is represented, can then be
determined independently; the control can then be implemented with one
of several methods using a structured logic technique.
Initial Representation

Sequencing Control

Logic Representation

Implementation
Technique

Finite State Diagram

Microprogram

Explicit Next State


Function

Microprogram counter
+ Dispatch ROMs

Logic Equations

PLA
hardwired control

Truth Tables

ROM
microprogrammed control

EECC550 - Shaaban
#36 Final Review Winter 2000

2-19-2001

Microprogrammed Control
Finite state machine control for a full set of instructions is very
complex, and may involve a very large number of states:
Slight microoperation changes require new FSM controller.
Microprogramming: Designing the control as a program that
implements the machine instructions.
A microprogam for a given machine instruction is a symbolic
representation of the control involved in executing the instruction
and is comprised of a sequence of microinstructions.

Each microinstruction defines the set of datapath control signals


that must asserted (active) in a given state or cycle.
The format of the microinstructions is defined by a number of
fields each responsible for asserting a set of control signals.
Microarchitecture:
Logical structure and functional capabilities of the hardware as
seen by the microprogrammer.
EECC550 - Shaaban
#37 Final Review Winter 2000

2-19-2001

Microprogrammed Control Unit


Microprogram
Storage
Outputs
Control
ROM/PLA
Signal

Multicycle
Datapath

Fields
Microinstruction Address

Inputs
1

Adder

Sequencing
Control
Field

State Reg
Address Select Logic

Types of branching
Set state to 0 (fetch)
Dispatch i (state 1)
Use incremented
address (seq) state
number 2
Microprogram
Counter, MicroPC

Opcode
EECC550 - Shaaban
#38 Final Review Winter 2000

2-19-2001

Multiple Bit Control

Single Bit Control

List of control Signals Grouped Into Fields


Signal name
ALUSelA
RegWrite
MemtoReg
RegDst
MemRead

Effect when deasserted


1st ALU operand = PC
None
Reg. write data input = ALU
Reg. dest. no. = rt
None

Effect when asserted


1st ALU operand = Reg[rs]
Reg. is written
Reg. write data input = memory
Reg. dest. no. = rd
Memory at address is read,
MDR Mem[addr]
Memory at address is written
Memory address = S
IR Memory
PC PCSource
IF ALUzero then PC PCSource
PCSource = ALUout

MemWrite None
IorD
Memory address = PC
IRWrite
None
PCWrite
None
PCWriteCond None
PCSource
PCSource = ALU
Signal name Value Effect
ALUOp
00
ALU adds
01
ALU subtracts
10
ALU does function code
11
ALU does logical OR
ALUSelB
000
2nd ALU input = Reg[rt]
001
2nd ALU input = 4
010
2nd ALU input = sign extended IR[15-0]
011
2nd ALU input = sign extended, shift left 2 IR[15-0]
100
2nd ALU input = zero extended IR[15-0]

EECC550 - Shaaban
#39 Final Review Winter 2000

2-19-2001

Microinstruction Field Values


Field Name
ALU

Values for Field


Add
Subt.
Func code
Or
SRC1
PC
rs
SRC2
4
Extend
Extend0
Extshft
rt
destination
rd ALU
rt ALU
rt Mem
Memory
Read PC
Read ALU
Write ALU
Memory register IR
PC write
ALU
ALUoutCond
Sequencing
Seq
Fetch
Dispatch i

Function of Field with Specific Value


ALU adds
ALU subtracts
ALU does function code
ALU does logical OR
1st ALU input = PC
1st ALU input = Reg[rs]
2nd ALU input = 4
2nd ALU input = sign ext. IR[15-0]
2nd ALU input = zero ext. IR[15-0]
2nd ALU input = sign ex., sl IR[15-0]
2nd ALU input = Reg[rt]
Reg[rd] ALUout
Reg[rt] ALUout
Reg[rt] Mem
Read memory using PC
Read memory using ALU output
Write memory using ALU output, value B
IR Mem
PC ALU
IF ALU Zero then PC ALUout
Go to sequential instruction
Go to the first microinstruction
Dispatch using ROM.

EECC550 - Shaaban
#40 Final Review Winter 2000

2-19-2001

Microprogram for The Control Unit


Label

ALU

SRC1

SRC2

Fetch:

Add
Add

PC
PC

4
Extshft

Lw:

Add

rs

Extend

Dest.

Memory
Read PC

Mem. Reg. PC Write


IR

Sequencing

ALU

Seq
Seq
Fetch

Read ALU
rt MEM
Sw:

Add

rs

Extend

Seq
Fetch

Write ALU
Rtype:

Func

rs

rt

Seq
Fetch

rd ALU
Beq:

Subt.

rs

rt

Ori:

Or

rs

Extend0

Seq
Dispatch

ALUoutCond.

Fetch
Seq
Fetch

rt ALU

EECC550 - Shaaban
#41 Final Review Winter 2000

2-19-2001

MIPS Integer ALU Requirements


(1) Functional Specification:
inputs:
2 x 32-bit operands A, B, 4-bit mode
outputs:
32-bit result S, 1-bit carry, 1 bit overflow, 1 bit zero
operations: add, addu, sub, subu, and, or, xor, nor, slt, sltU

10 operations
thus 4 control bits

(2) Block Diagram:

32
c A
zero
ovf

32
B

ALU
S
32

4
m

00

add

01

addU

02

sub

03

subU

04

and

05

or

06

xor

07

nor

12

slt

13

sltU

EECC550 - Shaaban
#42 Final Review Winter 2000

2-19-2001

Building Block: 1-bit ALU


Performs: AND, OR,
addition on A, B or A, B inverted
invertB

Operation

CarryIn

and

1-bit
Full
Adder

Mux

or

Result

add

CarryOut

EECC550 - Shaaban
#43 Final Review Winter 2000

2-19-2001

32-Bit ALU Using 32 1-Bit ALUs


CarryIn0

32-bit rippled-carry adder

A0

1-bit
Result0
ALU
B0
CarryOut0
CarryIn1
A1
1-bit
Result1
ALU
B1
CarryOut1
CarryIn2
A2
B2

1-bit
ALU

Result2

CarryIn3

:
:
CarryIn31
A31
B31

CarryOut30
1-bit
ALU

(operation/invertB lines not shown)

Addition/Subtraction Performance:
Assume gate delay = T
Total delay =
=
=
=

32 x (1-Bit ALU Delay)


32 x 2 x gate delay
64 x gate delay
64 T

Result31

CarryOut31

EECC550 - Shaaban
#44 Final Review Winter 2000

2-19-2001

MIPS ALU With SLT Support Added


CarryIn0
Less

A0
B0

A1
B1
Less = 0
A2
B2
Less = 0

1-bit
Result0
ALU
CarryIn1 CarryOut0
1-bit
Result1
ALU
CarryIn2 CarryOut1
1-bit
ALU
CarryIn3

:
:

Zero

Result2

:
:
:
:

CarryOut30
CarryIn31
A31
1-bit
B31
Result31
ALU
Less = 0
CarryOut31

Overflow

EECC550 - Shaaban
#45 Final Review Winter 2000

2-19-2001

Improving ALU Performance:

Carry Look Ahead (CLA)


Cin
A0
B1

A
0
0
1
1

S
G
P
C1 =G0 + C0 P0

A
B

S
G
P

A
B

B
0
1
0
1

C-out
0
C-in
C-in
1

kill
propagate
propagate
generate

G = A and B
P = A xor B
C2 = G1 + G0 P1 + C0 P0 P1

S
G
P
C3 = G2 + G1 P2 + G0 P1 P2 + C0 P0 P1 P2

A
B

S
G
P

G
P

EECC550 - Shaaban
#46 Final Review Winter 2000

2-19-2001

C
L
A

G0
P0
C1 =G0 + C0 P0

Delay = 2 + 2 + 1 = 5 gate delays = 5T

4-bit
Adder

Cascaded Carry Look-ahead


C0
16-Bit Example

C2 = G1 + G0 P1 + C0 P0 P1

Assuming all
gates have
equal delay T

4-bit
Adder

4-bit
Adder

C3 = G2 + G1 P2 + G0 P1 P2 + C0 P0 P1 P2
G
P

C4 = . . .

EECC550 - Shaaban
#47 Final Review Winter 2000

2-19-2001

Unsigned Multiplication Example


Paper and pencil example (unsigned):
Multiplicand
1000
Multiplier
1001
1000
0000
0000
1000
Product
01001000
m bits x n bits = m + n bit product, m = 32, n = 32, 64 bit product.
The binary number system simplifies multiplication:
0 => place 0
1 => place a copy

( 0 x multiplicand).
( 1 x multiplicand).

We will examine 4 versions of multiplication hardware & algorithm:

Successive refinement of design.


EECC550 - Shaaban
#48 Final Review Winter 2000

2-19-2001

Operation of Combinational Multiplier


0

0
A3
A3

A3
A3
P7

P6

A2
P5

A2
A1
P4

A2
A1

0
A2
A1

0
A1

0
A0

B0

A0

B1

A0

B2

A0
P3

B3
P2

P1

P0

At each stage shift A left ( x 2).


Use next bit of B to determine whether to add in shifted multiplicand.
Accumulate 2n bit partial product at each stage.

EECC550 - Shaaban
#49 Final Review Winter 2000

2-19-2001

MULTIPLY HARDWARE Version 3


Combine Multiplier register and Product register:
32-bit Multiplicand register.
32 -bit ALU.
64-bit Product register, (0-bit Multiplier register).
Multiplicand
32 bits
32-bit ALU
Shift Right
Product (Multiplier)
64 bits

Control
Write

EECC550 - Shaaban
#50 Final Review Winter 2000

2-19-2001

Multiply Algorithm
Version 3 Product0 = 1

Start

1. Test
Product0

Product0 = 0

1a. Add multiplicand to the left half of product &


place the result in the left half of Product register

2. Shift the Product register right 1 bit.

32nd
repetition?

No: < 32 repetitions

Yes: 32 repetitions
Done

EECC550 - Shaaban
#51 Final Review Winter 2000

2-19-2001

Combinational Shifter from MUXes


Basic Building Block
sel

0
D

8-bit right shifter


A7

A6

A5

A4

A3

A2

A1

S2 S1 S0

A0

R7

R6

R5

R4

What comes in the MSBs?

How many levels for 32-bit shifter?

R3

R2

R1

R0

EECC550 - Shaaban
#52 Final Review Winter 2000

2-19-2001

Division
1001
Divisor 1000
1001010
1000
10
101
1010
1000
10

Quotient
Dividend

Remainder (or Modulo result)

See how big a number can be subtracted, creating quotient bit on each step:
Binary =>

1 * divisor or 0 * divisor

Dividend = Quotient x Divisor + Remainder


=> | Dividend | = | Quotient | + | Divisor |

3 versions of divide, successive refinement

EECC550 - Shaaban
#53 Final Review Winter 2000

2-19-2001

DIVIDE HARDWARE Version 3


32-bit Divisor register.
32 -bit ALU.
64-bit Remainder regegister (0-bit Quotient register).
Divisor
32 bits
32-bit ALU
HI

LO

Shift Left

Remainder (Quotient)
64 bits

Control
Write

EECC550 - Shaaban
#54 Final Review Winter 2000

2-19-2001

Divide Algorithm
Version 3

Start: Place Dividend in Remainder


1. Shift the Remainder register left 1 bit.

2. Subtract the Divisor register from the


left half of the Remainder register, & place the
result in the left half of the Remainder register.
Remainder >= 0

3a. Shift the


Remainder register
to the left setting
the new rightmost
bit to 1.

Test
Remainder

Remainder < 0

3b. Restore the original value by adding the Divisor


register to the left half of the Remainderregister,
&place the sum in the left half of the Remainder
register. Also shift the Remainder register to the
left, setting the new least significant bit to 0.

nth
repetition?

No: < n repetitions

Yes: n repetitions (n = 4 here)


Done. Shift left half of Remainder right 1 bit.

EECC550 - Shaaban
#55 Final Review Winter 2000

2-19-2001

Representation of Floating Point Numbers in


Single Precision

IEEE 754 Standard

Value = N = (-1)S X 2
0 < E < 255
Actual exponent is:
e = E - 127

Example:

1
sign S

E-127

X (1.M)

8
E

23
M

exponent:
excess 127
binary integer
added

0 = 0 00000000 0 . . . 0

Magnitude of numbers that


can be represented is in the range:
Which is approximately:

mantissa:
sign + magnitude, normalized
binary significand with
a hidden integer bit: 1.M

-1.5 = 1 01111111 10 . . . 0
2

-126

(1.0)

1.8 x 10

- 38

127

(2 - 2 -23 )

to

to

3.40 x 10

38

EECC550 - Shaaban
#56 Final Review Winter 2000

2-19-2001

Representation of Floating Point Numbers in


Single Precision

IEEE 754 Standard

Value = N = (-1)S X 2
0 < E < 255
Actual exponent is:
e = E - 127

Example:

1
sign S

E-127

X (1.M)

8
E

23
M

exponent:
excess 127
binary integer
added

0 = 0 00000000 0 . . . 0

Magnitude of numbers that


can be represented is in the range:
Which is approximately:

mantissa:
sign + magnitude, normalized
binary significand with
a hidden integer bit: 1.M

-1.5 = 1 01111111 10 . . . 0
2

-126

(1.0)

1.8 x 10

- 38

127

(2 - 2 -23 )

to

to

3.40 x 10

38

EECC550 - Shaaban
#57 Final Review Winter 2000

2-19-2001

Representation of Floating Point Numbers in


Double Precision

IEEE 754 Standard

Value = N = (-1)S X 2

0 < E < 2047


Actual exponent is:
e = E - 1023

Example:

1
sign S

E-1023

X (1.M)

11
E

52
M
Mantissa:
sign + magnitude, normalized
binary significand with
a hidden integer bit: 1.M

exponent:
excess 1023
binary integer
added

0 = 0 00000000000 0 . . . 0

Magnitude of numbers that


can be represented is in the range:
Which is approximately:

-1.5 = 1 01111111111 10 . . . 0
-1022

1023

(2 - 2 - 52 )

(1.0)

to

- 308
2.23 x 10

to

1.8 x 10

308

EECC550 - Shaaban
#58 Final Review Winter 2000

2-19-2001

Floating Point Conversion Example


The decimal number -2345.12510 is to be represented in the
IEEE 754 32-bit single precision format:
-2345.12510 = -100100101001.0012
(converted to binary)
= -1.00100101001001 x 211 (normalized binary)
Hidden

The mantissa is negative so the sign S is given by:


S=1
The biased exponent E is given by E = e + 127
E = 11 + 127 = 13810 = 100010102
Fractional part of mantissa M:
M = .00100101001001000000000 (in 23 bits)
The IEEE 754 single precision representation is given by:
1

10001010

1 bit

8 bits

00100101001001000000000
M
23 bits

EECC550 - Shaaban
#59 Final Review Winter 2000

2-19-2001

Start
Compare the exponents of the two numbers
shift the smaller number to the right until its
exponent matches the larger exponent

(1)

(2)

Floating Point
Addition
Flowchart

Add the significands (mantissas)


Normalize the sum, either shifting right and
incrementing the exponent or shifting left
and decrementing the exponent

(3)

(4)
No

Overflow or
Underflow ?

Yes

Generate exception
or return error

No
Still
normalized?
yes

Round the significand to the appropriate number of bits


If mantissa = 0, set exponent to 0

(5)

Done

EECC550 - Shaaban
#60 Final Review Winter 2000

2-19-2001

Floating Point Addition Hardware

EECC550 - Shaaban
#61 Final Review Winter 2000

2-19-2001

Floating Point
Multiplication Flowchart
(1)

Set the result to zero:


exponent = 0

Is one/both
operands =0?

(2)

Compute exponent:
biased exp.(X) + biased exp.(Y) - bias

(3)

Compute sign of result: Xs XOR Ys

(4)

Multiply the mantissas


Normalize mantissa if needed

(5)
Generate exception
or return error
No

Start

Yes

Overflow or
Underflow?

(6)

No
Round or truncate the result mantissa

Still
Normalized?
Yes

(7)

Done

EECC550 - Shaaban
#62 Final Review Winter 2000

2-19-2001

Extra Bits for Rounding


Extra bits used to prevent or minimize rounding errors.
How many extra bits?
IEEE: As if computed the result exactly and rounded.
Addition:
1.xxxxx

1.xxxxx

1.xxxxx

+ 1.xxxxx

0.001xxxxx

0.01xxxxx

1x.xxxxy

1.xxxxxyyy

1x.xxxxyyy

post-normalization

pre-normalization

pre and post

Guard Digits: digits to the right of the first p digits of significand to guard
against loss of digits can later be shifted left into first P places during
normalization.
Addition: carry-out shifted in
Subtraction: borrow digit and guard
Multiplication: carry and guard, Division requires guard

EECC550 - Shaaban
#63 Final Review Winter 2000

2-19-2001

Rounding Digits
Normalized result, but some non-zero digits to the right of the
significand --> the number should be rounded
E.g., B = 10, p = 3:
-

2-bias
0 2 1.69 = 1.6900 * 10
0 0 7.85 = - .0785 * 10 2-bias
0 2 1.61 = 1.6115 * 10 2-bias

One round digit must be carried to the right of the guard digit so that
after a normalizing left shift, the result can be rounded, according
to the value of the round digit.
IEEE Standard:
four rounding modes: round to nearest (default)
round towards plus infinity
round towards minus infinity
round towards 0
round to nearest:
round digit < B/2 then truncate
> B/2 then round up (add 1 to ULP: unit in last place)
= B/2 then round to nearest even digit
it can be shown that this strategy minimizes the mean error
introduced by rounding.

EECC550 - Shaaban
#64 Final Review Winter 2000

2-19-2001

Sticky Bit
Additional bit to the right of the round digit to better fine tune rounding
d0 . d1 d2 d3 . . . dp-1 0 0 0
0. 0 0 X... X XX S
XX S

Sticky bit: set to 1 if any 1 bits fall off


the end of the round digit

d0 . d1 d2 d3 . . . dp-1 0 0 0
0. 0 0 X... X XX 0

d0 . d1 d2 d3 . . . dp-1 0 0 0
0. 0 0 X... X XX 1
generates a borrow

Rounding Summary:
Radix 2 minimizes wobble in precision.
Normal operations in +,-,*,/ require one carry/borrow bit + one guard digit.
One round digit needed for correct rounding.
Sticky bit needed when round digit is B/2 for max accuracy.
Rounding to nearest has mean error = 0 if uniform distribution of digits
are assumed.

EECC550 - Shaaban
#65 Final Review Winter 2000

2-19-2001

Pipelining: Design Goals


The length of the machine clock cycle is determined by the time
required for the slowest pipeline stage.
An important pipeline design consideration is to balance the
length of each pipeline stage.
If all stages are perfectly balanced, then the time per
instruction on a pipelined machine (assuming ideal conditions
with no stalls):
Time per instruction on unpipelined machine
Number of pipe stages

Under these ideal conditions:


Speedup from pipelining = the number of pipeline stages = k
One instruction is completed every cycle: CPI = 1 .

EECC550 - Shaaban
#66 Final Review Winter 2000

2-19-2001

Pipelined Instruction Processing


Representation
Time in clock cycles

Clock cycle Number


Instruction Number

Instruction I
Instruction I+1
Instruction I+2
Instruction I+3
Instruction I +4

IF

ID
IF

EX
ID
IF

MEM
EX
ID
IF

WB
MEM
EX
ID
IF

WB
MEM
EX
ID

WB
MEM
EX

WB
MEM

WB

Time to fill the pipeline

Pipeline Stages:
IF
ID
EX
MEM
WB

= Instruction Fetch
= Instruction Decode
= Execution
= Memory Access
= Write Back

First instruction, I
Completed

Last instruction,
I+4 completed

EECC550 - Shaaban
#67 Final Review Winter 2000

2-19-2001

Single Cycle, Multi-Cycle, Pipeline:


Performance Comparison Example
For 1000 instructions, execution time:
Single Cycle Machine:
8 ns/cycle x 1 CPI x 1000 inst = 8000 ns

Multicycle Machine:
2 ns/cycle x 4.6 CPI (due to inst mix) x 1000 inst = 9200 ns

Ideal pipelined machine, 5-stages:


2 ns/cycle x (1 CPI x 1000 inst + 4 cycle fill) = 2008 ns
EECC550 - Shaaban
#68 Final Review Winter 2000

2-19-2001

Single Cycle, Multi-Cycle, Vs. Pipeline


Cycle 1

Cycle 2

Clk
Single Cycle Implementation:

8 ns

Load

Store

Waste

2ns
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10
Clk
Multiple Cycle Implementation:
Load
IF

Store
ID

EX

MEM

WB

MEM

WB

IF

R-type
ID

EX

MEM

IF

Pipeline Implementation:
Load IF

ID

Store IF

EX
ID

R-type IF

EX
ID

MEM
EX

WB
MEM

WB

EECC550 - Shaaban
#69 Final Review Winter 2000

2-19-2001

MIPS: A Pipelined Datapath


0
M
u
x
1

IF/ID

ID/EX

E X/ME M

MEM/WB

Ad d
4

A dd

A dd
result

PC

Add ress
In struction
memory

Ins truction

Shift
left 2
Read
register 1

Re ad
data 1
Read
register 2
Reg iste rs Re ad
Write
data 2
register
Write
data

Ze ro
0
M
u
x
1

ALU

ALU
result

A ddress

Read
da ta

1
M
u
x
0

Data
me mory
Wri te
data

16

IF
Instruction Fetch

Sign
e xtend

32

ID
Instruction Decode

EX
Execution

MEM
Memory

WB
Write Back

EECC550 - Shaaban
#70 Final Review Winter 2000

2-19-2001

Pipeline Control
Pass needed control signals along from one stage to the next as the
instruction travels through the pipeline just like the data

Instruction
R-format
lw
sw
beq

Execution/Address Calculation Memory access stage


stage control lines
control lines
Reg
ALU
ALU
ALU
Mem
Mem
Dst
Op1
Op0
Src Branch Read Write
1
1
0
0
0
0
0
0
0
0
1
0
1
0
X
0
0
1
0
0
1
X
0
1
0
1
0
0

Write-back
stage control
lines
Reg Mem to
write
Reg
1
0
1
1
0
X
0
X

WB
Instruction

IF/ID

Control

WB

EX

WB

ID/EX

EX/MEM

MEM/WB

EECC550 - Shaaban
#71 Final Review Winter 2000

2-19-2001

Basic Performance Issues In Pipelining


Pipelining increases the CPU instruction throughput:
The number of instructions completed per unit time.
Under ideal condition instruction throughput is one
instruction per machine cycle, or CPI = 1
Pipelining does not reduce the execution time of an
individual instruction: The time needed to complete all
processing steps of an instruction (also called instruction
completion latency).
It usually slightly increases the execution time of each
instruction over unpipelined implementations due to the
increased control overhead of the pipeline and pipeline
stage registers delays.
EECC550 - Shaaban
#72 Final Review Winter 2000

2-19-2001

Pipelining Performance Example


Example: For an unpipelined machine:
Clock cycle = 10ns, 4 cycles for ALU operations and branches and 5 cycles for
memory operations with instruction frequencies of 40%, 20% and 40%,
respectively.
If pipelining adds 1ns to the machine clock cycle then the speedup in
instruction execution from pipelining is:
Non-pipelined Average instruction execution time = Clock cycle x Average CPI

= 10 ns x ((40% + 20%) x 4 + 40%x 5) = 10 ns x 4.4 = 44 ns


In the pipelined five implementation five stages are used with
an average instruction execution time of: 10 ns + 1 ns = 11 ns
Speedup from pipelining = Instruction time unpipelined
Instruction time pipelined
= 44 ns / 11 ns = 4 times
EECC550 - Shaaban
#73 Final Review Winter 2000

2-19-2001

Pipeline Hazards
Hazards are situations in pipelining which prevent the next
instruction in the instruction stream from executing during
the designated clock cycle resulting in one or more stall cycles.
Hazards reduce the ideal speedup gained from pipelining and
are classified into three classes:

Structural hazards: Arise from hardware resource conflicts


when the available hardware cannot support all possible
combinations of instructions.
Data hazards: Arise when an instruction depends on the

results of a previous instruction in a way that is exposed by


the overlapping of instructions in the pipeline.

Control hazards: Arise from the pipelining of conditional


branches and other instructions that change the PC.
EECC550 - Shaaban
#74 Final Review Winter 2000

2-19-2001

Structural hazard Example:


Single Memory For Instructions & Data
Time (clock cycles)

Instr 4

Reg

Mem

Reg

Mem

Reg

Mem

Reg

Mem

Reg

Mem

Reg

Mem

Reg

ALU

Instr 3

Reg

ALU

Instr 2

Mem

Mem

ALU

Instr 1

Reg

ALU

O
r
d
e
r

Load

Mem

ALU

I
n
s
t
r.

Mem

Reg

Detection is easy in this case (right half highlight means read, left half write)

EECC550 - Shaaban
#75 Final Review Winter 2000

2-19-2001

Data Hazards Example

Problem with starting next instruction before first is


finished
Data dependencies here that go backward in time
create data hazards.

sub
and
or
add
sw

$2, $1, $3
$12, $2, $5
$13, $6, $2
$14, $2, $2
$15, 100($2)

Time (in clock cycles)


CC 1
Value of
register $2: 10

CC 2

CC 3

CC 4

CC 5

CC 6

CC 7

CC 8

CC 9

10

10

10

10/ 20

20

20

20

20

DM

Reg

Program
execution
order
(in instructions)
sub $2, $1, $3

and $12, $2, $5

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

IM

Reg

IM

DM

Reg

IM

DM

Reg

IM

Reg

DM

Reg

IM

Reg

Reg

Reg

DM

Reg

EECC550 - Shaaban
#76 Final Review Winter 2000

2-19-2001

Data Hazard Resolution: Stall Cycles


Stall the pipeline by a number of cycles.
The control unit must detect the need to insert stall cycles.
In this case two stall cycles are needed.
Time (in clock cycles)
CC 1
Value of
register $2: 10
Program
execution
order
(in instructions)
sub $2, $1, $3

and $12, $2, $5

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

IM

CC 2

CC 3

CC 4

CC 5

CC 6

CC 7

CC 8

CC 9

CC 10

CC 11

10

10

10

10/ 20

20

20

20

20

20

20

DM

Reg

DM

Reg

Reg

IM

STALL

STALL
Reg

STALL

STALL

IM

DM

Reg

IM

DM

Reg

IM

Reg

Reg

Reg

DM

Reg

EECC550 - Shaaban
#77 Final Review Winter 2000

2-19-2001

Data Hazard Resolution: Stall Cycles


Stall the pipeline by a number of cycles.
The control unit must detect the need to insert stall cycles.
In this case two stall cycles are needed.
Time (in clock cycles)
CC 1
Value of
register $2: 10
Program
execution
order
(in instructions)
sub $2, $1, $3

and $12, $2, $5

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

IM

CC 2

CC 3

CC 4

CC 5

CC 6

CC 7

CC 8

CC 9

CC 10

CC 11

10

10

10

10/ 20

20

20

20

20

20

20

DM

Reg

DM

Reg

Reg

IM

STALL

STALL
Reg

STALL

STALL

IM

DM

Reg

IM

DM

Reg

IM

Reg

Reg

Reg

DM

Reg

EECC550 - Shaaban
#78 Final Review Winter 2000

2-19-2001

Performance of Pipelines with Stalls


Hazards in pipelines may make it necessary to stall the
pipeline by one or more cycles and thus degrading
performance from the ideal CPI of 1.
CPI pipelined = Ideal CPI + Pipeline stall clock cycles per instruction
If pipelining overhead is ignored and we assume that the stages are
perfectly balanced then:
Speedup = CPI unpipelined / (1 + Pipeline stall cycles per instruction)
When all instructions take the same number of cycles and is equal to
the number of pipeline stages then:
Speedup = Pipeline depth / (1 + Pipeline stall cycles per instruction)

EECC550 - Shaaban
#79 Final Review Winter 2000

2-19-2001

Data Hazard Resolution: Forwarding


Register file forwarding to handle read/write to same register
ALU forwarding

EECC550 - Shaaban
#80 Final Review Winter 2000

2-19-2001

Data Hazard Example With Forwarding


Time (in clock cycles)
CC 1
Value of register $2 : 10
Value of EX/MEM : X
Value of MEM/WB : X

CC 2

CC 3

CC 4

CC 5

CC 6

CC 7

CC 8

CC 9

10
X
X

10
X
X

10
20
X

10/ 20
X
20

20
X
X

20
X
X

20
X
X

20
X
X

DM

Reg

Program
execution order
(in instructions)
sub $2, $1, $3

and $12, $2, $5

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

IM

Reg

IM

Reg

IM

DM

Reg

IM

Reg

DM

Reg

IM

Reg

DM

Reg

Reg

DM

Reg

EECC550 - Shaaban
#81 Final Review Winter 2000

2-19-2001

A Data Hazard Requiring A Stall


A load followed by an R-type instruction that uses the loaded value
Time (in clock cycles)
Program
CC 1
execution
order
(in instructions)
lw $2, 20($1)

and $4, $2, $5

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

IM

CC 2

CC 3

Reg

IM

CC 4

CC 5

DM

Reg

Reg

IM

DM

Reg

IM

CC 6

CC 8

CC 9

Reg

DM

Reg

IM

CC 7

Reg

DM

Reg

Reg

DM

Reg

Even with forwarding in place a stall cycle is needed


This condition must be detected by hardware

EECC550 - Shaaban
#82 Final Review Winter 2000

2-19-2001

Compiler Scheduling Example


Reorder the instructions to avoid as many pipeline stalls as possible:
lw
$15, 0($2)
lw
$16, 4($2)
sw
$16, 0($2)
sw
$15, 4($2)
The data hazard occurs on register $16 between the second lw and the first sw
resulting in a stall cycle
With forwarding we need to find only one independent instructions to place
between them, swapping the lw instructions works:
lw
$15, 0($2)
lw
$16, 4($2)
sw
$15, 0($2)
sw
$16, 4($2)
Without forwarding we need three independent instructions to place between
them, so in addition two nops are added.
lw
$15, 0($2)
lw
$16, 4($2)
nop
nop
sw
$15, 0($2)
sw
$16, 4($2)

EECC550 - Shaaban
#83 Final Review Winter 2000

2-19-2001

Control Hazards: Example


Three other instructions are in the pipeline before branch
instruction target decision is made when BEQ is in MEM stage.
Time (in clock cycles)
Program
execution
CC 1
CC 2
order
(in instructions)
40 beq $1, $3, 7

44 and $12, $2, $5

48 or $13, $6, $2

52 add $14, $2, $2

72 lw $4, 50($7)

IM

CC 3

Reg

IM

CC 4

CC 5

DM

Reg

Reg

IM

DM

Reg

IM

CC 6

CC 8

CC 9

Reg

DM

Reg

IM

CC 7

Reg

DM

Reg

Reg

DM

Reg

In the above diagram, we are predicting branch not taken


Need to add hardware for flushing the three following instructions if
we are wrong losing three cycles.

EECC550 - Shaaban
#84 Final Review Winter 2000

2-19-2001

Reducing Delay of Taken Branches

Next PC of a branch known in MEM stage: Costs three lost cycles if taken.
If next PC is known in EX stage, one cycle is saved.
Branch address calculation can be moved to ID stage using a register comparator, costing
only one cycle if branch is taken.
IF.Flush
H az ard
detection
unit
ID/EX

M
u
x

WB

C ontrol
0

M
u
x

IF/ID

WB

EX

MEM/WB
WB

Shift
left 2

R egisters
PC

EX/MEM

M
u
x

Instruc tion
m emory

Data
memory

ALU
M
u
x

M
u
x

Sign
extend

M
u
x
F orw arding
unit

EECC550 - Shaaban
#85 Final Review Winter 2000

2-19-2001

Pipeline Performance Example


Assume the following MIPS instruction mix:
Type
Arith/Logic
Load
Store
branch

Frequency
40%
30%
of which 25% are followed immediately by
an instruction using the loaded value
10%
20%
of which 45% are taken

What is the resulting CPI for the pipelined MIPS with


forwarding and branch address calculation in ID stage?
CPI = Ideal CPI + Pipeline stall clock cycles per instruction
=
=
=
=

1 +
1 +
1 +
1.165

stalls by loads + stalls by branches


.3x.25x1
+
.2 x .45x1
.075
+
.09

EECC550 - Shaaban
#86 Final Review Winter 2000

2-19-2001

Memory Hierarchy: Motivation


Processor-Memory (DRAM) Performance Gap

100

CPU

Processor-Memory
Performance Gap:
(grows 50% / year)

10
DRAM

Proc
60%/yr.

DRAM
7%/yr.

1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000

Performance

1000

EECC550 - Shaaban
#87 Final Review Winter 2000

2-19-2001

Processor-DRAM Performance Gap Impact:


Example
To illustrate the performance impact, assume a pipelined RISC
CPU with CPI = 1 using non-ideal memory.
Over an 10 year period, ignoring other factors, the cost of a full
memory access in terms of number of wasted instructions:

Year

CPU
speed
MHZ

1986:
1988:
1991:
1994:
1996:

8
33
75
200
300

CPU
cycle
ns

125
30
13.3
5
3.33

Memory
Access

Minimum CPU cycles or


instructions wasted

ns

190
175
155
130
100

190/125
175/30
155/13.3
130/5
110/3.33

= 1.5
= 5.8
= 11.65
= 26
= 33

EECC550 - Shaaban
#88 Final Review Winter 2000

2-19-2001

Memory Hierarchy: Motivation

The Principle Of Locality


Programs usually access a relatively small portion of their
address space (instructions/data) at any instant of time
(program working set).
Two Types of locality:
Temporal Locality: If an item is referenced, it will tend to
be referenced again soon.
Spatial locality: If an item is referenced, items whose
addresses are close will tend to be referenced soon.
The presence of locality in program behavior, makes it
possible to satisfy a large percentage of program access needs
using memory levels with much less capacity than program
address space.
EECC550 - Shaaban
#89 Final Review Winter 2000

2-19-2001

Levels of The Memory Hierarchy


Part of The On-chip
CPU Datapath
16-256 Registers
One or more levels (Static RAM):
Level 1: On-chip 16-64K
Level 2: On or Off-chip 128-512K
Level 3: Off-chip 128K-8M
Dynamic RAM (DRAM)
16M-16G
Interface:
SCSI, RAID,
IDE, 1394
4G-100G

Farther away from


The CPU
Lower Cost/Bit
Higher Capacity
Increased Access
Time/Latency
Lower Throughput

Registers

Cache
Main Memory

Magnetic Disc
Optical Disk or Magnetic Tape

EECC550 - Shaaban
#90 Final Review Winter 2000

2-19-2001

Cache Design & Operation Issues


Q1: Where can a block be placed cache?
(Block placement strategy & Cache organization)
Fully Associative, Set Associative, Direct Mapped

Q2: How is a block found if it is in cache?


(Block identification)
Tag/Block

Q3: Which block should be replaced on a miss?


(Block replacement)
Random, LRU

Q4: What happens on a write?


(Cache write policy)
Write through, write back

EECC550 - Shaaban
#91 Final Review Winter 2000

2-19-2001

Cache Organization Example

EECC550 - Shaaban
#92 Final Review Winter 2000

2-19-2001

Locating A Data Block in Cache


Each block frame in cache has an address tag.
The tags of every cache block that might contain the required data
are checked or searched in parallel.
A valid bit is added to the tag to indicate whether this entry contains
a valid address.
The address from the CPU to cache is divided into:
A block address, further divided into:
An index field to choose a block set in cache.
(no index field when fully associative).
A tag field to search and match addresses in the selected set.
A block offset to select the data from the block.

Block Address
Tag

Index

Block
Offset
EECC550 - Shaaban
#93 Final Review Winter 2000

2-19-2001

Address Field Sizes


Physical Address Generated by CPU

Block Address
Tag

Block
Offset

Index

Block offset size = log2(block size)

Index size = log2(Total number of blocks/associativity)

Tag size = address size - index size - offset size

EECC550 - Shaaban
#94 Final Review Winter 2000

2-19-2001

Four-Way Set Associative Cache:


MIPS Implementation Example
Addr ess

Tag
Field

31 3 0

12 11 10 9 8

22

Index

T ag

Data

3 2 1 0

Index
Field

Tag

Data

T ag

Data

T ag

Data

0
1
2
253
254
255
22

256 sets
1024 block frames

32

4- to- 1 m ultiplexo r

Hit

Da ta

EECC550 - Shaaban
#95 Final Review Winter 2000

2-19-2001

Cache Organization/Addressing Example


Given the following:
A single-level L1 cache with 128 cache block frames
Each block frame contains four words (16 bytes)
16-bit memory addresses to be cached (64K bytes main memory
or 4096 memory blocks)

Show the cache organization/mapping and cache


address fields for:
Fully Associative cache.
Direct mapped cache.
2-way set-associative cache.
EECC550 - Shaaban
#96 Final Review Winter 2000

2-19-2001

Cache Example: Fully Associative Case


V
V
All 128 tags must
be checked in parallel
by hardware to locate
a data block

Valid bit
V
Block Address = 12 bits
Tag = 12 bits

Block offset
= 4 bits

EECC550 - Shaaban
#97 Final Review Winter 2000

2-19-2001

Cache Example: Direct Mapped Case


V
V
V
Only a single tag must
be checked in parallel
to locate a data block

Valid bit
V

Block Address = 12 bits


Tag = 5 bits

Index = 7 bits

Block offset
= 4 bits

Main Memory

EECC550 - Shaaban
#98 Final Review Winter 2000

2-19-2001

Cache Example: 2-Way Set-Associative

Two tags in a set must


be checked in parallel
to locate a data block

Valid bits not shown


Block Address = 12 bits
Tag = 6 bits

Index = 6 bits

Block offset
= 4 bits

Main Memory

EECC550 - Shaaban
#99 Final Review Winter 2000

2-19-2001

Cache Replacement Policy


When a cache miss occurs the cache controller may have to
select a block of cache data to be removed from a cache block
frame and replaced with the requested data, such a block is
selected by one of two methods:
Random:
Any block is randomly selected for replacement providing
uniform allocation.
Simple to build in hardware.
The most widely used cache replacement strategy.
Least-recently used (LRU):
Accesses to blocks are recorded and and the block
replaced is the one that was not used for the longest period
of time.
LRU is expensive to implement, as the number of blocks
to be tracked increases, and is usually approximated.
EECC550 - Shaaban
#100 Final Review

Winter 2000

2-19-2001

Miss Rates for Caches with Different Size,


Associativity & Replacement Algorithm
Sample Data
Associativity:
Size
16 KB
64 KB
256 KB

2-way
LRU
Random
5.18% 5.69%
1.88% 2.01%
1.15% 1.17%

4-way
LRU Random
4.67% 5.29%
1.54% 1.66%
1.13% 1.13%

8-way
LRU
Random
4.39% 4.96%
1.39% 1.53%
1.12% 1.12%

EECC550 - Shaaban
#101 Final Review

Winter 2000

2-19-2001

Single Level Cache Performance


For a CPU with a single level (L1) of cache and no stalls for
cache hits:
With ideal memory
CPU time = (CPU execution clock cycles +
Memory stall clock cycles) x clock cycle time
Memory stall clock cycles =
(Reads x Read miss rate x Read miss penalty) +
(Writes x Write miss rate x Write miss penalty)
If write and read miss penalties are the same:
Memory stall clock cycles = Memory accesses x Miss rate x Miss penalty

EECC550 - Shaaban
#102 Final Review

Winter 2000

2-19-2001

Single Level Cache Performance


CPUtime = IC x CPI x C
CPIexecution = CPI with ideal memory
CPI =

CPIexecution + Mem Stall cycles per instruction

CPUtime = IC x (CPIexecution + Mem Stall cycles per instruction) x C


Mem Stall cycles per instruction = Mem accesses per instruction x Miss rate x Miss penalty

CPUtime = IC x (CPIexecution + Mem accesses per instruction x


Miss rate x Miss penalty ) x C
Misses per instruction = Memory accesses per instruction x Miss rate
CPUtime = IC x (CPIexecution + Misses per instruction x Miss penalty) x
C

EECC550 - Shaaban
#103 Final Review

Winter 2000

2-19-2001

Cache Performance Example


Suppose a CPU executes at Clock Rate = 200 MHz (5 ns per cycle)
with a single level of cache.
CPIexecution = 1.1
Instruction mix: 50% arith/logic, 30% load/store, 20% control
Assume a cache miss rate of 1.5% and a miss penalty of 50 cycles.
CPI = CPIexecution + mem stalls per instruction
Mem Stalls per instruction =
Mem accesses per instruction x Miss rate x Miss penalty
Mem accesses per instruction = 1 + .3 = 1.3
Instruction fetch

Load/store

Mem Stalls per instruction = 1.3 x .015 x 50 = 0.975


CPI = 1.1 + .975 = 2.075
The ideal CPU with no misses is 2.075/1.1 = 1.88 times faster

EECC550 - Shaaban
#104 Final Review

Winter 2000

2-19-2001

Cache Performance Example


Suppose for the previous example we double the clock rate to
400 MHZ, how much faster is this machine, assuming similar
miss rate, instruction mix?
Since memory speed is not changed, the miss penalty takes
more CPU cycles:
Miss penalty = 50 x 2 = 100 cycles.
CPI = 1.1 + 1.3 x .015 x 100 = 1.1 + 1.95 = 3.05
Speedup = (CPIold x Cold)/ (CPInew x Cnew)
= 2.075 x 2 / 3.05 = 1.36
The new machine is only 1.36 times faster rather than 2
times faster due to the increased effect of cache misses.

CPUs with higher clock rate, have more cycles per cache miss and more
memory impact on CPI

EECC550 - Shaaban
#105 Final Review

Winter 2000

2-19-2001

3 Levels of Cache
CPU
L1 Cache
L2 Cache
L3 Cache

Hit Rate= H 1, Hit time = 1 cycle


Hit Rate= H 2, Hit time = T2 cycles

Hit Rate= H 3, Hit time = T3

Main Memory
Memory access penalty, M

EECC550 - Shaaban
#106 Final Review

Winter 2000

2-19-2001

3-Level Cache Performance


CPUtime = IC x (CPIexecution + Mem Stall cycles per instruction) x C
Mem Stall cycles per instruction = Mem accesses per instruction x Stall cycles per access

For a system with 3 levels of cache, assuming no penalty


when found in L1 cache:
Stall cycles per memory access =
[miss rate L1] x [ Hit rate L2 x Hit time L2
+ Miss rate L2 x (Hit rate L3 x Hit time L 3
+ Miss rate L3 x Memory access penalty) ] =
[1 - H1] x [ H2 x T2
+ ( 1-H2 ) x (H3 x (T2 + T3)
+ (1 - H3) x M) ]
EECC550 - Shaaban
#107 Final Review

Winter 2000

2-19-2001

Three Level Cache Example

CPU with CPIexecution = 1.1 running at clock rate = 500 MHZ


1.3 memory accesses per instruction.
L1 cache operates at 500 MHZ with a miss rate of 5%
L2 cache operates at 250 MHZ with miss rate 3%, (T2 = 2 cycles)
L3 cache operates at 100 MHZ with miss rate 1.5%, (T3 = 5 cycles)

Memory access penalty, M= 100 cycles.

Find CPI.
With single L1, CPI = 1.1 + 1.3 x .05 x 100 = 7.6
CPI =

CPIexecution + Mem Stall cycles per instruction

Mem Stall cycles per instruction = Mem accesses per instruction x Stall cycles per access
Stall cycles per memory access = [1 - H1] x [ H 2 x T 2 + ( 1-H2 ) x (H 3 x (T2 + T3)
+ (1 - H 3) x M) ]
= [.05] x [ .97 x 2 + (.03) x ( .985 x (2+5)
+
.015 x 100)]
= .05 x [ 1.94 + .03 x ( 6.895 + 1.5) ]
= .05 x [ 1.94 + .274] = .11

CPI = 1.1 + 1.3 x .11 = 1.24


EECC550 - Shaaban
#108 Final Review

Winter 2000

2-19-2001

You might also like