You are on page 1of 389

Computer Architecture CS1601

(Dr. Hari T.S. Narayanan)


Fundamentals of Computer Design

Unit 1
Contents
 Measuring and Reporting performance
 Quantitative principles of computer design
 Classifying instruction set architecture
 Memory addressing
 Addressing Modes
 Type and size of operands
 Operations in the instruction set
 Operands and operations for signal processing
 Instruction for control flow
 Encoding an instruction set
 Example architecture – MIPS and TM32.
Measuring and Reporting
Performance
 Response Time or Execution Time
 Throughput
 Benchmarks are used to measure
and compare performance of
different systems
Response Time Break up
 Response time
 Same as Elapsed time or Wall Clock
time
 Includes the following components
 User Time
 System Time or Privileged Time
 Waiting time and I/O time
 Finding User time in Windows
Choosing Programs to Evaluate
Performance

1. Real Applications (C compiler, Text


Processing)
2. Modified (or Scripted) Applications
3. Kernels (Linpack, Livermore Loops)
4. Toy benchmarks (Quicksort, Sieve of
Eratosthenes)
5. Synthetic benchmarks (Whetstone &
Dhrystone)
Comparing & Summarizing Response
Time

Computer A B C
Program 1 1 10 20
Program 2 1000 100 20
Total 1001 110 40

 Total Execution Time


 Weighted Execution Time
 Normalized Execution Time with
benchmark
Problem
Computer A B C
P1 1 10 20
P2 1000 100 20
Total 1001 110 40

 Compare the performance for the


following scenario:
 For every P2 there are 10 P1s
Some Related Terminologies
 Million Instructions Per Second
(MIPS)
 FLOPS (Floating Point Operations)
 TeraFlops
Quantitative Principles
 Principle of Locality
 Make the common case fast
 Amdahl’s Law
 CPU Performance Equation
 Take advantage of parallelism
Principle of Locality
Memory
 Spatial Locality Instruction

 Temporal Locality add


sto
 80-20 Rule
Space
Data
 Applies to
Instruction and 2
8

Data
 More applicable to add sto … add sto

Instruction than time

data!
Amdahl’s Law
 Performance Gain or Speed up
Performance for entire task with
enhancement
= Performance for entire task without
enhancement
OR
Execution time of the entire task without
enhancement
= Execution time of the entire with
enhancement

Note: Performance ∝ 1/execution time


Enhancing the Execution time of a
task
 Each task is divided into two parts
 Part A – whose performance cannot be
improved/enhanced
 Part B – whose performance can be
improved/enhanced

 Amdahl’s law relies on this


classification
Amdahl’s Law - Example
Problem from Page 41 of your text book:

1. Total time for Normal Execution = 0.6+0.4 = 1.0 s

Fraction–enhanced = 0.4 s
Speedup-enhanced = 10 times
Total time for Enhanced execution = 0.6 + 0.4/10 s
= 0.64s

Speedup-overall = 1.0/0.64 ~ 1.56


Amdahl’s Law - Review
 Let Speedup-enhanced be n
 As we increase n, 0.4/n approaches
0
 Thus, limit of overall speedup
= 1/(0.6 + 0.4/n) = 1/0.6 = 1/(1-
0.4)
Short Problems
 A task with
 Part A – 0.9
 Part B – 0.1

 Another task with


 Part A – 0.1
 Part B – 0.9
The CPU Performance Equations

1. CPU Time for a program


= CPU Clock Cycles x Clock Cycle
Time sto
1 2 add
n
..

1 ns

2. CPU time = CPU clock cycles for a program/Clock rate

3. CPIavg = CPU clock cycles for a program/Instruction Count

CPI – Clock Cycles Per Instruction


The CPU Performance Equations…

4. CPU Time
= Instruction x Clock Cycles x Seconds =
Seconds
Program Instruction Clock Cycle Program
Take Advantage of Parallelism
 Instruction level parallelism using
 Pipeline
 Memory Access using multiple
banks
 Parallel computation of two or more
possible outcomes
Assignment

1. Discuss why MIPS is not an


accurate measure for comparing
performance among computers.
2. Software Design Patterns
3. Problem 1.16 from your text book
4. Problem on p.44-45 – Do this
problem in the class without
referring to your book
Instruction Set Architecture

Unit I
Content
 Internal Storage
Reading
 Memory Addressing Assignment
 Addressing Mode for Signal Processing
 Types and Size of Operands
 Operands for Media and Signal Processing
 Operations in the Instruction Set
 Operations for Media and Signal Processing
 Instruction for Control Flow
 Encoding and Instruction Set
 Role of Compilers
Internal Storage
Reg-Reg/Load-
stack Accumulator Register-mem store
processor

TOS
ACC

ALU ALU ALU ALU

memory
Code Sequence with different
Internal Storage

Stack Accumulator Register Register


(register- (load-store)
memory)
Push A Load A Load R1,A Load R1,A

Push B Add B Add R3,R1,B Load R2,B

Add Store C Store R3,C Add R3,R1,R2

Pop C Store R3,C


Memory Operands and Register Operands
Number of Maximum Type of Example
memory number of architecture
address operands
allowed
0 3 Register-register Alpha, ARM, MIPS,
PowerPC, SPARC,
SuperH, Trimedia
TM5200
1 2 Register-Memory IBM 360, 370,Intel 80x86,
Motorola 68000,TI
TMS320C54x

2 2 Memory-memory VAX (also has three-


operand formats)
3 3 Memory-memory VAX (also has two-
operand formats)
Comparison of Reg-Reg with Reg-Memory
Type Advantage Disadvantage
Register- 1. simple, 1. Higher instruction count than
register(0,3) 2. fixed-length instruction architectures with memory
encoding. references in instructions.
3. Simple code generation 2. More instructions , lower
model. instruction density leads to
4. Instruction take similar larger programs.
numbers of clocks to
execute

Register- 1. Data can be accessed 1. Operands are not equivalent


memory(1,2) without a separate load since a source operand in a
instruction binary is destroyed.
2. format tends to be easy 2. Encoding a register number
to encode and yields and a memory address in
good density each instruction may restrict
the number of registers.
3. Clocks per instruction vary by
operand location.
Memory Addressing
 Byte Order
 Big Endian
 Small Endian
 Network Byte Order (Big Endian)
 Addressing Modes
Small-endian and Big-endian
Value is 0X0201

Small Endian
Address Byte 0
Byte 1

Value 0 2 0 1

Big Endian
Address Byte 1 Byte
0
Value 0 1 0 2
Alignment
 Alignment normally at
 2 Byte boundary 
 4 Byte boundary
 Alignment
 Optimizes the storage
 Makes the CPU logic more complex &
slow
4 3 2 1 0

4 byte -normal integer char/int


Addressing Modes

Addressing Instruction Meaning


Mode
Register Add R4,R3 Regs[R4]←Regs[R4]+Re
gs[R3]
Immediate Add R4,#3 Regs[R4]←Regs[R4]+3

Displacement Add R4,100(R1) Regs[R4]←Regs[R4]+M


em[100+Regs[R1]]
Register indirect Add R4,(R1) Regs[R4]←Regs[R4]+M
em[Regs[R1]]

Indexed Add R3,(R1+R2) Regs[R3]←Regs[R3]+M


em[Regs[R1]+Regs[R2]]
Addressing Modes (Continued)
Addressing Instruction Meaning
Mode
Direct or absolute Add R4,(1001) Regs[R4]←Regs[R41]+Mem[10
01]
Memory indirect Add R4,@(R3) Regs[R4]←Regs[R4]+Mem[Me
m[Regs[R3]]]
Auto-increment Add R4,(R2)+ Regs[R4]←Regs[R4]+Mem[Reg
s[R2]]
Regs[R2]←Regs[R2]+d

Auto-decrement Add R1,-(R2) Regs[R2]←Regs[R2]-d

Regs[R1]←Regs[R1]+Mem[
Regs[R2]]

Scaled Add R1,100(R2)[R3] Regs[R1]←Regs[R1]+Mem[100


+Regs[R2]+Regs[R3]*d]
Type and Size of Operands
 Fixed Size – rarely used
 Encoded in the operation code
 Tag – rarely used
 Data widths - 16, 24, 32
 Accumulator width - 32, 64, 56, 40
 Integers – 8 bits, 16 bits, 32 bits, FP
Floating Point Number
Representation

8 23

Type Exponent Mantissa


Zeros 0 0
Denormalized Numbers 0 non-zero
Normalized Numbers 1 to 28 - 2 any
Infinities 28 - 1 0
NaNs 28 - 1 non-zero
Encoding −118.625 using the IEEE 754 system.

1. Negative number, set the sign to "1".


2. Write the number using binary notation. The result is 1110110.101.
3. Move the radix point left, leaving only a 1 at its left:
1110110.101 = 1.110110101 × 26.
This is a normalized floating point number.
1. Compute the exponent - bias is 127 and so 6 + 127 = 133.
Terminologies
 Precision – Resolution
 Rounding of Error
 Radix – Exponent
 Mantissa – Significand
 Normalization
 Denormals
 Banker’s rounding
64-bit floating point representation

Double Precision Floating Point Representation


Operations - grouping
O
perato
rTy
pe E
xamp
le

Integ e ra rth im e tica ndlo g ic a l


A
rith
ime
tica n
d
op era tio n s;
lo
gica
l
ad d,s u b trac t,a n d ,o r,m u tip le,d ivid
e
Lo ad -s to res (m o vein s tru c tio nso n
D
ataT
ran
sfe
r co m p u te rsw ithm e m o ry
ad dre s s in g)
Bra n c h ,ju m p ,p ro c e d u rec a ll an d
C
ontro
l
return ,tra ps
O pe ra tin gs y s te m c all,V irtu a l
S
yste
m m em o rym a n a g e m e ntin s tru ctions
Floa tin g -p oin t
op era tin s 'ad d ,m u tip ly,d iv id e,c o mpa
F
loa
tin
gPo
int re
De c im a la d d ,d e c im al
m utip ly ,d ec im a l-to -c h a ra c te r
D
ecim
al co nv e rs io n
s
Strin gm o v
e ,s trin gc om p a re,s trin
g
S
trin
g se arc h
Pix e la n dv e rte x
op era tio n s,c o m pre s sio n s /d e c om p
r
G
ra
phic
s es sio no p era tio n s
Top 10 Frequently used instructions
Integer average
Rank 80x86instruction (%total
executed)
1 load 22%
2 conditional branch 20%
3 compare 16%
4 store 12%
5 add 8%
6 and 6%
7 sub 5%
moveregister-
8 4%
register
9 call 1%
10 return 1%
Total 96%
Instructions for Control Flow
 Four different types of control flow
changes:
 Conditional Branches
 Jumps
 Procedure calls
 Procedure returns
 Addressing modes for these
changes
Three types of Instruction Encoding
(a)Variable(e.g.VAX,Intel 80X86)

operationand Address Address Address Address


Noof operands specifier 1 field1 specifier n fieldn

(b)Fixed(e.g.Alpha,ARM,MIPS,PowerPC,SPARC,SuperH)

Addre
ss Addre
ss Addre
ss
o
per
atio
n f
ield1 f
ield2 f
ield3

(c)Hybrid(e.g.IBM 360/70,MIPS 16,Thumb)

A
ddr
ess Addre
ss
o
pe
ra
ti
o n s
pec
i
fier f
iel
d

Addre ss Addre ss A ddress


operatio
n s
p ecifier1 s
p e
c ifier2 field

Add
re ss A dd
res
s A dd
res
s
o
per
a t
ion s
pecifie
r fie
ld1 fie
ld2
Topics to prepare and present
1. RISC versus CISC
2. IEEE 754 Floating point standard
3. Floating point Arithmetic
4. DEC VAX Assembler Instruction Set
5. Benchmarking - Whetstone, Dhrystone,
Linpack, SPEC, & TPC
6. Stack based versus Register
architecture
7. Problem – Example on page 106
Instruction Level Parallelism

Unit 2
What is Instruction-level parallelism
(ILP)?
 Overlapping the execution of
instructions
Assembled code

----------
------
----------
----------
-------
------
Pipeline - Overlap
 Compare &
contrast
1. Serving an
individual to
completion
2. Serving in a
pipeline as
shown above

 Second case:
 Time to serve an individual

 Time to server 100 people


What is pipelining?
Pipelining – Advantage, Speedup, Effect, &
Balance

 Instruction Fetch (IF)


 Instruction Decode (ID)
 Execution (EX)
 Memory Access/Data Access (MEM/DA)
 Write-back Cycle (WB)
Classic Five-stage Pipeline for a RISC
m/c

C lo c k N u m b e r
In s t r u c t io n n u m 1b e r 2 3 4 5 6 7 8 9
In s t r u c t io n I IF ID EX MEM W B
In s t ru c t io n i+ 1 IF ID EX MEM W B
In s t ru c t io n i+ 2 IF ID EX MEM W B
In s t ru c t io n i+ 3 IF ID EX MEM W B
In s t ru c t io n i+ 4 IF ID EX MEM W B
Pipelining Data Paths
Time( in clock cycles)
Program execution order(in instructions)

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9

ALU
IM Reg DM Reg

ALU
IM Reg DM Reg

ALU
IM Reg DM Reg

ALU
IM Reg DM Reg

ALU
IM Reg DM Reg
Basic Performance Issues in
Pipelining
 Pipelining slows down individual
instructions.
 But increases the speed of
execution of your program!
Problem 1
 Consider the un-pipelined processor section.
Assume that it has a 1 ns clock cycle and that it
uses 4 cycles for ALU operations and branches and
5 cycles for memory operations. Assume that the
relative frequencies of these operations are
40%,20% and 40% respectively. Suppose that due
to clock skew and setup, pipelining the processor
adds 0.2 ns of overhead to the clock. Ignoring any
latency impact, how many speedup in the
instruction execution rate will we gain from a
pipeline?

 Assume 5-stage pipeline


Problem 2
 Consider the un-pipelined processor section.
Assume that it has 2 GHz clock and that it uses 4
cycles for ALU operations, 3 cycles for branches,
and 5 cycles for memory operations. Assume that
the relative frequencies of these operations are
50%,25% and 25% respectively. Suppose that due
to clock skew and setup, pipelining the processor is
reduced to work at half the clock rate. Ignoring any
latency impact, how many speedup in the
instruction execution rate will we gain from a
pipeline?

 Assume 5-stage pipeline


The Hazards of Pipelining
 What is a pipeline Hazard?
 What is a pipeline Stall
 Types of Hazards
 Structural
 Data
 Control
Effect of Stalls on Pipeline
Performance

Speedup from pipelining


Avg instruction time
unP
=
Avg instruction time
P
CPI unP * Clock period unP
=
CPI P * Clock period P

CPI unP Clock period


unP
= *
CPI P Clock period P
Speed due to Improved CPI

Speed up
CPI unP Clock period
unP
= *
CPI P Clock period P
CPI P = Ideal CPI + Pipeline stall cycles per instruction
CPI P = 1 + Pipeline stall cycles per instruction

CPI unP or Pipeline depth


=
1+ Pipeline stall cycles per
instructions
Pipeline Implementation with MIPS – IF
Cycle

 Instruction Fetch Cycle

IR Mem[PC];

NPC PC + 4;
Pipeline Implementation with MIPS – ID
Cycle

 Instruction decoder/register fetch


cycle(ID)

A Regs[rs];
B Regs[rt];
IMM sign-extended immediate
field of IR;
Execution/effective address
cycle(EX)
 The ALU operates on the operands prepared
in the prior cycle,performing one of the MIPS
instruction type.
 Memory Reference:
ALUOutput A + IMM;
 Register-Register ALU instruction:
ALUOutput A func B;
 Register-Immediate ALU instruction:
ALUOutput A op IMM;
 Branch
ALUOutput NPC + (IMM << 2);
Cond (A==0)
Pipeline Implementation with MIPS –
Memory Access Cycle

 Memory access/branch completion cycle(MEM)


The PC is updated for all instructions:

 Memory reference:
LMD Mem[ALUOutput] or
Mem[ALUOutput] B;
 Branch:
if(cond) PC ALUOutput; #NPC
content
Pipeline Implementation with MIPS – Write
Back

 Register-Register ALU instruction:


Regs[rt] ALUOutput;
 Register-Immediate ALU instruction:
Regs[rt] ALUOutput;
 Load instruction:
Regs[rt] LMD;
What is Instruction-level parallelism
(ILP)?
 Overlapping the execution of
instructions
Assembled code

----------
------
----------
----------
-------
------
Instruction Level Parallelism
 Basic block
 Parallelism within basic block
 Limitation of parallelism within …
 Parallelism across basic block
 Limitation of parallelism across …
entry Inst1
Inst 2
Basic block
Inst 3

exit Inst 4
MIPS Program Characteristics
1. ---- 92. JUMP
2. --- 93. ----
3. ---- 94. ---
4. ----- 95. ----
5. BE 96. -----
6. -- 97. BNE
7. ---- 98. --
8. --- 99. ----
9. ---- 100. ---
10. BNEZ 101. ----
Vector Processors – Parallelism
across …

for (i= 1; i<= 1000; i=i+1)


x[i] = x[i] + y[i];

Vector processors don’t scale well for


different problem size!

ILPs are different from Vector


processors
Data Dependences and ILP
 Two independent instructions can
be completely overlapped
 Two dependent instructions cannot
be completely overlapped
Relationship between …
 Data Dependence
Inst1
 Hazard Inst 2
Inst 3
 Stall Inst 4
Inst 5
Inst 6
Data Dependences

 Three types of data


dependences
 True dependence
 Name dependence

 Control dependence
True Dependence
 Instruction i produces a result that
may be used by instruction j or
 Instruction j is data dependent on
instruction k, and instruction k is
data dependent on instruction i

Output from instruction i -> input to


instruction j
Code Sequence with Data
Dependence

Loop: L.D F0, 0(R1) ;


ADD.D F4, F0,F2;
S.D F4, 0(R1);
DADDUI R1, R1,#-8;
BNE R1,R2, Loop;
Code Sequence with Data
Dependence

Loop: L.D F0, 0(R1) ;


ADD.D F4, F0,F2;
S.D F4, 0(R1);
DADDUI R1, R1,#-8;
BNE R1,R2, Loop;
Data Dependence can be addressed

1. Maintaining the dependence but


avoiding a hazard
2. Eliminating the dependence by
transforming the code
Name Dependence
 Anti-dependence: When instruction j
writes to a register or memory
location that instruction i reads.
 Output dependence: Instruction i
and instruction j write to the same
register or memory location.
Control Dependence

if p1 {
S1;
}
if p2 {
S2;
}
Data Hazards
 RAW (Read After Write)
 WAW (Write After Write)
 WAR (Write After Read)
 RAR (Read After Read)
Read After Write - RAW
 Wrong execution: j tries to read a
source before i write to it.
 Arises from true dependence
 Example:

Load R1, A

Add R1, R2, R1
Write After Write
 Arises from Output Dependence
 Correct order between writes to be
maintained
 Appears in only those pipelines
where there are multiple write stages
 Example:
 DMUL F4, F3, F4
 DLOAD F4, A
Write After Read
 Arises from Anti-dependence
 An instruction writes to a
destination before it is read by
another instruction, thus making the
latter read the new (incorrect value)
 Does not occur in static pipelines
Read After Read
 Not a hazard!
Dynamic Scheduling
 Summarize: Data Dependence, & Structural
Dependence
 Executing instructions in non-sequential
form
 Example:
L R1, 0(R2)
SUB R1, R3
BEQZ Same
S R1, 0(R2)
JUMP End
Same: S R1, 0(R4)
Control Dependence

if p1 { ADD R2, R3, R4


S1; BEQZ R2, L1
} LW R1, 0(R2)

if p2 { RET
L1:
S2;
ADD R1, R2, R3
}
Two Constraints

 Control dependent instruction cannot be


moved before the branch
 An instruction that is not control
dependent cannot be converted to
control dependent.
ADD R2, R3, R4
BEQZ R2, L1
X LW R1, 0(R2)
X
RET
L1:
ADD R1, R2, R3
Preserving Control & Data
Dependence

ADD R2, R3, R4


BEQZ R2, L1
LW R1, 0(R2)
L1:
Alternative to Control & Data
Dependence
 Sequential execution of a program guarantees
its intended correctness
 Preserving control and data dependence
guarantees correctness in dynamic scheduling
 Preserving two dependences is an easier way
to guarantee correctness
 But not the best way in terms of better
throughput!
 Preserving exception behavior and data flow
guarantees correctness and offers better
throughput!
What is Exception Behavior?

ADD R2, R3, R4


BEQZ R2, L1
LW R1, 0(R2)
RET
L1:
ADD R1, R2, R3
What is Data Flow?

ADD R1, R2, R3


BEQZ R4, L
SUB R1, R5, R6
L:
OR R7, R1, R8
To prove control dependence is not
critical

ADD R1, R2, R3


BEQZ R12, Skip
DSUB R4, R5, R6
ADD R5, R4, R9
Skip: OR R7, R8, R9
Scheduling & Speculation

 Executing DSUB R4, R5, R6 ahead of


BEQZ.
 Scheduling: reordering the
instructions to improve throughput.
 Speculation: Indicates scheduling
instructions ahead of their order
based on prediction
Dynamic Scheduling vs Static
Scheduling
 Static Scheduling
 Decision at Compile time
 No change to hardware
 Limited to compile time data
 Dynamic scheduling
 Decision at Run time
 Complex hardware
 Better throughput
Limitations of Static Scheduling
 Unnecessary Stalling of non-
dependent instructions

DIV.D F0, F2, F4


ADD.D F10, F0, F8
SUB.D F12, F8, F14
Issues with Dynamic Scheduling
 Hazards - WAR and WAW
 Exception handling
DIV.D F0, F2, F4
ADD.D F6, F0, F8 WAR

SUB.D F8, F10, F14


WAW
MUL F6, F10, F8
Simple 5-stage Pipeline with Integer
ALU

IF ID EX MEM WB

Instructions ALU Instructions


Enter Leave
5-Stage Pipeline with looped FP FUs

EX
Integer Unit

EX
FP MUL
IF ID MEM WB

Instructions Instructions
EX
Enter Leave
FP Adder

EX
FP DIV
5-Stage Pipeline with pipelined FP
units

EX
Integer Unit

EX
FP MUL
IF ID MEM WB

Instructions Instructions
EX
Enter Leave
FP Adder

EX
FP DIV
5-Stage Pipeline with pipelined
multiple FP units

EX
EX
Integer
IntegerUnit
Unit

EX
EX
FP MUL
Integer Unit
IF ID MEM WB

Instructions Instructions
EX
EX
Enter Leave
FP Adder
Integer Unit

EX
EX
FP DIVUnit
Integer
Hazards in Longer Latency Pipelines
 Divide is not fully pipelined, this can lead to structural
hazard.
 Instructions have varying running times, the number
of writes required in a cycle can be more than 1
 Instructions can complete in a different order than
they were issued
 WAW Hazard possible
 WAR Hazard less likely
 Stalls due to RAW hazards will be more due to longer
latency
Dynamic Scheduling Stages
 Dynamic scheduling implies out-of-order
execution
 To allow this we split ID pipe stage of 5-stage
pipe line into two stages
 Issue: Decode instructions, check for
structural hazards.
 Read operands: Wait until no data hazards,
then read operands
IF Issue RO EX MEM WB
ID
Instructions
Instructions
Enter
Leave
Variations and other characteristics

 Instruction is fetched into a register Or


 Instruction is fetched into a queue
 Execution may take different number of
cycles depending upon the instruction
 EX stage follows read operand stage
 Multiple instructions in EX stage at the
same time using multiple FUs
 Stall and Bypass
Dynamic Scheduling using
scoreboard

 All instructions come to the issue stage in


order (set by the compiler)
 They can stall or bypass each other in the
read operand stage and enter execution
out-of-order.
 Scoreboard is a technique to schedule
instructions out-of-order, Tomasulo’s
algorithm is the other one.
Scoreboard …
 Objective: DIV.D F0, F2, F4
 To execute one ADD.D F10,
instruction per clock F0, F8
cycle SUB.D F8,
 From CDC 6600 F8, F14
 Recall WAR hazard

IF Issue RO EX MEM WB
ID
Instructions Instructions
Enter Leave
Scoreboard content

 Instruction Status
 Indicates the current stage for an
instruction
 Functional Unit Status
 Busy, Op, Fi, Fj, Fk, Qj, Qk, Rj, Rk
 Register Result Status
 Indicates which functional unit will write
each register
Scoreboard Building blocks
Registers

FU

FU

FU

Scoreboard
Control/status Control/status
Tomasulo’s Approach to Dynamic
Scheduling
 Addresses WAR and WAW hazards
 Based on renaming of registers

DIV.D F0, F2, F4 DIV.D F0, F2, F4


ADD.D F6, F0, F8 ADD.D S, F0, F8
S.D F6, 0(R1) S.D S, 0(R1)
SUB.D F8, F10, SUB.D T, F10, F14
F14 MUL.D F6, F10, T
MUL.D F6, F10, F8
3-Stages of Tomasulo’s Pipeline
 Issue
 Execute
 Write Result
Basic Structure of Tomasulo’s
Algorithm
From Instruction Unit

Load-store FP
Instruction registers
operations Queue

Address Unit
Store buffers Load buffers Floating-point Operand
operations buses
Operation bus

3 2
2 1
Reservation Stations
1
Data Address

Memory Unit FP adders FU


FP multipliers
Common Data Bus
Advantages of Tomasulo’s algorithm
 Distribution of Hazard detection logic
 No head of the queue blocking –
eliminates stalls due to WAW and WAR
hazards
 It can also eliminate name dependences
that cannot be eliminated by a compiler
 Common Data Bus (CDB) makes it easier
to copy one destination result to multiple
source registers
Reservation Station – Data Fields

 Instruction Status
 Indicates the current stage for an
instruction
 Reservation Station Status
 Busy, Op, Vj, Vk, Qj, Qk, A
 Register Status
 Indicates which functional unit will write to
each register
 Show what information tables look like
for the following code sequence when
only the first load has completed and
written its result:
1. L.D F6,34(R2)
2. L.D F2,45(R3)
3. MUL.D F0,F2,F4
4. SUB.D F8,F2,F6
5. DIV.D F10,F0,F6

6. ADD.D F6,F8,F2     
Example

In s t r u c t io n S
In s t r u c t io n Is s u e E x e c u te
L .D F 6 ,3 4 (R 2 ) √ √ √
L .D F 2 ,4 5 (R 3 ) √ √
M U L .D F 0 ,F 2 ,F 4 √
S U B .D F 8 ,F 2 ,F 6 √
D IV . D F 1 0 ,F 0 ,F 6√
A D D .D F 6 ,F 8 ,F 2 √
Example
R e s e r va t io n S t a t io n
Nam e B usy j kO Vp Qj V k A Q
Load1 no
Load2 yes Load 4 5 + R e g s [R 3 ]]
A dd1 yes S UB M e m [ 3 4 + R e g s [ RL 2o ] a] d 2
A dd2 yes A DD A dd1 Load2
A dd3 no
M u lt 1 yes M UL R e g s [F 4 ] Load2
M u lt 2 y e s D IV M e m [ 3 4 + R e g s [ RM2 u] ] l t 1
Example

R e g is t e r S t a t u s
F ie ld F0 F2 F4 F6 F8
Q i M u lt 1 Load2 A dd2 A dd1
Steps in Algorithm
 To understand the full power of eliminating WAW and WAR
hazards through dynamic renaming of registers, We must look at
a loop. Consider the following simple sequence for multiplying
the elements of an array by a scalar in F2:

LOOP: L . D F0 , 0 ( R1 )
MUL . D F4 , F0 , F2
S.D F4 , 0 ( R1 )
DADDUI R1 , R1 , #- 8
BNE R1 , R2 , LOOP: branches if R1 # R2
Steps in Algorithm
Instruction State Wait until Action or bookkeeping
if(RegisterStat[rs].Qi ≠0)
{RS[r].Qj<-RegisterStat[rs].Qi}
else{RS[r].Vj<-Regs[rs];RS[r].Qj<-0};
if(Register[rt].Qi ≠0)
{RS[r].Qk<-Register[rt].Qi
else{RS[r].Vk<-Regs[rt];RS[r].Qk<-0};
Issue Sation r empty RS[r].Busy<-yes;RegisterStat[rd],Qi=r;
FP Operation
if(RegisterStat[rs].Qi ≠0)
{RS[r].Qj<-RegisterStat[rs].Qi}
else{RS[r].Vj<-Regs[rs]; RS[r].Qj<-0};
Load or Store Buffer r empty RS[r].A<-imm;RS[r].Busy<-yes;

Load only RegisterStat[rt].Qi=r;


if(RegisterStat[rt].Qi ≠0)
{RS[r].Qk<-RegisterStat[rs].Qi}
Store Only else{RS[r].Vk<-Regs[rt];RS[r].Qk<-0};
Steps in Algorithm
In s t r u c t i o n S t a t e W a it u n t il A c t io n o r b o o k
E x e c u te (R S [ r] . Q j= 0 )a n d C o m p u t e re s u lt : o p e r
F P O p e ra t io n (R S [ r] . Q k = 0 )
L o a d S t o r e s t e pR 1S [ r ] . Q j = 0 & r i s h e aRd So[ fr ]l .oAa <d - R S [ r ] . V j + R S [ r ] . A ;
s to re q u e u e
L o a d S t o r e s t e pL 2o a d S t o r e s t e p 1 c o Rm epal ed t efr o m M e m [ R S [ r ] . A ]
W rit e re s u lt E x e c u t i o n c o m p l e t e a tx r(∀i f& ( R e g i s t e r S t a{ Rt [ ex g] . sQ[ xi =] <r ) - r e s u l t ;
F P o p e ra t io n C D B a va ila b le R e g is t e rS t a t [ x ] . Q i < -0 });
or ∀ x ( i f( R S [ x ] . Q j = r ) { R S [ x ] . V j < - r e s u l t ; R S [ x ] . Q j < - 0
lo a d ∀ x ( i f( R S [ x
E x e c u t i o n c o m p l e t e Ma te mr &[ R S [ r ] . A ] < - R S [ r ] . V k ;
s to re R S [ r].Q k = 0 R S [ r].B u s y < -n o ;
Reducing Branch Cost
 Control hazard can cause a greater loss for MIPS pipeline
than data hazards.
 Branch Prediction is the mechanism to resolve the
outcome of a branch early, thus preventing control
dependences from causing stalls.
 Based on the fact that more often branches are
associated with loops
 Two types of branch prediction
 Static prediction
 Dynamic prediction
Terminologies related to Branch
Prediction
 Control or branch Hazard
 Taken (if a branch changes the PC to its target address)
 Not taken or untaken (otherwise)
 Freeze or flush
 Predicted-not-taken

ADD R1, R2, R3


BEQZ R4, L
SUB R1, R5, R6
ST R1, A
L:
OR R7, R1, R8
Branch Taken or Untaken
High-level program segment Assembly program segment

For (j = 0; j < 10; j++)… RET: check & branch to for-loop


nextinst
for-body …
End-for …
nextinst for-loop: …

JUMP RET
Basic Branch Prediction – Prediction buffer
 Using Branch History Table or Prediction-buffer

Taken
Take
Branch or
n
Not Not taken
Address Predict taken Predict taken
11 10
baddr1 1 Taken

baddr2 0 Taken Not taken

Not taken
baddr3 1 Predict untaken Predict untaken
01 00
1 bit Taken
Not taken

80%
90%
Static Branch Prediction
 Useful when branch behavior is
highly predictable at compile time
 It can also assist dynamic predictors
 Loop unrolling is an example of this
Branch Causing 3-Cycle Stall
Time

B ra n c h i n s t r u c t i o n IF ID EX MEM W B
B ra n c h successor IF IF IF IF ID EX
B ra n c h successor + 1 IF ID
B ra n c h successor + 2 IF

ADD R1, R2, R3


BEQZ R4, L
SUB R1, R5, R6
ST R1, A

L:
OR R7, R1, R8
Reducing pipeline branch penalties –
Compile time schemes

 Freeze or Flush
 Assume that every branch has not taken
 Assume that every branch is taken
 Delayed branch
Delayed Branch
 Code segment
branch instruction
sequential successor

Loop: branch target if taken
 Branch delay slot
 Instruction in branch delay slot is always
executed
 Compiler can schedule the instruction(s)
for branch delay slot so that its execution
is useful even when the prediction is
wrong
Branch Delay of 1
Not Taken
U n t a k e n B r a n c h I Fi n s t r uIDc t i o nE X M E M W B
B r a n c h d e l a y i n s t r u c t IFi o n j +I D 1 E X M EM W B
In s t r u c t i o n j + 2 IF ID EX M EM W B
In s t r u c t i o n j + 3 IF ID EX M EM W B
In s t r u c t i o n j + 4 IF ID EX M EM W B

Taken
U n ta k e n B r a n c h i n s tr u cItiFo n I D EX MEM W B
B r a n c h d e l a y in s tr u c tio n j + I1 F ID EX M EM W B
B r a n c h ta r g e t IF ID EX MEM W B
B r a n c h ta r g e t + 1 IF ID EX MEM W B
B r a n c h ta r g e t + 2 IF ID EX MEM W B
Scheduling the branch delay slot
From before From target From fall-through

DADD R1, R2, R3 DSUB R4, R5, R6 DADD R1, R2, R3


If R2 = 0 then If R1 = 0 then
Delay slot DADD R1, R2, R3 Delay slot
If R1 = 0 then OR R7, R8, R9
Delay slot
DSUB R4, R5, R6

becomes becomes
becomes
DSUB R4, R5, R6
If R2 = 0 then DADD R1, R2, R3
DADD R1, R2, DADD R1, R2, R3 If R1 = 0 then
R3
If R1 = 0 then OR R7, R8, R9
DSUB R4, R5, R6
DSUB R4 , R5, R6
Best choice
Branch Prediction – Correlating
Prediction
 Prediction based on correlation
SUB R3, R1, #2
if (aa == 2) BNEZ R3, L1 ; branch b1
aa = 0; (aa != 2)
if (bb == 2) ADD R1, R0, 0 ;aa=0
L1: SUB R3, R2, #2
bb = 0; BNEZ R3, L2 ;branch b2
if (aa != bb) { (bb != 2)
ADD R2, R0, R0 :bb=0
L2: SUB R3, R1, R2 ; R3=aa-bb
 If both first and second branches are
BEQZ R3, L3 ; branch b3 (aa ==
bb)
not taken then the last branch will be
taken
ILP

Software based approaches


Basic Compiler Technique
 Basic Pipeline Scheduling and Loop
Unrolling
L: L.D F0, 0(R1) ; F0=array element
ADD .D F4, F0, F2 ; add scalar in F
for (j = 1000; j > 0; j--) S.D F4, 0(R1) ; store result
x[j] = x[j] + s; ADDI R1, R1, #-8 ; decrement pointer
BNE R1, R2, L ; branch R1 != R2
Latencies of FP operations used
In stru ctio n P ro d u cin g
Re su lt In stru ctio n Usin g Re suLlta te n cy in Clo ck Cycle s
F P A LU op A nother F P A LU op 3
F P A LU op S tore double 2
Load Double F P A LU op 1
Load Double S tore double 0

Assume an integer load latency of 1 and an integer ALU operation latency of 0

L: L.D F0, 0(R1) ; F0=array element ?


ADD .D F4, F0, F2 ; add scalar in F2
S.D F4, 0(R1) ; store result ?
ADDI R1, R1, #-8 ; decrement pointer
BNE R1, R2, L ; branch R1 != R2 ?
Without Any Scheduling

Clock cycles issued


L: L.D F0, 0(R1) 1
stall 2 ; NOP
ADD.D F4, F0, F2 3
stall 4 ; NOP
stall 5 ; NOP
S.D F4, 0(R1) 6
ADDI R1, R1, #-8 7
stall 8 ; note latency of 1 for ALU,
BNE
BNE R1, R2, L 9
stall 10 ; NOP

10 clock cycles/element, 7 (ADDI, BNE, stall) are loop overhead


Assume an integer load latency of 1 and an integer ALU operation latency of

With Scheduling
In structio n P ro du cin g
Re su lt In struction Usin g Re suLlta te n cy in Clock Cycle s
FP A LU op A nother FP A LU op 3
FP A LU op S tore double 2
Load Double F P A LU op 1
Load Double S tore double 0

Clock cycles issued


Clock cycles
L: L.D F0, 0(R1) 1
issued
stall 2 ; NOP
L: L.D F0, 0(R1) 1 ? ADD.D F4, F0, F2 3
ADDI R1, R1, #-8 2
stall 4 ; NOP
ADD.D F4, F0, F2 3
stall 5 ; NOP
stall 4
? S.D F4, 0(R1) 6
BNE R1, R2, L 5
ADDI R1, R1, #-8 7
S.D F4, 8(R1) 6
stall 8 ; note latency of 1 for ALU,
BNE
BNE R1, R2, L 9
stall 10 ; NOP

6 clock cycles/element, 3 (ADDI, BNE, stall) are loop overhead


With Unrolling
I n str u c ti o n P r o d u c i n g
L: L.D F0, 0(R1) R e su l t I n str u c ti o n U si n g RL ea su
te nl tc y i n C l o c k C y c l e s
FP A LU op A no ther F P A L U op 3
L.D F6, -8(R1) FP A LU op S t o re d o u b le 2
L.D F10, -16(R1) i L o a d D o u b le FP A LU op 1
L.D F14, -24(R1) L o a d D o u b le S t o re d o u b le 0
ADD.D F4, F0, F2
ADD.D F8, F6, F2
ADD.D i
F12, F10, F2
ADD.D F16, F14, F2
S.D F4, 0(R1)
S.D F8, -8(R1)
ADDI R1, R1, #-32
i
S.D F16, 16(R1)
BNE R1, R2, L
S.D F20, 8(R1)

14 clock cycles/4 elements,


2 (ADDI, BNE) are loop
overhead
With Unrolling & Scheduling

L: L.D F0, 0(R1)


L.D F6, -8(R1) i
L.D F10, -16(R1) ADD.D F4, F0, F2
L.D F14, -24(R1) ADD.D F8, F6, F2
L.D F18, -32(R1) ADD.D F12, F10, F2
S.D F4, 0(R1) ADD.D F16, F14, F2
S.D F8, -8(R1) ADD.D F20, F18, F2
S.D F12, -16(R1)
ADDI R1, R1, #-40
S.D F16, 16(R1)
BNE R1, R2, L
S.D F20,
8(R1)

12 clock cycles/5 elements,


2 (ADDI, BNE) are loop
overhead
The VLIW Approach

 The first multiple-issue processors required


the instruction steam to be explicitly
organized.
 Used wide instruction with multiple
operations per instruction.
 This architectural approach was VLIW.(Very
long instruction word)
Advantage of VLIW Approach

 VLIW increases as the maximum issue rate


grows.
 VLIW approaches make sense for wider
processors.
 For example VLIW processors might have five
instructions that contain five operations,

including one integer,two floating,two memory.


The Basic VLIW Approach

 There is no fundamental difference in two


approaches.
 The instructions to be issued simultaneously
falls on the complier, the hardware in a
superscalar to make these issue decisions is
unneeded.
The Basic VLIW Approach

 Scheduling Techniques
 Local Scheduling
 Global Scheduling
Example

 Suppose we have a VLIW that could issue two


memory references, two FP operations, and one
integer operation or branch in every clock cycle.
Show an unrolled version of the loop x[I]=x[I]+s
for such a processor. Unroll as many times as
necessary to eliminate any stalls. Ignore the
branch delay slot.
VLIW instructions occupy inner loop
and replace the unrolled sequence

I n te g e r
M e m o r y r e f e r e n c e M1 e m o r y r e f e r e n c eF P2 o p e r a ti o n F1 P o p e r a ti o n 2 o p e r a ti o n / b r a n c h
L .D F 0 , 0 ( R 1 ) L .D F 6 , -8 ( R 1 )
L .D F 1 0 , - 1 6 ( R 1 ) L .D F 1 4 , 2 4 ( R 1 )
L .D F 1 8 , - 3 2 ( R 1 ) L .D F 2 2 , - 4 0 ( R 1 ) A D D .D F 4 , F 0 A, FD2D .D F 8 , F 6 , F 2
L .D F 2 6 , -4 8 R 1 ) A D D .D F 1 2 , F A1 0D, DF 2.D F 1 6 , F 1 4 , F 2
A D D .D F 2 0 , F A1 8D ,DF .D
2 F 2 4 ,F 2 2 ,F 2
S .D F 4 , 0 ( R 1 ) S .D F 8 , -8 ( R 1 ) A D D .D F 2 8 , F 2 6 , F 2
S .D F 1 2 , -1 6 ( R 1 ) S .D F 1 6 , - 2 4 ( R 1 ) D A D D U I R 1 ,R 1 ,# -5 6
S .D F 2 0 2 4 ( R 1 ) S .D F 2 4 , 1 6 ( R 1 )
S .D F 2 8 , 8 ( R 1 ) B N E R 1 ,R 2 ,L o o p
Problems of VLIW Model

 They are:
 Technical problems
 Logistical Problems
Technical Problems
 Increase in code size and the limitations of lockstep.
 Two different elements combine to increase code
size substantially for a VLIW.
 Generating enough operations in a straight-line
code fragment requires ambitiously unrolling
loops, thereby increasing code size.
 Whenever instructions are not full, the unused
functional units translate to wasted bits in the
instruction encoding.
Logistical Problems And Solution
 Logistical Problems
 Binary code compatibility has been a major problem.
 The different numbers of functional units and unit latencies require
different versions of the code.
 Migration problems

 Solution
 Approach:
 Use object-code translation or emulation.This technology is
developing quickly and could play a significant role in future
migration schemes.
 To tamper the strictness of the approach so that binary
compatibility is still feasible.
Advantages of multiple-issue versus
vector processor

 Twofold.
 A multiple-issue processor has the potential
to extract some amount of parallelism from
less regularly structured code.
 It has the ability to use a more conventional,
and typically less expensive, cache-based
memory system.
Advanced complier support for
exposing and exploiting ILP

 Complier technology for increasing the amount


of parallelism .
 Defining when a loop is parallel and how
dependence can prevent a loop from being
parallel.
 Eliminating some types of dependences
Detecting and Enhancing Loop-level
Parallelism

 Loop-level parallelism is normally analyzed at the


source level or close to it.
 while most analysis of ILP is done once instructions
have been generated by the complier.
 It involves determining what dependences exist
among the operand in a loop across the iterations of
that loop.
Detecting and Enhancing Loop-level
Parallelism
 The analysis of loop-level parallelism focuses on
determining whether data accesses in later iterations are
dependent on the data values produced in earlier
iterations is called a loop-dependence.
 Loop-level parallel at the source representation:

for (I = 1000; I > 0; I = I - 1)

x[i] = x[i] + s;
Example 1
 Consider a loop:
for (i= 1; i <= 100; i = i + 1)
{
A [ i + 1] = A [ i ] + C [ i ] ;
B [ i + 1] = B [ i ] + A [ i + 1 ] ;
}
 Loop-carried dependence : execution of
an instance of a loop requires the
execution of a previous instance.
Example 2
 Consider a loop:
for (i= 1; i <= 100; i = i + 1)
{
A [ i ] = A [ i ] + B [ i ] ; /* S1 */
B [ i + 1] = C [ i ] + D [ i ] ;/* S2 */
}

What are the dependences between S1 and S2?


Is
this loop parallel? If not, show how to make it
parallel.
Scheduling
Iteration Iteration
1 A[1] = A[1] + B[1] 1 A[1] = A[1] + B[1]
1 B[2] = C[1] + D[1] 1 B[2] = C[1] + D[1]

2 A[2] = A[2] + B[2] 2 A[2] = A[2] + B[2]


2 B[3] = C[2] + D[2] 2 B[3] = C[2] + D[2]

3 A[3] = A[3] + B[3] 3 A[3] = A[3] + B[3]


3 B[4] = C[3] + D[3] 3 B[4] = C[3] + D[3]

4 A[4] = A[4] + B[4] 4 A[4] = A[4] + B[4]


4 B[5] = C[4] + D[4] 4 B[5] = C[4] + D[4]

ooo ooo
99 A[99] = A[99] + B[99] 99 A[99] = A[99] + B[99]
99 B[100] = C[99] + D[99] 99 B[100] = C[99] + D[99]

100 A[100] = A[100] + B[100] 100 A[100] = A[100] + B[100]


100 B[101] = C[100] + D[100] 100 B[101] = C[100] + D[100]
Removing Loop-carried Dependency

 Consider the same example:


A[1] = B[1] + C[1]
for (i= 1; i <= 99; i = i + 1)
{
B [ i + 1] = C [ i] + D [ i] ;
A [ i +1] = A [ i +1] + B [ i +1] ;
}
B[101] = C[100] + D[100]
Example 3

 Loop–carried dependences. This dependence


information is inexact, may exist.
 Consider a example:
for (i= 1; i <= 100; i = i + 1)
{
A[i]=B[i] +C[i];
D[i]=A[i] *E[i ];
}
Recurrence
 Loop-carried dependences
for (i= 2; i <= 100; i = i + 1)
{
Y [ i ] = Y [ i –1 ] + Y [ i ] ;
}
for (i= 6; i <= 100; i = i + 1)
{
Y [ i ] = Y [ i –5] + Y [ i ] ;
}
 Dependence-distance: An iteration i of the loop
dependence on the completion of an earlier
iteration i-n. Dependence distance is n.
Finding Dependences

 Three tasks
 Good scheduling of code
 Determining which loops might contain
parallelism
 Eliminating name dependences
Finding Dependence

A dependence exists if two conditions hold:
 There are two iteration indices, j and k , both
within the limits of the for loop. That is, m<=j
<= n, m<=k <= n.
 The loop stores into an array element
indexed by a*j+b and later fetches from the
same array element when it is indexed by
c*k+d. That is, a*j+b=c*k+d.
Example 1
 Use the GCD test to determine whether
dependences exist in the loop:
for ( i=1 ; i<= 100 ; i = i + 1)
{
X[2*i+3]=X[2*i ]*5.0;
}

A = 2, B = 3, C= 2, D = 0
Since 2 (GCD(A,C)) does not divide –3 (D-B), no
dependence is possible
GCD – Greatest Common Divisor
Example 2
 The loop has multiple types of dependences. Find
all the true dependences , and antidependences ,
and eliminate the output dependences and
antidependences by renaming.
for ( i = 1 , i <= 100 ; i = i + 1)
{
y [ i ] = x [ i ] / c ; /* S1 */
x [ i ] = x [ i ] + c ; /* S2 */
z [ i ] = y [ i ] + c ; /* S3 */
y [ i ] = c - y [ i ] ; /* S4 */
}
Dependence Analysis
 There are a wide variety of situations in which array-oriented
dependence analysis cannot tell us we might want to know ,
including
 When objects are referenced via pointers rather than array
indices .
 When array indexing is indirect through another array, which
happens with many representations of sparse arrays.
 When a dependence may exist for some value of the inputs, but
does not exist in actuality when the code is run since the inputs
never take on those values.
 When an optimization depends on knowing more than just the
possibility of a dependence, but needs to know on which write
of a variable does a read of than variable depend.
Basic Approach used in points-to
analysis
 The basic approach used in points-to analysis relies on
information from three major sources:
 Type information, which restricts what a pointer can point to.
 Information derived when an object is allocated or when the
address of an object is taken, which can be used to restrict
what a pointer can point to.
 For example if p always points to an object allocated in a
given source line and q never points to that object, then p
and q can never point to the same object.
 Information derived from pointer assignments.
 For example , if p may be assigned the value of q, then p
may point to anything q points to.
Analyzing pointers
 There are several cases where analyzing pointers has been
successfully applied and is extremely useful:
 When pointers are used to pass the address of an object as a
parameter, it is possible to use points-to analysis to determine
the possible set of objects referenced by a pointer. One
important use is to determine if two pointer parameters may
designate the same object.
 When a pointer can point to one of several types , it is
sometimes possible to determine the type of the data object
that a pointer designates at different parts of of the program.
 It is often to separate out pointers that may only point to local
object versus a global one.
Different types of limitations
 There are two different types of limitations that affect
our ability to do accurate dependence analysis for
large programs.
 Limitations arises from restrictions in the analysis

algorithms . Often , we are limited by the lack of


applicability of the analysis rather tan a
shortcoming in dependence analysis per se.
 Limitation is the need to analyze behavior across

procedure boundaries to get accurate information.


Eliminating dependent computations
 Compilers can reduce the impact of dependent
computations so as to achieve more ILP.
 The key technique is to eliminate or reduce a
dependent computation by back substitution,
which increase the amount of parallelism and
sometimes increases the amount of
computation required. These techniques can be
applied both within a basic block and within
loops.
Copy propagation
 Within a basic block, algebraic simplifications of
expressions and an optimization called copy
propagation which eliminates operations that
copy values can be used to simplify sequences.
DADDUI R1,R2, #4
DADDUI R1,R1 #4

to

DADDUI R1,R2, #8
Example
 Consider the code sequences:
ADD R1,R2,R3
ADD R4,R1,R6
ADD R8,R4,R7
 Notice that this sequence requires at least three
execution cycles, since all the instructions depend on the
immediate predecessor. By taking advantage of
associativity, transform the code and rewrite:
ADD R1,R2,R3
ADD R4,R6,R7
ADD R8,R1,R4
 This sequence can be computed in two execution cycles.
When loop unrolling is used , opportunities for these types
of optimizations occur frequently.
Recurrences


Recurrences are expressions whose value on
one iteration is given by a function that depends
on the previous iterations.
 When a loop with a recurrence is unrolled, we
may be able to algebraically optimize the
unrolled loop, so that the recurrence need only
be evaluated once per unrolled iterations.
Type of recurrences

 One common type of recurrence arises from an


explicit program statement:

Sum = Sum + X ;
 Unroll a recurrence – Unoptimized (5 dependent
instructions)

Sum = Sum + X1 + X2 + X3 + X4 + X5 ;
 Optimized: with 3 dependent operations:

Sum = ((Sum + X1) + (X2 + X3) ) + (X4 + X5);


Software Pipelining : Symbolic Loop
Unrolling

 There are two important techniques:


 Software pipelining
 Trace scheduling

 Reading Assignment

Trace Scheduling

Software Pipelining versus Loop Unrolling
Software Pipelining

 This is a technique for reorganizing loops such


that each iteration in the software-pipelined
code is made from instruction chosen from
different iterations of the original loop.
 This loop interleaves instructions from different
iterations without unrolling the loop.
 This technique is the software counterpart to
what Tomasulo’s algorithm does in hardware.
Software Pipelined Iteration
Iteration 0
Iteration 1
Iteration 2
Iteration 3
Iteration 4

Software- Pipelined Iteration


Example
 Show a software-pipelined version of this loop, which
increments all the elements of an array whose
starting address is in R1 by the contents of F2:
Loop: L.D F0,0(R1)
ADD.D F4,F0,F2
S.D F4,0(R1)
DADDUI R1,R1,#-8
BNE R1,R2,Loop
You can omit the start-up and clean-up code.
Software Pipeline
Ite ra tio n I: L .D F 0 , 0 (R 1 ) Load M[I] in F0,
A D D .D F 4 ,F 0 , F 2 Add to F0, hold the result in F4
S .D F 4 , 0 (R 1 ) Load M[I-1] in F0
Loop: S.D F4,16(R1) ; stores into M
Ite ra tio n I+ 1 L: .D F 0 , 0 (R 1 ) [i]
A D D .D F 4 ,F 0 , F 2
ADD.D F4,F0,F2 ; add to M [ i –
S .D F 4 , 0 (R 1 )
1]

Ite ra tio n I+ 2 L: .D F 0 , 0 (R 1 ) L.D F0,0(R1) ; load M [ i – 2 ]


A D D .D F 4 ,F 0 , F 2 DADDUI R1,R1,#-8
S .D F 4 , 0 (R 1 ) BNE R1,R2,Loop
Store F4 in M[1]
Add F4, F0, F2
Store it in M[0]
Major Advantage of Software
Pipelining
 Software pipelining over straight loop unrolling is
that software pipelining consumes less code space.
 In addition to yielding a better scheduling inner
loop, each reduce a different type of overhead.
 Loop unrolling reduces the overhead of the loop -
the branch and counter update code.
 Software pipelining reduces the time when the loop
is not running at peak speed to once per loop at the
beginning and end.
Software pipelining and Unrolled
Loop
Start-up
Wind-down
code
code
Number of
Overlapped
operations

Time
a) Software pipelining
Overlap
Proportional
between
to number
Unrolled
of unrolls
iterations
Number of
Overlapped
operations

Time
b) Loop unrolling
Software Pipelining Difficulty
 In practice, complication using software pipelining is quite difficult
for several reasons:
 Many loops require significant transformation before they can be
software pipelined, the trade-offs in terms of overhead versus
efficiency of the software-pipelined loop are complex, and the
issue of register management creates additional complexities.
 To help deal with the two of these issues, the IA-64 added
extensive hardware support for software pipelining
 Although this hardware can make it more efficient to apply
software pipelining , it does not eliminate the need for complex
complier support, or the need to make difficult decisions about
the best way to compile a loop.
Global Code Scheduling
 Global code scheduling aims to compact a code
fragment with internal control structure into the
shortest possible sequence that preserves the
data and control dependence .
 Data Dependence:
 This force a partial order on operations and

overcome by unrolling and, in case of memory


operations , using dependence analysis to
determine if two references refer to the same
address.
 Finding the shortest possible sequence for a
piece of code means finding the shortest
sequence for the critical path, which is the
longest sequence of dependent instructions.
Global Code Scheduling ….
 Control dependence
 It dictate instructions across which code cannot be

easily moved.
 It arising from loop branches are reduce by unrolling.

 Global code scheduling can reduce the effect of

control dependences arising from conditional nonloop


branches by moving code.
 Since moving code across branches will often affect

the frequency of execution of such code, effectively


using global code motion requires estimates of the
relatively frequency of different paths.
Global Code Scheduling ….

 Although global code motion cannot guarantee


faster code, if the frequency information is
accurate, the compiler can determine whether
such code movement is likely to lead to faster
code.
 Global code motion is important since many inner
loops contain conditional statements.
 Global code scheduling also requires complex
trade-offs to make code motion decisions.
A Typical Code Fragment
AA[[i i]]=
=AA[i]
[i]+
+B[i]
B[i]

T F
AA[i]
[i]=
=0?
0?

B [i] = X

C [i] =
LD R4,0(R1) ; load A
LD R5,0(R2) ; load B
DADDU R4,R4,R5 ; ADD to A
SD R4,0(R1) ; Store A
….
BNEZ R4,elsepart ; Test A
…. ; then part
SD ….,0(R2) ;Stores to B
….
J join ; jump over else
elsepart: …. ; else part
X ; Code for X
….
join: …. ; after if
SD ….,0(R3) ; Store C[i]
Factors of the complier
 Consider the factors that the complier would have to
consider in moving the computation and assignment of B:
 What are the relative execution frequencies of the then

case and the else case in the branch?


If the then case is much more frequent, the code
motion may be beneficial. If not, it is likely, although not
impossible, to consider moving the code.
 What is the cost of executing the computation and

assignment to B on the branch?


It may be that there are a number of empty
instruction issue slots in the code on the branch and
that the instructions for B can be placed into slots that
would otherwise go empty. This opportunity makes the
computation of B “free” at least to the first order.
Factors of the complier …
 How will the movement of B change the execution
time for the then case?
If B is at the start of the critical path for the then
case, moving it may be highly beneficial.
 Is B the best code fragment that can be moved above
the branch? How does it compare with moving C or
other statements within the then case?
 What is the cost of the compensation code that may
be necessary for the else case? How effectively can
this code be scheduled, and what is its impact on
execution time?
Global code principle
 The two methods on a two principle:
 Focus on the attention of the compiler on a straight-line
code segment representing what is estimated to be
most frequently executed code path.
 Unrolling is used to generate the straight-line code
but,of course, the complexity arises in how conditional
branches are handled.
 In both cases, they are effectively straightened by
choosing and scheduling the most frequent path.
Trace Scheduling : focusing on the critical
path
 It useful for processors with a large number of
issues per clock, where conditional or
predicated with a large number of issues per
clock, where conditional or predicated execution
is inappropriate or unsupported, and where
simple loop unrolling may not be sufficient by
itself to uncover enough ILP to keep the
processor busy.
 It is away to organize the global code motion
process, so as to simplify the code scheduling
by incurring the costs of possible code motion
on the less frequent paths.
Steps to Trace Scheduling
 There are two steps to trace scheduling:
 Trace selection:

It tries to find likely sequence of


basic blocks whose operations will be put
together into a smaller number of instructions:
this sequence is called a trace.
 Trace compaction:

It tries squeeze the trace into a small


number of wide instructions. Trace compaction
is code scheduling; hence , it attempts to move
operations as early as it can in a sequence
(trace), packing the operations into a few wide
instructions (or issue packets) as possible
Advantage of Trace Scheduling

 It approach is that it simplifies the decisions


concerning global code motion.
 In particular , branches are viewed as jumps
into or out of the selected trace, which is
assumed to be the most probable path.
 When the code is moved across such trace entry
and exit points, additional book-keeping code
will often be needed on entry or exit point.
Key Assumption:
 The key assumption is that the trace is so much more probable than
the alternatives that the cost of the book-keeping code need not be a
deciding factor:
 If an instruction can be moved and thereby make the main trace
execute faster; it is moved.
 Although trace scheduling has been successfully applied to scientific
code with its intensive loops and accurate profile data, it remains
unclear whether this approach is suitable for programs that are less
simply characterized and less loop-intensive.
 In such programs,the significant overheads of compensation code
may make trace scheduling an unattractive approach, or, best , its
effective use will be extremely complex for the complier.
Trace in the program fragment
AA[[i i]]=
=AA[i]
[i]+
+B[i]
B[i]

T F
AA[i]
[i]=
=0?
0? Trace exit

B [i] =

C [i] = Trace entrance


AA[[i i]]=
=AA[i]
[i]+
+B[i]
B[i]

T F
AA[i]
[i]=
=0?
0? Trace exit

B [i] =

C [i] =
Drawback of trace scheduling

 One of the major drawbacks of trace scheduling


is that the entries and exits into the middle of
the trace cause significant complications,
requiring the complier to generate and track the
compensation code and often making it difficult
to asses the cost of such code.
Superblocks
 Superblocks are formed by a process similar to
that used for traces, but are a form of extended
basic blocks, which are restricted to a single entry
point but allow multiple exits.
 How can a superblock with only one entrance be
constructed?
The answer is to use tail duplication to create
a separate block that corresponds to the portion
of the trace after the entry.
 The superblock approach reduces the complexity
of book-keeping and scheduling versus the more
general trace generation approach, but may
enlarge code size more than a trace- based
approach.
Superblock results from unrolling the
code
AA[[i i]]=
=AA[i]
[i]+
+B[i]
B[i]

T F
AA[i]
[i]=
=0?
0?
Superblock exit
B [i] = Wit n=2 AA[[i i]]=
=AA[i]
[i]++B[i]
B[i]

T F
C [i] =
AA[i]
[i]=
=0?
0?
AA[[i i]]=
=AA[i]
[i]+
+B[i]
B[i]
B [i] = X
T F
AA[i]
[i]=
=0?
0?
Superblock exit
B [i] =
Wit n=1
C [i] =

C [i] =
Exploited of ILP

 Loop unrolling, software pipelining , trace scheduling , and


superblock scheduling all aim at trying to increases the
amount of ILP that can be exploited by a processor issuing
more than one instruction on every clock cycle.
 The effectiveness of each of these techniques and their
suitability for various architectural approaches are among
the hottest topics being pursed by researches and
designers of high-speed processors.
Hardware Support for Exposing More
Parallelism at Compile Time
 Techniques such as loop unrolling, software pipelining ,
and trace scheduling can be used to increase the
amount of parallelism available when the behavior of
branches is fairly predictable at compile time.
 The first is an extension of the instruction set to include
conditional or predicated instructions.
 Such instructions can be used to eliminate branches,
converting a control dependence into a data
dependence and potentially improving performances.
 Such approaches are useful with either the hardware-
intensive schemes , predication can be used to eliminate
branches.
Conditional or Predicated
Instructions
 The concept behind instructions is quite simple:
 An instruction refers to a condition, which is

evaluated as part of the instruction


execution.
 If the condition is true, the instruction is

executed normally;
 If the condition is false, the execution

continues as if the instruction were a no-op.


 Many newer architectures include some form

of conditional instructions.
Example:
 Consider the following code:
if (A==0) {S=T;}
Assuming that registers R1,R2 and R3 hold the
values of A,S and T respectively, show the code
for this statement with the branch and with the
conditional move.
Conditional and Predictions
 Conditional moves are the simplest form of conditional or
predicted instructions, and although useful for short
sequences, have limitations.
 In particular, using conditional move to eliminate branches
that guard the execution of large blocks of code can be
efficient, since many conditional moves may need to be
introduced.
 To remedy the inefficiency of using conditional moves,
some architectures support full predication, whereby the
execution of all instructions is controlled by a predicate.
 When the predicate is false, the instruction becomes a no-
op. Full predication allows us to simply convert large
blocks of code that are branch dependent.
 Predicated instructions can also be used to speculatively
move an instruction that is time critical, but may cause an
exception if moved before a guarding branch.
Example:
 Here is a code sequence for a two-issue superscalar
that can issue a combination of one memory
reference and one ALU operation, or a branch by
itself, every cycle:
First InstructionSet SecondInstructionSet
LW R 1,40(R2) ADDR 3,R4,R5
ADDR 6,R3,R7
BEQZ R 10,L
LW R8,0(R10)
LW R9,0(R8)

 This sequence wastes a memory operation slot in


the second cycle and will incur a data dependence
stall if the branch is not taken, since the second LW
after the branch depends on the prior load. Show
how the code can be improved using a predicated
form of LW.
Predicate instructions
 When we convert an entire code segment to
predicated execution or speculatively move an
instruction and make it predicated, we remove a
control dependence.
 Correct code generation and the conditional
execution of predicated instructions ensure that
we maintain the data flow enforced by the
branch.
 To ensure that the exception behavior is also
maintained, a predicated instruction must not
generate an exception if the predicate is false.
Complication and Disadvantage
 The major complication in implementing predicated
instructions is deciding when to annul an instruction.
 Predicated instructions may either be annulled during
instruction issue or later in the pipeline before they
commit any results or raise an exception.
 Each choice has a disadvantage.
 If predicated instructions are annulled in pipeline, the
value of the controlling condition must be known early to
prevent a stall for a data hazard.
 Since data-dependent branch conditions, which tends to
be less predictable, are candidates for conversion to
predicated execution, this choice can lead to more pipeline
stalls.
Complication of Predication
 Because of this potential for data hazard stalls, no
design with predicated execution (or conditional
move) annuls instructions early.
 Instead , all existing processors annul instructions
later in the pipeline , which means that annulled
instructions will consume functional unit resources
and potentially have negative impact on
performance .
 A variety of other pipeline implementation
techniques, such as forwarding, interact with
predicated instructions, further complicating the
implementation.
Factors Predicated or Conditional
Instruction
 Predicated or conditional instructions are extremely useful for
implementing short alternative control flows, for eliminating
some unpredictable branches, and for reducing the overhead of
global code scheduling. Nonetheless, the usefulness of
conditional instructions is limited by several factors:
 Predicated instructions that are annulled (I.e., whose

conditions are false) still take some processor resources.


 An annulled predicated instruction requires fetch resources
at a minimum, and in most processors functional unit
execution time.
 Therefore, moving an instruction across a branch and

making it conditional will slow the program down whenever


the moved instruction would not have been normally
executed.
Factors Predicated or Conditional
Instruction
 Likewise, predicating a control-dependent portion of code
and eliminating a branch may slow down the processor if
that code would not have been executed.
 An important execution to these situations occurs when
the cycles used by the moved instruction when it is not
performed would have been idle anyway.
 Moving an instruction across a branch or converting a
code segment to predicated execution is essentially
speculating on the outcome of the branch.
 Conditional instructions make this easier but do not
eliminate the execution time taken by an incorrect guess.
 In simple cases,where we trade a conditional move for a
branch and a move, using conditional moves or
predication is almost always better.
Factors Predicated or Conditional
Instruction…
 When longer code sequences are made
conditional, the benefits are more limited.
 Predicated instructions are most useful when
the predicate can be evaluated early.
 If the condition evaluation and predicated
instructions cannot be separated (because of
data dependence in determining the
condition) , then a conditional instruction may
result in a stall for a data hazard.
 With branch prediction and speculation, such
stalls can be avoided, at least when the
branches are predicted accurately.
Factors Predicated or Conditional
Instruction…
 The use of conditional instructions can be limited when the
control flow involves more than a simple alternative sequence.
 For example, moving an instruction across multiple branches
requires making it conditional on both branches, which
requires two conditions to be specified or requires additional
instructions to compute the controlling predicate.
 If such capabilities are not present, the overhead of if
conversion will be larger, reduce its advantage.
 Conditional instructions may have some speed penalty
compared with unconditional instructions.
 This may show up as a higher cycle count for such instructions
or a slower clock rate overall.
 If conditional instructions are more expensive, they will need
to be used judiciously.
Compiler Speculation With Hardware
Support

Many programs have branches that can be
accurately predicted at compile time either from the
program structure or by using a profile.
 In such cases, the compiler may want to speculate
either to improve the scheduling or to increase the
issue rate.
 Predicated instructions provide one method to
speculate, but they are really more useful when
control dependences can be completely eliminated
by if conversion.
 In many cases,we would like to move speculated
instructions not only before the branch, but before
the condition evaluation cannot achieve this.
Speculate Ambitiously
To speculate ambitiously requires three capabilities:
 The ability of the complier to find instructions that,

with the possible use of register renaming, can be


speculatively moved and not affect the program
data flow.
 The ability to ignore exceptions in speculated

instructions, until we know that such exceptions


should really occur.
 The ability to speculatively interchange loads and

stores, or stores and stores, which may have


address conflicts.
 The first of these is a complier capability, while the

last two require hardware support .


Hardware Support for Preserving
Exception Behavior

 To speculate ambitiously, we must be able to


move any type of instruction and still preserve
its exception behavior.
 The key to being able to do this is to observe
that the results of a speculated sequence that is
mispredicted will not be used in the final
computation , and such a speculated instruction
should not cause an exception.
Investigated for supporting more
ambitious speculation
 There are four methods that have been investigated for
supporting more ambitious speculation without introducing
erroneous behavior:
 The hardware and operating system cooperatively ignore

exceptions for speculative instructions. This approach


preserves exception behavior for correct programs, but
not for incorrect ones. This approach may viewed as
unacceptable for some programs, but it has been used,
under program control , as a “fast mode” in several
processors.
 Speculative instructions that never raise exceptions are

used, and checks are introduced to determine when an


exception should occur.
Investigated for supporting more
ambitious speculation

 A set of status bits, called poison bits, are attached


to the result registers written by speculated
instructions when the instructions cause exceptions.
The poison bits cause a fault when a normal
instructions attempts to use the register.
 A mechanism is provided to indicate that an
instruction is speculative , and the hardware buffers
the instruction result until it is certain that the
instruction is no longer speculative.
Schemes
 To know the schemes, we need to distinguish between
exceptions that indicate a program error and would
normally cause termination, such as a memory
protection violation, and those that are handled and
normally resumed, such as a page default.
 Exception that can be resumed can be accepted and
processed for speculative instructions just as if they
were normal instructions.
 If the speculative instruction should not have been
executed, handling the unneeded exception may have
some negative performance effects, but it cannot cause
incorrect execution.
Schemes..
 The cost of these exceptions may be high, however, and
some processors use hardware support to avoid taking such
exceptions, just as processors with hardware speculation may
take some exceptions in speculative mode, while avoiding
others until an instruction is known not to be speculative.
 Exceptions that indicate a program error should not occur in
correct programs, and the result of a program that gets such
a exception is not well defined, except perhaps when the
program is running in a debugging mode.
 If such exceptions arise in speculated instructions, we cannot
take the exception until we know that the instruction is no
longer speculative.
Schemes …
 In the simplest method for preserving exceptions, the
hardware and the operating system handle all resumable
exceptions when the exception occurs and simply return
an undefined value for any exception that would cause
termination.
 If the instruction generating the terminating exception
was not speculative , then the program is in error.
 Note the instead of terminating the program, the program
is allowed to continue, although it will almost certainly
generate incorrect results.
 If the instruction generating the terminating exception is
speculative, then the program may be correct and the
speculative result will simply be unused.
Schemes …
 Thus, returning an undefined value for the
instruction cannot be harmful.
 This scheme can never cause a correct program
to fail, no matter how much speculation is done.
 An incorrect program, which formerly might
have received a terminating exception, will get
an incorrect result.
 This a acceptable for some programs, assuming
the complier can also generate a normal version
of the program, which does not speculate and
can receive a terminating exception.
Schemes …
 In such a scheme, it is not necessary to know that an
instruction is speculative. Indeed, it is helpful only
when a program is in error and receives a terminating
exception on a normal instruction; in such cases, if the
instruction were not marked as speculative, the
program could be terminated.
 A second approach to preserving exception behavior
when speculating introduces speculative versions of
instructions that do not generate terminating
exceptions and instructions to check for such
exceptions. This combines preserves the exception
behavior exactly.
Schemes …
 A third approach for preserving exception tracks exceptions as
they occur but postpones of the exception, although not in a
completely precise fashion.
 The scheme is simple: a poison bit is added to every register, and
another bit is added to every instruction to indicate whether a
speculative instructions results in a terminating exception; all
other exceptions are handled immediately.
 If a speculative instruction uses a register with a poison bit turned
on, the destination register of the instruction simply has its
poison bit turned on,
 If a normal instruction attempts to use a register source with its
poison bit turned on, the instruction causes a fault .
 In this way, any program that would have generated an exception
still generates one, albeit at the first instance where a result is
used by an instruction that is not speculative.
Schemes …
 Since poison bits exist only on register values and not memory
values, stores are never speculative and thus trap if either
operand is “poison”.
 One complication thus must be overcome is how the OS saves
the user registers on a context switch if the poison bit is set.
 A special instruction is needed to save and reset the state of the
poison bits to the avoid this problem.
 The fourth and final approach listed in earlier relies on a hardware
mechanism that operates like a reorder buffer.
 In such an approach, instructions are marked by the complier as
speculative and include an indicator of how many branches the
instruction was speculatively moved across and what branch
action (taken/not taken) the complier assumed.
 The last piece of information basically tells the hardware the
location of the code block where the speculated instruction
originally was .
Schemes …
 In practice, most of the benefit of speculative
instruction is marked by a sentinel, which tells the
hardware that the earlier speculative instruction is
no longer speculative and values may be
committed.
 All instructions are placed in a reorder buffer when
issued and are forced to commit in order, as in
hardware speculation approach.(Notice , though,
that no actual speculative branch prediction or
dynamic scheduling occurs).
 The reorder buffer tracks when instructions are
ready to commit and delays the “write-back”
portion of any speculative instruction.
Schemes …
 Speculative instructions are not allowed to commit
until the branches that have been speculatively
moved over are also ready to commit, or
,alternatively until the corresponding sentinel is
reached.
 At that point, we know whether the speculated
instructions should have been executed or not.
 If it should have been executed and it generated a
terminating exception, then we know that the
program should be terminated.
 If the instruction should not have been executed then
the can be ignored.
Example 1
 Consider the code fragment from if-the-else
statement of the form:
If (A==0) A=B; else A=a+4;Where A is at 0(R3) and B is
at 0(R2). Assume the then clause is almost always
executed. Compile the code using complier-based
speculation. Assume R14 is unused and available.
LD R1,0(R3) ; load A
B NE Z R1,L1 ;tes t A
LD R1,0(R2) ; then c laus e
J L2 ; S k ip els e
L1: DA DDI R1,R1, #4; els e c laus e
L2: SD R0,0(R3) ; s tore A
Example 2
 Show how the code using a speculative load
(sLD) and a speculation check instruction
(SPECCK) to completely preserve exception
behavior. Assume R14 is unused and available.
LD R1,0(R3) ; load A
B NE Z R1,L1 ;tes t A
LD R1,0(R2) ; then c laus e
J L2 ; S k ip els e
L1: DA DDI R1,R1, #4; els e c laus e
L2: SD R0,0(R3) ; s tore A
Example 3
 Consider the code fragment and show how it
would be complied with speculative instructions
and poison bits. Show where an exception for
the speculative memory reference would be
recognized. Assume R14 is unused and
available. LD R1,0(R3) ; load A
B NE Z R1,L1 ;tes t A
LD R1,0(R2) ; then c laus e
J L2 ; S k ip els e
L1: DA DDI R1,R1, #4; els e c laus e
L2: SD R0,0(R3) ; s tore A
Hardware support for memory
Reference Speculation
 Moving loads across stores is usually done when the
complier is certain the address do not conflict.
 A special instruction to check for address conflicts can
be included in the architecture,
 The special instruction is left at the original location of
the load instruction (and acts like a guardian), and the
load is moved up across or more stores.
 When a speculated load is executed, the hardware
saves the address of the accessed memory location.
 If a subsequent store the address of the location before
the check instruction, then the speculation has failed.
Hardware support for memory
Reference Speculation …
 If the location has not been touched, then the
speculation is successful.
 Speculation failure can be handled in two ways.
 If only the load instruction was speculated, then

it suffices to redo the load at the point of the


check instruction (which could supply the target
register in addition to the memory address).
 If additional instructions that depend on the load

were also speculated , then a fix-up sequence


that reexecutes all the speculation instructions
starting with the load is needed.
 In this case, the check instruction specifies the

address where the fix-up code is loaded.


Sections To be covered from Chapter
4
 Section 4.5
 Software versus Hardware based
scheduling
Memory Hierarchy

Memory Hierarchy & Cache


Memory
Entry Quiz

1. Primary cache or level 1 cache is


implemented in a separate chip

a. True
b. False
Entry Quiz

2. SRAM is implemented using

a. Flip-Flop
b. Magnetic core
c. Capacitor
d. Non-volatile Technology
Entry Quiz

3. Main memory (200 ns) is slower


compared register (0.2 ns) by an
order of

a. 3
b. 4
c. 1000
d. 10000
Entry Quiz

4. Virtual Memory is

a. Same as caching
b. Same as associative memory
c. Different from caching
d. Same as disk memory
Entry Quiz

5. Cache Miss occurs when the


a. Required instruction is not found in the
cache
b. Required data is not found in the cache
c. Required instruction or data is not
found in the cache
d. Required instruction or data is not
found in the main memory
e. For all of the above conditions
Module Objective
 To understand
1. Memory requirements of different computers
2. Memory hierarchy and the motivation behind it 
3. Moore’s Law 
4. Principles of Locality 
5. Cache Memory and its implementation
6. Cache Performance
7. Terms: Cache, Cache Miss, Cache Hit, Latency,
Bandwidth, SRAM, DRAM, by an order of, Direct
Mapping, Associative Mapping, Set Mapping, Write
Through, Write Allocated, Write Back, Dirty Bit, and
Valid Bit
Memory Requirements

 In general we would like to have


 Faster Memory (lower access time
or latency)
 Larger (capacity and bandwidth)

Memory
 Simpler Memory
Memory Requirements - Server, Desktop,
and Embedded devices
 Desktop
 Server  Embedded
 Lower Access
 Lower Access time  Lower Access
time
 Higher Bandwidth* time
 Larger Memory
 Better Protection*  Simpler
 Larger Memory Memory*
Processor-Memory Gap
Moore’s Law
 Transistor density on a chip dye
doubles every couple (1.5) of years.

 Short reference:
http://en.wikipedia.org/wiki/Moore's_law
What is Memory Hierarchy and Why?

250 ns

Main
Memory
CPU
0.25 ns
(Registers)
Bus Storage &
Adapter I/O devices
2,500,000 ns!
Memory Hierarchy & Cache
Cache
 Cache is a smaller, faster, and expensive memory.
 Improves the througput/latency of slower memory
next to it in the memory hierarchy.
 Blocking reads and delaying the writes to slower
memory offers better performance.
 There are two cache memories L1 and L2 between
CPU and main memory.
 L1 is built into CPU.
 L2 is an SRAM.
 Data, Instructions, and Addresses are cached.
Cache Operation

 Cache Hit: CPU finds the required data item (or


instruction) in the cache.
 Cache Miss: CPU does not find the required item
in the cache.
 CPU Stalled
 Hardware loads the entire block that contains the
required data item from the memory into cache.
 CPU continues to execute once the cache loaded.
 Spatial Locality
 Principle of Locality
Hit and Miss
 Hit
 Hit Rate
 Hit Time
 Miss
Higher Level
 Miss Rate
 Miss Penalty
 Hit Time << Miss Penalty Lower Level
Cache Performance
Program Execution Time
 CPU Clock Cycle
 Cycle Time
 IC – Instruction Count
 Program Execution Time (Simple
model)

= (Useful CPU Cycles + Stalled CPU


Cycles) x
Cycle Time
Stalled CPU Cycles
Stalled CPU Cycles

= Number of Cache Misses X Miss Penalty

= IC X Memory Access X Miss Rate X Miss Penalty


Instruction

Note: Unit of Miss Penalty is in CPU cycles


Number of Cache reference
Separating Read Miss from Write
Miss

Stalled CPU Cycles

= IC X Memory Access X Miss Rate X Miss Penalty


Instruction

= IC X Read Access X Read Miss Rate X Read Penalty +


Instruction
IC X Write Access X Write Miss Rate X Write Penalty

Instruction
Example 1
 Assume we have a computer where Clock
cycles Per Instruction (CPI) is 1.0 when all
memory accesses are cache hits. The
only data accesses are loads and stores
and these total 50% of the instructions. If
the miss penalty is 25 clock cycles and
miss rate is 2%, how much faster would
the computer be if all instructions were
cache hits?
Example 1 …

1. Execution time for the program in a


computer with 100 % hits

= (Useful CPU Cycles + Stalled CPU


Cycles) x Cycle Time
= (IC X CPI + 0) X Cycle Time
= IC X 1.0 X Cycle Time
Example 1 …
2. Stall Cycle when there is one or more
Cache Misses:
= IC X Memory Access X Miss Rate X Miss Penalty
Instruction
= IC X (1 + 0.5) X 0.02 X 25 Cycle Time
= IC X 0.75 Cycle Time

3. Total Execution Time:


= (IC X 1 + IC X 0.75) Clock Cycles
= 1.75 IC Clock Cycles
Example 1 …

4. Ratios
= 1.75 IC Clock Cycles / IC X 1.0 X Cycle Time

= 1.75

Execution time is 1.75 faster in a computer


with no Misses.
Example 2
 Assume we have a computer where Clock
cycles Per Instruction (CPI) is 1.0 when all
memory accesses are cache hits. The only
data accesses are loads and stores and
these total 50% of the instructions. Miss
penalty for read miss is 25 clock cycles
and miss penalty for write miss is 50 clock
cycles. Miss rate is 2% and out of this 80%
is Read miss. How much faster would the
computer be if all instructions were cache
hits?
Cache Operation
0
1
Line 2
Number Tag Block 3 Block
(K Words)
0    
1    
2    

CPU
    ..
    .
..
    .
C= 16    

Block Length
(K Words) Block

Cache 2N - 1
Word Length
(K Words)
Main Memory
Elements of Cache Design
 Cache Size
 Block Size
 Mapping Function
 Replacement Algorithm
 Write Policy
 Write Miss
 Number of caches
 Split versus Unified/Mixed Cache
Mapping Functions
Address
From CPU Tag Line Word Tag
Block 0

Line 0 Block 1

1. Select
Line 1
2. Copy
3.Compare

5. Load Line 2
+
4. Hit
Block n-1
Miss Line 3 (m)

To main memory Block n

Cache
Main Memory
Mapping Function

 Direct
 Line value in address uniquely points to a line in cache. 1 tag
Comparison
 Set Associative
 Line value in address points to a set of lines in cache (typically
2/4/8, so 2/4/8 tag comparisons). This is known as 2/4/8 way Set
Associative.
 Associative
 Line value is always 0. This means Line points to all the lines in
cache (4 (m) tag Comparisons)
 Uses Content Addressable Memory (CAM) for comparison.
 Needs non-trivial replacement algorithm
Address 0

From CPU Tag Line Word Tag 1


2 Block 0
3
5 2 2 3 Line 0
4
5 Block 1
6
Line 1 7
1. Select
SET 0
2. Copy
3.Compare Line 2
5. Load
+ 504
4. Hit 505
Line 3 Block 126
506
Miss
507
508
SET 1
509
To main memory 510
Block 127

Cache 511

Memory
Mapping Function Comparison

Cache Type Hit Ratio Search Speed


Direct Mapping Good Bes t
Fully Associative Bes t M ode rate
Ve ry Good, Bette r Good, Worse as N
N-Way Set Associative as N Incre ase s Incre as e s
Replacement Algorithm
 Least Recently Used
 First In First Out
 Least Frequently Used
 Random
Write Policy
 Write-through
 Information written to cache and the memory

 Write back
 Information written only to cache. Content of the
cache is written to the main memory only when this
cache block is replaced or the program terminates.

 Dirty bit is used to indicate that a cache block


needs “Write back”
Write Miss
 Write-Allocate
 No Write-Allocate
Number of Caches
 Primary Cache (CPU)
 Secondary Cache (SRAM)
 L3 Cache (Cheaper SRAM)
Split versus Unified/Mixed Cache
 Single or two Level
 Unified or Split
 Misses per 1000 instructions with various Cache
size

Instruct
Size (KB) Cache
8 8.1
Improving Cache Performance
 Reducing Penalty
 Reducing Misses
 Compiler optimization attempts to reduce the
Cache Misses falls under this category

Stalled Cycles
= IC X Memory Access X Miss Rate X Miss
Penalty
Instruction
Improving Cache Performance Using
Compiler
 Compilers are built with the following optimization:
Instructions:
Reordering instructions to avoid conflict misses
Data :
Merging Arrays
Loop Interchange
Loop Fusion
Blocking
Merging Arrays

/* Conflict */ Key

int key[size]; value


int value [size];

/* Instead no conflict*/ Key, value pairs


struct merge {
int key;
int value;
};
Loop Interchange
for (k = 0; k < 100; k= k+1) for (k = 0; k < 100; k= k+1)
for (j = 0; j < 100; j= j+1) for (i = 0; i < 5000; i= i+1)
for (i = 0; i < 5000; i= i+1) for (j = 0; j < 100; j= j+1)
x[i][j] = 2*x[i][j] x[i][j] = 2*x[i][j]
/*instead */
Memory
X(0,0) Address
X(0,1)
X(0,2)
X(0,3)
X(0,4)
X(0,5)

X(0,4999)
X(1,0)
X(1,1)

Blocking
Loop Fusion
for (i = 0; i < 5000; i= i+1)
a[i] = i;

for (i = 0; i < 5000; i= i+1)
b[i] = b[i]+a[i];

/*instead */

for (i = 0; i < 5000; i= i+1)


a[i]= i;
b[i] = b[i]+a[i];
Example 2
 Assume a fully associative write-back cache
with many cache entries that starts empty.
Below is a sequence of five memory operations
(the address is in square brackets)
 What are the number of hits and misses using
no-write allocate versus write allocate?

WriteMem[100];
WriteMem[100];
ReadMem[200];
WriteMem[200];
WriteMem[100];
Exit Quiz
1. In memory hierarchy top layer is occupied by
the

a. Fastest and the most expensive memory


b. Slowest and the most expensive memory
c. High band width and the fastest memory
d. Slowest and the least expensive memory
Exit Quiz

2. The memory technology suitable


for L2 cache is

a. EPROM
b. DRAM
c. SRAM
d. Flash
Exit Quiz

3. Server’s memory requirements are


different from desktop’s memory
requirements
a. True
b. False
Exit Quiz

4. Hit Rate is (1 – Miss Rate)


a. True
b. False

5. Gordon Moore’s law states that the


transistor density
a. Triples every year
b. Doubles every two years
c. Doubles every 1.5 years
d. Doubles every year
Improving Cache Performance
 Two options
 By Reducing Miss Penalty
 By Reducing Miss Rate
Reducing Cache Miss Penalty

1. Multilevel caches
2. Critical word first & early restart
3. Priority for Read misses over Write
4. Merging Write Buffers
5. Victim Cache
Reducing Miss Penalty - Multilevel
caches (1)
100 ns, 128M 100 ns, 128M

2 ns, 16K 10 ns, 512K

CPU Cache1 Main Main


Memory CPU Cache 1
0.0 Memory
4 0.0
Faster but smaller 2 larger
Slower but

100 ns, 128M


? ns, 528
K

CPU Cache 1 Cache 2 Main


Memory
0.0
0.02
4
Faster and Larger
Reducing Miss Rate & Global Miss Rate
(1)
2.6 ns, 528K 100 ns, 128M

10 ns, 512K
2 ns, 16K

CPU Cache 1 Main


Cache 2
Memory
0.05
0.02
Faster and Larger

Average memory access timeeff = Hit TimeL1 + Miss RateL1 x


Miss

PenaltyL1
Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2

Global Miss Rate = Miss RateL1 x Miss RateL2


Reducing Miss Penalty – Critical
Word First and Early Restart (2)
 Critical word first
 Early restart
 Lets discuss the following:
 How is Miss Penalty improved?
 Under what conditions this is useful?
 Is it worth going through all the
trouble?
Reducing Miss Penalty – Priority To
Read Miss (3)
 Read Miss takes priority
 Write to Memory is put on hold
 Lets discuss the following:
 Is there a problem?
 How is Miss Penalty improved?
 How this is done in Write-through?
 How this is done in Write-back?
 Under what conditions this is useful?
 Is it worth going through all the trouble?
Reducing Miss Penalty – Merging
Write Buffer (4)
 Write-through is sent to a write buffer
 Buffer & Write possible options:
1. Buffer is empty – write address and data to buffer
2. Buffer is not full – write or merge data
3. Buffer is full – Stall until buffer is written to memory.
 Note: Block write to memory is more efficient then
multiple writes.
 Lets Discuss the following:
 How many different ways do we optimize Miss Penalty in
this scheme?
 Explain the Fig 5.12 in the book
 When there is no block write (like I/O device), no
merging
Reducing Miss Penalty – Fully
Associative Victim Cache (5)
 Key word is Recycle
 Another implicit key word is
complex (hardware)
 Aimed at handling conflict misses
 1 to 6 victim caches are ideal
Reducing Misses
 Misses due to 3Cs: Compulsory,
Capacity, and Conflict
 Reducing Misses Techniques
1. Larger Lines/Blocks
2. Larger Caches
3. Higher Associativity
4. Way Prediction Pseudoassociative caches
5. Compiler Techniques

Notes: copy tables 5.14 and 5.17 to two slides


Reducing Cache Misses – Larger Block
Size (1)

 Larger block improves number of misses up to a


point!
 Points to discuss
 Why number of misses starts increasing for larger
blocks?
 Low latency encourages smaller blocks and higher
latency encourages larger blocks.
 Low bandwidth encourages smaller blocks and higher
bandwidth encourages larger blocks.
 Effect on Miss Penalty with larger blocks
 What is the other name for 1-way set associative mapping?
Reducing Cache Misses – Larger Caches
(2)

 Obviously!
Reducing Cache Misses – Higher
Associativity (3)

 Miss rate improves with associativity


 Points to discuss
 Complexity of set associative mapping
versus the improvement.
 Rule of thumb 1: 8-way associative is as
good as fully associative.
 Rule of thumb 2: 2-way associative is as
good as direct mapping.
 Greater associativity increases Hit time
Reducing Cache Misses – Pseudo-
Associative Caches (4)
 Lets understand this using an 8-way Cache Memory – ith set
tags
set associative Instruction Cache.
Block 0
 Instruction exhibits better locality
 Each access to instruction cache
normally needs 8 comparisons. Block 1
 Using locality predict the next block
to access in the set to reduce the
number of comparisons.
 Effect on Hit-time and Cache Miss

Block 6

Block 8
Reducing Hit Time – Small and Simple
Cache (1)

 Tag comparison is complex,


specifically in associative mapping
 Tag checking can be overlapped
with data transfer to reduce the hit
time.
 Tag checking in CPU chip, cache in
a chip by itself. Provides better hit
time and larger cache capacity
Reducing Hit Time – Virtual Cache (2)

 Addressing
 VA -> Physical Address -> cache
 Skip two levels, VA maps to cache
 Problems:
 No page boundary checks
 Building direct mapping between VA
and Cache for every process is not
easy.
Reducing Hit Time – Pipelined Cache (3)

 Increases the cache throughput not


access time
Reducing Hit Time – Trace Cache

 Increasing instruction level


parallelism
 Instead of 4 consecutive locations of
cache, load the next 4 instruction
required by the CPU using trace.
 Folding branch prediction into cache
Improving Parallelism with Cache
 Cache Hit under Cache Miss
 Cache Miss under Cache Miss
 Both require non-blocking cache, out
of order execution CPU
 Other methods to improve
performance using parallelism:
 Hardware pre-fetching of instruction
 Software (compiler) pre-fetching of data
Preparation

Write-back Write Through


 Complicates cache  Simplifies cache
coherency problem coherency problem
 Low overhead memory  High overhead
access overhead  If cache is blocking then
 Better cache access time higher access time
than write -through  Requires Lower memory
 Requires higher memory bandwidth
bandwidth if blocking

 Note: No need to describe these two policies


 Write-through does not buy anything extra for
a single processor system due to absence of
cache coherency
CPU Execution Time
CPU Execution Time
= (CPU Cycle time + Stalled Cycle) X Cycle Time

 Stalled Cycle = misses x penalty


 Misses given either as misses/1000 instruction or
misses/memory-access AKA miss rate.
 Instruction Count , Cycles per Instruction, Miss are also
required to compute CPU execution time.
Average Access Time with Cache
Average Access Time
= Hit Time + Miss Rate X Penalty

Multi-level Cache
Avg Access Time = Hit TimeL1+ Miss RateL1X
PenaltyL1
PenaltyL1 = Hit TimeL2+ Miss RateL2X PenaltyL2
Addressing
Address 0
From CPU Tag Set Word Tag 1
2 Block 0
Set 0 3
4
Line 0 5 Block 1
6
7
1. Select
Line 1
2. Copy
3.Compare Set 1

5. Load Line 2
+
4. Hit
Block 126
4. Miss Line 3 (m)
508
509
To main memory 510 Block 127
Cache
511
Main Memory
Assignment I – Due same day next
week
 Mapping functions
 Replacement algorithms
 Write policies
 Write Miss policies
 Split Cache versus Unified Cache
 Primary Cache versus Secondary Cache
 Compiler cache optimization
techniques with examples
Assignment II - Due same day next
week
 Multilevel Cache
 Cache Inclusion/Exclusion Property
 Thumb rules of cache
 Compiler pre-fetch
 Multi-level Caching + one another Miss
Penalty Optimization technique
 Two miss Rate Optimization Techniques
Assignment III - Due 2nd class of next
week
 All odd numbered problems from
cache module of your text book.
Assignment IV - Due 2nd class of next
week
 All even numbered problems from
cache module of your text book.
CPU Execution Time & Average Access
Time

Memory

Cache

CPU
1 CC 100 CC
With Multi-level Cache

Memory

Cache

CPU
10 CC 100 CC
1 CC
1000
Memory Hierarchy

Main Memory
Main Memory
 Module Objective
 To understand Main memory latency
and bandwidth
 Techniques to improve latency and
bandwidth
Memory Hierarchy & Cache
Main Memory – Cache – I/O
250 ns

Main
Memory
CPU cache
0.25 ns
Bus Storage &
Adapter I/O devices
2,500,000 ns!

 Cache prefers low latency main memory


 I/O & Multiprocessors prefer high bandwidth/throughput
main memory
Main Memory Access Time

4 cc
Main
Memory

56 cc
CPU cache
4 cc
CC – Clock Cycle

Data
Bus Address
Bus
Access Time per word = 4+56+4 CC
One word is 8 Bytes
Latency is 1 bit/CC
Improving Memory Performance
 Improving Latency ( time to access
1 memory unit - word)
 Improving bandwidth (bytes
accessed in unit time)
Improving Memory Bandwidth
Simple Design Wider Bus Interleaved
CPU
CPU CPU
64 bits
64 bits 64 bits
Cache
Cache Cache

4x64 bits 64 bits


64 bits

Memory Memory 0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
Cache Block 4 words
Bank 0 Bank 1 Bank 2 Bank 3
One word 8 bytes
Bandwidth, Latency, Penalty
Interleaving Factor
Address allocation with Interleaving
Problem 1
 Interleaving factor
 What is the optimal interleaving factor if the memory
cycle is 8 CC (1+6+1)?
 Assume 4 banks
b1 b2 b3 b4

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

X √
1st Read 2nd Read 3rd Read 3rd Read
Request Request Request Request
Problem 2 (page 452 in your text
book)
 Block size = 1 word Simple
 Memory bus size = 1 word Memory
Cache 64 CC
 Miss rate = 3% CPU
 Memory access per instruction = 1.2
 Cache miss Penalty = 64 CC
 Avg Cycles per instruction = 2

Assume 1000 instructions in your Program


If no miss then execution time is 2000 CC
One instruction needs 1.2 memory accesses, 1000 instruction ->
1200 accesses. If miss rate is 3% then number of misses for
1200 accesses is 36.
the execution time is
= 2000+36x64 = 4304 CC
Average cycles per instruction = 4304/1000 = 4.3
Problem 2 (wider bus – 2 words)
 Block size = 4 word Interleaved
 Memory bus size = 2 word Memory
Cache 64 CC
 Miss rate = 2.5% CPU
 Memory access per instruction = 1.2
 Cache miss Penalty = 128 CC
 Avg Cycles per instruction = 2

Assume 1000 instructions in your Program


If no miss then execution time is 2000 CC
One instruction needs 1.2 memory accesses, 1000 instruction ->
1200 accesses. If miss rate is 2.5% then number of misses for
1200 accesses is 30.
the execution time is
= 2000+30x128 = 5840 CC
Average cycles per instruction = 5840/1000 = 5.84
Problem 2 (2-way interleaving)
 Block size = 4 word Inter-
 Memory bus size = 1 word leaved
Cache 64 CCMemory
 Miss rate = 2.5% CPU
 Memory access per instruction = 1.2
 Cache miss Penalty = 68x2 CC
 Avg Cycles per instruction = 2

Assume 1000 instructions in your Program


If no miss then execution time is 2000 CC
One instruction needs 1.2 memory accesses, 1000 instruction ->
1200 accesses. If miss rate is 2.5% then number of misses for
1200 accesses is 30.
the execution time is
= 2000+30x68x2 = 6000 CC
Average cycles per instruction = 6000/1000 = 6.0
DRAM (Dynamic Random Access
Memory)
 Address Multiplexing
 RAS and CAS
 Dynamic refreshing and implications
 Data to be written after a READ!
 Amdahl’s rule of thumb
 1000 MIPS should have 1000 M memory
 Read your text book for DRAM, SRAM,
RAMBUS, and Flash.
SRAM (Static Random Access
Memory)
 Uses six transistor per bit.
 Static
 NO NEED to write data after a READ
 Comparison versus DRAM
 More transistors
 Less memory – ¼ to 1/8

Expensive 8 - 10 times
 No Refresh
 Faster 8-16 times
 Used in Cache instead of main memory
Flash
 Similar to EEPROM
 Ideal choice for embedded systems
 Low power requirement, no moving parts
 Faster READ time, slower WRITE time
 Comparison versus EEPROM
 Finer and simpler control
 Better capacity
 WRITE operation – entire block to be erased
and replaced.
 Writes are 50 to 100 times slower than READ
 NOR type flash and NAND type flash
Virtual Memory
Physical
Virtual Address
Address 0
0 A 4K C
4K B 8K
8K C 12K
12K D 16K A
20K
Virtual Memory 24K B
28K

Physical
Disk Main Memory

D
Virtual Versus First Level Cache

B lock /P age S iz e 16-128 B 4K -64K B


Hit Tim e 1-3 CC 50-150 CC
M is s P enalty 8-150 CC 1-10 M CC
A cc es s Tim e 6-130 CC 0.8-8 M CC
Trans ferTim e 2-20 CC 0.2-2 M CC
M iss Rate 0.1-10% .00001 to 0.001%
25-45 P hy sical 32-64 V irtual A ddres s
A ddres s to 14-20 to 25-45 P hy sical
A ddres s M apping Cac he A ddress A ddres s
Memory Hierarchy Questions
 Where can a block be placed?
 How is a block found if it is in?
 Which block should be replaced?
 What happens on a write?
Segmentation versus Paging

Code Data

Paging

Segmentation

Remark Page Segment


Words per address One Two (Segment and offset)
Programmer visible Invisible Visible to programmer
Replacing a block Trivial Hard
Memory use inefficiency Internal fragmentation External fragmentation

Efficient disk traffic Yes (easy tune the page size) Not always (small segments)
CPU Execution Time
CPU Execution Time
= (CPU Cycle time + Stalled Cycle) X Cycle Time

 Stalled Cycle = misses x penalty


 Misses given either as misses/1000 instruction or
misses/memory-access AKA miss rate.
 Instruction Count , Cycles per Instruction, Miss are also
required to compute CPU execution time.
Average Access Time with Cache

Average Access Time


= Hit Time + Miss Rate X Penalty

Multi-level Cache
Avg Access Time = Hit TimeL1 + Miss
RateL1 X PenaltyL1

PenaltyL1 = Hit TimeL2 + Miss


RateL2 X PenaltyL2
I/O Systems
 Objective
 To understand mainly disk storage
technologies.
Storage Systems
 Individual Disks
 SCSI Subsystem
 RAID
 Storage Area Networks (NAS & SAN)
Disk Storage

 Disk storage slower


than Memory.
 Disk offers better
volume at a lower
unit price.
 Nicely fit into the
Virtual Memory
concept.
 Critical piece of the
performance puzzle. Courtesy:
www.shortcourses.com/ choosing/storage/06.htm
Disk System
 Disk pack & Cache
 Disk Embedded Controller
 SCSI Host Adapter
 OS (kernel/driver)
 User Program
disk disk

SCSI …
Host Adapter
Controller Controller T

SCSI Bus
Host I/O Bus
Individual Disks

Reference: http://www.stjulians.com/cs/diskstoragenotes.html
Components of Disk Access Time

 Seek Time
 Rotational Latency
 Internal Transfer Time
 Other Delays
 That is,
 Avg Access Time =
Avg Seek Time + Avg Rotational Delay +
Transfer Time + Other overhead
Problem (page 684)
 Seek Time = 5 ms/100 tracks
 RPM = 10000 RPM
 Transfer rate 40 MB/sec
 Other Delays = 0.1 ms
 Sector Size = 2 KB

Average Access Time =


Average Seek Time (5ms) +
Average Rotational Delay (time for 1/2 revolution) +
Transfer Time (512/(40x106)) +
Other overhead (0.1 ms)
= 5 + 103x60/(2x10000) + 512x103/(40x106) + 0.1 ms
= 5 + 3 + 128/104 + 0.1 ms = 5 + 3 + 0.0128 + 0.1 = 8.1128 ms
RAID
 RAID stands for Redundant Array of
Inexpensive/Individual Disks
 Uses a number of little disks instead of one large
disk.
 Six+ Types of RAID (0-5).
Terminologies
 Mean Time Between Failure (MTBF)
 Mean Time To Data Loss (MTTDL)
 Mean Time To Data Inaccessability
(MTTDI)
RAID-0 Striping

chunk Stripe width


Logical
Storage 01 02 03 04 05 06 07 08 09 10 11 12

01 02 03 04
05 06 07 08
Physical
09 10 11 12
Storage

n number of independent disks


RAID-0 Performance
 Throughput: best case - nearly n x single
disk value
 Utilization: worst case – nearly (1/ n) x
single disk value
 Data Reliability: (r)n , where r is the
reliability of a disk, (r ≤ 1).
 Sequential Access: Fast
 Random Access: Multithreaded Random
Access offers better performance.
 When r = 0.8 and n = 2, reliability is 0.64
RAID-1 Mirroring
 Issue Addressed: RAID-0
Reliability Problem.
 Shadowing or mirroring 01
is used. 02
 Data is not lost when a 03
disk goes down.
 One or more disks can be
used to mirror primary
disk.
 Writes are posted to 01
primary and shadowing 02
disks. 03
 Read from any of them.
RAID-1 Performance
 Reliability is improved with mirroring:
 (1- (1-r)(1-r))
 Example: when r is 0.8, the reliability of RAID-1 is
.96.
 Writes are more complex – must be committed to
primary and all shadowing disks.
 Writes much slower due to atomicity requirement.
 Expensive – due to 1-to-1 redundancy.
 Throughput: Same as single disk value
 Utilization: ½ of single disk value
RAID 0+1 -Stripping & Mirroring
RAID 0 + 1
RAID 1

RAID 0 RAID 0

01 02 01 02
03 04 03 04
05 06 05 06

n disks n disks
Performance RAID-0+1
 Let the Reliability of a RAID 0 sub-tree be R’:
 Then the reliability of RAID 1 tree = 1 – (1-R’)(1-

R’)
 Reliability R’ is:
 R’ = r2 (reliability of a single disk is r):
 Throughput is same as RAID-0, however with 2 x
n disks
 Utilization is lower than RAID-0 due to mirroring
 “Write” is marginally slower due to atomicity
 When r = 0.9, R’ = 0.81, and R = 1 – (0.19)2 = .96
RAID 1+0 - Mirroring & Stripping
RAID 1+0

RAID 0

RAID 1 RAID 1

01 01 02 02
03 03 04 04
05 05 06 06
Performance RAID-1+0
 Let the Reliability of a RAID 1 sub-tree be R’:
 Then the reliability of RAID 0 tree = (R’)2

 Reliability R’ is:
 R’ = 1-(1-r)2 (reliability of a single disk is r):
 Throughput is same as RAID-0, however with 2 x n disks
 Utilization is lower than RAID-0 due to mirroring
 “Write” is marginally slower due to its atomicity
 When r = 0.9, R’ = 0.99, and R = (0.99)2 = .98
RAID-2 Hamming Code Arrays
 Low commercial interest due to
complex nature of Hamming code
computation.
RAID-3 Stripping with Parity

Single Bits
Or Words Stripe width
Logical
Storage 01 02 03 04 05 06 07 08 09 10 11 12

01 02 03 04 P0
Physical
Storage 05 06 07 08 P1
09 10 11 12 P2

Parity Disk
RAID-3 Operation
 Based on the principle of reversible form
of parity computation.
 Where Parity P = C0⊕ C1⊕ … Cn-1⊕ Cn⊕
 Missing Stripe Cm = P ⊕ C0⊕ C1⊕ … Cm-1⊕ Cm+1⊕… Cn-1⊕ Cn
RAID-3 Performance
 RAID-1’s 1-to-1 redundancy issue is addressed by 1-for-n
Parity disk. Less expensive than RAID-1
 Rest of the performance is similar to RAID-0.
 This can withstand the failure of one of its disks.
 Reliability = all the disks are working + exactly one failed

= rn + nc1rn-1.(1-r)
 When r = 0.9 and n = 5
= 0.95 + 5 x 0.94 x (1- 0.9)
= 0.6 + 5x0.66 x 0.1
= 0.6 + 0.33 = .93
RAID-4 Performance
 Similar to RAID-3, but supports
larger chunks.
 Performance measures are similar
to RAID-3.
RAID-5 (Distributed Parity)
Chunk
Stripe width
Logical
Storage 01 02 03 04 05 06 07 08 09 10 11 12

Physical 01 02 03 04 P0
Storage P1 05 06 07 08
09 P2 10 11 12

Parity Disk
RAID-5 Performance
 In RAID-3 and RAID-4, Parity Disk is
a bottleneck for Write operations.
 This issue is addressed in RAID-5.
Reading Assignment
 Optical Disk
 RAID study material
 Worked out problems from the text
book: p 452, p 537, p 539, p561, p
684, p 691, p 704
Amdahl’s Speedup Multiprocessors
Speedup = 1/(Fractione/Speedupe+(1–Fractione ))
 
Speedupe - number of processors
Fractione - Fraction of the program that runs
parallely on Speedupe
 
Assumption: Either the program runs in fully parallel
(enhanced) mode making use of all the processors or non-
enhanced mode.
Multiprocessor Architectures
 Single Instruction Stream, Single Data Stream (SISD):
 Single Instruction Stream, Multiple Data Stream (SIMD) 
 Multiple Instruction Streams, Single Data Stream (MISD)  
 Multiple Instruction Streams, Multiple Data Streams (MIMD)
Classification of MIMD Architectures
 Shared Memory:
 Centralized shared memory architecture OR
Symmetric (shared-memory) Multiprocessors (SMP)
OR Uniform Memory Access (UMA).
 Distributed shared memory architecture OR Non-
Uniform Memory Access (NUMA) architecture.
 Message Passing:
 Multiprocessor Systems based on messaging
Shared Memory versus Message
Passing

No Shared Memory Message Passing


1 Compatibility with well-understood Simpler hardware compared to
shared memory mechanism used in scalable shared memory
centralized multi-processor systems.
2 Simplify compiler design Communication is explicit. This
means it is easier to understand.
3 No need to learn a messaging protocol Improved modularity
4 Low communication overhead No need for expensive and complex
synchronization mechanisms similar
to the ones used in shared memory.

5 Caching to improve latency


Two types of Shared Memory
architectures

Centralized Shared -Memory Architecture

Processor Processor Processor Processor

Cache
Cache Cache
Cache Cache
Cache Cache
Cache

Main Memory I/O Systems

Distributed -Shared -Memory Architecture

Processor Processor Processor


+ + +
Cache Cache Cache

Memory I/O Memory I/O Memory I/O

Interconnection Network
Symmetric versus Distributed
Memory MP
 SMP: Uses shared memory for Inter-process Communication
 Advantage:
 Close coupling due to shared memory

Sharing of data is faster between processors
 Disadvantage
 Scaling: Memory is a bottleneck

High and unpredictable Memory Latency 

 Distributed : Uses message passing for Inter-process Communication


 Advantage:

Low Memory Latency
 Scaling: Scales better than SMP
 Disadvantage

Control and management are more complex due to distributed memory
Performance Metrics
 Communication bandwidth  
 Communication Latency: Ideally latency is as low as
possible. Communication latency is
= Sender Overhead + Time of flight +
Transmission Time + Receiver Overhead

Communication Latency Hiding
Cache Coherence Problem

Cache Cache Memory


Contents for Contents for contents for
Time Event CPU A CPU B Location X
0 1
1 CPU A reads X 1 1
2 CPU B reads X 1 1 1
3 CPU A stores 0 in X 0 1 0
Cache Coherence

A Memory System is coherent if

1. A read by a processor to location X that follows a write by another


processor to X returns the written value if the read and write are
sufficiently separated in time and no other writes to X occur between
the two accesses. Defines coherent view of memory.

2. Write to the same location are serialized; that is two writes to the same
location by any two processors are seen in the same order by all
processors. For example, if the values 1 and then 2 are written to a
location, processors can never read the value of the location as 2 and
then later read it as 1.
Coherency and Consistency
 Coherency and consistency are complementary.

 Coherency defines the behavior of reads and writes to


the same memory location, while consistency defines
the behavior of reads and writes with respect to other
memory locations.
Features supported by Coherent
Multiprocessors
 Migration: Shared data in a cache can be moved to
another cache directly (without going through the main
memory). This is referred to as migration. It is used in a
transparent fashion. It reduces both latency (going to
another cache every time the data item is accessed)
and precious memory bandwidth.

 Replication: Cache also provides replication for shared


data items that are being simultaneously read, since
the caches make a copy of the data item in the local
cache. Replication reduces both latency of access and
contention for a read shared data item.
Migration and Replication

Centralized Shared -Memory Architecture (SMP)

Processor Processor Processor Processor

Cache
Cache
Cache Cache
Cache Cache
Cache Cache

cc cc cc cc

Bus
Main Memory I/O Systems

cc Cache Controller
Cache Coherent (CC) Protocols
 CC protocol implement cache coherency
 Two types:
 Snooping (Replicated): There are multiple copies of
the sharing status. Every cache that has a copy of the
data from a block of physical memory also has a copy
of the sharing status of the block, and no centralized
state is kept.
 Directory based (logically centralized): There is only
one copy of the sharing status of a block of physical
memory. All the processors use this one copy. This
copy could be in any of the participating processors.
Snooping Protocol

Centralized Shared -Memory Architecture (SMP)

Processor Processor Processor Processor


Two ways:
1. Write
Cache
Cache Cache
Cache Cache
Cache Cache
Cache
Invalidation
2. Write
cc cc cc cc
Broadcast

Bus
Main Memory I/O Systems

cc Cache Controller
Invalidation of Snooping Protocol

Cache Cache Memory


Processor Contents for Contents for contents for
Activity Bus Activity CPU A CPU B Location X
0
CPU A reads X
(block) Cache Miss for X 0 0
CPU B reads X
(block) Cache Miss for X 0 0 0
A Writes 1 to X Invalidation for X 1 0
B reads X Cache Miss for X 1 1 1
Write Broadcast of Snooping Protocol

Cache Cache Memory


Processor Contents for Contents for contents for
Activity Bus Activity CPU A CPU B Location X
0
CPU A reads X
(block) Cache Miss for X 0 0
CPU B reads X
(block) Cache Miss for X 0 0 0
A Writes 1 to X Write Broadcast for X 1 1 1
B reads X 1 1 1
Reading Assignment
 Section 7.4 (Reliability, Availability,
& Dependability) in your text book
 Page 554 and 555
 Section 7.11 I/O Design – attempt
all the 5 problems in this section.
Invalidation versus Write Distribute
 Multiple writes to the same data item with no intervening reads require multiple
write broadcast in an update protocol, but only one initial invalidation in a write
invalidate protocol.
 
 With multiword cache blocks, each word written in a cache block requires a write
broadcast in an update protocol, although only the first write to any word in the
block needs to generate an invalidate in an invalidation protocol.
 
 An invalidation protocol works on cache blocks, while an update protocol must
work on individual words.
 
 The delay between writing a word in one processor and reading the written value in
another processor is usually less in a write update scheme, since written data are
immediately updated in the reader’s cache. By comparison, in an invalidation
protocol, the reader is invalidated first, then later reads the data and is stalled until
a copy can be read and returned to the processor.
Coherent Misses
 Write Invalidate causes Read cache
misses
 True Miss: If an item X in a block is made
exclusive by write invalidate and the read of
the same item in another processor causes
read miss then it is a true miss.
 False Miss: If non-exclusive items in the
same block cause the miss then it is a false
miss.
Use of Valid, Shared, and Dirty bits
 Valid bit: Every time a block is loaded into a cache from memory, the tag for the
block is saved in cache and the valid bit is set to TRUE. A write update to the same
block in a different processor may reset this valid bit due to write invalidate. Thus,
when a cache block is accessed for READ or WRITE, tag should match AND the
value of valid bit should be TRUE. If the tag matches but the valid bit is reset, then
its a cache miss.
 
 Shared bit: When a memory block is loaded into a cache block for the first time the
shared bit is set to FALSE. When some other cache loads the same block, it is
turned into TRUE. When this block is updated, write invalidate uses the value of
shared bit to decide whether to send write-invalidate message or not. If the shared
bit is set then an invalidate message is sent, otherwise not.

 Dirty bit: Dirty bit is set to FALSE when a block is loaded into a cache memory. It is
set to TRUE when the block is updated the first time. When another processor
wants to load this block, this block is migrated to the processor instead of loading
from the memory.
Summary of Snooping Mechanism

S ta te o f a d d re sse d
Re q u e st S o u rce ca ch e b lock F u n ctio n a n d Ex p la n a tio n
Read hit proc es s or s hared or ex c lus iveRead data in c ac he
Read m is s proc es s or invalid P lac e read m is s on bus
Read m is s proc es s or s hared A ddres s c onflic t m is s : plac e read m is s on bus
Read m is s proc es s or ex c lus ive A ddres s c onflic t m is s : write bac k bloc k , then plac e read m is s on bus
W rite hit proc es s or ex c lus ive W rite data in c ac he
W rite hit proc es s or s hared P lac e write m is s on bus
W rite m is s proc es s or invalid P lac e write m is s on bus
W rite m is s proc es s or s hared A ddres s c onflic t m is s : plac e read m is s on bus
W rite m is s proc es s or ex c lus ive A ddres s c onflic t m is s : write bac k bloc k , then plac e read m is s on bus
Read m is s bus s hared No ac tion; allow m em ory to s ervic e read m is s
A ttem pt to s hare data: plac e c ac he bloc k on bus and c hange s tate to
Read m is s bus ex c lus ive s hared
W rite m is s bus s hared A ttem pt to write s hared bloc k ; invalidate the bloc k
A ttem pt to write bloc k that is ex c lus ive els ewhere: write bac k the c ac he
W rite m is s bus ex c lus ive bloc k and m ak e its s tate invalid
State Transition
CPU read
hit

CPU Read Shared


Invalid Place read miss on bus (read only)

k
loc CPU read
CPU Write b
k
ac
Place write miss

e miss
- b rit
te bus w Place read miss on bus
r i PU
s sC
is W on bu
m s s
on
on bus

ad i
re m s
U ad is
C P r e m
e
c e r it
la w
Exclusive P c e Cache State Transitions
a
(read/write) Pl Based on requests from CPU
CPU Write hit
CPU Read hit

CPU Write miss


Write-back cache block
Place write miss on bus
State Transition

Write miss for this block Shared


Invalid (read only)
Abort memory access

ort
CPU Write ab CPU read
Write-back block;

k; miss
loc
b s
c k es
c
- ba a c
te y
ri or
W em
M
Write miss for
this block Exclusive Cache State Transitions
(read/write) Based on requests from the bus
Read miss for
this block
Some Terminologies
 Polling: A process periodically checks if there is a message that it
needs to handle. This method of awaiting a message is called
polling. Polling reduces processor utilization.
 Interrupt: A process is notified when a message arrives using
built-in interrupt register. Interrupt increases processor utilization
in comparison to polling. 
 Synchronous: A process sends a message and waits for the
response to come before sending another message or carrying
out other tasks. This is way of waiting is referred to as
synchronous communication.
 Asynchronous: A process sends a message and continues to
carry out other tasks while the requested message is processed.
This is referred as asynchronous communication.
Communication Infrastructure
 Multiprocessor Systems with shared
memory can have two types of
communication infrastructure:
 Shared Bus
 Interconnect

Interconnect
Shared Bus
Directory Based Protocol

1
Nod
e1 e n-
Nod

Directory
… Directory

Interconnect

Directory Directory

e2 en
Nod Nod
State of the Block
 Shared: One or more processors have the block
cached, and the value in memory is up to date.
 Uncached: No processor has a copy of the block.
 Exclusive: Exactly one processor has a copy of the
cache block, and it has written the block, so the
memory copy is out of date. The processor is called the
owner of the block.
Local, Remote, Home Node
 Local Node: This is the node where
request originates
 Home Node: This is the node where
the memory location and the
directory entry of an an address
reside.
 Remote Node: A remote node is the
node that has copy of a cache block
Shared State Operation
 Read-miss: The requesting processor is
sent the requested data from memory
and the requestor is added to the sharing
set.
 Write-miss: The requesting processor is
sent the value. All the processors in the
set Sharer are sent invalidate messages,
and the Sharer set is to contain the
identity of the requesting processor. The
state of the block is made exclusive.
Uncached State Operation
 Read-miss: The requesting processor is sent
the requested data from memory and the
requestor is made the only sharing node. The
state of the block is made shared.
 Write-miss: The requesting processor is sent
the requested data and becomes the sharing
node. The block is made exclusive to indicate
that the only valid copy is cached. Sharer
indicates the identity of the owner.
Exclusive State Operation
 Read-miss: The owner processor is sent a data fetch message, which causes the
state of the block in the owner’s cache to transition to shared and causes the
owner to send the data to the directory, which it is written to memory and sent back
to the requesting processor. The identity of the requesting processor is added to
the set Sharer, which still contains the identity of the processor that was the owner
(since it still has a readable copy).

 Data write back: The owner processor is replacing the block and therefore must
write it back. This write back makes the memory copy up to date (the home
directory essentially becomes the owner), the block is now uncached and the
Sharer is empty.

 Write-miss: The block has a new owner. A message is sent to the old owner
causing the cache to invalidate the block and send the value to the directory, from
which it is sent to the requesting processor, which becomes the new owner. Sharer
is set to the identity of the new owner, and the state of the block remains exclusive.
Cache State Transition

Read Miss Shared


Uncached (read only)
Data Value Reply y;
Shares = {P} epl
r
Fetch, Shares = {}

e
lu
Data Value Reply

va
Write Miss Read
ta P} s miss
is
da = {
Shares = {P}

M
; e
t ch es + us rit
W Data value Reply
Fe ar n b };
h
S so { P Shares += {P}
s is =
is m es
m r
Data Write- ad ead ha ly
re r S p
Back
ce t e; re
a a e
Pl lid alu
va v
Exclusive In t a
(read/write) da

Write miss Fetch/invalidate


Data value Reply
Shares = {P}
True Sharing Miss & False Sharing
Miss

Time Processor P1 Processor P2


1 Write X1
2 Read X2
3 Write X1
4 Write X2
5 Read X2
Synchronization
 We need a set of hardware
primitives with the ability to
atomically read and modify a
memory location.
Example 1: Atomic Exchange
 Interchanges a value in a register
for a value in memory
 You could build a lock with this. If
the memory location contains 1
then the lock is on.
Register
A 0 1

Physical Main Memory


Other Operations
 Test-and-Set
 Fetch-and-Increment
Load Linked & Set Conditional
 Load Linked: Load linked is a primitive
operation that loads the content of a
specified location into cache.
 Set Conditional: Set conditional operation
is related to Load Linked operation by
operating on the same memory location.
This operation sets the location to a new
value only if the location contains the
same value read by Load Linked,
otherwise it will not.
Operation of LL & SC

try: LL R2, 0(R1) ; load linked


DADDUI R3, R2, #1 ; increment
SC R3, 0(R1) ; Store conditional
BEQZ R3, try ; branch store fails

1. Address of the memory location loaded by LL is kept in a register


2. If there is an interrupt or invalidate then this register is cleared
3. SC checks if this register is zero.
a. If it is then it fails
b. Otherwise it simply Stores the content of R3 in that memory address
Single Processor System
 Concurrent Execution is the source of problem

Process 1 X=0 Process 2

Memory

P1 P2
TOP READ X
BNEZ TOP
time P2 is scheduled
TOP READ X
BNEZ TOP
WRITE 1
P1 is scheduled
WRITE 1
With Interrupt Mask

Process 1 X=0 Process 2


Memory

P1 P2

TOP READ X
BNEZ TOP
MASK
WRITE 1
time UNMASK
MASK
TOP READ X
BNEZ TOP
WRITE 1
UNMASK
Multiprocessor System
 Masking the interrupt will not work

Processor1 X=0 Processor2


Memory

P1 P2

TOP READ X TOP READ X


BNEZ TOP BNEZ TOP
MASK MASK
WRITE 1
UNMASK WRITE 1
UNMASK
Multiprocessor System
 Uses Cache Coherence
 Bus Based Systems or Interconnect
Based System
 Coherence system arbitrates writes
(Invalidation/write distribute)

Thus, serializes writes!
Spin Locks
 Locks that a processor continuously
tries to acquire, spinning around a
loop until it succeeds.
 It is used when a lock is to be for a
short amount of time.
Spin Lock implementation

lockit: LD R2, 0(R1) ;load of lock


BNEZ R2, lockit ;not available-spin
DADDUI R2, R0,#1 ;load locked value
EXCH R2, 0(R1) ;swap
BNEZ R2, lockit ;branch if lock wasn't 0
Cache Coherence steps
Coherence
step 1 P0 P1 P2 state of lock Bus/Directory activity
Spins, testing if Spins, testing if
1 Has lock lock=0 lock=0 Shared None
Invalidate Invalidate Write invalidate of lock
2 Set lock to 0 received received Exclusive (P0) variable from P0
Bus/Directory services P2
Cache miss; write back from
3 Cache Miss Cache Miss Shared P0
4 Waits fo bus Lock = 0 Shared Cache miss for P2 satisfied
Executes swap,
5 Lock = 0 gets cache miss Shared Cache miss for P1 satisfied
Completes
swap: returns 0 Bus/Directory services P2
Executes swap, and sets Lock = Cache miss; generates
6 gets cache miss 1 Exclusive (P2) invalidate
Swap completes
and returns 1, Bus/Directory services P1
and sets Lock = Enter critical Cache miss; generates write
7 1 section Exclusive (P1) back
Spins, testing if
8 lock=0 None
Multi-threading
 Multi-threading – why and how?
 Fine-grain
 Almost round robin
 High overhead (compared to coarse grain)
 Better throughput then coarse grain
 Coarse-grain
 Threads switched only on long stalls (L2
cache miss)
 Low overhead
 Individual threads get better performance
Reading Assignment
 Memory Consistency
 Notes (Answers)
 Problem on page 596
 An example that uses LL (Load
Linked) and Set Conditional (SC).
End!
IETF – Internet Engineering Task Force
 Loosely self-organized groups (working groups)
of …
 Two types of documents: I-D and RFC
 I-D’s life is shorter compared to RFCs life
 RFCs include proposed standards and
standards
 RFC Example: RFC 1213, RFC 2543
 I-D -> Proposed Standard -> Standard
SIP Evolution
 SIP I-D v1 ’97, v2 ’98
 SIP RFC 2543 ’99
 SIP RFC 3261 ’00, obsoletes RFC 2543
 SIP working groups – SIPPING, SIMPLE, PINT,
SPIRITS, …
 SIPit – SIP interoperability test
Predictor for SPEC92 Benchmarks

300

250
Instructions
between 2 0 0
mispredictio P re d ic t e d T a k e n
150
ns
P r o fil e b a s e d
100

50

0
li

ear
gcc

uc

ijdp

cor
tot

d
so

dod

ro2
ss
eqn

su2
md
res
pre

hyd
esp
com

Bench mark