You are on page 1of 78

Review of Computer Architetcure

A Sahu Deptt. of C f Comp. S & Engg. Sc. IIT Guwahati

Outline
Computer organization Vs Architecture p g Processor architecture Pipeline h P l architecture
Data, resource and branch hazards ,

Superscalar & VLIW architecture Memory h hierarchy h Reference

Computer organization Vs Architecture
Comp Organization => Digital Logic Module l d l Logic and Low level ============================ p g , g Comp Architecture = > ISA Design, MicroArch Design Algorithm for Designing best micro architecture, Pipeline model, Branch prediction strategy, memory management Etc…..

Hardware abstraction
Register file CPU PC ALU System bus Memory bus

Bus interface

bridge

Main memory

I/O bus USB controller Mouse Keyboard Graphics adapter Display Disk Disk controller

Expansion slots for p other devices such as network adapters

Hardware/software interface
software f C++ m/c instr reg, adder dd hardware transistors Arch. focus

Instruction set architecture Lowest level visible to a programmer Micro architecture Fills the gap between instructions and logic modules

Instruction Set Architecture
Assembly Language View
Processor state (RF, mem) Instruction set and encoding
Application pp cat o Program Compiler ISA CPU Design Circuit Design Chip Layout OS

Layer of Abstraction y
Above: how to program , machine - HLL, OS Below: what needs to be built tricks to make it run fast

The Abstract Machine
CPU Registers PC ALU Condition Codes Memory Addresses Add Data Instructions Stack Code + Data

Programmer-Visible State
PC Program Counter g Register File

heavily used data y
Condition Codes

Memory y Byte array Code + data stack

Instructions
Language of M hi L f Machine Easily interpreted y p primitive compared to HLLs

Instruction set design goals
maximize performance, minimize cost, reduce design time

Instructions
All MIPS Instructions: 32 bit long, have 3 operands Operand order is fixed (destination first) Example: C code: A=B+C MIPS code: d add $ 0 $ 1 $ 2 dd $s0, $s1, $s2 (associated with variables by compiler)

Registers numbers 0 .. 31 e g 31, e.g., $t0=8,$t1=9,$s0=16,$s1=17 etc.

000000 10001 10010 01000 00000 100000
op rs rt rd shamt funct

Instructions LD/ST & Control
Load d L d and store instructions i i Example: A[8] = h + A[8]; [ ] [ ]; C code: MIPS code: lw $t0, 32($s3) add $t0, $s2, $t0 sw $t0 32($s3) $t0, Example: lw $t0, 32($s2) 35 18 9 32 op rs rt 16 bit number Example: if (i != j) beq $s4, $s5, Lab1 b $ 4 $ 5 L b1 h = i + j; add $s3, $s4, $s5 else j Lab2 h = i - j; Lab1: sub $s3, $s4, $s5 Lab2: ...

What constitutes ISA?
Set f basic/primitive operations S of b i i ii i
Arithmetic, Logical, Relational, Branch/jump, Data movement

Storage structure – registers/memory St t t it /
Register-less machine, ACC based machine, A few special purpose registers, Several Gen purpose registers, Large number of registers g , p p g , g g

How addresses are specified
Direct, Indirect, Base vs. Index, Auto incr and auto decr, Pre (post) incr/decr, Stack

How operand are specified
3 address machine r1 = r2 + r3, 2 address machine r1 = r1 + r2 1 address machine Acc = Acc + x (Acc is implicit) 0 address machine add values on (top of stack)

How instructions are encoded

RISC vs. CISC
RISC Uniformity of instructions, Simple set of operations and addressing modes, Register based architecture with 3 address instructions RISC: Virtually all new ISA since 1982 ARM, MIPS, SPARC, HP’s PA RISC, PowerPC, Alpha, HP s PA-RISC, CDC 6600 CISC : Mi i i code size, make assembly l Minimize d i k bl language easy VAX: instructions from 1 to 54 bytes long! Motorola 680x0, Intel 80x86

MIPS subset for implementation
Arithmetic - logic instructions g add, sub, and, or, slt Memory reference instructions lw, sw Control flow instructions beq, j Incremental changes in the design to include other instructions will be discussed later

Design overview
Use the program counter (PC) to supply instruction address Get the i h instruction f i from memory Read registers Use the instruction to decide exactly what to do

Da ta Re gis ter # Re gis te rs Re gis ter # Re gis ter # Da ta

PC

Addres s Ins truction memory

Ins truction

ALU

Addres s Da ta memory

Division into data path and control

control signals i l

status signals i l

CONTROLLER

Building block types
Two types of functional units: elements that operate on data values (combinational) output is function of current input, no memory Examples
gates: and, or, nand, nor, xor, inverter ,Multiplexer, decoder, adder, subtractor, comparator, ALU, array multipliers

elements that contain state (sequential) output is function of current and previous inputs state = memory Examples:
flip-flops, counters, registers, register files, memories

Components for MIPS subset Register, Adder ALU Multiplexer Register file Program memory Data memory Bit manipulation components

Components - register

32

PC

32

clock

Components - adder

PC
32 + 32

PC+4
32 + 32

4
32

offset
32

Components - ALU
operation a
32 a=b overflow ALU 32 32

result

b

Components - multiplexers C t lti l
PC+4
32

0

PC+4+offset
32

mux
1

32

select

Components - register file
5 Regis ter g numbe rs 5 5 Re R ad re gis ter 1 Re a d re gis ter 2 Write re gis ter Write data Re a d data 1 Data Re a d data 2

Registers

Data

Reg Write

Components - program memory C t g
Ins tructio n a ddre s s Ins I tructio n i

Instruction memory

MIPS components - d t memory t data
Me m Write

Ad dre s s

Re ad d a ta

Write da ta

Data memory

Me m R e a d

Components - bit manipulation circuits C t i l ti i it

16

Sign xtend

32

MSB

LSB MSB 32

shift

32 0 LSB

MIPS subset for implementation

Arithmetic - logic instructions add, sub, and, or, slt Memory reference instructions lw, l sw Control flow instructions beq, j

Datapath for add sub and or slt add,sub,and,or,slt
Fetch instruction Address the register file Pass operands to ALU Pass result to register file Increment PC I t Format: add $t0, $s1, $s2 op rs rt rd

actions required

000000 10001 10010 01000 00000 100000
shamt funct

Fetching instruction

PC

ad

ins

IM

Addressing RF

PC

ad

ins[25-21] ins[20-16] ins

IM

rad1 rad2 wad RF wd

rd1 rd2

Passing operands to ALU

PC

ad

ins[25-21] ins[20-16] ins

IM

rad1 rad2 wad RF wd

rd1 rd2

ALU

Passing the result to RF

ins

IM

ins[15-11]

rd2

ALU

PC

ad

ins[25-21] ins[20-16]

rad1 rad2 wad RF wd

rd1

Incrementing PC

4

+

PC

ad

ins[25-21] ins[20-16] ins

IM

ins[15-11]

rad1 rad2 wad RF wd

rd1 rd2

ALU

Load and Store instructions
format : I Example: lw $t0, 32($s2) 35 op 18 rs 9 rt t 32 16 bit number b

Adding “sw” instruction sw

4

+

PC

ad

ins[25-21] ins[20-16] ins

IM

ins[15-11]

rad1 rad2 wad RF wd 16

rd1 rd2

ALU

0 1

ad wd

rd

DM

ins[15-0]

sx

Adding “lw” instruction lw

4

+

PC

ad

ins[25-21] ins[20-16] ins
0

IM

ins[15-11]

1

rad1 rad2 wad RF wd 16

rd1 rd2

ALU

0 1

ad wd

rd

1 0

DM

ins[15-0]

sx

Adding “beq” instruction beq
0 1

+

4

PC

ad

ins[25-21] ins[20-16] ins
0

IM

ins[15-11]

1

rad1 rad2 wad RF wd 16

rd1 rd2

s2

ALU

+

0 1

ad wd

rd

1 0

DM

ins[15-0]

sx

Adding “j” instruction j
s2
ins[25-0] 28 ja[31-0]
0 0 1 1

PC+4[31-28]

+

4

PC

ad

ins[25-21] ins[20-16] ins
0

IM

ins[15-11]

1

rad1 rad2 wad RF wd 16

rd1 rd2

s2

ALU

+

0 1

ad wd

rd

1 0

DM

ins[15-0]

sx

Control signals
s2
ins[25-0] 28 ja[31-0]
0 1

jmp
0 1

PC+4[31-28]

+

4

0

IM

ins[15-11]

rd2

ALU

ins

Asrc

PC

ad

rd1
0 1

ad wd

rd

1

3

op

DM
MR

Rdst ins[15-0]

16

sx

M2R
1 0

ins[25-21] ins[20-16]

RW rad1 rad2 wad RF wd Z MW

s2

Psrc

+

Datapath + Control
s2
ins[25-0] 28 ja[31-0]
0 1

jmp
0 1

PC+4[31-28]

+

ins[31-26]

control

4

0

IM

ins[15-11]

rd2

ALU

ins

Asrc

PC

ad

rd1
0 1

ad wd

rd

1

3

op

DM
MR

ins[15-0] ins[5-0]

2

Actrl

Rdst

16

sx

opc

M2R
1 0

ins[25-21] ins[20-16]

RW rad1 rad2 wad RF wd Z MW

s2

brn

Psrc

+

Analyzing performance
Component delays

Register Adder ALU Multiplexer Register file Program memory Data memory y Bit manipulation components

0 t+ tA 0 tR tI tM 0

Delay for {add, sub, and, or, slt} y { , , , , }

4

+

t+ ⎧ max ⎨ ⎩t I + t R + t A + t R
ins[25-21] ins[20-16] ins[15-11] rad1 rad2 wad RF wd rd1 rd2

PC

ad

IM

ALU

ins

Delay for {sw}

+

4

t+ ⎧ max ⎨ ⎩t I + t R + t A + t M
ins[25-21] ins[20-16] rad1 rad2 wad RF wd 16 ins[15-0] rd1 rd2

PC

ad

ALU

ins

IM

ad wd

rd

DM

sx

Clock period in single cycle design
R-class R l lw sw beq tI tI tI tI t+ tI j t+ tI tR tR tR tR t+ t+ tA tA tA tA tR tM tM tR clock period

Clock period in multi cycle design multi-cycle
R-class lw sw beq tI tI tI tI t+ tI j t+ tI tR tR tR tR t+ t+ tA tA tA tA tR tM tM tR clock period

Cycle time and CPI
high
multi-cycle design

CPI
low
pipelined design single cycle design

short

cycle time long

PIpelined datapath (abstract)
IF
IF/ID +

ID
ID/EX

EX

Mem
EX/Mem

WB

Mem/WB

PC

ad

rad ins

rd1 rd2

ALU

+

4

IM

wad RF wd

ad wd

rd

DM

Fetch new instruction every cycle
IF
IF/ID +

ID
ID/EX

EX

Mem
EX/Mem

WB
Mem/WB

PC

ad

rad ins

rd1 rd2

ALU

+

4

IM

wad RF wd

ad wd

rd

DM

Pipelined processor design
1 0

IF/IDw

flush

control
0 4
rad1 d rad2 wad wd

bubble
0 1

+

PC

ins

IM
PCw PCw=0 IF/IDw=0 bubble=1

RF

rd2

ALU

ad

rd1
0 1

s2

+

ad wd

rd

1 0

DM

0 1

Actrl

sx

Graphical representation
5 stage pipeline g pp

actions

IF

ID

EX

ALU U

stages

IM

RF

DM

RF

Mem

WB

Usage of stages by instructions
ALU

lw sw add beq

IM

RF

DM

RF

ALU

IM

RF

ALU

IM

RF

ALU

IM

RF

DM

RF

DM

RF

DM

RF

Pipelining
Simple multicycle design : • Resource sharing across cycles • All instructions may not take same cycles

IF

D

RF

EX/AG

M

WB

• Faster throughput with pipelining

Degree of overlap
Serial

Depth
Shallow

Overlapped O l d

Deep Pipelined

Hazards in Pipelining
Procedural dependencies => Control hazards cond and uncond branches, calls/returns Data dependencies => Data hazards RAW (read after write) WAR ( i after read) (write f d) WAW (write after write) Resource conflicts => Structural hazards use of same resource in different stages g

Data Hazards
read/write previous instr current instr

read/write

delay = 3

Structural Hazards
Caused by Resource Conflicts
Use of a hard are resource in hardware more than one cycle
A B A A B A C A B C A C

Different sequences of resource usage by different instructions Non-pipelined multi-cycle resources

A

B A

C C

D B D

F

D F

X D

X X X

Control Hazards
cond eval branch b h instr next inline instr target instr
delay = 2

target addr gen

delay = 5

• the order of cond eval and target addr gen may be different y p • cond eval may be done in previous instruction

Pipeline Performance
T S stages

Frequency of interruptions - b

CPI = 1 + (S - 1) * b Time = CPI * T / S

Improving Branch Performance
Branch Elimination
Replace branch with other instructions

Branch Speed Up
Reduce time for computing CC and TIF

Branch Prediction
Guess the outcome and proceed, undo if necessary

Branch Target Capture g p
Make use of history

Branch Elimination
C T S C:S F Use conditional instructions (predicated execution)

OP1 BC CC = Z ∗ + 2 Z, ADD R3, R2, R1 OP2

OP1 ADD R3, R2, R1, NZ OP2

Branch Speed Up : Early target address generation
Assume each instruction is Branch Generate target address while decoding If target i same page omit translation in i l i After decoding discard target address if not Branch
BC
IF IF IF D AG TIF TIF TIF

Branch Prediction
Treat conditional branches as unconditional branches / NOP Undo if necessary y Strategies: Fixed (always guess inline) Static (guess on the basis of instruction type) Dynamic (guess based on recent history)

Static Branch Prediction
Instr uncond cond loop call/ret % 14.5 58 9.8 17.7 17 7 Guess always never always always Branch 100% 54% 91% 100% Correct 14.5% 27% 9% 17.7% 17 7%

Total 68.2% ota 68 %

Branch Target Capture
• Branch Target Buffer (BTB) • Target Instruction Buffer (TIB)

instr addr

pred stats

target target addr g target instr

prob of target change < 5%

BTB Performance
decision BTB miss g go inline inline .8 .2 delay 0 5 4 BTB hit g go to target g target .2 .8 0 .6*.8*0 = 0.88 (Eff.Delay)

.4 .6 4 6 target inline

result

.4*.8*0 + .4*.2*5 + .6*.2*4 +

Compute/fetch scheme
(no dynamic branch prediction)
A I I+1 Instruction I Fetch address F A I - cache R I+2 I+3

BTA IIFA Compute BTA

+
Next sequential address
BTI BTI+1 BTI+2 BTI+3

BTAC scheme
A I I+1 Instruction I Fetch address F A I - cache R I+2 I+3

BA BTA

BTA IIFA

BTAC

+
Next sequential address
BTI BTI+1 BTI+2 BTI+3

ILP in VLIW processors
Cache/ memory Fetch Unit Single multi-operation instruction

FU

FU

FU

Register file multi-operation instruction

ILP in Superscalar processors
Decode Cache/ memory Fetch Unit and issue unit Multiple instruction

FU

FU

FU

Sequential stream of instructions

Instruction/control Data FU Funtional Unit Register file

Why Superscalars are popular ?
Binary code compatibility among scalar & y p y g superscalar processors of same family Same compiler works for all processors (scalars and superscalars) of same family Assembly programming of VLIWs is tedious Code density in VLIWs is very poor y yp Instruction encoding schemes
slide 69

Hierarchical structure
S peed
Fastest CPU

S ize
S mallest

Cost / bit
Highest

Memory

Memory M

S lowest

Memory

Biggest

Lowest

Data transfer between levels
Processor

access

hi t miss

Data are transferred

unit of transfer = block

Principle of locality & Cache Policies
Temporal Locality Spatial Locality
references repeated in time f di i references repeated in space Special case: Sequential Locality

============================
Read d Sequential / Concurrent Simple / Forward Load Block load / Load forward / Wrap around Replacement ep ace e t LRU / LFU / FIFO / Random

Load policies
0 Cache miss on AU 1 Block Load Load Forward Fetch Bypass (wrap around load) l d) 4 AU Block 2 3 1

Fetch Policies Demand fetching fetch only when required (miss) y q ( ) Hardware prefetching automatically automaticall prefetch ne t block next Software prefetching programmer decides to prefetch questions: how much ahead (prefetch distance) how f h often

Write Policies
Write Hit Write Back Write Through Write Miss Write Back Write Through
With Write Allocate With No Write Allocate

Cache Types Instruction | Data | Unified | Split p Split vs. Unified:
Split allows specializing each part Unified allows best use of the capacity p y

On-chip | Off-chip
on-chip : fast but small hi f b ll off-chip : large but slow

Single level | Multi level

References
1. 2. 3. 4. 4 Patterson, D A.; Hennessy, J L. Computer Organization and Design:The Hardware/software Interface Morgan Kaufman Interface. Kaufman, 2000 Sima, T, FOUNTAIN, P KACSUK, Advanced Computer p Architectures: A Design Space Approach, Pearson Education, 1998 Flynn M J, Computer Architecture: Pipelined and Parallel l Processor Design, Narosa publishing India, 1999 John L. Hennessy, David A. Patterson, Computer L Hennessy A Patterson architecture: a quantitative approach, 2nd Ed, Morgan Kauffman, 2001 ,

Thanks