You are on page 1of 186

Prepared by: Karlene M.

Black COA - Sem 1


 A communication pathway connecting two or
more devices
 A link in which information is usually
broadcast to all devices (not just intended
recipient)
 consists of multiple communication lines
transmitting digital information in parallel
 A component in which width is important in
determining performance
 often grouped
◦ A number of channels in one bus
e.g. 32 bit data bus is 32 separate single bit
channels

 systems can contain multiple buses


 Major advantages of bus organization are
versatility and low cost.
 Major disadvantage is that insufficient
bandwidth may create a communication bottle
neck, limiting the maximum I/O throughput.
 Performance factors
◦ Physical size (latency & bandwidth)
◦ Number and type of connected devices (taps)
 Parallel lines on circuit boards (often yellow)
 Ribbon cables
 Strip connectors on mother boards
 e.g. PCI

 Sets of wires
1. System (processor-memory) buses
 Address bus

 Data bus

 Control bus

2. I/O buses

3. Backplane buses
 contains the source or destination address of
the data on the data bus
e.g. CPU needs to read an instruction
(data) from a given location in memory

 Bus width determines maximum memory


capacity of system
e.g. 8080 has 16 bit address bus giving 64k address
space
 Used for moving data between modules
◦ Remember that there is no difference between
“data” and “instruction” at this level

 Width is a key determinant of performance


◦ 8, 16, 32, 64 bit
 set of lines used to control use of the data and
address lines by the attached devices
 includes
◦ Memory read/write signal
◦ Interrupt acknowledgement
◦ Interrupt request
◦ Clock signals
◦ I/O Read or Write
◦ Transfer ACK
◦ Bus request
◦ Bus grant
◦ Reset
 Type
• dedicated or multiplexed
 Arbitration
• centralised or distributed
 Timing
• synchronous or asynchronous
 Width

 Transfer type
• serial or parallel
 Dedicated
◦ Separate data & address lines

 Multiplexed
◦ Shared lines
◦ Address valid or data valid control line
◦ Advantage - fewer lines
◦ Disadvantages
 More complex control
 Degradation of performance
 Ensuring only one device uses the bus at a
time – avoiding collisions
 Choosing a master among multiple requests
◦ Try to implement priority and fairness (no device
“starves”)

 Uses a master-slave mechanism


 Two main schemes:
• centralized and distributed
Components
 Bus master: component that can initiate a bus
request
◦ Bus typically has several masters
◦ Processor, but I/O devices can also be masters
Components
 Daisy-chain: devices connect to bus in priority
order
◦ High-priority devices intercept/deny requests by low-
priority ones
◦ Simple, but slow and can’t ensure fairness
New trend: Point-to-point busses
•Pro: No arbitration, no “master”, fast,
simple, source synchronous
•Con: need lots of wires or requires high
per-wire bandwidth
Centralized
◦ single hardware device controls bus access
- (bus controller or bus arbiter)
◦ May be part of CPU or separate
Distributed
◦ any module (except passive devices like
memory) can become the bus master e.g.
CPU and DMA controller
◦ Access control logic is on all modules
◦ Modules work together to control bus
 Co-ordination of events on bus
 Synchronous – events are controled by a clock
 Asynchronous – timing is handled by well-
defined specifications, i.e., a response is
delivered within a specified time after a
request
 Events determined by clock signals
 Control Bus includes clock line
 A single 1-0 cycle is a bus cycle
 All devices can read clock line
 Usually sync on leading/rising edge
 Usually a single cycle for an event
 Analogy – Orchestra conductor with baton
 Usually stricter in terms of its timing
requirements
 Devices must have certain tolerances to
provide responses to signal stimuli
 More flexible allowing slower devices to
communicate on same bus with faster
devices.
 Performance of faster devices, however, is
limited to speed of bus
 Wider the bus the better the data transfer rate
or the wider the addressable memory space

 Serial “width” is determined by


length/duration of frame
 Lots of devices on one bus leads to:
◦ Propagation delays
 Long data paths mean that co-ordination of bus use
can adversely affect performance
 If aggregate data transfer approaches bus capacity

 Most systems use multiple buses to overcome


these problems
 Historically, parallel has been used for high
speed peripherals (e.g., SCSI, parallel port
zip drives rather than serial port). High
speed serial, however, has begun to replace
this need

 Serial communication also used to be


restricted to point-to-point
communications. Now there's an increasing
prevalence of multipoint
 working space (temporary storage) available
to CPU (a MUST have)
 Number and function vary between processor
designs
 One of the major design decisions
 Top level of memory hierarchy
 hardware devices for holding values

 mainly used to:

◦ control execution of a program

◦ hold an instruction that is currently


being executed

31
 fast memory, almost always connected to
circuitry that allows various arithmetic,
logical, control, and other manipulations, as
well as possibly setting internal flags

 found in all computers but not all registers


can be manipulated by the programmer
(examples, IR, MAR, MBR)

32
 include those that may be controlled by the
programmer - called operational registers
and are often referred to as the register set
of the machine.

33
EAX
AX ESI

34
 Make them general purpose
◦ Increase flexibility and programmer options
◦ Increase instruction size & complexity

 Make them specialized


◦ Smaller (faster) instructions
◦ Less flexibility
 General Purpose

 Data

 Address

 Condition Codes
 May be true general purpose
 May be restricted
 May be used for data or addressing
◦ Data
 Accumulator
◦ Addressing
 Segment
How Many?
 Between 8 - 64
 Fewer = more memory references
 More does not reduce memory references
and takes up processor real estate
How big?
 Large enough to hold full address
 Large enough to hold full word
 Often possible to combine two data registers
Example: C programming
◦ double int a;
◦ long int a;
 register-to-register instructions
 memory-to-register instructions
 memory-to-memory instructions
 load-and-store instructions

40
 Ax – Used to accumulate the results of additions,
subtractions and so forth.

 Bx – Base register. Used to point to the starting


address (called the base) of a structure in memory.

 Cx – Count register. Frequently specifies the


number of times some operation is to repeat.

 Dx – Data register. Holds data, perhaps passed to a


subroutine for processing.

41
 Sp (stack pointer) – points to the top of the
processor’s stack

 Bp (base pointer) – usually addresses variables


stored inside the stack

 Si (Source index) and di (destination index) – aka


string registers

 IP (Instruction Pointers) – s pecifies the next


machine code instruction to be executed, relative to
the segment located by CS. IP is rarely referred to
directly.

42
 Cs (code segment) – addresses the start of the
programs machine code in memory

 Ds (data segment) – addresses the start of the


programs variables

 Ss (stack segment) – locates start of program’s stack


space

 Es (extra segment) – locates an additional data


segment if needed, although in many programs, es
and ds address the same memory, facilitating some
operations tied to these registers.

43
 Also called status bits

 Boolean variables within the processor status


(condition code) register

 Describe the status of the CPU and the


program currently being executed

 Values are set (1) or cleared (0) after the


execution of an instruction

44
 O or of – Overflow flag
 D or df – Direction flag
 I or if – interrupt enable flag
 T or tf – trap flag
 S or sf – sign flag
 Z or zf – zero flag
 A or af – auxiliary flag
 P or pf – parity flag
 C or cf – carry

45
 Sets of individual bits
◦ e.g. result of last operation was zero
 Can be read (implicitly) by programs
◦ e.g. Jump if zero
 Cannot (usually) be set by programs
 Program Counter
 Instruction Decoding Register
 Memory Address Register
 Memory Buffer Register
 May have registers pointing to:
◦ Process control blocks
◦ Interrupt Vectors

 N.B. CPU design and operating system design


are closely linked
 Two main classes – RISC and CISC
RISC = Reduced Instruction Set Computer.
- type of microprocessor architecture that
utilizes a small, highly-optimized set of
instructions,
CISC = Complex Instruction Set Computer
- type of microprocessor architecture that
utilizes a more specialized set of instructions
51
RISC
• Emphasis on software • Relative large general
purpose register sets
• Single-clock,
reduced instruction only
• Instructions that operate
• Register to register: upon operands stored in
"LOAD" and "STORE" registers rather than in
are independent instructions main memory
• Low cycles per second,
• A hardwired control unit
large code sizes

• Relatively few addressing modes • Spends more transistors


on memory registers
• Fixed-length decodable
instruction formats
52
• Emphasis on hardware • Memory-to-memory:
"LOAD" and "STORE"
• Includes multi-clock
incorporated in
complex instructions
instructions
• A large number of instructions
• Small code sizes,
• A large number of addressing high cycles per second
modes
• Transistors used for storing
• A range of variable length complex instructions
instruction formats with
variable execution times
• A microprogrammed control
unit

53
 A.k.a writing microcode
 method used to implement machine instructions
in a CPU relatively easily, often using less
hardware than with other methods.
 a set of very detailed and rudimentary lowest-
level routines which controls and sequences the
actions needed to execute particular
instructions, sometimes also to decode them.
 a machine instruction implemented by a series of
microinstructions - loosely comparable to how an
interpreter implements a high-level language statement
using a series of machine instructions

 Microprograms are carefully designed and optimized


for the fastest possible execution, since a slow
microprogram would yield a slow machine instruction
which would in turn cause all programs using that
instruction to be slow.
Microcode
 the element of a microprogram
 normally written by the CPU engineer during
the design phase
 generally not meant to be visible or changeable
by a normal programmer, not even an
assembly programmer
 can be dramatically changed with a new
microarchitecture generation
Microcode
 often used to let one microarchitecture emulate
another, usually more powerful, architecture.

 used as a synonym for firmware, whether or not


it actually implements the microprogramming of
a processor e.g. firmware used in a hard drive
Microcode
 Microinstruction providing the bits which control
the functional elements that internally compose a
CPU
 transforms a complex electronic design challenge
(the control of a CPU) into a less-complex
programming challenge
 can be characterized as horizontal or vertical.
 each microinstruction directly controls CPU
elements
 typically contained in a fairly wide control store,
it is not uncommon for each word to be 56 bits
or more.
 on each tick of a sequencer clock, a microcode
word is read, decoded, and used to control the
functional elements which make up the CPU.
 A microprogram word comprises fairly tightly
defined groups of bits.
 each microinstruction requires subsequent
decoding before it can control CPU elements,
since the bit fields may pass through
intermediate combinatory logic which in turn
generates the actual control signals for internal
CPU elements (ALU, registers, etc.).
 bit fields themselves directly produce the
control signals.
 requires small instruction lengths and storage
 requires more time to decode, resulting in a
slower CPU clock.
 may be just the assembly language of a simple
conventional computer that is emulating a more
complex computer.
 As transistors became cheaper, horizontal
microcode came to dominate the design of CPUs
using microcode, with vertical microcode no
longer being used.
 Each type of CPU has its own machine
language and assembly language, so an
assembly language program written for
one type of CPU won't run on another.

62
Machine languages
 consist entirely of numbers
 almost impossible for humans to read and
write.
Assembly languages
 have the same structure and set of commands
as machine languages
 enable a programmer to use names instead of
numbers

63
 Three basic characteristics differentiate
microprocessors:
1. Clock speed
2. Bandwidth/Bus speed
3. Instruction set

64
Clock speed
 also called clock rate
 the speed at which a microprocessor executes
instructions.
 regulated by an internal clock
 synchronizes all the various computer
components.

65
Clock speed
 faster the clock => more instructions that may be
executed per second
 stated in Hertz (Hz)
i.e: 1 MHz = 1 million cycles per second,
 Does not imply better performance
 major factor in determining the power of a
computer
66
Bus speed
 measured in MHz
 often hampers the performance of the computer
by having a slower speed than the processor
 Ideally should be the same as the CPU clock
speed

67
Instruction set:
 The complete collection of instructions that
the microprocessor can execute:
◦ Machine Code
◦ Binary
 Usually represented by assembly codes
 Determines a computer family

68
Instruction set:
 Dictates how programs a (software written) for a
microprocessor
example: the SIMP computer understands
10 instructions, and any program written for it
uses those ten instructions in various ways to
accomplish some surprisingly complicated tasks.

69
Instruction set:
 sometimes a larger instruction set will equal
better performance.
For example, one difference between Pentium 4
and Pentium 5 is that Pentium 5 has a larger
instruction set.

70
 plays an important role in co-design of the
embedded computer system
 links the software and hardware part of the
system.
 design process supports software interface
implementation and hardware interface
synthesis.
 enables software running on the
microcontroller to control external devices
 consists of the sequential logic that
physically connects the devices to the
microcontroller and the software drivers
that allow code to access the device
functions.
 model that renders lower-level details of
computer systems temporarily invisible in
order to facilitate the design of sophisticated
systems
 the notion that we can concentrate on one
“level” of the big picture at a time, with
confidence that we can then connect
effectively with the levels above and below.
 framing the levels of abstraction
appropriately is one of the most important
skills in any undertaking.

 on the other hand, abstraction does not


mean being clueless about the neighboring
levels.
 interface between the hardware and the
lowest-level software - one of the most
important abstractions – referred to as the
Instruction Set Architecture (ISA)
 Omits unnecessary details
High Level Language

Compiler

Assembly Language

Assembler

Binary Machine Language


 What kind of interface should the hardware
present to the software?
• Types of instructions?
• Instruction representation?
• How do we get from instruction 1 to 2 (or to 7
instead)?
• Software’s view of storage? Where do variables live?
• Does the hardware help to support function calls? If
so, how?

78
 includes anything programmers need to
know to make a binary machine language
program work correctly, including
instructions, I/O devices, and so on.

COA - Sm 1
Note: The operating system will encapsulate the
details of doing I/O, allocating memory, and other
low-level system functions, so that application
programmers do not need to worry about such
details. The combination of the basic instruction
set and the operating system interface provided for
application programmers is called the application
binary interface (ABI).
 allows computer designers to talk about
functions independently from the hardware
that performs them. Computer designers
distinguish architecture from an
implementation of an architecture along the
same lines: an implementation is hardware
that obeys the architecture abstraction.
 Operation code (Op code) - What to do

 Source Operand reference – Object(s) to use

 Result Operand reference – Where to store result

 Next Instruction Reference - When you have done


that, do this...
 a unique bit pattern for machine code

 a symbolic representation for human


readability
◦ e.g. ADD, SUB, LOAD
ADD A,B
 Operate Instructions
◦ process data (addition, logical operations, etc.)

 Data Movement Instructions …


◦ move data between memory locations and registers.
 Control Instructions …
◦ change the sequence of execution of instructions in
the stored program.
 The default is sequential execution: the PC is incremented
by 1 at the start of every Fetch, in preparation for the next
one.
 Control instructions set the PC to a new value during the
Execute phase, so the next instruction comes from a
different place in the program.
 This allows us to build control structures such as loops
and branches.
Length

1. Fixed length

• 32 or 64 bits (depends on architecture – MIPS


152/16 is 16 bit)

+ Simple implementation: compute next PC


using only PC

– Code density: 32 or 64 bits for a NOP?

87
Length

2. Variable length

– Complex implementation

+ Code density

3. Compromise of fixed and variable two lengths

Encoding

- transforming information from one format to


another

88
Length
• 32-bits
• MIPS16: 16-bit variants of common instructions for
density

Encoding
• 4 formats, simple encoding

89
R-Type Instruction
opcode (6) rs (5) rt (5) rd (5) sa (5) function (6)
(register to register)

I-Type Instruction
format function
(register with immediate operand) opcode (6) ft (5) fs (5) fd (5)
(5) (6)

J-Type Instruction
(jump to target) opcode (6) target (26)

90
opcode (31-26)

Opcode is short for "operation code". The


opcode is a binary encoding for the
instruction.

rd (25-21)

This is the destination register. The


destination register is the register where the
result of the operation is stored.

91
rs (20-16)

This is the first source register. The source


register is the register that holds one of the
arguments of the operation.

rt (15-11)

This is the second source register.

92
sa (B10-6)

The amount of bits to shift. Used in shift


instructions.

function (B5-0)
An additional 6 bits used to specify the
operation, in addition to the opcode

93
 3 addresses
◦ Operand 1, Operand 2, Result
◦ a = b + c;
◦ May be a forth - next instruction (usually
implicit)
◦ Not common
◦ Needs very long words to hold everything
 2 addresses
◦ One address doubles as operand and
result
◦a=a+b
◦ Reduces length of instruction
◦ Requires some extra work
 Temporary storage to hold some results
 1 address
◦ Implicit second address
◦ Usually a register (accumulator)
◦ Common on early machines
 0 (zero) addresses
◦ All addresses implicit
◦ Uses a stack
◦ e.g. push a
◦ push b
◦ add
◦ pop c

◦c=a+b
 More addresses
◦ More complex (powerful?) instructions
◦ More registers
 Inter-register operations are quicker
◦ Fewer instructions per program

 Fewer addresses
◦ Less complex (powerful?) instructions
◦ More instructions per program
◦ Faster fetch/execution of instructions
 Operation repertoire
◦ How many ops?
◦ What can they do?
◦ How complex are they?

 Data types
 Instruction formats
◦ Length of op code field
◦ Number of addresses
 Registers
◦ Number of CPU registers available
◦ Which operations can be performed on which
registers?

 Addressing modes ( will be discussed later…)

 RISC v CISC
 Addresses
 Numbers
◦ Integer/floating point

 Characters
◦ ASCII etc.

 Logical Data
◦ Bits or flags
 (Aside: Is there any difference between numbers and characters?
Ask a C programmer!)
 Operation type encoded in instruction opcode
 Operations act on operands (stored in registers)
◦ Data Transfer
◦ Arithmetic
◦ Logical
◦ Conversion
◦ I/O
◦ System Control
◦ Transfer of Control
Data Transfer
 Specify
◦ Source
◦ Destination
◦ Amount of data
 May be different instructions for different
movements
◦ e.g. IBM 370
 Or one instruction and different addresses
◦ e.g. VAX
Arithmetic
 Add, Subtract, Multiply, Divide, Mod/rem
◦ signed/unsigned)
 Packed integer: padd, pmul, pand, por…
(saturating/wraparound
 Floating point
◦ add, sub, mul, div, sqrt
 May include
◦ Increment (a++)
◦ Decrement (a--)
◦ Negate (-a)
Shift and
Rotate
Operations
Logical
 Bitwise operations
 AND, OR, NOT, XOR, SLL, SRL. SRA

Conversion
E.g. Binary to Decimal
Input/Output

 May be specific instructions

 May be done using data movement


instructions (memory mapped)

 May be done by a separate controller (DMA)


Systems Control

 Privileged instructions

 CPU needs to be in specific state


◦ Ring 0 on 80386+

◦ Kernel mode

 For operating systems use


Transfer of Control
 Branch
◦ e.g. branch to x if result is zero

 Skip
◦ e.g. increment and skip if zero
◦ ISZ Register1
◦ Branch xxxx
◦ ADD A

 Subroutine call
◦ c.f. interrupt call
 What order do we read numbers that occupy
more than one byte?
 Endianness
◦ byte ordering used to represent types of data
◦ the transmission order over a network or other
medium
Big endian
◦ most significant byte (MSB) value is stored at the
memory location with the lowest address
Little endian
◦ The least significant byte (LSB) value is stored at the
lowest address
 refers to the ways in which instructions
reference their operands.

 Programmers use statement labels to refer


to instructions and the names of variables
or expressions to refer to the variables
themselves.

112
 Direct – moves a byte or word between a
memory location specified in the instruction
and a register
MOV AL, [1234h]

 Register indirect (base relative or indexed) –


transfers a byte or word of data between a
register and the memory location addressed by
an index (DI or SI) or base register (BP or BX)
MOV AX, [BX]
113
 Register – transfers a byte or word from the
source register to the destination register
MOV BX, CX

 Immediate – transfers an immediate byte or


word of data into the destination register or
memory location
MOV AX, 3456h
114
 Register –Base plus index (base relative
indexed) – transfers a byte or word of data
between a register and the memory location
addressed by a base register (BP or BX) plus
index (DI or SI) register

MOV DX, [BX + DI]

115
 Register relative – transfers a byte or word of
data between a register and the memory location
addressed by an index (DI or SI) or base register
(BP or BX) plus a constant displacement

MOV AX, [BX + 1000h]

116
 Register relative Base relative plus index –
transfers a byte or word of data between a
register and the memory location addressed by
a base register (BP or BX) plus an index
register (DI or SI) plus a displacement

MOV DX, [BP + SI + 1000h]

117
 For the CPU to function effectively, it must:
◦ Fetch instructions
◦ Interpret instructions
◦ Fetch data
◦ Process data
◦ Write data
 The Control Unit orchestrates the complete
execution of each instruction:

◦ At its heart is a Finite State Machine (FSM) that sets


up the state of the logic circuits according to each
instruction.

◦ This process is governed by the system clock - the


FSM goes through one transition (“machine cycle”)
for each tick of the clock.
 Six phases of the complete Instruction Cycle
◦ Fetch: load IR with instruction from memory
◦ Decode: determine action to take (set up inputs for
ALU, RAM, etc.)
◦ Evaluate address: compute memory address of
operands, if any
◦ Fetch operands: read operands from memory or
registers
◦ Execute: carry out instruction
◦ Store results: write result to destination (register or
memory)
 Fetch
◦ The first step is to read an instruction from memory.
◦ This actually takes several smaller steps, or “micro-
instructions”:
 MAR (PC) ; use the value in PC to access
memory
 PC (PC) + 1 ; increment the value of PC
 MDR Mem[MAR] ; read memory location to MDR
 IR (MDR) ; copy (MDR) to IRDecode

◦ Steps 1, 2 & 4 take a single machine cycle each, but step 3


(memory access) can take many machine cycles .
 Decode

◦ The opcode is input to a decoder, which sets


up the ensuing sequence of events according
the instruction.
 Evaluate Address

◦ Computes the address of the operand (if


any), or of the memory location to be
accessed: e.g. the location from which to
obtain a value in a LOAD instruction.

 This is known as the Effective Address (EA).


 Fetch Operands
◦ Obtains the source operand(s), if required for
execution.
◦ Operands can come from Registers or RAM, or
be embedded in the instruction itself.
 The Effective Address (EA) determined in the
previous step is used to obtain an operand
from memory.
 Execute

◦ the instruction is executed.

 e.g. if the opcode was ADD, the two source


operands are added by the ALU.

 If the opcode was a control instruction, a


value is written to the PC
NB: Data Movement instructions don’t have an execute phase
 Store Result
◦ If there is a result from the operation it is
written to memory (using the EA), or to a
register.
Note: Some instructions don't need all 6 phases

 If only using registers, skip Evaluate Address


 If only moving data, skip Execute
 Start over …
◦ The control unit just keeps repeating this whole
process: so it now Fetches a new instruction from the
address currently stored in the PC.

 Recall that the PC was incremented in the first step


(FETCH), so the instruction retrieved will be the next in the
program as stored in memory - unless the instruction just
executed changed the contents of the PC.
 May require memory access to fetch operands

 Indirect addressing requires more memory


accesses

 Can be thought of as additional instruction


subcycle
 A “good” design goal of any system is to have all
of its components performing useful work all of
the time – high efficiency

 Following the instruction cycle in a sequential


fashion does not permit this level of efficiency

 Compare the instruction cycle to an automobile


assembly line… do you see a problem?
 Solution: Perform all tasks concurrently, but on
different (sequential) instructions
◦ i.e Overlap the stages on instruction cycle
◦ The result is temporal parallelism  instruction
pipeline

 An ideal pipeline divides a task into k


independent sequential subtasks
◦ Each subtask requires 1 time unit to complete
◦ The task itself then requires k time units to complete
 For n iterations of the task, the execution times
will be:
◦ With no pipelining: n x k time units
◦ With pipelining: k + (n -1) time units

 Speedup of a k-stage pipeline is thus:


S = n k / [k +(n -1)] ==> k (for large n)
 does not decrease the time for individual
instruction execution

 Increases instruction throughput


◦ The throughput of the instruction pipeline is
determined by how often an instruction exits the
pipeline.

 introduces hazards as it changes the relative timing


of instructions by overlapping their execution
 situations in which a correct program ceases to
work correctly due to implementing the
processor with a pipeline

 three fundamental types of hazard:


◦ data hazards
◦ branch hazards
◦ structural hazards

Note: Only branch hazards and RAW data hazards are possible in MIPS pipeline
Data hazards
◦ when reads and writes of data occur in a different
order in the pipeline than in the program code,
resulting in data dependencies

◦ an instruction uses the result of a previous


instruction (RAW)
ADD R1, R2, R3 or SW R1, 3(R2)
ADD R4, R1, R5 LW R3, 3(R2)

◦ Write After Read (WAR), Write After Write (WAW), and


Read After Write (RAW) hazards
Data hazards
WAR
◦ the reverse of a RAW: in the code a write occurs after
a read, but the pipeline causes write to happen first
WAW
◦ when two writes occur out of order - when there is no
read in between
RAW
◦ when, in the code as written, one instruction reads a
location after an earlier instruction writes new data to
it, but in the pipeline the write occurs after the read
(so the instruction doing the read gets stale data).
Example (Data hazard)
add $1, $2, $3 _ _ _ _ _
add $4, $5, $6 _ _ _ _ _
add $7, $8, $9 _ _ _ _ _
add $10, $11, $12 _ _ _ _ _
add $13, $14, $1 _ _ _ _ _ (data arrives early; OK)
add $15, $16, $7 _ _ _ _ _ (data arrives on time;
OK)
add $17, $18, $13 _ _ _ _ _ (uh, oh)
add $19, $20, $17 _ _ _ _ _ (uh, oh again)
Branch/Control hazards
 occurs when a decision needs to be made,
but the information needed to make the
decision is not available yet
JMP LOOP

LOOP: ADD R1, R2, R3
Structural hazards
 occur when a single piece of hardware is used
in more than one stage of the pipeline, so it's
possible for two instructions to need it at the
same time
 Four possible techniques
1. Stall
◦ Can resolve any type of hazard
◦ Detect the hazard
◦ Freeze the pipeline up to the dependent stage until
the hazard is resolved - simply make the later
instruction wait until the hazard resolves itself
◦ Undesirable because it slows down the machine, but
may be necessary.
2. Bypass/Forward
◦ detects condition - If the data is available
somewhere, but is just not where we want it, create
extra data paths to “forward” the data to where it is
needed
◦ no need to stall - eliminates stalls for single-cycle
operations - reduces longest stall to N-1 cycles for N-
cycle operations

◦ best solution, since it doesn't slow the machine down


and doesn't change the semantics of the instruction
set.
3. Document/punt –
 define instruction sequences that are
forbidden, or change the semantics of
instructions, to account for the hazard -
worst solution, both because it results in
obscure conditions on permissible
instruction sequences, and (more
importantly) because it ties the instruction
set to a particular pipeline implementation.
4. Add hardware
◦ most appropriate to structural hazards; if a
piece of hardware has to be used twice in
an instruction, see if there is a way to
duplicate the hardware.
1 2 3 4 5 6 7 8 9

ADD R1, R2, R3 IF ID EX MEM WB


IF
SUB R4, R5, R1 IDSUB EX MEM WB

IF
AND R6, R1, R7 IDAND EX MEM WB

IF
OR R8, R1, R9 IDOR EX MEM WB

IF
XOR R10,R1,R11 IDXOR EX MEM WB
Analysis of hazards
 All the instructions after the ADD use the result of the
ADD instruction (in R1).

 The ADD instruction writes the value of R1 in the WB


stage and the SUB instruction reads the value during
ID stage resulting in a data hazard and unless
precautions are taken to prevent it, the SUB instruction
will read the wrong value and try to use it.

 For the AND instruction, the write of R1 does not


complete until the end of cycle 5. Thus, the AND
instruction that reads the registers during cycle 4 will
receive the wrong result.
Analysis of hazards
 The OR instruction can be made to operate without
incurring a hazard by a simple implementation
technique. The technique is to perform register file
reads in the second half of the cycle, and writes in the
first half. Because both WB for ADD and ID for OR are
performed in one cycle 5, the write to register file by
ADD will perform in the first half of the cycle, and the
read of registers by OR will perform in the second half
of the cycle.
Analysis of hazards
 The XOR instruction operates properly, because its
register read occur in cycle 6 after the register write
by ADD.
 Multiple Streams
 Prefetch Branch Target
 Loop buffer
 Branch prediction
 Delayed branching
Multiple Streams
 Have two pipelines
 Prefetch each branch into a separate pipeline
 Use appropriate pipeline
◦ Leads to bus & register contention
◦ Multiple branches lead to further pipelines
being needed
Prefetch Branch Target
 Target of branch is prefetched in addition to
instructions following branch

 Keep target until branch is executed

 Used by IBM 360/91


Loop Buffer
 Very fast memory
 Maintained by fetch stage of pipeline
 Check buffer before fetching from memory
 Very good for small loops or jumps
 c.f. cache
 Used by CRAY-1
Loop Buffer Diagram
Branch Prediction
 Predict never taken
◦ Assume that jump will not happen
◦ Always fetch next instruction
◦ 68020 & VAX 11/780
◦ VAX will not prefetch after branch if a page fault would
result (O/S v CPU design)

 Predict always taken


◦ Assume that jump will happen
◦ Always fetch target instruction
Branch Prediction
 Predict by Opcode
◦ Some instructions are more likely to result in a jump
◦ Can get up to 75% success

 Taken/Not taken switch


◦ Based on previous history
◦ Good for loops

 Delayed Branch
◦ Do not take jump until you have to
◦ Rearrange instructions
Branch Prediction
 Predict by Opcode
◦ Some instructions are more likely to result in a
jump
◦ Can get up to 75% success

 Taken/Not taken switch


◦ Based on previous history
◦ Good for loops
Performance of a machine is determined by:
 Instruction count(determine by compiler & ISA)
 Clock cycle time
 Clock cycles per instruction

Processor design (datapath and control) will


determine:
 Clock cycle time
 Clock cycles per instruction
CPU Time = execution time
= instruction count x CPI x Clock cycle time

= instruction count x CPI


Clock rate
NB: clock cycle time = 1 . CPI = cycles per
clock rate instruction

Performance ~ 1/Execution Time

CPU PerformanceA = Execution TimeB


CPU PerformanceB Execution TimeA
Example
System A takes 12 seconds to execute a
program. System B takes 20 seconds to
execute the same program. Which system is
faster and by how much?
Example (soln)
NB: Slower time = faster machine

CPU PerformanceA = Execution TimeB


CPU PerformanceB Execution TimeA
(Normally read: faster as compared to slower)

=> CPU PerformanceA = 20 seconds


CPU PerformanceB 12 seconds

System A is 20/12 = 1.67 times or 67% faster than


System B for this application.
67% performance improvement
Computing CPI
 The CPI is the average number of cycles per
instruction.
 If for each instruction type, we know its frequency
and number of cycles need to execute it, we can
compute the overall CPI as follows:
 CPI = Σ (CPI x F)

For example
Op F CPI (CPI x F) % Time
ALU 50% 1 0.5 23
Load 20% 5 1.0 45
Store 10% 3 0.3 14
Branch 20% 2 0.4 18
Total 100% 2.2 100
(frequency (f) must always add up to 100)
Estimating Performance Improvements
Assume a processor currently requires 10 seconds to
execute a program and performance improves by 50% per
year.
By what factor does processor performance improve in 5
years?
Factor of improvement (foi)= (1 + %)years

Factor of improvement = (1 + 0.5)5 = 7.59

How long will it take a processor to execute the program


after 5 years?
Exec_Timenew = Exec_Timeold =10/7.59 = 1.32 seconds
foi
Example:
You have a system that contains a special
processor for doing floating-point operations.
You have determined that 50% of your
computations can use the floating-point
processor. The speedup of the floating pointing-
point processor is 15. What is the overall
speedup?
Execution timenew
= Execution timeold x (1 – Fractionenhanced) + fractionenhanced
Speedupenhanced

The overall speedup is the ratio of the execution times:

Speedupoverall = executionold
executionnew
= 1 .

(1 – Fractionenhanced) + Fractionenhanced
Speedupenhanced
Overall speedup achieved by using the
floating-point processor.

Overall speedup = 1 .

(1-0.5) + 0.5/15
= 1 .

0.5 + 0.033

= 1.876
Example: (Impact of optimizing compiler)
Assume the following program makeup:
Operation Freq. Clock
Cycle
ALU 43% 1
Load 21% 2
Store 12% 2
Branch 24% 2
Assume a 20 ns clock, optimizing compiler
eliminates 50% of all ALU operations.
Solution (NOT Optimized):
MIPS = instruction count = clock rate
Exec Time x 106 CPI x 106

Avg. CPI = (0.43 x 1) + (0.21 x 2) + (0.12 x 2) + (0.24x 2)


= 1.57
MIPS = 50 MHz
1.57 x 106
= 5 x 106
1.57 x 106
NB: MIPS has no units
= 31.8
Solution (Optimized):
Ave CPI = .43/2x1 + .21x2 + .12x2 + .24x2
1-.43/2
= 1.73

MIPS = 50 MHz
1.73 x 106
= 5 x 106
1.73 x 106
= 28.9
 An implementation of a CPU

 Must be considered with the processor design –


sequencing and execution of instructions
◦ the individual components that are necessary:
◦ the use of a clock
◦ registers and memory.
Datapath
 The interconnection of functional units that
make up the processor, such as ALUs,
decoders, and multiplexers that perform data
processing operations

 A component of most processors

 Works in conjunction with the control unit


Two implementations
 Single cycle implementation
◦ uses a single clock cycle for every instruction.
(Start execution with one clock edge, and complete on the next edge)

◦ Advantage: One clock cycle per instruction


◦ Disadvantage: long cycle time

 Multicycle implementation
◦ instructions use multiple clock cycles.
◦ Advantage: Shorter execution time
◦ Disadvantage: More complex control
PCSrc

1
Add M
u
x
4 ALU 0
Add result
RegWrite Shift
left 2

Instruction [25– 21] Read


Read register 1 Read MemWrite
PC data 1
address Instruction [20– 16] Read MemtoReg
ALUSrc
Instruction register 2 Zero
1 Read ALU ALU
[31– 0] Write data 2 1 Read
M result Address 1
u register M data
Instruction Instruction [15– 11] x u M
memory Write x u
0 data Registers 0
x
Write Data 0
RegDst data memory
Instruction [15– 0] 16 Sign 32
extend ALU MemRead
control
Instruction [5– 0]

ALUOp
 Cycle time = Σ(stages)
 Execution Time = IC * CPI * Cycle Time
 Processor design (datapath and control) will
determine:
◦ Clock cycle time
◦ Clock cycles per instruction
 All combinational logic must stabilize within
one clock cycle.
 All state elements will be written exactly
once at the end of the clock.
 CPI = 1
 All operations must occur in parallel
 Used to improve performance in real computer
systems
 Divides instruction execution into multiple clock
cycles.
 Resources can be reused on each cycle, saving
resources
◦ ALU used to compute address and to increment PC
◦ Memory used for instruction and data
 Control signals not determined solely by
instruction
◦ e.g., what should the ALU do for a “subtract”
instruction?

 Shorter instructions can use fewer clock cycles

 Total execution time is reduced.


 Cycle time = time for longest stage

 Execution time = cycle time * IC * CPI

 Advantages
◦ The work required by the typical instructions can be
divided over approximately equal, smaller elementary
operations.
◦ The clock cycle can be set to the longest elementary
operation.
Multicycle Datapath
 The performance of a pipelined processor may
be considered:

 Cycle time
= longest stage + pipeline overhead

 Execution time
= cycle time * (no. of stages + IC – 1)
 the processor is not the only component in the
system that determines overall system
performance
 Speed is NOT performance e.g. a system running
with a Pentium 150 is not necessarily "50% faster"
than one with a Pentium 100.
 speeding up the processor only improves system
performance for those aspects of system use that
depend on the processor.
 most processors are already much faster than the
devices that support them, so they spend a great deal
of time waiting around for data that they can use

 One of the most important factors that influences


overall system performance is memory bus speed e.g.
a slower processor running on a 66 MHz system bus
can provide comparable performance to a faster one
on a 60 MHz bus (Pentium 133 vs Pentium 150 in
some benchmarks)
 very high clock speeds sometimes results in minimal
improvement in overall system performance is
minimal even if the processor's performance
increases a great deal (e.g. Pentium 200 which by
most benchmarks provides less than 10% system
performance improvement over the Pentium 166,
despite having 20% higher benchmarks when looking
just at the processor - sort of a law of diminishing
returns in processor speed.

You might also like