Advanced Computer Systems Architecture Lect-6

Advanced Computer Systems
Architecture
Course Teacher: Dr.-Ing. Shehzad Hasan
CIS, NED University
Lecture # 6
Fall Semester 2015 CS-506 ACSA 1

Recap Lecture – 5
• Computer Arithmetic
• Non-Restoring Unsigned Division
• Non-Restoring Signed Division
– Quotient conversion from {-1, 1} to 2’s Complement
– Floating Point Arithmetic

• IEEE Floating Point Standard

Basic FP Operations
Addition & Subtraction
Assume e1  e2; alignment shift (preshift) is needed if e1 > e2
( s1  b e1) + ( s2  b e2) = ( s1  b e1) + ( s2 / b e1–e2)  b e1

= ( s1  s2 / b e1–e2)  b e1 =  s  b e
Example: Numbers to be added:

x = 25  1.00101101 Operand with
y = 21  1.11101101 smaller exponent
to be preshifted
Operands after alignment shift:
x = 25  1.00101101
y = 25  0.000111101101
Extra bits to be
Result of addition: rounded off
s = 25  1.010010111101
s = 25  1.01001100 Rounded sum

FP Addition
• When operand signs are alike, a single bit
normalization shift is always enough as 1 ≤ s < 4. If
the result is in 2 ≤ s < 4 it may have to be reduced by
a factor of 2 through a single bit right shift ( and
adding 1 to the exponent to compensate).
• When operands have different signs, the resulting
significand may be very close to 0 and left shifting by
many positions may be needed for normalization.
• Overflow/underflow can occur during the addition
step as well as due to normalization.

FP Addition x Operands y
Unpack
Unpack Signs Exponents Significands
Isolate the sign, exponent, significand Add/

Sub
Reinstate the hidden 1 Selective complement
Mu x Sub and possible swap
Convert operands to internal format
Identify special operands, exceptions Align significands
cout cin
Control Add
& sign
logic
Normalize
Round and
Pack selective complement
Combine sign, exponent, significand

Add Normalize
Hide (remove) the leading 1
Identify special outcomes, exceptions Sign Exponent Significand
Pack
s Sum/Difference

Basic FP Operations
Multiplication
( s1  b e1)  ( s2  b e2) = ( s1  s2 )  b e1+e2
Because s1  s2  [1, 4), postshifting may be needed for normalization
Overflow or underflow can occur during multiplication or normalization
Division
( s1  b e1) / ( s2  b e2) = ( s1 / s2 )  b e1-e2
Because s1 / s2  (0.5, 2), postshifting may be needed for normalization
Overflow or underflow can occur during division or normalization

Floating-Point Multipliers and Dividers
Floating-point operands
( s1  b e1)  ( s2  b e2) = ( s1  s2 )  b e1+e2
s1  s2  [1, 4): may need postshifting Unpack
Overflow or underflow can occur during

XOR Add
multiplication or normalization Exponents
Multiply
Significands
Adjust
Exponent Normalize
Round
Adjust
Normalize
Exponent
Pack
Product

Rounding Schemes
rtne(x)
4 The IEEE 754-2008 standard
3 includes five rounding modes:
2
1. Round to nearest, ties away
from 0 (rtna)
1
2. Round to nearest, ties to even
x (rtne) [default rounding mode]
–4 –3 –2 –1 1 2 3 4
–1
3. Round toward zero (inward)
4. Round toward + (upward)
–2
5. Round toward – (downward)
–3
–4
Rounding to the nearest even number

Rounding and Exceptions
Adder result = (coutz1z0 . z–1z–2 . . . z–l G R S)2’s-compl
Guard bit Sticky bit
Why only 3 extra bits? Round bit OR of all bits
shifted past R
Amount of alignment right-shift
One bit: G holds the bit that is shifted out, no precision is lost
Two bits or more: Shifted significand has a magnitude in [0, 1/2)
Unshifted significand has a magnitude in [1, 2)
Difference of aligned significands has a magnitude in [1/2, 2)
Normalization left-shift will be by at most one bit
If a normalization left-shift actually takes place: (1/2, 1) [1, 2)
R = 0, round down, discarded part < ulp/2 Shift left No shift
R = 1, round up, discarded part > ulp/2
The only remaining question is establishing whether the discarded part
is exactly ulp/2 (for round to nearest even); S provides this information
Examples
• Add
– 0 01111101 00000000000000000000000
– 0 10000101 10010000000000000000000
• Multiply
– 0 10000100 0100…. 00
– 1 00111100 1100…. 00

Invalidated Laws of Algebra
Many laws of algebra do not hold for floating-point arithmetic
(some don’t even hold approximately)
This can be a source of confusion and incompatibility
Associative law of addition: a + (b + c) = (a + b) + c
a = 0.123 41  105 b = – 0.123 40  105 c = 0.143 21  101
a +fp (b +fp c)
= 0.123 41  105 +fp (– 0.123 40  105 +fp 0.143 21  101)
= 0.123 41 105 –fp 0.123 39 105
Results = 0.200 00 101
differ (a +fp b) +fp c
by more = (0.123 41  105 –fp 0.123 40  105) +fp 0.143 21  101
than = 0.100 00 101 +fp 0.143 21 101
20%! = 0.243 21 101

Other Invalidated Laws of Algebra with FLP Arithmetic
Associative law of multiplication a (b c) = (a b) c
Cancellation law (for a > 0) a  b = a  c implies b = c
Distributive law a  (b + c) = (a  b) + (a  c)
Multiplication canceling division a (b /a) = b
Before the IEEE 754 floating-point standard became available and

widely adopted, these problems were exacerbated by the use of
many incompatible formats

Instruction Level Parallelism

Basic Pipelining
Consider a 5-stage instruction pipeline as shown below:
IF ID EX M WB
A time-space diagram is used to describe the progress of instructions through

the pipeline.
WB I1 I2 I3 I4 I5 I6
M I1 I2 I3 I4 I5 I6 I7
Stages
EX I1 I2 I3 I4 I5 I6 I7 I8
ID I1 I2 I3 I4 I5 I6 I7 I8 I9
IF I1 I2 I3 I4 I5 I6 I7 I8 I9 I10
1 2 3 4 5 6 7 8 9 10
Clock Cycles 

Basic Pipelining
WB I1 I2 I3 I4 I5 I6
M I1 I2 I3 I4 I5 I6 I7
Stages
EX I1 I2 I3 I4 I5 I6 I7 I8
ID I1 I2 I3 I4 I5 I6 I7 I8 I9
IF I1 I2 I3 I4 I5 I6 I7 I8 I9 I10
1 2 3 4 5 6 7 8 9 10
Clock Cycles 
Instruction Latency (the time it takes to complete an instruction)
= 5 cycles
Instruction Throughput (for 10 cycles)
=6/10 IPC = 0.6 IPC
Speedup of k-stage pipeline for a program having n instructions
𝑛𝑘𝜏 𝑛𝑘
𝑆𝑝 = =
(𝑘 + 𝑛 − 1)𝜏 𝑘 − 1 + 𝑛

MIPS Pipelined Architecture

Instruction Fetch (IF) Stage
• Instruction Fetch
Instruction’s address in PC is applied to
instruction memory that causes the
addressed instruction to become available
at the output lines of instruction memory.
• Updating PC
what is written in PC is determined by the
control signal PCSrc. Depending upon the
status of control signal PCSrc, PC is either
written by the branch target address (BTA)
or the sequential address (PC + 4).

Instruction Format

Instruction Decode (ID) Stage

• Instruction is decoded by the
control unit that takes 6-bit
opcode and generates control
signals.
• The control signals are buffered
in the pipeline registers until
they are used in the concerned
stage by the corresponding
instruction.

Instruction Decode (ID) Stage
• Registers are also read in this
stage.
 The first source register’s identifier in
every instruction is at bit positions
[25:21] and second source register’s
identifier (if any) is at bit positions
[20:16].
 The destination register’s identifier is
either at bit positions [15:11] (for R-
type) or at [20:16] (for load and
immediate data).

Execution (EX) Stage

• This stage is marked by the
use of ALU that performs the
desired operation on registers
(R-type), calculates address
(memory reference
instructions), or compares
registers (branch).

Execution (EX) Stage

• An ALU control accepts 6-bit
funct field and 2-bit control
signal ALUOp to generate
the required control signal
for the ALU.
• Branch Target Address (BTA)
is also calculated in the EX
stage by a separate adder

Memory (M) Stage
• Data memory is read (load)
or written (store) using the
address calculated by the
ALU in EX stage.
• Branch decisions are taken
in this stage
 ZERO output of ALU and
BRANCH signal generated by
the control unit are ANDed to
determine the fate of branch
(taken or not taken)

Write Back (WB) Stage
• Result produced by ALU
in EX stage (R-type) or
data read from data
memory in M stage (lw)
is written in destination
register.
• The data to be written …
in destination register is
selected via multiplexer
controlled by the control
signal MemToReg

Consider pipelined execution of following MIPS instructions:
ld R1, 10(R2)
dadd R3, R4, R5
The load instruction uses all stages in the pipeline but add instruction doesn’t
access data memory.
C1 C2 C3 C4 C5
ld IF ID EX M WB
dadd IF ID EX WB
A resource conflict is indicated in CC5. That is, two different instructions attempt
to use the same hardware in the same cycle.
This can be averted by ensuring uniformity: make all instructions pass through all
the stages in the same order.
As a consequence, some instructions will do nothing (accomplished through
disabling corresponding control signals) in some stages

Pipeline Hazards
• A pipeline hazard is a situation that prevents an instruction
from using a pipeline stage during the designated clock cycle.
• Hazards reduce the performance from the ideal speedup
gained by pipelining. There are three classes of hazards:
1. Structural hazards arise from resource conflicts when the
hardware cannot support all possible combinations of instructions
simultaneously in overlapped execution.
2. Data hazards arise when an instruction depends on the results of
a previous instruction in a way that is exposed by the overlapping of
instructions in the pipeline.
3. Control hazards arise from the pipelining of branches and other
instructions that change the PC.

Structural Hazards
Some pipelined processors have shared a single-memory pipeline for data
and instructions. As a result, when an instruction contains a data memory
reference, it will conflict with the instruction reference for a later instruction.

Structural Hazards
Solution: Stall the pipeline for 1 clock cycle when the data memory
access occurs
No instruction is initiated on clock cycle 4 (which normally would initiate instruction i+3).
Because the instruction being fetched is stalled, all other instructions in the pipeline
before the stalled instruction can proceed normally.
In the above figure it is assumed that instructions i+1 and i+2 are not memory references

Structural Hazards
• Structural hazards are typically averted by
employing replicated resources.
• This structural hazard can be avoided by
having Harvard Architecture i.e. separate
memory units for instructions and data
– One for instruction fetch and another for data
read/write.

Data Hazards
• Data hazards occur when the pipeline changes the order
of read/write accesses to operands so that the order
differs from the order seen by sequentially executing
instructions on a non-pipelined processor
• Consider the sequence of instruction
DADD R1, R2, R3
DSUB R4, R1, R5
AND R6, R1, R7
OR R8, R1, R9
XOR R10, R1, R11

Data Hazards

Data Hazards
There are a number of data dependencies
between various pair of instructions as
detailed below
1) The DADD instruction writes the value of R1

in the WB stage, but the DSUB instruction
reads the value during its ID stage
2) The AND instruction is also affected by this

hazard. The write of R1 does not complete
until the end of clock cycle 5. Thus, the AND
instruction that reads the registers during
clock cycle 4 will receive the wrong results.
3) The OR instruction operates without incurring a hazard because we perform the

register file reads in the second half of the cycle and the writes in the first half.
4) The XOR instruction also operates properly because its register read occurs in clock
cycle 6, after the register write.

Data Hazards
• The problem can be solved with a simple hardware
technique called forwarding (also called bypassing).
• Forwarding can be generalized to include passing a
result directly to the functional unit that requires it.
• In the previous example the result is not really
needed by the DSUB instruction until after the DADD
instruction actually produces it.
• If the result can be moved from the pipeline register
where the DADD stores it to where the DSUB needs
it, then the need for a stall can be avoided.

Data Hazards

Data Hazards
• Consider another example To prevent a stall in this sequence, we
DADD R1, R2, R3 would need to forward the values of
LD R4, 0(R1) the ALU output and memory unit
SD R4, 12(R1) output from the pipeline registers to
the ALU and data memory inputs.

Data Hazards
Where to find the ALU result?
• The ALU result generated in the EX stage is passed
through the pipeline registers to the MEM and WB
stages, before it is finally written to the register file.
• Since the pipeline registers already contain the ALU
result, we could just forward that value to
subsequent instructions, to prevent data hazards.

Forwarding Unit
Forwarding unit selects the correct ALU inputs for the
EX stage.
• If there is no hazard, the ALU’s operands will come from the

register file, just like before.
• If there is a hazard, the operands will come from either the
EX/MEM or MEM/WB pipeline registers instead.
• The ALU sources will be selected by two new multiplexers,
with control signals named ForwardA and ForwardB.

Forwarding Unit

Detecting Data Hazards
EX Hazard (ALU-ALU forwarding)
• An EX hazard occurs between the instruction currently in its
EX stage and the previous instruction if:
1. The previous instruction will write to the register file, and
2. The destination is one of the ALU source registers in the EX stage.
The first ALU source comes from the pipeline register when
necessary.
• if (EX/MEM.RegWrite && (EX/MEM.RegisterRd == ID/EX.RegisterRs))
then ForwardA = 10 (2)
The second ALU source is treated in a similar fashion.
• if (EX/MEM.RegWrite && (EX/MEM.RegisterRd == ID/EX.RegisterRt))
then ForwardB = 10 (2)

Detecting Data Hazards
MEM hazard (MEM-ALU forwarding)
• A MEM hazard may occur between an instruction in the EX
stage and the instruction from two cycles ago.
For detecting and handling MEM hazard for the first ALU source.
• if (MEM/WB.RegWrite && (MEM/WB.RegisterRd == ID/EX.RegisterRs)
&& ((EX/MEM.RegisterRd != ID/EX.RegisterRs) || != (EX/MEM.RegWrite)))
then ForwardA = 01
The second ALU operand is handled similarly.
• if (MEM/WB.RegWrite && (MEM/WB.RegisterRd == ID/EX.RegisterRt)
&& ((EX/MEM.RegisterRd != ID/EX.RegisterRt) || != (EX/MEM.RegWrite)))
then ForwardB = 01

Data Hazards
Consider the following sequence of instructions:
LD R1,0(R2)
DSUB R4,R1,R5
AND R6,R1,R7
OR R8,R1,R9
The LD instruction does not

have the data until the end
of clock cycle 4 (its MEM
cycle), while the DSUB
instruction needs to have
the data by the beginning
of that clock cycle.

Data Hazards
• The load instruction has a delay or latency that cannot be
eliminated by forwarding alone. Instead, we need to add
hardware, called a pipeline interlock, to preserve the correct
execution pattern.

Detecting Stalls
Consider
– LD R1, 0(R2)
– DSUB R4,R1,R5
• A load use hazard occurs between the current instruction in

its ID stage and the preceding instruction in the EX stage if:
– The preceding instruction is a load instruction and
– The load’s destination register is one of the current source registers.
• The set of equations to test the above conditions follows:
– if (ID/EX.MemRead = 1 && ((ID/EX.RegisterRt = IF/ID.RegisterRs) ||
(ID/EX.RegisterRt = IF/ID.RegisterRt)))
then stall

Detecting Stalls

Advanced Computer Systems Architecture Lect-6

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Advanced Computer Systems Architecture Lect-6

Uploaded by

Copyright:

Available Formats

Advanced Computer Systems

Fall Semester 2015 CS-506 ACSA 1

– Floating Point Arithmetic

Fall Semester 2015 CS-506 ACSA 2

( s1  b e1) + ( s2  b e2) = ( s1  b e1) + ( s2 / b e1–e2)  b e1

Example: Numbers to be added:

Fall Semester 2015 CS-506 ACSA 3

Fall Semester 2015 CS-506 ACSA 4

Isolate the sign, exponent, significand Add/

Combine sign, exponent, significand

Fall Semester 2015 CS-506 ACSA 5

Because s1  s2  [1, 4), postshifting may be needed for normalization

Overflow or underflow can occur during multiplication or normalization

Because s1 / s2  (0.5, 2), postshifting may be needed for normalization

Overflow or underflow can occur during division or normalization

Fall Semester 2015 CS-506 ACSA 6

Overflow or underflow can occur during

Fall Semester 2015 CS-506 ACSA 7

Rounding to the nearest even number

Fall Semester 2015 CS-506 ACSA 8

Fall Semester 2015 CS-506 ACSA 10

Fall Semester 2015 CS-506 ACSA 11

Associative law of multiplication a (b c) = (a b) c

Cancellation law (for a > 0) a  b = a  c implies b = c

Multiplication canceling division a (b /a) = b

Before the IEEE 754 floating-point standard became available and

Fall Semester 2015 CS-506 ACSA 12

Fall Semester 2015 CS-506 ACSA 13

A time-space diagram is used to describe the progress of instructions through

Fall Semester 2015 CS-506 ACSA 14

Fall Semester 2015 CS-506 ACSA 15

Fall Semester 2015 CS-506 ACSA 16

Fall Semester 2015 CS-506 ACSA 17

Fall Semester 2015 CS-506 ACSA 18

Instruction Decode (ID) Stage

Fall Semester 2015 CS-506 ACSA 19

Fall Semester 2015 CS-506 ACSA 20

Execution (EX) Stage

Fall Semester 2015 CS-506 ACSA 21

Execution (EX) Stage

Fall Semester 2015 CS-506 ACSA 22

Fall Semester 2015 CS-506 ACSA 23

Fall Semester 2015 CS-506 ACSA 24

Fall Semester 2015 CS-506 ACSA 25

Fall Semester 2015 CS-506 ACSA 26

Fall Semester 2015 CS-506 ACSA 27

Fall Semester 2015 CS-506 ACSA 28

Fall Semester 2015 CS-506 ACSA 29

Fall Semester 2015 CS-506 ACSA 30

Fall Semester 2015 CS-506 ACSA 31

1) The DADD instruction writes the value of R1

2) The AND instruction is also affected by this

3) The OR instruction operates without incurring a hazard because we perform the

Fall Semester 2015 CS-506 ACSA 32

Fall Semester 2015 CS-506 ACSA 33

Fall Semester 2015 CS-506 ACSA 34

Fall Semester 2015 CS-506 ACSA 35

Fall Semester 2015 CS-506 ACSA 36

• If there is no hazard, the ALU’s operands will come from the

Fall Semester 2015 CS-506 ACSA 37

Fall Semester 2015 CS-506 ACSA 38

Fall Semester 2015 CS-506 ACSA 39

Fall Semester 2015 CS-506 ACSA 40