Professional Documents
Culture Documents
CMBR GMBR
MBR
EMBR
CIR GIR
IR
EIR
CPC GPC
PC
EPC
CD0 GD0
D0
ED0
CD1 GD1
D1
ED1
CL1
ALU P Latch 1
F f(P,Q)
Q Latch 2
Function select
CL2
F2 F1 F0
You should describe the actions that occur in plain English (e.g., “Put data from this register on that bus”) and as a
sequence of events (e.g., Read = 1, EMSR). The table below defines the effect of the ALU’s function code. Note that
all data has to pass through the ALU (the copy function) to get from bus B or bus C to bus A.
F2 F1 F0 Operation
0 0 0 Copy P to bus A A=P
0 0 1 Copy Q to bus A A=Q
0 1 0 Copy P + 1 to bus A A=P+1
0 1 1 Copy Q + 1 to bus A A=Q+1
1 0 0 Copy P ‐ 1 to bus A A=P–1
1 0 1 Copy Q ‐ 1 to bus A A=Q–1
1 1 0 Copy bus P + Q to bus A A=P+Q
1 1 1 Copy bus P ‐ Q to bus A A=P–Q
109
© 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied, or duplicated, or posted to a publicly available website, in whole or in part.
SOLUTION
To perform the addition D1 must be latched into an ALU latch, D2 latched into an ALU latch, the ALU set to add
and the result latched into D1. That is,
ED0 = 1, CL1 ;we can do D0 or D1 in any order and we can use latch L1 or latch L2
ED1 = 1, CL2 ;copy D1 via bus B into latch 2
ALU(f2,f1,f0) = 1,1,0, CD1 ;perform addition and latch result in D1.
2. For the architecture of Figure P7.1 write the sequence of signals and control actions necessary to implement the
fetch cycle.
SOLUTION
The fetch cycle involves reading the data at the address in the PC, moving the instruction read from memory to
the IR, and updating the PC.
EPC = 1, CL1 ;move PC via B bus to latch 1
ALU(f2,f1,f0) = 0,0,0, CMAR ;pass PC through ALU and clock into MAR
;the PC is in L1 so we can increment it
ALU(f2,f1,f0) = 0,1,0, CPC ;use the ALU to increment L1 and move to PC
Read = 1, EMSR = 1, CL1 ;move instruction from memory to latch 1 via B bus
ALU(f2,f1,f0) = 0,0,0, CIR ;pass instruction through ALU and clock into IR
SOLUTION
Because there is only one bus to the ALU input and no direct connection between the B and A bus. This means
that all data has to go through the ALU, which becomes a bottleneck.
SOLUTION
Because three of the operations are repeated. Since there is only one B bus input to the ALU via latch L1 or L2, it
does not matter whether data is passed from bus B to bus A via L1 or L2.
5. For the architecture of Figure P7.1, write the sequence of signals and control actions necessary to execute the
instruction ADD M,D0 that adds the contents of memory location M to data register D0 and deposits the results
in D0. Assume that the address M is in the instruction register IR.
SOLUTION
110
© 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied, or duplicated, or posted to a publicly available website, in whole or in part.
6. This question asks you to implement register indirect addressing. For the architecture of Figure P7.1, write the
sequence of signals and control actions necessary to execute the instruction ADD (D1),D0 that adds the
contents of the memory location pointed at by the contents of register D1 to register D0, and deposits the results
in D0. This instruction is defined in RTL form as[D0] ← [[D1]] + [D0].
SOLUTION
Here, we have to read the contents of a register, use it as an address, and read from memory.
7. This question asks you to implement memory indirect addressing. For the architecture of Figure P7.1, write the
sequence of signals and control actions necessary to execute the instruction ADD [M],D0 that adds the
contents of the memory location pointed at by the contents memory location M to register D0, and deposits the
results in D0. This instruction is defined in RTL form as[D0] ← [[M]] + [D0].
SOLUTION
We have to read the contents of a memory location, use it as an address, and read from memory. We can begin
with the same code we used for ADD M,D0.
8. This question asks you to implement memory indirect addressing with index. For the architecture of Figure P7.1,
write the sequence of signals and control actions necessary to execute the instruction ADD [M,D1],D0, that
adds the contents of the memory location pointed at by the contents memory location M plus the contents of
register D1 to register D0, and deposits the results in D0. This instruction is defined in RTL form as[D0] ←
[[M]+[D1]] + [D0].
SOLUTION
We have to read the contents of a memory location, generate an address by adding this to a data register, and
then use the sum to get the actual data. We can begin with the same code we used for ADD [M],D0.
Note how microprogramming can implement any arbitrarily complex addressing mode.
111
© 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied, or duplicated, or posted to a publicly available website, in whole or in part.
9. For the microprogrammed architecture of Figure P7.1, define the sequence of actions (i.e., micro‐operations)
necessary to implement the instruction TXP1 (D0)+,D1 that is defined as:
[D1] ← 2*[M([D0])] + 1
[D0] ← [D0] + 1
Explain the actions in plain English and as a sequence of enables, ALU controls, memory controls and clocks. This
is quite a complex instruction because it requires a register‐indirect access to memory to get the operand and it
requires multiplication by two (there is no ALU multiplication instruction). You will probably have to use a
temporary register to solve this problem and you will find that it requires several cycles to implement this
instruction. A cycle is a sequence of operations that terminates in clocking data into a register.
SOLUTION
Now we have to perform quite a complex operation; that is, read from memory using a register indirect address.
The address is obtained by reading the data in the location pointed at by D0, multiplying this value by 2 and
adding 1. We have no multiplied or shifter, so we must add the number to itself.
10. Why was microprogramming such a popular means of implementing control units in the 1980s?
SOLUTION
In the 1980s memory was horrendously expensive by comparison with the cost of memory today. Every byte was
precious. Consequently, complex instructions were created to do a lot of work per instruction. These instructions
were interpreted in microcode in the CPU. Today, memory is cheap and simple regular instructions are the order
of the day (i.e., RISC). However, some processors like the IA32 have legacy code (complex instructions), that is still
interpreted by means of microcode.
SOLUTION
Microcode is not generally used today in new processors because executing microcode involves too many data
paths in series. In particular, there are several ROM look‐up paths in series. First, it is necessary to look up the
instruction to decode it. Then you have to look up each microinstruction in the microinstruction memory. Today,
RISC‐like processors with 32‐bit instructions are encoded so that the instruction word itself is able to directly
generate the signals necessary to interpret the instruction in a single cycle. In other words, the machine itself has
become the new microcode.
112
© 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied, or duplicated, or posted to a publicly available website, in whole or in part.
12. Figure P7.12 from the text demonstrates the execution of a conditional branch instruction in a flow‐through
computer. The grayed out sections of the computer are not required by a conditional branch instruction. Can you
think of any way in which these unused elements of the computer could be used during the execution of a
conditional branch?
BRA Target
Register file
OpCode
PCaddress Data
S1address S1data S1data Memory
Memory
PC address Memory_MPLX
ALU_MPLX ALU Maddress
PCdata 0
S2address S2data 0 MPLX
BRA Target where S2data
MPLX Mdata_out 1
the target address is Instruction 1
[PC]+4+4*L Memory Daddress
Mdata_in
PC_MPLX Literal L Ddata
00
01
PC
Branch +
PC_adder
4
0
PC_MPLX
Z +
Branch_adder
Left shift x 2
control
The Z-bit from the CCR 32-bit sign-extended
controls the PC multiplexer. byte offset
It selects between next
address and branch address.
SOLUTION
In this example, the register file, ALU, and data memory are not in use. It begs an interesting question. Could a
branch be combined with another operation that could be performed in parallel (rather like the VLIW (very long
instruction word) computers that we look at in Chapter 8. For example, you could imagine an instruction BEQ
target: r0++ which performs a conditional branch to target and also increments register r0. Of course, the
price of such an extension would be to reduce the number of bits available for the target address.
13. What modifications would have to be made to the architecture of the computer in Figure P7.12 to implement
predicated execution like the ARM?
SOLUTION
The ARM predicates instructions; for example, ADDEQ r0,r1,r2. A predicated instruction is executed if the
stated condition is true. In this case ADDEQ r0,r1,r2 is executed if the Z‐bit of the status is true. One way of
implementing predicated execution would be to take a NOP (no operation) instruction that is jammed into the
instruction register if the predicated condition is false. Another solution would be to put AND gates in all paths
that generate signals that clock or update registers and status values. If the predicated condition is false, all
signals that perform an update are negated and the state of the processor does not change.
14. What modifications would have to be added to the computer of Figure P7.12 to add a conditional move
instruction with the format MOVZ r1,r2,r3 that performs [r1] ← [r2] if [r3] == 0?
SOLUTION
The basic data movement can be implemented in the normal way using existing data paths from the register file,
through the ALU, the memory multiplexer, and back to the ALU. To implement the conditional action, register r3
must be routed to the ALU and compared with zero. The result of the comparison is used to determine whether a
writeback (i.e., writing r2 into r1) would take place in the next pipeline stage.
113
© 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied, or duplicated, or posted to a publicly available website, in whole or in part.
15. What modifications would have to be made to the architecture of the computer in Figure P7.12 to implement
operand shifting (as part of a normal instruction) like the ARM?
SOLUTION
As in the case of the ARM processor family, it would require a barrel shifter in one of the inputs to the ALU so that
the operand is shifted before use. The number of shifts to be performed could be taken from the op‐code (for
example, from the literal field). However, the existing structure could not implement an ARM‐like dynamic shift
ADD r0,r1,r2, lsl r3 , because the register file does not have three address inputs. In order to provide
dynamic shifts, it would be necessary to add an extra address and read the data port to the register file.
16. Derive an expression for the speedup ratio (i.e., the ratio of the execution time without pipelining to the
execution time with pipelining) of a pipelined processor in terms of the number of stages in the pipeline m and
the number of instructions to be executed N.
SOLUTION
Suppose that the number of instructions to be executed were N. It would take N clocks + m ‐ 1 to execute. The
factor (m ‐ 1) is due to the time for the last instruction to pass through the pipeline. The speedup relative to an
unpipelined system that would require N⋅m cycles (N instructions executed in n stages) is N⋅m/(N + m ‐ 1).
17. In what ways is the formula for the speedup of the pipeline derived in the previous question flawed?
SOLUTION
There are two flaws. The first is that the pipeline can be exploited fully only if the pipeline is continually supplied
with instructions. However, interactions between data elements, competition for resources, and branch
operations reduce the efficiency of a pipeline. These factors can introduce stall cycles (wait states for resources)
or force the pipeline to be flushed.
However, there is another factor to consider. In order to pipeline a process, it is necessary to place a register
between stages. The register has a setup and hold time which must be taken into account; that is, the pipeline
register increases the effective length of each stage.
18. A processor executes an instruction in the following six stages. The time required by each stage in picoseconds
(1,000 ps = 1 ns) is given for each stage.
114
© 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied, or duplicated, or posted to a publicly available website, in whole or in part.
SOLUTION
a. Add up the individual times: 300 + 150 + 250 + 350 + 700 + 200 = 1950ps = 1.950ns
b. The longest stage is 700 ps which determines the clock period. With 20 ps for the latches, the time is 720 × 6
= 4320 ps = 4.32 ns.
d. 75% of instructions are not taken branches and these take on average 0.75 × 720 ps = 540 ps. 25% are taken
branches that take 0.25 × 3 × 720 ps = 540 ps. The total time is 540 + 540 =1080 ps.
19. Both RISC and CISC processors have registers. Answer the following questions about registers.
a. Is it true that a larger number of registers in any architecture is always better than a smaller number?
b. What limits the number of registers that can be implemented by any ISA?
c. What are the relative advantages and disadvantages of dedicated registers like the IA32 architecture
compared to general purpose registers like ARM and MIPS?
d. If you have an m‐bit register select field in an instruction, you can’t have more than 2m registers. There are, in
fact, ways round this restriction. Suggest ways of increasing the number of registers beyond 2m while keeping
an m‐bit register select field.
SOLUTION
a. In principle yes, as long as you don’t have to pay a price for them. More registers means fewer memory
accesses and that is good. However, if you have to perform a context switch when you run a new task, having
to save a lot of registers may be too time‐consuming. Having more registers requires more bits in an
instruction to specify them. If you allocate too many bits to register specification then you have a more
limited instruction set.
b. Today, it’s the number of bits required to specify a register. A processor like the Itanium IA64 with a much
longer instruction word can specify more registers.
c. Having fixed special purpose registers permits more compressed code. For example, if you have a counter
register, any instruction using the counter doesn’t need to specify the register – because that is fixed. The
weakness is that you can’t have two counter registers. Computers that originated in the CISC area like the
IA32 architecture use special‐purpose registers, because they were designed when saving bits (reducing
instruction size) was important. Remember that early 8‐bit microprocessors had an 8‐bit instruction set.
More recent architectures are RISC based and have general‐purpose architectures. ARM processors are
unusual in the sense that they have a small general‐purpose register set that includes two special‐purpose
registers, a link register for return addresses and the program counter itself.
d. Of course, you can’t address more than 2m registers with an m‐bit address field. But you can use a set of more
than 2m registers of which only 2m are currently visible. Such a so‐called windowing technique has been used
in, for example, the Berkeley RISC and the SPARC processor. Essentially, every time you call a
subroutine/function you get a new set of register windows (these are still numbered r0 t0 r31). However,
each function has its own private registers that cannot be accessed from other functions. There are also
global registers common to all functions and parameter passing registers that are shared with parent and
child functions. Such mechanisms have not proved popular. The problem is that if you deeply nest
subroutines, you end up having to dump registers to memory.
115
© 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied, or duplicated, or posted to a publicly available website, in whole or in part.
20. Someone once said, “RISC is to hardware what UNIX is to software”. What do you think this statement means and
is it true?
SOLUTION
This is one of those pretentious statements that people make for effect. UNIX is the operating system loved by
many computer scientists and is often contrasted with operating systems from large commercial organizations
such as Microsoft. By analogy, RISC processors were once seen as an opportunity for small companies and
academics to develop hardware at a time when existing processors were being developed by large corporations at
considerable expense. Relatively small teams were required to design MIPS or the ARM processor compared to an
Intel IA32 processor. In that sense RISC/UNIX were seen as returning hardware/software to the masses. Over the
years, the distinction between RISC and CISC processors has become very blurred, even though computing world
is still, to some extent, divided into UNIX and Windows spheres.
21. What are the characteristics of a RISC processor that distinguish it from a CISC processor? Does it matter whether
this question is asked in 2015 or 1990?
SOLUTION
The classic distinction between RISC processors and CISC processors is that RISC processors are pipelined, and
have a small, simple, and highly regular instruction sets. RISC processors are also called load/store processors with
the only memory access operations being load and store. All data processing operations are register‐to‐register.
CISC processors tend to have irregular instruction sets, special purpose registers, complex instruction
interpretation hardware and memory to memory operations. However, the difference between modern RISC and
CISC processors is blurred and the distinction is no longer as significant as it was. RISC techniques have been
applied to CISC processors and even traditional complex instruction set processors are highly pipelined. Equally,
some RISC processors have quite complex instruction sets. One difference is that today’s RISC processors have not
returned to memory‐to‐memory or memory‐to‐register instruction formats.
22. What, in the context of pipelined processors, is a bubble and why is it detrimental to the performance of a
pipelined processor?
SOLUTION
As an instruction flows through a pipeline, various operations are applied to it. For example, in the first stage it is
fetched from memory and it may be decoded. In the second stage any operands it requires are read from the
register file, and so on. Sometimes, it is not possible to perform an operation on an instruction. For example, if an
operand is required and that operand is not ready, the stage processing the operand cannot continue. This results
in a bubble or a stall when ‘nothing happens’. Equally, bubbles appear when a branch is taken and instructions
following the branch are no longer going to be executed. So, a bubble is any condition that leads to a stage in the
pipeline not performing its normal operation because it cannot proceed. A bubble is detrimental to performance
because it means that an operation that could be executed is not executed and its time slot is wasted.
23. To say that the RISC philosophy was all about reducing the size of instruction sets would be wrong and entirely
miss the point. What enduring trends or insights did the so‐called RISC revolution bring to computer architecture
including both RISC and CISC design?
SOLUTION
Designers learned to look at the whole picture rather than just optimizing one or two isolated aspects of the
processor. In particular there was a movement toward the use of benchmarks to improve performance. That is,
engineers applied more rigorous design techniques to the construction of new processors.
116
© 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied, or duplicated, or posted to a publicly available website, in whole or in part.
24. There are RAW, WAR, and WAW data hazards. What about RAR (read‐after‐read)? Can a RAR operation cause
problems in a pipelined machine?
SOLUTION
In the above code, register r2 is read by both instructions. Since the value of r2 is altered by neither operation and
it does not matter (semantically) which instruction is executed first, there can be no problem.
25. Consider the instruction sequence in a five‐stage pipeline IF, OF, E, M, OS:
1. ADD r0,r1,r2
2. ADD r3,r0,r5
3. STR r6,[r7]
4. LDR r8,[r7]
Instructions 1 and 2 will create a RAW hazard. What about instructions 3 and 4? Will they also create a RAW
hazard?
SOLUTION
Yes ‐ possibly. Register r6 may not have been stored before it is read (in memory) by the next instruction. Of
course, part of the problem is the bad code. You are storing a value in memory and then reading it back. You
should replace the LDR r8,[r7] by MOV r8,r6.
26. A RISC processor has a three‐address instruction format and typical arithmetic instructions (i.e., ADD, SUB, MUL,
DIV etc.). Write a suitable sequence of instructions to evaluate the following expression in the minimum time:
X = (A+B)(A+B+C)E+H
G+A+B+D+F(A+B-C)
Assume that all variables are in registers and that the RISC does not include a hardware mechanism for the
elimination of data dependency. Each instance of data dependency causes one bubble in the pipeline and wastes
one clock cycle.
SOLUTION
It is necessary to write the code with the minimum number of RAWs. For example,
117
© 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied, or duplicated, or posted to a publicly available website, in whole or in part.
27. Figure P7.27 gives a partial skeleton diagram of a pipelined processor. What is the purpose of the flip‐flops
(registers) in the information paths?
SOLUTION
The problem with architecture like that of Figure P7.27 is that when an instruction is processed (e.g., an operation
and its operands), all the information must be in place at the same time. For example, if you perform a = b + c
followed by p = q ‐ r, it would be unfortunate if q and r arrived at the ALU at the same time as the + operator. This
would lead to the erroneous operation p = q + r.
Once an instruction goes from PC to instruction memory to instruction register, it is divided into fields (operands,
constants, instructions) and each of these fields provides data that flows along different paths. For example, the
op‐code goes to the ALU immediately, whereas the operands (during a register‐to‐register operation) go via the
register file where operand addresses are translated into operand values. The flip‐flops equalize the time at which
data and operations arrive at the ALU. It is also necessary to put a delay in the destination address path because
the destination address has to wait an extra cycle – the time required for the ALU to perform an operation.
28. Explain why branch operations reduce the efficiency of a pipelined architecture. Describe how branch prediction
improves the performance of a RISC processor and minimizes the effect of branches?
SOLUTION
N+1 IF OF E S
118
© 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied, or duplicated, or posted to a publicly available website, in whole or in part.
The figure demonstrates the effect of a bubble in a pipelined architecture due to a branch. The pipeline inputs a
string of instructions and executes them in stages; in this example, it’s four. Once the pipe is full, four instructions
are in varying stages of completion. If a branch is read into the pipeline and that branch is taken, the instructions
following the branch are not going to be executed. Instructions ahead of the branch will be executed. A bubble is
the term used to describe a pipeline state where the current instruction must be rejected. In this figure it takes
two clocks before normal operation can be resumed.
29. Assume that a RISC processor uses branch prediction to improve its performance. The table below gives the
number of cycles taken for predicted and actual branch outcomes. These figures include both the cycles taken by
the branch itself and the branch penalty associated with branch instructions.
Actual
Prediction Not taken Taken
Not taken 1 4
Taken 2 1
If pb is the probability that a particular instruction is a branch, pt is the probability that a branch is taken, and pw is
the probability of a wrong prediction, derive an expression for the average number of cycles per instruction, TAVE.
All non‐branch instructions take one cycle to execute.
SOLUTION
Non‐branch cycles + branches not taken and predicted not taken + branches not taken and predicted taken +
branches taken and predicted taken + branches taken and predicted not taken
In each case, we multiply the probability of the event by the cost of the event; that is:
Remember that if pt is the probability of a branch being taken, 1 ‐ pt is the probability of a branch not being taken.
If pw is the probability of a wrong correction, (1 ‐ pw) is the probability of a correct prediction.
Therefore, the average number of cycles is 1 ‐ pb(1 ‐ 1 + pt + pw ‐ pt⋅pw ‐ 2⋅pw + 2⋅pt⋅pw ‐ pt + pt⋅pw ‐ 4⋅pt⋅pw)
= 1 ‐ pb⋅ ( ‐pw ‐ 2⋅pt⋅pw ) =1 + pb⋅pw(1 + 2⋅pt)
30. IDT application note AN33 [IDT89] gives an expression for the average number of cycles per instruction in a RISC
system as:
SOLUTION
The first term, Pb(1 + b), is the probability of a branch multiplied by the total cost of a branch (i.e., 1 plus the
branch penalty). The second Pm(1 + m) term deals with memory accesses and is the probability of a memory
access multiplied by the total memory access cost. The final term, (1 ‐ Pb ‐ Pm), is what’s left over; that is not a
branch and not a memory access.
119
© 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied, or duplicated, or posted to a publicly available website, in whole or in part.
This formula is limited in the sense that it does not describe the difference between branches that are taken and
not taken and between cache accesses and not‐cache accessed. However, its message is clear; reduce both
branches and memory accesses.
31. RISC processors rely (to some extent) on on‐chip registers for their performance increase. A cache memory can
provide a similar level of performance increase without restricting the programmer to a fixed set of registers.
Discuss the validity of this statement.
SOLUTION
Memory accesses can take orders of magnitude longer than register accesses. Because RISC style processors have
far more registers than CISC processors, it is possible to operate on a subset of data stored within the chip and to
reduce memory accesses.
However, cache memory, which is a copy of some frequently‐used memory, can reduce the memory access
penalty by keeping data in the on‐chip cache.
One argument in favor of cache is that it is handled automatically by the hardware. Registers have to be allocated
by the programmer or the compiler. If the number of registers is limited, it is possible that the on‐chip registers
may be used/allocated non‐optimally.
Cache memory also has the advantage that it supports dynamic data structures like the stack. Most computers do
not allow dynamic data structures based on registers (that is, you can’t access register ri, where i is an index). The
Itanium IA64 that we discuss in Chapter 8 does indeed have dynamic registers.
32. RISC processors best illustrate the difference between architecture and implementation. To what extent is this
statement true (or not true)?
SOLUTION
We have already stated that architecture and organization are orthogonal; that is they are independent. In
principle, this statement is true. You can create an instruction set on paper and then implement it any way you
want; via direct logic (called random logic) or via a structure such as microprogramming. However, some design
or organization techniques may be suited or unsuited to a particular architecture. CISC processors are
characterized by both complicated instructions (i.e., multiple‐part instructions or instructions with complex
addressing modes); for example, the BFFFO (locate the occurrence of the first bit set to 1) can be regarded as a
complex instruction, and irregular instruction encodings. Consequently, CISC instruction sets are well‐suited to
implementation/interpretation via microcode. The instruction lookup table simply translates a machine code
value into the location of the appropriate microcode. It doesn’t matter how odd the instruction encoding is.
RISC processors with simple instructions are well suited to implementation by pipelining because of the regularity
of a pipeline; that is, all instructions are executed in approximately the same way.
33. A RISC processor executes the following code. There are no data dependencies.
ADD r0,r1,r2
ADD r3,r4,r5
ADD r6,r7,r8
ADD r9,r10,r11
ADD r12,r13,r14
ADD r15,r16,r17
a. Assuming a 4‐stage pipeline fetch, operand fetch, execute, write, what registers are being read during the 6th
clock cycle and what register is being written?
120
© 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied, or duplicated, or posted to a publicly available website, in whole or in part.
b. Assuming a 5‐stage pipeline fetch, operand fetch, execute, write, store, what registers are being read during
the 6th clock cycle and what register is being written?
SOLUTION
a. Four‐stage pipeline
Cycle 1 2 3 4 5 6 7 8
ADD r0,r1,r2 IF OF E W
ADD r3,r4,r5 IF OF E W
ADD r6,r7,r8 IF OF E W
ADD r9,r10,r11 IF OF E W
ADD r12,r13,r14 IF OF E W
ADD r15,r16,r17 IF OF E
During the 6th clock cycle, operands r13 and r14 are being read and operand r6 is being written.
During the 6th clock cycle operands r13 and r14 are being read and operand r3 is being written.
34. A RISC processor executes the following code. There are data dependencies but no internal forwarding. A source
operand cannot be used until it has been written.
ADD r0,r1,r2
ADD r3,r0,r4
ADD r5,r3,r6
ADD r7,r0,r8
ADD r9,r0,r3
ADD r0,r1,r3
a. Assuming a 4‐stage pipeline: fetch, operand fetch, execute, result write, what registers are being read during
the 10th clock cycle and what register is being written?
b. How long will it take to execute the entire sequence?
SOLUTION
Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13
ADD r0,r1,r2 IF OF E W
ADD r3,r0,r4 IF OF E W
ADD r5,r3,r6 IF OF E W
ADD r7,r0,r8 IF OF E W
ADD r9,r0,r3 IF OF E W
ADD r0,r1,r3 IF OF E W
a. In the 10th cycle registers r0 and r3 are being read and register r5 is being written.
121
© 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied, or duplicated, or posted to a publicly available website, in whole or in part.
35. A RISC processor has an eight stage pipeline: F D O E1 E2 MR MW WB (fetch, decode, register read operands,
execute 1, execute 2, memory read, memory write, result writeback to register). Simple logical and arithmetic
operations are complete by the end of E1. Multiplication is complete by the end of E2.How many cycles are
required to execute the following code assuming that internal forwarding is not used?
MUL r0,r1,r2
ADD r3,r1,r4
ADD r5,r1,r6
ADD r6,r5,r7
LDR r1,[r2]
SOLUTION
Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
1 MUL r0,r1,r2 F D O E1 E2 MR MW WB
2 ADD r3,r1,r4 F D O E1 E2 MR MW WB
3 ADD r5,r1,r6 F D O E1 E2 MR MW WB
4 ADD r6,r5,r7 F D O E1 E2 MR MW WB
5 LDR r1,[r2] F D O E1 E2 MR MW WB
There’s only one RAW dependency in instruction 4 involving r5. The total number of cycles is 17.
36. Repeat the previous problem assuming that internal forwarding is implemented.
SOLUTION
Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
1 MUL r0,r1,r2 F D O E1 E2 MR MW WB
2 ADD r3,r1,r4 F D O E1 E2 MR MW WB
3 ADD r5,r1,r6 F D O E1 E2 MR MW WB
4 ADD r6,r5,r7 F D O E1 E2 MR MW WB
5 LDR r1,[r2] F D O E1 E2 MR MW WB
37. Consider the same structure as question 35 but with the following code fragment. Assume that internal
forwarding is possible and an operand can be used as soon as it is generated. Show the execution of this code.
LDR r0,[r2]
ADD r3,r0,r1
MUL r3,r3,r4
ADD r6,r5,r7
STR r3,[r2]
ADD r6,r5,r7
SOLUTION
Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
1 LDR r0,[r2] F D O E1 E2 MR MW WB
2 ADD r3,r0,r1 F D O E1 E2 MR MW WB
3 MUL r3,r3,r4 F D O E1 E2 MR MW WB
4 ADD r6,r3,r7 F D O E1 E2 MR MW WB
5 LDR r1,[r2] F D O E1 E2 MR MW WB
122
© 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied, or duplicated, or posted to a publicly available website, in whole or in part.
38. The following table gives a sequence of instructions that are performed on a 4‐stage pipelined computer. Detect
all hazards. For example if instruction m uses operand r2 generated by instruction m‐1, then write m‐1,r2 in the
RAW column in line m.
SOLUTION
Note that some of the hazards are technical hazards and not real hazards. For example, instruction 3 does not
suffer a RAW hazard on r1 because any delay will have been swallowed by the previous instruction.
The processor has a five‐stage pipeline F O E M S; that is, instruction fetch, operand fetch, operand execute,
memory, operand writeback to register file.
a. How many cycles does this code take to execute assuming internal forwarding is not used?
b. How many cycles does this code take to execute assuming internal forwarding is used?
c. How many cycles does the code take to execute assuming that it is reordered (no internal forwarding)?
d. How many cycles does the code take to execute assuming reordering and internal forwarding?
123
© 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied, or duplicated, or posted to a publicly available website, in whole or in part.
SOLUTION
a. No forwarding
Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
1 LDR r1,[r6] F O E M S
2 ADD r1,r1,#1 F O E M S
3 LDR r2,[r6,#4] F O E M S
4 ADD r2,r2,#1 F O E M S
5 ADD r3,r1,r2 F O E M S
6 ADD r8,r8,#4 F O E M S
7 STR r2,[r6,#8] F O E M S
8 SUB r4,r4,#64 F O E M S
b. Forwarding
Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
1 LDR r1,[r6] F O E M S
2 ADD r1,r1,#1 F O E M S
3 LDR r2,[r6,#4] F O E M S
4 ADD r2,r2,#1 F O E M S
5 ADD r3,r1,r2 F O E M S
6 ADD r8,r8,#4 F O E M S
7 STR r2,[r6,#8] F O E M S
8 SUB r4,r4,#64 F O E M S
c. Reordering
Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
1 LDR r1,[r6] F O E M S
2 LDR r2,[r6,#4] F O E M S
3 ADD r8,r8,#4 F O E M S
4 ADD r1,r1,#1 F O E M S
5 ADD r2,r2,#1 F O E M S
6 SUB r4,r4,#64 F O E M S
7 ADD r3,r1,r2 F O E M S
8 STR r2,[r6,#8] F O E M S
40. Why do conditional branches have a greater effect on a pipelined processor than unconditional branches?
SOLUTION
The outcome of an unconditional branch is known the moment it is first detected. Consequently, instructions at
the target address can be fetched immediately. The outcome of a conditional address is not known until the
condition has been tested which may be at a later stage in the pipeline.
41. Describe the various types of change of flow‐of‐control operations that modify the normal sequence in which a
processor executes instructions. How frequently do these operations occur in typical programs?
SOLUTION
All these events cause a change in the flow of control (non‐sequential instruction execution). Interrupts and
exceptions are relatively rare (expressed as a percentage of total instructions executed). The frequency of
branches and jumps may be expressed statically or dynamically. The static frequency is the fractional number of
branches in the code. The dynamic frequency is more meaningful and is the number of branches executed when
the code is run. Branch instructions make up about 20% of a typical program. Subroutine calls and returns are less
frequent (of the order of 2%).
Suppose this ARM‐like code is executed on a 4‐stage pipeline with internal forwarding. The load instruction has
one cycle penalty and the multiply instruction introduces two stall cycles into the execute phase. Assume the
taken branch has no penalty.
SOLUTION
a. There are two pre‐loop instructions and a 6‐instruction loop repeated 10 times. Total = 2 + 10 × 6 = 62.
b. The following shows the code of one pass round the loop
Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1. LDR r1,[r0] F O E S
2. SUBS r2,r2,#1 F O E S
3. MUL r1,r1,#5 F O E S
4. STR r1,[r0] F O E S
5. ADD r0,r0,#4 F O E S
6. BNE Loop F O E S
1. LDR r1,[r0] F O E S (repeat)
c. It takes 11 cycles to make one pass round the loop. However, it takes 14 cycles to execute all the instructions
in a loop fully. The total number of cycles is 2 (preloop) + 10 × 11 + 3 (post loop) = 115.
43. Branch instructions may be taken or not taken. What is the relative frequency of taken to not taken, and why is
this so?
SOLUTION
At first sight is might appear that the probability of branches being taken or not taken is 50:50 because there are
two alternatives. However, this logic is entirely misleading because of the way in which branches are used. A
paper (albeit old) by Y. Wu and J.R. Larus (Static branch frequency and program profile analysis, MICRO‐27 Nov
1994) suggests that loop branches have a probability of 88% of being taken.
125
© 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied, or duplicated, or posted to a publicly available website, in whole or in part.
44. What is branchless computing?
SOLUTION
If branches are considered harmful because a misprediction can lead to bubbles in the pipeline, it is a good idea to
reduce the frequency of branches. Doing this is called branchless computing. In particular, it refers to predicated
computing where an instruction is conditionally executed; for example, the ARM’s ADDEQ r0,r0,#1
increments the value of register r0 if the result of the last operation that set the condition code was zero. The
IA32 MMX instruction set extension also permits branchless computing by turning a condition into a value; that is,
if a test yields true, the value 1111…1 is generated and if the condition is false the value 0000...0 is generated.
These two constants can then be used as masks in Boolean operations
45. What is a delayed branch and how does it contribute to minimizing the effect of pipeline bubbles? Why are
delayed branch mechanisms less popular then they were?
SOLUTION
The term delayed in delayed branch is not a very good description. In a pipelined computer a taken branch means
that the pipeline must be (partially) flushed. If the instruction sequence is P,B,Q where P, B, and Q are three
instructions and B is a branch, the instruction Q is executed if the branch is not taken and not executed if the
branch is taken. A delayed branch mechanism always executes the instruction after the branch. Thus, the
sequence P,Q,B (where P and Q are executed before the branch) becomes P,B,Q where Q is still executed before
the branch. Of course, if a suitable instruction P cannot be found, the so‐called delayed branch slot must be filled
with a NOP (no operation).
SOLUTION
In a pipelined processor, an instruction flows through the pipeline and is executed in stages. If an instruction is a
branch and the branch is taken, all instructions behind it in the pipeline have to be flushed. The earlier a branch is
detected and the outcome resolved the better. Branch prediction makes a guess about the direction (outcome) of
the branch; taken or not taken. If the branch is predicted not taken, nothing happens and execution continues. If
the branch is predicted as taken, instructions can be obtained from the branch target address and loaded into the
instruction stream immediately. If the prediction is incorrect, the pipeline has to be flushed in the normal way.
47. A pipelined computer has a four‐stage pipeline: fetch/decode, operand fetch, execute, writeback. All operations
except load and branch do not introduce stalls. A load introduces one stall cycle. A non‐taken branch introduces
not stalls and a taken branch introduces two stall cycles. Consider the following loop.
a. Express this code in an ARM‐like assembly language (assume that you cannot use autoindexed addressing and
that the only addressing mode is register indirect of the form [r0]).
b. Show a single trip round the loop and indicate how many clock cycles are required.
c. How many cycles will it take to execute this code in total?
d. How can you modify the code to reduce the number of cycles?
SOLUTION
a. The code
mov r2,#1023
Loop ldr r0,[r1]
add r0,r0,#2
126
© 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied, or duplicated, or posted to a publicly available website, in whole or in part.
str r0,[r1]
add r1,r1,#4
subs r2,r2,#1
BNE Loop
b. A trip round the loop has 6 instructions. The load has a one cycle stall and the taken branch back has two
cycles. The total is 6 + 1 + 2 = 9 cycles.
c. The total number of cycles is 1 + 1,024 × 9 ‐ 2 (the minus 2 is there because the branch is not taken on the last
loop). This is 9,215 cycles.
d. You can speed up the code by unrolling the loop and performing multiple iterations per trip and avoiding the
two cycle branch delay. You could save a cycle of latency by inserting the increment r1 by 4 after the load to
hide the load stall.
48. Suppose that you design an architecture with the following characteristics
SOLUTION
a. Average cycles = non‐branch cycles + non‐taken branches + taken branches slot filled + taken branches slot
unfilled.
= 80% × 1 + 20%(15% × 1 + 85% × (50% × 1 + 50% × 2))
=0.80 + 0.20 × (0.15 + 0.85 × (0.50 + 1.00) = 0.80 + 0.20 × (0.15 + 1.275) = 1.085
b. The only thing different is the fraction of unfilled slots. We can write
Average cycles = 80% × 1 + 20%(15% × 1 + 85% × (95% × 1 + 5% × 2))
= 0.80 + 0.20(0.15 + 0.8925) = 1.0085.
SOLUTION
50. What is the difference between static and dynamic branch prediction?
SOLUTION
Static prediction takes place before any code is executed; that is, it does not use feedback from the actual running
of the code to make a prediction. Dynamic prediction uses information from the past behavior of the program to
predict the future behavior. Dynamic prediction is more accurate than static prediction.
Static prediction relies on factors such as the static behavior of individual branches (e.g., this branch type is
usually taken, this one is not). Such an approach is relatively crude. The compiler can analyze code and make a
guess about the outcome of branches and then set a hint bit in the code. The processor uses this hint bit to decide
whether the branch will be taken. Note that not all computers have a branch hint bit.
Dynamic branch prediction observes the history of branches (either individually or collectively) and the position of
branches in the program to decide whether to take or not take a branch. Dynamic prediction can be very accurate
in many circumstances.
51. A processor has a branch‐target buffer. If a branch is in the buffer and it is correctly predicted, there is no branch
penalty. The prediction rate is 85% correct. If it is incorrectly predicted, the penalty is 4 cycles. If the branch is not
in the buffer, and not taken, the penalty is 2 cycles. Seventy percent of branches are taken. If the branch is not in
the buffer and is taken the penalty is 3 cycles. The probability that a branch is in the buffer is 90%. What is the
average branch penalty?
SOLUTION
Branch penalty = mispredict penalty (in buffer) + taken penalty (not in buffer) + not taken penalty (not in buffer) =
90% × 15% × 4 + 10% × 70% × 3 + 10% × 30% × 2
= 0.54 + 0.21 + 0.06 = 0.81 cycles per branch.
52. How can the compiler improve the efficiency of some processors with branch prediction mechanisms?
SOLUTION
Some processors allow the compiler to set/clear bits in the op‐code that tell the processor whether to treat this
branch as taken or not taken; for example, if you have a loop in a high level language, the terminating conditional
branch will be taken back to the start of the loop n‐1 times for n iterations. The compiler would set the take
branch bit in the opcode and the processor would automatically assume ‘branch taken’.
53. Consider the following two streams of branch outcomes (T = taken and N = not taken). In each case what is the
simplest form of branch prediction mechanism that would be effective in reducing the branch penalty?
a. T, T, T, T, T, N, T, T, T, T, T, T, T, N, T, T, T, T, T, N, T, T, T, T, T, T, T, N, T, T, T, T, T
b. T, T, T, T, T, N, N, N, N, N, N, N, N, N, T, T, T, T, T, T, T, T, T, T, T, N, N, N, N, N, N, N, N
SOLUTION
a. Static
128
© 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied, or duplicated, or posted to a publicly available website, in whole or in part.
54. A processor uses a 2‐bit saturation‐counter dynamic branch predictor with the states strongly taken, weakly
taken, weakly not taken, and strongly not taken. The symbol T indicates a branch that is taken and an N indicates
a branch that is not taken. Suppose that the following predicted sequence of branches is recorded: T T T N T X
SOLUTION
In order to make the N prediction, the previous two states would have to be not taken states. If the next
prediction is T then the previous branch must have been T to move from the weakly not taken predicted state to
the weakly predicted taken state. Therefore the next prediction X will be T.
55. The following sequence of branch outcomes is applied to a saturating counter branch predictor
TTTNTTNNNTNNNTTTTTNTTTNNTTTTNT. If the branch penalty is two cycles for a miss‐predicted branch, how
many additional cycles does the system incur for the above sequence of 30 branches? Assume that the predictor
is initially in the strongly predicted taken state.
SOLUTION
Branch sequence
T T T N T T N N N T N N N T T T T T N T T T N N T T T T N T
Next predictor state (ST, WT, SN, WN, SN)
ST ST ST WT ST ST WT WN SN WN SN SN SN WN WT ST ST ST WT ST ST ST WT WN WT ST ST ST WT ST
Outcome (decision)
T T T T T T T N N N N N N N T T T T T T T T T N T T T T T T
Wrong decision
W W W W W W W W W W W
The number of wrong decisions is 11 costing 11 × 2 = 22 cycles. This is no better than guessing taken.
56. The state diagram below represents one of the many possible 2‐bit state machines that can be used to perform
prediction. Explain, in plain English, what it does.
T T
NT Not Not
S0 S1 Taken S2 Taken S3
taken taken
NT NT T
NT
SOLUTION
We can regard S0 as a strongly not taken state and all not taken branches lead towards this state. States S0, S1,
S2, S3 behave exactly like the corresponding states in a saturating counter with respect to not taken branches.
The differences between this and a saturating counter are:
1. If you are in state S1 (not taken) and the next branch is taken, you go straight to state S3, the strongly taken
state.
2. If you are in state S3, a taken branch takes you to state S2 (rather than back to state S3). State S3 is not a
saturating state. If there is a sequence of taken branches, the system oscillates between S2 and S3. From
state S3 the next state is always state S2 (since a taken and a not taken have the same destination).
129
© 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied, or duplicated, or posted to a publicly available website, in whole or in part.
57. What is a branch target buffer and how does it contribute to a reduction of the branch penalty?
SOLUTION
The fundamental problem with a branch is that if it is taken, instructions already in the pipeline have to be
flushed. Consequently, you want to detect a branch instruction as soon as possible. Then you can begin execution
at the target address.
Branch target prediction operates by detecting the branch, guessing its outcome and fetching instructions from
the next or target address as soon as possible.
The branch target buffer, BTB, is a form of memory cache that caches the addresses of branch instructions. The
program counter searches the BTB. If the current instruction address corresponds to a branch, the cache can be
accessed and the predicted outcome of the branch read (This is true only of BTBs that have a prediction bit. In
general, it is assumed that every cached branch will be taken). The BTB contains the address of the target of the
branch. This means that instructions can be loaded from that address immediately (without having to read the
branch instruction and compute the target address). If you also cache the instruction at the target address you
can get the instruction almost immediately. The BTB lets you resolve the branch much earlier in the pipeline and
therefore reduce the branch penalty.
58. Consider the 4‐bit saturating counter as a branch predictor with 16 states from 1111 to 0000? Describe in words
the circumstances where such a counter might be effective.
SOLUTION
If the branch predictor works in the same way as a 2‐bit saturating counter, it has 16 states; 8 of which predict
take and 8 don’t take the branch. If you are in a run of taken or not taken branches (more than 15) then you are in
the strongest taken (or not taken state). It will take a run of eight wrongly predicted branches in sequence to
reverse the decision. Therefore, you might use such a system in circumstances where very longs runs of a branch
are in one direction, and you do not wish to reverse the direction unless there is a change of direction spanning 8
branches.
59. Draw the state diagram of a branch predictor using three‐bit saturating counter? Under what circumstances do
you think such a predictor might prove effective?
SOLUTION
The predictor will not change direction when fully saturated until four consecutive wrong decisions have been
made.
130
© 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied, or duplicated, or posted to a publicly available website, in whole or in part.
60. Given the branch sequence TTTTNTTNNTTTTTNNNNNNNNNTNTTTTTTTTTTTT and assuming that the 3‐bit
saturating predictor starts in its saturated T state, what will the predicted sequence be?
SOLUTION
Input
T T T T N T T N N T T T T T N N N N N N N N N T N T T T T T T T T T T T T
State
S7 S7 S7 S7 S6 S7 S7 S6 S5 S6 S7 S7 S7 S7 S6 S5 S4 S3 S2 S1 S0 S0 S0 S1 S0 S1 S2 S3 S4 S5 S6 S7 S7 S7 S7 S7 S7
Predict
T T T T T T T T T T T T T T T T T N N N N N N N N N N N T T T T T T T T T
Outcome (c = correct, w = wrong)
C C C W C C W W C C C C C W W W W C C C C C W C W W W W C C C C C C C C C
MOV r0,#4
B1 MOV r2,#5
SUB r2,r2,r0
B2 SUBS r2,r2,#1
BNE B2 ;Branch 1
SUBS r0,r0,#1
BNE B1 ;Branch 2
Assume that a 1‐bit branch predictor is used for both branch 1 and branch 2 and that both predictors are initially
set to N. Complete the following table by running through this code.
Branch 1 Branch 2
Cycle Branch prediction Branch outcome Cycle Branch prediction Branch outcome
1 N N 1 N T
2 2
3 3
4 4
5
6
7
8
9
10
Repeat the same exercise with the same initial conditions but assume a 2‐bit saturating counter branch predictor.
SOLUTION
Branch 1 Branch 2
Cycle Branch prediction Branch outcome Cycle Branch prediction Branch outcome
1 N N 1 N T
2 N T 2 T T
3 T N 3 T T
4 N T 4 T N
5 T T
6 T N
7 N T
8 T T
9 T T
10 T N
131
© 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied, or duplicated, or posted to a publicly available website, in whole or in part.
62. A processor executes all non‐branch instructions in one cycle. This processor implements branch prediction,
which incurs an additional penalty of 2 cycles if the prediction is correct and 4 cycles if the prediction is incorrect.
a. If conditional branch instructions occupy 15% of the instruction stream, and the probability of an incorrect
branch prediction is 20%, what is the average number of cycles per instruction?
b. If the same processor is to run no less than 28% slower than a machine with a zero branch penalty when up
to 20% of the instructions are conditional branches, what level of accuracy must the branch prediction
achieve on average?
SOLUTION
a. CPI = non‐branch cycles + branch cycles (correct prediction) + branch cycles (incorrect prediction)
= 0.85 × 1 + 0.15(0.80 × 2 + 0.20 × 4) = 0.85 + 0.15(2.4) = 0.85 + 0.36 = 1.21 CPI
63. A computer has a branch target buffer, BTB. Derive an expression for the average branch penalty if:
• a branch not in the BTB that is not taken incurs a penalty of 0 cycles
• a branch not in the BTB that is taken incurs a penalty of 6 cycles
• a branch in the BTB that is not taken incurs a penalty of 4 cycles
• a branch in the BTB that is taken incurs a penalty of 0 cycles
• the probability that a branch instruction is cached in the BTB is 80%
• the probability that an instruction not in the BTB is taken is 20%
• the probability that an instruction in the BTB is taken is 90%
SOLUTION
132
© 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied, or duplicated, or posted to a publicly available website, in whole or in part.
64. A RISC processor implements a subroutine call using a link register (i.e., the return address is saved in the link
register).The cost of a call is 2 cycles and the return costs 1 cycle. If a subroutine is called from another subroutine
(i.e., the subroutine is nested), the contents of the link register must be saved and later restored. The cost of
saving the link register is 6 cycles and the cost of restoring the link register is 8 cycles. Assume that a certain
instruction mix contains 20% subroutine calls and returns (i.e., 10% calls, 10% returns). The probability of a single
subroutine call and return without nesting is 60%. The probability that a subroutine call will be followed by a
single nested call is 40%. Assume that the probability of further nesting is vanishingly small. What is the overall
cost of subroutine calls? The average call of all other instructions is 1.5 cycles. What is the average number of
cycles per instruction?
SOLUTION
There are five possibilities: an instruction is not a subroutine call or return, it is a single call, it is a nested call, it is
a single return, it is a nested return. Note that when a subroutine is nested, it has the unnested call return plus
the extra save/return time. The probabilities and costs are:
65. Why is the literal in the op‐code sign‐extended before use (in most computer architectures)?
SOLUTION
Literals in instructions are invariably shorter than the register size of the computer; for example, a 32‐bit
processor might have a 16‐bit literal and 32‐bit registers. When the literal is loaded into the low‐order bits of a
register, the upper order bits must either be cleared, left unchanged, or used to extend the loaded value to the
full length of the register (i.e., sign extension). Since many computer instructions operate with signed values or
with address offsets, it makes sense to sign‐extend an operand when it is loaded. Some processors like the 68K
have separate address (pointer) and general‐purpose data register. Values in address registers are always sign‐
extended, whereas those in data registers are not sign‐extended.
66. Why is the address offset shifted two places left in branch/jump operations in 32‐bit RISC‐like processors?
SOLUTION
Typical processors have 32‐bit, four‐byte, instructions, yet the memory is byte addressed. That is, words have the
hexadecimal address 0,4,8,C,10,14 … However, the address bus can access addresses at any location; for example,
you can access address 0xABC3 (which is not word‐aligned). Because the two lowest bits of an address are always
zero for an aligned address, there is no point in storing them when an address is stored in an instruction as an
offset; for example if the address offset is xxxxxxxx00, it is stored as xxxxxxxx. Consequently, when loaded it must
be shifted left by two places to generate xxxxxxxx00. Doing this extends the effective size of a literal by two bits.
133
© 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied, or duplicated, or posted to a publicly available website, in whole or in part.
67. Assume a 5‐stage pipeline (instruction fetch, operand fetch, execute, memory, write‐back). For the following code
show any stalls and indicate where operand forwarding would be needed.
ADD R9,R9,R8
MUL R1,R2,R3
LDA R5,(4,R1)
SUB R5,R5,R1
ADD R7,R8,R9
MUL R7,R1,R5
SOLUTION
With no internal forwarding (operand fetch in bold where operand fetching is needed)
Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Instruction
ADD F OF E M WB
R9,R9,R8 ADD R9,R8 R9+R8 None R9
MUL F OF E M WB
R1,R2,R3 MUL R2,R3 R2.R3 None R1
LDR R5,(4,R1) F OF E M WB
LDR 4,R1 [4,R1] read R5
SUB R5,R5,R1 F OF E M WB
SUB R5,R1 R5‐R1 None R5
ADD F OF E M WB
R7,R8,R9 ADD R8,R9 R8+R9 None R7
MUL F OF E M WB
R7,R1,R5 MUL R1,R5 R1.R5 None R7
With internal forwarding (operand used in next cycle after its creation)
Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Instruction
ADD F OF E M WB
R9,R9,R8 ADD R9,R8 R9+R8 None R9
MUL F OF E M WB
R1,R2,R3 MUL R2,R3 R2.R3 None R1
LDR F OF E M WB
R5,(4,R1) LDR 4,R1 [4,R1] read R5
SUB F OF E M WB
R5,R5,R1 SUB R5,R1 R5‐R1 None R5
ADD F OF E M WB
R7,R8,R9 ADD R8,R9 R8+R9 None R7
MUL F OF E M WB
R7,R1,R5 MUL R1,R5 R1.R5 None R7
134
© 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied, or duplicated, or posted to a publicly available website, in whole or in part.
Another random document with
no related content on Scribd:
VII.
Décadence des contrées du Dniéper.
Le nouveau monde russe et ses
prétentions.